cpg

Updating paperless-ngx document titles with LLMs

2025-03-20 #ml

Using local language models to populate multilingual document metadata

Contents

paperless-ngx is a great self-hosted open source tool to manage administrative documents (letters, bills, forms, etc) painlessly.


Screenshot from the paperless-ngx documentation.

After upload (through browser, mobile app or email), paperless-ngx performs Optical Character Recognition on the documents, and metadata can be added manually or automatically. At the end of this process, a fully searchable document archive is available, a significant improvement over sifting through a piles of paper.

Document metadata

The primary metadata for a document are:


A document with metadata: correspondent, title, category, date, tags, archive id.

LLM-based metadata extraction

Unlike correspondent, category and date, there is no simple built-in way to automatically assign the title and monetary amount metadata.
In a workflow where documents are either imported from the producer (bank statements, PDF attached to emails) or scanned manually, the default title (the filename) is usually a pretty poor description of the contents.

This, and more generally extracting general information from documents, is a task well-suited for large language models, as discussed in this paperless-ngx GitHub issue, and several tools have been developed to this effect:

ToolFieldsLanguageRemarks
paperless-gpttitle, tags and corresp.GoLLM-based OCR, web UI for change review
doclyticstitle, corresp., date, category and othersRust
paperless-aititle, corresp., tags, categoryJavascriptWeb UI for change review

The mode of operation of each is:

  1. Retrieve document to process from paperless-ngx via the REST API.

  2. Send the document to an LLM API server, with an adequate prompt to retrieve the desired fields.
    The data sent is either:

    • the raw document itself (PDF, images) or
    • the content already OCR’ed by paperless-ngx via OCRmyPDF (wrapping tesseract).

    The first option has the advantage of a potential higher accuracy, especially with complex documents.

  3. Parse the output, usually constrained to be JSON.

  4. Update the paperless-ngx metadata via the API, possibly after user validation.

Thanks to the ubiquity of OpenAI API compatibility, these tools are usually compatible with a wide range of commercial or locally-hosted models (e.g. via ollama).

Requirements

The solution I was seeking had the following requirements:

Out-of scope are:

Evaluating open models

From the requirements above, the model should be:

At the time of writing (March 2025), good candidates are:

CompanyModelParametersContext [t]MultilingualSpeed [t/s]CPU offloading
Microsoftphi-414B16kNo 1990 / 245% (Q5)
AlibabaQwen 2.514B128kYes 2970 / 215% (Q5)
Metallama 3.18B128kYes 32000 / 35- (Q8)
MistralNeMo12B128kYes 41300 / 32- (Q5)
MistralSmall 324B32kYes 5500 / 930% (Q4)
GoogleGemma 312B128kYes 61230 / 31 7- (Q5)

The speeds are given in tokens per second, separately for prompt decoding and generation.

The speed and CPU offloading (for a 5k tokens context size with Q4 quantization) are computed on a RTX 3060 with 12 GB VRAM, with an AMD Ryzen 9 7900, using ollama for inference.

1

From the Microsoft phi-4 model summary:

Multilingual data constitutes about 8% of our overall data. […] The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. phi-4 is not intended to support multilingual use.

2

From the announcements of Qwen2 (June 2024) and Qwen2.5:

Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training. They also maintain multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Below, we provide basic information about the models and details of the supported languages.

See also the announcement of the original Qwen model (January 2024):

Models are sufficiently trained with 2-3 trillion tokens. The pretraining data are multilingual, and thus Qwen is essentially a multilingual model instead of a model of a single language or bilingual. Note that due to the limitations of our pretraining data, the model is strongly capable of English and Chinese and also capable of other languages, such as Spanish, French, and Japanese.

3

From the llama 3.1 model card:

Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

4

From the Mistral NeMo announcement

The model is designed for global, multilingual applications. It is trained on function calling, has a large context window, and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi

5

From the Mistral Small 3 model card:

Supports dozens of languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish.

6

From the model card and the technical report:

Multilingual support in over 140 languages

We also increase the amount of multilingual data to improve language coverage. We add both monolingual and parallel data, and we handle the imbalance in language representation […]

7

Disabling KV cache quantization and running under llama.cpp rather than ollama, see below.

Speed


Comparing processing speeds (prompt and generation) of two models in Grafana, using Prometheus to scrape the metrics exposed by llama.cpp.

A typical document will be around 1000 tokens, with an output of length ~10 tokens. This gives the following approximate processing speeds:


Number of parameters vs processing time.

The throughput be increased through parallel processing, see below. However, this is already reasonable enough to process a moderately-sized existing collection, as well as new documents.

Accuracy evaluation

The following system prompt was used after some trial-and-error:

Provide a title and amount (CHF, if relevant) for the text.
Respect the following rules for the title:
- Concise (no unnecessary words)
- No markup.
- In English, even if the text is in another language
- No names of persons or companies
- Max 5 words
Output in JSON, with keys "title" and "amount".
Text:

The “even if the text is in another language” instruction avoids cases where the models ignore the “in English” request when the text is in another language. For now, we use JSON output, but see below.

The “No name” instruction prevents the title from containing superfluous information when combined with the correspondent.

A rough manual evaluation shows the following, on the titles output alone:

ModelSize
Assessment
Meta llama 3.18BPoor titles, not always in English.
Mistral NeMo12BOutputs markup (**Title**: ....), and titles not always in English for documents in other languages.
Alibaba Qwen 2.514BCreates relatively good titles, although sometimes subtly wrong (e.g. No-show fee instead of Doctor appointment) and sometimes including the monetary amount in the title.
Mistral Small 324BGenerally excellent at providing accurate title, in English.
Gemma 312BDespite being half the size of Mistral Small 3, seems to produce equally good output (if not slightly worse) for this task.

Unsurprisingly, the other models that do not claim to be multilingual (e.g. phi-4) have relatively poor performance on the non-English documents.

Optimizing inference

Given we are performing inference with limited resources, it is worth addressing simple optimizations to speed-up processing.

Gemma 3 inference speed

Running Google’s Gemma 3 under ollama with the same parameters as other models yields a prompt processing speed of only 180 tokens/s, an order of magnitude lower than other models of this size.

From this llama.cpp issue (see also this ollama issue), this happens when combining flash attention and KV cache quantization, due to the larger head size. Disabling the KV cache quantization (at the cost of increase memory usage) gives a processing speed of 1000 tokens/s; this is the value reported in the table above.

In practice, the maximum context size that then fits with the Q5-quantized model on 12 GB VRAM is 6144 (or 9216 using a Q4 model quantization):

init: kv_size = 6144, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_context: KV self size  = 2304.00 MiB, K (f16): 1152.00 MiB, V (f16): 1152.00 MiB

Simplified structured output grammar

Using JSON structured output to retrieve the title and amount from the model is natural, and supported by ollama (and more generally the OpenAI API).

This works by transforming the JSON schema to a GBNF grammar, which is then used to constrain the sampling, combined with adequate prompting of the model.

However, from the tables above, while prompt processing is relatively fast, generation speed is slow, and producing

{
  "title": "",
  "amount": ""
}

already requires around 16 tokens.

It is therefore beneficial to use a more compact grammar, e.g.

root ::= string linebreak monetaryAmount
string ::= char+
char ::= [^\n\r]
linebreak ::= "\n" | "\r\n" | "\r"
monetaryAmount ::= amount | "-"
amount ::= digits "." decimalPart
digits ::= [0-9]+
decimalPart ::= [0-9]{2}

while adapting the prompt with:

Output the title on line 1, the amount (float or - if there is none) on line 2.

(when the sampling is not constrained by the grammar, the output matches the format requested in the prompt most of the time, but not always.)

This way, the model is not required to output useless tokens.

Ollama does not (yet) support passing arbitrary GBNF grammars, so this requires using llama.cpp directly (e.g. through llama-server). Note that ollama GGUF models can in many cases 8 be used directly in llama.cpp:

$ # From the ollama data/models folder
$ llama-server -m blobs/sha256-(cat manifests/registry.ollama.ai/library/qwen2.5/14b-instruct-q5_K_M | jq -r '.layers[] | select(.mediaType == "application/vnd.ollama.image.model") | .digest' | cut -d: -f2) ...
8

Although not Gemma 3; get the GGUF directly here.

Parallel generation

The consistent large difference in speed between prompt processing and generation stems from the fact that the former is parallelized, while the latter cannot due to the autoregressive nature of generation.

Performing parallel queries, which can be batched, allows significantly speeding up the generation phase.
This is supported by llama-server: queries are queued, assigned to “slots”, which are then batched together. The total context size in the flag --ctx-size (dictated by the available memory) is divided evenly among slots, whose number is given by the --parallel flag.

The following table is produced with the llama-batched-bench example from llama.cpp, and gives timings for a prompt of size 1000 and a generation of 10 tokens (matching our use case), with batch size from 1 to 4, on Mistral Small 3:

Batch sizePrompt processing [t/s]Generation [t/s]Total time [s]
1585.449.872.722
2569.5918.674.582
3594.2824.996.248
4586.2629.108.198

A ~30% speed gain in terms of documents per second can be observed when documents can be processed in parallel (with an underlying 3x speed gain in generation speed). This is easy to do client-side.

This is useful for Mistral Small 3 (up to 4 batches of 5k tokens can fit in memory), but not for Gemma 3 until the KV cache quantization issue is fixed, as only one batch fits in memory.

Context length

It is important to manage the available context length properly, especially when dealing with potentially large documents and limited VRAM. In the worst case scenario:

Shifting the context and allowing infinite output is the default in llama.cpp, triggering this problem.
On the opposite end, a frequent issue when using ollama is that inputs are silently truncated to the default context size of 2048 tokens.

See also this issue on doclytics or this one on paperless-ai.

llama.cpp compilation options

The Arch User Repository provides a llama.cpp-cuda package.
I was initially surprised to see that the prompt evaluation time was 5x slower than in ollama (although with a 2x faster generation speed), see this issue.

Comparing compilations options between ollama and the AUR package, it turns out that the -DGGML_BLAS=ON build option activated in the latter moves more computations than expected to the CPU.
Disabling it gives the expected performance. It would however be nice to get the faster generation speed.

Plugging the APIs together

Implementing the general mode of operation described above is then relatively trivial.

Minor details in addition to the above:

See the Rust implementation here.


Screenshot of the CLI tool reassigning titles from the scanner’s scan.pdf.