Support PDF file build #110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

Mateus-Cordeiro merged 15 commits into main from NEM-346-pdf-support

Feb 10, 2026

Collaborator

Mateus-Cordeiro commented Feb 3, 2026

This PR adds PDF support using Docling.

Changes

PDFPlugin

The plugin reads the PDF and converts it using Docling's DocumentConverter.
OCR and picture-related processing are explicitly disabled.
Output of build_file_context() is the Docling document object.

Docling Chunking Service

Reusable Docling chinking component.
Uses contextualization for text (includes surrounding structure like headings, improving retrieval)
Token-budget enforcement.
Separate indexing for tables, to improve their usability in dce.
- Small tables are converted to markdown and embedded that way with a TABLE header.
- Larger tables are exported to Dataframe and multiple row/column aware chunks are created.
- Rows that are too large are truncated (I expect this to be very rare)

Tables

If the whole table fits in the token budget, we embed one chunk whose embeddable_text looks like:

TABLE: (caption-if-any)
| col1 | col2 | col3 |
|------|------|------|
| ...  | ...  | ...  |
| ...  | ...  | ...  |

If the markdown would exceed the token budget, we embed multiple chunks. Each chunk repeats the table context so rows are interpretable on their own.

Each embeddable_text looks like:

TABLE: (caption-if-any)
COLUMNS: col1 | col2 | col3 | col4
ROWS:
- v11 | v12 | v13 | v14
- v21 | v22 | v23 | v24
- v31 | v32 | v33 | v34

The display text on both case is in markdown.

Mateus-Cordeiro added 3 commits

February 3, 2026 18:12


          Support PDF file build

c73bc0d


          Fix test

2c38b7d


          Fix test

eb939d9

JulienArzul previously approved these changes

View reviewed changes

src/databao_context_engine/plugins/files/pdf_plugin.py Outdated Show resolved Hide resolved

src/databao_context_engine/services/docling_chunker.py Outdated

+                          if self._is_table_chunk(chunk):
+                              continue
+                          display_text = getattr(chunk, "text", "") or ""

Collaborator

JulienArzul Feb 5, 2026

❓ In what case does a chunk have no "text" attribute?

Shouldn't we ignore this chunk if that's the case? It doesn't make sense IMO to have a display_text with an empty string. Even if we manage to contextualise the chunk below and get a good description, that means the result from the search would give an empty string as its result...

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

Good point. I will adapt the implementation. Thank you for noticing

src/databao_context_engine/services/docling_chunker.py Outdated

+                          display_text = getattr(chunk, "text", "") or ""
+                          embed_text = chunker.contextualize(chunk=chunk)
+                          for part in self.splitter.split(embed_text, tokenizer=tokenizer):

Collaborator

JulienArzul Feb 5, 2026

Isn't there some parameters we can give to the chunker directly so that it knows what the max size of a contextualisation can be?

It feels a bit strange to apply this splitting after the "contextualisation" was computed: we'll be dividing that contextualisation arbitrarily without knowing the context => we might end up with embeddings that have lost all of their meanings. It would be great if the method that was creating the contextualisation (and hence knows the full context) would be aware of the limitation

Collaborator

JulienArzul Feb 5, 2026

Actually, you're already giving the model_name and max_tokens to the Tokenizer when initialised. Shouldn't that be enough for it to know that it shouldn't create a context bigger than the max_tokens?

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

Indeed, splitting after contextualization can be suboptimal if we split away the context anchor (headings, etc). In practice, the tokenizer's max_toekns controls chunking, but contextualization can still push a chunk over budget, so we need a safety net. I'll keep the post-contextualization splitting, but I'll make it preserve the prefix. That way, each embedding slice retains the same semantic context instead of arbitrary fragments.

src/databao_context_engine/services/docling_chunker.py Show resolved Hide resolved

src/databao_context_engine/services/docling_chunker.py

+                  """
+                  model_name: str = "nomic-ai/nomic-embed-text-v1.5"
+                  tokens_budget: int = 1900

Collaborator

JulienArzul Feb 5, 2026

So those default values correspond to the defaults of "nomic-ai/nomic-embed-text-v1.5"?

I think we should move this class closer to where we choose which model to use for embeddings: the model_name could use the same constant we have for the embeddings model. And that would help remembering to change those hardcoded values if we ever change the default model we use

For now, I think it's fine to hardcode it in your Chunker but I guess that what you had in mind was that this Policy should be provided to the Plugin in the divide_context_into_chunks?

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

Hi! Indeed, this is a piece that is hardcoded and could probably done better if we had access to the embedding model (most embedding models can tell you their maximum token budget, so it can even be configured dynamically). The reason I left it as is for now is that we always use the same model. I think moving it into a non-hardcoded version would be better when we create the functionality for allowing any model to do the embedding.

src/databao_context_engine/services/docling_chunker.py

+                      if tokenizer.count_tokens(fast_embed) <= self.policy.tokens_budget:
+                          return [EmbeddableChunk(embeddable_text=fast_embed, content=table_md)]
+                      df = table.export_to_dataframe(doc=doc)

Collaborator

JulienArzul Feb 5, 2026

Stupid question: is using dataframes like that better than simply dividing the Markdown into chunks? (and re-adding the table header to all chunks)

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

That's definitely not a stupid question and it can be done. I used this functionality to avoid re-inventing the wheel.

src/databao_context_engine/plugins/files/pdf_plugin.py

+                      converter = DocumentConverter(format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)})
+                      stream = DocumentStream(name=file_name, stream=BytesIO(pdf_bytes))
+                      return converter.convert(stream).document

Collaborator

JulienArzul Feb 5, 2026

Do you have an example of how bad this looks when serialised to Yaml? 🙂
And whether it could be understood by an LLM if directly given that Yaml context file

Collaborator

JulienArzul Feb 5, 2026

As we were talking about in Tuesday's meeting, we might need to come up with a solution to create a context that has some meaning but still allows to re-create the chunks later 🤔
Maybe we actually need to store the path of the PDF file in the context... So that the chunker re-reads the files. And the context could be simply about giving the outline of the file.

Something like:

context_metadata:
    original_file_path: my_absolute_path
context:
    table_of_contents:
        ...

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

It looks really ugly. I can share it with you in slack. I decided to not care too much about it, because indeed it might make sense to get rid of the context for just files.


          Improved splitting strategy and skipped empty text from pdf

06a00a4

Mateus-Cordeiro dismissed JulienArzul’s stale review via

06a00a4

February 5, 2026 16:03

Mateus-Cordeiro added 11 commits

February 5, 2026 17:08


          Merge branch 'main' into NEM-346-pdf-support

c0af215


          Merge branch 'main' into NEM-346-pdf-support

e959897


          NEM-366 Ensure deterministic ordering on file export

8efbcf8

This PR makes YAML exports deterministic by preserving model-defined
field order when serializing dataclasses and Pydantic models.

PyYAML's default behaviour can produce noisy diffs due to unstable key
ordering.

# Changes
* Updated `yaml.py`'s YAML representer to handle dataclasses and
Pydantic models
* Generic objects (anything with __dict__) now serialize public
attributes (non _ prefixed) with keys sorted alphabetically to avoid
nodeterministic attribute order.

# Limitations
Database samples can't be guaranteed to be deterministic. At the moment,
a sampling query looks like this:
```sql
SELECT * FROM "{schema}"."{table}"
```

At the point in the code in which we obtain samples, we know the columns
of the table, so an `ORDER BY` clause could be created. This, however,
would not guarantee that the same rows would be returned.


          Merge branch 'main' into NEM-346-pdf-support

5c7c8ab


          Merge branch 'main' into NEM-346-pdf-support


          Prune docker

abfd4c4


          Merge branch 'main' into NEM-346-pdf-support

7fa0bcd


          Merge branch 'main' into NEM-346-pdf-support

46273eb


          Merge branch 'main' into NEM-346-pdf-support

701f1ad


          Merge branch 'main' into NEM-346-pdf-support

a569f1b


          Merge branch 'main' into NEM-346-pdf-support

38d54ba

Mateus-Cordeiro merged commit 9b1a7ef into main

5 checks passed

Mateus-Cordeiro deleted the NEM-346-pdf-support branch

February 10, 2026 14:35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet