-
Notifications
You must be signed in to change notification settings - Fork 0
Support PDF file build #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if self._is_table_chunk(chunk): | ||
| continue | ||
|
|
||
| display_text = getattr(chunk, "text", "") or "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ In what case does a chunk have no "text" attribute?
Shouldn't we ignore this chunk if that's the case? It doesn't make sense IMO to have a display_text with an empty string. Even if we manage to contextualise the chunk below and get a good description, that means the result from the search would give an empty string as its result...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will adapt the implementation. Thank you for noticing
| display_text = getattr(chunk, "text", "") or "" | ||
| embed_text = chunker.contextualize(chunk=chunk) | ||
|
|
||
| for part in self.splitter.split(embed_text, tokenizer=tokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there some parameters we can give to the chunker directly so that it knows what the max size of a contextualisation can be?
It feels a bit strange to apply this splitting after the "contextualisation" was computed: we'll be dividing that contextualisation arbitrarily without knowing the context => we might end up with embeddings that have lost all of their meanings. It would be great if the method that was creating the contextualisation (and hence knows the full context) would be aware of the limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you're already giving the model_name and max_tokens to the Tokenizer when initialised. Shouldn't that be enough for it to know that it shouldn't create a context bigger than the max_tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, splitting after contextualization can be suboptimal if we split away the context anchor (headings, etc). In practice, the tokenizer's max_toekns controls chunking, but contextualization can still push a chunk over budget, so we need a safety net. I'll keep the post-contextualization splitting, but I'll make it preserve the prefix. That way, each embedding slice retains the same semantic context instead of arbitrary fragments.
| """ | ||
|
|
||
| model_name: str = "nomic-ai/nomic-embed-text-v1.5" | ||
| tokens_budget: int = 1900 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So those default values correspond to the defaults of "nomic-ai/nomic-embed-text-v1.5"?
I think we should move this class closer to where we choose which model to use for embeddings: the model_name could use the same constant we have for the embeddings model. And that would help remembering to change those hardcoded values if we ever change the default model we use
For now, I think it's fine to hardcode it in your Chunker but I guess that what you had in mind was that this Policy should be provided to the Plugin in the divide_context_into_chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Indeed, this is a piece that is hardcoded and could probably done better if we had access to the embedding model (most embedding models can tell you their maximum token budget, so it can even be configured dynamically). The reason I left it as is for now is that we always use the same model. I think moving it into a non-hardcoded version would be better when we create the functionality for allowing any model to do the embedding.
| if tokenizer.count_tokens(fast_embed) <= self.policy.tokens_budget: | ||
| return [EmbeddableChunk(embeddable_text=fast_embed, content=table_md)] | ||
|
|
||
| df = table.export_to_dataframe(doc=doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid question: is using dataframes like that better than simply dividing the Markdown into chunks? (and re-adding the table header to all chunks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely not a stupid question and it can be done. I used this functionality to avoid re-inventing the wheel.
|
|
||
| converter = DocumentConverter(format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}) | ||
| stream = DocumentStream(name=file_name, stream=BytesIO(pdf_bytes)) | ||
| return converter.convert(stream).document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of how bad this looks when serialised to Yaml? 🙂
And whether it could be understood by an LLM if directly given that Yaml context file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we were talking about in Tuesday's meeting, we might need to come up with a solution to create a context that has some meaning but still allows to re-create the chunks later 🤔
Maybe we actually need to store the path of the PDF file in the context... So that the chunker re-reads the files. And the context could be simply about giving the outline of the file.
Something like:
context_metadata:
original_file_path: my_absolute_path
context:
table_of_contents:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks really ugly. I can share it with you in slack. I decided to not care too much about it, because indeed it might make sense to get rid of the context for just files.
This PR makes YAML exports deterministic by preserving model-defined
field order when serializing dataclasses and Pydantic models.
PyYAML's default behaviour can produce noisy diffs due to unstable key
ordering.
# Changes
* Updated `yaml.py`'s YAML representer to handle dataclasses and
Pydantic models
* Generic objects (anything with __dict__) now serialize public
attributes (non _ prefixed) with keys sorted alphabetically to avoid
nodeterministic attribute order.
# Limitations
Database samples can't be guaranteed to be deterministic. At the moment,
a sampling query looks like this:
```sql
SELECT * FROM "{schema}"."{table}"
```
At the point in the code in which we obtain samples, we know the columns
of the table, so an `ORDER BY` clause could be created. This, however,
would not guarantee that the same rows would be returned.
This PR adds PDF support using Docling.
Changes
PDFPlugin
DocumentConverter.build_file_context()is the Docling document object.Docling Chunking Service
TABLEheader.Tables
If the whole table fits in the token budget, we embed one chunk whose
embeddable_textlooks like:If the markdown would exceed the token budget, we embed multiple chunks. Each chunk repeats the table context so rows are interpretable on their own.
Each
embeddable_textlooks like:The display text on both case is in markdown.