Replies: 4 comments 1 reply
-
@sujee fuzzy dedup runs on complete documents, not partial. So unless you chunk documents in very small segments, I do not think you will ever run into this. Maybe run fuzzy before chunking and exact after? |
Beta Was this translation helpful? Give feedback.
-
@sujee I support going to document level dedup and then chunking as we have been discussing. |
Beta Was this translation helpful? Give feedback.
-
Page headers and footers should already be removed from 1) the export to markdown, 2) the document chunking. |
Beta Was this translation helpful? Give feedback.
-
Docling allows to exclude some items from the exports (e.g. markdown exports or chunking). Page headers and footers are excluded automatically, so it should be already happening in DPK as well. The default list of exported labels is here: https://github.com/DS4SD/docling-core/blob/main/docling_core/types/doc/document.py#L38 |
Beta Was this translation helpful? Give feedback.
-
Q: is there a case for running named entity recognition first to prevent fuzzy dedupe from exploring the space of changing important words?
So if I have 2 segments like
John won the pickleball tournament
Jane won the pickleball tournament
Will fuzzy dedupe eliminate one of them? Because they are very different
@blublinsky @Bytes-Explorer
Q: is there ever a case for intentionally including duplicated data to "bring more attention" to that data?
This is to do with eliminating duplicate chunks. But we are moving towards deduping documents instead of chunks. So may be moot point. How ever we can point to some studies / benchmarks on chunk-level deduping.
@Bytes-Explorer
Q: can we filter out page-footer ?are page-footers also included in chunks?
I know footers are captured by pdf2pq. Is there a flag to ignore them?
One way I can think of eliminating footers is parsing
contents
column and removingfooters
section. Is there any other way?@dolfim-ibm
Q: Controlling PDF parsing
From : https://discord.com/channels/1276554812359442504/1306700512795824198
@dolfim-ibm ?
Q: are there examples with agentic rag with dataprepkit?
coming soon 😄
@Bytes-Explorer @shahrokhDaijavad
Beta Was this translation helpful? Give feedback.
All reactions