Feedback from DPK + RAG workshop from NumHack hackathon (2024-11-14) #801

sujee · 2024-11-14T20:05:45Z

sujee
Nov 14, 2024

Q: is there a case for running named entity recognition first to prevent fuzzy dedupe from exploring the space of changing important words?

So if I have 2 segments like

John won the pickleball tournament
Jane won the pickleball tournament

Will fuzzy dedupe eliminate one of them? Because they are very different

Q: is there ever a case for intentionally including duplicated data to "bring more attention" to that data?

This is to do with eliminating duplicate chunks. But we are moving towards deduping documents instead of chunks. So may be moot point. How ever we can point to some studies / benchmarks on chunk-level deduping.
@Bytes-Explorer

Q: can we filter out page-footer ?are page-footers also included in chunks?

I know footers are captured by pdf2pq. Is there a flag to ignore them?
One way I can think of eliminating footers is parsing contents column and removing footers section. Is there any other way?
@dolfim-ibm

Q: Controlling PDF parsing

From : https://discord.com/channels/1276554812359442504/1306700512795824198

Hello, my question was about parsing with Docling. From my understanding, the document is parsed using layout detection, identifying elements like headers, footers, etc. How can we configure this to exclude these elements from further processing like page-footers, Table of Contents etc?

@dolfim-ibm ?

Q: are there examples with agentic rag with dataprepkit?

coming soon 😄
@Bytes-Explorer @shahrokhDaijavad

blublinsky · 2024-11-14T21:20:21Z

blublinsky
Nov 14, 2024
Collaborator

@sujee fuzzy dedup runs on complete documents, not partial. So unless you chunk documents in very small segments, I do not think you will ever run into this. Maybe run fuzzy before chunking and exact after?

1 reply

sujee Nov 15, 2024
Author

fair assessment.
We started with chunk dedupe. But we are going for doc-level dedupe. Will let you know.
thx!

Bytes-Explorer · 2024-11-15T04:46:52Z

Bytes-Explorer
Nov 15, 2024
Maintainer

@sujee I support going to document level dedup and then chunking as we have been discussing.

0 replies

dolfim-ibm · 2024-11-15T12:46:39Z

dolfim-ibm
Nov 15, 2024
Collaborator

Q: can we filter out page-footer ?are page-footers also included in chunks?

I know footers are captured by pdf2pq. Is there a flag to ignore them? One way I can think of eliminating footers is parsing contents column and removing footers section. Is there any other way?

Page headers and footers should already be removed from 1) the export to markdown, 2) the document chunking.
If there is any example where they are retained, we should have a look at the specific case.

0 replies

dolfim-ibm · 2024-11-15T12:52:27Z

dolfim-ibm
Nov 15, 2024
Collaborator

Q: Controlling PDF parsing

From : https://discord.com/channels/1276554812359442504/1306700512795824198

Hello, my question was about parsing with Docling. From my understanding, the document is parsed using layout detection, identifying elements like headers, footers, etc. How can we configure this to exclude these elements from further processing like page-footers, Table of Contents etc?

@dolfim-ibm ?

Docling allows to exclude some items from the exports (e.g. markdown exports or chunking). Page headers and footers are excluded automatically, so it should be already happening in DPK as well.

The default list of exported labels is here: https://github.com/DS4SD/docling-core/blob/main/docling_core/types/doc/document.py#L38
I was recently posting a minimal example about excluding a label from it: DS4SD/docling#172 (reply in thread)
At the moment we don't expose this as DPK config parameter. It would be a quite heavy parameter. Maybe the simplest would be to provide an exclude list.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback from DPK + RAG workshop from NumHack hackathon (2024-11-14) #801

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Q: can we filter out page-footer ?are page-footers also included in chunks?

{{title}}

Q: Controlling PDF parsing

Select a reply

Feedback from DPK + RAG workshop from NumHack hackathon (2024-11-14) #801

sujee Nov 14, 2024

Q: is there a case for running named entity recognition first to prevent fuzzy dedupe from exploring the space of changing important words?

Q: is there ever a case for intentionally including duplicated data to "bring more attention" to that data?

Q: can we filter out page-footer ?are page-footers also included in chunks?

Q: Controlling PDF parsing

Q: are there examples with agentic rag with dataprepkit?

Replies: 4 comments · 1 reply

blublinsky Nov 14, 2024 Collaborator

sujee Nov 15, 2024 Author

Bytes-Explorer Nov 15, 2024 Maintainer

dolfim-ibm Nov 15, 2024 Collaborator

Q: can we filter out page-footer ?are page-footers also included in chunks?

dolfim-ibm Nov 15, 2024 Collaborator

Q: Controlling PDF parsing

sujee
Nov 14, 2024

Replies: 4 comments 1 reply

blublinsky
Nov 14, 2024
Collaborator

sujee Nov 15, 2024
Author

Bytes-Explorer
Nov 15, 2024
Maintainer

dolfim-ibm
Nov 15, 2024
Collaborator

dolfim-ibm
Nov 15, 2024
Collaborator