Fix malformed input by creating missing Top-Level-Section-Titles from Table Of Contents #58
Elijas
started this conversation in
General Chat & Ideas
Replies: 2 comments 2 replies
-
@deenaawny-github-account
Continued in Discord: |
Beta Was this translation helpful? Give feedback.
0 replies
-
I would like to work on fixing this! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
https://www.sec.gov/Archives/edgar/data/19617/000001961723000432/jpm-20230630.htm
JPM 10-Q report does not have any kind of text headings for
part1item1
(Financial Statements) andpart1item2
(Management's Discussion) Top Level Sections.However, the report does have Table of Contents links pointing to the general area of where these sections start.
The task is to insert "dummy"
TopLevelSectionTitle
elements where they should belong.FAQ
How to insert an element?
Let's do a hypothetical example of inserting a TitleElement above a TextElement. The ProcessingStep walks through all elements in a page. Once we find the TextElement we want to insert the TitleElement above, we do so by replacing the TextElement with a CompositeElement with two inner_elements. The first is the newly created TitleElement, and the second is the TextElement we previously had.
What if I want first to scan the document and make insertions only then?
Set
_NUM_ITERATIONS = 2
and then your new processing step which inherits fromElementwiseProcessingStep
will iterate over all elements two times. Here is an example.Where to store temporary data while processing the document?
As
self.
variables of theElementwiseProcessingStep
class. These values get reset on every new parsed document.Beta Was this translation helpful? Give feedback.
All reactions