Fix malformed input by creating missing Top-Level-Section-Titles from Table Of Contents #58

Elijas · 2023-11-27T18:31:20Z

Elijas
Nov 27, 2023
Maintainer

https://www.sec.gov/Archives/edgar/data/19617/000001961723000432/jpm-20230630.htm

JPM 10-Q report does not have any kind of text headings for part1item1 (Financial Statements) and part1item2 (Management's Discussion) Top Level Sections.

However, the report does have Table of Contents links pointing to the general area of where these sections start.

The task is to insert "dummy" TopLevelSectionTitle elements where they should belong.

FAQ

How to insert an element?

Let's do a hypothetical example of inserting a TitleElement above a TextElement. The ProcessingStep walks through all elements in a page. Once we find the TextElement we want to insert the TitleElement above, we do so by replacing the TextElement with a CompositeElement with two inner_elements. The first is the newly created TitleElement, and the second is the TextElement we previously had.

What if I want first to scan the document and make insertions only then?

Set _NUM_ITERATIONS = 2 and then your new processing step which inherits from ElementwiseProcessingStep will iterate over all elements two times. Here is an example.

Where to store temporary data while processing the document?

As self. variables of the ElementwiseProcessingStep class. These values get reset on every new parsed document.

Elijas · 2023-11-28T13:26:37Z

Elijas
Nov 28, 2023
Maintainer Author

@deenaawny-github-account
I have an initial idea of how to implement the feature, let me know what you think. Feel free to design a completely different approach as well, just wanted to give one of the possible ideas as a brainstorm of how it could be approached, it's not necessarily the best approach.

After parsing the Table Of Contents, but before parsing the TopLevelSectionTitles, we would like to insert "TopSectionStartClue" elements based on the Table Of Contents, right in the places where the hyperlinks in the ToC point to.
Therefore we would create this "TopSectionStartCluesInserter" processing step (inheriting from "ElementwiseProcessingStep") and put it in-between the other to processing steps
The "TopSectionStartCluesInserter" would skip all elements until it finds the TableOfContents element and saves it to self._table_of_contents_element variable. Then it would parse this table of contents to get the <a href="LINK">values. each LINK value would be matched to a TopLevelSection. It would then continue to scan all elements, searching for all the elements that the LINK points to
The element "X" that the LINK points to is replaced with a composite element, with two inner_elements. The first is the new TopSectionStartClue element, and the second is the original "X" element.
After the TopSectionStartCluesInserter stops inserting all the Clues, the TopLevelSectionManagerFor10Q class can then can use those clues along with the standard approach (where the standard approach is just scanning for "Item 1" text).

Continued in Discord:
https://discord.com/channels/1164249739836018698/1178935913116606504/1179050875138875492

0 replies

john0isaac · 2023-12-21T19:18:56Z

john0isaac
Dec 21, 2023

I would like to work on fixing this!

2 replies

Elijas Dec 22, 2023
Maintainer Author

Awesome! Let us first reprioritize our short-term roadmap and create the Github Issues for it. Stay tuned!

john0isaac Dec 22, 2023

@Elijas ok, I will be waiting for the update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alphanome.AI

Fix malformed input by creating missing Top-Level-Section-Titles from Table Of Contents #58

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Alphanome.AI

Fix malformed input by creating missing Top-Level-Section-Titles from Table Of Contents #58

Elijas Nov 27, 2023 Maintainer

FAQ

How to insert an element?

What if I want first to scan the document and make insertions only then?

Where to store temporary data while processing the document?

Replies: 2 comments · 2 replies

Elijas Nov 28, 2023 Maintainer Author

john0isaac Dec 21, 2023

Elijas Dec 22, 2023 Maintainer Author

john0isaac Dec 22, 2023

Elijas
Nov 27, 2023
Maintainer

Replies: 2 comments 2 replies

Elijas
Nov 28, 2023
Maintainer Author

john0isaac
Dec 21, 2023

Elijas Dec 22, 2023
Maintainer Author