Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design function returning start/end indices for key sections in SC 13D and SC 13G #69

Open
bdcallen opened this issue Jan 21, 2020 · 0 comments
Assignees
Labels

Comments

@bdcallen
Copy link
Contributor

@iangow

Look at this Schedule 13D for instance, which is pretty typical of most of these forms (there are >some exceptions of course, which will be more difficult). It has the part at the start associated with >the header file. Then there is a title page (which includes the cusip number underneath which is >"(Cusip Number)" on the next line) which usually ends with a paragraph that reads at the end "...but >shall be subject to all other provisions of the Act (however, see the Notes).", though in this case >there is a footnote. Then there is a set of cover pages (in this case just one, but can be more than >one, particularly when there is more than one cusip involved), which in SC 13D has questions 1 >(Name of Reporting Person) through to 14 (Type of Reporting Person) (in SC 13G it is 1 to 12). >Then there is a section which contains the "Items" of the filing, usually 1 through to 10 (on >amendments, the items where there has been no amendment are usually omitted). Finally, Item 10 >contains the certification statement, which is then followed by the signatures, and then the exhibits >(the indexes/titles of which are usually stated in Item 7). I actually have been working to scrape the >whole of these documents, first by separating out the different section. Furthermore, I think the >cusip numbers we get can be a whole lot cleaner if we scrape the whole form, as we can localize >where the cusips are usually found, and then potentially guess what the cusips are in the case that >they have less than 8 characters using other information in the form (for instance 'Common Stock' >is almost always the first security for which a cusip is assigned for a given issuer, and normally the >7th and 8th digits (the issue identifier) are '10' for the first security assigned a cusip).

Looking at this initial comment from issue #62, the vast majority of SC 13D and 13G forms seem to follow a given structure, starting with the header, then a title page, then the cover pages with questions 1 to 14 (or 12 for 13G), then an item section, then signatures, then exhibits. For the vast majority of forms (something like 90%), the starts and ends of these section can be found through a number of key regular expressions, just like the simpler case with the cusip numbers.

I think we need a function which finds the starting and ending indices of these sections, as well as other information such as whether the form is of an alternate style, upper/lower bounds, and so on. I also think it would be helpful to have a program which makes this function write to a table in the database, so that we can get key information on cases which do not follow the normal pattern.

bdcallen added a commit that referenced this issue Jan 31, 2020
- Includes function get_key_indices which is designed to return a one-row dataframe containing key information on each sc13d or g document

- Includes write_indexes_to_table which writes the results of get_key_indices into edgar.sc13dg_indexes

- Relates to #69
bdcallen added a commit that referenced this issue Feb 6, 2020
- Includes code to clean text, get bounded segments of the text

- Relates to  #69
bdcallen added a commit that referenced this issue Feb 7, 2020
bdcallen added a commit that referenced this issue Feb 7, 2020
bdcallen added a commit that referenced this issue Feb 7, 2020
bdcallen added a commit that referenced this issue Feb 19, 2020
- Some changes made due to mistakes/problems found in outcomes from table

- Relates to #69
bdcallen added a commit that referenced this issue Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants