Inspect structural similarity between SEC filings with common properties. #45
Replies: 8 comments
-
Hi, if so, let us call the property whose values we are segmenting by, as the segmenting property (in this example Let me know if my understanding, and approach seem sensible |
Beta Was this translation helpful? Give feedback.
-
Hi @tumble-weed 👋, thank you for showing interest in this issue.
We are not interested in "distributions" in any way 😀
That is an interesting idea you got there! You are very close in your understanding but not entirely correct. Instead of seeing how the segmenting property affects all the other properties, we are only interested in seeing how the segmenting property affects linkToFilingDetails. linkToFilingDetails contains the URL for HTML filing. The purpose of this issue is to find out if these segmenting properties affect the HTML structure of filings. By "HTML structure" I mean anything that may affect the parsing of sec-parser. The easiest way to know it is by simply downloading the filing document and having a look at it. Let me know if this clears your doubts! |
Beta Was this translation helpful? Give feedback.
-
Hi, @INF800. But in terms of HTML structure, I think it can be the semantic structure of the HTML extracted by sec-parser. |
Beta Was this translation helpful? Give feedback.
-
Hi @kameleon-ad 👋,
Yes, exactly.
Thank you for this important point. By HTML structure I mean the inherent structure of HTML filing available via |
Beta Was this translation helpful? Give feedback.
-
@kameleon-ad @tumble-weed Also, I just crafted a starter notebook to help you with some other things like the problems you may face while downloading the HTML file using python - This happens because SEC website accepts requests with specific headers. The notebook code will help you fetch HTML files for each property type and put them in the following folder structure. └── 10-Q
├── exchange
│ ├── BATS_n50
│ ├── NASDAQ_n50
│ ├── Not Available_n50
│ ├── NYSEARCA_n50
│ ├── NYSEMKT_n50
│ ├── NYSE_n50
│ └── OTC_n50
├── isDelisted
│ ├── False_n50
│ ├── Not Available_n50
│ └── True_n50
├── isUS
│ ├── False_n50
│ ├── Not Available_n50
│ └── True_n50
├── market_cap_category
│ ├── Large ($10-200B)_n50
│ ├── Medium ($2-10B)_n50
│ ├── Mega (>$200B)_n50
│ ├── Micro ($50-300M)_n50
│ ├── Nano ($0-50M)_n50
│ ├── Not Available_n50
│ └── Small ($0.3-2B)_n50
... The notebook can be found here |
Beta Was this translation helpful? Give feedback.
-
Hi, @INF800, nice to meet you Maybe we just need to add some features in the CSV file containing information about the structure of the HTML Filling. But the first thing to do is define the "information structure for the HTML Filling", for example: the number of 'table of contents' item, number of tables in the HTML, etc. Then after we define the information structure HTML Filling, We fill in the data in the CSV file.
Then using clustering to cluster every HTML Filling, by the cluster we got from clustering, we calculate the correlation of cluster to every feature(currency, location, isUS, market_cap_category, exchange, filingYear, isDelisted, category, sicSector, sicIndustry) so we can know which features make a difference in HTML filling. I am sorry it's just a raw idea, I haven't done it yet. |
Beta Was this translation helpful? Give feedback.
-
Hello @Risdan224, thank you for showing interest in this issue! You are right, we can leverage unsupervised clustering methods for this task. Inspecting manually is not a viable option when dealing with hundreds of thousands of files, it is just a quick solution for now. If you want to use clustering methods you can create and experiment with your own feature sets. There are some interesting but a bit old projects like html-cluster based on page-compare project which can help you in feature engineering. Please have a look at it and let me know what you think of it. I'd suggest starting with simple unsupervised methods first. |
Beta Was this translation helpful? Give feedback.
-
Hello @tumble-weed @Risdan224 @kameleon-ad, Please feel free to share your findings whenever you get a chance. I was actively working on it this past week and would like to discuss your perspectives on this. |
Beta Was this translation helpful? Give feedback.
-
👁️ Inspect structural similarity between SEC filings with common properties
🦠 Problem
We need to figure out if different filing property values correspond to different structures of HTML filings.
For example, the figure below shows different possible
sicSector
values for 10-Q filings. You need to figure out if filings with differentsicSector
values are structurally dissimilar or not. One way to do this is to sample 10 HTML filings withsicSector=Mining
and another 10 HTML filings withsicSector=Services
and manually inspect if they are similar or not. If they are dissimilar,sicSector
is a valid property for sampling a representative sample set.Do the same for
currency
,location
,isUS
,market_cap_category
,exchange
,filingYear
,isDelisted
,category
,sicSector
,sicIndustry
of 10-K, 10-Q and 8-K filings and figure out if their different values correspond to different structures of HTML filings.🌟 Other Approaches
Automate this by using unsupervised clustering methods.
Beta Was this translation helpful? Give feedback.
All reactions