This project is a Spring Boot-based service that downloads and updates CIK (Central Index Key) data from the SEC (U.S. Securities and Exchange Commission) for public companies.
The service stores the CIK data in a RDBMS (e.g. MySQL) database and supports enrichment with stock exchange information.
There are endpoints to retrieve the CIK data by CIK or ticker.
CIK is a unique identifier assigned by the U.S. Securities and Exchange Commission (SEC) to companies and individuals who are required to file disclosures with the SEC. It is used to track and access filings and forms submitted by these entities.
The CIK is used in the following ways:
- To uniquely identify companies and individuals required to submit filings to the SEC.
- To search for and access filings and forms submitted by these entities in the SEC's EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database.
CUSIP is a unique identifier for financial instruments, such as stocks and bonds, issued in the United States and Canada. The CUSIP system is owned by the American Bankers Association (ABA) and is operated by S&P Global Market Intelligence.
A CUSIP consists of a nine-character alphanumeric code, where the first six characters represent the issuer, the next two characters indicate the issue, and the last character is a check digit.
CUSIP is used in the following ways:
- To uniquely identify financial instruments, such as stocks and bonds, in the context of trading, clearance, and settlement.
- To facilitate the processing and communication of financial transactions, ensuring accurate tracking and reporting.
A cron expression is used to schedule the CIK data update process.
- Download CIK data from the SEC and store it in a MySQL database.
- 2 sources of CIK data, one with stock exchange information.
- Scheduled CIK data updates using cron expressions.
- Configurable properties for URLs and scheduling.
- Retrieve stock information by CIK or ticker through RESTful API endpoints.
- Monitor and log CIK data updates and process executions.
- Utilize OpenAPI/Swagger for API documentation and testing.
- Download filings of specific form types or the ones containing '13' in their form type, each stored in its respective folder.
- Enrich stock information by extracting additional data such as SIC and sector.
- Retry failed enrichment attempts up to 3 times with a 5-second interval between each attempt.
- Flag records with errors and retry only once in the next run for these records.
- Enrich stock information with CUSIP extracted from the Edgar filings.
- Generate a mapping file based on the given filing types.
- Java 17
- MySQL 8 or Postgres
- Optional: Docker
All the configuration settings can be found in the src/main/resources/application.yml
file. You can change the following properties:
edgar.company-tickers-url
: URL for downloading the company tickers JSON data.- Default value:
https://www.sec.gov/files/company_tickers.json
- Default value:
edgar.company-tickers-exchange-url
: URL for downloading the company tickers with exchange JSON data.- Default value:
https://www.sec.gov/files/company_tickers_exchange.json
- Default value:
edgar.enrich-sector-url
: URL for enriching the sector data based on the CIK value.- Default value:
https://www.sec.gov/cgi-bin/browse-edgar?CIK={cik}
- Default value:
edgar.cik-update-cron
: Cron expression for scheduling CIK data updates.- Default value:
0 0 0 1 * ?
- Default value:
edgar.cik-exchange-update-cron
: Cron expression for scheduling CIK data updates with exchange information.- Default value:
0 */20 * * * *
- Default value:
edgar.unenriched-cron
: Cron expression for scheduling the enrichment of unenriched CIKs.- Default value:
0 0 0 * * *
- Default value:
edgar.sector-enrich-cron
: Cron expression for scheduling the enrichment of sector data.- Default value:
*/10 * * * * *
- Default value:
edgar.use-tickers
: Enable or disable the company tickers downloader.- Default value:
false
- Default value:
edgar.use-tickers-exchange
: Enable or disable the company tickers with exchange downloader.- Default value:
true
- Default value:
edgar.use-sector-enrich
: Enable or disable the sector enrichment feature.- Default value:
true
- Default value:
You can modify these properties according to your requirements. For example, you may change the cron expressions to adjust the frequency of scheduled tasks or enable/disable certain features of the application.
Install MySQL and create a Database:
CREATE DATABASE edgar;
CREATE USER 'edgar_user'@'%' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON edgar.* TO 'edgar_user'@'%';
FLUSH PRIVILEGES;
Or use the following for Postgres:
CREATE DATABASE edgar;
CREATE USER edgar_user WITH ENCRYPTED PASSWORD 'password';
GRANT ALL PRIVILEGES ON DATABASE edgar TO edgar_user;
ALTER USER edgar_user CREATEDB;
And the following connection string in your application.properties
.
jdbc:postgresql://localhost:5432/edgar
To build and start a database container, you can use the following commands from /scripts/postgres
:
docker build -t edgar-postgres .
docker volume create edgar-postgres-data
docker run -d --name edgar-postgres -p 5432:5432 -v edgar-postgres-data:/var/lib/postgresql/data postgres:latest
Or the equivalent for MySQL:
docker build -t edgar-mysql .
docker volume create edgar-mysql-data
docker run -d --name edgar-mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=password -v edgar-mysql-data:/var/lib/mysql mysql:latest
Note: The performance of batch updates is much better, MySQL has issues with identity IDs for batch and Postgres does not.
After an initial execution you can see that the main table gets populated:
- Clone the repository:
git clone https://github.com/danielsobrado/edgar-cik-ticker-service.git
cd edgar-cik-ticker-service
-
Update the src/main/resources/application.yml file with your MySQL connection information.
-
Build the project:
./gradlew build
- Run the project:
./gradlew bootRun
The service will start and automatically download and update CIK data based on the specified cron expressions in the application.yml file.
GET /api/v1/stocks/cik/{cik}
: Retrieves the stock information by CIK.GET /api/v1/stocks/ticker/{ticker}
: Retrieves the stock information by ticker.GET /api/v1/stocks?sector={sector}
: Retrieves a list of stock information by sector.GET /api/v1/stocks?sic={sic}
: Retrieves a list of stock information by SIC code (tag).
POST /api/v1/stocks/enrich/ticker/{ticker}
: Enriches the stock information by ticker.GET /api/v1/stocks/enrich/cusip
: Enriches the stock information with CUSIP extracted from the Edgar filings.GET /api/v1/stocks/enrich/cusip/from-filings
: Generates a mapping file based on the given filing types.
GET /api/v1/stocks/export/csv
: Exports the CIK data to a CSV file.
GET /api/v1/stocks/filings/{filingType}
: Downloads filings of the given filing type.
GET /api/v1/stocks/formTypes
: Retrieves a list of distinct form types from the full index.
POST /api/v1/stocks/fullIndex
: Initiates the download of the full index.GET /api/v1/edgar/download_full_index?year={year}&quarter={quarter}
: Downloads and processes the full index for a specific year and quarter.
Example of end point use:
This endpoint downloads and processes the EDGAR master index file for the specified year and quarter.
GET /api/v1/edgar/download_full_index
Parameter | Type | Description |
---|---|---|
year | int | The year of the index file to download. |
quarter | int | The quarter of the index file to download (1-4). |
The endpoint returns an HTTP 204 No Content status on successful processing, or an appropriate error status with a message if there's an issue.
- Status:
204 No Content
-
Status:
400 Bad Request
- Description: Invalid input parameters.
- Example:
{"message": "Invalid year or quarter."}
-
Status:
500 Internal Server Error
- Description: An error occurred during processing.
- Example:
{"message": "Failed to download and process the master index file."}
This application keeps track of the last execution time of the CIK data update process. The purpose of this tracking is to ensure that the process is executed immediately if the last execution date is more than one month ago, or if the tracking table is empty (e.g., the application is run for the first time).
A CIK can be enhanced utilizing a REST API endpoint using another service that scrapes Edgar to provide further data. The maximum rate for this service will be one inquiry per ten seconds. We'll have a scheduled procedure that will check the database for entries that haven't been enriched and enrich them one at a time using a cron job that can be configured. Using configuration we can determine whether to enable this service or not.
The number of unenriched records in the system will be counted by another scheduled activity, which will run once every day. The previous scrapper won't need to be run if there are no records to be enriched.
There are other fields that can be enriched from the filings:
The enrichCik
method handles various scenarios when enriching a CIK. The following are the possible error cases and how they are managed:
- No matching Ticker Symbol: When the provided ticker does not match any CIK, the method will return an
EnrichedData
object with the following values:
sic
: "Not Found"sector
: "Not Found"
- No matching CIK: When the provided CIK does not have a matching record, the method will return an
EnrichedData
object with the following values:
sic
: "No CIK"sector
: "No CIK"
- Sector not available: When the sector cannot be extracted from the page, the method will return an
EnrichedData
object with the following values:
sic
: (extracted SIC value)sector
: "Not Available"
- Error during enrichment: When an error occurs during the enrichment process, such as an IOException, the method will retry the operation up to three times (with a 5-second interval between attempts). If the enrichment still fails after the retries, it will mark the record with an error message and save it to the repository. For records flagged with errors, the method will only retry once in the next run.
- Add a
lastError
field to theStock
class to store the error message from the last failed attempt. Update theStock
class with the necessary getter and setter methods for this field.
- The CIK data is updated every month, so the cron expression for the CIK data update process should be set to run once a month.
- There are some cases where the CIK is duplicated in the CIK data file. In these cases, the service will use the first CIK found for the ticker. See:
As the form isn't submitted until 45 days following the end of the quarter, a lot might change regarding investments during that time. It's also possible that by this point, the projected price changes have already taken place.
Even with holdings information, it might be challenging to determine an asset manager's net position. A fund may be net long, net short, or more likely have a limited pay-off if it discloses holding long positions in the shares, put options, and call options. Moreover, there may be unlisted positions that call for longs in the underlying security while the manager retains a net short exposure or an investor may be partially hedged, such as short positions in convertible bonds.
This code is not production ready, passwords are in clear text, additional logging is required. Portions of the code have been generated using AI (e.g copilot and GPT4).
Please feel free to contact me if you have any questions or suggestions.