Analyze long-term trends from weekly news publications.
To replicate the environment and support Jupyter notebooks, follow these steps:
# Install pipenv
pip install pipenv
# Enter the virtual environment
pipenv shell
# Install ipykernel to support Jupyter notebooks
pipenv install ipykernel
# Also, install this to support Jupyter notebooks
pipenv install notebook jupyterlab
To run a notebook:
pipenv shell
pipenv run jupyter notebook
The project includes several scripts for data extraction and processing:
- Fetches initial RSS feed data
- Stores raw feed data for further processing
- Processes RSS feed entries using GPT-4-mini
- Extracts two types of content:
- Individual News:
- Start and end dates
- Ticker symbol
- News count
- Growth percentage
- News text
- Market News (1-day and 1-week summaries):
- Model name
- Time period
- News count
- Market summary text
- Individual News:
- Adds source link to each entry
- Saves data in a flattened Parquet format
To run the content extraction:
python scripts/02_get_content_data_flattened.py
The project implements text search capabilities using minsearch
, allowing efficient search across all data fields:
type
: News entry type (individual/market)start_date
&end_date
: Time period of the newsticker
: Company/stock ticker symbolscount
: Number of news itemsgrowth
: Growth percentagetext
: Main news contentmodel
: Model name for market summaries
- Full-text search across all fields
- Field boosting (prioritizes matches in important fields):
- text (3x boost)
- type and ticker (2x boost)
- growth and model (1.5x boost)
- other fields (1x boost)
- Link-based filtering for source tracking
Example usage in notebooks:
# Basic search
results = search_news("technology growth")
# Search with link filtering
results = search_news("market analysis", link="specific_url")
# Custom field boosting
custom_boost = {
"ticker": 3,
"text": 2,
"type": 1
}
results = search_news("AAPL earnings", boost_dict=custom_boost)
RSS feed with news (mostly weekly, some weeks are missing)—around 46 weeks or 1 year of data:
- RSS Feed URL: https://pythoninvest.com/rss-feed-612566707351.xml
- This represents the weekly financial news feed section of the website: https://pythoninvest.com/#weekly-fin-news-feed
The processed data is saved in Parquet format with the following structure:
- Individual News Entries:
{
"type": "individual",
"start_date": "date",
"end_date": "date",
"ticker": "symbol",
"count": number,
"growth": percentage,
"text": "news content",
"link": "source_url"
}
- Market News Entries:
{
"type": "market_[period]", # period can be "1day" or "1week"
"end_date": "date",
"start_date": "date",
"ticker": "multiple_tickers",
"count": number,
"model": "model_name",
"text": "market summary",
"link": "source_url"
}
The data is saved to data/news_feed_flattened.parquet
using Brotli compression for efficient storage.