A production-grade web scraping system that automatically extracts financial news articles for all 500 S&P 500 companies from Google News and Yahoo Finance. This project aggregates 5 years of market-relevant news (2020-2025) into a structured, queryable dataset for quantitative finance analysis, machine learning training, or financial intelligence platforms.
Target Dataset: 450+ S&P 500 tickers | Date Range: 2020-2025 | Estimated Output: 130K+ articles
- Comprehensive Ticker Coverage: Automatically scrapes all S&P 500 companies from Wikipedia
- Intelligent Proxy Rotation: 10-pool proxy system with automatic failover to bypass IP blocking
- Error Recovery: Automatic retry mechanism for rate-limiting (429) errors with exponential backoff
- Multi-Language Support: Parses dates in English, Spanish, and Portuguese
- Full-Text Extraction: Scrapes complete article bodies from Yahoo Finance
- Duplicate Detection: URL deduplication prevents redundant processing
- Incremental Processing: Tracks scraped tickers, allowing safe resumption from checkpoints
- Date-Range Filtering: 30-day chunking to bypass Google News 30-page limit
- Data Validation: Confirms ticker/security name presence in articles before storing
| Layer | Technology | Purpose |
|---|---|---|
| HTTP Requests | requests |
Fetch webpages with proxy support, timeout handling |
| HTML Parsing | BeautifulSoup |
DOM navigation and CSS selector-based extraction |
| Data Processing | pandas |
Wikipedia table parsing, CSV management |
| Proxy Infrastructure | SmartProxy (gate.smartproxy.com) | IP rotation, geo-blocking bypass |
| Date Processing | datetime, re |
Relative date parsing, multi-language support |
| Language | Python 3.7+ | Core implementation language |
- Python 3.7+
- pip or conda
- Active proxy service account (SmartProxy recommended)
# Clone the repository
git clone https://github.com/gitEricsson/financial-news-aggregator.git
cd financial-news-aggregator
# Install dependencies
pip install -r requirements.txt
# Or with conda
conda create -n scraper python=3.8
conda activate scraper
pip install -r requirements.txtpandas>=1.3.0
requests>=2.26.0
beautifulsoup4>=4.9.3
Install via:
pip install pandas requests beautifulsoup4-
Obtain SmartProxy Credentials
- Sign up at SmartProxy
- Get username and password
- Note: Free tier typically has limited concurrent connections
-
Configure in Code
Edit the proxy credentials section in the main script:
PROXY_USERNAME = 'your_username'
PROXY_PASSWORD = 'your_password'
PROXY_BASE_URL = 'gate.smartproxy.com'
PROXY_PORTS = [10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010]Modify the output CSV filename:
filename = "/path/to/your/output/all_tickers_news_2020-01-01_to_2025-01-01.csv"Exclude test tickers from scraping:
quality_control = ['ABT', 'A', 'ARE', 'T', 'C', 'D', 'IT', 'J', 'K', 'L', 'ALL', 'KEY']from scraper import scrape_multiple_tickers
# Define parameters
start_date = "2020-01-01"
end_date = "2025-01-01"
tickers = ['AAPL', 'MSFT', 'GOOGL'] # Or use full S&P 500 list
# Run scraper
scrape_multiple_tickers(
tickers=tickers,
start_date=start_date,
end_date=end_date,
chunk_size_days=30, # 30-day date chunks
num_pages=30 # Pages per chunk (~10 results per page)
)import pandas as pd
from scraper import scrape_multiple_tickers
# Fetch all S&P 500 tickers from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
sp500_df = tables[0]
all_tickers = sp500_df["Symbol"].tolist()
# Define quality control and exclusions
quality_control = ['ABT', 'A', 'ARE', 'T', 'C', 'D', 'IT', 'J', 'K', 'L', 'ALL', 'KEY']
already_scraped = ['MMM', 'AOS', 'ABBV'] # From previous runs
# Filter tickers
tickers_to_scrape = [t for t in all_tickers if t not in quality_control and t not in already_scraped]
# Run
scrape_multiple_tickers(
tickers=tickers_to_scrape,
start_date="2020-01-01",
end_date="2025-01-01",
chunk_size_days=30,
num_pages=30
)The script automatically detects already-scraped tickers and skips them:
# Script will resume from where it left off
python run_scraper.pytitle,link,date,original_date,source,snippet,full_text,ticker
"Apple Q4 Earnings Beat Expectations",https://finance.yahoo.com/...,2024-11-01,1 day ago,Yahoo Finance,"Apple reported...",Full article text here,AAPL| Column | Type | Description |
|---|---|---|
title |
string | Article headline |
link |
string | URL to original article |
date |
YYYY-MM-DD | Parsed publication date |
original_date |
string | Raw date from source ("2 days ago", etc.) |
source |
string | News outlet (Yahoo Finance, Reuters, etc.) |
snippet |
string | Article preview/summary |
full_text |
string | Complete article body (if available) |
ticker |
string | S&P 500 symbol |
- Validation: Articles must mention ticker or security name
- Deduplication: URLs tracked to prevent duplicates
- Missing Values: "N/A" for unavailable fields
- Date Parsing: Supports multiple languages (EN/ES/PT)
┌──────────────────────────────────────────────────────────┐
│ 1. DATA SOURCE ACQUISITION │
│ ├─ Fetch S&P 500 ticker list from Wikipedia │
│ ├─ Create ticker → security name mapping │
│ └─ Apply quality control filters │
└────────────────────┬─────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ 2. PROXY INFRASTRUCTURE │
│ ├─ Generate 10-pool proxy list │
│ ├─ Validate each proxy with test request │
│ └─ Keep only working proxies in rotation │
└────────────────────┬─────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ 3. FOR EACH TICKER (450+) │
│ ├─ Check if already scraped (skip if yes) │
│ ├─ Query: "Yahoo Finance {TICKER}" │
│ └─ Date range: 2020-2025 in 30-day chunks │
└────────────────────┬─────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ 4. PER DATE CHUNK (30 pages = ~300 results) │
│ ├─ Send request with random proxy & user agent │
│ ├─ Parse HTML with BeautifulSoup │
│ ├─ Extract: title, link, date, source, snippet │
│ ├─ Handle 429: rotate proxy & retry │
│ └─ Delay: 2-5 seconds (random) │
└────────────────────┬─────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ 5. YAHOO FINANCE SCRAPING │
│ ├─ For Yahoo Finance links only │
│ ├─ Fetch full article body │
│ ├─ Extract text from <div class="body yf-tsvcyu"> │
│ └─ Handle errors gracefully ("N/A" on fail) │
└────────────────────┬─────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ 6. DATA VALIDATION & DEDUPLICATION │
│ ├─ Check if URL already seen │
│ ├─ Verify ticker/security name in content │
│ ├─ Require title + link (mandatory fields) │
│ └─ Parse relative dates to YYYY-MM-DD │
└────────────────────┬─────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ 7. PERSISTENT STORAGE │
│ ├─ Append to CSV (incremental) │
│ ├─ Track scraped ticker │
│ ├─ Enable safe resumption │
│ └─ Delay: 5-10 seconds before next ticker │
└──────────────────────────────────────────────────────────┘
| Function | Purpose |
|---|---|
generate_proxy_list() |
Creates proxy URLs from credentials |
validate_proxy(proxy) |
Tests proxy with GET request |
get_working_proxies(proxy_list) |
Filters valid proxies |
parse_relative_date(date_str) |
Converts "2 days ago" → "2025-02-21" |
scrape_google_news(query, dates) |
Searches Google News with date range |
scrape_yahoo_article(url) |
Extracts full text from Yahoo Finance |
scrape_google_news_in_chunks() |
Bypasses 30-page limit with chunking |
scrape_multiple_tickers() |
Main orchestration function |
Automatic Handling:
- Detect 429 status code
- Rotate to next proxy in pool
- Wait 5 seconds
- Retry up to 3 times
- Skip page if all retries fail
# Configured in scrape_google_news():
retries = 3 # Number of retry attempts
for attempt in range(retries):
if response.status_code == 429:
proxy_index = (proxy_index + 1) % len(proxy_list)
time.sleep(5)
continueDefault to "N/A":
title = link_element.find('div', attrs={'role': 'heading'})
title = title.text if title else "N/A" # Safe extractionMulti-Language Fallback:
def parse_relative_date(date_str):
# Try multiple parsing strategies
# English: "2 days ago"
# Spanish: "hace 2 días"
# Portuguese: "há 2 dias"
# Return original string if all fail
return date_str- Per Ticker: ~5-10 minutes (depending on result volume)
- All 450 Tickers: ~37-75 hours (continuous)
- With Interruptions: Resumes from last ticker
- Memory: ~500MB-1GB (DataFrame in progress)
- Disk: ~2-5GB (CSV output for 130K articles)
- Network: High bandwidth (proxy service charges apply)
- CPU: Low (I/O bound, not compute intensive)
-
Parallel Processing (future enhancement):
# Use ThreadPoolExecutor for multiple tickers from concurrent.futures import ThreadPoolExecutor with ThreadPoolExecutor(max_workers=5) as executor: # ...
-
Reduce Date Range:
# Scrape 1 year instead of 5 start_date = "2024-01-01" end_date = "2025-01-01"
-
Fewer Pages Per Chunk:
scrape_multiple_tickers(..., num_pages=10) # Default 30
- Respects website rate limits (2-5 second delays)
- Rotates proxies on 429 errors
- Random user agents (3 variations)
- 30-page limit bypass via date chunking (respects intent)
- Use responsibly and check each website's
robots.txt - Respect
Crawl-delaydirectives - Consider using official APIs where available
- Attribution: Always credit original sources
- Cost: Proxy service charges per GB
- Bandwidth: ~500MB-1GB for full S&P 500 scrape
- Estimated Cost: $5-50 depending on provider
Solution:
- Check proxy connectivity:
python -c "from scraper import validate_proxy; validate_proxy('http://...')" - Verify date format: Must be
YYYY-MM-DD - Check ticker symbols: Must be exact S&P 500 symbols
Solution:
- Increase delay:
time.sleep(10)instead of5 - Use fewer proxies temporarily
- Reduce pages per chunk:
num_pages=10 - Wait 24 hours before retrying
Solution:
- Yahoo Finance blocks aggressive scraping; consider rate limiting
- Use a fresh proxy pool
- Check if article was moved/deleted
Solution:
- Process in smaller ticker batches:
batch_size = 50 for i in range(0, len(tickers), batch_size): scrape_multiple_tickers(tickers[i:i+batch_size], ...)
import pandas as pd
# Load results
df = pd.read_csv('all_tickers_news_2020-01-01_to_2025-01-01.csv')
# Overview
print(df.shape) # (130000+, 8)
print(df.columns)
print(df.head())
# Articles per ticker
df['ticker'].value_counts().head()
# Articles over time
df['date'] = pd.to_datetime(df['date'])
df.groupby(df['date'].dt.year).size() # Count by yearfrom textblob import TextBlob
# Add sentiment scores
df['sentiment'] = df['full_text'].apply(
lambda x: TextBlob(x).sentiment.polarity if isinstance(x, str) else 0
)
# Average sentiment by ticker
df.groupby('ticker')['sentiment'].mean().sort_values(ascending=False)- Parallel Processing: ThreadPoolExecutor for 10-50 concurrent tickers
- Database Backend: Move from CSV to SQLite/PostgreSQL for better querying
- Sentiment Analysis: Integrate TextBlob or VADER for article sentiment
- NLP Pipeline: Extract entities, keywords, and topics
- Real-Time Updates: Scheduled daily/weekly runs for fresh data
- API Endpoint: Flask/FastAPI for querying results
- Caching Layer: Redis for deduplication across runs
- Monitoring Dashboard: Grafana/Streamlit for progress tracking
- Official APIs: Integration with Alpha Vantage, NewsAPI for supplementary data
| Metric | Value |
|---|---|
| Total Tickers | 500 (S&P 500) |
| Target Tickers | 450+ (after QC filter) |
| Date Range | 5 years (2020-2025) |
| Estimated Articles | 130,000+ |
| Date Chunks | ~60 per ticker (30-day chunks) |
| Pages Per Chunk | 30 (~300 results) |
| Estimated Requests | 810,000+ |
| Processing Time | 40-80 hours (continuous) |
| Output Size | 2-5 GB |
| Languages Supported | English, Spanish, Portuguese |
- BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Google News: https://news.google.com/
- S&P 500 Companies: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
- SmartProxy: https://smartproxy.com/
- Python Requests: https://docs.python-requests.org/
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Disclaimer: This tool is for educational and research purposes only. Users are responsible for complying with the terms of service of websites being scraped and all applicable laws.
Ericsson Raphael || Gozie Ibekwe
Financial Data Engineering | Web Scraping | Quantitative Finance
For issues, questions, or suggestions:
- Open an issue on GitHub
- Check existing issues for solutions
- Review troubleshooting section above
- S&P 500 data sourced from Wikipedia
- News data sourced from Google News and Yahoo Finance
- Proxy infrastructure by SmartProxy
- Built with Python, BeautifulSoup, and Pandas
Last Updated: February 23, 2026
Version: 1.5.0
Status: Production Ready