Financial News Aggregator - S&P 500 Web Scraper

📊 Project Overview

A production-grade web scraping system that automatically extracts financial news articles for all 500 S&P 500 companies from Google News and Yahoo Finance. This project aggregates 5 years of market-relevant news (2020-2025) into a structured, queryable dataset for quantitative finance analysis, machine learning training, or financial intelligence platforms.

Target Dataset: 450+ S&P 500 tickers | Date Range: 2020-2025 | Estimated Output: 130K+ articles

✨ Key Features

Comprehensive Ticker Coverage: Automatically scrapes all S&P 500 companies from Wikipedia
Intelligent Proxy Rotation: 10-pool proxy system with automatic failover to bypass IP blocking
Error Recovery: Automatic retry mechanism for rate-limiting (429) errors with exponential backoff
Multi-Language Support: Parses dates in English, Spanish, and Portuguese
Full-Text Extraction: Scrapes complete article bodies from Yahoo Finance
Duplicate Detection: URL deduplication prevents redundant processing
Incremental Processing: Tracks scraped tickers, allowing safe resumption from checkpoints
Date-Range Filtering: 30-day chunking to bypass Google News 30-page limit
Data Validation: Confirms ticker/security name presence in articles before storing

🛠️ Technology Stack

Layer	Technology	Purpose
HTTP Requests	`requests`	Fetch webpages with proxy support, timeout handling
HTML Parsing	`BeautifulSoup`	DOM navigation and CSS selector-based extraction
Data Processing	`pandas`	Wikipedia table parsing, CSV management
Proxy Infrastructure	SmartProxy (gate.smartproxy.com)	IP rotation, geo-blocking bypass
Date Processing	`datetime`, `re`	Relative date parsing, multi-language support
Language	Python 3.7+	Core implementation language

📋 Installation

Prerequisites

Python 3.7+
pip or conda
Active proxy service account (SmartProxy recommended)

Setup

# Clone the repository
git clone https://github.com/gitEricsson/financial-news-aggregator.git
cd financial-news-aggregator

# Install dependencies
pip install -r requirements.txt

# Or with conda
conda create -n scraper python=3.8
conda activate scraper
pip install -r requirements.txt

Dependencies

pandas>=1.3.0
requests>=2.26.0
beautifulsoup4>=4.9.3

Install via:

pip install pandas requests beautifulsoup4

⚙️ Configuration

Proxy Setup

Obtain SmartProxy Credentials
- Sign up at SmartProxy
- Get username and password
- Note: Free tier typically has limited concurrent connections
Configure in Code

Edit the proxy credentials section in the main script:

PROXY_USERNAME = 'your_username'
PROXY_PASSWORD = 'your_password'
PROXY_BASE_URL = 'gate.smartproxy.com'
PROXY_PORTS = [10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010]

Output File Path

Modify the output CSV filename:

filename = "/path/to/your/output/all_tickers_news_2020-01-01_to_2025-01-01.csv"

Quality Control Tickers

Exclude test tickers from scraping:

quality_control = ['ABT', 'A', 'ARE', 'T', 'C', 'D', 'IT', 'J', 'K', 'L', 'ALL', 'KEY']

🚀 Usage

Basic Usage

from scraper import scrape_multiple_tickers

# Define parameters
start_date = "2020-01-01"
end_date = "2025-01-01"
tickers = ['AAPL', 'MSFT', 'GOOGL']  # Or use full S&P 500 list

# Run scraper
scrape_multiple_tickers(
    tickers=tickers,
    start_date=start_date,
    end_date=end_date,
    chunk_size_days=30,    # 30-day date chunks
    num_pages=30           # Pages per chunk (~10 results per page)
)

Advanced: Full S&P 500 Scrape

import pandas as pd
from scraper import scrape_multiple_tickers

# Fetch all S&P 500 tickers from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
sp500_df = tables[0]
all_tickers = sp500_df["Symbol"].tolist()

# Define quality control and exclusions
quality_control = ['ABT', 'A', 'ARE', 'T', 'C', 'D', 'IT', 'J', 'K', 'L', 'ALL', 'KEY']
already_scraped = ['MMM', 'AOS', 'ABBV']  # From previous runs

# Filter tickers
tickers_to_scrape = [t for t in all_tickers if t not in quality_control and t not in already_scraped]

# Run
scrape_multiple_tickers(
    tickers=tickers_to_scrape,
    start_date="2020-01-01",
    end_date="2025-01-01",
    chunk_size_days=30,
    num_pages=30
)

Resuming After Interruption

The script automatically detects already-scraped tickers and skips them:

# Script will resume from where it left off
python run_scraper.py

📊 Output Format

CSV Structure

title,link,date,original_date,source,snippet,full_text,ticker
"Apple Q4 Earnings Beat Expectations",https://finance.yahoo.com/...,2024-11-01,1 day ago,Yahoo Finance,"Apple reported...",Full article text here,AAPL

Column Descriptions

Column	Type	Description
`title`	string	Article headline
`link`	string	URL to original article
`date`	YYYY-MM-DD	Parsed publication date
`original_date`	string	Raw date from source ("2 days ago", etc.)
`source`	string	News outlet (Yahoo Finance, Reuters, etc.)
`snippet`	string	Article preview/summary
`full_text`	string	Complete article body (if available)
`ticker`	string	S&P 500 symbol

Data Quality

Validation: Articles must mention ticker or security name
Deduplication: URLs tracked to prevent duplicates
Missing Values: "N/A" for unavailable fields
Date Parsing: Supports multiple languages (EN/ES/PT)

🏗️ Architecture

System Workflow

┌──────────────────────────────────────────────────────────┐
│ 1. DATA SOURCE ACQUISITION                               │
│    ├─ Fetch S&P 500 ticker list from Wikipedia           │
│    ├─ Create ticker → security name mapping              │
│    └─ Apply quality control filters                      │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 2. PROXY INFRASTRUCTURE                                  │
│    ├─ Generate 10-pool proxy list                        │
│    ├─ Validate each proxy with test request              │
│    └─ Keep only working proxies in rotation              │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 3. FOR EACH TICKER (450+)                                │
│    ├─ Check if already scraped (skip if yes)             │
│    ├─ Query: "Yahoo Finance {TICKER}"                    │
│    └─ Date range: 2020-2025 in 30-day chunks             │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 4. PER DATE CHUNK (30 pages = ~300 results)              │
│    ├─ Send request with random proxy & user agent        │
│    ├─ Parse HTML with BeautifulSoup                      │
│    ├─ Extract: title, link, date, source, snippet       │
│    ├─ Handle 429: rotate proxy & retry                   │
│    └─ Delay: 2-5 seconds (random)                        │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 5. YAHOO FINANCE SCRAPING                                │
│    ├─ For Yahoo Finance links only                       │
│    ├─ Fetch full article body                            │
│    ├─ Extract text from <div class="body yf-tsvcyu">     │
│    └─ Handle errors gracefully ("N/A" on fail)           │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 6. DATA VALIDATION & DEDUPLICATION                       │
│    ├─ Check if URL already seen                          │
│    ├─ Verify ticker/security name in content             │
│    ├─ Require title + link (mandatory fields)            │
│    └─ Parse relative dates to YYYY-MM-DD                 │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 7. PERSISTENT STORAGE                                    │
│    ├─ Append to CSV (incremental)                        │
│    ├─ Track scraped ticker                               │
│    ├─ Enable safe resumption                             │
│    └─ Delay: 5-10 seconds before next ticker             │
└──────────────────────────────────────────────────────────┘

Key Functions

Function	Purpose
`generate_proxy_list()`	Creates proxy URLs from credentials
`validate_proxy(proxy)`	Tests proxy with GET request
`get_working_proxies(proxy_list)`	Filters valid proxies
`parse_relative_date(date_str)`	Converts "2 days ago" → "2025-02-21"
`scrape_google_news(query, dates)`	Searches Google News with date range
`scrape_yahoo_article(url)`	Extracts full text from Yahoo Finance
`scrape_google_news_in_chunks()`	Bypasses 30-page limit with chunking
`scrape_multiple_tickers()`	Main orchestration function

⚠️ Error Handling & Recovery

HTTP 429 (Too Many Requests)

Automatic Handling:

Detect 429 status code
Rotate to next proxy in pool
Wait 5 seconds
Retry up to 3 times
Skip page if all retries fail

# Configured in scrape_google_news():
retries = 3  # Number of retry attempts
for attempt in range(retries):
    if response.status_code == 429:
        proxy_index = (proxy_index + 1) % len(proxy_list)
        time.sleep(5)
        continue

Missing HTML Elements

Default to "N/A":

title = link_element.find('div', attrs={'role': 'heading'})
title = title.text if title else "N/A"  # Safe extraction

Date Parsing Failures

Multi-Language Fallback:

def parse_relative_date(date_str):
    # Try multiple parsing strategies
    # English: "2 days ago"
    # Spanish: "hace 2 días"
    # Portuguese: "há 2 dias"
    # Return original string if all fail
    return date_str

⚡ Performance Characteristics

Expected Runtime

Per Ticker: ~5-10 minutes (depending on result volume)
All 450 Tickers: ~37-75 hours (continuous)
With Interruptions: Resumes from last ticker

Resource Usage

Memory: ~500MB-1GB (DataFrame in progress)
Disk: ~2-5GB (CSV output for 130K articles)
Network: High bandwidth (proxy service charges apply)
CPU: Low (I/O bound, not compute intensive)

Optimization Tips

Parallel Processing (future enhancement):

# Use ThreadPoolExecutor for multiple tickers
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=5) as executor:
    # ...

Reduce Date Range:

# Scrape 1 year instead of 5
start_date = "2024-01-01"
end_date = "2025-01-01"

Fewer Pages Per Chunk:

scrape_multiple_tickers(..., num_pages=10)  # Default 30

🔒 Security & Ethical Considerations

Rate Limiting

Respects website rate limits (2-5 second delays)
Rotates proxies on 429 errors
Random user agents (3 variations)
30-page limit bypass via date chunking (respects intent)

Terms of Service

Use responsibly and check each website's robots.txt
Respect Crawl-delay directives
Consider using official APIs where available
Attribution: Always credit original sources

Proxy Usage

Cost: Proxy service charges per GB
Bandwidth: ~500MB-1GB for full S&P 500 scrape
Estimated Cost: $5-50 depending on provider

🐛 Troubleshooting

Issue: No articles found

Solution:

Check proxy connectivity: python -c "from scraper import validate_proxy; validate_proxy('http://...')"
Verify date format: Must be YYYY-MM-DD
Check ticker symbols: Must be exact S&P 500 symbols

Issue: "Too Many Requests" errors persist

Solution:

Increase delay: time.sleep(10) instead of 5
Use fewer proxies temporarily
Reduce pages per chunk: num_pages=10
Wait 24 hours before retrying

Issue: Yahoo Finance articles not scraping

Solution:

Yahoo Finance blocks aggressive scraping; consider rate limiting
Use a fresh proxy pool
Check if article was moved/deleted

Issue: Out of memory

Solution:

Process in smaller ticker batches:

batch_size = 50
for i in range(0, len(tickers), batch_size):
    scrape_multiple_tickers(tickers[i:i+batch_size], ...)

📈 Data Analysis Examples

Load and Explore

import pandas as pd

# Load results
df = pd.read_csv('all_tickers_news_2020-01-01_to_2025-01-01.csv')

# Overview
print(df.shape)  # (130000+, 8)
print(df.columns)
print(df.head())

# Articles per ticker
df['ticker'].value_counts().head()

# Articles over time
df['date'] = pd.to_datetime(df['date'])
df.groupby(df['date'].dt.year).size()  # Count by year

Sentiment Analysis

from textblob import TextBlob

# Add sentiment scores
df['sentiment'] = df['full_text'].apply(
    lambda x: TextBlob(x).sentiment.polarity if isinstance(x, str) else 0
)

# Average sentiment by ticker
df.groupby('ticker')['sentiment'].mean().sort_values(ascending=False)

🚀 Future Enhancements

Parallel Processing: ThreadPoolExecutor for 10-50 concurrent tickers
Database Backend: Move from CSV to SQLite/PostgreSQL for better querying
Sentiment Analysis: Integrate TextBlob or VADER for article sentiment
NLP Pipeline: Extract entities, keywords, and topics
Real-Time Updates: Scheduled daily/weekly runs for fresh data
API Endpoint: Flask/FastAPI for querying results
Caching Layer: Redis for deduplication across runs
Monitoring Dashboard: Grafana/Streamlit for progress tracking
Official APIs: Integration with Alpha Vantage, NewsAPI for supplementary data

📝 Project Statistics

Metric	Value
Total Tickers	500 (S&P 500)
Target Tickers	450+ (after QC filter)
Date Range	5 years (2020-2025)
Estimated Articles	130,000+
Date Chunks	~60 per ticker (30-day chunks)
Pages Per Chunk	30 (~300 results)
Estimated Requests	810,000+
Processing Time	40-80 hours (continuous)
Output Size	2-5 GB
Languages Supported	English, Spanish, Portuguese

📚 Resources

BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Google News: https://news.google.com/
S&P 500 Companies: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
SmartProxy: https://smartproxy.com/
Python Requests: https://docs.python-requests.org/

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer: This tool is for educational and research purposes only. Users are responsible for complying with the terms of service of websites being scraped and all applicable laws.

👤 Author

Ericsson Raphael || Gozie Ibekwe
Financial Data Engineering | Web Scraping | Quantitative Finance

📞 Support

For issues, questions, or suggestions:

Open an issue on GitHub
Check existing issues for solutions
Review troubleshooting section above

🙏 Acknowledgments

S&P 500 data sourced from Wikipedia
News data sourced from Google News and Yahoo Finance
Proxy infrastructure by SmartProxy
Built with Python, BeautifulSoup, and Pandas

Last Updated: February 23, 2026
Version: 1.5.0
Status: Production Ready

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Herculean_Scraping.ipynb		Herculean_Scraping.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Financial News Aggregator - S&P 500 Web Scraper

📊 Project Overview

✨ Key Features

🛠️ Technology Stack

📋 Installation

Prerequisites

Setup

Dependencies

⚙️ Configuration

Proxy Setup

Output File Path

Quality Control Tickers

🚀 Usage

Basic Usage

Advanced: Full S&P 500 Scrape

Resuming After Interruption

📊 Output Format

CSV Structure

Column Descriptions

Data Quality

🏗️ Architecture

System Workflow

Key Functions

⚠️ Error Handling & Recovery

HTTP 429 (Too Many Requests)

Missing HTML Elements

Date Parsing Failures

⚡ Performance Characteristics

Expected Runtime

Resource Usage

Optimization Tips

🔒 Security & Ethical Considerations

Rate Limiting

Terms of Service

Proxy Usage

🐛 Troubleshooting

Issue: No articles found

Issue: "Too Many Requests" errors persist

Issue: Yahoo Finance articles not scraping

Issue: Out of memory

📈 Data Analysis Examples

Load and Explore

Sentiment Analysis

🚀 Future Enhancements

📝 Project Statistics

📚 Resources

🤝 Contributing

⚖️ License

👤 Author

📞 Support

🙏 Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages