Skip to content

gitEricsson/Financial-News-Aggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Financial News Aggregator - S&P 500 Web Scraper

📊 Project Overview

A production-grade web scraping system that automatically extracts financial news articles for all 500 S&P 500 companies from Google News and Yahoo Finance. This project aggregates 5 years of market-relevant news (2020-2025) into a structured, queryable dataset for quantitative finance analysis, machine learning training, or financial intelligence platforms.

Target Dataset: 450+ S&P 500 tickers | Date Range: 2020-2025 | Estimated Output: 130K+ articles


✨ Key Features

  • Comprehensive Ticker Coverage: Automatically scrapes all S&P 500 companies from Wikipedia
  • Intelligent Proxy Rotation: 10-pool proxy system with automatic failover to bypass IP blocking
  • Error Recovery: Automatic retry mechanism for rate-limiting (429) errors with exponential backoff
  • Multi-Language Support: Parses dates in English, Spanish, and Portuguese
  • Full-Text Extraction: Scrapes complete article bodies from Yahoo Finance
  • Duplicate Detection: URL deduplication prevents redundant processing
  • Incremental Processing: Tracks scraped tickers, allowing safe resumption from checkpoints
  • Date-Range Filtering: 30-day chunking to bypass Google News 30-page limit
  • Data Validation: Confirms ticker/security name presence in articles before storing

🛠️ Technology Stack

Layer Technology Purpose
HTTP Requests requests Fetch webpages with proxy support, timeout handling
HTML Parsing BeautifulSoup DOM navigation and CSS selector-based extraction
Data Processing pandas Wikipedia table parsing, CSV management
Proxy Infrastructure SmartProxy (gate.smartproxy.com) IP rotation, geo-blocking bypass
Date Processing datetime, re Relative date parsing, multi-language support
Language Python 3.7+ Core implementation language

📋 Installation

Prerequisites

  • Python 3.7+
  • pip or conda
  • Active proxy service account (SmartProxy recommended)

Setup

# Clone the repository
git clone https://github.com/gitEricsson/financial-news-aggregator.git
cd financial-news-aggregator

# Install dependencies
pip install -r requirements.txt

# Or with conda
conda create -n scraper python=3.8
conda activate scraper
pip install -r requirements.txt

Dependencies

pandas>=1.3.0
requests>=2.26.0
beautifulsoup4>=4.9.3

Install via:

pip install pandas requests beautifulsoup4

⚙️ Configuration

Proxy Setup

  1. Obtain SmartProxy Credentials

    • Sign up at SmartProxy
    • Get username and password
    • Note: Free tier typically has limited concurrent connections
  2. Configure in Code

Edit the proxy credentials section in the main script:

PROXY_USERNAME = 'your_username'
PROXY_PASSWORD = 'your_password'
PROXY_BASE_URL = 'gate.smartproxy.com'
PROXY_PORTS = [10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010]

Output File Path

Modify the output CSV filename:

filename = "/path/to/your/output/all_tickers_news_2020-01-01_to_2025-01-01.csv"

Quality Control Tickers

Exclude test tickers from scraping:

quality_control = ['ABT', 'A', 'ARE', 'T', 'C', 'D', 'IT', 'J', 'K', 'L', 'ALL', 'KEY']

🚀 Usage

Basic Usage

from scraper import scrape_multiple_tickers

# Define parameters
start_date = "2020-01-01"
end_date = "2025-01-01"
tickers = ['AAPL', 'MSFT', 'GOOGL']  # Or use full S&P 500 list

# Run scraper
scrape_multiple_tickers(
    tickers=tickers,
    start_date=start_date,
    end_date=end_date,
    chunk_size_days=30,    # 30-day date chunks
    num_pages=30           # Pages per chunk (~10 results per page)
)

Advanced: Full S&P 500 Scrape

import pandas as pd
from scraper import scrape_multiple_tickers

# Fetch all S&P 500 tickers from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
sp500_df = tables[0]
all_tickers = sp500_df["Symbol"].tolist()

# Define quality control and exclusions
quality_control = ['ABT', 'A', 'ARE', 'T', 'C', 'D', 'IT', 'J', 'K', 'L', 'ALL', 'KEY']
already_scraped = ['MMM', 'AOS', 'ABBV']  # From previous runs

# Filter tickers
tickers_to_scrape = [t for t in all_tickers if t not in quality_control and t not in already_scraped]

# Run
scrape_multiple_tickers(
    tickers=tickers_to_scrape,
    start_date="2020-01-01",
    end_date="2025-01-01",
    chunk_size_days=30,
    num_pages=30
)

Resuming After Interruption

The script automatically detects already-scraped tickers and skips them:

# Script will resume from where it left off
python run_scraper.py

📊 Output Format

CSV Structure

title,link,date,original_date,source,snippet,full_text,ticker
"Apple Q4 Earnings Beat Expectations",https://finance.yahoo.com/...,2024-11-01,1 day ago,Yahoo Finance,"Apple reported...",Full article text here,AAPL

Column Descriptions

Column Type Description
title string Article headline
link string URL to original article
date YYYY-MM-DD Parsed publication date
original_date string Raw date from source ("2 days ago", etc.)
source string News outlet (Yahoo Finance, Reuters, etc.)
snippet string Article preview/summary
full_text string Complete article body (if available)
ticker string S&P 500 symbol

Data Quality

  • Validation: Articles must mention ticker or security name
  • Deduplication: URLs tracked to prevent duplicates
  • Missing Values: "N/A" for unavailable fields
  • Date Parsing: Supports multiple languages (EN/ES/PT)

🏗️ Architecture

System Workflow

┌──────────────────────────────────────────────────────────┐
│ 1. DATA SOURCE ACQUISITION                               │
│    ├─ Fetch S&P 500 ticker list from Wikipedia           │
│    ├─ Create ticker → security name mapping              │
│    └─ Apply quality control filters                      │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 2. PROXY INFRASTRUCTURE                                  │
│    ├─ Generate 10-pool proxy list                        │
│    ├─ Validate each proxy with test request              │
│    └─ Keep only working proxies in rotation              │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 3. FOR EACH TICKER (450+)                                │
│    ├─ Check if already scraped (skip if yes)             │
│    ├─ Query: "Yahoo Finance {TICKER}"                    │
│    └─ Date range: 2020-2025 in 30-day chunks             │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 4. PER DATE CHUNK (30 pages = ~300 results)              │
│    ├─ Send request with random proxy & user agent        │
│    ├─ Parse HTML with BeautifulSoup                      │
│    ├─ Extract: title, link, date, source, snippet       │
│    ├─ Handle 429: rotate proxy & retry                   │
│    └─ Delay: 2-5 seconds (random)                        │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 5. YAHOO FINANCE SCRAPING                                │
│    ├─ For Yahoo Finance links only                       │
│    ├─ Fetch full article body                            │
│    ├─ Extract text from <div class="body yf-tsvcyu">     │
│    └─ Handle errors gracefully ("N/A" on fail)           │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 6. DATA VALIDATION & DEDUPLICATION                       │
│    ├─ Check if URL already seen                          │
│    ├─ Verify ticker/security name in content             │
│    ├─ Require title + link (mandatory fields)            │
│    └─ Parse relative dates to YYYY-MM-DD                 │
└────────────────────┬─────────────────────────────────────┘
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 7. PERSISTENT STORAGE                                    │
│    ├─ Append to CSV (incremental)                        │
│    ├─ Track scraped ticker                               │
│    ├─ Enable safe resumption                             │
│    └─ Delay: 5-10 seconds before next ticker             │
└──────────────────────────────────────────────────────────┘

Key Functions

Function Purpose
generate_proxy_list() Creates proxy URLs from credentials
validate_proxy(proxy) Tests proxy with GET request
get_working_proxies(proxy_list) Filters valid proxies
parse_relative_date(date_str) Converts "2 days ago" → "2025-02-21"
scrape_google_news(query, dates) Searches Google News with date range
scrape_yahoo_article(url) Extracts full text from Yahoo Finance
scrape_google_news_in_chunks() Bypasses 30-page limit with chunking
scrape_multiple_tickers() Main orchestration function

⚠️ Error Handling & Recovery

HTTP 429 (Too Many Requests)

Automatic Handling:

  1. Detect 429 status code
  2. Rotate to next proxy in pool
  3. Wait 5 seconds
  4. Retry up to 3 times
  5. Skip page if all retries fail
# Configured in scrape_google_news():
retries = 3  # Number of retry attempts
for attempt in range(retries):
    if response.status_code == 429:
        proxy_index = (proxy_index + 1) % len(proxy_list)
        time.sleep(5)
        continue

Missing HTML Elements

Default to "N/A":

title = link_element.find('div', attrs={'role': 'heading'})
title = title.text if title else "N/A"  # Safe extraction

Date Parsing Failures

Multi-Language Fallback:

def parse_relative_date(date_str):
    # Try multiple parsing strategies
    # English: "2 days ago"
    # Spanish: "hace 2 días"
    # Portuguese: "há 2 dias"
    # Return original string if all fail
    return date_str

⚡ Performance Characteristics

Expected Runtime

  • Per Ticker: ~5-10 minutes (depending on result volume)
  • All 450 Tickers: ~37-75 hours (continuous)
  • With Interruptions: Resumes from last ticker

Resource Usage

  • Memory: ~500MB-1GB (DataFrame in progress)
  • Disk: ~2-5GB (CSV output for 130K articles)
  • Network: High bandwidth (proxy service charges apply)
  • CPU: Low (I/O bound, not compute intensive)

Optimization Tips

  1. Parallel Processing (future enhancement):

    # Use ThreadPoolExecutor for multiple tickers
    from concurrent.futures import ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=5) as executor:
        # ...
  2. Reduce Date Range:

    # Scrape 1 year instead of 5
    start_date = "2024-01-01"
    end_date = "2025-01-01"
  3. Fewer Pages Per Chunk:

    scrape_multiple_tickers(..., num_pages=10)  # Default 30

🔒 Security & Ethical Considerations

Rate Limiting

  • Respects website rate limits (2-5 second delays)
  • Rotates proxies on 429 errors
  • Random user agents (3 variations)
  • 30-page limit bypass via date chunking (respects intent)

Terms of Service

  • Use responsibly and check each website's robots.txt
  • Respect Crawl-delay directives
  • Consider using official APIs where available
  • Attribution: Always credit original sources

Proxy Usage

  • Cost: Proxy service charges per GB
  • Bandwidth: ~500MB-1GB for full S&P 500 scrape
  • Estimated Cost: $5-50 depending on provider

🐛 Troubleshooting

Issue: No articles found

Solution:

  • Check proxy connectivity: python -c "from scraper import validate_proxy; validate_proxy('http://...')"
  • Verify date format: Must be YYYY-MM-DD
  • Check ticker symbols: Must be exact S&P 500 symbols

Issue: "Too Many Requests" errors persist

Solution:

  • Increase delay: time.sleep(10) instead of 5
  • Use fewer proxies temporarily
  • Reduce pages per chunk: num_pages=10
  • Wait 24 hours before retrying

Issue: Yahoo Finance articles not scraping

Solution:

  • Yahoo Finance blocks aggressive scraping; consider rate limiting
  • Use a fresh proxy pool
  • Check if article was moved/deleted

Issue: Out of memory

Solution:

  • Process in smaller ticker batches:
    batch_size = 50
    for i in range(0, len(tickers), batch_size):
        scrape_multiple_tickers(tickers[i:i+batch_size], ...)

📈 Data Analysis Examples

Load and Explore

import pandas as pd

# Load results
df = pd.read_csv('all_tickers_news_2020-01-01_to_2025-01-01.csv')

# Overview
print(df.shape)  # (130000+, 8)
print(df.columns)
print(df.head())

# Articles per ticker
df['ticker'].value_counts().head()

# Articles over time
df['date'] = pd.to_datetime(df['date'])
df.groupby(df['date'].dt.year).size()  # Count by year

Sentiment Analysis

from textblob import TextBlob

# Add sentiment scores
df['sentiment'] = df['full_text'].apply(
    lambda x: TextBlob(x).sentiment.polarity if isinstance(x, str) else 0
)

# Average sentiment by ticker
df.groupby('ticker')['sentiment'].mean().sort_values(ascending=False)

🚀 Future Enhancements

  • Parallel Processing: ThreadPoolExecutor for 10-50 concurrent tickers
  • Database Backend: Move from CSV to SQLite/PostgreSQL for better querying
  • Sentiment Analysis: Integrate TextBlob or VADER for article sentiment
  • NLP Pipeline: Extract entities, keywords, and topics
  • Real-Time Updates: Scheduled daily/weekly runs for fresh data
  • API Endpoint: Flask/FastAPI for querying results
  • Caching Layer: Redis for deduplication across runs
  • Monitoring Dashboard: Grafana/Streamlit for progress tracking
  • Official APIs: Integration with Alpha Vantage, NewsAPI for supplementary data

📝 Project Statistics

Metric Value
Total Tickers 500 (S&P 500)
Target Tickers 450+ (after QC filter)
Date Range 5 years (2020-2025)
Estimated Articles 130,000+
Date Chunks ~60 per ticker (30-day chunks)
Pages Per Chunk 30 (~300 results)
Estimated Requests 810,000+
Processing Time 40-80 hours (continuous)
Output Size 2-5 GB
Languages Supported English, Spanish, Portuguese

📚 Resources


🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer: This tool is for educational and research purposes only. Users are responsible for complying with the terms of service of websites being scraped and all applicable laws.


👤 Author

Ericsson Raphael || Gozie Ibekwe
Financial Data Engineering | Web Scraping | Quantitative Finance


📞 Support

For issues, questions, or suggestions:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Review troubleshooting section above

🙏 Acknowledgments

  • S&P 500 data sourced from Wikipedia
  • News data sourced from Google News and Yahoo Finance
  • Proxy infrastructure by SmartProxy
  • Built with Python, BeautifulSoup, and Pandas

Last Updated: February 23, 2026
Version: 1.5.0
Status: Production Ready

About

A production-grade web scraping system that automatically extracts financial news articles for all 500 S&P 500 companies from Google News and Yahoo Finance. It aggregates 5 years of market-relevant news (2020-2025) into a structured, queryable dataset for quantitative finance analysis, machine learning training, or financial intelligence platforms.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors