A production-ready web scraping system for extracting restaurant longevity data from Yelp reviews. Built to analyze 5,897 restaurants across the United States to determine time-in-business metrics through alternative data collection.
This project demonstrates systematic alternative data extraction for quantitative analysis. By scraping Yelp's oldest review dates, we can estimate restaurant operational longevityβa key metric for assessing business stability and risk in the restaurant financing sector.
- β Successfully scraped 5,897 restaurants across all 50 states
- β 92% success rate in finding Yelp URLs
- β 88% data completeness for oldest review extraction
- β Handled 200+ CAPTCHA challenges during scraping
- β Robust anti-detection measures implemented
Originally developed to assess portfolio risk for alternative restaurant financing, this system provides:
- Time-in-business metrics from alternative data sources
- Business closure status monitoring
- Geographic distribution analysis
- Historical operational timeline reconstruction
The system uses a two-phase approach to maximize efficiency and data quality:
Phase 1: Fast URL discovery via Tavily API (no CAPTCHAs, 92% success rate) Phase 2: Selenium-based review scraping (handles JavaScript, 88% extraction success)
Python 3.8+
Chrome browser- Clone the repository
git clone https://github.com/yourusername/restaurant-age-yelp-scraper.git
cd restaurant-age-yelp-scraper- Set up virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Configure API key
cp config.example.yml config.yml
# Edit config.yml and add your Tavily API keypython -m src.url_finder data/sample_input.csv data/urls.csv YOUR_API_KEYInput format: CSV with columns: Location Name, Address, City, State
Output: CSV with Yelp URLs and search metadata
python -m src.review_scraper data/urls.csv data/results.csvOutput: CSV with oldest review dates, ratings, and business status
Check out the analysis notebook for detailed quantitative analysis including statistical modeling, survival curves, and risk scoring.
π Real Data: All visualizations below are generated from the complete dataset of 5,897 scraped restaurants. These represent actual patterns and statistics from the full scraping project.
The data shows a right-skewed distribution with most restaurants being relatively young (median 8.2 years), but a significant tail of established businesses operating 15+ years.
Key Insight: Restaurants that survive past the 3-year mark show dramatically lower closure rates, making this a critical threshold for risk assessment.
- Oldest restaurant found: Operating since 2005 (19+ years)
- Median time-in-business: 8.2 years
- Critical threshold: 3-year mark shows 69% reduction in closure risk
- Closure correlation: Newer restaurants (<3 years) show 31% closure rate vs. 12% for 8+ year establishments
- Custom user-agent rotation
- Disabled automation indicators
- Natural scrolling and delays
- Randomized request timing
- 4-tier cascading search strategy
- Date validation with regex patterns
- Duplicate detection and handling
- Automatic retry mechanisms
- β Automatic progress saving (every 10 records)
- β Resume capability after interruption
- β Comprehensive logging
- β Error handling and recovery
- β Rate limiting and polite scraping
restaurant-age-yelp-scraper/
βββ src/
β βββ url_finder.py # Phase 1: URL discovery
β βββ review_scraper.py # Phase 2: Review extraction
β βββ __init__.py
βββ data/
β βββ sample_input.csv # Example input data
β βββ sample_output.csv # Example results
βββ notebooks/
β βββ analysis.ipynb # Data analysis & visualizations
βββ docs/
β βββ METHODOLOGY.md # Technical methodology
βββ README.md
βββ LICENSE
βββ requirements.txt
βββ config.example.yml
βββ .gitignore
For detailed technical methodology, see METHODOLOGY.md
- Name + Street + City + State (most specific)
- Name + City + State (high precision)
- Project Name + City + State (handles brand variations)
- Base Name + City + State (fallback for complex names)
- Sorts reviews by date (ascending)
- Validates date format with regex
- Filters promotional content
- Extracts rating and review text
- Restaurant portfolio risk assessment
- Business longevity predictions
- Market entry/exit analysis
- Geographic expansion patterns
- Competitive landscape analysis
- Industry trend identification
- Location-based success factors
- Consumer sentiment over time
This tool is designed for:
- β Research and analysis purposes
- β Public data aggregation
- β Portfolio risk assessment
Please ensure compliance with:
- Yelp's Terms of Service
- robots.txt guidelines
- Rate limiting best practices
- Local data privacy regulations
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Tavily API for efficient URL discovery
- Selenium WebDriver for robust scraping
- BeautifulSoup for HTML parsing
For questions or collaboration opportunities, please open an issue or reach out via email.
Note: This is a portfolio project demonstrating web scraping, data engineering, and quantitative analysis skills. The techniques shown here can be adapted for various alternative data collection use cases.



