Skip to content

NotCoolYoshi/nlp-cpi-nowcasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Economic Nowcasting

Predicting CPI Surprises from Google Trends, FOMC Sentiment, and Real-Time Energy Prices

Sean Dai · Arizona

Python License: MIT


Revision History

v1.0 (March 2026): Initial release.

v1.1 (April 2026): Corrected SUR Wald test standard errors. See Releases for full details.

v1.2 (April 2026): Added real-time nowcasting infrastructure (update_actual.py, realtime_ predictions.csv, nowcast_today.py).


Overview

This repository contains the full code and results for an independent empirical study constructing a machine learning nowcasting model for monthly CPI surprises. The final Ridge regression model (Sample C) achieves:

  • RMSE = 0.2322% vs AR(1) benchmark of 0.2337%
  • OOS-R² = +14.1% over the full 2018–2025 evaluation period
  • OOS-R² = +26.9% pre-COVID, +12.4% post-COVID
  • Directional accuracy = 65.9% (mean 66.9% across 8 years)
  • Outperforms both the Cleveland Fed (RMSE = 0.2483%) and Michigan Survey (RMSE = 0.2442%) benchmarks

The key methodological finding is that the T/p ratio is the binding constraint in high-dimensional alternative data nowcasting: a 6-feature model (T/p = 23.0) outperforms a 61-feature model (T/p = 2.3) by a factor of 10 in OOS-R².


Paper

My current findings will be published on SSRN which I'll put the link down here soon. While the paper is "complete", I've yet to compare my findings to updated data and verify my model fairs with real data, not just virtual simulations. Stay tuned!


Inspiration

This project grew out of a question I kept returning to while studying Real Analysis through Stanford Pre-Collegiate Studies: where does this mathematics actually show up in the real world? The convergence proofs, continuity conditions, and function space arguments felt powerful in the abstract, but I wanted to see them do something concrete.

Macroeconomic forecasting turned out to be the answer. The consistency of an OLS estimator depends on a sequence of random variables converging in probability: the same ε-N convergence I had been proving in analysis. The stationarity condition for an AR(1) model is a geometric series convergence argument. The existence of the OLS projection is guaranteed by the completeness of L² function spaces. These were not analogies but rather the exact same mathematical results, applied to a problem with real stakes: predicting inflation before it is officially measured.

The specific motivation came from watching the Federal Reserve and professional forecasters repeatedly fail to anticipate the 2021–2022 inflation surge. Markets had access to real-time signals — pump prices at gas stations, Google search behavior, the language the Fed used in its own statements — that traditional models based on lagged official data simply could not incorporate. This project is an attempt to close that gap using alternative data, NLP, and the mathematical foundations that make the statistical claims honest rather than just empirical.


What This Project Is

Every month, the Bureau of Labor Statistics releases the Consumer Price Index, which is a measure of how much prices changed over the past month. Financial markets, central banks, and businesses all form expectations about what that number will be before it is released. When the actual number differs from those expectations, it is called a CPI surprise, and it can move bond markets, shift Federal Reserve policy, and alter corporate pricing decisions within minutes of publication.

This project builds a machine learning model that predicts CPI surprises before they are officially released, using three real-time data sources that encode economic reality faster than official statistics can measure it:

  • Google Trends — when consumers search "gas prices" or "grocery inflation," they are experiencing price changes in real time, weeks before CPI captures them
  • FOMC Statements — the Federal Reserve's post-meeting language, scored for hawkishness using a financial NLP model (FinBERT), encodes the Fed's private assessment of inflation risk
  • EIA Gasoline Prices — weekly retail pump prices published every Monday, which directly drive the energy component of CPI

The core finding is methodological: a 6-variable model with a good observations-to-parameters ratio dramatically outperforms a 61-variable model with a poor one. Sophistication in model architecture matters far less than having enough data per parameter to estimate reliably. The final Ridge regression model achieves OOS-R² = +14.1% over a 7-year out-of-sample window and outperforms both the Cleveland Fed and Michigan Survey professional forecasts.


Data Sources

Source Series Access
FRED (St. Louis Fed) CPI, Core CPI, Unemployment, Fed Funds Rate, T10YIE, WTI Oil, M2, INDPRO Free API
Google Trends 9 inflation-related search queries pytrends
Federal Reserve 129 FOMC post-meeting statements (2010–2025) Public HTML
EIA Weekly retail gasoline prices (GASREGW) FRED API

Note: Raw data files are not committed to this repository. Run the pipeline scripts in order to reproduce the dataset.


Quickstart

# Clone the repository
git clone https://github.com/YOUR_USERNAME/nlp-cpi-nowcasting.git
cd nlp-cpi-nowcasting

# Install dependencies
pip install -r requirements.txt

# Add your FRED API key
echo "FRED_API_KEY=your_key_here" > .env

# Run the full pipeline
python src/pull_fred.py
python src/pull_trends.py
python src/pull_eia.py
python src/scrape_fomc.py
python src/fomc_sentiment.py
python src/construct_surprise.py
python src/feature_engineering.py
python src/models.py
python src/shap_analysis.py
python src/chow_test.py

Repository Structure

src/           All Python scripts (run in order listed above)
notebooks/     Jupyter notebooks for exploration and visualization
data/          Data directory (populated after running pipeline)
figures/       All paper figures (Figures 1–20)
results/       Model output CSVs
paper/         Full research paper PDF

Key Results

Three-Sample Progression

Sample Features T/p Best OOS-R² Mean Dir. Acc
A (61 feat, 2010–2025) 61 2.3 +1.4% 58.3%
B (61 feat, 2006–2025) 61 3.1 -3.2% 45.6%
C (6 feat + EIA) 6 23.0 +14.1% 66.9%

SHAP Feature Importance (Sample C Ridge)

Feature Source Importance
EIA gas price acceleration EIA 36%
Google Trends: gas prices (L3) Google Trends 20%
WTI crude oil MoM (L1) FRED 17%
10yr breakeven inflation diff FRED 14%
Google Trends: gas price accel Google Trends 8%
FOMC hawkishness (L1) FinBERT 5%

Citation

@misc{dai2026nowcasting,
  author = {Dai, Sean},
  title  = {NLP Economic Nowcasting: Predicting CPI Surprises from 
             Google Trends, FOMC Sentiment, and Real-Time Energy Prices},
  year   = {2026},
  url    = {https://github.com/YOUR_USERNAME/nlp-cpi-nowcasting}
}

Contact

Sean Dai · sean.zhdai@gmail.com · Arizona

About

Machine learning CPI nowcasting using Google Trends, FinBERT FOMC sentiment, and real-time EIA gasoline prices. Ridge regression beats Cleveland Fed benchmark.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages