NLP Economic Nowcasting

Predicting CPI Surprises from Google Trends, FOMC Sentiment, and Real-Time Energy Prices

Sean Dai · Arizona

Revision History

v1.0 (March 2026): Initial release.

v1.1 (April 2026): Corrected SUR Wald test standard errors. See Releases for full details.

v1.2 (April 2026): Added real-time nowcasting infrastructure (update_actual.py, realtime_ predictions.csv, nowcast_today.py).

Overview

This repository contains the full code and results for an independent empirical study constructing a machine learning nowcasting model for monthly CPI surprises. The final Ridge regression model (Sample C) achieves:

RMSE = 0.2322% vs AR(1) benchmark of 0.2337%
OOS-R² = +14.1% over the full 2018–2025 evaluation period
OOS-R² = +26.9% pre-COVID, +12.4% post-COVID
Directional accuracy = 65.9% (mean 66.9% across 8 years)
Outperforms both the Cleveland Fed (RMSE = 0.2483%) and Michigan Survey (RMSE = 0.2442%) benchmarks

The key methodological finding is that the T/p ratio is the binding constraint in high-dimensional alternative data nowcasting: a 6-feature model (T/p = 23.0) outperforms a 61-feature model (T/p = 2.3) by a factor of 10 in OOS-R².

Paper

My current findings will be published on SSRN which I'll put the link down here soon. While the paper is "complete", I've yet to compare my findings to updated data and verify my model fairs with real data, not just virtual simulations. Stay tuned!

Inspiration

This project grew out of a question I kept returning to while studying Real Analysis through Stanford Pre-Collegiate Studies: where does this mathematics actually show up in the real world? The convergence proofs, continuity conditions, and function space arguments felt powerful in the abstract, but I wanted to see them do something concrete.

Macroeconomic forecasting turned out to be the answer. The consistency of an OLS estimator depends on a sequence of random variables converging in probability: the same ε-N convergence I had been proving in analysis. The stationarity condition for an AR(1) model is a geometric series convergence argument. The existence of the OLS projection is guaranteed by the completeness of L² function spaces. These were not analogies but rather the exact same mathematical results, applied to a problem with real stakes: predicting inflation before it is officially measured.

The specific motivation came from watching the Federal Reserve and professional forecasters repeatedly fail to anticipate the 2021–2022 inflation surge. Markets had access to real-time signals — pump prices at gas stations, Google search behavior, the language the Fed used in its own statements — that traditional models based on lagged official data simply could not incorporate. This project is an attempt to close that gap using alternative data, NLP, and the mathematical foundations that make the statistical claims honest rather than just empirical.

What This Project Is

Every month, the Bureau of Labor Statistics releases the Consumer Price Index, which is a measure of how much prices changed over the past month. Financial markets, central banks, and businesses all form expectations about what that number will be before it is released. When the actual number differs from those expectations, it is called a CPI surprise, and it can move bond markets, shift Federal Reserve policy, and alter corporate pricing decisions within minutes of publication.

This project builds a machine learning model that predicts CPI surprises before they are officially released, using three real-time data sources that encode economic reality faster than official statistics can measure it:

Google Trends — when consumers search "gas prices" or "grocery inflation," they are experiencing price changes in real time, weeks before CPI captures them
FOMC Statements — the Federal Reserve's post-meeting language, scored for hawkishness using a financial NLP model (FinBERT), encodes the Fed's private assessment of inflation risk
EIA Gasoline Prices — weekly retail pump prices published every Monday, which directly drive the energy component of CPI

The core finding is methodological: a 6-variable model with a good observations-to-parameters ratio dramatically outperforms a 61-variable model with a poor one. Sophistication in model architecture matters far less than having enough data per parameter to estimate reliably. The final Ridge regression model achieves OOS-R² = +14.1% over a 7-year out-of-sample window and outperforms both the Cleveland Fed and Michigan Survey professional forecasts.

Data Sources

Source	Series	Access
FRED (St. Louis Fed)	CPI, Core CPI, Unemployment, Fed Funds Rate, T10YIE, WTI Oil, M2, INDPRO	Free API
Google Trends	9 inflation-related search queries	`pytrends`
Federal Reserve	129 FOMC post-meeting statements (2010–2025)	Public HTML
EIA	Weekly retail gasoline prices (GASREGW)	FRED API

Note: Raw data files are not committed to this repository. Run the pipeline scripts in order to reproduce the dataset.

Quickstart

# Clone the repository
git clone https://github.com/YOUR_USERNAME/nlp-cpi-nowcasting.git
cd nlp-cpi-nowcasting

# Install dependencies
pip install -r requirements.txt

# Add your FRED API key
echo "FRED_API_KEY=your_key_here" > .env

# Run the full pipeline
python src/pull_fred.py
python src/pull_trends.py
python src/pull_eia.py
python src/scrape_fomc.py
python src/fomc_sentiment.py
python src/construct_surprise.py
python src/feature_engineering.py
python src/models.py
python src/shap_analysis.py
python src/chow_test.py

Repository Structure

src/           All Python scripts (run in order listed above)
notebooks/     Jupyter notebooks for exploration and visualization
data/          Data directory (populated after running pipeline)
figures/       All paper figures (Figures 1–20)
results/       Model output CSVs
paper/         Full research paper PDF

Key Results

Three-Sample Progression

Sample	Features	T/p	Best OOS-R²	Mean Dir. Acc
A (61 feat, 2010–2025)	61	2.3	+1.4%	58.3%
B (61 feat, 2006–2025)	61	3.1	-3.2%	45.6%
C (6 feat + EIA)	6	23.0	+14.1%	66.9%

SHAP Feature Importance (Sample C Ridge)

Feature	Source	Importance
EIA gas price acceleration	EIA	36%
Google Trends: gas prices (L3)	Google Trends	20%
WTI crude oil MoM (L1)	FRED	17%
10yr breakeven inflation diff	FRED	14%
Google Trends: gas price accel	Google Trends	8%
FOMC hawkishness (L1)	FinBERT	5%

Citation

@misc{dai2026nowcasting,
  author = {Dai, Sean},
  title  = {NLP Economic Nowcasting: Predicting CPI Surprises from 
             Google Trends, FOMC Sentiment, and Real-Time Energy Prices},
  year   = {2026},
  url    = {https://github.com/YOUR_USERNAME/nlp-cpi-nowcasting}
}

Contact

Sean Dai · sean.zhdai@gmail.com · Arizona

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
plots		plots
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
update_actual.py		update_actual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Economic Nowcasting

Predicting CPI Surprises from Google Trends, FOMC Sentiment, and Real-Time Energy Prices

Revision History

Overview

Paper

Inspiration

What This Project Is

Data Sources

Quickstart

Repository Structure

Key Results

Three-Sample Progression

SHAP Feature Importance (Sample C Ridge)

Citation

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Economic Nowcasting

Predicting CPI Surprises from Google Trends, FOMC Sentiment, and Real-Time Energy Prices

Revision History

Overview

Paper

Inspiration

What This Project Is

Data Sources

Quickstart

Repository Structure

Key Results

Three-Sample Progression

SHAP Feature Importance (Sample C Ridge)

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages