Sean Dai · Arizona
v1.0 (March 2026): Initial release.
v1.1 (April 2026): Corrected SUR Wald test standard errors. See Releases for full details.
v1.2 (April 2026): Added real-time nowcasting infrastructure (update_actual.py, realtime_ predictions.csv, nowcast_today.py).
This repository contains the full code and results for an independent empirical study constructing a machine learning nowcasting model for monthly CPI surprises. The final Ridge regression model (Sample C) achieves:
- RMSE = 0.2322% vs AR(1) benchmark of 0.2337%
- OOS-R² = +14.1% over the full 2018–2025 evaluation period
- OOS-R² = +26.9% pre-COVID, +12.4% post-COVID
- Directional accuracy = 65.9% (mean 66.9% across 8 years)
- Outperforms both the Cleveland Fed (RMSE = 0.2483%) and Michigan Survey (RMSE = 0.2442%) benchmarks
The key methodological finding is that the T/p ratio is the binding constraint in high-dimensional alternative data nowcasting: a 6-feature model (T/p = 23.0) outperforms a 61-feature model (T/p = 2.3) by a factor of 10 in OOS-R².
My current findings will be published on SSRN which I'll put the link down here soon. While the paper is "complete", I've yet to compare my findings to updated data and verify my model fairs with real data, not just virtual simulations. Stay tuned!
This project grew out of a question I kept returning to while studying Real Analysis through Stanford Pre-Collegiate Studies: where does this mathematics actually show up in the real world? The convergence proofs, continuity conditions, and function space arguments felt powerful in the abstract, but I wanted to see them do something concrete.
Macroeconomic forecasting turned out to be the answer. The consistency of an OLS estimator depends on a sequence of random variables converging in probability: the same ε-N convergence I had been proving in analysis. The stationarity condition for an AR(1) model is a geometric series convergence argument. The existence of the OLS projection is guaranteed by the completeness of L² function spaces. These were not analogies but rather the exact same mathematical results, applied to a problem with real stakes: predicting inflation before it is officially measured.
The specific motivation came from watching the Federal Reserve and professional forecasters repeatedly fail to anticipate the 2021–2022 inflation surge. Markets had access to real-time signals — pump prices at gas stations, Google search behavior, the language the Fed used in its own statements — that traditional models based on lagged official data simply could not incorporate. This project is an attempt to close that gap using alternative data, NLP, and the mathematical foundations that make the statistical claims honest rather than just empirical.
Every month, the Bureau of Labor Statistics releases the Consumer Price Index, which is a measure of how much prices changed over the past month. Financial markets, central banks, and businesses all form expectations about what that number will be before it is released. When the actual number differs from those expectations, it is called a CPI surprise, and it can move bond markets, shift Federal Reserve policy, and alter corporate pricing decisions within minutes of publication.
This project builds a machine learning model that predicts CPI surprises before they are officially released, using three real-time data sources that encode economic reality faster than official statistics can measure it:
- Google Trends — when consumers search "gas prices" or "grocery inflation," they are experiencing price changes in real time, weeks before CPI captures them
- FOMC Statements — the Federal Reserve's post-meeting language, scored for hawkishness using a financial NLP model (FinBERT), encodes the Fed's private assessment of inflation risk
- EIA Gasoline Prices — weekly retail pump prices published every Monday, which directly drive the energy component of CPI
The core finding is methodological: a 6-variable model with a good observations-to-parameters ratio dramatically outperforms a 61-variable model with a poor one. Sophistication in model architecture matters far less than having enough data per parameter to estimate reliably. The final Ridge regression model achieves OOS-R² = +14.1% over a 7-year out-of-sample window and outperforms both the Cleveland Fed and Michigan Survey professional forecasts.
| Source | Series | Access |
|---|---|---|
| FRED (St. Louis Fed) | CPI, Core CPI, Unemployment, Fed Funds Rate, T10YIE, WTI Oil, M2, INDPRO | Free API |
| Google Trends | 9 inflation-related search queries | pytrends |
| Federal Reserve | 129 FOMC post-meeting statements (2010–2025) | Public HTML |
| EIA | Weekly retail gasoline prices (GASREGW) | FRED API |
Note: Raw data files are not committed to this repository. Run the pipeline scripts in order to reproduce the dataset.
# Clone the repository
git clone https://github.com/YOUR_USERNAME/nlp-cpi-nowcasting.git
cd nlp-cpi-nowcasting
# Install dependencies
pip install -r requirements.txt
# Add your FRED API key
echo "FRED_API_KEY=your_key_here" > .env
# Run the full pipeline
python src/pull_fred.py
python src/pull_trends.py
python src/pull_eia.py
python src/scrape_fomc.py
python src/fomc_sentiment.py
python src/construct_surprise.py
python src/feature_engineering.py
python src/models.py
python src/shap_analysis.py
python src/chow_test.pysrc/ All Python scripts (run in order listed above)
notebooks/ Jupyter notebooks for exploration and visualization
data/ Data directory (populated after running pipeline)
figures/ All paper figures (Figures 1–20)
results/ Model output CSVs
paper/ Full research paper PDF
| Sample | Features | T/p | Best OOS-R² | Mean Dir. Acc |
|---|---|---|---|---|
| A (61 feat, 2010–2025) | 61 | 2.3 | +1.4% | 58.3% |
| B (61 feat, 2006–2025) | 61 | 3.1 | -3.2% | 45.6% |
| C (6 feat + EIA) | 6 | 23.0 | +14.1% | 66.9% |
| Feature | Source | Importance |
|---|---|---|
| EIA gas price acceleration | EIA | 36% |
| Google Trends: gas prices (L3) | Google Trends | 20% |
| WTI crude oil MoM (L1) | FRED | 17% |
| 10yr breakeven inflation diff | FRED | 14% |
| Google Trends: gas price accel | Google Trends | 8% |
| FOMC hawkishness (L1) | FinBERT | 5% |
@misc{dai2026nowcasting,
author = {Dai, Sean},
title = {NLP Economic Nowcasting: Predicting CPI Surprises from
Google Trends, FOMC Sentiment, and Real-Time Energy Prices},
year = {2026},
url = {https://github.com/YOUR_USERNAME/nlp-cpi-nowcasting}
}Sean Dai · sean.zhdai@gmail.com · Arizona