This repository demonstrates a comprehensive analytics workflow for marketing performance analysis, covering user funnel optimization, retention cohorts, trend analysis, and predictive modeling. The project showcases modern data analysis techniques including SQL-driven reporting, dashboard development, data integrity monitoring, and ROI forecasting to extract valuable insights from marketing data and drive strategic decision-making.
- Actionable analytics: funnel, ROI/ROAS, retention, and churn signals tied directly to UA decisions.
- Cohort retention: D1/D7 cohort analysis by acquisition channel and platform highlights where to invest or iterate.
- Predictive modeling: churn-classification pipeline (LogReg + XGBoost) with documented lift and Tableau integration.
- Reporting craft: reproducible SQL, polished notebooks, and exportable figures/dashboard embeds.
- Funnel: Install -> Onboarding ~95.6%, Onboarding -> D1 ~47.6%, D1 -> Purchase ~16.8%.
- Retention: Overall D7 ~33.3%; Organic keeps ~73.8% of its D1 returners (D7 ~35.2%).
- ROI: Organic ROAS ~1.85 (ROI ~+85%); paid channels range 0.19-0.34 ROAS (ROI -66% to -81%).
- Churn model: ROC-AUC ~0.61, PR-AUC ~0.58, accuracy ~0.60; top 10% risk bucket captures ~78% of churn (lift ~1.17x).
- Dataset: Cookie Cats (Kaggle) user-level/mobile game telemetry, enriched with synthetic user acquisition attributes (e.g., acquisition_channel, CAC/ad spend, revenue fields) to enable ROI/ROAS analysis.
- KPIs: D1/D7 retention, conversion funnel step rates, ROI/ROAS by channel, ARPDAU/Revenue trends (optional), and a focused prediction target (e.g., D7 or revenue proxy).
- Decisions supported: Budget reallocation across channels, creative/testing priorities, onboarding/FTUE optimizations, and retention-oriented product bets.
-
0.9-SQL-Validation-and-Samples.ipynb -> sql_analysis.ipynb
- Runs 3 canonical queries (daily installs, funnel step rates, ROI/ROAS) and validates notebook vs. SQL outputs.
-
1.0-EDA-and-Funnel.ipynb -> eda.ipynb
- Defines the funnel (install -> onboarding -> D1 -> purchase), produces the first KPI table and a funnel chart.
-
2.0-ROI-and-ROAS-by-Channel.ipynb -> roas_analysis.ipynb
- Computes ROI/ROAS by acquisition channel (+optional platform), exports ranked tables and visuals, ends with 3 actionable recommendations.
-
2.1-Retention-Cohorts.ipynb
- D1/D7 cohort heatmaps and top/bottom channel lists, with short commentary on implications.
-
3.0-Churn-Model.ipynb
- Builds the churn feature set, trains LogReg & XGBoost, records ROC/PR metrics, and exports risk segments + Tableau-ready artefacts.
This section explains how to run the project end'to'end, how SQL/Notebooks are wired, and where artifacts are exported. It is written to be copy'paste runnable on a fresh machine.
- Python 3.10+
pip(orpipxif preferred)- Git
- (Optional) DuckDB is installed via
requirements.txtand used to run SQL over CSV/Parquet without a DB server.
Create and activate a virtual environment
# macOS/Linux
python -m venv .venv && source .venv/bin/activate
# Windows (PowerShell)
python -m venv .venv; .\.venv\Scripts\Activate.ps1Install Requirements
./scripts/install.ps1 After installing requirements, you can reproduce the pipeline without opening notebooks:
make data- rebuildsclean_data.csvandevents.parquetvia the Typer CLI wrapper.make features- createsfeatures.csvandlabels.csvwith the churn flag.make train- trains the churn models, writes metrics toreports/tables/and stores the best pipeline undermodels/churn_model.pkl.- (Optional)
make predict- uses the trained model to generatepredictions.csvwith churn probabilities.
Shortcut: make pipeline runs steps 1-3 sequentially.
Each rule calls python -m mobile_game_analytics_pipeline.<command> under the hood, so you can invoke them directly if you prefer finer control over paths.
If you dont have make and dont want to install it you can use given commands;
# When Virtual Environment is Active
python -m mobile_game_analytics_pipeline.dataset
python -m mobile_game_analytics_pipeline.features
python -m mobile_game_analytics_pipeline.modeling.train
python -m mobile_game_analytics_pipeline.modeling.predict # opsiyonel
mobile_game_analytics_pipeline/
├─ mobile_game_analytics_pipeline/
│ ├─ __init__.py
│ ├─ config.py
│ ├─ dataset.py # Typer command: rebuild synthetic data
│ ├─ features.py # Typer command: create features & labels
│ ├─ modeling/
│ │ ├─ __init__.py
│ │ ├─ train.py # Typer command: train churn model
│ │ └─ predict.py # Typer command: generate predictions
│ └─ …
├─ data/
│ ├─ raw/
│ │ └─ cookie_cats.csv
│ ├─ processed/ # clean_data.csv, events.parquet, features.csv, labels.csv
│ ├─ config/
│ │ └─ synthetic.yaml
│ └─ make_dataset.py
├─ notebooks/
│ ├─ 0.9-SQL-Validation-and-Samples.ipynb
│ ├─ 1.0-EDA-and-Funnel.ipynb
│ ├─ 2.0-ROI-and-ROAS-by-Channel.ipynb
│ ├─ 2.1-Retention-Cohorts.ipynb
│ └─ 3.0-Churn-Model.ipynb
├─ references/
│ └─ sql/
├─ reports/
│ ├─ tables/
│ ├─ figures/
│ └─ executive_summary.md
├─ tests/
│ └─ test_synthetic.py
├─ Makefile
├─ requirements.txt
└─ README.md
- Primary input:
data/processed/clean_data.csv(ordata/processed/events.parquet). - Notebook 0.9 (SQL) automatically creates a DuckDB view
eventsfrom one of these files. - If neither file is present, notebooks will raise a clear error.
Run notebooks top'down in this order:
-
0.9-SQL-Validation-and-Samples.ipynbLoads SQL fromreferences/sql/and executes via DuckDB. Exports validation tables toreports/tables/. -
1.0-EDA-and-Funnel.ipynbDefines the funnel and exportsfunnel.csvandfunnel.png. -
2.0-ROI-and-ROAS-by-Channel.ipynbComputes ROI/ROAS by channel; exportsroi_by_channel.csvandroi_by_channel.png. -
2.1-Retention-Cohorts.ipynbProduces D1/D7 cohort heatmaps; exportsretention_by_channel.csvandretention_heatmap.png. -
3.0-Forecast-or-Churn.ipynbChurn classification (LogReg + XGBoost/LightGBM). Exportsmodel_metrics.jsonand one key figure (roc_pr_curves.png).
SQL is stored in references/sql/ and loaded from notebooks at runtime.
references/sql/daily_installs.sqlreferences/sql/funnel_step_rates.sqlreferences/sql/roi_by_channel.sql
All notebooks export tables and figures to a consistent location:
-
Tables:
reports/tables/funnel.csv,funnel_long.csv,roi_by_channel.csv,roi_by_channel_long.csv,retention_by_channel.csv,retention_cohort_by_version.csv
-
Figures:
reports/figures/funnel.png,roi_by_channel.png,retention_heatmap.png,forecast_plot.pngorroc_pr_curves.png
-
Summary:
reports/executive_summary.md(curated manually from notebook findings)
pytest -qtests/test_synthetic.pyvalidates schema, channel/country/platform distributions, ROI formula, and retention ratios for the synthetic dataset.- CLI smoke test:
make pipeline(or run the Typer commands manually) regenerates data, features, and churn model artefacts end-to-end.
This repository uses pre-commit hooks (formatting, lint checks, etc.). A fresh clone does not include the hook automatically; run the following once after setting up your environment:
- Determinism: set seeds for model training and sampling (e.g.,
numpy,random, model libraries). - No Leakage: split by time for validation (rolling or holdout) and build features only from the past window.
- Versioning: tag releases when major deliverables change (e.g.,
v0.1MVP,v0.2cohorts deep dive,v0.3modeling). - Environment: keep
requirements.txtup to date; pin critical libs if needed for a clean re'run.
This section summarizes key insights, embeds exported figures, and outlines limitations and next actions. Replace the placeholder metrics below once the latest notebooks are executed.
Acquisition & Funnel
- Conversion funnel: Install -> Onboarding -> D1 return -> Purchase. Current run shows:
- Install -> Onboarding: ~
95.6%(baseline FTUE completion) - Onboarding -> D1: ~
47.6%(early retention health) - D1 -> Purchase: ~
16.8%(monetization gate)
- Install -> Onboarding: ~
- Action: Focus UX experiments on the largest drop (e.g., onboarding), and validate with an A/B test.
ROI/ROAS by Channel
- Top channel:
Organicdelivers ROAS ~1.85, ROI ~+85%, and is the only source above break-even. - Underperformers: Paid UA (e.g.,
TikTokROAS ~0.34,InstagramROAS ~0.23,FacebookROAS ~0.19) remains below break-even. - Action: Reallocate +10-20% UA budget to top channels; test creative iteration for low-ROAS channels before further spend.
Retention Cohorts (D1/D7)
-
D1 retention: overall average ~
45.5%;Organicleads at ~47.7%. -
D7 retention: ~
33.3%of installs return on day 7, andOrganickeeps ~73.8%of its D1 returners through D7. -
Action: Prioritize best-quality sources (high D7) for long-term value; refine onboarding for channels with high D1 but weak D7.
-
**Prediction/Forecast **
Churn model (LogReg + XGBoost): ROC-AUC 0.607, PR-AUC 0.580, accuracy 0.602; top 10% risk bucket captures ~78% of churn (lift 1.17).
Key segments: Highest churn risk clusters in Facebook and TikTok installs on Google Play; see reports/tables/churn_risk_segments.csv for channel/platform drill-down.
Artifacts: reports/tables/backtest_scores.csv, reports/tables/model_metrics.json, reports/tables/churn_risk_segments.csv, reports/figures/roc_pr_curves.png.
Scores are measured on the synthetic demo dataset; expect lower performance on production data.
Record the finalized numbers in
reports/executive_summary.mdas a single'page narrative for reviewers.
- Synthetic enrichment: User acquisition fields (e.g.,
acquisition_channel, CAC/ad spend) are enriched and may not reflect production distributions. - Schema/coverage: Missing events or short time windows can bias retention and ROI estimates; metrics are indicative.
- Attribution simplification: Channel attribution is 1'touch in this demo; multi'touch or MMM would alter ROI interpretation.
- Model scope: Forecasts/classifiers are compact prototypes (no hyper-intensive tuning). Calibration and backtesting are included to keep results honest.
- Code is released under MIT License (see
LICENSE). - Dataset: Cookie Cats (Kaggle) " used here for educational/demo purposes with synthetic UA enrichment.



