-
Notifications
You must be signed in to change notification settings - Fork 0
Project Experiment
Project Experiment is a definitive, 25-module knowledge base synthesizing academic research, industry best practices, and real-world case studies from leading technology and financial services companies. Developed through collaborative synthesis by multiple expert systems and refined through rigorous editorial consolidation, this resource provides both theoretical foundations and practical implementation guidance for analytics professionals, data scientists, and business leaders.
This knowledge base is organized into five thematic groups that represent the natural progression of experimentation maturityβfrom foundational concepts through advanced techniques to organizational transformation:
- Foundations & Statistical Methods (Topics 01-05): Core concepts, decision frameworks, and fundamental statistical techniques
- Advanced Statistical Techniques (Topics 06-10): Variance reduction, specialized metrics, and channel-specific applications
- Analytical Sophistication (Topics 11-15): Heterogeneous effects, personalization, and operational excellence
- Stakeholder Management & Governance (Topics 16-20): Communication, compliance, unified measurement, and future trends
- Organizational Maturity (Topics 21-25): Implementation roadmaps, common pitfalls, and cultural transformation
- 21. From Holdouts to Experimentation - The complete roadmap for building an experimentation program from scratch. Start here if you're new to experimentation or leading organizational transformation.
- 02. When Is an Experiment Done? - Moving beyond arbitrary p-values and 95% confidence thresholds to sophisticated decision frameworks. Elite company implementations from Spotify, Netflix, Airbnb.
- 14. Building an Experimentation Operating Model - Organizational design for scaled experimentation: governance, CoE models, platform selection, stakeholder alignment.
- 06. Variance Reduction: CUPED, CUPAC, and Beyond - Reduce sample size requirements by 30-50% with Microsoft and Netflix's advanced techniques.
- 11. From ATE to CATE - Extract value from "failed" experiments using heterogeneous treatment effects and causal forests.
- 05. Sequential Testing Methods - Rigorous frameworks for early stopping without inflating false positives.
- 08. Email Experimentation in the Post-Apple MPP Era - Adapting to Apple's Mail Privacy Protection and the death of open rates.
- 09. Measuring Incrementality - Gold standard holdout designs for measuring true incremental value.
- 03. The Role of Null Results - Why 70-90% of experiments "fail" at elite companies and how to build a learning culture.
- 25. Debunking the 30-Day Myth - Data-driven experiment duration decisions and the 4-quadrant decision matrix.
- 23. Early Experiment Signals - What you can and cannot learn in the first 48 hours.
- 01. Experimentation in Regulated Finance - Model Risk Management (SR 11-7), Fair Lending (ECOA), Conduct Risk frameworks for financial services.
- 18. Legal and Compliance Considerations - Navigating GDPR, CCPA, TCPA, and sector-specific regulations.
Governance, compliance, and decision science in high-stakes environments
A comprehensive analysis of how experimentation functions within regulated industries (financial services, healthcare, insurance). Covers Model Risk Management (SR 11-7), Conduct Risk frameworks, Fair Lending regulations (ECOA), and the governance structures that distinguish regulated experimentation from Silicon Valley's "move fast and break things" culture. Essential reading for anyone working in regulated sectors or building experimentation programs that require rigorous oversight.
Key Topics: Three Lines of Defense, Model Risk Management, Conduct Risk, Fair Lending (ECOA), SR 11-7 compliance, capital planning integration, gamification risks
Sequential testing, Bayesian methods, and sophisticated stopping criteria
Consolidated from 3 expert sources: Claude, Gemini, and Main synthesis
The definitive guide to moving beyond arbitrary 95% confidence thresholds and fixed experiment durations. Covers group sequential testing (GST), always-valid inference (AVI), Bayesian decision rules, expected loss frameworks, and risk-adjusted thresholds by metric hierarchy. Includes detailed implementation guidance from Spotify, Netflix, Airbnb, Microsoft, Booking.com, and other elite experimentation organizations.
Key Frameworks: O'Brien-Fleming boundaries, Pocock spending functions, mSPRT, GSPRT, Expected Loss minimization, Probability to Be Best (P2BB), threshold of caring
π Research Appendix (10 documents):
- Bayesian A/B Testing Beyond Frequentist Methods
- Bayesian Decision Rules and Expected Loss
- Company Case Studies on Stopping Criteria
- Experiments at Airbnb
- GrowthBook Decision Framework
- Netflix: Meta-Analysis and Optimal Stopping
- Minimum Detectable Effect and Power Analysis
- Organizational Decision Frameworks
- Spotify Sequential Testing: Comparisons
- Spotify Sequential Testing: Research Notes
Building a culture that values learning from failure
Consolidated from 3 expert sources: Claude, Gemini, and Main synthesis
Why 70-90% of experiments "fail" at elite companiesβand why this is a sign of maturity, not dysfunction. Explores the organizational culture, knowledge management systems, and decision frameworks needed to extract value from null results. Includes the "Ship Flat" decision matrix, Edmondson's failure typology, and detailed case studies from Booking.com, Netflix, Microsoft, Amazon, and Airbnb.
Key Frameworks: Edmondson's failure framework (preventable/complex/intelligent), "Ship Flat" decision matrix, Sample Ratio Mismatch diagnostics, A/A testing protocols, knowledge repository design
π Research Appendix (4 documents):
- Centralized A/B Testing Repository
- Jeff Bezos 2016 Letter to Amazon Shareholders
- Netflix: A Culture of Learning
- Research Notes: Null and Negative Results
Statistical power analysis for channels with limited traffic
The essential guide to running experiments in low-velocity environments: direct mail, B2B sales, premium segments, annual purchase cycles, and regulatory-limited channels. Covers power calculations, Minimum Detectable Effect (MDE) tradeoffs, duration estimation, and practical strategies when traditional A/B testing isn't feasible.
Key Topics: Power analysis, MDE calculation, sample size requirements, variance estimation, cluster randomization, quasi-experimental designs, difference-in-differences
Rigorous frameworks for early stopping without inflating false positives
A comprehensive guide to sequential testing frameworks that allow "peeking" at results without alpha inflation. Covers Wald's Sequential Probability Ratio Test (SPRT), alpha spending functions, group sequential designs, and practical implementations from Netflix, Optimizely, and VWO.
Key Frameworks: SPRT, alpha spending functions, O'Brien-Fleming boundaries, Lan-DeMets approach, futility monitoring, conditional power
Advanced methods to reduce sample size requirements and duration
Detailed coverage of variance reduction techniques that can reduce sample size requirements by 30-50%. Covers CUPED (Controlled-experiments Using Pre-Experiment Data), CUPAC (CUPED using Predictions as Covariates), stratification, regression adjustment, and modern machine learning approaches. Includes Microsoft and Netflix implementations.
Key Techniques: CUPED, CUPAC, stratified randomization, post-stratification, regression adjustment, doubly robust estimation, machine learning-based covariate adjustment
Statistical complexities of ratio metrics and user-level dependencies
Addresses the unique challenges of ratio metrics (revenue per user, sessions per visit, CTR) and correlated user behavior. Covers the Delta method, bootstrap approaches, clustered standard errors, and when simple t-tests fail. Critical for anyone analyzing engagement, revenue, or behavioral metrics.
Key Topics: Delta method, bootstrap estimation, Taylor expansion, clustered standard errors, intra-cluster correlation, ratio estimators
Adapting email analytics after Apple's tracking pixel blocking
Comprehensive analysis of how Apple's Mail Privacy Protection (MPP) fundamentally changed email analytics. Covers the technical mechanisms of pixel blocking, alternative metrics (clicks, conversions, list hygiene), proxy metric validation, and new experimental approaches that don't rely on open rates.
Key Topics: Apple MPP mechanics, click-based metrics, conversion tracking, list hygiene indicators, engagement scoring, proxy metric validation
Holdout designs and causal inference for marketing campaigns
The gold standard for measuring true incremental value of email campaigns using holdout groups. Covers holdout design, attribution modeling, long-term effects, cannibalization detection, and strategic frameworks for building incrementality testing programs.
Key Frameworks: Global holdouts, rolling holdouts, stratified holdouts, ghost ads, synthetic controls, attribution modeling integration
Balancing short-term engagement with long-term subscriber value
Analyzes the complex relationship between email frequency and long-term subscriber value. Covers frequency optimization, fatigue detection, lifetime value modeling, recency-frequency-monetary (RFM) analysis, and how to balance short-term engagement metrics with long-term list health.
Key Topics: Fatigue curves, optimal frequency estimation, suppression list management, reactivation strategies, LTV modeling for email
Heterogeneous treatment effects and conditional average treatment effects
When an experiment shows no overall effect (ATE = 0), Conditional Average Treatment Effects (CATE) analysis can uncover heterogeneous treatment effects across segments. Covers causal forests, meta-learners (S-learner, T-learner, X-learner), honest causal trees, and practical implementation strategies for finding value in "failed" experiments.
Key Methods: Causal forests, generalized random forests, meta-learners, honest splitting, double machine learning, targeted maximum likelihood estimation
Decision frameworks for segment-level personalization
A rigorous framework for deciding when segment-level personalization is justified versus when it introduces unnecessary complexity or overfitting. Covers the bias-variance tradeoff in personalization, multiple testing corrections, false discovery rates, validation frameworks, and organizational readiness assessment.
Key Frameworks: Bias-variance tradeoff, cross-validation for personalization rules, Bonferroni correction, FDR control, effect size thresholds
Adaptive allocation algorithms and exploration-exploitation tradeoffs
Comprehensive comparison of traditional A/B testing versus adaptive algorithms. Covers multi-armed bandits (epsilon-greedy, UCB, Thompson Sampling), contextual bandits, regret minimization, and practical guidance on when each approach is appropriate.
Key Algorithms: Epsilon-greedy, Upper Confidence Bound (UCB), Thompson Sampling, LinUCB, contextual bandits, reinforcement learning
Organizational design for scaled experimentation capabilities
Strategic guide for building organization-wide experimentation capabilities. Covers governance structures, Centers of Excellence (CoE) models, federated vs. centralized architectures, tool selection, platform architecture, velocity metrics, and change management for embedding experimentation into company culture.
Key Topics: CoE design, platform selection, experimentation velocity, democratization vs. governance, training programs, stakeholder alignment
Metric hierarchies and Overall Evaluation Criteria
Frameworks for designing primary, secondary, and guardrail metrics that align with strategic business objectives. Covers metric selection criteria, leading vs. lagging indicators, Overall Evaluation Criteria (OEC) design, composite metrics, and how to avoid "metric gaming."
Key Frameworks: Metric hierarchies, OEC design, guardrail specification, leading indicator validation, proxy metric assessment
Translating statistical rigor into executive decision-making
Practical guidance on presenting statistical results to non-technical executives. Covers narrative structure, visualization best practices, confidence interval communication, effect size interpretation, and how to discuss statistical nuance without losing business context.
Key Skills: Executive storytelling, data visualization, p-value translation, confidence intervals for business audiences, decision memo structure
Ethical frameworks and reputational risk management
Explores the ethical and reputational dimensions of experimentation. When do customers perceive experiments as innovation vs. manipulation? Covers transparency frameworks, opt-in/opt-out considerations, informed consent, and case studies of experimentation that damaged trust (Facebook emotional contagion study, OkCupid compatibility experiments).
Key Topics: Research ethics, informed consent, deceptive practices, transparency obligations, reputational risk, customer trust recovery
Navigating GDPR, CCPA, and sector-specific regulations
Comprehensive legal framework covering GDPR Article 6(1)(f) (legitimate interests), CCPA opt-out rights, CAN-SPAM compliance, TCPA restrictions, and sector-specific regulations. Includes guidance on consent management, data retention, cross-border data transfers, and when experiments require legal review.
Key Regulations: GDPR, CCPA, CAN-SPAM, TCPA, HIPAA (healthcare), GLBA (financial services), data localization requirements
Integrating experimentation with Marketing Mix Modeling and attribution
Explains how controlled experiments complement (rather than replace) Marketing Mix Modeling and multi-touch attribution. Covers the strengths and limitations of each approach, integration strategies, calibration techniques, and how leading companies use all three methods together.
Key Topics: MMM-experiment integration, attribution model validation through experiments, calibration techniques, incrementality vs. attribution
Emerging trends and technological frontiers
Forward-looking analysis of emerging trends: AI-powered experiment design, automated variance reduction, privacy-preserving experimentation techniques (differential privacy, federated learning, secure multi-party computation), synthetic controls at scale, and the evolution from "tests" to "continuous optimization systems."
Emerging Trends: Automated experiment design, privacy-preserving techniques, synthetic controls, federated learning, continuous optimization
Organizational transformation roadmap
Consolidated from 3 expert sources: Claude, Gemini, and Main synthesis
Strategic roadmap for organizations transitioning from basic holdout testing to mature experimentation programs. Covers the causality gap, defensive foundation (holdout methodology), catalyst strategy (the single win), tactical playbook (First Five experiments), four-phase maturity model, and common mistakes. Includes detailed case studies from eBay ($50M revelation), True Classic (DTC transformation), Microsoft (experimentation flywheel), and Booking.com (cultural sustainability).
Key Frameworks: Four-phase maturity model (Crawl/Walk/Run/Fly), Microsoft's experimentation flywheel, CATS hypothesis framework, First Five experiments, global holdout methodology
π Research Appendix (6 documents):
- Experimentation Program Mistakes
- Getting Buy-In for Experimentation
- Holdouts: Measuring Impact Accurately
- Organizational Evolution Research Notes
- Conversion Maturity Model
- Why Most Programs Fail
Building organizational confidence through low-risk initial tests
How to design early experiments that build organizational confidence rather than generate political resistance. Covers stakeholder engagement, risk assessment frameworks, choosing low-risk initial tests, success criteria definition, and building credibility through transparency.
Key Strategies: Stakeholder mapping, risk assessment matrix, pilot experiment selection, transparent reporting, confidence building
Diagnostics, validation, and early decision-making
Consolidated from 3 expert sources: Claude, Gemini, and Main synthesis
Comprehensive guide to what early experiment data can and cannot tell you. Covers Sample Ratio Mismatch (SRM) detection, instrumentation verification, baseline covariate balance, temporal dynamics (email lifecycle, novelty effects), leading vs. lagging metrics, statistical frameworks for early monitoring (SPRT, Bayesian P2BB), and the critical 48-hour checklist.
Key Frameworks: SRM severity levels, email engagement timelines, leading-lagging metric hierarchy, sequential monitoring protocols, 48-hour validation checklist
π Research Appendix:
Understanding how experiment effects evolve over time
Comprehensive analysis of how experiment effects evolve over time. Covers novelty effects, primacy effects, user learning curves, habituation, long-term equilibrium effects, and how to design experiments that capture both immediate and sustained impacts.
Key Concepts: Novelty effects, primacy bias, habituation, learning curves, long-run equilibrium, carryover effects
Data-driven experiment duration decisions
Consolidated from 3 expert sources: Claude, Gemini, and Main synthesis
Critical analysis of the widespread but unfounded "30-day minimum" rule. Covers the statistical and organizational origins of this myth, the 4-quadrant experimentation matrix (signal velocity Γ risk), economic tradeoffs (opportunity cost analysis), when shorter or longer durations are actually appropriate, and practical duration calculation frameworks. Includes special considerations for financial services and regulated industries.
Key Frameworks: 4-quadrant matrix (signal velocity Γ risk), 6-question pre-commitment framework, duration calculation methodology, risk-based decision logic
π Research Appendix (4 documents):
- 30 Day Dogma: Critical Analysis (Expanded)
- 30 Day Dogma: Critical Analysis (Original)
- Research Notes on Duration Analysis
- Why Organizations Default to 30 Days
Start here: 04 β 05 β 06 β 07 β 02 β 03 Then explore: 08 β 09 β 10 (email specialization) or 11 β 12 β 13 (advanced methods) Complete with: 14 β 15 β 23
Start here: 21 β 01 β 16 β 22 Then explore: 02 β 03 β 15 β 17 β 18 Complete with: 19 β 20
Start here: 05 β 06 β 07 β 11 β 13 Deep dive: 02 (research appendix), 03 (research appendix) Advanced topics: 12 β 20
Start here: 21 β 14 β 22 β 15 Then explore: 02 β 03 β 16 β 17 β 18 Complete with: 23 β 24 β 25 β 19 β 20
Statistical Foundations:
- Sequential testing (GST, AVI, SPRT)
- Bayesian methods (Expected Loss, P2BB, Threshold of Caring)
- Variance reduction (CUPED, CUPAC, stratification)
- Power analysis and sample size calculation
- Multiple testing corrections (Bonferroni, FDR)
Causal Inference:
- Heterogeneous treatment effects (CATE, causal forests)
- Meta-learners (S-learner, T-learner, X-learner)
- Instrumental variables and quasi-experiments
- Synthetic controls and difference-in-differences
- Doubly robust estimation
Organizational Design:
- Centers of Excellence (CoE) architecture
- Three Lines of Defense (regulated environments)
- Maturity models and transformation roadmaps
- Metric hierarchies and Overall Evaluation Criteria
- Knowledge management and failure libraries
Regulatory & Compliance:
- Model Risk Management (SR 11-7)
- Fair Lending (ECOA)
- Data privacy (GDPR, CCPA)
- Marketing regulations (CAN-SPAM, TCPA)
Industry Case Studies: Netflix, Airbnb, Spotify, Microsoft, Amazon, Booking.com, LinkedIn, eBay, Facebook, Uber, Google, Intuit, Vanguard, Capital One, JPMorgan Chase, True Classic, Robinhood, Optimizely, GrowthBook
- Email Experimentation at Vanguard: The Canon
- When Is an Experiment Done?
- Where Experiments Fit in Marketing Measurement
Five critical topics (02, 03, 21, 23, 25) represent consolidated authoritative documents synthesized from multiple expert AI systems (Claude, Gemini, and custom synthesis agents). These consolidations integrate the strongest insights, frameworks, and case studies from each source while eliminating redundancy and maintaining rigorous McKinsey/HBR editorial standards. Original source documents and extensive research appendices remain accessible through the linked materials.
This knowledge base represents a comprehensive synthesis of academic research, industry publications, regulatory guidance, and practical implementation experience in experimentation methodology. All content is designed for analytics professionals, data scientists, and business leaders working in marketing, product, and strategy roles.
Last updated: February 2026 | Maintained by expert editorial synthesis
- Agents Companion v2
- Agile Coordination
- AI Agents (2025β2030)
- Anthropic: Multi-Agent Research
- Efficient Agents (Cost Reduction)
- Google's AI Ecosystem
- Agents vs Agentic AI
- McKinsey: What is an Agent?
- OpenAI: Building Agents
- OpenAI Swarm Guide
- Ascendance of Agentic AI
- System Prompts Handbook
- AI-Enabled Hardware
- B2B Sales Tech & AI
- OpenAI: AI in Enterprise
- Scaling AI Use Cases
- Human-Centered AI
- AI & Productivity
- B2B Growth via Gen AI
- 42signals Synthesis
- CI in U.S. E-Commerce
- [CI: Amazon Arena](Competitive-Intelligence-in-the-E-commerce and Amazon-Arena_-Current-State-and-Future-Horizons.md)
- Premium Brands on Amazon
- Retail Media
- Claude Code Architecture (Full)
- Claude Code Best Practices
- Text-to-SQL RAG Patterns
- Streamlit Chatbot Guide
- NL-to-SQL Development
- Prompt Engineering Guide
- LangGraph & LangChain
- Notion Mastery
- Deep Research in Claude Code
- Recursive Deep Research
- Web Tech Overview
Sorted by Topic