Skip to content

Bayesian power analysis #276 #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ repos:
rev: 1.7.0
hooks:
- id: nbqa-ruff
args: [ --fix ]
- repo: https://github.com/econchick/interrogate
rev: 1.5.0
hooks:
Expand Down
289 changes: 282 additions & 7 deletions causalpy/pymc_experiments.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"""

import warnings # noqa: I001
from typing import Union
from typing import Union, Dict

import arviz as az
import matplotlib.pyplot as plt
Expand All @@ -26,7 +26,7 @@
from causalpy.custom_exceptions import BadIndexException
from causalpy.custom_exceptions import DataException, FormulaException
from causalpy.plot_utils import plot_xY
from causalpy.utils import _is_variable_dummy_coded
from causalpy.utils import _is_variable_dummy_coded, compute_bayesian_tail_probability

LEGEND_FONT_SIZE = 12
az.style.use("arviz-darkgrid")
Expand Down Expand Up @@ -330,15 +330,290 @@ def plot(self, counterfactual_label="Counterfactual", **kwargs):

return fig, ax

def summary(self) -> None:
def _summary_intervention(self, alpha: float = 0.05, **kwargs) -> pd.DataFrame:
"""
Calculate and summarize the intervention analysis results in a DataFrame format.

This function performs cumulative and mean calculations on the posterior predictive distributions,
computes Bayesian tail probabilities, posterior estimations, causal effects, and confidence intervals.
It optionally applies corrections to the cumulative and mean calculations.

Parameters:
- alpha (float, optional): The significance level for confidence interval calculations. Default is 0.05.
- kwargs (Dict[str, Any], optional): Additional keyword arguments.
- "correction" (bool or Dict[str, float]): If True, applies predefined corrections to cumulative and mean results.
If a dictionary, the corrections for 'cumulative' and 'mean' should be provided. Default is False.

Returns:
- pd.DataFrame: A DataFrame where each row represents different statistical measures such as
Bayesian tail probability, posterior estimation, causal effect, and confidence intervals for cumulative and mean results.
"""
correction = kwargs.get("correction", False)

results = {}
ci = (alpha * 100) / 2

# Cumulative calculations
cumulative_results = self.post_y.sum()
_mu_samples_cumulative = (
self.post_pred["posterior_predictive"]
.mu.stack(sample=("chain", "draw"))
.sum("obs_ind")
)

# Mean calculations
mean_results = self.post_y.mean()
_mu_samples_mean = (
self.post_pred["posterior_predictive"]
.mu.stack(sample=("chain", "draw"))
.mean("obs_ind")
)

if not isinstance(correction, bool):
_mu_samples_cumulative += correction["cumulative"]
_mu_samples_mean += correction["mean"]

# Bayesian Tail Probability
results["bayesian_tail_probability"] = {
"cumulative": compute_bayesian_tail_probability(
posterior=_mu_samples_cumulative, x=cumulative_results
),
"mean": compute_bayesian_tail_probability(
posterior=_mu_samples_mean, x=mean_results
),
}

# Posterior Mean
results["posterior_estimation"] = {
"cumulative": np.mean(_mu_samples_cumulative.values),
"mean": np.mean(_mu_samples_mean.values),
}

results["results"] = {"cumulative": cumulative_results, "mean": mean_results}

# Causal Effect
results["causal_effect"] = {
"cumulative": cumulative_results
- results["posterior_estimation"]["cumulative"],
"mean": mean_results - results["posterior_estimation"]["mean"],
}

# Confidence Intervals
results["ci"] = {
"cumulative": [
np.percentile(_mu_samples_cumulative, ci),
np.percentile(_mu_samples_cumulative, 100 - ci),
],
"mean": [
np.percentile(_mu_samples_mean, ci),
np.percentile(_mu_samples_mean, 100 - ci),
],
}

# Convert to DataFrame
results_df = pd.DataFrame(results)

return results_df

def summary(self, version="coefficients", **kwargs) -> Union[None, pd.DataFrame]:
"""
Print text output summarising the results
"""
if version == "coefficients":
print(f"{self.expt_type:=^80}")
print(f"Formula: {self.formula}")
# TODO: extra experiment specific outputs here
self.print_coefficients()
elif version == "intervention":
return self._summary_intervention(**kwargs)

def _power_estimation(self, alpha: float = 0.05, correction: bool = False) -> Dict:
"""
Estimate the statistical power of an intervention based on cumulative and mean results.

print(f"{self.expt_type:=^80}")
print(f"Formula: {self.formula}")
# TODO: extra experiment specific outputs here
self.print_coefficients()
This function calculates posterior estimates, systematic differences, confidence intervals, and
minimum detectable effects (MDE) for both cumulative and mean measures. It can apply corrections to
account for systematic differences in the data.

Parameters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the parameters and returns parts of the docstrings are not rendering properly when you build the docs. Can you check the notation used in the other docstrings and just double check this is rendering ok?

- alpha (float, optional): The significance level for confidence interval calculations. Default is 0.05.
- correction (bool, optional): If True, applies corrections to account for systematic differences in
cumulative and mean calculations. Default is False.

Returns:
- Dict: A dictionary containing key statistical measures such as posterior estimation,
systematic differences, confidence intervals, and posterior MDE for both cumulative and mean results.
"""
results = {}
ci = (alpha * 100) / 2

# Cumulative calculations
cumulative_results = self.post_y.sum()
_mu_samples_cumulative = (
self.post_pred["posterior_predictive"]
.mu.stack(sample=("chain", "draw"))
.sum("obs_ind")
)

# Mean calculations
mean_results = self.post_y.mean()
_mu_samples_mean = (
self.post_pred["posterior_predictive"]
.mu.stack(sample=("chain", "draw"))
.mean("obs_ind")
)

# Posterior Mean
results["posterior_estimation"] = {
"cumulative": np.mean(_mu_samples_cumulative.values),
"mean": np.mean(_mu_samples_mean.values),
}

results["results"] = {"cumulative": cumulative_results, "mean": mean_results}

results["_systematic_differences"] = {
"cumulative": results["results"]["cumulative"]
- results["posterior_estimation"]["cumulative"],
"mean": results["results"]["mean"]
- results["posterior_estimation"]["mean"],
}

if correction:
_mu_samples_cumulative += results["_systematic_differences"]["cumulative"]
_mu_samples_mean += results["_systematic_differences"]["mean"]

results["ci"] = {
"cumulative": [
np.percentile(_mu_samples_cumulative, ci),
np.percentile(_mu_samples_cumulative, 100 - ci),
],
"mean": [
np.percentile(_mu_samples_mean, ci),
np.percentile(_mu_samples_mean, 100 - ci),
],
}

cumulative_upper_mde = (
results["ci"]["cumulative"][1]
- results["posterior_estimation"]["cumulative"]
)
cumulative_lower_mde = (
results["posterior_estimation"]["cumulative"]
- results["ci"]["cumulative"][0]
)

mean_upper_mde = (
results["ci"]["mean"][1] - results["posterior_estimation"]["mean"]
)
mean_lower_mde = (
results["posterior_estimation"]["mean"] - results["ci"]["mean"][0]
)

results["posterior_mde"] = {
"cumulative": (cumulative_upper_mde + cumulative_lower_mde) / 2,
"mean": (mean_upper_mde + mean_lower_mde) / 2,
}
return results

def power_summary(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user-facing methods power_summary and plot_power are currently methods of the PrePostFit class. You can see from the UML class diagram that these methods will be available to both the SyntheticControl experiment class and the Interrupted Time Series class.

Clearly from the example you are using Bayesian Power Analysis in the context of synthetic control, but can I ask... Does it make sense that these methods are also available in the interrupted time series situations? I don't know the answer, I've not thought about it. But we want to make sure that these methods are placed at the right point in the class hierarchy.

If the Bayesian Power Analysis is also appropriate for the interrupted time series context, then the best possible outcome would be to also create an example notebook for that. But I'd be fine if you instead created a separate issue to deal with that in the near future.

But if this new functionality only makes sense in the context of synthetic control, then can I ask you to either a) move these new methods into the SyntheticControl class, or b) create a Bayesian Power Analysis mix-in class instead. I don't mind which, but I think CausalPy will probably transition from the current class hierarchy to using mix-ins.

self, alpha: float = 0.05, correction: bool = False
) -> pd.DataFrame:
"""
Summarize the power estimation results in a DataFrame format.

This function calls '_power_estimation' to perform power estimation calculations and then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the docs are set up to not render docstrings for private methods. At the moment I think this is fine because the API docs will focus only on what users are intended to use. If we keep it like that, then we might want to remove references to private methods in the public docstrings.

converts the resulting dictionary into a pandas DataFrame for easier analysis and visualization.

Parameters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about the parameters and returns blocks not rendering properly in when you build and check the docs locally.

- alpha (float, optional): The significance level for confidence interval calculations used in power estimation. Default is 0.05.
- correction (bool, optional): Indicates whether to apply corrections in the power estimation process. Default is False.

Returns:
- pd.DataFrame: A DataFrame representing the power estimation results, including posterior estimations,
systematic differences, confidence intervals, and posterior MDE for cumulative and mean results.
"""
return pd.DataFrame(self._power_estimation(alpha=alpha, correction=correction))

def power_plot(self, alpha: float = 0.05, correction: bool = False) -> plt.Figure:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I request this is called plot_power instead?

"""
Generate and return a figure containing plots that visualize power estimation results.

This function creates a two-panel plot (for mean and cumulative measures) to visualize the posterior distributions
along with the confidence intervals, real mean, and posterior mean values. It allows for adjustments based on
systematic differences if the correction is applied.

Parameters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about the parameters and returns blocks not rendering properly in when you build and check the docs locally.

- alpha (float, optional): The significance level for confidence interval calculations used in power estimation. Default is 0.05.
- correction (bool, optional): Indicates whether to apply corrections for systematic differences in the plotting process. Default is False.

Returns:
- plt.Figure: A matplotlib figure object containing the plots.
"""
_estimates = self._power_estimation(alpha=alpha, correction=correction)

fig, axs = plt.subplots(1, 2, figsize=(20, 6)) # Two subplots side by side

# Adjustments for Mean and Cumulative plots
for i, key in enumerate(["mean", "cumulative"]):
_mu_samples = self.post_pred["posterior_predictive"].mu.stack(
sample=("chain", "draw")
)
if key == "mean":
_mu_samples = _mu_samples.mean("obs_ind")
elif key == "cumulative":
_mu_samples = _mu_samples.sum("obs_ind")

if correction:
_mu_samples += _estimates["_systematic_differences"][key]

# Histogram and KDE
sns.histplot(
_mu_samples,
bins=30,
kde=True,
ax=axs[i],
color="C0",
stat="density",
alpha=0.6,
)
kde_x, kde_y = (
sns.kdeplot(_mu_samples, color="C1", fill=True, ax=axs[i])
.get_lines()[0]
.get_data()
)

# Adjust y-limits based on max density
max_density = max(kde_y)
axs[i].set_ylim(0, max_density + 0.05 * max_density) # Adding 5% buffer

# Fill between for the percentile interval
axs[i].fill_betweenx(
y=np.linspace(0, max_density + 0.05 * max_density, 100),
x1=_estimates["ci"][key][0],
x2=_estimates["ci"][key][1],
color="C0",
alpha=0.3,
label="C.I",
)

# Vertical lines for the means
axs[i].axvline(
_estimates["results"][key], color="C3", linestyle="-", label="Real Mean"
)
if not correction:
axs[i].axvline(
_estimates["posterior_estimation"][key],
color="C4",
linestyle="--",
label="Posterior Mean",
)

axs[i].set_title(f"Posterior of mu ({key.capitalize()})")
axs[i].set_xlabel("mu")
axs[i].set_ylabel("Density")
axs[i].legend()

return fig


class InterruptedTimeSeries(PrePostFit):
Expand Down
28 changes: 28 additions & 0 deletions causalpy/utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
"""
Utility functions
"""
import numpy as np
import pandas as pd
from scipy.stats import norm


def _is_variable_dummy_coded(series: pd.Series) -> bool:
Expand All @@ -13,3 +15,29 @@ def _is_variable_dummy_coded(series: pd.Series) -> bool:
def _series_has_2_levels(series: pd.Series) -> bool:
"""Check that the variable in the provided Series has 2 levels"""
return len(pd.Categorical(series).categories) == 2


def compute_bayesian_tail_probability(posterior, x) -> float:
"""
Calculate the probability of a given value being in a distribution defined by the posterior,

Args:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about the parameters and returns blocks not rendering properly in when you build and check the docs locally.

- data: a list or array-like object containing the data to define the distribution
- x: a numeric value for which to calculate the probability of being in the distribution

Returns:
- prob: a numeric value representing the probability of x being in the distribution
"""
lower_bound, upper_bound = min(posterior), max(posterior)
mean, std = np.mean(posterior), np.std(posterior)

cdf_lower = norm.cdf(lower_bound, mean, std)
cdf_upper = 1 - norm.cdf(upper_bound, mean, std)
cdf_x = norm.cdf(x, mean, std)

if cdf_x <= 0.5:
probability = 2 * (cdf_x - cdf_lower) / (1 - cdf_lower - cdf_upper)
else:
probability = 2 * (1 - cdf_x + cdf_lower) / (1 - cdf_lower - cdf_upper)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safe, and beneficial, to make sure this line is covered by tests


return abs(round(probability, 2))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that we should not round the result at this point, only when displaying the results.

6 changes: 3 additions & 3 deletions docs/source/_static/interrogate_badge.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Synthetic Control
notebooks/sc_pymc.ipynb
notebooks/sc_skl.ipynb
notebooks/sc_pymc_brexit.ipynb
notebooks/power_analysis.ipynb
notebooks/geolift1.ipynb


Expand Down
Loading