Make `sample_posterior_predictive` API less surprising #7069

ricardoV94 · 2023-12-15T11:11:17Z

Description

The current behavior of sample_posterior_predictive regarding "volatility" and var_names is very opaque and error-prone.

If you include a model variable in var_names it will always be sampled, even if it was an unobserved variable with values in the trace. Furthermore, any variables that depend on variables being sampled ("volatile") will also be sampled even if they are not in var_names, and become "volatile" themselves. With this, it's easy for users to end up resampling the whole model from the prior by accident when they only wanted to do predictions and copy some variables from the posterior to the predictions group.

Most of the times I've seen users including unobserved variables in var_names, they seemed to only want to copy those over from the posterior to the posterior_predictive, to be together with the newly sampled observed variables. The volatile logic was a bug from their perspective.

Furthermore if an unobserved variable depends on MutableData/Coords that have changed that will also be resampled, and the whole cascade of volatility propagation happens the same.

If this doesn't make sense, I tried to explain the current behavior in #7014

Why did we ever think this behavior was a good idea?

Both behaviors are sensible in that they provide internally consistent draws, and avoid reusing posteriors in case they are likely to make little theoretical sense (if you have an intercept for the cities ["Paris", "Lisbon"], and now changed the coords to be ["London", "Madrid"], there is no reason to think the old posteriors should apply to the new cities), or computational sense (the length of the intercepts may have changed, and reusing the old ones would likely cause shape errors or silent bugs).

Aside: Maybe in that example users set the new coords to be ["Paris, "Lisbon", "London"], or they changed the shape from 2 to 3, expecting the new dim would be sampled, but the original two reused. That would be nice but hard to do, and on the verge of mind reading if only the shape changed. Such behavior should not be the competence of sample_posterior_predictive, but helper model transformation function. sample_posterior_predictive should only have to worry whether the variables with the same name as those in the posterior trace can be used as is, or must be resampled.

The current flexibility is not only a burden! It allows to do many interesting things like https://www.pymc-labs.io/blog-posts/out-of-model-predictions-with-pymc/

Anyway, the original motivation for these changes was because of Deterministics, which should sometimes be resampled (because they depended on MutableData that has changed) and other times reused to avoid useless re-computations in sample_posterior_predictive.

We initially went overboard and resampled everything that depended on MutableData, regardless of whether that had changed

Variables with shared inputs are always resampled from the prior in sample_posterior_predictive #6047

Which was fixed in #6147

Because the behavior still seemed too implicit and hard to debug, we tried to make logging explicit, although some environments like vscode/colab like to suppress those logs #5973

Latest proposal

IMO the issue with the current behavior, is that it's too silent/implicit and only rarely what users want. Instead of making it easy for power-users to do what they want, I would err on the side of caution and alerting users for those cases where reusing posterior draws may not be reasonable. My proposal:

Distinguish between var_names and sample_vars. The first determines which variables are included in the trace, the second which variables are being sampled. By default, observed variables are included in both fields, and the function behaves as the current default. I think most users think of var_names like this already from use in arviz functions.
Instead of defaulting to sampling new variables that depend on trace constant_data / coords / dims that have changed, but which are not in sample_vars try to reuse those from the trace but issue a warning that the posterior draws may no longer be valid since the model structure has changed. If they are in sample_vars, there is no warning and it behaves as now. We could make it a hard error... but someone will certainly complain that's actually want they wanted to do, and how can they do it then.
The warning should happen regardless of whether those constant_data / coords / dims are defined in the model as Mutable/ConstantData or coords/coords_mutable (brought up in sample_posterior_predictive does not consider changes in "ConstantData" and "Constant coords" #6876)
Deterministics, which is the reason why much of the behavior was written as is (we want to update Deterministics if model data / coords have changed, but reuse from the trace otherwise), still follow the current implicit volatile logic. They cannot be included in sample_vars, but can be in var_names in which case they are recomputed or copied, depending on whether there were any changes in the model structure leading to them (or if they depend on a variable that is being sampled).

More extreme proposal (rejected in the comments)

Only allow sampling of observed variables in sample_posterior_predictive and create a new function predict (or whatever, just not so verbose) that has all the flexibility sample_posterior_predictive currently has. Could also be used for prior when no idata is passed.

~~This could also allow the values of ObservedRVs to be used in place of the variables when they are not being sampled, which is often useful in forecasting models.~~

Recent examples where var_names did not do what the user expected:

The text was updated successfully, but these errors were encountered:

zaxtax · 2023-12-18T14:49:05Z

I don't think sample_posterior_predictive needs to be made more restrictive. That the posterior predictive is usually used for replications of the observation data is more of a convention. There is a documentation deficit advertising the power of the function. That should be addressed. There should be at least one example for the function where what's produced isn't replications of the observation data.

I also think there are coding conventions we could adopt that would make the functionality less surprising. For example, in the blog post, there is this model

with pm.Model() as m:
    mu = pm.Normal("mu")
    sigma = pm.Normal("sigma")
    y = pm.GaussianRandomWalk(
        "y", 
        init_dist=pm.Normal.dist(), 
        mu=mu, 
        sigma=sigma,
        observed=y_obs
    )

    idata = pm.sample(random_seed=rng)

And this forecast

with pm.Model() as forecast_m:
    mu = pm.Normal("mu")

    # Flat sigma for illustration purposes
    sigma = pm.Flat("sigma")

    # init_dist now starts on last observed value of y
    pm.GaussianRandomWalk(
        "y_forecast",
        init_dist=pm.DiracDelta.dist(y_obs[-1]),
        mu=mu,
        sigma=sigma,
        steps=99,
    )

    pp = pm.sample_posterior_predictive(
        idata, 
        var_names=["y_forecast"], 
        predictions=True, 
        random_seed=rng,
    )

I think the forecast should just drop mu and sigma and re-use them from the original model. It makes it clear they aren't coming from the forecast model

with pm.Model() as forecast_m:
    mu = pm.Deterministic("mu", mu)
    sigma = pm.Deterministic("sigma", sigma)
    # init_dist now starts on last observed value of y
    pm.GaussianRandomWalk(
        "y_forecast",
        init_dist=pm.DiracDelta.dist(y_obs[-1]),
        mu=mu,
        sigma=sigma,
        steps=99,
    )

    pp = pm.sample_posterior_predictive(
        idata, 
        var_names=["y_forecast"], 
        predictions=True, 
        random_seed=rng,
    )

I know this convention is in conflict with the way I have been using pmx.as_model to fit models and predict forecasts. But if you aren't doing that, I don't see what's gained in writing a prior distribution that we know isn't even used

ricardoV94 · 2023-12-18T16:13:01Z

I think the forecast should just drop mu and sigma and re-use them from the original model. It makes it clear they aren't coming from the forecast model

I don't like that. Sharing variables across models is error-prone and would force users to create a model even in settings where they are just using a stored InferenceData and don't need the original model at all. There was some discussion of defining an explicit input variable instead of a dummy that gets ignored like x = pm.from_trace("x"). In either case, I think the forecast model should be self-contained.

Regarding documentation there is an open PR just for that: #7014

zaxtax · 2023-12-18T16:40:07Z

It doesn't need to be passed in, and I'm equally happy with an explicit input variable. Especially if the error messages when forgetting to include a trace are helpful. But I don't like this business of saying we ignore the distribution so don't worry.

ricardoV94 · 2023-12-18T17:27:33Z

But ignoring model variables is exactly what happens in normal uses of sample_posterior_predictive (not just forecasting)

zaxtax · 2023-12-18T18:29:32Z

I think articulating what model variables are ignored is also something that can be better documented. Since my mental model right now is, all variables that were not observations present in the trace object.

lucianopaz · 2023-12-19T15:12:18Z

I'll first address the stuff that @zaxtax said. I agree with @ricardoV94. The idea of having a Model instance is for it to work as a sort of book-keeping helper. It has to store some meta information of the data generation process. If you want to extend an existing Model with new variables, there should be other ways to do so. Two methods come to mind. The first is to use the clone_model functionality that came with model_from_fgraph and the converse, and do

with pm.model.fgraph.clone_model(model) as forecast_model:
    pm.GaussianRandomWalk(
        "y_forecast",
        init_dist=pm.DiracDelta.dist(y_obs[-1]),
        mu=forecast_model["mu"],
        sigma=forecast_model["sigma"],
        steps=99,
    )
    pp = pm.sample_posterior_predictive(idata, var_names="y_forecast")

The alternative that @ricardoV94 is suggesting is to use a completely new model instance but define mu and sigma like this:

with model.clone() as forecast_model:
    mu = pt.scalar("mu")
    sigma = pt.scalar("sigma")
    pm.GaussianRandomWalk(
        "y_forecast",
        init_dist=pm.DiracDelta.dist(y_obs[-1]),
        mu=mu,
        sigma=sigma,
        steps=99,
    )
    pp = pm.sample_posterior_predictive(idata, var_names="y_forecast")

This second approach basically says that mu and sigma must be inputted to whatever function gets compiled down the line. It works out of the box with pm.sample_posterior_predictive as long as you provide names to the variables that are actually present in the InferenceData object. If there is any variable missing, you'll run into a TypeError or MissingInputError. This approach does not play well with either pm.sample or pm.sample_prior_predictive, but if we wanted to, we could try to add support for this (it would play nicer with a functional approach to supplying input values to variables than changing shared variables).

The original discussion

Let me start off by saying that I don't want to change the meaning of the var_names argument. If we want to change anything, it should be done by adding a new keyword argument that behaves differently, and eventually deprecate var_names in favor of the new stuff.

It's true that var_names is playing two roles at the same time, and that has proven to be both confusing and cumbersome for some uses. I agree with adding two new arguments that distinguish between what we want to mark as volatile and will get resampled instead of grabbing it from the idata, and what we want to return in the dataset at the end of sampling. I don't like the more extreme version where you add an entirely new function called predict.

The only thing that I think that we might want to invest some time in, is to try to design some lower level functions (in a similar spirit to compile_forward_sampling_function) to handle the situation where a user wants to also specify input variables for the function that needs to be compiled. Having that would allow us to use plain old pytensor.Variable as input nodes for whatever arbitrary value a user wants to pass in, making the functional paradigm more transparent to implement.

The main reason I can think of for wanting to have this is to be able to run distributed predictions from a same compiled model.

ricardoV94 · 2024-01-05T09:42:58Z

Another thing I am inclining too, regardless of which kwarg we use, is to require users to explicitly enumerate the variables that exist in the trace but must be resampled due to volatility.

I would guess 99.9% of the users don't really want to resample variables that are in the trace, due to the volatility logic, and when this happens it's usually a bug / conceptual mistake.

For instance this user recently wrote a model where a variable shape depended on MutableData, while trying to figure out how to resize things for sample_posterior_predictive. It's obvious from context, they do not want that variable to be sampled from the prior when they do the resizing, so they must write it differently (probably using mutable advanced index downstream of the RV specification): https://discourse.pymc.io/t/shape-error-when-making-out-of-sample-predictions/13568/9?u=ricardov94

We could raise when a variable in a trace is found to be volatile but was not manually added to sample_vars or whatever the kwarg is. We can instruct users that if this is what they want (which I doubt will be the case in practice) they just have to add it.

It makes sample_posterior_predictive much more opinionated but I believe this is outweighed by making it much more fool-proof for users who are not aware of the internals (99.1%?)

The low-level function we use doesn't need to be opinionated/foolproof, just the user facing sample_posterior_predictive. WDYT?

OriolAbril · 2024-01-12T18:29:59Z

I like separating var_names and sample_vars and this extra change in behaviour sounds good too.

ricardoV94 · 2024-02-27T11:57:04Z

I have updated the top comment with the latest proposal

wd60622 · 2025-03-08T13:47:22Z

I want to be able to sample a model and reuse an learned Determinstic from training data. However, It seems I need to redefine the model with pm.Flat in order to do this:

Toy model

import pymc as pm
import numpy as np
import xarray as xr


def build_model(use_deterministic: bool):
    """Build a simple model that saves off mean of x and centers x."""
    coords = {"time": [1, 2, 3, 4, 5], "covariate": ["A", "B", "C"]}
    with pm.Model(coords=coords) as model:
        x = pm.Normal("x", dims=("time", "covariate"))

        if use_deterministic:
            x_mean = pm.Deterministic("x_mean", x.mean(axis=0), dims="covariate")
        else:
            x_mean = pm.Flat("x_mean", dims="covariate")

        pm.Deterministic("centered", x - x_mean, dims=("time", "covariate"))

    return model


def sample_x_and_centered_given_x_mean(use_deterministic, posterior):
    model = build_model(use_deterministic)

    # Also set the data again for test set or change coords etc

    idata = pm.sample_posterior_predictive(
        posterior,
        model=model,
        var_names=["centered", "x"],
    )
    posterior_predictive = idata.posterior_predictive

    np.testing.assert_allclose(
        posterior_predictive["centered"].values,
        posterior_predictive["x"].values,
    )


mock_posterior = xr.Dataset(
    {
        "x_mean": (("chain", "draw", "covariate"), np.zeros((2, 500, 3))),
    },
    coords={"chain": [0, 1], "draw": np.arange(500), "covariate": ["A", "B", "C"]},
)

try:
    sample_x_and_centered_given_x_mean(use_deterministic=True, posterior=mock_posterior)
except Exception as e:
    print("Using a deterministic breaks the test:")
    print(e)

sample_x_and_centered_given_x_mean(
    use_deterministic=False,
    posterior=mock_posterior,
)

I'd suspect there are many cases where the values would be in a posterior where sampling like this wouldn't want to be done.

I think it would be useful to either have some toggle within the Deterministic or have some graph transformation to change the

# Change in how Deterministic works which triggers logic internally
pm.Deterministic("x_mean", x.mean(axis=0), dims="covariate", one_time_only=True)

# Manual change from one graph to other so rebuilt or flags everywhere aren't needed
replacement_model = replace_deterministics_with_flat_so_sampling_occurs(model, ["x_mean"])

ricardoV94 · 2025-03-08T14:07:27Z

You want to freeze a deterministic while resampling the variable that depended on it. I think you can do it with pm.do.

Otherwise your request definitely goes against making sample_posterior_predictive more intuitive /less magical.

I think the only sane strategy going forward is to limit the scope of sample_posterior_predictive and make it easier for advanced users to define their predictive functions (either by using a model_builder function like you did and/or adding some useful model transformations like "freeze_deterministic") and use apply_function_over_dataset instead of trying to have sample_posterior_predictive fit all the needs

pymc/pymc/backends/arviz.py

Line 635 in e9b3c99

def apply_function_over_dataset(

Bo docstrings but you can see how it's used by compute_deterministics

pymc/pymc/sampling/deterministic.py

Line 24 in e9b3c99

def compute_deterministics(

wd60622 · 2025-03-08T15:34:45Z

The variable would have (chain, draw, *others) so I don't think that it can be put through pm.do. Talked with @jessegrabowski , a graph transformation to do that freeze_deterministics or "deterministics_to_flat_distribution" would be good alternative here

ricardoV94 · 2025-03-08T15:42:46Z

Either way the point stands that if you compile your own forward function you have complete control over what gets sampled/what stays static. In your case it means you want to make x_mean an input to the function and pass the values from the trace.

ricardoV94 added request discussion samplers labels Dec 15, 2023

ricardoV94 assigned lucianopaz, OriolAbril and jessegrabowski Dec 15, 2023

ricardoV94 changed the title ~~Make sample_posterior_predictive API more intuitive~~ Make sample_posterior_predictive API less surprising Dec 15, 2023

ricardoV94 mentioned this issue Feb 23, 2024

Add examples explaining advanced applications of sample_posterior_predictive #7014

Merged

5 tasks

ricardoV94 assigned AlexAndorra Feb 27, 2024

JasonTam mentioned this issue Jun 19, 2024

BUG: Using a length-only coord causes it to be volatile #7376

Closed

OriolAbril mentioned this issue Aug 1, 2024

Incorrect computation in Posterior Predictive Checking section of BYM Notebook pymc-devs/pymc-examples#675

Closed

ricardoV94 mentioned this issue Nov 29, 2024

Make do interventions shared variables by default #7596

Merged

PabloRoque mentioned this issue Mar 6, 2025

Detected inconsistent results with multiple Bayesian MMM competitors pymc-labs/pymc-marketing#1535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `sample_posterior_predictive` API less surprising #7069

Make `sample_posterior_predictive` API less surprising #7069

ricardoV94 commented Dec 15, 2023 •

edited

Loading

zaxtax commented Dec 18, 2023 •

edited

Loading

ricardoV94 commented Dec 18, 2023

zaxtax commented Dec 18, 2023

ricardoV94 commented Dec 18, 2023 •

edited

Loading

zaxtax commented Dec 18, 2023

lucianopaz commented Dec 19, 2023

ricardoV94 commented Jan 5, 2024 •

edited

Loading

OriolAbril commented Jan 12, 2024

ricardoV94 commented Feb 27, 2024

wd60622 commented Mar 8, 2025 •

edited

Loading

ricardoV94 commented Mar 8, 2025 •

edited

Loading

wd60622 commented Mar 8, 2025

ricardoV94 commented Mar 8, 2025

Make sample_posterior_predictive API less surprising #7069

Make sample_posterior_predictive API less surprising #7069

Comments

ricardoV94 commented Dec 15, 2023 • edited Loading

Description

Why did we ever think this behavior was a good idea?

Latest proposal

More extreme proposal (rejected in the comments)

zaxtax commented Dec 18, 2023 • edited Loading

ricardoV94 commented Dec 18, 2023

zaxtax commented Dec 18, 2023

ricardoV94 commented Dec 18, 2023 • edited Loading

zaxtax commented Dec 18, 2023

lucianopaz commented Dec 19, 2023

The original discussion

ricardoV94 commented Jan 5, 2024 • edited Loading

OriolAbril commented Jan 12, 2024

ricardoV94 commented Feb 27, 2024

wd60622 commented Mar 8, 2025 • edited Loading

ricardoV94 commented Mar 8, 2025 • edited Loading

wd60622 commented Mar 8, 2025

ricardoV94 commented Mar 8, 2025

Make `sample_posterior_predictive` API less surprising #7069

Make `sample_posterior_predictive` API less surprising #7069

ricardoV94 commented Dec 15, 2023 •

edited

Loading

zaxtax commented Dec 18, 2023 •

edited

Loading

ricardoV94 commented Dec 18, 2023 •

edited

Loading

ricardoV94 commented Jan 5, 2024 •

edited

Loading

wd60622 commented Mar 8, 2025 •

edited

Loading

ricardoV94 commented Mar 8, 2025 •

edited

Loading