Interfaces for probability distributions #140

sbfnk · 2023-11-16T12:06:20Z

sbfnk
Nov 16, 2023
Maintainer

Probability distributions are ubiquitous in epidemiological work and used, e.g., for specifying the number of secondary infections generated by an infectious person (the offspring distribution) or delays in the process of infection or reporting (e.g. incubation periods). At the moment we have at least 4 different ways of specifying them across our packages, which is perhaps not ideal from a user perspective:

As a function that takes a single vector, x, and returns a density. This was discussed in the context of the cfr package and decided on as the approach most in line with best practice in R.
As a character string that refers to a function that needs to be present in the environment. This is the approach taken in epichains (and previously bpmodels) for specifying the offspring distribution in the likelihood function. Confusingly, the serial interval distribution can then specified using a function that takes a single argument in the corresponding simulation function, and there's an open Issue to harmonise this one way or the other. The whole issue of function vs. string was discussed extensively. My conclusion from that discussion was that passing distributions as functions was preferable but in the case of the offspring distribution impossible because we need to know the distribution in order to look up the corresponding log-likelihood function (if it exists) provided in the package (examples).
Using a function interface, the proposed approach in EpiNow2. Here, again, we need to know the distribution in order to pass it to stan but we also want to be able to nest them so that e.g. a parameter of one distribution is drawn from another distribution. The proposed interface is similar the approach taken in the well-established rstanarm package and broadly in line with the probabilistic programming paradigm represented by stan.
Using epidist objects as done in epiparameter. I think the aim here is to represent as much possible as available of the distribution so that it can be used for further processing in many different ways (but happy to hear of other design motivations).

I don't have any immediate solution for harmonising this beyond perhaps using a more comprehensive distribution interface such as distr6 which would introduce another layer of complexity. I'd be keen to hear people's thoughts and/or suggestions for improving both interoperability and clarity to users whilst fulfilling the specific requirements in the different packages.

TimTaylor · 2023-11-16T13:42:20Z

TimTaylor
Nov 16, 2023

Quick thoughts whilst in my head. It may be that multiple approaches are needed but that the implementations of those can be standardised.

1 (or variations of it) is good for flexibility and can have a relatively straightforward pattern but you do need some validation to ensure they do what you want, e.g.

do_something <- function(x, f, ...) {
    if (!valid_function(f))
        stop("invalid function")
    out <- tryCatch(f(x, ...), error = function(e) stop("useful error message"))
    if (!valid_output(out))
        stop("another useful error message")
    
    # do something useful
}

2 is good when you want to only support specific distributions

do_something <- function(x, f = c("weibull", "norm"), ...) {
    f <- match.arg(f)
    ddist <- c(weibull = dweibull, norm = dnorm)
    f <- ddist[[f]]
    out <- f(x, ...)
    
    # now do something useful
}

I viewed 4 more as a catalogue where end users could extract parameters from literature and then would use in functions with interfaces 1,2, 4 (maybe). Developers may want to create methods that work with those objects for convenience but I assumed the main thrust was making known parameters easily available.

I'm not familiar with 3 so would like to understand it more before I comment.

0 replies

TimTaylor · 2023-11-16T13:52:28Z

TimTaylor
Nov 16, 2023

Also I'd be similarly wary about introducing another abstraction via a package (but ... see edit below). I know {distr6} was just given as an example but it would be a no go for me as not on CRAN (nor will it ever be AFAICT).

EDIT: That said, I've now had time to look at 3 and I really like the idea here and how it allows nesting of distributions. However I'm very ignorant of the wider package ecosystem in regards to implementations like this. Are there any out there you could make use of rather than creating another? If this answer is no then this feels like something that could be a package in it's own right.

So my current thinking is that there's a place for 1, 2 and 3. Functions that utilise 1 or 2 could always be replaced with generics and methods provided for 3 (and 4 for that matter) if the more custom approach makes the experience better for users.

Hope this makes sense. Interested to hear what others think.

0 replies

joshwlambert · 2023-11-17T17:37:07Z

joshwlambert
Nov 17, 2023
Collaborator

Thanks for raising @sbfnk. I echo much of what @TimTaylor has said.

In terms of what should be accepted by functions/packages that use distributions, I think 1 is preferred, with the aim at some point for Epiverse-TRACE packages (and potentially others) to accept 4 as well.

My feeling with 4 (the <epidist> class) is that it should utilise existing infrastructure for working with probability distributions. It currently does this by using {distributional} and {distcrete}, the reasoning for the choices is outlined in this old {epiparameter} issue.

One possibility, building on something mentioned by Tim:

Are there any out there you could make use of rather than creating another? If this answer is no then this feels like something that could be a package in it's own right.

If option 3 became a standalone package, it could be utilised as the distribution infrastructure for <epidist> (but I haven't taken an in depth look at the {EpiNow2} PR for this implementation so I'm not 100% on the details).

1 reply

adamkucharski Nov 19, 2023
Maintainer

Thanks for flagging. Having to wrangle parameters/distributions (or learn a new standard) can be a major barrier to new users, so in general I'd be keen on solutions that integrate with existing practices/standards in R as much as possible. Or if we can't avoid some additional complexity (E.g. for passing to Stan or importing from a library), having functionality that keeps these obstacles away from the everyday user. Perhaps a topic for an upcoming meeting (as it also relates to the issue of how to pass parameter uncertainty between tasks/packages)?

sbfnk · 2024-03-28T10:28:41Z

sbfnk
Mar 28, 2024
Maintainer Author

Some updates on this:

Approach 2 in {epichains} has been superseded by approach 1 - the lookup is now done by looking up the name of the function that has been passed (if it has a name). So we're down to 3 approaches.
A first version of approach 3 has now been implemented in {EpiNow2} in a way that has been designed to be independent from other functionality in the package. That means that in principle it could be taken out and developed as a separate package that both {EpiNow2} and {epiparameter} depend on. The plan is to explore this once it has become part of an {EpiNow2} release (1.5.0 planned for April 2024), involving all relevant stakeholders and taking a conservative approach with respect to reverse dependencies.

2 replies

TimTaylor Mar 28, 2024

Is this one of the few cases where a function prefix would be useful? I prefer the capitalised approach but do worry about collisions. Aesthetically horrible but ... 🤔

sbfnk Mar 28, 2024
Maintainer Author

Possibly, yes. Kind of annoying that it's only the gamma that's the issue - looking at implementations that inspired ours {distributionalS3} has the same problem, {rstanarm} avoids it by avoiding the gamma altogether, and distributional uses prefixes. There really isn't a perfect solution.

Related discussion with various opinions happened at https://community.epinowcast.org/t/use-of-enw-prefix/42/2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epiverse-TRACE

Interfaces for probability distributions #140

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Epiverse-TRACE

Interfaces for probability distributions #140

sbfnk Nov 16, 2023 Maintainer

Replies: 4 comments · 3 replies

TimTaylor Nov 16, 2023

TimTaylor Nov 16, 2023

joshwlambert Nov 17, 2023 Collaborator

adamkucharski Nov 19, 2023 Maintainer

sbfnk Mar 28, 2024 Maintainer Author

TimTaylor Mar 28, 2024

sbfnk Mar 28, 2024 Maintainer Author

sbfnk
Nov 16, 2023
Maintainer

Replies: 4 comments 3 replies

TimTaylor
Nov 16, 2023

TimTaylor
Nov 16, 2023

joshwlambert
Nov 17, 2023
Collaborator

adamkucharski Nov 19, 2023
Maintainer

sbfnk
Mar 28, 2024
Maintainer Author

sbfnk Mar 28, 2024
Maintainer Author