From 7044696fc5e87eb76a5c2341e3715311e20909a6 Mon Sep 17 00:00:00 2001 From: Fabricia Nascimento Date: Tue, 8 Oct 2024 14:52:30 +0100 Subject: [PATCH] updated vignette --- .gitignore | 1 - vignettes/SenegalHIVmodel.Rmd | 943 ++++++++++++++++++++++++++++++++++ 2 files changed, 943 insertions(+), 1 deletion(-) create mode 100644 vignettes/SenegalHIVmodel.Rmd diff --git a/.gitignore b/.gitignore index 83dbfa2..9c57e26 100644 --- a/.gitignore +++ b/.gitignore @@ -8,7 +8,6 @@ Growth reorganizing_data.R *.rda *.tex -*.Rmd .Rhistory .Rbuildignore .Rapp.history diff --git a/vignettes/SenegalHIVmodel.Rmd b/vignettes/SenegalHIVmodel.Rmd new file mode 100644 index 0000000..0b07114 --- /dev/null +++ b/vignettes/SenegalHIVmodel.Rmd @@ -0,0 +1,943 @@ +--- +title: "HIV Senegal Model" +author: "Fabricia F. Nascimento" +date: "`r Sys.Date()`" +output: + bookdown::html_vignette2: + #bookdown::pdf_book: + toc: TRUE + fig_caption: TRUE + global_numbering: TRUE + citation_package: biblatex +bibliography: bib/senegalHIVmodel.bib +pkgdown: + as_is: true +fontsize: 12pt +vignette: > + %\VignetteIndexEntry{HIV Senegal Model} + %\VignetteEngine{rmarkdown::render} + %\VignetteEncoding{UTF-8} +--- + +```{r setup, include = FALSE} +library(magrittr) + +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) + +options(kableExtra.html.bsTable = T) +``` + +# Introduction + +This vignette will demonstrate how we estimated transmission rates using +HIV genetic sequences from Senegal. This also provides a guidance on how analysis +described in [Nascimento _et al_. 2020](https://www.sciencedirect.com/science/article/pii/S1755436519300660) was +carried out.We analysed HIV-1 sequences from subtypes B, C and 02_AG. + +Here we will provide a snapshot of these analyses. + +# Basic requirements +This vignette assumes that you know the basics of R and have the following packages +already installed: + +* ape: for phylogenetic trees; +* akima: necessary for interpolation of data (used for the calculation of the +likelihood by the phydynR package); +* [BayesianTools]("https://github.com/florianhartig/BayesianTools"): package for +Bayesian inference; +* CODA: is a series of tools to analze the output of Markov chain Monte Carlo; +* devtools: useful for installing packages directly from github repository; +* [phydynR]("https://github.com/emvolz-phylodynamics/phydynR"): implements the +coalescent simulation and likelihood function for phylodynamic analysis; +* [treedater]("https://github.com/emvolz/treedater"): fits a molecular clock to +a phylogenetic tree. +* [senegalHIVmodel](https://github.com/thednainus/senegalHIVmodel/tree/master): contains the example data to run the analysis described in this tutorial. + + +## Load the necessary packages: +```{r message = FALSE} + library(ape) + library(akima) + library(BayesianTools) + library(coda) + library(phydynR) + library(treedater) + library(senegalHIVmodel) + library(ggplot2) +``` + +# The Model +The model we fit is based on the structured coalescent models [@Volz2012]. These +models are used to estimate epidemiological parameters using a phylogenetic tree +and information on states of each tip of the tree. These states are discrete-trait +information representing each sequences. + +In our mathematical model we have 4 different discrete-traits associated to each +genetic sequence: + +* $gpf$ = HIV sample from the general population -- females; +* $gpm$ = HIV sample from the general population -- males; +* $msm$ = HIV sample from men that have sex with other men; +* $src$ = source sample, which are HIV samples from individuals who are from +other countries and not from Senegal. + +## Stage of infection + +We fit the HIV epidemic in Senegal using ordinary differential equations (ODE) +and only 1 stage of infection. This means that infected individuals would die +and not recover from the infection. In our model we represented it as $\gamma$ +rate. We used 1 stage of infection, because the metadata available for the +Senegal sequences did not have information that we could use to determine the +stage of HIV infection at the time the samples were collected. + +## How transmissions were modelled? + +* An infected $msm$ ($I_{msm}$) could transmit to another $msm$ with +probability $p_{msm2msm}$ +* An infected $msm$ ($I_{msm}$) could transmit to a $gpf$ with probability +$(1 - p_{msm2msm})$ +* An infected $gpf$ ($I_{gpf}$) could transmit to a $gpm$ with probability +$p_{gpf2gpm}$ +* An infected $gpf$ ($I_{gpf}$) could transmit to a $msm$ with probability +$(1 - p_{gpf2gpm})$ +* An infected $gpm$ could also transmit to a $gpf$. For this event, we used the +risk ratio of a male to transmit to a female, and fixed it to $1.02$. This is the +parameter $male_{x}$ of our model. + +See Figure \@ref(fig:figure1) for a schematic representation of the transmission +model for HIV in Senegal. In this figure $gpf$, $gpm$ and $msm$ represent the +infected individuals. + +```{r figure1, out.width = "100%", fig.align="center", fig.cap="Transmission model for HIV in Senegal. $gpf$, $gpm$ and $msm$ represent infected individuals.", echo = FALSE} +knitr::include_graphics(path = "../vignettes/images/SN_model_v2.png") +``` + +## How about HIV incidence rate? + +We also modeled the HIV incidence rate as a function of time ($t$) in $msm$ and +the $gp$ (general population) as different spline functions [@Eilers1996], that +in our ODEs are represented by $\lambda(t)$ and $\mu(t)$, respectively. + +## The $source$ compartment + +Finally, to model the HIV epidemic in Senegal, we also added an additional +compartment named "source" ($src$), that represents the rate in which HIV lineages +are imported to Senegal from other countries. We modeled this as a constant +effective population size rate with two parameters to be estimated -- $srcNe$: +the effective source population size; and the $import$ rate. Because the number +of imported HIV balances the number of exported HIV, the infected $src$ +individuals along time are not represented in the ODEs. + +## The ODEs or mathematical model equations + +Based on **Figure \@ref(fig:figure1)** and the parameters we would like to estimate, +the ordinary differential equations (ODEs) of our model are: + +$\dot{I}_{gpf} = male_x \mu(t) I_{gpm} + (1 - p_{msm2msm}) \lambda(t) I_{msm} - \gamma I_{gpf}$ + +$\dot{I}_{gpm} = p_{gpf2gpm} \mu(t) I_{gpf} - \gamma I_{gpm}$ + +$\dot{I}_{msm} = (1 - p_{gpf2gpm}) \mu(t) I_{gpf} + p_{msm2msm} \lambda(t) I_{msm} - \gamma I_{msm}$ + +For a publication, you should choose mathematical symbols such as $\phi$, or $p$, +etc to make your equations nicer. However, for simplicity and to make it easier +to understand the meaning of the parameters, we will keep the values as it is, +such as $p_{gpf2gpm}$, etc. + +## How to express the mathematical model in R? + +In our model, we are interested in HIV transmissions in the $gp$ and in the $msm$ +risk group. But also remember that we have to consider the imported HIV, which +is in the $src$ compartment. Based on that, $gpf$, $gpm$, $msm$, and $src$ are +the demes of our model, and are represented as a vector in R as: + +```{r} +demes <- c("gpm", "gpf", "msm", "src") +``` + +Because we use spline functions to determine the shape of the curve for the +transmission rates in $gp$ and $msm$, we should provide the initial `T0` and +final `T1` times for our simulations. + +```{r} +T0 <- 1978 +T1 <- 2014 +``` + +We can also go ahead and set the value for the stage of infection, that in our +model is just one stage. We assume we know this value. + +```{r} +GAMMA <- 1/10 +``` + +We should also create a list to set up a template for the parameter values of +our model. You can set this to values that you think are appropriate. Remember +that the majority of parameter values in this list will be estimated. +In R, this list of parameter values can be created as below: + +```{r} +THETA <- list( + gpsp0 = 6/10, + gpsp1 = 4/10, + gpsp2 = 1/10, + gpsploc = 1987, + msmsp0 = 4/10, + msmsp1 = 4/10, + msmsp2 = 2/10, + msmsploc = 1995, + maleX = 1.02, + import = 1/20, + srcNe = 20, + gpspline = function( t, parms ){ + if (t < T0 ) return( parms$gpsp0 ) + if (t > T1) return (parms$gpsp2) + with(parms, pmax(0.025, approx( x = c(T0, gpsploc, T1), y=c(gpsp0, gpsp1, gpsp2) , xout = t, rule = 2)$y) ) + }, + msmspline = function( t, parms){ + if (t < T0 ) return( parms$msmsp0 ) + if (t > T1) return ( parms$msmsp2 ) + with(parms, pmax(0.025, approx( x = c(T0, msmsploc, T1), y=c(msmsp0, msmsp1, msmsp2) , xout = t, rule = 2)$y) ) + }, + pmsm2msm = 0.85, + pgpf2gpm = 0.85, + initmsm = 1, + initgp = 1 +) +``` + +Note that parameters $gpsp0$, $gpsp1$, $gpsp2$, and $gpsploc$ are necessary to +estimate the flexible piecewise linear function for the $gp$ ($gpspline$ in R +or $\mu(t)$ in the ODEs). +And parameters $msmsp0$, $msmsp1$, $msmsp2$, $msmsploc$ are necessary to +estimate the flexible piecewise linear function for the $msm$ risk group +($msmspline$ in R or $\lambda(t)$ in the ODEs). + +```{r echo = FALSE} + +Parameter = c("Transmission rate gp0", + "Transmission rate gp1", + "Transmission rate gp2", + "Time interval for gp", + "Transmission rate msm0", + "Transmission rate msm1", + "Transmission rate msm2", + "Time interval for msm", + "Infectiouness ratio from male to female", + "Importation rate", + "Effective population size of src", + "Probability of infected msm to infect another msm", + "Probability of infected gpf to infect a gpm", + "Initial number of infected msm", + "Initial number of infected gp") +`Symbol in R` = c("gpsp0", + "gpsp1", + "gpsp2", + "gpsploc", + "msmsp0", + "msmsp1", + "msmsp2", + "msmsploc", + "maleX", + "import", + "srcNe", + "pmsm2msm", + "pgpf2gpm", + "initmsm", + "initgp") +`Initial values` = c("6/10", + "4/10", + "1/10", + "1987", + "4/10", + "4/10", + "2/10", + "1995", + "1.2", + "1.20", + "1.10", + "0.85", + "0.85", + "1.0", + "1.0") +data_list = data.frame(Parameter, `Symbol in R`, `Initial values`) +colnames(data_list)<-c("Parameter","Symbol in R", + "Initial values") + +kableExtra::kbl(data_list, booktabs = TRUE, + caption = "Parameter definition, values and symbols used in the R script.") %>% + kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) + +``` + + +We also need to setup some initial conditions of our model. We should set an +arbitrary large number for the $src$ population. + +```{r} +SRCSIZE <<- 1e5 +X0 <- c(gpm = unname(THETA$initgp/2), + gpf = unname(THETA$initgp/2), + msm = unname(THETA$initmsm), + src = SRCSIZE) +``` + + +### Setting up the birth, migration and death rates + +The calculation of births and migrations of our model are expressed as +4 $\times$ 4 matrices, which represent a transmission or movement from one deme +to another deme. Lineages also die at a same rate. + +First, we have to setup the components of the model. This can be done using the `setup.model.equations` function in this research compendium. + +```{r message = FALSE} +eqns <- senegalHIVmodel::setup.model.equations(demes) +attach(eqns) +``` + + +### Setting up the birth matrix + +We should set up the birth matrix to allow transmissions from one deme to +another deme as in __Figure \@ref(fig:figure1) __. + +```{r, echo=FALSE} +births['msm', 'msm'] <- "$msm \\to msm$" +births['msm', 'gpf'] <- "$msm \\to gpf$" + +births['gpm', 'gpf'] <- "$gpm \\to gpf$" +births['gpf', 'gpm'] <- "$gpf \\to gpm$" +births['gpf', 'msm'] <- "$gpf \\to msm$" + +# f = (1/2)*(Y^2)/Ne +births['src', 'src'] <- "$src \\to src$" + +kableExtra::kbl(births, booktabs = TRUE, escape = FALSE, + caption = "Illustration of allowed transmissions between demes.") %>% + kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) + +``` + + +To set up these transmissions between demes, each element in the matrix is a +string that will be passed as R code. + +```{r message = FALSE} +births['msm', 'msm'] <- "parms$msmspline(t, parms) * msm * parms$pmsm2msm" +births['msm', 'gpf'] <- "parms$msmspline(t, parms) * msm * (1-parms$pmsm2msm)" + +births['gpm', 'gpf'] <- "parms$gpspline(t, parms) * gpm * parms$maleX" +births['gpf', 'gpm'] <- "parms$gpspline(t, parms) * gpf * parms$pgpf2gpm" +births['gpf', 'msm'] <- "parms$gpspline(t, parms) * gpf * (1-parms$pgpf2gpm)" + +# f = (1/2)*(Y^2)/Ne +births['src', 'src'] <- "0.5 * SRCSIZE^2 / parms$srcNe" +``` + + +### Setting up the migration matrix + +Similar to the birth matrix, we also allow migrations from $gpf$, $gpm$, or $msm$ +to the $src$; and from $src$ to $gpf$, $gpm$, or $msm$. This is modelled as +a constant effective population size. + +```{r echo=FALSE} +migs['src', 'gpm'] <- "$src \\to gpm$" +migs['src', 'gpf'] <- "$src \\to gpf$" +migs['src', 'msm'] <- "$src \\to msm$" + +migs['gpm', 'src'] <- "$gpm \\to src$" +migs['gpf', 'src'] <- "$gpf \\to src$" +migs['msm', 'src'] <- "$msm \\to src$" + +kableExtra::kbl(migs, booktabs = TRUE, escape = FALSE, + caption = "Illustration of allowed migrations between demes.") %>% + kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) + +``` + + +We also set up migrations between demes where each element in the matrix is a +string that will be passed as R code. + +```{r} +migs['src', 'gpm'] <- "parms$import * gpm" +migs['src', 'gpf'] <- "parms$import * gpf" +migs['src', 'msm'] <- "parms$import * msm" + +migs['gpm', 'src'] <- "parms$import * gpm" +migs['gpf', 'src'] <- "parms$import * gpf" +migs['msm', 'src'] <- "parms$import * msm" +``` + +### Setting up the vector for the death rates + +Similarly to the birth and migration matrices, we also set up death rates where +each element is a string that will be passed as R code. + +```{r} +deaths['msm'] <- "GAMMA * msm" +deaths['gpf'] <- "GAMMA * gpf" +deaths['gpm'] <- "GAMMA * gpm" +deaths['src'] <- "0.5 * SRCSIZE^2 / parms$srcNe" +``` + +### The demographic model + +After setting up all the components of the mathematical model, we can build the +demographic process using the function `build.demographic.process` from the +`phydynR` package. The `dm` output can be used as input to coalescent models +for the calculation of the likelihood, when fitting the model using a +Markov chain Monte Carlo (MCMC), for example. + +```{r} +dm <- build.demographic.process(births = births, + deaths = deaths, + migrations = migs, + parameterNames = names(THETA), + rcpp = FALSE, + sde = FALSE) +``` + +For more information on the the input data for the `build.demographic.process` +function, you should see its R documentation using the command +`?phydynR::build.demographic.process`. + +## Load additional data + +After setting up all the equations of the model in a way that R can understand, +the model can now be fitted to a binary and dated phylogenetic tree. In our +specific case, we estimated a maximum likelihood trees using +[RAxML-NG](https://github.com/amkozlov/raxml-ng) for each subtype, as we analyzed +HIV-1 subtypes B, C and 02_AG. + +Each tree had a relaxed clock fitted using the R package [treedater](https://github.com/emvolz/treedater). After that, the trees were merged +into a single tree using the R script [_merge_trees.R_](https://github.com/thednainus/senegalHIVmodel/blob/master/analyses/scripts/R-scripts/Processing/merge_trees.R). The final tree did not contain sequences from children, +and from which risk group or sex was NA (not available). + +We load the binary dated tree in R using the following: + +```{r} +tree.all <- read.tree( + system.file("data/bindTree_CGR_GTR+Gp12+3_droppedTip.tre", + package = "senegalHIVmodel")) +``` + + +After, we should load all the information for the metadata. This aims to create +a discrete-trait data for all tips in the phylogenetic tree. Remember that our discrete-trait data in this model are the _gpf_, _gpm_, _msm_, and _src_. + +First we need to read in R all metadata as below: + +```{r} +# Metadata that will be used in our model for the CGR (close global reference) +# sequences and for sequences from Senegal. +# CGRs are referred in the mathematical model as src (source) data +# src are HIV sequences that are from other countries and not from Senegal + +# the 1st column is the sequence names +# the 2nd column is the state (gpf, gpm, msm, or src) of each sequence +all_data <- readRDS(system.file("data/states.RDS", package = "senegalHIVmodel")) +``` + + +After organizing all metadata, we further organize it in a way that the R package `phydynR` will understand it. +For that, we create a matrix to receive the information on states (discrete-traits) for each tip of the tree + + +```{r} +gpm <- gpf <- msm <- src <- rep(0, length(tree.all$tip.label)) + +# Adds 1 to where states matches "gpm", and so on. +gpm[all_data$States == "gpm"] <- 1 +gpf[all_data$States == "gpf"] <- 1 +msm[all_data$States == "msm"] <- 1 +src[all_data$States == "src"] <- 1 + +sampleStates <- cbind(gpm, gpf, msm, src) +rownames(sampleStates) <- all_data$tip.name +``` + +Now, we need to read the estimated times (in calendar units) for each sequence +in the phylogenetic tree + +```{r} +times <- readRDS( + system.file("data/bindTree_CGR_GTR+Gp12+3_droppedTip_sts.RDS", + package = "senegalHIVmodel")) +``` + + +Finally, we can create an an object of `DatedTree` [phydynR package]. This is +the tree that should be used in the calculation of the likelihood to estimate +parameter values using `phydynR` + +```{r} +dated.tree <- phydynR::DatedTree(phylo = tree.all, + sampleTimes = times, + sampleStates = sampleStates, + minEdgeLength = 2/52, + tol = 0.1) +``` + +# Calculation of the likelihood + +After setting up the mathematical model and the data, we have the following: + +* `dated.tree` = a phylogenetic tree of class DatedTree +* `THETA` = template for parameter values +* `dm` = the demographic process +* `X0` = initial conditions + +The above objects will be used in the calculation of the likelihood using the +`phydynR::colik` function. + +```{r, eval=FALSE} +phydynR::colik(tree = dated.tree, + theta = THETA, + demographic.process.model = dm, + x0 = X0, + t0 = 1978, + res = 1e3, + timeOfOriginBoundaryCondition = FALSE, + AgtY_penalty = 1, + maxHeight = 35) +``` + +Note that for the calculation of the likelihood you can provide a value for +maximum height `maxHeight`. This parameter "tells" the function up to which point +back in time the calculation of the likelihood should be done. If computer +resources are not a problem, you can leave this as the default. Using the +whole tree for the calculation of the likelihood can make it slower than +when using just a portion back in time. + + +For the Senegal data, we used a `maxHeight` = 35. This means that it will go back +in time in the tree at approximately 1979. This was merely chosen to put the +HIV epidemics in Senegal around this time, for estimation of the parameters in +our model. + +# Estimation of epidemiological parameters + +Now that we are happy with everything, we are ready to estimate the parameters of +our model. For the Senegal HIV model, we chose to estimate the parameters using a +Markov chain Monte Carlo (MCMC) as implemented in the R package [BayesianTools]("https://github.com/florianhartig/BayesianTools"). + +## Which parameters to estimate? + +For this example, we decided to estimate the following parameters: + +**Parameters for estimating the spline function for the _gp_:** + +* _gpsp0_ +* _gpsp1_ +* _gpsp2_ +* _gpsploc_ + +**Parameters for estimating the spline function for the _msm_:** + +* _msmsp0_ +* _msmsp1_ +* _msmsp2_ +* _msmsploc_ + +**Parameters that controls the _src_:** + +* _import_ +* _srcNe_ + +**Probability of certain events to occur:** + +* _pmsm2msm_ +* _pgpf2gpm_ +* _maleX_ + +**Parameters to estimate the initial population size:** + +* _initmsm_ (initial number of individuals in the _msm_ population) +* _initgp_ (initial number of individuals in the _gp_ population) + +## Estimating the parameters + +To estimate these parameters we should set up an object function. This object +function will receive the proposals of the MCMC. The reason of using an object +function is to make it easier to change the values of the parameters to be estimated +in THETA (our parameter template as explained before). Note that not all +parameters listed in THETA will be estimated; some will stay fixed to the +predifined values. + +```{r} +obj_fun <- function(parameters){ + # we use unname here because "parameters" can be as vectors or matrix, and + # sometimes it comes with column names, which I chose to remove these column names + # in here. + parameters <- unname(parameters) + + # add the values of THETA to a new variable named THETA.new + THETA.new <- THETA + + # change the values in THETA.new to the new proposals that will be evaluated + THETA.new$gpsp0 <- parameters[1] + THETA.new$gpsp1 <- parameters[2] + THETA.new$gpsp2 <- parameters[3] + THETA.new$gpsploc <- parameters[4] + THETA.new$msmsp0 <- parameters[5] + THETA.new$msmsp1 <- parameters[6] + THETA.new$msmsp2 <- parameters[7] + THETA.new$msmsploc <- parameters[8] + THETA.new$maleX <- parameters[9] + THETA.new$import <- parameters[10] + THETA.new$srcNe <- parameters[11] + THETA.new$pmsm2msm <- parameters[12] + THETA.new$pgpf2gpm <- parameters[13] + THETA.new$initmsm <- parameters[14] + THETA.new$initgp <- parameters[15] + + # X0 is the initial conditions for the 4 demes (gpf, gpm, msm, src) + X0 <- c(gpm = unname(THETA.new$initgp/2), + gpf = unname(THETA.new$initgp/2), + msm = unname(THETA.new$initmsm) , + src = 1e5) + + # After changing the parameter values to the new proposals, a likelihood is + # calculated with the function phydynR::colik. + # Note that this function uses several global variables, such as, dated.tree, + # dm, and X0 + mll <- colik(tree = dated.tree, + theta = THETA.new, + demographic.process.model = dm, + x0 = X0, + t0 = 1978, + res = 1e3, #TODO + timeOfOriginBoundaryCondition = FALSE, + AgtY_penalty = 1, + maxHeight = 35) + + return(mll) + +} +``` + +We can now estimate these parameters using the MCMC. For that we need to decide +the priors -- our best knowledge on the parameter value before the data is +analysed -- on each parameter that will be estimated. + +Because we are using the `BayesianTools` R package, we need to specify the density function (that represents our prior) for each parameter as below: + +```{r} +# Specify a density function to be used in the +# prior specification (see below) +densities <- function(par){ + # d1 to d3 and d5 to d7 I am using a lognormal distribution with mean = R0 = 1.1 and sigma = 1 + # d4 and d8 uniform distribution between the start time and the most recent sample + # d10 exponential distribution with mean around 30 + # d11 exponential distribution with mean around 1/100 + d1 = dgamma(par[1], shape = 3, rate = 3/0.1, log = TRUE) #gpsp0 + d2 = dgamma(par[2], shape = 3, rate = 3/0.1, log = TRUE) #gpsp1 + d3 = dgamma(par[3], shape = 3, rate = 3/0.1, log = TRUE) #gpsp2 + d4 = dunif(par[4], min = 1978, max = 2014, log = TRUE) #gpsploc + d5 = dgamma(par[5], shape = 3, rate = 3/0.1, log = TRUE) #msmsp0 + d6 = dgamma(par[6], shape = 3, rate = 3/0.1, log = TRUE) #msmsp1 + d7 = dgamma(par[7], shape = 3, rate = 3/0.1, log = TRUE) #msmsp2 + d8 = dunif(par[8], min = 1978, max = 2014, log = TRUE) #msmsploc + d9 = dunif(par[9], min = 0.5, max = 2.0, log = TRUE) #maleX + d10 = dexp(par[10], rate = 30, log = TRUE) #import + d11 = dexp(par[11], rate = 1/100, log = TRUE) #srcNe + d12 = dbeta(par[12], shape1 = 16, shape2 = 4, log = TRUE) #pmsm2msm + d13 = dbeta(par[13], shape1 = 16, shape2 = 4, log = TRUE) #pgpf2gpm + d14 = dexp(par[14], rate = 1/3, log = TRUE) #initmsm + d15 = dexp(par[15], rate = 1/3, log = TRUE) #initgp + + return(d1 + d2 + d3 + d4 + d5 + d6 + d7 + d8 + d9 + d10 + d11 + d12 + d13 + d14 + d15) +} +``` + + +Now, we can provide a sampler, which is optional, as described in the `BayesianTools` package, as below: + +```{r} +# Create sampling, this is optional. +# However, if a sampler function is not provided, the user will have to provide +# explicit starting values for for many of the MCMC sampler +sampler <- function(n=1){ + d1 = rgamma(n, shape = 4, rate = 4/0.6) #gpsp0 + d2 = rgamma(n, shape = 4, rate = 4/0.4) #gpsp1 + d3 = rgamma(n, shape = 4, rate = 4/0.1) #gpsp2 + d4 = runif(n, min = 1985, max = 2000) #gpsploc + d5 = rgamma(n, shape = 4, rate = 4/0.4) #msmsp0 + d6 = rgamma(n, shape = 4, rate = 4/0.4) #msmsp1 + d7 = rgamma(n, shape = 4, rate = 4/0.2) #msmsp2 + d8 = runif(n, min = 1985, max = 2005) #msmsploc + d9 = runif(n, min = 0.5, max = 2.0) #maleX + d10 = runif(n, 1/40, 1/5) #import + d11 = runif(n, 5, 1000) #srcNe + d12 = rbeta(n, shape1 = 16, shape2 = 4) #pmsm2msm + d13 = rbeta(n, shape1 = 16, shape2 = 4) #pgpf2gpm + d14 = runif(n, 1, 3) #initmsm + d15 = runif(n, 1, 3) #initgp + + return(cbind(d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15)) +} + +``` + +After setting up the densities and sampler functions we can now set up our prior +by using the following: + +```{r} +# Create prior (necessary for the BayesianTools package) +prior <- createPrior(density = densities, + sampler = sampler, + lower = c(0.05, 0.05, 0.05, 1978, + 0.05, 0.05, 0.05, 1978, + 0.5, 0, 1, 0, 0, 1, 1), + upper = c(1, 1, 1, 2014, + 1, 1, 1, 2014, + 2, 0.30, 5000, 1, + 1, 300, 300)) +``` + +We are now ready to estimate the parameter values using MCMC. We chose the +DEzs sampler. DEzs stands for Differential-Evolution MCMC zs and this is beyond +what will cover in this vignette. However, if you can check Ter Braak and Vrugt [-@TerBraak2008] for more information. + +First, we run a MCMC as below for ca. 10,000 iterations. After checking the traces and decided they are _ok_, we can use this initial run to provided a z-matrix as explained [here](https://github.com/florianhartig/BayesianTools/issues/79). + +Note that it beyond the scope of this vignette to explain about Markov chain +Monte Carlo (MCMC), but you can read more in this other [tutorial](https://thednainus.wordpress.com/2017/03/08/part1-introduction/). + +Note that that the iterations parameter will take a long time to run. +```{r eval = FALSE} + +settings = list(iterations = 10000, nrChains = 1, thin = 1) +# Create bayesianSetup +bayesianSetup <- createBayesianSetup(likelihood = obj_fun , prior = prior) +# you do not need to run this part as it will take a while to run +# run the MCMC +#out <- BayesianTools::runMCMC(bayesianSetup = bayesianSetup, +# sampler = "DEzs", +# settings = settings) +``` + + +Now we can use the results obtained with the previous run and create another +run but this time providing a Zmatrix as below: + +```{r eval = FALSE} +# Get an ok sample (the run above is not good, however it can provide a good Z matrix) +# For more information on this: https://github.com/florianhartig/BayesianTools/issues/79 + +#read previous run +out <- readRDS(system.file("data/MCMC_example/out_results.RDS", + package = "senegalHIVmodel")) +x <- BayesianTools::getSample(out, start = 2000) + +# Get the range for the parameter estimates for the previous run +rangePost = apply(x, 2, range) + +#get unique values of x +u_x <- unique(x) + +#cretae new Z matrix based on previous run +# now I am estimating 15 parameters (hence ncol=15) +newZ = matrix(runif(1800, rangePost[1,], rangePost[2,]), ncol = 15, byrow = T) + + +# Because I will run several analysis in parallel, and to avoid the initial values to be identical +# I will provide as argument position 1 (pos1), position 2 (pos2), and position 3 (pos3) +# from the unique values of x (u_x) +pos1 = 72 +pos2 = 73 +pos3 = 74 +iter = 80000 # number of iterations +settings = list(Z = newZ, + startValue = u_x[c(pos1, pos2, pos3), ], + nrChains = 1, + iterations = iter, + thin = 1) + + +# Create bayesianSetup +# Here we will take advantage of the parallel option. +#bayesianSetup <- createBayesianSetup(likelihood = obj_fun, +# prior = prior, +# parallel = 3) + +#Note that this will take a while to run as well +#outZ <- runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings ) +#stopParallel(bayesianSetup) +``` + +The MCMC will take a while to run, so we will instead load the results: + +```{r} + +outZ <- readRDS(system.file("data/MCMC_example/outZ_results.RDS", + package = "senegalHIVmodel")) + + +``` + +We can plot the traces: + +```{r fig.asp = 0.8, fig.width = 6.5} + +#par("mar") +par(mar=c(1,5,5,1)) + +#here we will plot just the 4 initial parameters for the linear function +#representing the transmission rates for the general population +#save than upload figure +plot(outZ, whichParameters = c(1:4)) + +``` + +We can remove the first initial 5,000 iter as burnin and get the ESS (effective +sample size). You can see that ESS values a greater than 200 for all parameters. + +```{r} + +out_sample <- BayesianTools::getSample(outZ, start = 5000, coda = TRUE) + +# ESS value for all parameters +coda::effectiveSize(out_sample) + +``` + +## Estimating effective number of infections + +We can now estimate and plot the effective number of infections through time for +$gpf$, $gpm$ and $msm$. +For that, we will sample 100 combination of parameter values from the +posterior distribution. For the paper we published, we sampled 1000 combination +of parameter values. +Then for each combination of parameter values from the posterior distributions, +we will solve the demographic process _dm_ function that we created in +the beginning to this tutorial. +We will use the function _post_traj_mx_ from the R package `senegalHIVmodel`. If +you have a look at this function, you will see that we are solving dm for +each combination of parameter value from the posterior distribution. + + +```{r eval = FALSE} + out_sample <- BayesianTools::getSample(outZ, start = 5000, coda = FALSE) + + # sample n combination of parameters from the posterior + # here we will use only a size = 10 samples to make it faster to run + # in the paper, we used size = 1,000 samples from the posterior probability + run.n <- out_sample[sample(nrow(out_sample), size = 10, replace = FALSE), ] + + # solve the demographic model for 100 combination of parameter values + # that was sampled from the posterior probability + run_o.n <- do.call("cbind", apply(run.n, 1, senegalHIVmodel::post_traj_mx, THETA)) +``` + +We can now load the solved demographic process for 100 samples for the +posterior distribution. Note that in the paper, we used 1,000 samples from the +posterior distribution. + +```{r} +run_o.n <- readRDS(system.file("data/solved_dm.RDS", package = "senegalHIVmodel")) +``` + +Now, we can solve the demographic process function for the maximum a posteriori +(MAP). + +```{r} + + #then we will fo the same for the MAP (maximum a posteriori value) + run_map <- BayesianTools::MAP(outZ, start = 5000)$parametersMAP + + # solve the demographic model for MAP + run_map_o <- senegalHIVmodel::post_traj_mx(parameters = run_map, THETA = THETA) + +``` + +After solving the demographic model for each combination of parameter value, +we can use the function _senegalHIVmodel::df_sizes_prop_ to get the proportion +of $gpm$, $gpf$ and $msm$ for the effective population size, which is the 4th +element from the solved demographic process. +And then we can plot the proportion of $gpf$, $gpm$ and $msm$. + + +```{r fig.asp = 0.8, fig.width = 7} +# Note that the element 4 in the sizes in the solved demographic function +eff_sizes <- senegalHIVmodel::df_sizes_prop(sizes.p = run_o.n[4,], + sizes.map = run_map_o[4], + times = run_o.n[[1]], + Nrep = 100, + Ntime = 1000) + +ggplot(data = eff_sizes, aes(x = times)) + + geom_ribbon(aes(ymin = lower, ymax = upper, fill = group), alpha = 0.70) + + geom_line(aes(y = median, colour = group, linetype = "solid")) + + geom_line(aes(y = MAP, colour = group, linetype = "dashed")) + + facet_wrap( ~ group) + + ggtitle("All subtypes") + + ylab("Effective number of infections (proportion)") + + xlab("Time (years)") + + scale_fill_brewer() + scale_colour_brewer() + theme_bw() + + theme(legend.position="bottom", text = element_text(size = 14)) + + scale_linetype_manual(labels = c('MAP', "Median"), values = c("dashed", "solid")) +``` + + + +## Estimating the reproduction number ($R_0$) + + +We can calculate the $(R_0)$ for this model using the equations $\mu(t)/\gamma$ +for $gp$ and $\lambda(t)/\gamma$ for $msm$. +Note that $\mu(t)$ and $\lambda(t)$ are the transmission rates through time +for $gp$ and $msm$, respectively, and were estimated with MCMC. + +We can then use the function `senegalHIVmodel::df_r0`. This function will +basically estimate transmission rate for each group ($gp$ or $msm$) by solving +the function $\mu(t)$ and $\lambda(t)$ as described in THETA (above). +Then it estimate the $(R_0)$ as described above. + +```{r} +# These are the values used for the simulations +# Initial time for simulations +T0 <- 1978 +# Final time for simulations +T1 <- 2014 +# Duration of infection. In our model we assumed 1 stage of infection +GAMMA <- 1/10 + +times <- seq(1978, 2014, length.out = 1000) + +all_subtypes <- df_r0(run = outZ, + burnin = 5000, + par_names = c("gpsp0", "gpsp1", "gpsp2", "gpsploc", + "msmsp0", "msmsp1", "msmsp2", "msmsploc", + "maleX", "import", "srcNe", + "pmsm2msm", "pgpf2gpm", + "initmsm", "initgp"), + times = times, + T0 = T0, + T1 = T1, + GAMMA = GAMMA) + +``` + +We can then look at the median and credible interval for the $R_0$ for $gp$ and +$msm$ in 2014. + +```{r} + +#for gp +all_subtypes[1000,] + +#for msm +all_subtypes[2000,] +``` + +Note that the values obtained here might not match exactly as described in the +paper. This is because we tested many different models and run the MCMC for longer. + +# References