+
Example analysis of microbiome data
+
This analysis is the same as that presented in the fido
manuscript (Justin D. Silverman et al.
2019). I will reanalyze a previously published study comparing
@@ -189,113 +190,101 @@
Example analysis of microbiome data
2014). To do this I will fit a pibble model using CD status,
inflammation status and age as covariates (plus a constant intercept
term).
-
library(MicrobeDS)
-library(phyloseq)
-library(dplyr)
-library(fido)
-
-set.seed(899)
-
-data("RISK_CCFA")
-# drop low abundant taxa and samples
-dat <- RISK_CCFA %>%
- subset_samples(disease_stat!="missing",
- immunosup!="missing") %>%
- subset_samples(diagnosis %in% c("no", "CD")) %>%
- subset_samples(steroids=="false") %>%
- subset_samples(antibiotics=="false") %>%
- subset_samples(biologics=="false") %>%
- subset_samples(biopsy_location=="Terminal ileum") %>%
- tax_glom("Family") %>%
- prune_samples(sample_sums(.) >= 5000,.) %>%
- filter_taxa(function(x) sum(x > 3) > 0.10*length(x), TRUE)
+
Create Design Matrix and OTU Table
-
sample_dat <- as.data.frame(as(sample_data(dat),"matrix")) %>%
- mutate(age = as.numeric(as.character(age)),
- diagnosis = relevel(factor(diagnosis, ordered = FALSE), ref="no"),
- disease_stat = relevel(factor(disease_stat, ordered = FALSE), ref="non-inflamed"))
-X <- t(model.matrix(~diagnosis + disease_stat+age, data=sample_dat))
-Y <- otu_table(dat)
-
-# Investigate X and Y look like
-X[,1:5]
-#> 1939.SKBTI.0175 1939.SKBTI047 1939.SKBTI051 1939.SKBTI063
-#> (Intercept) 1.00000 1.00000 1.00 1.00000
-#> diagnosisCD 1.00000 1.00000 1.00 1.00000
-#> disease_statinflamed 0.00000 1.00000 1.00 1.00000
-#> age 15.16667 14.33333 15.75 13.58333
-#> 1939.SKBTI072
-#> (Intercept) 1.00
-#> diagnosisCD 1.00
-#> disease_statinflamed 1.00
-#> age 15.75
-Y[1:5,1:5]
-#> OTU Table: [5 taxa and 5 samples]
-#> taxa are rows
-#> 1939.SKBTI.0175 1939.SKBTI047 1939.SKBTI051 1939.SKBTI063 1939.SKBTI072
-#> 4442127 0 9 0 14 2
-#> 74305 1 2 35 1 0
-#> 663573 36 1 0 2 1
-#> 2685602 10 264 211 276 83
-#> 4339819 0 37 42 70 22
+
+sample_dat <- as.data.frame(as(sample_data(dat),"matrix")) %>%
+ mutate(age = as.numeric(as.character(age)),
+ diagnosis = relevel(factor(diagnosis, ordered = FALSE), ref="no"),
+ disease_stat = relevel(factor(disease_stat, ordered = FALSE), ref="non-inflamed"))
+X <- t(model.matrix(~diagnosis + disease_stat+age, data=sample_dat))
+Y <- otu_table(dat)
+
+# Investigate X and Y look like
+X[,1:5]
+#> 1939.SKBTI.0175 1939.SKBTI047 1939.SKBTI051 1939.SKBTI063
+#> (Intercept) 1.00000 1.00000 1.00 1.00000
+#> diagnosisCD 1.00000 1.00000 1.00 1.00000
+#> disease_statinflamed 0.00000 1.00000 1.00 1.00000
+#> age 15.16667 14.33333 15.75 13.58333
+#> 1939.SKBTI072
+#> (Intercept) 1.00
+#> diagnosisCD 1.00
+#> disease_statinflamed 1.00
+#> age 15.75
+Y[1:5,1:5]
+#> OTU Table: [5 taxa and 5 samples]
+#> taxa are rows
+#> 1939.SKBTI.0175 1939.SKBTI047 1939.SKBTI051 1939.SKBTI063 1939.SKBTI072
+#> 4442127 0 9 0 14 2
+#> 74305 1 2 35 1 0
+#> 663573 36 1 0 2 1
+#> 2685602 10 264 211 276 83
+#> 4339819 0 37 42 70 22
Next specify priors. We are going to start by specifying a prior on
-the covariance between log-ratios \(\Sigma\). I like to do this by thinking
+the covariance between log-ratios \(\Sigma\). I like to do this by thinking
about a prior on the covariance between taxa on the log-scale (i.e.,
between the log of their absolute abundances not the log-ratios). I will
-refer to this covariance on log-absolute abundances \(\Omega\). For example, here I will build a
-prior that states that the mean of \(\Omega\) is the identity matrix \(I_D\). From From Aitchison (1986), we know that if we assume that
+refer to this covariance on log-absolute abundances \(\Omega\). For example, here I will build a
+prior that states that the mean of \(\Omega\) is the identity matrix \(I_D\). From From Aitchison (1986), we know that if we assume that
the taxa have a covariance \(\Omega\)
-in terms of log-absolute abundance then their correlation in the \(\text{ALR}_D\) is given by \[ \Sigma = G \Omega G^T \] where \(G\) is a \(D-1
+in terms of log-absolute abundance then their correlation in the \(\text{ALR}_D\) is given by \[ \Sigma = G \Omega G^T \] where \(G\) is a \(D-1
\times D\) matrix given by \(G =
-[I_{D-1}; -1_{D-1}]\) (i.e., \(G\) is the \(\text{ALR}_D\) contrast matrix).
-Additionally, we know that the Inverse Wishart mode is given by \(\frac{\Xi}{\upsilon + D}\). Finally, note
+[I_{D-1}; -1_{D-1}]\) (i.e., \(G\) is the \(\text{ALR}_D\) contrast matrix).
+Additionally, we know that the Inverse Wishart mode is given by \(\frac{\Xi}{\upsilon + D}\). Finally, note
that \(\upsilon\) essentially controls
our uncertainty in \(\Sigma\) about
this prior mean. Here I will take \(\upsilon =
D+3\). This then gives us \(\Xi =
-(\upsilon - D) GIG^T\). We scale \(\Xi\) by a factor of 1/2 to make \(Tr(\Xi)=D-1\).
-
upsilon <- ntaxa(dat)+3
-Omega <- diag(ntaxa(dat))
-G <- cbind(diag(ntaxa(dat)-1), -1)
-Xi <- (upsilon-ntaxa(dat))*G%*%Omega%*%t(G)
-
Finally I specify my priors for \(\Theta\) (mean of \(\Lambda\)) and \(\Gamma\) (covariance between columns of
+(\upsilon - D) GIG^T\). We scale \(\Xi\) by a factor of 1/2 to make \(Tr(\Xi)=D-1\).
+
+
Finally I specify my priors for \(\Theta\) (mean of \(\Lambda\)) and \(\Gamma\) (covariance between columns of
\(\Lambda\); i.e., covariance between
-the covariates). I will center my prior for \(\Lambda\) about zero, and assume that the
+the covariates). I will center my prior for \(\Lambda\) about zero, and assume that the
covariates are independent.
-
Theta <- matrix(0, ntaxa(dat)-1, nrow(X))
-Gamma <- diag(nrow(X))
+
I strongly recommend users perform prior predictive checks to make
sure their priors make sense to them. fido makes this easy, all
the main fitting functions (e.g., pibble
) will
automatically sample from the prior predictive distribution if
Y
is left as NULL
(e.g., without data your
posterior is just your prior).
-
priors <- pibble(NULL, X, upsilon, Theta, Gamma, Xi)
-print(priors)
-#> pibblefit Object (Priors Only):
-#> Number of Samples: 250
-#> Number of Categories: 49
-#> Number of Covariates: 4
-#> Number of Posterior Samples: 2000
-#> Contains Samples of Parameters:Eta Lambda Sigma
-#> Coordinate System: alr, reference category: 49
+
+priors <- pibble(NULL, X, upsilon, Theta, Gamma, Xi)
+print(priors)
+#> pibblefit Object (Priors Only):
+#> Number of Samples: 250
+#> Number of Categories: 49
+#> Number of Covariates: 4
+#> Number of Posterior Samples: 2000
+#> Contains Samples of Parameters:Eta Lambda Sigma
+#> Coordinate System: alr, reference category: 49
The main fitting functions in the fido package output
special fit objects (e.g., pibble
outputs an object of
class pibblefit
). These fit objects are just lists with
@@ -307,9 +296,7 @@
Example analysis of microbiome data
simply the ALR coordinate system where the last category (49 above) is
taken as reference (this will be generalized in future versions). More
specifically for a vector
\(x\)
-representing the proportions of categories
\(\{1, \dots, D\}\) we can write
\[x^* = \left( \log \frac{x_1}{x_D}, \dots, \log
+representing the proportions of categories \(\{1, \dots, D\}\) we can write \[x^* = \left( \log \frac{x_1}{x_D}, \dots, \log
\frac{x_{D-1}}{x_D}\right).\] As mentioned above however, I have
designed fido to work with many different coordinate systems
including the ALR (with respect to any category), CLR, ILR, or
@@ -324,25 +311,27 @@ Example analysis of microbiome data
suppressed when pibblefit
objects are in the proportions
coordinate system. As an example, lets look at viewing a summary of the
prior for \(\Lambda\) with respect to
-the CLR coordinate system.
-priors <- to_clr(priors)
-summary(priors, pars="Lambda", gather_prob=TRUE, as_factor=TRUE, use_names=TRUE)
-#> $Lambda
-#> # A tibble: 784 x 9
-#> Parameter coord covariate val .lower .upper .width .point .interval
-#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
-#> 1 Lambda 1 1 0.0509 -0.528 0.596 0.5 mean qi
-#> 2 Lambda 1 2 0.00199 -0.567 0.575 0.5 mean qi
-#> 3 Lambda 1 3 -0.0373 -0.620 0.532 0.5 mean qi
-#> 4 Lambda 1 4 -0.00205 -0.554 0.549 0.5 mean qi
-#> 5 Lambda 2 1 0.00991 -0.564 0.535 0.5 mean qi
-#> 6 Lambda 2 2 -0.00000782 -0.572 0.580 0.5 mean qi
-#> 7 Lambda 2 3 -0.0247 -0.570 0.518 0.5 mean qi
-#> 8 Lambda 2 4 0.0165 -0.536 0.589 0.5 mean qi
-#> 9 Lambda 3 1 0.0138 -0.584 0.620 0.5 mean qi
-#> 10 Lambda 3 2 -0.00502 -0.542 0.581 0.5 mean qi
-#> # ... with 774 more rows
+the CLR coordinate system.
+
+priors <- to_clr(priors)
+summary(priors, pars="Lambda", gather_prob=TRUE, as_factor=TRUE, use_names=TRUE)
+#> $Lambda
+#> # A tibble: 784 x 9
+#> Parameter coord covariate val .lower .upper .width .point .interval
+#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
+#> 1 Lambda 1 1 0.0509 -0.528 0.596 0.5 mean qi
+#> 2 Lambda 1 2 0.00199 -0.567 0.575 0.5 mean qi
+#> 3 Lambda 1 3 -0.0373 -0.620 0.532 0.5 mean qi
+#> 4 Lambda 1 4 -0.00205 -0.554 0.549 0.5 mean qi
+#> 5 Lambda 2 1 0.00991 -0.564 0.535 0.5 mean qi
+#> 6 Lambda 2 2 -0.00000782 -0.572 0.580 0.5 mean qi
+#> 7 Lambda 2 3 -0.0247 -0.570 0.518 0.5 mean qi
+#> 8 Lambda 2 4 0.0165 -0.536 0.589 0.5 mean qi
+#> 9 Lambda 3 1 0.0138 -0.584 0.620 0.5 mean qi
+#> 10 Lambda 3 2 -0.00502 -0.542 0.581 0.5 mean qi
+#> # ... with 774 more rows
By default the summary
function returns a list (with
possible elements Lambda
, Sigma
, and
Eta
) summarizing each posterior parameter based on
@@ -355,40 +344,46 @@
Example analysis of microbiome data
of the names
functions for pibblefit
objects
to provide some more specific names for the covariates (helpful when we
then plot).
-names_covariates(priors) <- rownames(X)
-p <- plot(priors, par="Lambda")
-#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
-#> replace the existing scale.
-p + ggplot2::xlim(c(-10, 10))
-
+
+names_covariates(priors) <- rownames(X)
+p <- plot(priors, par="Lambda")
+#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
+#> replace the existing scale.
+p + ggplot2::xlim(c(-10, 10))
+
This looks fairly reasonable to me. So I am going to go ahead and fit
the model with data. fido
provides a helper method called
refit
that we will use to avoid passing prior parameters
again.
-priors$Y <- Y # remember pibblefit objects are just lists
-posterior <- refit(priors, optim_method="lbfgs")
+
+priors$Y <- Y # remember pibblefit objects are just lists
+posterior <- refit(priors, optim_method="lbfgs")
Unlike the main pibble function, the refit
method can be called on objects in any coordinate system and all
transformations to and from the default coordinate system are handled
-internally. This is one nice thing about using the
+internally. This is one nice thing about using the
refit
method. That said, new objects added to the
pibblefit
object need to be added in the proper coordinates
-For example, if we wanted to replace our prior for \(\Xi\) for an object in CLR coordinates, we
+For example, if we wanted to replace our prior for \(\Xi\) for an object in CLR coordinates, we
would had to transform our prior for Xi
to CLR coordinates
before adding it to the priors
object.
Now I are also going to add in the taxa names to make it easier to
interpret the results.
-tax <- tax_table(dat)[,c("Class", "Family")]
-tax <- apply(tax, 1, paste, collapse="_")
-names_categories(posterior) <- tax
+
Before doing anything else lets look at the posterior predictive
distribution to assess model fit. This can be accessed through the
-method ppc
.
-ppc(posterior) + ggplot2::coord_cartesian(ylim=c(0, 30000))
-
+method ppc
.
+
+
There are a few things to note about this plot. First, when zoomed
out like this it looks it is hard to make much of it. This is a fairly
large dataset we are analyzing and its hard to view an uncertainty
@@ -396,8 +391,9 @@
Example analysis of microbiome data
interval in grey and black and the observed counts in green.
fido also has a simpler function that summarizes the posterior
predictive check.
-ppc_summary(posterior)
-#> Proportions of Observations within 95% Credible Interval: 0.9898776
+
+ppc_summary(posterior)
+#> Proportions of Observations within 95% Credible Interval: 0.9898776
Here we see that the model appears to be fitting well (at least based
on the posterior predictive check) and that only about 1.5% of
observations fall outside of the 95% posterior predictive density (this
@@ -407,77 +403,79 @@
Example analysis of microbiome data
using ppc
. One is to predict the counts based on the
samples of \(\eta\) (Eta; as we did
above); the other is to predict “from scratch” that is to predict
-starting form the posterior samples of \(\Lambda\) (Lambda) then sampling \(\eta\) and only then sampling \(Y\). This later functionality can be
+starting form the posterior samples of \(\Lambda\) (Lambda) then sampling \(\eta\) and only then sampling \(Y\). This later functionality can be
accessed by also passing the parameters from_scratch=TRUE
to the ppc
function. Note: these two posterior predictive
checks have different meanings, one is not better than the other.
-ppc(posterior, from_scratch=TRUE) +ggplot2::coord_cartesian(ylim=c(0, 30000))
-
-ppc_summary(posterior, from_scratch=TRUE)
-#> Proportions of Observations within 95% Credible Interval: 0.9721633
+
+
+
+ppc_summary(posterior, from_scratch=TRUE)
+#> Proportions of Observations within 95% Credible Interval: 0.9721633
Now we are going to finally look at the posterior distribution of our
regression parameters, but because there are so many we will focus on
just those that have a 95% credible interval not including zero (i.e.,
those that the model is fairly certain are non-zero). We are also going
to ignore the intercept term and just look at parameters associated with
age and disease status.
-posterior_summary <- summary(posterior, pars="Lambda")$Lambda
-focus <- posterior_summary[sign(posterior_summary$p2.5) == sign(posterior_summary$p97.5),]
-focus <- unique(focus$coord)
-plot(posterior, par="Lambda", focus.coord = focus, focus.cov = rownames(X)[2:4])
-#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
-#> replace the existing scale.
-
+
+posterior_summary <- summary(posterior, pars="Lambda")$Lambda
+focus <- posterior_summary[sign(posterior_summary$p2.5) == sign(posterior_summary$p97.5),]
+focus <- unique(focus$coord)
+plot(posterior, par="Lambda", focus.coord = focus, focus.cov = rownames(X)[2:4])
+#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
+#> replace the existing scale.
+
The first, and most obvious ting to notice is that the covariate
age
has pretty much no effect at all, whatever effect it
may have is incredibly weak. So we are going to remove age from the plot
and just look at those coordinates with non-zero effect for diagnosis
CD
-posterior_summary <- filter(posterior_summary, covariate=="diagnosisCD")
-focus <- posterior_summary[sign(posterior_summary$p2.5) == sign(posterior_summary$p97.5),]
-focus <- unique(focus$coord)
-
-tax_table(dat)[taxa_names(dat)[which(names_coords(posterior) %in% focus)]]
-#> Taxonomy Table: [12 taxa by 7 taxonomic ranks]:
-#> Kingdom Phylum Class Order
-#> 74305 "Bacteria" "Proteobacteria" "Epsilonproteobacteria" "Campylobacterales"
-#> 4449236 "Bacteria" "Proteobacteria" "Betaproteobacteria" "Burkholderiales"
-#> 1105919 "Bacteria" "Proteobacteria" "Betaproteobacteria" "Burkholderiales"
-#> 4477696 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Pasteurellales"
-#> 4448331 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
-#> 4154872 "Bacteria" "Bacteroidetes" "Flavobacteriia" "Flavobacteriales"
-#> 4452538 "Bacteria" "Fusobacteria" "Fusobacteriia" "Fusobacteriales"
-#> 341322 "Bacteria" "Firmicutes" "Bacilli" "Turicibacterales"
-#> 1015143 "Bacteria" "Firmicutes" "Bacilli" "Gemellales"
-#> 176318 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
-#> 1788466 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
-#> 1896700 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
-#> Family Genus Species
-#> 74305 "Helicobacteraceae" NA NA
-#> 4449236 "Alcaligenaceae" NA NA
-#> 1105919 "Oxalobacteraceae" NA NA
-#> 4477696 "Pasteurellaceae" NA NA
-#> 4448331 "Enterobacteriaceae" NA NA
-#> 4154872 "[Weeksellaceae]" NA NA
-#> 4452538 "Fusobacteriaceae" NA NA
-#> 341322 "Turicibacteraceae" NA NA
-#> 1015143 "Gemellaceae" NA NA
-#> 176318 "Christensenellaceae" NA NA
-#> 1788466 "Lachnospiraceae" NA NA
-#> 1896700 "Peptostreptococcaceae" NA NA
-plot(posterior, par="Lambda", focus.coord = focus, focus.cov = rownames(X)[2])
-#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
-#> replace the existing scale.
-
+
+posterior_summary <- filter(posterior_summary, covariate=="diagnosisCD")
+focus <- posterior_summary[sign(posterior_summary$p2.5) == sign(posterior_summary$p97.5),]
+focus <- unique(focus$coord)
+
+tax_table(dat)[taxa_names(dat)[which(names_coords(posterior) %in% focus)]]
+#> Taxonomy Table: [12 taxa by 7 taxonomic ranks]:
+#> Kingdom Phylum Class Order
+#> 74305 "Bacteria" "Proteobacteria" "Epsilonproteobacteria" "Campylobacterales"
+#> 4449236 "Bacteria" "Proteobacteria" "Betaproteobacteria" "Burkholderiales"
+#> 1105919 "Bacteria" "Proteobacteria" "Betaproteobacteria" "Burkholderiales"
+#> 4477696 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Pasteurellales"
+#> 4448331 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
+#> 4154872 "Bacteria" "Bacteroidetes" "Flavobacteriia" "Flavobacteriales"
+#> 4452538 "Bacteria" "Fusobacteria" "Fusobacteriia" "Fusobacteriales"
+#> 341322 "Bacteria" "Firmicutes" "Bacilli" "Turicibacterales"
+#> 1015143 "Bacteria" "Firmicutes" "Bacilli" "Gemellales"
+#> 176318 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
+#> 1788466 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
+#> 1896700 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
+#> Family Genus Species
+#> 74305 "Helicobacteraceae" NA NA
+#> 4449236 "Alcaligenaceae" NA NA
+#> 1105919 "Oxalobacteraceae" NA NA
+#> 4477696 "Pasteurellaceae" NA NA
+#> 4448331 "Enterobacteriaceae" NA NA
+#> 4154872 "[Weeksellaceae]" NA NA
+#> 4452538 "Fusobacteriaceae" NA NA
+#> 341322 "Turicibacteraceae" NA NA
+#> 1015143 "Gemellaceae" NA NA
+#> 176318 "Christensenellaceae" NA NA
+#> 1788466 "Lachnospiraceae" NA NA
+#> 1896700 "Peptostreptococcaceae" NA NA
+plot(posterior, par="Lambda", focus.coord = focus, focus.cov = rownames(X)[2])
+#> Scale for 'colour' is already present. Adding another scale for 'colour', which will
+#> replace the existing scale.
+