Preprocessing.Rmd

---
title: "Microbiome data processing"
author: "Leo Lahti, Sudarshan Shetty et al."
bibliography: 
- bibliography.bib
output:
  BiocStyle::html_document:
    number_sections: no
    toc: yes
    toc_depth: 4
    toc_float: true
    self_contained: true
    thumbnails: true
    lightbox: true
    gallery: true
    use_bookdown: false
    highlight: haddock
---
<!--
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{microbiome tutorial - Preprocessing}
  %\usepackage[utf8]{inputenc}
  %\VignetteEncoding{UTF-8}  
-->


**Processing phyloseq objects**

Instructions to manipulate microbiome data sets using tools from the [phyloseq package](http://joey711.github.io/phyloseq/) and some extensions from the [microbiome package](https://github.com/microbiome/microbiome), including subsetting, aggregating and filtering.


Load example data:

```{r data21, warning=FALSE, message=FALSE}
library(phyloseq)
library(microbiome)
library(knitr)
data(atlas1006)   
# Rename the example data (which is a phyloseq object)
pseq <- atlas1006
```


## Summarizing the contents of a phyloseq object  

```{r summarize_pseq} 
summarize_phyloseq(pseq)
```


## Retrieving data elements from phyloseq object  

A phyloseq object contains OTU table (taxa abundances), sample
metadata, taxonomy table (mapping between OTUs and higher-level
taxonomic classifications), and phylogenetic tree (relations between
the taxa). Some of these are optional.

Pick metadata as data.frame:

```{r data21b, message=FALSE, error=FALSE, warning=FALSE}
meta <- meta(pseq)
```

Taxonomy table:

```{r data22, message=FALSE, error=FALSE, warning=FALSE}
taxonomy <- tax_table(pseq)
```


Abundances for taxonomic groups ('OTU table') as a TaxaxSamples matrix:

```{r data23, message=FALSE, error=FALSE, warning=FALSE}
# Absolute abundances
otu.absolute <- abundances(pseq)

# Relative abundances
otu.relative <- abundances(pseq, "compositional")
```


Total read counts:


```{r reads, message=FALSE, error=FALSE, warning=FALSE}
reads_sample <- readcount(pseq)
# check for first 5 samples
reads_sample[1:5]

```
Add read per sample to phyloseq object metadata.  

```{r reads2, message=FALSE, error=FALSE, warning=FALSE}
sample_data(pseq)$reads_sample <- reads_sample

# reads_sample is add to the last column in sample_data of pseq object.
head(meta(pseq)[,c("sample", "reads_sample")])
```


Melting phyloseq data for easier plotting:

```{r data25, message=FALSE, error=FALSE, warning=FALSE}
df <- psmelt(pseq)
kable(head(df))
```


## Sample operations

Sample names and variables

```{r preprocess2b, message=FALSE, error=FALSE, warning=FALSE}
head(sample_names(pseq))
```

Total OTU abundance in each sample

```{r preprocess2c, warning=FALSE, message=FALSE, eval=FALSE}
s <- sample_sums(pseq)
```

Abundance of a given species in each sample

```{r preprocess2d, message=FALSE, error=FALSE, warning=FALSE}
head(abundances(pseq)["Akkermansia",])
```


Select a subset by metadata fields:

```{r data-subsetting, eval=FALSE}
pseq.subset <- subset_samples(pseq, nationality == "AFR")
```


Select a subset by providing sample names: 

```{r data-subsetting2, eval=FALSE, message=FALSE, error=FALSE, warning=FALSE}
# Check sample names for African Females in this phyloseq object
s <- rownames(subset(meta(pseq), nationality == "AFR" & sex == "Female"))

# Pick the phyloseq subset with these sample names
pseq.subset2 <- prune_samples(s, pseq)
```


Pick samples at the baseline time points only:

```{r baseline, message=FALSE, error=FALSE, warning=FALSE}
pseq0 <- baseline(pseq)
```


## Data transformations

The microbiome package provides a wrapper for standard sample/OTU transforms. For arbitrary transforms, use the transform_sample_counts function in the phyloseq package. Log10 transform is log(1+x) if the data contains zeroes. Also "Z",
"clr", "hellinger", and "shift" are available as common
transformations. Relative abundances (note that the input data needs
to be in absolute scale, not logarithmic!):

```{r data10, message=FALSE, error=FALSE, warning=FALSE}
pseq.compositional <- microbiome::transform(pseq, "compositional")
```

The CLR ("clr") transformation is also available, and comes with a pseudocount to avoid zeroes. An alternative method is to impute the zero-inflated unobserved values. Sometimes a multiplicative Kaplan-Meier smoothing spline (KMSS) replacement, multiplicative lognormal replacement, or multiplicative simple replacement are used. These are available in the zCompositions R package (functions multKM, multLN, and multRepl, respectively). Use at least n.draws = 1000 in practice, here less used to speed up example.

```{r data10b, message=FALSE, error=FALSE, warning=FALSE}
data(dietswap)
x <- dietswap
# Compositional data 
x2 <- microbiome::transform(x, "compositional")
```

```{r data10bb, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, results="hide"}
# library(zCompositions)
# This would be recommended for compositional zero-inflated data:
# KMSS replacement for zero-inflated compositional values
# x3 <- multKM(transform(x, "compositional"),
#         label = 0, dl = rep(0, ncol(x)), n.draws = 10, n.knots = NULL)
# x4 <- transform(x3, "clr")
```


## Variable operations

Sample variable names

```{r preprocess4, message=FALSE, error=FALSE, warning=FALSE}
sample_variables(pseq)
```

Pick values for a given variable

```{r preprocess4b, message=FALSE, error=FALSE, warning=FALSE}
head(get_variable(pseq, sample_variables(pseq)[1]))
```

Assign new fields to metadata

```{r preprocess4b2, message=FALSE, error=FALSE, warning=FALSE}
# Calculate diversity for samples
div <- microbiome::alpha(pseq, index = "shannon")

# Assign the estimated diversity to sample metadata
sample_data(pseq)$diversity <- div
```


## Taxa operations


Number of taxa

```{r preprocess3e, warning=FALSE, message=FALSE, eval=FALSE}
n <- ntaxa(pseq)
```


Most abundant taxa

```{r toptaxa, warning=FALSE, message=FALSE}
topx <- top_taxa(pseq, n = 10)
```


Names

```{r preprocess3f, warning=FALSE, message=FALSE}
ranks <- rank_names(pseq)  # Taxonomic levels
taxa  <- taxa(pseq)        # Taxa names at the analysed level
```


Subset taxa

```{r preprocess3c, warning=FALSE, message=FALSE}
pseq.bac <- subset_taxa(pseq, Phylum == "Bacteroidetes")
```


Prune (select) taxa:

```{r preprocess3b, warning=FALSE, message=FALSE}
# List of Genera in the Bacteroideted Phylum
taxa <- map_levels(NULL, "Phylum", "Genus", pseq)$Bacteroidetes

# With given taxon names
ex2 <- prune_taxa(taxa, pseq)

# Taxa with positive sum across samples
ex3 <- prune_taxa(taxa_sums(pseq) > 0, pseq)
```


Filter by user-specified function values (here variance):

```{r preprocess3d, warning=FALSE, message=FALSE}
f <- filter_taxa(pseq, function(x) var(x) > 1e-05, TRUE)
```


List unique phylum-level groups: 

```{r preprocess3g, warning=FALSE, message=FALSE}
head(get_taxa_unique(pseq, "Phylum"))
```

Pick the taxa abundances for a given sample:

```{r preprocess3gg, warning=FALSE, message=FALSE}
samplename <- sample_names(pseq)[[1]]

# Pick abundances for a particular taxon
tax.abundances <- abundances(pseq)[, samplename]
```

## Get 

```{r}
library(jeevanuDB)
```

```{r}
# Example data
data("moving_pictures")
# Rename
ps <- moving_pictures
taxa_names(ps)[1:2]
```

The taxa names are unique sequences. This common when using `dada2` for denoising and identifying amplicon sequence variants. 
```{r}
print(ps) 
```

It would be useful to store these inside the `refseq()` slot of phyloseq. For this we can use `microbiome::add_refseq`   

```{r}
ps <- add_refseq(ps, tag = "ASV")
# check again
taxa_names(ps)[1:2]
```

```{r}
print(ps)
```

As can be seen, the sequence ids are converted to ASV ids and sequences stored within the `refseq()` slot.  

## Labelling ASVs with best taxonomy  

Converting or amending the ASV ids with their best/lowest taxonomic classification can be very handy. This can be achieved with `microbiome::add_besthit()`

```{r}
ps <- add_besthit(ps)
taxa_names(ps)[1:2]
```

If required, the names can be cleaned as follows.   
```{r}
taxa_names(ps) <- gsub("[a-x]__", "",taxa_names(ps))
taxa_names(ps)[1:2]
```

## Merging operations

Aggregate taxa to higher taxonomic levels. This is particularly useful if the phylogenetic tree is missing. When it is available, see [merge_samples, merge_taxa and tax_glom](http://joey711.github.io/phyloseq/merge.html)).

```{r data24b, message=FALSE, error=FALSE, warning=FALSE}
pseq2 <- aggregate_taxa(pseq, "Phylum") 
```


Merge the desired taxa into "Other" category. Here, we merge all Bacteroides groups into a single group named Bacteroides.

```{r merge2, message=FALSE, error=FALSE, warning=FALSE}
pseq2 <- merge_taxa2(pseq, pattern = "^Bacteroides", name = "Bacteroides") 
```


Merging phyloseq objects with the phyloseq package:

```{r preprocess-merge, warning=FALSE, message=FALSE, eval=FALSE}
merge_phyloseq(pseqA, pseqB)
```

Joining otu/asv table and taxonomy in one data frame

```{r taxa-abund-full-taxonomy, message=FALSE, error=FALSE, warning=FALSE}
library(dplyr) 
library(microbiome)
data("atlas1006") # example data from microbiome pkg
otu.tax.tibble <- combine_otu_tax(atlas1006, column.id = "ASVID")
# view first fist rows and columns
otu.tax.tibble[1:5,1:5]
```

## Get tibbles  

In many cases, user may want to extract slots from phyloseq as tibbles for custom data processing.  
This can be achieved with following functions.  

```{r}
otu.tib <- otu_tibble(atlas1006, column.id = "FeatureID")

tax.tib <- tax_tibble(atlas1006, column.id = "FeatureID")

sample.tib <- sample_tibble(atlas1006, column.id = "SampleID")
```

## Convert to tidy long format   

As an alternative to `phyloseq::psmelt`, we provide `psmelt2` which returns a tibble in long format. 

```{r}
ps.long.tib <- psmelt2(pseq)
head(ps.long.tib)
```

## Rarefaction

```{r preprocess-rarify, warning=FALSE, message=FALSE, eval=FALSE}
pseq.rarified <- rarefy_even_depth(pseq)
```


## Taxonomy 

Convert between taxonomic levels (here from Genus (Akkermansia) to
Phylum (Verrucomicrobia):

```{r phylogeny-example2, warning=FALSE, message=FALSE}
m <- map_levels("Akkermansia", "Genus", "Phylum", tax_table(pseq))
print(m)
```


## Metadata

Visualize frequencies of given factor (sex) levels within the
indicated groups (group):

```{r phylogeny-example3, warning=FALSE, message=FALSE, fig.width=10, fig.height=6}
p <- plot_frequencies(sample_data(pseq), "bmi_group", "sex")
print(p)

# Retrieving the actual data values:
# kable(head(p@data), digits = 2)
```


Custom functions are provided to cut age or BMI information into discrete classes.

```{r preprocessing-metadata2, warning=FALSE, message=FALSE}
group_bmi(c(22, 28, 31), "standard")
group_age(c(17, 41, 102), "decades")
```


Add metadata to a phyloseq object. For reproducibility, we just use the existing metadata in this example, but this can be replaced by another data.frame (samples x fields). 

```{r add_metadata, eval=TRUE, message=FALSE, error=FALSE, warning=FALSE}
# Example data
data(dietswap)
pseq <- dietswap

# Pick the existing metadata from a phyloseq object
# (or retrieve this from another source)
df <- meta(pseq)

# Merge the metadata back in the phyloseq object
pseq2 <- merge_phyloseq(pseq, sample_data(df))
```