From 831a8e18b8e64141860e821ec65a73323febfb44 Mon Sep 17 00:00:00 2001 From: Yichen Wang Date: Sun, 5 Nov 2023 16:57:07 -0500 Subject: [PATCH] update and add more Rmd doc --- README.md | 128 +------- _pkgdown.yml | 2 +- .../Integrating_scRNA_and_scATAC_data.Rmd | 301 +++++++++--------- vignettes/articles/installation.Rmd | 100 ++++++ 4 files changed, 250 insertions(+), 281 deletions(-) create mode 100644 vignettes/articles/installation.Rmd diff --git a/README.md b/README.md index 8f3c9814..431519a2 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,8 @@ # LIGER (Linked Inference of Genomic Experimental Relationships) +>Now we have a comprehensive documentation site for the latest development version of [rliger 2.0](https://mvfki.github.io/liger/index.html)! + LIGER (installed as `rliger` ) is a package for integrating and analyzing multiple single-cell datasets, developed by the Macosko lab and maintained/extended by the Welch lab. It relies on integrative non-negative matrix factorization to identify shared and dataset-specific factors. Check out our [Cell paper](https://doi.org/10.1016/j.cell.2019.05.006) for a more complete description of the methods and analyses. To access data used in our SN and BNST analyses, visit our [study](https://portals.broadinstitute.org/single_cell/study/SCP466) on the @@ -30,9 +32,11 @@ We have also designed LIGER to interface with existing single-cell analysis pack [Seurat](https://satijalab.org/seurat/). ## Feedback + Consider filling out our [feedback form](https://forms.gle/bhvp3K6tiHwf976r8) to help us improve the functionality and accessibility of LIGER. ## Usage + For usage examples and guided walkthroughs, check the `vignettes` directory of the repo. * [Iterative Single-Cell Multi-Omic Integration Using Online iNMF](http://htmlpreview.github.io/?https://github.com/welch-lab/liger/blob/master/vignettes/online_iNMF_tutorial.html) @@ -49,126 +53,9 @@ For usage examples and guided walkthroughs, check the `vignettes` directory of t * [Running Liger directly on Seurat objects using Seurat wrappers](https://htmlpreview.github.io/?https://github.com/satijalab/seurat.wrappers/blob/master/docs/liger.html) -## System Requirements - -### Hardware requirements -The `rliger` package requires only a standard computer with enough RAM to support the in-memory operations. For minimal performance, please make sure that the computer has at least about 2 GB of RAM. For optimal performance, we recommend a computer with the following specs: - -* RAM: 16+ GB -* CPU: 4+ cores, 2.3 GHz/core - -### Software requirements - -The package development version is tested on *Linux* operating systems and *Mac OSX*. - -* Linux: CentOS 7, Manjaro 5.3.18 -* Mac OSX: Mojave (10.14.1), Catalina (10.15.2) - -The `rliger` package should be compatible with Windows, Mac, and Linux operating systems. - -Before setting up the `rliger` package, users should have R version 3.4.0 or higher, and several packages set up from CRAN and other repositories. The user can check the dependencies in `DESCRIPTION`. - -## Installation - -LIGER is written in R and is also available on the Comprehensive R Archive Network (CRAN). Note that the package name is `rliger` to avoid a naming conflict with an unrelated package. To install the version on CRAN, follow these instructions: - -1. Install [R](https://www.r-project.org/) (>= 3.6) -2. Install [Rstudio](https://posit.co/download/rstudio-desktop/) (recommended) -3. Type the following R command: -``` -install.packages('rliger') -``` -To install the latest development version directly from GitHub, type the following commands instead of step 3: -``` -install.packages('devtools') -library(devtools) -install_github('welch-lab/liger') -``` -Note that the GitHub version requires installing from source, which may involve additional installation steps on MacOS (see below). - -### Additional Steps for Installing LIGER from Source (recommended before step 3) -Installation from CRAN is easy because pre-compiled binaries are available for Windows and MacOS. However, a few additional steps are required to install from source on MacOS/Windows (e.g. Install RcppArmadillo). -(MacOS) Installing RcppArmadillo on R>=3.4 requires Clang >= 4 and gfortran-6.1. For newer versions of R (R>=3.5), it's recommended to follow the instructions in this [post](https://thecoatlessprofessor.com/programming/r-compiler-tools-for-rcpp-on-macos/). Follow the instructions below if you have R version 3.4.0-3.4.4. - -1. Install gfortran as suggested [here](https://gcc.gnu.org/wiki/GFortranBinaries) -2. Download clang4 from this [page](https://mac.R-project.org/libs/clang-4.0.0-darwin15.6-Release.tar.gz) -3. Uncompress the resulting zip file and type into Terminal (`sudo` if needed): -``` -mv /path/to/clang4/ /usr/local/ -``` -4. Create `.R/Makevars` file containing following: -``` -# The following statements are required to use the clang4 binary -CC=/usr/local/clang4/bin/clang -CXX=/usr/local/clang4/bin/clang++ -CXX11=/usr/local/clang4/bin/clang++ -CXX14=/usr/local/clang4/bin/clang++ -CXX17=/usr/local/clang4/bin/clang++ -CXX1X=/usr/local/clang4/bin/clang++ -LDFLAGS=-L/usr/local/clang4/lib -``` -For example, use the following Terminal commands: -``` -cd ~ -mkdir .R -cd .R -nano Makevars -``` -Paste in the required text above and save with `Ctrl-X`. - -### Additional Installation Steps for Online Learning using LIGER -The HDF5 library is required for implementing online learning in LIGER on data files in HDF5 format. It can be installed via one of the following commands: - -| System | Command -|:------------------------------------------|:---------------------------------| -|**OS X (using Homebrew or Conda)** | `brew install hdf5` or `conda install -c anaconda hdf5` -|**Debian-based systems (including Ubuntu)**| `sudo apt-get install libhdf5-dev` -|**Systems supporting yum and RPMs** | `sudo yum install hdf5-devel` - -For Windows, the latest HDF5 1.12.0 is available at https://www.hdfgroup.org/downloads/hdf5/. - -### Detailed Instructions for FIt-SNE Installation -Note that the runUMAP function (which calls the `uwot` package) also scales to large datasets and does not require additional installation steps. -However, using FIt-SNE is recommended for computational efficiency if you want to perform t-SNE on very large datasets. -Installing and compiling the necessary software requires the use of git, FIt-SNE, and FFTW. For a -basic overview of installation, visit this [page](https://github.com/KlugerLab/FIt-SNE). - -Basic installation for most Unix machines can be achieved with the following commands after downloading -the latest version of FFTW from [here](http://www.fftw.org/). In the fftw directory, run: -``` -./configure -make -make install -``` -(Additional [instructions](http://www.fftw.org/fftw3_doc/Installation-and-Customization.html) if -necessary). -Then in desired directory: -``` -git clone https://github.com/KlugerLab/FIt-SNE.git -cd FIt-SNE -g++ -std=c++11 -O3 src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp -o bin/fast_tsne -pthread -lfftw3 -lm -pwd -``` -Use the output of `pwd` as the `fitsne.path` parameter in runTSNE. - -Note that the above instructions require root access. To install into a specified folder (such as your home directory) on a server, use the `--prefix` option: -``` -./configure --prefix= -make -make install -git clone https://github.com/KlugerLab/FIt-SNE.git -cd FIt-SNE -g++ -std=c++11 -O3 src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp -I/include/ -L/lib/ -o bin/fast_tsne -pthread -lfftw3 -lm -pwd -``` - -### Install Time and Expected Run Time - -The installation process of `rliger` should take less than 30 minutes. - -The expected run time is 1 - 4 hours depending on dataset size and downstream analysis of the user’s choice. ## Sample Datasets + The `rliger` package provides a small sample dataset for basic demos of the functions. You can find it in folder `liger/tests/testdata/small_pbmc_data.RDS`. We also provide a set of scRNA-seq and scATAC-seq datasets for real-world style demos. These datasets are as follows: @@ -190,8 +77,3 @@ We also provide a set of scRNA-seq and scATAC-seq datasets for real-world style * scATAC and scRNA data provided by 10X Genomics. The data sources are: + `pbmc.atac.expression.mat.RDS`: raw data can be accessed [here](https://support.10xgenomics.com/single-cell-atac/datasets/1.1.0/atac_v1_pbmc_10k), created by Cell Ranger ATAC 1.1.0; + `pbmc.rna.expression.mat.RDS`: raw data can be accessed [here](https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3), created by Cell Ranger 3.0.0. - -Corresponding tutorials can be found in section **Usage** above. - -## License -This project is covered under the **GNU General Public License 3.0**. diff --git a/_pkgdown.yml b/_pkgdown.yml index 43b2c2b7..1f21f1af 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -95,7 +95,7 @@ navbar: title: "rliger" left: - text: "Installation" - href: # + href: articles/installation.html - text: "Tutorials" menu: - text: "Integrating multiple scRNAseq data" diff --git a/vignettes/articles/Integrating_scRNA_and_scATAC_data.Rmd b/vignettes/articles/Integrating_scRNA_and_scATAC_data.Rmd index 0bc93287..a7c67e4d 100644 --- a/vignettes/articles/Integrating_scRNA_and_scATAC_data.Rmd +++ b/vignettes/articles/Integrating_scRNA_and_scATAC_data.Rmd @@ -2,19 +2,21 @@ title: "Joint definition of cell types from single-cell gene expression and chromatin accessibility data (human bone marrow mononuclear cells)" author: "Jialin Liu and Joshua Welch" date: "3/27/2020" -output: html_document +output: + html_document: + toc: 3 + toc_float: true --- ```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, cache = TRUE) ``` - ->NOTE: The old version of this tutorial has been archived. The users can access it from [here](https://macoskolab.github.io/liger/walkthrough_rna_atac.html). - In this tutorial, we will demonstrate LIGER’s ability to jointly define cell types by leveraging multiple single-cell modalities. For example, the integration of single-cell RNA-seq and single-cell ATAC-seq enables cell type definitions that incorporate both gene expression and chromatin accessibility data. Such joint analysis allows not only the taxonomic categorization of cell types, but also a deeper understanding of their underlying regulatory networks. The pipeline for jointly analyzing scRNA-seq and scATAC-seq is similar to that for integrating multiple scRNA-seq datasets in that both rely on joint matrix factorization and quantile normalization. The main differences are: (1) scATAC-seq data needs to be processed into gene-level values; (2) gene selection is performed on the scRNA-seq data only; and (3) downstream analyses can use both gene-level and intergenic information. -## Stage I: Preprocessing and Normalization (40 - 50 minutes) +## Stage I: Preprocessing and Normalization + +### 1. Loading data In order to jointly analyze scRNA and scATAC-seq data, we first need to transform the scATAC-seq data--a genome-wide epigenomic measurement--into gene-level counts which are comparable to gene expression data from scRNA-seq. Most previous single-cell studies have used an approach inspired by traditional bulk ATAC-seq analysis: identifying chromatin accessibility peaks, then summing together all peaks that overlap each gene. This strategy is also appealing because the 10X CellRanger pipeline, a commonly used commercial package, automatically outputs such peak counts. However, we find this strategy less desirable because: (1) peak calling is performed using all cells, which biases against rare cell populations; (2) gene body accessibility is often more diffuse than that of specific regulatory elements, and thus may be missed by peak calling algorithms; and (3) information from reads outside of peaks is discarded, further reducing the amount of data in the already sparse measurements. Instead of summing peak counts, we find that the simplest possible strategy seems to work well: counting the total number of ATAC-seq reads within the gene body and promoter region (typically 3 kb upstream) of each gene in each cell. @@ -26,44 +28,53 @@ For convenience, we have prepared the pre-processed data which are ready to use. - [GSM4138888_scATAC_BMMC_D5T1_peak_counts.RDS](https://figshare.com/ndownloader/files/40054864) - [hg19_genes.bed](https://figshare.com/ndownloader/files/40054870) -```{r, message=F, results='hide'} +```{r, results='hide'} library(rliger2) -D5T1 <- readRDS("GSM4138888_scATAC_BMMC_D5T1.RDS") -rna1 <- readRDS("GSM4138872_scRNA_BMMC_D1T1.rds") -rna2 <- readRDS("GSM4138873_scRNA_BMMC_D1T2.rds") -bmmc.rna <- cbind(rna1,rna2) -rm(rna1, rna2) +if (!file.exists("liger_BMMC_rna_D1T1.rds")) + download.file("https://figshare.com/ndownloader/files/40054858", + destfile = "liger_BMMC_rna_D1T1.rds") +D1T1 <- readRDS("liger_BMMC_rna_D1T1.rds") + +if (!file.exists("liger_BMMC_rna_D1T2.rds")) + download.file("https://figshare.com/ndownloader/files/40054861", + destfile = "liger_BMMC_rna_D1T2.rds") +D1T2 <- readRDS("liger_BMMC_rna_D1T2.rds") + +if (!file.exists("liger_BMMC_atac_D5T1.rds")) + download.file("https://figshare.com/ndownloader/files/40054891", + destfile = "liger_BMMC_atac_D5T1.rds") +D5T1 <- readRDS("liger_BMMC_atac_D5T1.rds") ``` -### Start from CellRanger output +#### Start from CellRanger output You can also follow the following tutorial if you would like to start from the very beginning. Note that in this part, we included the details of running this preprocessing workflow for only one sample. Users should re-run this counting step multiple times if for more than one scATAC-seq sample. Note also that several commands need to be run through the Command Line Interface instead of the R Console or IDE (RStudio). We also employ the bedmap command from the BEDOPS tool to make a list of cell barcodes that overlap each gene and promoter. The gene body and promoter indexes are .bed files, which indicate gene and promoter coordinates. Since bedmap expects sorted inputs, the fragment output from CellRanger, gene body and promoter indexes should all be sorted. -We show below how to perform these steps for scATAC data generated by the 10X Chromium system, the most widely used scATAC-seq protocol. The starting input for this process is the file fragments.tsv output by CellRanger, which contains all ATAC reads that passed filtering steps. +We show below how to perform these steps for scATAC data generated by the 10X Chromium system, the most widely used scATAC-seq protocol. The starting input for this process is the file fragments.tsv output by CellRanger, which contains all ATAC reads that passed filtering steps. ````{=html}
Detailed CellRanger output preprocessing -```` +```` **1.** We must first sort fragments.tsv by chromosome, start, and end position using the sort command line utility. The `-k` option lets the user sort the file on a certain column; including multiple `-k` options allows sorting by multiple columns simultaneously. The `n` behind `-k` stands for "numeric ordering". Here the sorted BED file order is defined first by lexicographic chromosome order (using the parameter `-k1,1`), then by ascending integer start coordinate order (using parameter `-k2,2n`), and finally by ascending integer end coordinate order (using parameter `-k3,3n`). Note that this step may take a while, since the input fragment file is usually very large (for example, a typical fragment file of 4-5 GB can take about 40 minutes). -```{r Count, eval = FALSE} +```{bash Count, eval = FALSE} sort -k1,1 -k2,2n -k3,3n GSM4138888_scATAC_BMMC_D5T1.fragments.tsv > atac_fragments.sort.bed ``` Gene body and promoter indexes should also be sorted using the same strategy for sorting fragments output files: -```{r Count-2, eval = FALSE} +```{bash Count-2, eval = FALSE} sort -k 1,1 -k2,2n -k3,3n hg19_genes.bed > hg19_genes.sort.bed sort -k 1,1 -k2,2n -k3,3n hg19_promoters.bed > hg19_promoters.sort.bed ``` **2.** Use `bedmap` command to calculate overlapping elements between indexes and fragment output files: -```{r Count-3, eval = FALSE} +```{bash Count-3, eval = FALSE} bedmap --ec --delim "\t" --echo --echo-map-id hg19_promoters.sort.bed atac_fragments.sort.bed > atac_promoters_bc.bed bedmap --ec --delim "\t" --echo --echo-map-id hg19_genes.sort.bed atac_fragments.sort.bed > atac_genes_bc.bed ``` @@ -97,7 +108,7 @@ barcodes <- bc_filt **4.** We can then use LIGER’s `makeFeatureMatrix()` function to calculate accessibility counts for gene body and promoter individually. This function takes the output from `bedmap` and efficiently counts the number of fragments overlapping each gene and promoter. We could count the genes and promoters in a single step, but choose to calculate them separately in case it is necessary to look at gene or promoter accessibility individually in downstream analyses. ```{r Count-6, eval = FALSE} -library(liger) +library(rliger2) gene.counts <- makeFeatureMatrix(genes.bc, barcodes) promoter.counts <- makeFeatureMatrix(promoters.bc, barcodes) ``` @@ -114,7 +125,7 @@ colnames(D5T1)=paste0("D5T1_",colnames(D5T1)) **5.** Once the gene-level scATAC-seq counts are generated, the `read10X()` function from LIGER can be used to read scRNA-seq count matrices output by CellRanger. You can pass in a directory (or a list of directories) containing raw outputs (for example, `"/Sample_1/outs/filtered_feature_bc_matrix"`) to the parameter `sample.dirs`. Next, a vector of names to use for the sample (or samples, corresponding to `sample.dirs`) should be passed to parameter `sample.names` as well. LIGER can also use data from any other protocol, as long as it is provided in a genes x cells R matrix format. ```{r loading-1, eval=FALSE} -bmmc.rna <- read10X(sample.dirs = list("/path_to_sample"), sample.names = list("rna")) +bmmc.rna <- read10X(path = "/path_to/filtered_feature_barcode_matrix", sampleNames = "rna") ``` ````{=html} @@ -122,160 +133,150 @@ bmmc.rna <- read10X(sample.dirs = list("/path_to_sample"), sample.names = list("
```` -### LIGER preprocessing +### 2. preprocessing -**6.** We can now create a LIGER object with the `createLiger()` function. We also remove unneeded variables to conserve memory. +We can now create a LIGER object with the `createLiger()` function. We also remove unneeded variables to conserve memory. Additionally, we merge the two RNA datasets into one. -```{r loading-2, warning=FALSE, results='hide', message=FALSE, error=FALSE} -bmmc.data <- list(atac = D5T1, rna = bmmc.rna) -int.bmmc <- createLiger(bmmc.data, modal = c("atac", "rna")) -rm(genes.bc, promoters.bc, gene.counts, promoter.counts, D5T1, bmmc.rna, bmmc.data) -gc() -int.bmmc +```{r loading-2} +bmmc.rna <- cbind(D1T1, D1T2) +bmmcLiger <- createLiger(list(atac = D5T1, rna = bmmc.rna), + modal = c("atac", "rna")) ``` -**7.** Preprocessing steps are needed before running iNMF. Each dataset is normalized to account for differences in total gene-level counts across cells using the normalize function. Next, highly variable genes from each dataset are identified and combined for use in downstream analysis. Note that by setting the parameter `datasets.use = 2`, genes will be selected only from the scRNA-seq dataset (the second dataset) by the `selectGenes()` function. We recommend not using the ATAC-seq data for variable gene selection because the statistical properties of the ATAC-seq data are very different from scRNA-seq, violating the assumptions made by the statistical model we developed for selecting genes from RNA data. Finally, the `scaleNotCenter()` function scales normalized datasets without centering by the mean, giving the non-negative input data required by iNMF. +Preprocessing steps are needed before running iNMF. Each dataset is normalized to account for differences in total gene-level counts across cells using the normalize function. Next, highly variable genes from each dataset are identified and combined for use in downstream analysis. Note that by setting the parameter `useDataset = "rna"`, genes will be selected only from the scRNA-seq dataset (the second dataset) by the `selectGenes()` function. We recommend not using the ATAC-seq data for variable gene selection because the statistical properties of the ATAC-seq data are very different from scRNA-seq, violating the assumptions made by the statistical model we developed for selecting genes from RNA data. Finally, the `scaleNotCenter()` function scales normalized datasets without centering by the mean, giving the non-negative input data required by iNMF. -```{r pre-1, message=F, results='hide'} -int.bmmc <- normalize(int.bmmc) -int.bmmc <- selectGenes(int.bmmc, datasets.use = 2) -int.bmmc <- scaleNotCenter(int.bmmc) +```{r pre-1} +bmmcLiger <- bmmcLiger %>% + normalize() %>% + selectGenes(useDatasets = "rna") %>% + scaleNotCenter() ``` -## Stage II: Joint Matrix Factorization (3 - 10 minutes) +## Stage II: Joint Matrix Factorization -**8.** We next perform joint matrix factorization (iNMF) on the normalized and scaled RNA and ATAC data. This step calculates metagenes--sets of co-expressed genes that distinguish cell populations--containing both shared and dataset-specific signals. The cells are then represented in terms of the "expression level" of each metagene, providing a low-dimensional representation that can be used for joint clustering and visualization. To run iNMF on the scaled datasets, we use the `optimizeALS()` function with proper hyperparameter settings. +### 3. Determine parameters and perform iNMF integration -To run iNMF on the scaled datasets, use `optimizeALS()` function with proper hyperparameters setting: +We next perform joint matrix factorization (iNMF) on the normalized and scaled RNA and ATAC data. This step calculates metagenes--sets of co-expressed genes that distinguish cell populations--containing both shared and dataset-specific signals. The cells are then represented in terms of the "expression level" of each metagene, providing a low-dimensional representation that can be used for joint clustering and visualization. To run iNMF on the scaled datasets, we use the `runIntegration()` function with proper hyperparameter settings. -```{r 4-1, message=F, results='hide'} -int.bmmc <- optimizeALS(int.bmmc, k = 20, maxIter = 30) +```{r 4-1} +bmmcLiger <- runIntegration(bmmcLiger, k = 20) ``` -Important parameters are as follows: +## Stage III: Quantile Normalization and Joint Clustering -- *k*. Integer value specifying the inner dimension of factorization, or number of factors. Higher `k` is recommended for datasets with more substructure. We find that a value of `k` in the range 20 - 40 works well for most datasets. Because this is an unsupervised, exploratory analysis, there is no single "right" value for `k`, and in practice, users choose `k` from a combination of biological prior knowledge and other information. -- *lambda*. This is a regularization parameter. Larger values penalize dataset-specific effects more strongly, causing the datasets to be better aligned, but possibly at the cost of higher reconstruction error. The default value is `5`. We recommend using this value for most analyses, but find that it can be lowered to 1 in cases where the dataset differences are expected to be relatively small, such as scRNA-seq data from the same tissue but different individuals. -- *thresh*. This sets the convergence threshold. Lower values cause the algorithm to run longer. The default is `1e-6`. -- *maxIter*. This variable sets the maximum number of iterations to perform. The default value is `30`. +### 4. Align the factors -## Stage III: Quantile Normalization and Joint Clustering (1 minute) +Using the metagene factors calculated by iNMF, we then assign each cell to the factor on which it has the highest loading, giving joint clusters that correspond across datasets. We then perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. To perform this analysis, typing in: -**9.** Using the metagene factors calculated by iNMF, we then assign each cell to the factor on which it has the highest loading, giving joint clusters that correspond across datasets. We then perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. To perform this analysis, typing in: +```{r 4-2} +bmmcLiger <- quantileNorm(bmmcLiger) +``` -```{r 4-2, message=FALSE, warning=FALSE, results='hide'} -int.bmmc <- quantileNorm(int.bmmc) +### 5. Clustering + +The `quantileNorm()` function gives joint clusters that correspond across datasets, which are often completely satisfactory and sufficient for downstream analyses. However, if desired, after quantile normalization, users can additionally run the Leiden algorithm for community detection, which is widely used in single-cell analysis and excels at merging small clusters into broad cell classes. This can be achieved by running the `runCluster()` function. Several tuning parameters, including `resolution`, `nNeighbors`, and `prune` control the number of clusters produced by this function. For this dataset, we use a resolution of `0.2`. + +```{r} +bmmcLiger <- runCluster(bmmcLiger, nNeighbors = 30, resolution = 0.2) ``` -Important parameters of `quantileNorm()` are as follows: +>Starting from rliger 2.0.0, cluster labeling will be stored in cell metadata, which can be accessed with `cellMeta(bmmcLiger)`. Use argument `clusterName` to specify unique variable names for the result can enable storing multiple cluster labeling variables at the same time. -- `nNeighbors`. This sets the number of nearest neighbors for within-dataset KNN graph. The default is `20`. -- `quantiles`. This sets the number of quantiles to use for quantile normalization. The default is `50`. -- `minCells`. This indicates the minimum number of cells to consider a cluster as shared across datasets. The default is `20`. -- `useDims`. This sets the indices of factors to use for quantile normalization. The user can pass in a vector of indices indicating specific factors. This is helpful for excluding factors capturing biological signals such as the cell cycle or technical signals such as mitochondrial genes. The default is all `k` of the factors. -- `center`. This indicates whether to center the data when scaling factors. The default is `FALSE`. This option should be set to `TRUE` when metagene loadings have a mean above zero, as with dense data such as DNA methylation. -- `maxSample`. This sets the maximum number of cells used for quantile normalization of each cluster and factor. The default is `1000`. -- `refineKNN`. This indicates whether to increase robustness of cluster assignments using KNN graph. The default is `TRUE`. -- `eps`. This sets the error bound of the nearest neighbor search. The default is `0.9`. Lower values give more accurate nearest neighbor graphs but take much longer to computer. -- `reference`. This indicates the name of the dataset to be used as a reference for quantile normalization. By default, the dataset with the largest number of cells is used. +## Stage IV: Visualization and Downstream Analysis -**10.** The `quantileNorm()` function gives joint clusters that correspond across datasets, which are often completely satisfactory and sufficient for downstream analyses. However, if desired, after quantile normalization, users can additionally run the Leiden algorithm for community detection, which is widely used in single-cell analysis and excels at merging small clusters into broad cell classes. This can be achieved by running the `runLeidenCluster()` function. Several tuning parameters, including `resolution`, `nNeighbors`, and `prune` control the number of clusters produced by this function. For this dataset, we use a resolution of `0.2`, which yields 26 clusters (see below). +### 6. Generate dimensionality reduced embedding -```{r, message=F, results='hide'} -int.bmmc <- runLeidenCluster(int.bmmc, nNeighbors = 30, resolution = 0.2) +In order to visualize the clustering results, the user can use two dimensionality reduction methods supported by LIGER: t-SNE and UMAP. We find that often for datasets containing continuous variation such as cell differentiation, UMAP better preserves global relationships, whereas t-SNE works well for displaying discrete cell types, such as those in the brain. The UMAP algorithm (called by the `runUMAP()` function) scales readily to large datasets. The `runTSNE()` function also includes an option to use "FFtSNE", a highly scalable implementation of t-SNE that can efficiently process huge datasets. For the BMMC dataset, we expect to see continuous lineage transitions among the differentiating cells, so we use UMAP to visualize the data in two dimensions: + +```{r} +bmmcLiger <- runUMAP(bmmcLiger, nNeighbors = 30, minDist = 0.3) ``` -## Stage IV: Visualization (2 - 3 minutes) and Downstream Analysis (30 - 40 minutes) +>Starting from rliger 2.0.0, dimensionality reduction matrices will be stored in cell metadata, which can be accessed with `cellMeta(bmmcLiger)`. Use argument `dimredName` to specify unique variable names for the result can enable storing multiple low-dimensional representation matrices as variables at the same time. -**11.** In order to visualize the clustering results, the user can use two dimensionality reduction methods supported by LIGER: t-SNE and UMAP. We find that often for datasets containing continuous variation such as cell differentiation, UMAP better preserves global relationships, whereas t-SNE works well for displaying discrete cell types, such as those in the brain. The UMAP algorithm (called by the `runUMAP()` function) scales readily to large datasets. The `runTSNE()` function also includes an option to use "FFtSNE", a highly scalable implementation of t-SNE that can efficiently process huge datasets. For the BMMC dataset, we expect to see continuous lineage transitions among the differentiating cells, so we use UMAP to visualize the data in two dimensions: +### 7. Create plots -```{r, message=F, results='hide'} -int.bmmc <- runUMAP(int.bmmc, distance = 'cosine', nNeighbors = 30, minDist = 0.3) -``` +We provide a variety of utilities for visualization and analysis of clustering, gene expression across datasets, and comparisons of cluster assignments. Here we demonstrate several commonly used examples. -**12.** We can then visualize each cell, colored by cluster or dataset. +`plotByDatasetAndCluster()` returns two graphs, generated by t-SNE or UMAP in the previous step. The first colors cells by dataset of origin, and the second by cluster as determined by previous clustering step. The plots provide visual confirmation that the datasets are well aligned and the clusters are consistent with the shape of the data as revealed by UMAP. -```{r, message=F, fig.align='center', fig.width=10} +The two subplots can individually be generated with `plotDatasetDimRed()` and `plotClusterDimRed()`, respectively. + +```{r, fig.align='center', fig.width=10} options(ligerDotSize = 0.5) -plotByDatasetAndCluster(int.bmmc, labelTextSize = 3) +plotByDatasetAndCluster(bmmcLiger) ``` -**13.** LIGER employs the Wilcoxon rank-sum test to identify marker genes that are differentially expressed in each cell type using the following settings. We provide parameters that allow the user to select which datasets to use (`useDatasets`) and whether to compare across clusters or across datasets within each cluster (`method` with options `"clusters"` or `"datasets"`). To identify marker genes for each cluster combining scATAC and scRNA profiles, typing in: -```{r , eval=FALSE} -int.bmmc.wilcoxon <- runWilcoxon(int.bmmc, method = 'clusters') -``` +### 8. Differential expression -Important parameters of runWilcoxon are as follows: +LIGER employs the Wilcoxon rank-sum test to identify marker genes that are differentially expressed in each cell type using the following settings. We provide parameters that allow the user to select which datasets to use (`useDatasets`) and whether to compare across clusters or across datasets within each cluster (`method` with options `"clusters"` or `"datasets"`). To identify marker genes for each cluster combining scATAC and scRNA profiles, typing in: -- `useDatasets`. This selects which dataset(s) to be included. By default all datasets are included. -- `method`. This indicates whether to compare across clusters or across datasets with each cluster. Setting `method = "clusters"` compares each feature’s (genes, peaks, etc.) loading between clusters combining all datasets, which gives us the most specific features for each cluster. On the other hand, setting `method = "datasets"` gives us the most differentially loaded features for every cluster between datasets. +```{r} +marker.cluster <- runMarkerDEG(bmmcLiger) +``` -**14.** The number of marker genes identified by `runWilcoxon()` varies and depends on the datasets used. The function outputs a data.frame that the user can then filter to select markers which are statistically and biologically significant. For example, one strategy is to filter the output by taking markers which have `padj` (Benjamini-Hochberg adjusted p-value) less than 0.05 and `logFC` (log fold change between observations in group versus out) larger than 3: +The number of significant genes identified by `runMarkerDEG()` varies and depends on the datasets used. And the raw output of the function contains the statistics of all tested genes in all groups (clusters). In order to pick out the top markers for each cluster, we strongly suggest using package "dplyr", which provides a user-friendly interface for data table manipulation. The following code chunk first filters the markers which are statistically and biologically significant. For example, we filter the output by taking markers which have padj (Benjamini-Hochberg adjusted p-value) less than 0.05 and logFC (log fold change between observations in group versus out) larger than 3. Then for each cluster, we sort the markers primarily by its padj value in ascending order. Given that mathematically, the lowest padj values are rounded to 0 as they are too small, for genes tying on this metric, we then sort the markers by logFC in descending order. Finally, we select the top 20 markers for each cluster. -```{r , eval=FALSE} -int.bmmc.wilcoxon <- int.bmmc.wilcoxon[int.bmmc.wilcoxon$padj < 0.05,] -int.bmmc.wilcoxon <- int.bmmc.wilcoxon[int.bmmc.wilcoxon$logFC > 3,] +```{r} +library(dplyr) +marker.cluster <- marker.cluster %>% + filter(padj < 0.05, logFC > 3) %>% + group_by(group) %>% + arrange(padj, -logFC, .by_group = TRUE) %>% + top_n(20) ``` -You can then re-sort the markers by its `padj` value in ascending order and choose the top 100 for each cell type. For example, we can subset and re-sort the output for Cluster 1 and take the top 20 markers by typing these commands: +We can then view the selected top 20 markers for cluster 1 by: -```{r , eval=FALSE} -wilcoxon.cluster_1 <- int.bmmc.wilcoxon[int.bmmc.wilcoxon$group == 1, ] -wilcoxon.cluster_1 <- wilcoxon.cluster_1[order(wilcoxon.cluster_1$padj), ] -markers.cluster_1 <- wilcoxon.cluster_1[1:20, ] +```{r} +marker.cluster %>% filter(group == 3) ``` -**15.** We also provide functions to check these markers by visualizing their expression and gene loadings across datasets. You can use the `plotGeneDimRed()` to visualize the expression or accessibility of a marker gene, which is helpful for visually confirming putative marker genes or investigating the distribution of known markers across the cell sequenced. Such plots can also confirm that divergent datasets are properly aligned. +We also provide functions to check these markers by visualizing their expression and gene loadings across datasets. You can use the `plotGeneDimRed()` to visualize the expression or accessibility of a marker gene, which is helpful for visually confirming putative marker genes or investigating the distribution of known markers across the cell sequenced. Such plots can also confirm that divergent datasets are properly aligned. For instance, we can plot S100A9, which the Wilcoxon test identified as a marker for Cluster 1, and MS4A1, a marker for Cluster 4: -```{r , message=FALSE, warning=FALSE, paged.print=FALSE, fig.align='center', fig.height=6} -plots <- plotGeneDimRed(int.bmmc, c("S100A9", "MS4A1"), splitBy = "dataset", - titles = c(names(int.bmmc), names(int.bmmc))) +```{r, fig.align='center', fig.height=6} +plots <- plotGeneDimRed(bmmcLiger, c("S100A9", "MS4A1"), splitBy = "dataset", + titles = c(names(bmmcLiger), names(bmmcLiger))) cowplot::plot_grid(plotlist = plots, nrow = 2) ``` These plots indicate that S100A9 and MS4A1 are indeed specific markers for Cluster 1 and Cluster 4, respectively, with high values in these cell groups and low values elsewhere. Furthermore, we can see that the distributions are strikingly similar between the RNA and ATAC datasets, indicating that LIGER has properly aligned the two data types. -**16.** A key advantage of using iNMF instead of other dimensionality reduction approaches such as PCA is that the dimensions are individually interpretable. For example, a particular cell type is often captured by a single dimension of the space. Furthermore, iNMF identifies both shared and dataset-specific features along each dimension, giving insight into exactly how corresponding cells across datasets are both similar and different. Too visualize this information, users need to run `getFactorMarkers()` first in order to have the statistics prepared for ranking the factor loading. The function `plotGeneLoadings()` then creates visualization for the exploration. For example, we can visualize the factor loading of Factor 7 typing in: +### 9. Interpret the factors -```{r message=FALSE, warning=FALSE, results='hide', fig.keep='all', fig.align='center', fig.height=7} -factorMarker <- getFactorMarkers(int.bmmc, dataset1 = "atac", dataset2 = "rna") -plotGeneLoadings(int.bmmc, markerTable = factorMarker, useFactor = 7, nLabel = 15) +A key advantage of using iNMF instead of other dimensionality reduction approaches such as PCA is that the dimensions (factors) are individually interpretable. For example, a particular cell type is often captured by a single dimension of the space. Furthermore, iNMF identifies both shared and dataset-specific features along each dimension, giving insight into exactly how corresponding cells across datasets are both similar and different. Too visualize this information, users need to run `getFactorMarkers()` first in order to have the statistics prepared for ranking the factor loading. The function `plotGeneLoadings()` then creates visualization for the exploration. For example, we can visualize the factor loading of Factor 7 typing in: + +```{r, results='hide', fig.keep='all', fig.align='center', fig.height=7} +factorMarker <- getFactorMarkers(bmmcLiger, dataset1 = "atac", dataset2 = "rna") +plotGeneLoadings(bmmcLiger, markerTable = factorMarker, useFactor = 12, nLabel = 15) ``` -```{r message=FALSE, warning=FALSE, paged.print=FALSE, fig.align='center', fig.height=6} -plots <- plotGeneDimRed(int.bmmc, c("CCR6", "NCF1"), splitBy = "dataset", - titles = c(names(int.bmmc), names(int.bmmc))) +```{r, fig.align='center', fig.height=6} +plots <- plotGeneDimRed(bmmcLiger, c("CCR6", "NCF1"), splitBy = "dataset", + titles = c(names(bmmcLiger), names(bmmcLiger))) cowplot::plot_grid(plotlist = plots, nrow = 2) ``` These plots confirm that the expression and accessibility of these genes show clear differences. CCR6 shows nearly ubiquitous chromatin accessibility but is expressed only in clusters 2 and 4. The accessibility is highest in these clusters, but the ubiquitous accessibility suggests that the expression of CCR6 is somewhat decoupled from its accessibility, likely regulated by other factors. Conversely, NCF1 shows high expression in clusters 1, 6, 9, 10, 12, 14 and 16, despite no clear enrichment in chromatin accessibility within these clusters. This may again indicate decoupling between the expression and chromatin accessibility of NCF1. Another possibility is that the difference is due to technical effects--the gene body of NCF1 is short (~15KB), and short genes are more difficult to capture in scATAC-seq than in scRNA-seq because there are few sites for the ATAC-seq transposon to insert. -**17.** Single-cell measurements of chromatin accessibility and gene expression provide an unprecedented opportunity to investigate epigenetic regulation of gene expression. Ideally, such investigation would leverage paired ATAC-seq and RNA-seq from the same cells, but such simultaneous measurements are not generally available. However, using LIGER, it is possible to computationally infer “pseudo-multi-omic” profiles by linking scRNA-seq profiles--using the jointly inferred iNMF factors--to the most similar scATAC-seq profiles. After this imputation step, we can perform downstream analyses as if we had true single-cell multi-omic profiles. For example, we can identify putative enhancers by correlating the expression of a gene with the accessibility of neighboring intergenic peaks across the whole set of single cells. - ->For convenience, we have prepared the pre-processed peak-level count data which is ready to use. The data can be downloaded [here](https://www.dropbox.com/sh/5e9cy4qabs89kc8/AADy3jDxHx94j6A57t7A42u3a?dl=0). +### 10. Inferring linkage between ATAC and RNA data -```{r, message=FALSE, warning=FALSE, paged.print=FALSE, results=FALSE} -pmat <- readRDS("GSM4138888_scATAC_BMMC_D5T1_peak_counts.RDS") -``` +Single-cell measurements of chromatin accessibility and gene expression provide an unprecedented opportunity to investigate epigenetic regulation of gene expression. Ideally, such investigation would leverage paired ATAC-seq and RNA-seq from the same cells, but such simultaneous measurements are not generally available. However, using LIGER, it is possible to computationally infer “pseudo-multi-omic” profiles by linking scRNA-seq profiles--using the jointly inferred iNMF factors--to the most similar scATAC-seq profiles. After this imputation step, we can perform downstream analyses as if we had true single-cell multi-omic profiles. For example, we can identify putative enhancers by correlating the expression of a gene with the accessibility of neighboring intergenic peaks across the whole set of single cells. -You can also expand the following tutorial to start from the beginning. +You can also expand the following tutorial to start from the beginning. ````{=html}
Detailed CellRanger peak count preprocessing -```` +```` To achieve this, we first need a matrix of accessibility counts within intergenic peaks. The CellRanger pipeline for scATAC-seq outputs such a matrix by default, so we will use this as our starting point. The count matrix, peak genomic coordinates, and source cell barcodes output by CellRanger are stored in a folder named `filtered_peak_matrix/` in the output directory. The user can load these and convert them into a peak-level count matrix by typing these commands: ```{r, eval=FALSE} -barcodes <- read.table('/outs/filtered_peak_bc_matrix/barcodes.tsv', sep = '\t', header = FALSE, as.is = TRUE)$V1 -peak.names <- read.table('/outs/filtered_peak_bc_matrix/peaks.bed', sep = '\t', header = FALSE) -peak.names <- paste0(peak.names$V1, ':', peak.names$V2, '-', peak.names$V3) -pmat <- readMM('/outs/filtered_peak_bc_matrix/matrix.mtx') -dimnames(pmat) <- list(peak.names, barcodes) +peak <- read10XATAC("/path/to/filtered_peak_barcode_matrix", sampleNames = "atac") ``` ````{=html} @@ -283,71 +284,64 @@ dimnames(pmat) <- list(peak.names, barcodes)
```` -**18.** The peak-level count matrix is usually large, containing hundreds of thousands of peaks. We next filter this set of peaks to identify those showing cell-type-specific accessibility. To do this, we perform the Wilcoxon rank-sum test and pick those peaks which are differentially accessible within a specific cluster. Before running the test, however, we need to make sure: +The peak-level count matrix is usually large, containing hundreds of thousands of peaks. We next filter this set of peaks to identify those showing cell-type-specific accessibility. To do this, we perform the Wilcoxon rank-sum test and pick those peaks which are differentially accessible within a specific cluster. Before running the test, however, we need to make sure: 1. The peak-level count matrix includes the same cells as the gene-level counts matrix -2. Peak count matrix is normalized to sum to 1 within each cell. +2. Peak count matrix is normalized to sum to 1 within each cell. -```{r message=FALSE, warning=FALSE, paged.print=FALSE, results=FALSE} -rawPeak(int.bmmc, "atac") <- pmat -int.bmmc <- normalizePeak(int.bmmc) +```{r} +if (!file.exists("liger_BMMC_atac_D5T1_peak.rds")) + download.file("https://figshare.com/ndownloader/files/40054864", + destfile = "liger_BMMC_atac_D5T1_peak.rds") +D5T1.peak <- readRDS("liger_BMMC_atac_D5T1_peak.rds") + +rawPeak(bmmcLiger, "atac") <- D5T1.peak +bmmcLiger <- normalizePeak(bmmcLiger) ``` -> Recall that we set `modal` argument when intiating the object with `createLiger()`. This creates "ligerATACDataset" class object as the dataset-specific container for the ATAC-seq data. Only in this way can we insert the peak count matrix to its designated slot with `rawPeak<-()` method. +> Recall that we set `modal` argument when intiating the object with `createLiger()`. This creates "ligerATACDataset" class object as the dataset-specific container for the scATAC data. Only in this way can we insert the peak count matrix to its designated slot with `rawPeak<-()` method. Now we can perform the Wilcoxon test on the peak counts with setting `usePeak = TRUE`: -```{r, message=FALSE, warning=FALSE, paged.print=FALSE, results=FALSE} -peak.wilcoxon <- runWilcoxon(int.bmmc, useDatasets = "atac", usePeak = TRUE, method = 'clusters') +```{r, eval=FALSE} +peak.wilcoxon <- runWilcoxon(bmmcLiger, useDatasets = "atac", usePeak = TRUE) ``` -**19.** We can now use the results of the Wilcoxon test to retain only peaks showing differential accessibility across our set of joint clusters. Here we kept peaks with Benjamini-Hochberg adjusted p-value < 0.05 and log fold change > 2. +We can now use the results of the Wilcoxon test to retain only peaks showing differential accessibility across our set of joint clusters. Here we kept peaks with Benjamini-Hochberg adjusted p-value < 0.05 and log fold change > 2. -```{r, message=FALSE, warning=FALSE, paged.print=FALSE, results=FALSE} -peak.wilcoxon <- peak.wilcoxon[peak.wilcoxon$padj < 0.05,] -peak.wilcoxon <- peak.wilcoxon[peak.wilcoxon$logFC > 2,] +```{r, eval=FALSE} +peak.wilcoxon <- peak.wilcoxon %>% filter(padj < 0.05, logFC > 2) peaks.sel <- unique(peak.wilcoxon$feature) -rawPeak(int.bmmc, "atac") <- rawPeak(int.bmmc, "atac")[peaks.sel,] -int.bmmc <- normalizePeak(int.bmmc) +rawPeak(bmmcLiger, "atac") <- rawPeak(bmmcLiger, "atac")[peaks.sel,] +bmmcLiger <- normalizePeak(bmmcLiger) ``` -**20.** Using this set of differentially accessible peaks, we now impute a set of “pseudo-multi-omic” profiles by inferring the intergenic peak accessibility for scRNA-seq profiles based on their nearest neighbors in the joint LIGER space. LIGER provides a function named `imputeKNN()` that performs this task, yielding a set of profiles containing both gene expression and chromatin accessibility measurements for the same single cells: +Using this set of differentially accessible peaks, we now impute a set of “pseudo-multi-omic” profiles by inferring the intergenic peak accessibility for scRNA-seq profiles based on their nearest neighbors in the joint LIGER space. LIGER provides a function named `imputeKNN()` that performs this task, yielding a set of profiles containing both gene expression and chromatin accessibility measurements for the same single cells: -```{r, message=FALSE, warning=FALSE, paged.print=FALSE, results=FALSE} -int.bmmc <- imputeKNN(int.bmmc, reference = 'atac', queries = "rna") +```{r, eval=FALSE} +bmmcLiger <- imputeKNN(bmmcLiger, reference = 'atac', queries = "rna") ``` -Important parameters of `imputeKNN()` are as follows: - -- `reference`. Dataset containing values to impute into query dataset(s). For example, setting `reference = "atac"` uses the values in dataset "atac" to predict chromatin accessibility values for scRNA-seq profiles. -- `queries`. Dataset to be augmented by imputation. For example, setting `query = "rna"` predicts chromatin accessibility values for scRNA-seq profiles. -- `nNeighbors`. The maximum number of nearest neighbors to use for imputation. The imputation algorithm simply builds a k-nearest neighbor graph using the aligned LIGER latent space, then averages values from the reference dataset across neighboring cells. The default value is `20`. -- `norm`. This indicates whether to normalize data after imputation. The default is `TRUE`. -- `scale`. This indicates whether to scale data after imputation. The default is `FALSE`. - -> Under the hook, `imputeKNN()` converts the dataset-specific container object for scRNA-seq data to "ligerATACDataset" class, which allows storing the predicted chromatin accessibility values while retaining the gene expression. +> Under the hook, `imputeKNN()` converts the dataset-specific container object for scRNA-seq data to "ligerATACDataset" class, which allows storing the predicted chromatin accessibility values while retaining the gene expression. > Old versions of LIGER worked in a way that the predicted accessibility value occupies the space where the original gene expression is stored. This usually led to the creation of a new object when running the imputation. Now both types of features are allowed to exist together, so users do not need to create a new object anymore. -**21.** Now that we have both the (imputed) peak-level counts matrix and the (observed) gene expression counts matrix for the same cells, we can evaluate the relationships between pairs of genes and peaks, linking genes to putative regulatory elements. We use a simple strategy to identify such gene-peak links: Calculate correlation between gene expression and peak accessibility of all peaks within 500 KB of a gene, then retain all peaks showing statistically significant correlation with the gene. The `linkGenesAndPeaks()` function performs this analysis: +Now that we have both the (imputed) peak-level counts matrix and the (observed) gene expression counts matrix for the same cells, we can evaluate the relationships between pairs of genes and peaks, linking genes to putative regulatory elements. We use a simple strategy to identify such gene-peak links: Calculate correlation between gene expression and peak accessibility of all peaks within 500 KB of a gene, then retain all peaks showing statistically significant correlation with the gene. The `linkGenesAndPeaks()` function performs this analysis: ```{r, eval=FALSE} -regnet <- linkGenesAndPeaks(int.bmmc, useDataset = "rna", pathToCoords = "hg19_genes.bed", method = "spearman", alpha = 0.05, useGenes = , verbose = , path_to_coords = , genes.list = , dist = ) -``` - -Important parameters of linkGenesAndPeaks are as follows: +if (!file.exists("hg19_genes.bed")) + download.file("https://figshare.com/ndownloader/files/40054870", + destfile = "hg19_genes.bed") -- `useGenes`. A vector of the genes to be tested. If not specified, this function will use all available genes in the dataset specified by `useDataset` by default. -- `method`. This specifies the type of correlation to calculate -- one of `"spearman"` (default), `"pearson"`, or `"kendall"`. -- `alpha`. Significance threshold for correlation p-value. Peak-gene correlations with p-values below this threshold are considered significant. The default is `0.05`. -- `pathToCoords`. The path to the gene coordinates file (in BED format). We recommend passing in the same BED file used for making barcode list in **Step 1** (need to expand the CellRanger preprocessing block). +regnet <- linkGenesAndPeaks(bmmcLiger, useDataset = "rna", pathToCoords = "hg19_genes.bed", method = "spearman", alpha = 0.05) +``` -**22.** The output of this function is a sparse matrix with peak names as rows and gene symbols as columns, with each element indicating the correlation between peak *i* and gene *j*. The value would be 0 if the corresponding gene and peak are not significantly linked. For example, we can subset the results for marker gene S100A9, which is highly expressed in cluster 6, 8, 9, 10 and 12: +The output of this function is a sparse matrix with peak names as rows and gene symbols as columns, with each element indicating the correlation between peak *i* and gene *j*. The value would be 0 if the corresponding gene and peak are not significantly linked. For example, we can subset the results for marker gene S100A9, which is highly expressed in cluster 6, 8, 9, 10 and 12: ```{r, eval=FALSE} S100A9 <- regnet[, 'S100A9'] S100A9 <- S100A9[abs(S100A9) > 0] -View(S100A9[order(abs(S100A9), decreasing = TRUE)]) +S100A9[order(abs(S100A9), decreasing = TRUE)] ``` We also provide a function to transform the peaks-gene correlation matrix into an Interact Track supported by UCSC Genome Browser for visualizing the calculated linkage between genes and correlated peaks. To do this, tying in: @@ -356,14 +350,7 @@ We also provide a function to transform the peaks-gene correlation matrix into a exportInteractTrack(regnet, useGenes = "S100A9", pathToCoords = "hg19_genes.bed") ``` -Important parameters of makeInteractTrack are as follows: - -- `corrMat`. A peaks x genes sparse matrix containing inferred gene-peak links, the output of `linkGenesAndPeaks()`. -- `useGenes`. A vector of the gene symbols to be included in the output Interact Track file. If not specified, this function will use all the gene symbols from `corrMat` by default. -- `pathToCoords`. The path to the gene coordinates file (in BED format). We recommend passing in the same BED file path as used for `linkGenesAndPeaks()`. -- `outputPath`. The path to the directory in which the Interact Track file will be stored. The default is the working directory. - -The output of this function will be a UCSC Interact Track file named ‘Interact_Track.bed’ containing linkage information of the specified genes and correlated peaks stored in given directory. The user can then upload this file as a custom track using this page https://genome.ucsc.edu/cgi-bin/hgCustom and display it in the UCSC Genome browser. +The output of this function will be a UCSC Interact Track file named ‘Interact_Track.bed’ containing linkage information of the specified genes and correlated peaks stored in given directory. The user can then upload this file as a custom track using this page https://genome.ucsc.edu/cgi-bin/hgCustom and display it in the UCSC Genome browser. As an example, the three peaks most correlated to S100A9 expression are shown below in the UCSC genome browser. One of the peaks overlaps with the TSS of S100A8, a neighboring gene that is co-expressed with S100A9, while another peak overlaps with the TSS of S100A9 itself. The last peak, chr1:153358896-153359396, does not overlap with a gene body and shows strong H3K27 acetylation across ENCODE cell lines, indicating that this is likely an intergenic regulatory element. @@ -371,8 +358,8 @@ As an example, the three peaks most correlated to S100A9 expression are shown be If we plot the accessibility of this peak and the expression of S100A9, we can see that the two are indeed very correlated and show strong enrichment in clusters 1 and 3. Thus, the intergenic peak likely serves as a cell-type-specific regulator of S100A9. -```{r , message=FALSE, warning=FALSE, paged.print=FALSE, fig.align='center', fig.height=8, fig.width=6} -S100A9 <- plotGeneDimRed(int.bmmc, "S100A9", splitBy = "dataset") -peak1 <- plotPeakDimRed(int.bmmc, "chr1:153358896-153359396") -cowplot::plot_grid(peak1, S100A9[[2]], nrow = 2, align = "v", axis = "lr") +```{r, fig.align='center', fig.height=8, fig.width=6} +S100A9 <- plotGeneDimRed(bmmcLiger, "S100A9", splitBy = "dataset") +peak1 <- plotPeakDimRed(bmmcLiger, "chr1:153358896-153359396") +cowplot::plot_grid(peak1, S100A9$S100A9.rna, nrow = 2, align = "v", axis = "lr") ``` diff --git a/vignettes/articles/installation.Rmd b/vignettes/articles/installation.Rmd new file mode 100644 index 00000000..33e76613 --- /dev/null +++ b/vignettes/articles/installation.Rmd @@ -0,0 +1,100 @@ +--- +title: "Install LIGER with R" +author: "Yichen Wang" +date: "2023-11-05" +output: + html_document: + toc: 3 + toc_float: true +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, eval = FALSE) +``` + +Before setting up the "rliger" package, users should have [R](https://www.r-project.org/) version 3.6.0 or higher. + +## LIGER is available on CRAN + +To install the latest stable release, please run the following command in R console: + +```{R} +install.packages("rliger") +``` + +## Lastest developmental version + +The developmental versions are accessible from GitHub. We are currently contributing to the repository "mvfki/liger" in branch "newObj". + +```{R} +if (!requireNamespace("devtools")) install.packages("devtools") +devtools::install_github("mvfki/liger@newObj") +``` + +If you encounter issues when building from GitHub (source), please see below sections for details about some dependencies. + +**Please note that the latest version (>=1.9.9) is massively updated compared to prior versions, and functions are not directly compatible with the old version objects which users might possess. Please use `readLiger()` to load old RDS files or `convertOldLiger()` to convert a loaded object into the up-to-date structure. You might need to be careful when overwriting existing analysis because we don't provide methods to convert new the structure backwards.** + +### Compiler setup + +Installation from CRAN is easy because pre-compiled binaries are available for Windows and MacOS. However, a few additional steps are required to install from source on MacOS/Windows. + +Windows users will need to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) which can be downloaded from CRAN. After downloading it, open the file and follow the prompt to install. + +MacOS users will need Clang and gfortran, which can also be found on [CRAN Mac Tools](https://mac.r-project.org/tools/). + +### Installing HDF5 Library + +HDF5 library is required for interacting with H5 files. It can be installed via the following commands. + +| System | Command +|:------------------------------------------|:---------------------------------| +|OS X (using Homebrew or Conda) | `brew install hdf5` or `conda install -c anaconda hdf5` +|Debian-based systems (including Ubuntu) | `sudo apt-get install libhdf5-dev` +|Systems supporting yum and RPMs | `sudo yum install hdf5-devel` +|Windows | Go to [HDF5 website](https://www.hdfgroup.org/downloads/hdf5/) and download the proper installer + +### Installing RcppPlanc + +[RcppPlanc](https://github.com/welch-lab/RcppPlanc) is an extension R package built basing on [Planc](https://github.com/ramkikannan/planc). We implemented highly optimized iNMF algorithm and its variants here for the new version upgrade. It can be obtained also from CRAN with: + +```{R} +install.packages("RcppPlanc") +``` + +For latest developmental version, please refer to [RcppPlanc GitHub repository](https://github.com/welch-lab/RcppPlanc) for detail. + +### Installing Flt-SNE + +For dimensionality reduction, we have options of UMAP and t-SNE. For UMAP, we depend on "uwot" which is readily scalable with large datasets. For t-SNE, we by default use "Rtsne", which is not scalable for large datasets. We allow using another implementation of t-SNE, namingly Flt-SNE, for efficient computation. Flt-SNE is not distributed with the package or CRAN, thus users need to install it following steps below. + +#### 1. Install FFTW library. + +Download FFTW from [here](http://www.fftw.org/). For UNIX like users, run the following command in terminal. + +```{bash} +tar -xvzf fftw-x.x.x.tar.gz +cd fftw-x.x.x.tar.gz +./configure +make +make install +``` + +#### 2. Install Flt-SNE + +`cd` to a desired location, and run + +```{bash} +git clone https://github.com/KlugerLab/FIt-SNE.git +cd FIt-SNE +g++ -std=c++11 -O3 src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp -o bin/fast_tsne -pthread -lfftw3 -lm +pwd +``` + +#### 3. Run t-SNE with Flt-SNE + +Now, we've had the binary compiled, and command `pwd` shows the location to the binary. Then back to an R LIGER analysis, the following command can be used to invoke the binary. + +```{R} +ligerObj <- runTSNE(ligerObj, method = "fftRtsne", fitsnePath = "whatPWDShows") +```