Skip to content

Distribution analysis across clusters of a parent population overlaid with a rare subpopulation (DISCOV-R)

License

Notifications You must be signed in to change notification settings

BenaroyaResearch/briDiscovr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

briDiscovr

Functions implementing and supporting the "Distribution analysis across clusters of a parent population overlaid with a rare subpopulation" (DISCOV-R) analysis method (Wiedeman, Muir, et al. 2020).

NB - This code has been tested with R version 4.0.1 and 4.0.2. The change from R version 3.x to 4.x came with a number of substantial differences, and so while there's a possibility this package may work with earlier versions of R, it should be considered untested.


Installation

  1. Install R v4.0+ from https://cloud.r-project.org/
  1. Open R and install devtools using install.packages("devtools")
  2. Install this package using devtools::install_github("BenaroyaResearch/briDiscovr")
  • If you encounter difficulties with any dependency packages, please see the "Dependencies" section at the bottom of this page.

Usage

This package is based around the discovrExperiment object class, which is a list that holds all of the information about your experiment. You can save, load, and copy discovrExperiment objects to facilitate different phases of the analysis.

Preparing to start your analysis

Before running DISCOV-R, you'll need to prepare a marker information file and an. fcs information file. Both of these can be created in Excel, and should be saved in the 'comma-separated values (.csv)' format.

marker information file

This file should contain 3 columns:

  • A column for the common name of your marker (eg: "CD45RA") - by default this column is expected to be named 'fixed'
  • A column for the marker name as it's represented in the .fcs data (eg: "143Nd_CD45RA") - by default this column is expected to be named 'desc'
  • A column named 'useToCluster' that indicates whether the marker will be used for clustering. This column can only contain the values "TRUE"" or "FALSE".

Example "markerInformation.csv":

fixed desc useToCluster
CD45 89Y_CD45 TRUE
GRZMA 141Pr_GRZMA FALSE
... ... ...

.fcs information file

This file should contain 3 columns:

  • A column named 'subject' indicating the subject/sample ID (eg: "SubjXpb997364476").
    • Note that this should NOT start with a number.
  • A column named 'cellSubset' indicating the parent and child populations to be used in the analysis (eg: "A2_CD8_T_Cells")
  • A column named 'filename' indicating where to find the .fcs file for that patient and cell subset (eg: "/Users/mrosasco/Documents/projects/briUtils/briDiscovr/testFiles/Xpb997364476_A2_CD8_T_Cells.fcs").
    • Note that it's possible to use paths relative to your working directory, but it's generally preferred to provide an absolute path.

Example "fcsManifest.csv":

subject cellSubset filename
SubjXpb997364476 A2_CD8_T_Cells /Users/mrosasco/Documents/projects/briUtils/briDiscovr/testFiles/Xpb997364476_A2_CD8_T_Cells.fcs
SubjXpb997364476 A2_Er168+_EBV /Users/mrosasco/Documents/projects/briUtils/briDiscovr/testFiles/Xpb997364476_A2_Er168+_EBV.fcs
... ... ...

Downsampling FCS files

Some large datasets may exhaust the available memory or require a very long time to analyze. The package includes code to downsample each FCS file to a maximum number of events, in order to reduce memory usage and computational time. This is done using the function downsampleFcsFiles(). Detailed documentation can be accessed by ?downsampleFcsList. There are two different ways to use this function. (1) write out a new set of FCS files and a new .fcs information file, or (2) generate a list of vectors that can be passed to setupDiscovrExperiment() to downsample the files as they are read in.

This downsampling can be made reproducible by specifying a random number seed using the seed argument. In addition, you can specify to only downsample files from specific populations (e.g. the parent population when working with antigen-specific cells and a parent population).

library(briDiscovr)

myDownsampleVectorList <- downsampleFcsList(
    fcsInfoFile = "fcsManifest.csv",
    maxEvents = 10000,
    downsampleMode = "storeVectors",
    seed = 12345
)

Setting up your experiment

Once you have the information about your study design in your marker info and .fcs info files, you're ready to begin your analysis. This is done by using the setupDiscovrExperiment() function as shown below. More information about how to use this function can be accessed using ?setupDiscovrExperiment.

library(briDiscovr)

myExpt <- setupDiscovrExperiment(
    markerInfoFile = "markerInformation.csv",
    fcsInfoFile = "fcsManifest.csv",
    parentPopulation = "A2_CD8_T_Cells"
)

This will load in the .fcs files, check for any issues with the data, clean the marker names, and transform the event data using an arcsinh transform. For details on the appropriate transformation parameters for CyTOF and FACS data, please see ?setupDiscovrExperiment.

If you have run downsampling and generated a list of vectors for downsampling the input .fcs files, this can be passed to setupDiscovrExperiment() using the argument downsampleVectorList.

Once the data are loaded, you can access summary information by using the functions print() and getSubjectCounts(). This is useful to do before clustering as a 'sanity check' to make sure that your markers look as expected, and that you have reasonable numbers of event counts for all individuals in your dataset.

print(myExpt)
## Prints out:
# An object of class 'discovrExperiment'
# All markers: CD45, GRZMA, CD57, CD45RA, CD38, CD8, CD4, KLRG1, CD14, CD127, CD226, HELIOS, CD2, NKG2C, CD3, TIGIT, CD25, CD27, CD161, TBET, CD39, EOMES, CXCR3, CD95, CD19, NKG2A, CCR7, CD122, CD103, KI67, PD1, CD56, CD16
# Clustering markers: CD45, GRZMA, CD57, CD45RA, CD38, CD8, CD4, KLRG1, CD14, CD127, CD226, HELIOS, CD2, NKG2C, CD3, TIGIT, CD25, CD27, CD161, TBET, CD39, EOMES, CXCR3, CD95, CD19, NKG2A, CCR7, CD122, CD103, KI67, PD1, CD56, CD16
# Experiment status: initialized

getSubjectCounts(myExpt)
## Prints out:
#             sample nEvents
# 1 SubjXpb997364476  172634

The experiment can then be saved using normal R data handling utilities. For example:

saveRDS(myExpt, "200709-testExpt.RDS")
myExpt <- readRDS("200709-testExpt.RDS")

Clustering

Clustering is handled through the function clusterDiscovrExperiment(). Note that this step takes the longest, and may require several hours depending on the size of the dataset. Additional information about this function can be accessed using ?clusterDiscovrExperiment. You can use the utility function getSubjectClusters() to assess the results of the clustering.

Note that this function returns a discovrExperiment object, and so you'll need to assign this to a variable. This can be the same variable you use to store your initialized discovrExperiment.

myExpt <- clusterDiscovrExperiment(myExpt)
# May run for several hours

print(myExpt)
## Prints:
# An object of class 'discovrExperiment'
# All markers: CD45, GRZMA, CD57, CD45RA, CD38, CD8, CD4, KLRG1, CD14, CD127, CD226, HELIOS, CD2, NKG2C, CD3, TIGIT, CD25, CD27, CD161, TBET, CD39, EOMES, CXCR3, CD95, CD19, NKG2A, CCR7, CD122, CD103, KI67, PD1, CD56, CD16
# Clustering markers: CD45, GRZMA, CD57, CD45RA, CD38, CD8, CD4, KLRG1, CD14, CD127, CD226, HELIOS, CD2, NKG2C, CD3, TIGIT, CD25, CD27, CD161, TBET, CD39, EOMES, CXCR3, CD95, CD19, NKG2A, CCR7, CD122, CD103, KI67, PD1, CD56, CD16
# Experiment status: clustered

getSubjectClusters(myExpt)
## Prints:
#             sample kClusters
# 1 SubjXpb997364476        26

Metaclustering

Metaclustering is handled throught the function metaclusterDiscovrExperiment(). The documentation at ?metaclusterDiscovrExperiment contains information about how to change cluster occupancy thresholds, target numbers of metaclusters, metaclustering distance and linkage metrics, and how to exclude markers or samples from the metacluster analysis. Note that dropping markers or samples will not affect the clustered results, and to exclude samples altogether from the clustering a new experiment should be started from the initialization step.

Note that as with the clustering function, metaclusterDiscovrExperiment will return a new discovrExperiment object that needs to be assigned to a variable.

myExpt <- metaclusterDiscovrExperiment(myExpt)

print(myExpt)
## Prints:
# An object of class 'discovrExperiment'
# All markers: CD45, GRZMA, CD57, CD45RA, CD38, CD8, CD4, KLRG1, CD14, CD127, CD226, HELIOS, CD2, NKG2C, CD3, TIGIT, CD25, CD27, CD161, TBET, CD39, EOMES, CXCR3, CD95, CD19, NKG2A, CCR7, CD122, CD103, KI67, PD1, CD56, CD16
# Clustering markers: CD45, GRZMA, CD57, CD45RA, CD38, CD8, CD4, KLRG1, CD14, CD127, CD226, HELIOS, CD2, NKG2C, CD3, TIGIT, CD25, CD27, CD161, TBET, CD39, EOMES, CXCR3, CD95, CD19, NKG2A, CCR7, CD122, CD103, KI67, PD1, CD56, CD16
# Experiment status: metaclustered

After performing metaclustering, you can retrieve the event counts and fraction of events for each subject and subset that were assigned to each metacluster using the utility function getMetaclusterOccupancy.

getMetaclusterOccupancy(myExpt)

## Prints:
#             subject    metacluster A2_CD8_T_Cells_frac A2_Er168+_EBV_frac A2_Tm169+_CMV_frac A2_Yb174+_Flu_frac A2_CD8_T_Cells_cts
# 1  SubjXpb997364476  metacluster_1                0.01               0.00               0.00               0.00               1984
# 2  SubjXpb997364476  metacluster_2                0.16               0.11               0.24               0.12              25815
# 3  SubjXpb997364476  metacluster_3                0.15               0.14               0.19               0.15              23955
# 4  SubjXpb997364476  metacluster_4                0.29               0.42               0.04               0.01              46547
# 5  SubjXpb997364476  metacluster_5                0.05               0.08               0.08               0.01               8647
# 6  SubjXpb997364476  metacluster_6                0.11               0.07               0.13               0.01              18105
# 7  SubjXpb997364476  metacluster_7                0.01               0.00               0.16               0.44               1950
# 8  SubjXpb997364476  metacluster_8                0.04               0.03               0.01               0.02               6746
# 9  SubjXpb997364476  metacluster_9                0.07               0.04               0.03               0.02              11589
# 10 SubjXpb997364476 metacluster_10                0.04               0.03               0.01               0.02               6430
# 11 SubjXpb997364476 metacluster_11                0.05               0.06               0.10               0.20               7683
# 12 SubjXpb997364476 metacluster_12                0.02               0.01               0.01               0.00               3720
#    A2_Er168+_EBV_cts A2_Tm169+_CMV_cts A2_Yb174+_Flu_cts
# 1                  2                 3                 0
# 2                 53               530                88
# 3                 64               432               110
# 4                194                80                 7
# 5                 39               170                 4
# 6                 34               300                 9
# 7                  2               356               321
# 8                 12                31                14
# 9                 18                75                11
# 10                14                22                12
# 11                26               219               146
# 12                 5                15                 0

These values are returned as a standard data frame, and can be saved or manipulated as usual, eg:

metaxFreqValues = getMetaclusterOccupancy(myExpt)
write.csv(
  metaxFreqValues, 
  file = file.path("Users", "mrosasco", "discovrOutput", format(Sys.Date(), "%y%m%d-clusterOccupancy.csv"))
)

Heatmaps

Once you've performed metaclustering, you can use the function makeMetaclusterHeatmaps() to produce and save heatmaps to summarize your data. By default this function assumes that the first subset listed in the fcsInfoFile used to initialize the experimnet is the parent population (eg: all CD8+s), and that all the remaining cell subsets are the rare/child populations. Details on explicitly assigning these populations for plotting are available in the docs at ?makeMetaclusterHeatmaps. By default this function will save all heatmaps in the current working directory with the current date as the prefix, but this can be changed as in the example below.

makeMetaclusterHeatmaps(
  myExpt,
  filenamePrefix = file.path(
    "Users", "mrosasco", "discovrOutput", format(Sys.Date(), "%y%m%d-")
  )
)

UMAP

Once you've performed metaclustering, you can use the function runUmapDiscovrExperiment() to run UMAP on z-score normalized event marker values within each sample. It optionally downsamples the cells to speed the process. This function returns a data frame with the UMAP coordinates for each cell, as well as the original cell population, sample information, and metacluster if available. The outputs are intended to be visualized using plotting software such as ggplot2.

umapObj <- runUmapDiscovrExperiment(
  myExpt,
  umapMarkers = NULL,
  downsampleFreq = c(parentPopulation = 100, childPopulations = 1),
  seed = NULL,
  returnUmapObject = FALSE,
  returnExpressionZScores = FALSE)

#Plot the UMAP, color by metacluster

umapObj %>%
  ggplot(aes(x = UMAP1, 
             y = UMAP2, 
             color = metacluster)) +
  geom_point()

Dependencies

This package depends on bioconductor packages, which may run into issues when installing automatically. If you encounter any difficulties during installation, you can manually install the bioconductor dependencies using the following code:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install("flowCore")
BiocManager::install("ComplexHeatmap")

The remaining depndencies should be automatically installed via CRAN, but if you encounter any issues you can manually install using the following code:

install.packages(pkgs=c("rlang", "igraph", "RANN", "Rcpp", "methods", "tidyverse", "reshape2", "RColorBrewer", "circlize", "viridisLite")

About

Distribution analysis across clusters of a parent population overlaid with a rare subpopulation (DISCOV-R)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published