This repository corresponds to the analysis and pipeline outlined in Characterizing the impacts of dataset imbalance on single-cell data integration. If you use the integration guidelines or any aspect of this pipeline, please consider citing our work.
- Using the imbalanced integration guidelines
- Reproducing the paper analysis
- Custom data perturbation configuration setup
- Citation information
A separate README for the imbalanced integration guidelines, with full environment installation instructions are in the docs
folder.
Please note that mamba
and snakemake
are required to run the pipeline through conda
. After installing conda
(https://conda.io/projects/conda/en/latest/user-guide/install/index.html), please add mamba
(https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) to the base environment, as well as snakemake
(https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) in base or a new environment. Please note that the snakemake version has to be restricted due to recent changes breaking the current setup (see code below):
conda install -n base -c conda-forge mamba
mamba create -c conda-forge -c bioconda "snakemake<7.19.1" -n snakemake
mamba activate snakemake
The environments necessary to run the pipeline will be automatically installed by snakemake. The only exception is the analysis
pipeline, which is outlined in step 5 below.
The following steps are necessary to reproduce the paper analysis:
- Clone the GitHub repository:
git clone https://github.com/hsmaan/Iniquitate.git
-
Download the resources utilized in the study, extract and move to Iniquitate directory. The data can be downloaded from Figshare or Google Drive:
Figshare: https://doi.org/10.6084/m9.figshare.24625302.v1
Google Drive Link:
- Download the data from https://drive.google.com/file/d/1gWsYEI_u0Bn-7liar1XmvcrFqdt3IHjV/view?usp=sharing
- Alternatively, you can use gdown (https://github.com/wkentaro/gdown) if a command-line download is needed/desired
After downloading:
tar -xzvf resources.tar.gz # If the folder name is not named `resources`, change the name via mv [file_folder] resources mv resources Iniquitate
-
Run the different configurations utilized in the study through the Snakemake pipeline:
-
Change the configuration option at the top of
workflow/Snakefile
. The following configs were utilized for different analyses in the study:- config_control
- config_lowcap_modified
- config_pdac_comp
-
Run the Snakemake pipeline specific to the selected config:
snakemake --unlock snakemake -j 1000 \ --use-conda \ --cluster-config cluster.json \ --cluster "sbatch \ --mem={cluster.mem} \ --gres=gpu:{cluster.gpu} \ -t {cluster.time} \ -p {cluster.partition} \ -c {threads}" \ --restart-times 0 \ --latency-wait 300 \ --keep-going \ --rerun-incomplete
Note that the above Snakemake run utilizes a
workflow/cluster.json
configuration file and HPC parallelization of the various steps in the pipeline. Users will need to create acluster.json
file specific to their HPC setup that has resources for all of the rules inworkflow/Snakefile
. Alternatively, users can also choose to employ Snakemake profiles. Details can be found here: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html.Further, currently all of the temporary integration files will by default be stored in
/tmp
based on theTMPDIR
variable inworkflow/envs/integrate.yaml
. These files can be quite large and pile up during runtime, even though they are being deleted at the end of each run, and we therefore recommend that users set this directory to one that contains enough space - at least 100 GB. -
-
Analyze the results using the R and python scripts/notebooks:
- Begin by installing the analysis conda environment:
cd Iniquitate/workflows/envs mamba env create -f analysis.yaml
- First run the python notebook to completion for metric analysis:
conda activate iniq_analysis jupyter-notebook 01_Fig_7_Imbal_Metric_Analysis.ipynb
- Run the Rscripts based on their order, through RStudio or the commandline
conda activate iniq_analysis Rscript 05_Iniq_Control_Fig_2_Analysis_Plots.R Rscript 06_Iniq_Control_Fig_2_Analysis_Stat_Tests.R ...
It is not possible to re-run all of the perturbation experiments and downstream analyses in a reasonable amount of time without high-performance computing (HPC). It is highly recommended that the workflow is parallelized over HPC.
It is also recommended to Run the R and python analysis notebooks in an HPC environment as well, because some of the steps are memory-intensive. Particularly, we don't recommend running Rscripts 08 or 09 without HPC, as they are time-intensive sampling experiments.
The same software requirements for the paper analysis apply to custom data perturbation experiments (conda, mamba, snakemake). Please see the first part of Reproducing the paper analysis for details on how to install these requirements.
The following steps are necessary to use a custom dataset:
-
Process the batches/samples as necessary and convert to
h5ad
format. Create a folder inresources/h5ad_files/int_datasets/
(e.g.resources/h5ad_files/int_datasets/custom_data
) and move batches/samples to this location. -
Modify the custom dataset configuration file (
workflow/configs/config_custom.json
). This .json file has the following parameters:-
config_name
- this can be left as custom, or changed to a different name, but this same name must be used when modifying theSnakefile
-
int_datasets
- this is a nested dictionary of the datasets to be used in the downsampling experiments. In this case, it's best to use the name of the dataset inresources/h5ad_files/int_datasets
as the top level namedata_folder
- this value should be the same as the folder containing the batches/samples inresources/h5ad_files/int_datasets/
ds_celltypes
- this value indicates how many cell-types to downsample and/or ablate in each runds_proportions
- this value indicates what proportion of cell-types remain in the downsampled batch after the pertubation. In this case, 0 would indicate ablation, and 0.1 would indicate the same level of downsampling used for the main experiments in the studynum_batches
- the number of batches to downsample in each perturbation. The 0 option is included here so that control experiments are possible (no perturbation to any batches).repetitions
- how many experiments based on the given grid ofds_celltypes
,ds_proportions
, andnum_batches
to perform. 200 is a good starting point. If the space of possible cell-types is very large (n celltypes > 20), then it may be useful to increase this value to ensure each cell-type is downsampled/ablated enough times in the total number of runs.
-
int_ti_datasets
- if any datasets have an underlying trajectory, and PAGA-based integration is needed to be done, then they should be added here with the same options indicated inint_datasets
-
query_to_reference
- A "Yes" or "No" option indicating whether or not to perform query-to-reference experiments. Currently this functionality is not available, but custom query-to-reference setups will be available soon. This should be left as "No". -
celltype_list
- If the user has a list of specific cell-types to downsample (and not others), they can be included her as a json list of strings based on their names. We don't recommend specifying certain cell-types, as a-priori knowledge of the effects of downsampling/perturbation may not be accurate.
-
-
Modify the
Snakefile
inworkflow\Snakefile
at line 3, in reference to the name of the specific config being used. In the example given, theconfigfile
line would be changed to:configfile: "configs/config_custom.json"
-
Run snakemake:
snakemake --unlock snakemake -j 1000 \ --use-conda \ --cluster-config cluster.json \ --cluster "sbatch \ --mem={cluster.mem} \ --gres=gpu:{cluster.gpu} \ -t {cluster.time} \ -p {cluster.partition} \ -c {threads}" \ --restart-times 0 \ --latency-wait 300 \ --keep-going \ --rerun-incomplete
Note that the above Snakemake run utilizes a
workflow/cluster.json
configuration file and HPC parallelization of the various steps in the pipeline. Users will need to create acluster.json
file specific to their HPC setup that has resources for all of the rules inworkflow/Snakefile
. -
Analyze the result files using the R scripts - all of the R scripts in the
workflow/analysis/R
folder can be used to analyze the results of the perturbation experiments. Currently, an exhaustive list of custom analysis scripts is not available, but the existing scripts can be modified to suit the needs of the user. We provide one custom script for analysis of KNN classification accuracy in theworkflow/analysis/R/knn_example.R
file. Please note that this file still needs to be modified in the appropriate input locations, which are indicated in the comments of the file.
Maan, H. et al. (2024) ‘Characterizing the impacts of dataset imbalance on single-cell data integration’, Nature biotechnology. Available at: https://doi.org/10.1038/s41587-023-02097-9.