Skip to content

Commit 21220e5

Browse files
committed
add all new features and updated rulegraph to the docs #19
1 parent ed6244d commit 21220e5

File tree

3 files changed

+586
-248
lines changed

3 files changed

+586
-248
lines changed

README.md

+130-22
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Unsupervised Analysis Worfklow
2-
A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform selected unsupervised analyses and visualizations of high-dimensional (normalized) data.
2+
A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform unsupervised analyses (dimensionality reduction and cluster analysis) and visualizations of high-dimensional data.
33

4-
**If you use this workflow in a publication, don't forget to give credits to the authors by citing the URL of this (original) repository (and its DOI, see Zenodo badge above -> coming soon).**
4+
This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.
5+
6+
**If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [coming soon]().**
57

68
![Workflow Rulegraph](./workflow/dags/rulegraph.svg)
79

@@ -14,26 +16,38 @@ Table of contents
1416
* [Usage](#usage)
1517
* [Configuration](#configuration)
1618
* [Examples](#examples)
19+
* [scRNA-seq Analysis](#single-cell-RNA-sequencing-(scRNA-seq)-data-analysis)
1720
* [Links](#links)
21+
* [Resources](#resources)
22+
* [Publications](#publications)
23+
1824

1925
# Authors
2026
- [Stephan Reichl](https://github.com/sreichl)
27+
- [Raphael Bednarsky](https://github.com/bednarsky)
28+
- [Christoph Bock](https://github.com/chrbock)
2129

2230
# Software
2331
This project wouldn't be possible without the following software and their dependencies
2432

2533
| Software | Reference (DOI) |
2634
| :------------: | :-----------------------------------------------: |
35+
| clusterCrit | https://CRAN.R-project.org/package=clusterCrit |
36+
| clustree | https://doi.org/10.1093/gigascience/giy083 |
2737
| ComplexHeatmap | https://doi.org/10.1093/bioinformatics/btw313 |
2838
| densMAP | https://doi.org/10.1038/s41587-020-00801-7 |
2939
| ggally | https://CRAN.R-project.org/package=GGally |
3040
| ggplot2 | https://ggplot2.tidyverse.org/ |
3141
| ggrepel | https://CRAN.R-project.org/package=ggrepel |
42+
| igraph | https://doi.org/10.5281/zenodo.3630268 |
43+
| leidenalg | https://doi.org/10.5281/zenodo.1469356 |
3244
| pandas | https://doi.org/10.5281/zenodo.3509134 |
3345
| patchwork | https://CRAN.R-project.org/package=patchwork |
3446
| PCA | https://doi.org/10.1080/14786440109462720 |
3547
| plotly express | https://plot.ly |
48+
| pymcdm | https://doi.org/10.1016/j.softx.2023.101368 |
3649
| scikit-learn | http://jmlr.org/papers/v12/pedregosa11a.html |
50+
| scipy | https://doi.org/10.1038/s41592-019-0686-2 |
3751
| Snakemake | https://doi.org/10.12688/f1000research.29032.2 |
3852
| umap-learn | https://doi.org/10.21105/joss.00861 |
3953

@@ -55,58 +69,152 @@ Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was
5569
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the R package ComplexHeatmap (ver) [ref]. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
5670

5771
**Visualization**
58-
The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata] and/or feature(s) [features_to_plot].
59-
Interactive visualizations in self-contained HTML files of all 2D and 3D projections and embeddings were generated using plotly express (ver) [ref].
72+
The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata], feature(s) [features_to_plot], and/or clustering results.
73+
Interactive visualizations in self-contained HTML files of all 2D and 3D projections/embeddings were generated using plotly express (ver) [ref].
6074

61-
**The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [ref - cite this workflow here].**
75+
**The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [DOI]().**
6276

6377

6478
# Features
6579
The workflow perfroms the following analyses on each dataset provided in the annotation file. A result folder "unsupervised_analysis" is generated containing a folder for each dataset.
80+
81+
## Dimensionality Reduction
6682
- Principal Component Anlaysis (PCA) keeping all components (.pickle and .CSV)
6783
- diagnostics (.PNG):
6884
- variance: scree-plot and cumulative explained variance-plot of all and top 10% principal components
69-
- pairs: sequential pair-wise PCs for up to 10 PCs using scatter- and density-plots colored by metadata_of_interest
85+
- pairs: sequential pair-wise PCs for up to 10 PCs using scatter- and density-plots colored by [metadata_of_interest]
7086
- loadings: showing the magnitude and direction of the 10 most influential features for each PC combination
7187
- Uniform Manifold Approximation & Projection (UMAP)
72-
- k-nearest-neighbor graph (.pickle): generated using the maximum n_neighorhood parameter together with the provided metrics
73-
- low dimensional embedding (.pickle and .CSV): using the precomputed-knn graph from before, embeddings are parametrized using min_dist and n_components
88+
- k-nearest-neighbor graph (.pickle): generated using the [n_neighbors] parameter together with the provided [metrics].
89+
- fix any pickle load issue by specifying Python version to 3.9 (in case you want to use the graph downstream)
90+
- low dimensional embedding (.pickle and .CSV): using the precomputed-knn graph from before, embeddings are parametrized using [min_dist] and [n_components]
7491
- densMAP (optional): local density preserving regularization as additional dimensionality reduction method (i.e., all UMAP parameter combinations and downstream visualizations apply)
7592
- diagnostics (.PNG): 2D embedding colored by PCA coordinates, vector quantization coordinates, approximated local dimension, neighborhood Jaccard index
7693
- connectivity (.PNG): graph/network-connectivity plot with edge-bundling (hammer algorithm variant)
7794
- Hierarchically Clustered Heatmap (.PNG)
78-
- hierarchically clustered heatmaps of scaled data (z-score) with configured distance metrics and clustering methods (all combinations are computed), and annotated with metadata_of_interest.
95+
- hierarchically clustered heatmaps of scaled data (z-score) with configured distance [metrics] and clustering methods ([hclust_methods]). All combinations are computed, and annotated with [metadata_of_interest].
7996
- Visualization
80-
- 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, depending on the method
81-
- interactive 2D and 3D visualizations (self contained HTML files) of all projections and embeddings (**not in the example results or report due to large sizes**)
97+
- 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, respectively.
98+
- interactive 2D and 3D visualizations as self contained HTML files of all projections/embeddings.
8299
- Results directories for each dataset have the following structure:
83100
- "method" (containing all the data as .pickle and/or .CSV files)
84101
- plots (for all visualizations as .PNG files)
85102

103+
## Cluster Analysis
104+
> _"The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage."_ from _Algorithms for Clustering Data (1988)_ by Jain & Dubes
105+
106+
- Clustering
107+
- Leiden algorithm
108+
- Applied to the UMAP KNN graphs specified by the respective parameters (metric, n_neighbors).
109+
- All algorithm specific parameters are supported: [partition_types], [resolutions], and [n_iterations].
110+
- Clustification: an ML-based clustering approach that iteratively merges clusters based on misclassification
111+
0. User: Specify a clustering method [method].
112+
1. Chose the clustering with the most clusters as starting point (i.e., overclustered).
113+
2. Iterative classification using the cluster labels.
114+
- Stratified 5-fold CV
115+
- RF with 100 trees (i.e., defaults)
116+
- Retain predicted labels
117+
3. Merging of clusters.
118+
- Build a normalized confusion matrix using the predicted labels.
119+
- Make it symmetric and upper triangle, resulting in a similarity graph.
120+
- Check stopping criterion: if maximum edge weight < 2.5% -> STOP and return current cluster labels
121+
- Merge the two clusters connected by the maximum edge weight.
122+
4. Back to 2. using the new labels.
123+
- Clustree analysis and visualization
124+
- The following clustree specific parameters are supported: [count_filter], [prop_filter], and [layout].
125+
- default: produces the standard clustree visualization, ordered by number of clusters and annotated.
126+
- custom: extends default by adding [metadata_of_interest] as additional "clusterings".
127+
- metadata and features, specified in the config, are highlighted on top of the clusterings using aggregation functions
128+
- numeric: available aggregation functions: mean, median, max, min
129+
- categorical: available aggregation functions: "pure" or "majority"
130+
- Cluster Validation
131+
- External cluster indices are determined comparing all clustering results with all categorical metadata
132+
- all complementary indices from sklearn are used: [AMI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html#sklearn.metrics.adjusted_mutual_info_score), [ARI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score), [FMI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html#sklearn.metrics.fowlkes_mallows_score), [**Homogeneity** and **Completeness** and **V**-Measure](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html#sklearn.metrics.homogeneity_completeness_v_measure)
133+
- Internal cluster indices are determined for each clustering and [metadata_of_interest]
134+
- 6 complementary indices are used
135+
- 5 from the package [clusterCrit](https://rdrr.io/cran/clusterCrit/man/intCriteria.html): Silhouette, Calinski-Harabasz, C-index, Dunn index, Davis-Bouldin Score.
136+
- 1 weighted Bayesian Information Criterion (BIC) approach, previously described in [Reichl 2018 - Chapter 4.2.2 - Internal Indices](https://repositum.tuwien.at/handle/20.500.12708/3488)
137+
- Due to the comutational cost PCA results, representing 90% of variance explained, are used for as input and a [sample_proportion] can be configured.
138+
- Caveat: internal cluster indices are linear i.e., using Euclidean distance metrics.
139+
- Multiple-criteria decision-making (MCDM) using TOPSIS for ranking clustering results
140+
- The MCDM method TOPSIS is applied to the internal cluster indices to rank all clustering results (and [metadata_of_interest]) from best to worst.
141+
- This approach has been described in [Reichl 2018 - Chapter 4.3.1 - The Favorite Approach](https://repositum.tuwien.at/handle/20.500.12708/3488)
142+
- Caveat: Silhouette score sometimes generates NA due to a known [bug](https://github.com/cran/clusterCrit/pull/1/commits/b37a5e361d0a12f9d3900089aa03e3947d0d4ef7). Clusterings with NA scores are removed before TOPSIS is applied.
143+
- Visualization
144+
- all clustering results as 2D and interactive 2D & 3D plots for all available embedings/projections.
145+
- external cluster indices as hierarchically clustered heatmaps, aggregated in one panel.
146+
- internal cluster indices as one heatmap with clusterings (and [metadata_of_interest]) sorted by TOPSIS ranking from top to bottom and split cluster indices split by type (cost/benefit functions to be minimized/maximized).
147+
86148

87149
# Usage
88150
Here are some tips for the usage of this workflow:
89-
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (computational expensive and slow)
90-
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically. If nevertheless a out-of-memory exception occurs the flag `--retries 2` can be used to trigger automatic resubmission upon failure with twice (2x) the memory.
151+
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
152+
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
153+
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of RF training & testing.
154+
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata.
155+
91156

92157
# Configuration
93158
Detailed specifications can be found here [./config/README.md](./config/README.md)
94159

95160
# Examples
96-
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./.test/):
161+
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](.test/):
97162
- config
98-
- project configuration: digits_unsupervised_analysis_config.yaml
163+
- configuration: config/config.yaml
99164
- sample annotation: digits_unsupervised_analysis_annotation.csv
100165
- data
101-
- dataset: digits_data.csv
102-
- metadata: digits_labels.csv
103-
- results
104-
- containing all results in a dataset oriented structure (without large interactive HTML or object files)
105-
- detailed self-contained HTML [report](./.test/report.html) for distribution and reproducibility
106-
- performance/speed: on a HPC it took less than 6.5 minutes to complete a full run (with up to 32GB of memory per task)
166+
- dataset (1797 observations, 64 features): digits_data.csv
167+
- metadata (consisting of the ground truth label "target"): digits_labels.csv
168+
- results will be generated in a subfolder .test/results/
169+
- performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)
170+
171+
# single-cell RNA sequencing (scRNA-seq) data analysis
172+
Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses.
173+
Below are configurations of the two most commonly used frameworks, [scanpy](https://scanpy.readthedocs.io/en/stable/index.html) (Python) and [Seurat](https://satijalab.org/seurat/) (R), and the original package's defaults as comparison and to facilitate reproducibility:
174+
175+
UMAP for dimensionality reduction
176+
- [umap-learn](https://umap-learn.readthedocs.io/en/latest/api.html)
177+
- initialization: spectral
178+
- metric: Euclidean
179+
- neighbors: 15
180+
- min. distance: 0.1
181+
- [scanpy](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.neighbors.html#scanpy.pp.neighbors)
182+
- initialization: spectral
183+
- metric: Euclidean
184+
- neighbors: 15
185+
- min. distance: **0.5**
186+
- [Seurat](https://satijalab.org/seurat/reference/runumap)
187+
- initialization: **PCA**
188+
- method: "uwot" (not umap-learn package)
189+
- metric: **Cosine (or Correlation)**
190+
- neighbors: **30**
191+
- min. distance: **0.3**
192+
193+
Leiden algorithm for clustering
194+
- [leidenalg](https://leidenalg.readthedocs.io/en/stable/reference.html)
195+
- no defaults
196+
- [scanpy](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.leiden.html)
197+
- input: batch balanced UMAP KNN graph
198+
- partition type: RBConfigurationVertexPartition
199+
- resolution: 1
200+
- [Seurat](https://github.com/satijalab/seurat/blob/763259d05991d40721dee99c9919ec6d4491d15e/R/clustering.R#L344)
201+
- input: SNN graph
202+
- partition type: RBConfigurationVertexPartition
203+
- resolution: 0.8
204+
107205

108206
# Links
109207
- [GitHub Repository](https://github.com/epigen/unsupervised_analysis/)
110208
- [GitHub Page](https://epigen.github.io/unsupervised_analysis/)
111-
- [Zenodo Repository (coming soon)]()
209+
- [Zenodo Repository]()
112210
- [Snakemake Workflow Catalog Entry](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/unsupervised_analysis)
211+
212+
# Resources
213+
- Recommended compatible [MR.PARETO](https://github.com/epigen/mr.pareto) modules
214+
- for upstream processing:
215+
- [ATAC-seq Processing](https://github.com/epigen/atacseq_pipeline) to quantify chromatin accessibility.
216+
- [Split, Filter, Normalize and Integrate Sequencing Data](https://github.com/epigen/spilterlize_integrate) process sequencing data.
217+
218+
# Publications
219+
The following publications successfully used this module for their analyses.
220+
- ...

0 commit comments

Comments
 (0)