You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform selected unsupervised analyses and visualizations of high-dimensional (normalized) data.
2
+
A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform unsupervised analyses (dimensionality reduction and cluster analysis) and visualizations of high-dimensional data.
3
3
4
-
**If you use this workflow in a publication, don't forget to give credits to the authors by citing the URL of this (original) repository (and its DOI, see Zenodo badge above -> coming soon).**
4
+
This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.
5
+
6
+
**If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [coming soon]().**
@@ -55,58 +69,152 @@ Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was
55
69
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the R package ComplexHeatmap (ver) [ref]. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
56
70
57
71
**Visualization**
58
-
The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata] and/or feature(s) [features_to_plot].
59
-
Interactive visualizations in self-contained HTML files of all 2D and 3D projections and embeddings were generated using plotly express (ver) [ref].
72
+
The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata], feature(s) [features_to_plot], and/or clustering results.
73
+
Interactive visualizations in self-contained HTML files of all 2D and 3D projections/embeddings were generated using plotly express (ver) [ref].
60
74
61
-
**The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [ref - cite this workflow here].**
75
+
**The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [DOI]().**
62
76
63
77
64
78
# Features
65
79
The workflow perfroms the following analyses on each dataset provided in the annotation file. A result folder "unsupervised_analysis" is generated containing a folder for each dataset.
80
+
81
+
## Dimensionality Reduction
66
82
- Principal Component Anlaysis (PCA) keeping all components (.pickle and .CSV)
67
83
- diagnostics (.PNG):
68
84
- variance: scree-plot and cumulative explained variance-plot of all and top 10% principal components
69
-
- pairs: sequential pair-wise PCs for up to 10 PCs using scatter- and density-plots colored by metadata_of_interest
85
+
- pairs: sequential pair-wise PCs for up to 10 PCs using scatter- and density-plots colored by [metadata_of_interest]
70
86
- loadings: showing the magnitude and direction of the 10 most influential features for each PC combination
- k-nearest-neighbor graph (.pickle): generated using the maximum n_neighorhood parameter together with the provided metrics
73
-
- low dimensional embedding (.pickle and .CSV): using the precomputed-knn graph from before, embeddings are parametrized using min_dist and n_components
88
+
- k-nearest-neighbor graph (.pickle): generated using the [n_neighbors] parameter together with the provided [metrics].
89
+
- fix any pickle load issue by specifying Python version to 3.9 (in case you want to use the graph downstream)
90
+
- low dimensional embedding (.pickle and .CSV): using the precomputed-knn graph from before, embeddings are parametrized using [min_dist] and [n_components]
74
91
- densMAP (optional): local density preserving regularization as additional dimensionality reduction method (i.e., all UMAP parameter combinations and downstream visualizations apply)
75
92
- diagnostics (.PNG): 2D embedding colored by PCA coordinates, vector quantization coordinates, approximated local dimension, neighborhood Jaccard index
76
93
- connectivity (.PNG): graph/network-connectivity plot with edge-bundling (hammer algorithm variant)
77
94
- Hierarchically Clustered Heatmap (.PNG)
78
-
- hierarchically clustered heatmaps of scaled data (z-score) with configured distance metrics and clustering methods (all combinations are computed), and annotated with metadata_of_interest.
95
+
- hierarchically clustered heatmaps of scaled data (z-score) with configured distance [metrics] and clustering methods ([hclust_methods]). All combinations are computed, and annotated with [metadata_of_interest].
79
96
- Visualization
80
-
- 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, depending on the method
81
-
- interactive 2D and 3D visualizations (self contained HTML files) of all projections and embeddings (**not in the example results or report due to large sizes**)
97
+
- 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, respectively.
98
+
- interactive 2D and 3D visualizations as self contained HTML files of all projections/embeddings.
82
99
- Results directories for each dataset have the following structure:
83
100
- "method" (containing all the data as .pickle and/or .CSV files)
84
101
- plots (for all visualizations as .PNG files)
85
102
103
+
## Cluster Analysis
104
+
> _"The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage."_ from _Algorithms for Clustering Data (1988)_ by Jain & Dubes
105
+
106
+
- Clustering
107
+
- Leiden algorithm
108
+
- Applied to the UMAP KNN graphs specified by the respective parameters (metric, n_neighbors).
109
+
- All algorithm specific parameters are supported: [partition_types], [resolutions], and [n_iterations].
110
+
- Clustification: an ML-based clustering approach that iteratively merges clusters based on misclassification
111
+
0. User: Specify a clustering method [method].
112
+
1. Chose the clustering with the most clusters as starting point (i.e., overclustered).
113
+
2. Iterative classification using the cluster labels.
114
+
- Stratified 5-fold CV
115
+
- RF with 100 trees (i.e., defaults)
116
+
- Retain predicted labels
117
+
3. Merging of clusters.
118
+
- Build a normalized confusion matrix using the predicted labels.
119
+
- Make it symmetric and upper triangle, resulting in a similarity graph.
120
+
- Check stopping criterion: if maximum edge weight < 2.5% -> STOP and return current cluster labels
121
+
- Merge the two clusters connected by the maximum edge weight.
122
+
4. Back to 2. using the new labels.
123
+
- Clustree analysis and visualization
124
+
- The following clustree specific parameters are supported: [count_filter], [prop_filter], and [layout].
125
+
- default: produces the standard clustree visualization, ordered by number of clusters and annotated.
126
+
- custom: extends default by adding [metadata_of_interest] as additional "clusterings".
127
+
- metadata and features, specified in the config, are highlighted on top of the clusterings using aggregation functions
128
+
- numeric: available aggregation functions: mean, median, max, min
129
+
- categorical: available aggregation functions: "pure" or "majority"
130
+
- Cluster Validation
131
+
- External cluster indices are determined comparing all clustering results with all categorical metadata
132
+
- all complementary indices from sklearn are used: [AMI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html#sklearn.metrics.adjusted_mutual_info_score), [ARI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score), [FMI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html#sklearn.metrics.fowlkes_mallows_score), [**Homogeneity** and **Completeness** and **V**-Measure](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html#sklearn.metrics.homogeneity_completeness_v_measure)
133
+
- Internal cluster indices are determined for each clustering and [metadata_of_interest]
134
+
- 6 complementary indices are used
135
+
- 5 from the package [clusterCrit](https://rdrr.io/cran/clusterCrit/man/intCriteria.html): Silhouette, Calinski-Harabasz, C-index, Dunn index, Davis-Bouldin Score.
136
+
- 1 weighted Bayesian Information Criterion (BIC) approach, previously described in [Reichl 2018 - Chapter 4.2.2 - Internal Indices](https://repositum.tuwien.at/handle/20.500.12708/3488)
137
+
- Due to the comutational cost PCA results, representing 90% of variance explained, are used for as input and a [sample_proportion] can be configured.
138
+
- Caveat: internal cluster indices are linear i.e., using Euclidean distance metrics.
139
+
- Multiple-criteria decision-making (MCDM) using TOPSIS for ranking clustering results
140
+
- The MCDM method TOPSIS is applied to the internal cluster indices to rank all clustering results (and [metadata_of_interest]) from best to worst.
141
+
- This approach has been described in [Reichl 2018 - Chapter 4.3.1 - The Favorite Approach](https://repositum.tuwien.at/handle/20.500.12708/3488)
142
+
- Caveat: Silhouette score sometimes generates NA due to a known [bug](https://github.com/cran/clusterCrit/pull/1/commits/b37a5e361d0a12f9d3900089aa03e3947d0d4ef7). Clusterings with NA scores are removed before TOPSIS is applied.
143
+
- Visualization
144
+
- all clustering results as 2D and interactive 2D & 3D plots for all available embedings/projections.
145
+
- external cluster indices as hierarchically clustered heatmaps, aggregated in one panel.
146
+
- internal cluster indices as one heatmap with clusterings (and [metadata_of_interest]) sorted by TOPSIS ranking from top to bottom and split cluster indices split by type (cost/benefit functions to be minimized/maximized).
147
+
86
148
87
149
# Usage
88
150
Here are some tips for the usage of this workflow:
89
-
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (computational expensive and slow)
90
-
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically. If nevertheless a out-of-memory exception occurs the flag `--retries 2` can be used to trigger automatic resubmission upon failure with twice (2x) the memory.
151
+
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
152
+
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
153
+
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of RF training & testing.
154
+
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata.
155
+
91
156
92
157
# Configuration
93
158
Detailed specifications can be found here [./config/README.md](./config/README.md)
94
159
95
160
# Examples
96
-
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./.test/):
161
+
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](.test/):
- metadata (consisting of the ground truth label "target"): digits_labels.csv
168
+
- results will be generated in a subfolder .test/results/
169
+
- performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)
170
+
171
+
# single-cell RNA sequencing (scRNA-seq) data analysis
172
+
Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses.
173
+
Below are configurations of the two most commonly used frameworks, [scanpy](https://scanpy.readthedocs.io/en/stable/index.html) (Python) and [Seurat](https://satijalab.org/seurat/) (R), and the original package's defaults as comparison and to facilitate reproducibility:
0 commit comments