Skip to content

Commit ad4b6f5

Browse files
authored
Replace MDS solver; match variable naming to paper (#4)
Pandora now uses scikit-allel MDS instead of scikit-learn MDS and the confidence_level variable/CLI flag was renamed to convergence tolerance to match the terminology of the paper
1 parent ae0846b commit ad4b6f5

26 files changed

+305
-446
lines changed

docs/cli_config.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Configuration options:
1212
- ``file_format``, default = ``EIGENSTRAT``, Name of the file format your dataset is in. Supported formats are ``ANCESTRYMAP``, ``EIGENSTRAT``, ``PED``, ``PACKEDPED``, ``PACKEDANCESTRYMAP``. For more information see Section `Input data`_ below.
1313
- ``convertf``, default = ``convertf``, File path pointing to an executable of Eigensoft's ``convertf`` tool. ``convertf`` is used if the provided dataset is not in ``EIGENSTRAT`` format. Default is ``convertf``. This will only work if ``convertf`` is installed systemwide.
1414
- ``bootstrap_convergence_check``, default = ``True``, If true, instead of computing ``n_replicates`` bootstraps and embeddings, Pandora will check for convergence once every ``max(10, threads)`` bootstrap embeddings are computed. If according to our heuristic (see TODO for more details) the bootstrap procedure converged, all remaining tasks are cancelled and the stability is determined uisng only the number of replicates computed when convergence is determined. Due to the runtime overhead of the convergence check compared to the runtime of MDS computations, we only advice using this convergence check for PCA analyses. Note that this parameter is only relevant if ``analysis_mode`` is ``AnalysisMode.BOOTSTRAP``.
15-
- ``bootstrap_convergence_confidence_level``, default=0.05, Determines the level of confidence when checking for bootstrap convergence. A value of :math:`X` means that we allow deviations of up to :math:`X * 100\%` between pairwise bootstrap comparisons and still assume convergence.
15+
- ``bootstrap_convergence_tolerance``, default=0.05, Determines the level of deviation tolerance when checking for bootstrap convergence. A value of :math:`X` means that we allow deviations of up to :math:`X * 100\%` between pairwise bootstrap comparisons and still assume convergence.
1616
- ``n_replicates``, default = 100, Number of bootstrap replicates or sliding windows to compute
1717
- ``keep_replicates``, default = ``false``, Whether to store all intermediate datasets files (``.geno``, ``.snp``, ``.ind``). Note that this will result in a substantial storage consumption. Note that in case of bootstrapping, the bootstrapped indices are stored as checkpoints for full reproducibility in any case.
1818
- ``n_components``, default = 10, Number of components to compute and compare for PCA or MDS analyses. We recommend 10 for PCA analyses and 2 for MDS analyses. The default is 10 since the default for ``embedding_algorithm`` is ``PCA``.

docs/getting_started.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,12 +46,12 @@ have to add the path to ``smartpca`` to the ``config-example.yaml``.
4646
4747
You should then see an output similar to this:::
4848

49-
Pandora version 1.0.3 released by The Exelixis Lab
49+
Pandora version 2.0.0 released by The Exelixis Lab
5050
Developed by: Julia Haag
5151
Latest version: https://github.com/tschuelia/Pandora
5252
Questions/problems/suggestions? Please open an issue on GitHub.
5353

54-
Pandora was called at 06-Nov-2023 16:26:50 as follows:
54+
Pandora was called at 18-Apr-2024 16:26:50 as follows:
5555

5656
/Users/julia/micromamba/envs/pandora/bin/pandora -c config_example.yaml
5757

@@ -63,7 +63,7 @@ You should then see an output similar to this:::
6363
n_replicates: 10
6464
keep_replicates: False
6565
bootstrap_convergence_check: True
66-
bootstrap_convergence_confidence_level: 0.05
66+
bootstrap_convergence_tolerance: 0.05
6767
n_components: 10
6868
embedding_algorithm: PCA
6969
smartpca: smartpca
@@ -85,7 +85,7 @@ You should then see an output similar to this:::
8585
[00:00:02] Running SmartPCA on the input dataset.
8686
[00:00:02] Plotting embedding results for the input dataset.
8787
[00:00:18] Drawing 10 bootstrapped datasets and running PCA.
88-
[00:00:18] NOTE: Bootstrap convergence check is enabled. Will terminate bootstrap computation once convergence is determined. Convergence confidence level: 0.05
88+
[00:00:18] NOTE: Bootstrap convergence check is enabled. Will terminate bootstrap computation once convergence is determined. Convergence tolerance: 0.05
8989
[00:00:27] Bootstrapping done. Number of replicates computed: 10
9090
[00:00:27] Comparing bootstrapping embedding results.
9191
[00:00:34] Plotting bootstrapping embedding results.

docs/usage.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,11 @@ The Command line interface, as well as when using the Eigen-based Pandora Python
4040
Smartpca is a powerful PCA tool that implements a lot of genotype-data specific routines and optimizations and provides a lot of useful options for meaningful PCA analyses such as outlier detection.
4141
Pandora supports all custom configuration settings of smartpca. See the Section :ref:`SmartPCA` for more information. For MDS analyses, Pandora will use
4242
smartpca to generate the Fst-distance matrix as input for MDS. Note that this distance matrix computes the distances between population and not between samples.
43-
The subsequent MDS analysis is performed using the scikit-learn MDS implementation.
43+
The subsequent MDS analysis is performed using the scikit-allel MDS implementation.
4444
If you have genotype data in Eigenfiles but want to be able to do a more flexible analysis, consider using the alternative NumPy interface. Pandora provides a method
4545
to load your genotype data in EIGENSTRAT format as numpy array.
4646

47-
If you are using the NumPy-based Pandora interface, PCA and MDS is performed using the scikit-learn implementations. For both analyses, Pandora supports different types of data imputation, see the API documentation for more information.
47+
If you are using the NumPy-based Pandora interface, PCA and MDS is performed using the scikit-learn and scikit-allel implementations respectively. For both analyses, Pandora supports different types of data imputation, see the API documentation for more information.
4848
Per default, Pandora will apply SNP-wise mean imputation. The default distance metric for MDS analysis is the pairwise euclidean distance between all samples in your data. However, Pandora provides alternative distance metrics
4949
and allows you to define your own distance metric as well. Again, see the API documentation for further information.
5050

pandora/bootstrap.py

Lines changed: 16 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
def _bootstrap_convergence_check(
3030
bootstraps: List[Union[NumpyDataset, EigenDataset]],
3131
embedding: EmbeddingAlgorithm,
32-
bootstrap_convergence_confidence_level: float,
32+
bootstrap_convergence_tolerance: float,
3333
threads: int,
3434
logger: Optional[loguru.Logger] = None,
3535
):
@@ -44,14 +44,12 @@ def _bootstrap_convergence_check(
4444
f"Unrecognized embedding option {embedding}. Supported are 'pca' and 'mds'."
4545
)
4646

47-
return _bootstrap_converged(
48-
embeddings, bootstrap_convergence_confidence_level, threads
49-
)
47+
return _bootstrap_converged(embeddings, bootstrap_convergence_tolerance, threads)
5048

5149

5250
def _bootstrap_converged(
5351
bootstraps: List[Embedding],
54-
bootstrap_convergence_confidence_level: float,
52+
bootstrap_convergence_tolerance: float,
5553
threads: int,
5654
):
5755
"""Checks for convergence by comparing the Pandora Stabilities for 10 subsets of the given list of bootstraps."""
@@ -76,7 +74,7 @@ def _bootstrap_converged(
7674
stabilities[j] = stability_s2
7775

7876
relative_difference = abs(stability_s2 - stability_s1) / (stability_s1 + 1e-6)
79-
if round(relative_difference, 3) > bootstrap_convergence_confidence_level:
77+
if round(relative_difference, 3) > bootstrap_convergence_tolerance:
8078
return False
8179
return True
8280

@@ -252,7 +250,7 @@ def run(
252250
self,
253251
threads: int,
254252
bootstrap_convergence_check: bool,
255-
bootstrap_convergence_confidence_level: float,
253+
bootstrap_convergence_tolerance: float,
256254
embedding: EmbeddingAlgorithm,
257255
logger: Optional[loguru.Logger] = None,
258256
):
@@ -296,7 +294,7 @@ def run(
296294
converged = _bootstrap_convergence_check(
297295
bootstraps,
298296
embedding,
299-
bootstrap_convergence_confidence_level,
297+
bootstrap_convergence_tolerance,
300298
threads,
301299
logger,
302300
)
@@ -390,7 +388,7 @@ def bootstrap_and_embed_multiple(
390388
redo: bool = False,
391389
keep_bootstraps: bool = False,
392390
bootstrap_convergence_check: bool = True,
393-
bootstrap_convergence_confidence_level: float = 0.05,
391+
bootstrap_convergence_tolerance: float = 0.05,
394392
smartpca_optional_settings: Optional[Dict] = None,
395393
logger: Optional[loguru.Logger] = None,
396394
) -> List[EigenDataset]:
@@ -434,8 +432,8 @@ def bootstrap_and_embed_multiple(
434432
bootstrap_convergence_check : bool, default=True
435433
Whether to automatically determine bootstrap convergence. If ``True``, will only compute as many replicates as
436434
required for convergence according to our heuristic (see Notes below).
437-
bootstrap_convergence_confidence_level : float, default=0.05
438-
Determines the level of confidence when checking for bootstrap convergence. A value of X means that we allow
435+
bootstrap_convergence_tolerance : float, default=0.05
436+
Determines the deviation tolerance when checking for bootstrap convergence. A value of X means that we allow
439437
deviations of up to :math:`X * 100\\%` between pairwise bootstrap comparisons and still assume convergence.
440438
smartpca_optional_settings : Dict, default=None
441439
Additional smartpca settings.
@@ -462,7 +460,7 @@ def bootstrap_and_embed_multiple(
462460
We first create 10 random subsets of size :math:`int(N/2)` by sampling from all :math:`N` replicates.
463461
We then compute the Pandora Stability (PS) for each of the 10 subsets and compute the relative difference of PS
464462
values between all possible pairs of subsets :math:`(PS_1, PS_2)` by computing :math:`\\frac{\\left|PS_1 - PS_2\\right|}{PS_2}`.
465-
We assume convergence if all pairwise relative differences are below X * 100% were X is the set confidence level.
463+
We assume convergence if all pairwise relative differences are below X * 100% were X is the set tolerance.
466464
If we determine that the bootstrap has converged, all remaining bootstrap computations are cancelled.
467465
468466
(*) The reasoning for checking every ``max(10, threads)`` is the following: if Pandora runs on a machine with e.g. 48
@@ -517,7 +515,7 @@ def bootstrap_and_embed_multiple(
517515
bootstraps, finished_indices = parallel_bootstrap_process_manager.run(
518516
threads=threads,
519517
bootstrap_convergence_check=bootstrap_convergence_check,
520-
bootstrap_convergence_confidence_level=bootstrap_convergence_confidence_level,
518+
bootstrap_convergence_tolerance=bootstrap_convergence_tolerance,
521519
embedding=embedding,
522520
logger=logger,
523521
)
@@ -568,7 +566,7 @@ def bootstrap_and_embed_multiple_numpy(
568566
] = euclidean_sample_distance,
569567
imputation: Optional[str] = "mean",
570568
bootstrap_convergence_check: bool = True,
571-
bootstrap_convergence_confidence_level: float = 0.05,
569+
bootstrap_convergence_tolerance: float = 0.05,
572570
) -> List[NumpyDataset]:
573571
"""Draws ``n_replicates`` bootstrap datasets of the provided NumpyDataset and performs PCA/MDS analysis (as
574572
specified by ``embedding``) for each bootstrap.
@@ -611,8 +609,8 @@ def bootstrap_and_embed_multiple_numpy(
611609
bootstrap_convergence_check : bool, default=True
612610
Whether to automatically determine bootstrap convergence. If ``True``, will only compute as many replicates as
613611
required for convergence according to our heuristic (see Notes below).
614-
bootstrap_convergence_confidence_level : float, default=0.05
615-
Determines the level of confidence when checking for bootstrap convergence. A value of X means that we allow
612+
bootstrap_convergence_tolerance : float, default=0.05
613+
Determines the level of deviation tolerance when checking for bootstrap convergence. A value of X means that we allow
616614
deviations of up to :math:`X * 100\\%` between pairwise bootstrap comparisons and still assume convergence.
617615
618616
Returns
@@ -633,7 +631,7 @@ def bootstrap_and_embed_multiple_numpy(
633631
We first create 10 random subsets of size :math:`int(N/2)` by sampling from all :math:`N` replicates.
634632
We then compute the Pandora Stability (PS) for each of the 10 subsets and compute the relative difference of PS
635633
values between all possible pairs of subsets :math:`(PS_1, PS_2)` by computing :math:`\\frac{\\left|PS_1 - PS_2\\right|}{PS_2}`.
636-
We assume convergence if all pairwise relative differences are below X * 100% were X is the set confidence level.
634+
We assume convergence if all pairwise relative differences are below X * 100% were X is the set tolerance.
637635
If we determine that the bootstrap has converged, all remaining bootstrap computations are cancelled.
638636
639637
(*) The reasoning for checking every ``max(10, threads)`` is the following: if Pandora runs on a machine with e.g. 48
@@ -666,7 +664,7 @@ def bootstrap_and_embed_multiple_numpy(
666664
bootstraps, _ = parallel_bootstrap_process_manager.run(
667665
threads=threads,
668666
bootstrap_convergence_check=bootstrap_convergence_check,
669-
bootstrap_convergence_confidence_level=bootstrap_convergence_confidence_level,
667+
bootstrap_convergence_tolerance=bootstrap_convergence_tolerance,
670668
embedding=embedding,
671669
)
672670

0 commit comments

Comments
 (0)