Skip to content

v1.1.0

Compare
Choose a tag to compare
@Bribak Bribak released this 31 Jan 15:09
· 214 commits to master since this release
d7502e9

Change Log

glycan_data

  • Updated sugarbase database and all models

stats

  • Newly added module to glycowork
  • Moved all the statistics functions from motif.processing into this module: cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering
  • Added fast_two_sum, two_sum, expansion_sum, hlm, update_cf_for_m_n, jtkdist, jtkinit, jtkstat, and jtkx helper functions for JTK test
  • Added get_BF to calculate Jeffreys' approximate Bayes factor based on sample size and p-value
  • Added get_alphaN to calculate sample size-appropriate significance cut-offs informed by Bayesian statistics
  • Added pi0_tst and TST_grouped_benjamini_hochberg to perform a Two-Stage adaptive Benjamini-Hochberg procedure based on groups (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3175141/ or https://www.biorxiv.org/content/10.1101/2024.01.13.575531v1)
  • Added test_inter_vs_intra_group to estimate intra- versus inter-group correlation with a mixed-effects model for groupings of glycans based on domain expertise

motif

regex

  • Newly added module to glycowork
  • Added the get_match function and associated functions to implement a regular expression system for glycans. This allows for powerful queries to detect and extract motifs of arbitrary complexity.

processing

  • Moved cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering into glycan_data.stats to re-focus processing on processing glycan sequences
  • Extended canonicalize_composition to cases like ‘5_4_2_1’, ‘5421’, and ‘(Hex)2 (HexNAc)2 (Deoxyhexose)1 (NeuAc)2 + (Man)3(GlcNAc)2’
  • GlycoCT and WURCS handling for universal input now encompass more monosaccharides and more modifications
  • Expanded oxford_to_iupac to handle more complex sequences, including sulfation, LacdiNAc, hybrid structures, extended Neu5Ac, complex fucosylation, more custom linkage specifications
  • enforce_class can now deal with free glycans regardless of whether they end in ‘-ol’ or not

annotate

  • annotate_dataset and downstream functions now accept a new keyword in “feature_set”, called “custom”. If “custom” is added to “feature_set”, a list of custom motifs can and must be added via the “custom_motifs” keyword argument. “custom” can be mixed and matched with all other keywords in “feature_set”
  • annotate_dataset now also accepts glyco-regular expressions via the “custom” keyword in “feature_set”. These expressions need to be added within the “custom_motifs” keyword argument and have to start with an “r”, such as "rHex-HexNAc-([Hex|Fuc]){1,2}-HexNAc". Normal motifs and glyco-regular expressions can be freely mixed within “custom_motifs”
  • Added group_glycans_core, group_glycans_sia_fuc, and group_glycans_N_glycan_type to group glycans by core structure (for O-glycans), Sia/Fuc/FucSia/Rest, or complex/hybrid/high-man/rest (for N-glycans)
  • Fixed a bug in get_k_saccharides, in which redundant columns were not always correctly removed

analysis

  • Added get_jtk to analyze circadian expression of glycans in temporal glycomics datasets using the Jonckheere–Terpstra–Kendall (JTK) algorithm, with the typical interface for motifs and imputation etc analogous to differential expression.
  • get_differential_expression, get_glycanova, and get_jtk now use get_alphaN to calculate a sample size-appropriate significance cut-off (see https://journals.sagepub.com/doi/10.1177/14761270231214429) and add a ‘significant’ column to the output to display whether the corrected p-values lie below this threshold
  • Added the “zscores” keyword argument to get_pvals_motifs to perform z-score transformation if used data are not yet z-score transformed, by setting “zscores” to False
  • For statistical calculations, get_pval_motifs will now weigh the motif occurrences by z-score magnitude, rather than only using a cut-off for enrichment calculations
  • Added effect size calculations to get_pval_motifs which are also in the output, as Cohen’s d
  • Changed get_pval_motifs such that now both enrichments and depletions will be tested (with depletions resulting in negative effect sizes)
  • Added select_grouping to find out which grouping of glycans has the highest intra- versus inter-group correlation, as estimated by glycan_data.stats.test_inter_vs_intra_group
  • When “motifs = False” and “grouped_BH = True”, get_differential_expression now tries to use the Two-Stage adaptive Benjamini-Hochberg procedure based on groups for multiple testing correction, if meaningful groups can be found in the glycans [note this makes everything at least one order of magnitude slower, though most datasets should still finish in a few seconds]

draw

  • In GlycoDraw, the “highlight_motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
  • Added plot_glycans_excel to allow for the automated insertion of GlycoDraw SNFG pictures into an Excel file containing glycan sequences

graph

  • categorical_node_match_wildcard now uses string ID for matching, instead of integer ID, which means even two graphs, generated with two different libs, can now be successfully compared via compare_glycans or subgraph_isomorphism
  • compare_glycans or subgraph_isomorphism (and all functions using these functions) now support negation, by prepending “!”. For instance, “!Fuc(a1-?)Gal(b1-4)GlcNAc” will match subsequences that have a monosaccharide that is NOT Fuc before the Gal. It is highly recommend to generate your own lib via get_lib if you use negation, as monosaccharides such as !Fuc are not within lib and will cause indexing errors.
  • Added “?1-?” as another ultimate wildcard (promoting it from a strong narrow wildcard)
  • Fixed some cases where “Monosaccharide” was not treated as an ultimate wildcard in graph operations
  • Fixed an issue in graph_to_string in which glycans of size 1 (e.g., “GalNAc”) sometimes were missing their first character

network

  • Updated pre-calculated biosynthetic networks for milk oligosaccharides

biosynthesis

  • Refactored find_diff to make networks compatible with the automated, dynamic wildcards (i.e., ? behave as they should and don’t necessarily cause over-branching of the network)
  • In highlight_network, the “motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)

ml

model_training

  • In training_setup, upgraded the loss functions for all classification problems to PolyLoss with label smoothing (see https://arxiv.org/abs/2204.12511 for details).
  • In training_setup, number of classes (for multiclass or multilabel classification) can now be specified via the new “num_classes” keyword argument