You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Change Log
glycan_data
Updated sugarbase database and all models
stats
Newly added module to glycowork
Moved all the statistics functions from motif.processing into this module: cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering
Added fast_two_sum, two_sum, expansion_sum, hlm, update_cf_for_m_n, jtkdist, jtkinit, jtkstat, and jtkx helper functions for JTK test
Added get_BF to calculate Jeffreys' approximate Bayes factor based on sample size and p-value
Added get_alphaN to calculate sample size-appropriate significance cut-offs informed by Bayesian statistics
Added test_inter_vs_intra_group to estimate intra- versus inter-group correlation with a mixed-effects model for groupings of glycans based on domain expertise
motif
regex
Newly added module to glycowork
Added the get_match function and associated functions to implement a regular expression system for glycans. This allows for powerful queries to detect and extract motifs of arbitrary complexity.
processing
Moved cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering into glycan_data.stats to re-focus processing on processing glycan sequences
Extended canonicalize_composition to cases like ‘5_4_2_1’, ‘5421’, and ‘(Hex)2 (HexNAc)2 (Deoxyhexose)1 (NeuAc)2 + (Man)3(GlcNAc)2’
GlycoCT and WURCS handling for universal input now encompass more monosaccharides and more modifications
Expanded oxford_to_iupac to handle more complex sequences, including sulfation, LacdiNAc, hybrid structures, extended Neu5Ac, complex fucosylation, more custom linkage specifications
enforce_class can now deal with free glycans regardless of whether they end in ‘-ol’ or not
annotate
annotate_dataset and downstream functions now accept a new keyword in “feature_set”, called “custom”. If “custom” is added to “feature_set”, a list of custom motifs can and must be added via the “custom_motifs” keyword argument. “custom” can be mixed and matched with all other keywords in “feature_set”
annotate_dataset now also accepts glyco-regular expressions via the “custom” keyword in “feature_set”. These expressions need to be added within the “custom_motifs” keyword argument and have to start with an “r”, such as "rHex-HexNAc-([Hex|Fuc]){1,2}-HexNAc". Normal motifs and glyco-regular expressions can be freely mixed within “custom_motifs”
Added group_glycans_core, group_glycans_sia_fuc, and group_glycans_N_glycan_type to group glycans by core structure (for O-glycans), Sia/Fuc/FucSia/Rest, or complex/hybrid/high-man/rest (for N-glycans)
Fixed a bug in get_k_saccharides, in which redundant columns were not always correctly removed
analysis
Added get_jtk to analyze circadian expression of glycans in temporal glycomics datasets using the Jonckheere–Terpstra–Kendall (JTK) algorithm, with the typical interface for motifs and imputation etc analogous to differential expression.
get_differential_expression, get_glycanova, and get_jtk now use get_alphaN to calculate a sample size-appropriate significance cut-off (see https://journals.sagepub.com/doi/10.1177/14761270231214429) and add a ‘significant’ column to the output to display whether the corrected p-values lie below this threshold
Added the “zscores” keyword argument to get_pvals_motifs to perform z-score transformation if used data are not yet z-score transformed, by setting “zscores” to False
For statistical calculations, get_pval_motifs will now weigh the motif occurrences by z-score magnitude, rather than only using a cut-off for enrichment calculations
Added effect size calculations to get_pval_motifs which are also in the output, as Cohen’s d
Changed get_pval_motifs such that now both enrichments and depletions will be tested (with depletions resulting in negative effect sizes)
Added select_grouping to find out which grouping of glycans has the highest intra- versus inter-group correlation, as estimated by glycan_data.stats.test_inter_vs_intra_group
When “motifs = False” and “grouped_BH = True”, get_differential_expression now tries to use the Two-Stage adaptive Benjamini-Hochberg procedure based on groups for multiple testing correction, if meaningful groups can be found in the glycans [note this makes everything at least one order of magnitude slower, though most datasets should still finish in a few seconds]
draw
In GlycoDraw, the “highlight_motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
Added plot_glycans_excel to allow for the automated insertion of GlycoDraw SNFG pictures into an Excel file containing glycan sequences
graph
categorical_node_match_wildcard now uses string ID for matching, instead of integer ID, which means even two graphs, generated with two different libs, can now be successfully compared via compare_glycans or subgraph_isomorphism
compare_glycans or subgraph_isomorphism (and all functions using these functions) now support negation, by prepending “!”. For instance, “!Fuc(a1-?)Gal(b1-4)GlcNAc” will match subsequences that have a monosaccharide that is NOT Fuc before the Gal. It is highly recommend to generate your own lib via get_lib if you use negation, as monosaccharides such as !Fuc are not within lib and will cause indexing errors.
Added “?1-?” as another ultimate wildcard (promoting it from a strong narrow wildcard)
Fixed some cases where “Monosaccharide” was not treated as an ultimate wildcard in graph operations
Fixed an issue in graph_to_string in which glycans of size 1 (e.g., “GalNAc”) sometimes were missing their first character
network
Updated pre-calculated biosynthetic networks for milk oligosaccharides
biosynthesis
Refactored find_diff to make networks compatible with the automated, dynamic wildcards (i.e., ? behave as they should and don’t necessarily cause over-branching of the network)
In highlight_network, the “motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
ml
model_training
In training_setup, upgraded the loss functions for all classification problems to PolyLoss with label smoothing (see https://arxiv.org/abs/2204.12511 for details).
In training_setup, number of classes (for multiclass or multilabel classification) can now be specified via the new “num_classes” keyword argument