Releases: BojarLab/glycowork
v1.0.0
Change Log
- Added a Zenodo badge, to have a release-specific doi for glycowork
glycan_data
- Updated sugarbase database; sugarbase is now pickled, so literal evaluations are necessary
- Harmonized glycan column names across generated dataframes; all use ‘glycan’ now, ‘target’ has been deprecated
loader
- Updated
motif_list
to be compatible with new position encoding - Added Internal_LewisX and Internal_LewisA to
motif_list
(renamed LewisX and LewisA to Terminal_LewisX and Terminal_LewisA, correspondingly) - Made
df_species
static again to speed up package import - Added
find_nth_reverse
helper function that finds the starting index of the nth occurrence of a substring from the end of the string - Added
remove_unmatched_brackets
helper function to strip unmatched opening or closing brackets from glycan strings
motif
- Added more masses to mz_to_composition.csv /
mass_dict
: Acetonitrile, Formate, Cl-, HCO3-, and NH4+
processing
- Extended
canonicalize_iupac
to cases like "NeuGcα3Galβ3(NeuAcα6)GalNAcol" and even more modification formulations, e.g., “6S-GlcNAc” - Added
canonicalize_composition
to convert compositions formatted either in the style of HexNAc2Hex1Fuc3Neu5Ac1 or N2H1F3A1 into dictionaries used by glycowork - Added GalNAc4S to permitted reducing end monosaccharides for O-linked glycans in
enforce_class
MissForest
now has a maximum number of iterations and will check for convergence each iteration (immediately finishing upon converging), yielding some speed-ups in most cases- The output of
min_process_glycans
no longer contains empty strings for glycans ending in a linkage - Updated
choose_correct_isoform
to be compatible with change inmin_process_glycans
- Added
get_possible_linkages
to retrieve linkages matching a wildcarded linkage - Added
get_possible_monosaccharides
to retrieve monosaccharides matching a monosaccharide type (HexNAc, etc.) - Added decorators,
rescue_glycans
andrescue_compositions
, to canonicalize them in case a decorated function errors out - Added
linearcode_to_iupac
to support LinearCode as input format for glycowork (this will be called withincanonicalize_iupac
and the decorators); note that for now coverage may not be perfect yet - Added
iupac_extended_to_condensed
to support IUPAC-extended as input format for glycowork (this will be called withincanonicalize_iupac
and the decorators); note that for now coverage may not be perfect yet - Added
glycoct_to_iupac
to support GlycoCT as input format for glycowork (this will be called withincanonicalize_iupac
and the decorators); note that for now coverage may not be perfect yet - Added
wurcs_to_iupac
to support WURCS as input format for glycowork (this will be called withincanonicalize_iupac
and the decorators); note that for now coverage may not be perfect yet - Added
oxford_to_iupac
to support Oxford as input format for glycowork (this will be called withincanonicalize_iupac
and the decorators); note that for now coverage is limited check_nomenclature
(formerly inmotif.tokenization
) now handles outputting warning messages for trying to use non-string, non-graph nomenclatures or SMILES with glycowork functions- Expanded
find_isomorphs
to generate more isomorphic sequence variants and thereby increasing the chances thatchoose_correct_isoform
will have access to the canonical sequence - Fixed a rare issue with
canonicalize_iupac
where sequences coming fromstructure_to_basic
would sometimes be formatted incorrectly if they contained dHex - Fixed an issue in
find_isomorphs
in which double branches were not always correctly swapped
analysis
get_heatmap
now no longer tries to convert data to relative abundances if negative values are detected in the input- All functions using dataframes as inputs in
analysis
can now also be used by providing full filepaths to the .csv file instead - Optimized some of the code for readability and speed (everything should be at least a bit faster now)
annotate
get_k_saccharides
is now allowed to generate new dynamic motifs with tokens outside of lib (viaexpand_lib
)annotate_glycan
andannotate_dataset
now also support narrow wildcards- Fixed an issue in
count_unique_subgraphs_of_size_k
in which branched motifs were not always correctly formatted (i.e., opening/closing brackets) get_k_saccharides
now outputs dataframes with counts as default and can yield the old nested lists of motifs by setting the new keywordjust_motifs
to True- Fixed an edge case in which
get_k_saccharides
sometimes overcounted individual monosaccharides if their strings overlapped
graph
subgraph_isomorphism
andcompare_glycans
now support using wildcards and position encoding at the same time. Theextra
keyword argument is now deprecated and the functions auto-detect whether anything has been specified in wildcards and/or termini_listsubgraph_isomorphism
andcompare_glycans
now support automatically inferred narrow wildcards to allow for (i) matching linkages like a1-? to only specified linkages within that group (e.g., a1-3 but not b1-3 etc.) and (ii) matching monosaccharide types like HexNAc to only specified monosaccharides of that type (e.g., GlcNAc but not Glc, etc.)- The
wildcard_list
keyword argument in all graph & annotation functions is now deprecated as wildcards are inferred automatically via narrow wildcards and native full wildcards (?1-? and Monosaccharide) subgraph_isomorphism
now behaves as expected for testing motifs ending in linkages on glycans ending in linkagessubgraph_isomorphism
can now return the matched subgraphs in the input glycan with the newreturn_matches
keyword argumentglycan_to_nxGraph
is now decorated with therescue_glycans
decorator, which auto-canonicalizes IUPAC strings if they are not in the format preferred by glycowork- Fixed mismatch of labels and string_labels in
categorical_node_match_wildcard
- Fixed an issue in
subgraph_isomorphism
in which, when using positional encoding, sometimes the mirror image of a motif was incorrectly captured if the termini aligned termini_list
withinsubgraph_isomorphism
now only requires the specification of monosaccharide positions- Added
expand_termini_list
helper function to facilitate the expansion of monosaccharide-onlytermini_list
into fulltermini_list
behind the scenes - Added support for shorthand notation of position encoding, now either ‘terminal’ or ‘t’ will work
- Improved handling of complex branching in
graph_to_string
; should be fewer unexpected translations now - Fixed an issue in
graph_to_string
in which induced subgraphs could cause errors due to unexpected or weirdly sorted node indices - Fixed an edge case in which the reducing end could be sometimes calculated as ‘internal’ when termini=’calc’ in
glycan_to_nxGraph
- Deprecated a duplicate
character_to_label
andstring_to_labels
- Deprecated
categorical_termini_match
; the functionality is now handled withincategorical_node_match_wildcard
- Deprecated the
wildcards
keyword argument fromcompare_glycans
as this will now be detected internally, if wildcards are provided viawildcard_list
tokenization
- Composition functions (e.g.,
composition_to_mass
) are now decorated withrescue_compositions
, which means that they can be used with compositions like “H3N2” (basically anything thatcanonicalize_composition
can handle) - Deprecated
character_to_label
as it’s now handled withinstring_to_labels
- Moved
check_nomenclature
into motif.processing - Optimized some of the code for readability and speed (most things should be at least a bit faster now)
draw
- Support motif highlighting in
GlycoDraw
: by providing thehighlight_motif
keyword argument, motifs can be highlighted (everything else will be set to low opacity). Works with IUPAC-condensed motifs and named motifs fromknown
- Support wildcards in motif highlighting with the
highlight_wildcard_list
keyword argument, for instance highlighting allGal(?1-?)GlcNAc
subunits (for Gal(b1-?)GlcNAc you don’t needhighlight_wildcard_list
, as narrow wildcards are handled automatically) - Support positional encoding in motif highlighting with the
highlight_termini_list
keyword argument, for instance highlighting all terminal, non-reducing endGal(b1-?)GlcNAc
subunits (yes, you can use both wildcards and positional encoding at the same time😊) - Support drawing of repeat structures (indicated by brackets and the number of repeats) via the new
repeat
keyword argument. Internal repeats can also be specified with the additionalrepeat_range
keyword argument. - Optimized some of the code for readability and speed (most things should be at least a bit faster now)
network
biosynthesis
- Optimized some of the code for readability and speed (everything should be up to 2x faster now)
evolution
- Optimized some of the code for readability and speed (everything should be at least a bit faster now)
ml
- Optimized some of the code for readability and speed (most things should be at least a bit faster now)
v0.8.1-zenodo
Literally no code changes at this point (0.9 is expected to come in December) but Zenodo requires a new release to mint a doi
v0.8.1
v0.8.0
Change Log
For Version 0.8.0
- Linted the package with flake8
- Increased code coverage
- Added another optional extras install, [chem], including glyles, requests, and pubchempy
glycan_data
- Changed
lib
to be a dict of type glycoletters:index, as it’s faster to index a dict vs. a long list; also adapted all functions usinglib
to reflect this change
loader
- Added
replace_every_second
helper function - Updated
linkages
list - Changed
linkages
andHex
etc to be sets instead of lists
motif
processing
- Added
variance_stabilization
for variance stabilization normalization, both globally and group-specific - Added
in_lib
helper function to check whether all glycoletters of glycan are in lib - Deprecated
small_motif_find
cohen_d
now also returns the variance of the effect size and supports paired samples as well (calculating Cohen’s dz in this case)- Added
mahalanobis_distance
to calculate Mahalanobis distance as an effect size for multivariate comparisons - Added
mahalanobis_variance
to estimate variance of Mahalanobis distance via bootstrapping - Added
MissForest
for random forest based data imputation - Cleaned up
canonicalize_iupac
and made it slightly faster - Added
variance_based_filtering
- Added
impute_and_normalize
and underlying helper functions - Fixed numpy random seed for reproducibility
- Sped-up
presence_to_matrix
tokenization
- Deprecated
mz_to_composition
mz_to_composition2
is now the newmz_to_composition
- Adapted
mz_to_structures
,compositions_to_structures
, andmatch_composition_relaxed
to work with this change
annotate
- Added
create_correlation_network
to identify clusters of highly correlated glycans/motifs - Added
count_unique_subgraphs_of_size_k
as a helper function withinget_k_saccharides
- Refactor
get_k_saccharides
to be faster and more complete (and be, effectively, a replacement ofmotif_matrix
) annotate_dataset
now usesget_k_saccharides
for mono- and disaccharides, instead ofmotif_matrix
- Deprecated
motif_matrix
annotate_dataset
now also creates relevant ?-containing motifs if ‘terminal’ in feature_set, even if they don’t explicitly occur in the glycan strings- Big speed-up for
annotate_dataset
if known=True, as we now cache the precalculated motif graphs - Added
quantify_motifs
as a wrapper aroundannotate_dataset
to adequately distribute relative abundances across extracted motifs - Deprecated
estimate_lower_bound
as speed-ups make it no longer necessary
analysis
- Renamed
make_heatmap
toget_heatmap
- Renamed
make_volcano
toget_volcano
- Deprecated
replace_zero_with_random_gaussian
(this is now handled byMissForest
in .processing withinimpute_and_normalize
) - Added
hotellings_t2
for multivariate comparisons - Changed multiple-testing correction method from Holm-Sidak to Benjamini-Hochberg
- Added
variance_stabilization
inget_differential_expression
- Added the option to analyze highly correlated sets of glycans/motifs (via
create_correlation_network
) withinget_differential_expression
- Implemented usage of
hotellings_t2
and the Mahalanobis distance (as effect size) for usage if sets are analyzed withinget_differential_expression
get_heatmap
andget_differential_expression
now scale abundances by the actual counts of motifs per glycan, not just absence/presence- Added
get_meta_analysis
to estimate combined effect sizes from the results of multiple studies (both fixed-effects and random-effects models can be estimated) - Added
variance_based_filtering
inget_differential_expression
- Effect size variances can now also be retrieved within
get_differential_expression
via the effect_size_variance keyword argument get_differential_expression
now also can handle paired samples when paired=Trueget_differential_expression
now also tests the homogeneity of variances using Levene’s test in all settings (also multiple-testing controlled)- Added
get_glycanova
to use ANOVA-based analyses on glycomics datasets (uses basically all the improvements ofget_differential_expression
, including analysis on the motif level) - Added
get_pca
to plot glycomics data (also has the motif interface) - Added
get_pval_distribution
to plot the distribution of p-values - Added
get_ma
to plot a Bland-Altman plot - Added
get_glycan_change_over_time
to detect significant changes in time-course data via OLS fitting - Added
get_time_series
as a wrapper aroundget_glycan_change_over_time
to do time series analyses, with all the motif & normalization functionality - Added
get_coverage
to visualize glycan expression across samples (ordered by average intensity) in a coverage plot
draw
- Added import warning if draw dependencies are not installed
- Removed
pycairo
from dependencies - Modified
annotate_figure
to be compatible with .svg files from older Matplotlib versions - Changed “output” to “filepath” in
GlycoDraw
- If there are “?” in the provided filepath for
GlycoDraw
, they will now be automatically replaced with “_” to avoid saving errors
graph
- Sped-up
glycan_to_graph
/glycan_to_nxGraph
(and all downstream functions, which are a lot) - Also improved the runtime of downstream functions, such as
subgraph_isomorphism
independent of these advances subgraph_isomorphism
now also accepts precalculated motif graph as inputs (in addition to the already supported precalculated glycan graphs)
ml
- Rephrased import warnings to reflect optional install strategy for extra dependencies
model_training
- Sped-up
train_ml_model
network
biosynthesis
create_neighbors
no longer uses the libr keyword
v0.7.0
Change Log
For Version 0.7.0
- Removed support for Python 3.7; as we use the walrus operator in some of the re-worked functions, Python 3.8+ is now required to use
glycowork
- Added optional installs for specialized
glycowork
usage (‘all’, ‘ml’, and ‘draw’; for now), which install additional dependencies for these usages; more details in docs
glycan_data
Updated datasets, models, lib to be bigger & better; removed many sequence duplicates with differently written branch orderings
loader
- Added
multireplace
helper function, to map a dictionary of changes to a string - Made
build_custom_df
faster
motif
draw
- Added
draw
as a new submodule of.motif
- Added
GlycoDraw
to draw glycans in SNFG style and save them as .svg/.pdf - Added
annotate_figure
to replace glycan text with glycan images in .svg figures (heatmaps, volcano plots, etc.) - Added
text_to_glycan
, which replaces glycan strings in figures with glycan images - Added
scale_in_range
to normalize a list of numbers within a range
tokenization
- Sped up
glycan_to_composition
by 1000x (avoiding explicit stemification and just doing stemification of the building blocks); also speeds up all functions usingglycan_to_composition
- Sped up
composition_to_mass
(independent of the above) glycan_to_composition
(and downstream functions) now can handle more post-biosynthetic modifications: Ac, PCho, PEtN- Renamed
calculate_theoretical_mass
toglycan_to_mass
- Sped up
mz_to_composition2
by (i) filtering out duplicate compositions and (ii) selecting compositions from a chosen taxonomic kingdom - Reprioritized
mz_to_composition2
by first searching for native compositions and only then looking for compositions + adducts and only then searching for doubly-charged compositions canonicalize_iupac
now also handles floating substituents and can handle many more typos / inconsistencies / IUPAC dialects (such as CFG-coded glycans), including improvements made by Kathryn Klarich- Moved
canonicalize_iupac
intomotif.processing
- Expanded
get_core
(and downstream functions) with HexA, HexNAc, dHex - Expanded
map_to_basic
to (some) post-biosynthetic modifications mz_to_structures
no longer outright fails if no m/z value can be matched- Deprecated
structures_to_motifs
;annotate_dataset
can do the same
processing
- Fixed bug in processing glycans with floating substituents in
small_motif_find
- Deprecated
seed_wildcard
choose_correct_isoform
has been updated to keep up with the improvedfind_isomorphs
- Added more informative error message to
IUPAC_to_SMILES
get_lib
is now slightly faster
graph
- Sped up
compare_glycans
with string inputs, by avoiding graph operations when the two glycans do not have the same composition - Added support for enabling modification wildcards in
compare_glycans
andsubgraph_isomorphism
(for instance matching GalOS and Gal6S) by setting wildcards_ptm = True - Speed-up
glycan_to_nxGraph_int
by optimizing node label/attribute assignments - Refactor
graph_to_string
to be a lot more robust, streamlined, and faster. Its new integration withcanonicalize_iupac
may also result in string improvement upon back-translation (e.g., branch order canonicalization) ensure_graph
now has **kwargs that get passed toglycan_to_nxGraph
get_possible_topologies
now supports internal additions as well, with the keyword argument ‘exhaustive’possible_topology_check
now supports wildcard matching via **kwargs passed on tocompare_glycans
- Made changes to make
glycowork
compatible with NetworkX 3.0 - Moved
bracket_removal
tomotif.processing
- Fixed a small inconsistency in handling floating substituents in
glycan_to_nxGraph_int
that could have caused issues with custom libs override_reducing_end
is no longer needed inglycan_to_nxGraph
to delineate linkage-ending glycans (e.g., Fuc(a1-2) ); this is auto-inferred withinglycan_to_nxGraph
now
annotate
- Deprecated
convert_to_counts_glycoletter
andglycoletter_count_matrix
;motif_matrix
can do both - Refactored
motif_matrix
to be substantially faster and more condensed in its output (also speeds upannotate_dataset
with the ‘exhaustive’ option in the feature_set argument) - Expanded
motif_matrix
to implicitly test for subsumption enrichment (e.g., previously we only explicitly looked for “Gal(b1-?)GlcNAc”; now we also count “Gal(b1-4)GlcNAc” as to the former) annotate_glycan
is now dual-compatible with string and networkx graph input- expanded feature_set in
annotate_dataset
by the option ‘terminal’, which callsget_terminal_structures
- This usage of
get_terminal_structures
inannotate_dataset
now also does the same implicit test for subsumption enrichment as described formotif_matrix
above annotate_dataset
now creates its own lib, based on the motif list and the provided glycans- Expanded
find_isomorphs
to also be able to re-shuffle (some) branched branches - Moved
find_isomorphs
intomotif.processing
- Linkages-only are no longer considered by
motif_matrix
/annotate_dataset
analysis
- All functions with the feature_set keyword argument now can also use the ‘terminal’ keyword for analyzing non-reducing end motifs exclusively
- Added
get_differential_expression
to compare glycomics data, including data cleaning and imputation get_pvals_motifs
andmake_heatmap
no longer have the lib keyword argument, asannotate_dataset
will generate a suitable lib internally- Fixed relative abundance summation in motif-mode for
make_heatmap
- Added the
clean_up_heatmap
helper function to remove redundant (i.e., identical) rows in heatmaps, with a prioritization of named motifs and longer motifs containing redundant shorter motifs - Added
make_volcano
, to generate a volcano plot from internally calculated differential expression using theget_differential_expression
function - Moved
cohen_d
intomotif.processing
ml
model_training
train_ml_model
no longer has the lib keyword argument, as annotate_dataset will generate a suitable lib internally
network
biosynthesis
- Refactored
construct_network
pipeline to be faster and more memory-efficient reducing_end
has been deprecated and is being handled internally- Added
infer_roots
to auto-inferpermitted_roots
(also does not need to be specified any longer inconstruct_network
) - Implemented distance limit, to prevent combinatorial explosion when outlier glycans are present
- Deprecated
subgraph_to_string
andmake_network_from_edges
- Deprecated
fill_with_virtuals
andmake_network_directed
- Minor speed-up of
process_ptm
, by pre-calculating stem_lib once instead of for every glycan in network
v0.6.0
Change Log
For Version 0.6.0
- Updated nbdev1 to nbdev2
- Updated documentation notebooks
- Expanded documentation examples for (i) networks and (ii) deep learning models
glycan_data
- Updated v7_sugarbase and associated files + models
- Improved Cellosaurus ID prefixes
- Added glycan composition as a new column to sugarbase
- Exchanged ‘z’ with ‘?’ as a linkage uncertainty indicator
- Added protein column to glycan_binding, indicating the protein name whose sequence is in the target column
loader
- Added “Ins” and “Galf” to Hex list
- Added stringify_dict utils function to convert a dictionary into a string
motif
- Changed functions to use “?” as a linkage uncertainty indicator rather than “z”
processing
- Added enforce_class to check whether glycan is from desired glycan class
- Added IUPAC_to_SMILES to convert glycans from IUPAC-condensed into SMILES via GlyLES
graph
- glycan_to_nxGraph can now use glycan strings with floating substituents, such as “{Neu5Ac(a2-3)}Gal(b1-4)GlcNAc(b1-6)[Gal(b1-3)]GalNAc”
- added get_possible_topologies and possible_topology_check to probe whether glycans (could) match a glycan with floating substituents
- added ensure_graph to allow functions to be dual-compatible for string & graph inputs
- generate_graph_features, largest_subgraph, get_possible_topologies, and possible_topology_check are now dual-compatible with string & graph inputs
tokenization
- Refactor match_composition_relaxed to be slightly faster & a much smaller function, that uses glycan_to_composition for matching
- Deprecated match_composition accordingly
- mz_to_composition is now up to 100x faster, based on much better defaults / assumptions
- added support for free oligosaccharides to mz_to_composition
- added mz_to_composition2 as an alternative way of composition matching; better scaling and “more physiological” as it’s constrained by class-specific existing compositions within sugarbase
- glycan_to_composition can now also handle post-biosynthetic modifications such as sulfation
- added composition_to_mass
- Improve linkage uncertainty handling in canonicalize_iupac
- canonicalize_iupac now can handle sulfation and phosphorylation
- updated stemify_glycan & structure_to_basic to correctly handle glycans of length 1
- updated stemify_glycan to terminate the while loop if it would result in infinite loops
- updated glycan_to_composition to support floating substituents
- get_core now also handles “Ins” correctly
- calculate_theoretical_mass now can also handle methylation modifications correctly
- improved reducing end calculation for modified glycans in calculate_theoretical_mass
- added speed-up option to calculate_theoretical_mass & glycan_to_composition for non-exotic glycans
- refactored calculate_theoretical_mass to use composition_to_mass
annotate
- add get_terminal_structures to extract monosaccharide+linkage from all non-reducing ends of glycan
- improved runtime and completeness for get_k_saccharides
- get_terminal_structures & get_k_saccharides are now also both dual-compatible with string & graph inputs
- added get_molecular_properties to obtain chemical features of glycans via SMILES
- ‘chemical’ is a new option in feature_set of annotate_dataset, using get_molecular_properties
- small style fix in motif_matrix to avoid warning
- link_find (and downstream annotation findings) now also support floating substituents
analysis
- add cohen_d to calculate effect size between two comparison groups
- ‘chemical’ is a new option in feature_set of get_pvals_motifs and make_heatmap, using get_molecular_properties
ml
model_training
- added the option to use GSAM instead of SAM for the optimizer by specifying alpha in training_setup
models
- streamlined SweetNet architecture (credit to David Alexander) used in SweetNet and LectinOracle faster training and clearer code
network
biosynthesis
- added a dictionary of pre-calculated glycan graphs to construct_network and underlying functions ~2x speed-up and better scaling
- various other performance improvements to network construction functions further increase speed
- improved pruning of virtual root nodes in construct_network
- modified export_network to allow for custom node attribute extraction
- generalized find_diamonds to allow for extraction of diamonds, hexagons, etc with a custom parameter nb_intermediates (default: 2, for diamonds)
- generalized choose_path to compute path probabilities for non-diamond shape motifs
evolution
- small fix in calculate_distance_matrix
v0.5.0
Change Log
For Version 0.5.0
- added more in-line documentation to all functions/modules
glycan_data
- df_species is now being generated internally from df_glycan and is no longer a separate file
- added build_custom_df to generate df_species, df_tissue, and df_disease from sugarbase/df_glycan
- We are retiring ‘bond’. Instead, the default for full linkage uncertainty is now z1-z / z2-z. Replace z with ? for full compatibility with IUPAC-condensed
- The ethanolamine modification (previously Etn) is now EtN for consistency with the style of other modifications
- tissue associations now have either Uberon IDs (tissues etc.) or Cellosaurus IDs (cell lines)
- disease associations now have a Disease Ontology ID
- tissue and disease associations now also have a species designation (in tissue_species and disease_species, respectively)
- the internal lib is now a .pkl file instead of being calculated each time the package is loaded
- shifted glycan_representations_species.pkl into .motif, where it will be loaded upon calling .motif.analysis.plot_embeddings
- shifted df_glysum into .alignment, where it will be loaded upon calling .alignment.glysum.pairwiseAlign
- it should be noted that we may deviate more and more from the provided GlyTouCan IDs, as we strive towards removing unnecessary uncertainty (e.g., specifying the core Fuc as alpha, regardless of whether it has been denoted as alpha in the official GlyTouCan entry)
- updated positional information in motif_list to account for new graph generation output
loader
- Deprecated load_file
motif
tokenization
- added mz_to_composition to match m/z values from glycomics with possible monosaccharide compositions
- added mz_to_structures wrapper to directly go from m/z values to matching glycan sequences
- changed some required arguments to optional arguments in compositions_to_structures and mz_to_structures (the default is now human glycans with no additional relative intensities)
- fixed an issue in compositions_to_structures in which an error was returned if none of the provided compositions had any structure matches
- update stemify_glycan to the z-nomenclature for linkage uncertainty
- compositions_to_structures now allows for input of custom Hex, HexNAc, and dHex lists
- condense_composition_matching is updated to the z-linkage uncertainty nomenclature
- sped up composition matching by only considering glycans with correct number of monosaccharides
- added canonicalize_iupac to allow for conversion of other IUPAC “flavors” into the version of IUPAC-condensed nomenclature optimized for glycowork
- added structure_to_basic, glycan_to_composition, and calculate_theoretical_mass utility functions to convert glycan sequences into topologies, compositions, and their theoretical mass, respectively
processing
- added choose_correct_isoform to distinguish glycan branch isomers
- deprecated process_glycans and motif_find
- refactored get_lib to use min_process_glycans
- condensed small_motif_find
- moved check_nomenclature into .motif.tokenization + integrated canonicalize_iupac into it
analysis
- updated characterize_monosaccharide to work with seaborn 0.11.2+
graph
- overhauled graph generation (glycan_to_graph, glycan_to_nxGraph, graph_to_string) to be more robust, modular, and simpler / easier to maintain
- combined fast_compare_glycans and compare_glycans into compare_glycans (which internally detects whether strings or precomputed graphs were provided)
- compare_glycans (and its dependencies) is also 2-3x faster now
- subgraph_isomorphism also should be 2-3x as fast as before
- updated graph_to_string to the z-nomenclature for linkage uncertainty
- fixed a bug in the counting mode of subgraph_isomorphism, in which the graph was modified in-place if precomputed graphs were provided and the function was called multiple times
- glycan_to_nxGraph received a new optional argument to enable generating graphs of glycans ending in a linkage but note that this output will not work for all downstream functions
- correspondingly subgraph_isomorphism can now use motifs ending in a linkage as input
- wildcard matching for compare_glycans etc now uses the string labels instead of the regular lib index labels to define the wildcards
query
- dramatically sped up get_insight by first checking for string identity before doing graph isomorphisms
annotate
- fix scipy import for compatibility with scipy 1.8.0
- improved get_k_saccharides to be (i) compatible with the new graph generation approach and (ii) be a lot more robust and exhaustive
ml
- modified GPU utilization to allow CPU usage of all functions (in theory)
models
- the trained model file for LectinOracle_flex is now contained within the package instead of being loaded externally
- deprecated functions for loading external LectinOracle_flex model
processing
- refactored dataset_to_graphs to directly import from NetworkX graphs
train_test_split
- renamed taxonomic_multilabel to prepare_multilabel, as it now also works for preparing training datasets for tissue and disease associations
model_training
- SAM will now only be loaded by training_setup in case of multiclass or multilabel classification (for performance reasons)
network
- functions working with biosynthetic networks can now use dictionaries of pre-computed networks as inputs; with the default option of stored pre-computed milk glycan biosynthetic networks stored within glycowork
biosynthesis
- added trace_diamonds to automatically extract diamond-shaped motifs from networks and leverage evolutionary information to return likelihoods for real paths
- replaced infuse_network with highlight_network, which allows you to highlight motifs, species-specific glycans, abundances, and degree of conservation in a network
- added prune_network to cut away virtual paths that are unlikely to impossible (depending on threshold)
- added evoprune_network as a wrapper for trace_diamonds, highlight_network, prune_network
- fixed an issue in choose_path returning an error if a path doesn’t occur in any other species; now it returns an empty dictionary
- fixed an issue in propagate_virtuals that prevented proper deorphanization for O-glycans
- fixed a suffix issue in PTM detection for non-milk networks
- made get_virtual_nodes and construct_network more robust toward unusual branch ordering
- improved construct_network to prune virtual leaf nodes with degree > 1
- functions requiring a filepath now require a species : network dictionary as function input
evolution
- added check_conservation to assess the evolutionary conservation of a glycans and glycan motifs via biosynthetic networks
- added get_communities to use Louvain community detection algorithm, e.g., in biosynthetic networks
- refactored distance matrix calculation as separate function, calculate_distance_matrix
alignment
- retired alignment until significant improvements can be made
v0.4.0
Change Log
For Version 0.4.0
ml
models
- added NSequonPred (for predicting whether N-linked sequons will be glycosylated) as a trained model
- added LectinOracle_flex as a trained model (doing the same thing as LectinOracle but able to use raw protein sequences as input rather than ESM-1b representations; with comparable performance)
- modified prep_model to allow for NSequonPred and LectinOracle_flex selection
- added more model initialization options and adjusted their defaults in prep_model
model_training
- changed default optimizer from AdamW to AdamW+SAM (Sharpness-Aware Minimization from https://arxiv.org/abs/2010.01412); typically increases model performance on test set by ~2%
- implemented support for training models for multilabel classification
train_test_split
- added taxonomic_multilabel to prepare taxonomic glycan data for multilabel classification
inference
- added get_Nsequon_preds to use NSequonPred for inference
- modified get_lectin_preds to allow for LectinOracle_flex usage
motif
graph
- modified subgraph_isomorphism to use both string and precalculated graph inputs
- modified subgraph_isomorphism to be able to count the number of occurring subgraphs
- glycan_to_nxGraph now also records the actual monosaccharide/linkage strings as “string_labels” in the node labels
- glycan_to_nxGraph and graph_to_string can now also operate on monosaccharides (glycans of length 1)
- added largest_subgraph to identify the largest common subgraph between two glycans
annotate
- annotate_glycan now makes use of precalculated graph in calling subgraph_isomorphism ~3x faster in motif annotation (also applies to many heatmap applications etc etc.)
- annotate_glycan & annotate_dataset now also return the number of known/named motifs per glycan
- replaced get_trisaccharides with get_k_saccharides that allows for motif recognition of user-defined size
- bug fixes
tokenization
- added constrain_prot and prot_to_coded to process protein sequences for LectinOracle_flex
- added mask_rare_glycoletters to mask rare monosaccharides and linkages in glycan sequences
processing
- check_nomenclature now returns True if no red flag is raised
glycan_data
- replaced influenza_binding with the superset glycan_binding (564,647 protein-glycan interactions from 1,392 lectins)
loader
- added a reindex utility function
- updated linkages list
data_entry
- check_presence now ensures correct glycan nomenclature
network
biosynthesis
- added functions to consider post-translational glycan modifications when constructing biosynthetic networks (either via the process_ptm wrapper or as an option in construct_network)
- added functionality to convert biosynthesis networks into directed graphs (either via the make_network_directed wrapper or as an option in construct_network)
- added update_network to add new information to an already constructed biosynthetic network
- improved construct_network to enable finding paths for all nodes that can be connected to the biosynthetic root nodes
- added infuse_network to allow for visualizing glycomics abundance data together with biosynthetic networks
- added choose_path to leverage biosynthetic networks from other species to determine which path is taken in diamond shapes (A->B, A->C, B->D, C->D) where both paths are virtual/not observed
- various improvements to ensure that the code functionality also works for classes other than milk glycans, such as O-linked glycans
- better network layouts with pydot2
- added edge types (monosaccharide, monosaccharide+linkage, biosynthetic enzyme), which can be infused with differential gene expression information
- bug fixes & smaller improvements (e.g., pruning of virtual leaves, exporting of networks, user choice of edge type, etc.)
evolution
- added functions to calculate a distance matrix from glycan embeddings and use this to calculate dendrograms / evolutionary networks
- add distance_from_metric to calculate distance of networks, e.g., via Jaccard distance
v0.3.0
Change Log
For Version 0.3.0
ml
models
- added LectinOracle as option for prep_model & modified prep_model to allow for loading trained models
model_training - train_ml_model now allows for additional (optional) input features
- changed default optimizer from Adam to AdamW
- changed default learning rate scheduler from cosine-decay to ReduceLROnPlateau
processing - split_data_to_train now allows for additional (optional) input features
- label_type is now also an optional argument for split_data_to_train and all lower-level functions
model_training - modified train_model to allow for LectinOracle training
representation/inference - renamed “representation” module into “inference”
- added get_lectin_preds to use LectinOracle for inferring binding specificity of lectins
- added get_esm1b_representation to retrieve ESM1b representations for new lectins, to use them as input for LectinOracle
motif
query
- added tissue expression and disease association to get_insight
- glytoucan_to_glycan now more robust in dealing with missing GlyTouCan IDs
tokenization - added condense_composition_matching to find the minimum number of glycans to characterize matching compositions
- added compositions_to_structures wrapper function that will take a list of compositions, find possible matches, condense them into the minimum number of structures, and match them with values, such as provided relative intensities
- added structures_to_motifs function to convert datasets of relative intensities of glycan structures to relative intensities of the corresponding glycan motifs
- changed default mode of match_composition_relaxed to “exact”
- modified match_composition_relaxed to allow for filtering possible matches based on reducing end monosaccharide (e.g., O-linked glycans)
- fixed issue in match_composition_relaxed that prevented the addition of additional monosaccharide types to the composition
- moved motif_matrix and dependencies over to motif.annotate
glycan_data
- replaced glyco_targets_species_seq_all_V4 (~23,000 species-specific glycans) and v4_sugarbase (~47,000 unique glycans) with glyco_targets_species_seq_all_V5 (~31,500 species-specific glycans) and v5_sugarbase (~50,500 unique glycans)
- added directed disease associations (currently 533 associations) and tissue expression (currently 2,815 associations) for glycans in v5_sugarbase
- changed nomenclature of glycolipids (mostly receive an “1Cer” at their reducing end, for instance “Glc1Cer”) and free oligosaccharides (receive an “-ol” at their reducing end, for instance “Glc-ol”)
- made Lewis motifs in motif_list more general
- correspondingly updated glycan ML models, representations, and substitution matrix
v0.2.0
motif
tokenization
- added functions for stemifying glycans (by removing rare modifications)
- added match_composition & match_composition_relaxed for finding glycan structures in stored or provided databases that match a provided composition. Can be narrowed down to, e.g., a species of interest.
graph
- added function to translate glycan graph back to IUPAC-condensed string
- added try_string_conversion function to check whether glycan graph describes valid glycan
- modified generate_graph_features to also work with networks
analysis
- update plot_embeddings to use representation dataframes as inputs in addition to dictionaries
- swap subplots in characterize_monosaccharide and modify labelling to enhance clarity
- get_pvals_motifs now allows for a custom motif_list via the optional motifs argument
- plot_embeddings now allows for a custom color palette
query
- added glytoucan_to_glycan function to interconvert GlyTouCan IDs and glycans
- get_insight now also yields the GlyTouCan ID of a glycan (if available) + the predicted taxonomy if no taxonomy is recorded in our database
annotate
- added get_trisaccharides to retrieve a subset of the trisaccharides occurring in a glycan
- added estimate_lower_bound to give make_heatmap + get_pvals_motifs a speedup option with estimate_speedup = True (warning: estimate_lower_bound is an estimate and might in theory lead to missed motifs in the motif annotation); typically results in a 3x speed-up
network
- beta version of completely new module that is still in active development
biosynthesis
- added functions to find neighbors in biosynthesis space (one reaction removed)
- added functions to plot biosynthetic network for a set of glycans
- added functions to combine/align biosynthetic networks
glycan_data
- replaced glyco_targets_species_seq_all_V3 (~13,000 species-specific glycans) and v3_sugarbase (~20,000 unique glycans) with glyco_targets_species_seq_all_V4 (~23,000 species-specific glycans) and v4_sugarbase (~47,000 unique glycans)
- correspondingly updated glycan ML models, representations, and substitution matrix
- next to all the new glycans, many pre-existing glycans are now better specified (e.g., Gal3S instead of GalOS, wherever location of modification is known)
- GlyTouCan IDs were added whenever possible
- motif_list was expanded by two new motifs (difucosylated N-glycan core & extended core fucose)
ml
train_test_split
- modified hierarchy_filter to ignore glycans with ‘undetermined’ taxonomy label