Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump the glypy.Glycan to linear code #20

Open
bobaoai opened this issue Oct 30, 2019 · 5 comments
Open

Dump the glypy.Glycan to linear code #20

bobaoai opened this issue Oct 30, 2019 · 5 comments

Comments

@bobaoai
Copy link

bobaoai commented Oct 30, 2019

Hey Joshua,

I am afraid that the linearcode.dumps() function might have inconsistency with the linear code rule.
For example, if we have a glycan as below:

RES
1b:x-dglc-HEX-1:5
2s:n-acetyl
3b:b-dglc-HEX-1:5
4s:n-acetyl
5b:b-dman-HEX-1:5
6b:a-dman-HEX-1:5
7b:a-dman-HEX-1:5
8b:a-dman-HEX-1:5
9b:a-dman-HEX-1:5
LIN
1:1d(2+1)2n
2:1o(4+1)3d
3:3d(2+1)4n
4:3o(4+1)5d
5:5o(3+1)6d
6:5o(6+1)7d
7:7o(3+1)8d
8:7o(6+1)9d

We should get 'Ma3(Ma3(Ma6)Ma6)Mb4GNb4GN?'
But the dumps() returns 'Ma6(Ma3)Ma6(Ma3)Mb4GNb4GN?'.
I believe it only needs a slight modification. When the code traverses the glycan, it need sort the glycan.index descendingly to make the monosaccharide with the highest linkage-index go first.

@mobiusklein
Copy link
Owner

This looks like a breakdown in how linear_code.glycan_to_linear_code is calling linear_code.priority. Right now, branches are selected in "priority" order, as defined in Table 1 of [1]. At some point a breaking change caused priority to return -1 for any "real" monosaccharide. Returning the correct priority should fix this, but I have to be a bit more careful about how I do that, because of the degenerate way in which LinearCode encodes monosaccharides.

Note that LinearCode derived structures are necessarily not canonicalized w.r.t. to the same structure as parsed from GlycoCT or WURCS, and if you intend to mix the two formats, you should explicitly call glycan.canonicalize().

[1] Banin, E., Neuberger, Y., Altshuler, Y., Halevi, A., Inbar, O., Nir, D., & Dukler, A. (2002). A Novel Linear Code Nomenclature for Complex Carbohydrates. Trends in Glycoscience and Glycotechnology, 14(77), 127–137. https://doi.org/10.4052/tigg.14.127

@bobaoai
Copy link
Author

bobaoai commented Nov 3, 2019

Thanks!

I totally agreed that LinearCode derived structures might not be canonicalized and we should be careful when using it.

Since my data analysis is only dealing with the glycan with common monosaccharides, the extreme case doesn't both me. The reason the LinearCode is used is that in my case if two glycans are the same, their linearcodes are the same. In this case, the str1==str2 will be faster to check the similarity among a set of glycans. Do you have a faster way to compare if two glycans have same structure? Thanks!

@mobiusklein
Copy link
Owner

The same uniqueness is applied to any comparison of canonicalized structures and formats. The GlycoCT serializer is 5 times faster and enforces the GlycoCT canonicalization sorting on the structure as it is converted to a string, so it should be better all around.

I'm not sure I'll have time to fix the LinearCode serialization issue this week.

@bobaoai
Copy link
Author

bobaoai commented Nov 3, 2019

Do you mean when we get the GlycoCT from the str(Glycan), it is already canonicalized? So do you imply the glypy guarantees that every Glycan object with the same structure will have the same str(Glycan)? I mean currently, I only deal with the glycans with clearly specified topology and linkages.

It's totally okay. There is no push to fix the LinearCode serialization. It all depends on your schedule.

@mobiusklein
Copy link
Owner

Yes, GlycoCTWriter traverses the glycan by traveling edges in the order specified by the publication [1] (impl). Of course, as we've discussed before, glypy doesn't write out UND sections at the moment, but will canonicalize the current configuration of an under-determined structure.

The glycoct module is well over 2k LOC, so it needs some serious refactoring before anyone else will be able to read it and retain anything about its organization.

[1] Herget, S., Ranzinger, R., Maass, K., & Lieth, C.-W. V. D. (2008). GlycoCT-a unifying sequence format for carbohydrates. Carbohydrate Research, 343(12), 2162–2171. https://doi.org/10.1016/j.carres.2008.03.011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants