Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting requirements #12

Open
pnrobinson opened this issue Mar 27, 2019 · 7 comments
Open

Documenting requirements #12

pnrobinson opened this issue Mar 27, 2019 · 7 comments

Comments

@pnrobinson
Copy link

The Phenopacket group is planning to use three categories to denote whether a particular field is required, recommended, or optional. Here is an example: https://phenopackets-schema.readthedocs.io/en/latest/variant.html
I am wondering if this needs to be coordinated with SchemaBlocks or if there are any recommendations?

@mbaudis
Copy link
Contributor

mbaudis commented Mar 27, 2019

@pnrobinson I like that idea a lot; however I don't know how much something like that could even be specified as a specific schema element (AFAIK it isn't part of proto). So, again a Q for @Relequestual. (One could have a separate object for that, but maybe there is a JSON way?)

@pnrobinson
Copy link
Author

There is no way of specifying these three categories within protobuf, but we are supplying a Java validation library to implement it.

@mbaudis
Copy link
Contributor

mbaudis commented Mar 27, 2019

Thought so; I'm definitely pro adopting this systematically.

Re. allele example: Would be good if we could use this also for establishing / using a "GA4GH allele" type (or a specific PXF, Beacon ...) variant; we have the one lifted over & modified from the GA4GH schema, which has some options for structural variants. Would be good to have this moved to an agreed upon standard (modifications and all), as explicit "VCF inherited & documented variant storage & transfer standard"; we need this e.g. for Beacon (responding with matched variants to wildcard queries) & have to move soon on it.

@pnrobinson
Copy link
Author

It seems that the variant class from the GA4GH schema has gone a little overboard, and has too many fields that reflect the bioinformatics processing, e.g., mate_name. I would suggest that if a user needs that much detail, then they probably just want to have the FASTQ files and do everything themselves, rather than start from some summary message. But that is just one opinion and it might be good to start off by defining what we think the typical use cases are and what the requirements are?

@mbaudis
Copy link
Contributor

mbaudis commented Mar 27, 2019

MateName is related to the MateID of VCF structural variants; essential for translocations. Part of next Beacon point release. Easy porting - and querying - of cytogenetic annotation data.

There are more relevant structural changes than SNPs... (not sure about this statement; depending on context... :-) )

@pnrobinson
Copy link
Author

Yes, but that is not to say that this is the best way of representing them in these formats. It seems it would be better to abstract away from the VCF format, especially since there is little acceptance of this format in the community for SVs yet (different programs have a range of ways of representing SVs and translocations).

@mbaudis
Copy link
Contributor

mbaudis commented Mar 27, 2019

Well, I'm a (nearly pure...) SV person; and there is no good format (besides traditional ISCN banding annotations - so my primary method is to abstract from that, obviously accommodating for more resolution ...).

I really don't care about some of the VCF "features" (assuming a static dataset w/ callsets in columns????); but somehow they have put lots of thoughts into representing all (?) crazy types of variants. This is inspirational, regarding some of the representations (e.g. using a concept of fuzzy start, end for SVs, though this could be done more elegantly; acknowledging the need for fusion mapping etc.); but then VCF is a) limited through the static file structure, and b) overly permissive through headers/options (look e.g. at the 1k genomes SV files - custom mess).

But IMO better as a template than HGVS; we do not want to discuss transcript ID etc. based ways to annotate variants for data exchange. Map them or lose them, reference genome or bus (for cross-resource data exchange).

So this is about a robust, reference genome mapping based, SV supporting schema. Which - beyond this here & the related Beacon allele request format, (also based on VCF & GA4GH schema) - IMO doesn't exist (well, ISCN 2016 etc., but that is still based on "Human: Deparse that string!").

So w/ respect to having a separate variant format from the limited ones you list in PXF - Yes, definitely; otherwise it wouldn't have been drafted (& used with >100k samples behind Beacons). But PXF can/should obviously offer different ways to represent variants.

But - Well, up for changes, additions, any time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants