-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify purpose of the vcf_header
attribute and/or refine?
#15
Comments
I agree. Originally we added the With the VCF output work we added the ability to generate a VCF header from INFO, FORMAT, and CONTIG fields stored in Zarr - and also use other fields from the In fact, sgkit can now handle the case where there's no |
I think it's simplest to remove entirely, and plot out some potential ways we can incorporate the missing information more systematically. I think the main thing we're losing is the provenance and reference information, which would be straightforward to add as optional attrs later on. Otherwise we have to define what the header is for, and what should take precedence in terms of fields that are present in the dataset vs the header. |
It's not clear what the
vcf_header
attribute is for, and how complete it is expected to be, and what people are supposed to do with it.Some fields in the header are clearly redundant (the INFO and FORMAT field definitions, as well as CONTIG) and can/should be auto generated by conversion programs producing VCF from vcf-zarr (an important task)
So, we're actually making it harder for downstream programs to output valid VCF headers of subsets of the data by requiring that the entire thing is stored in the input.
I think it would be better to try and capture the non-redundant stuff in the header that is defined in the spec like
source
,reference
etc as separate attributesThe text was updated successfully, but these errors were encountered: