Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify purpose of the vcf_header attribute and/or refine? #15

Open
jeromekelleher opened this issue Feb 26, 2024 · 3 comments
Open

Clarify purpose of the vcf_header attribute and/or refine? #15

jeromekelleher opened this issue Feb 26, 2024 · 3 comments

Comments

@jeromekelleher
Copy link
Contributor

It's not clear what the vcf_header attribute is for, and how complete it is expected to be, and what people are supposed to do with it.

Some fields in the header are clearly redundant (the INFO and FORMAT field definitions, as well as CONTIG) and can/should be auto generated by conversion programs producing VCF from vcf-zarr (an important task)

So, we're actually making it harder for downstream programs to output valid VCF headers of subsets of the data by requiring that the entire thing is stored in the input.

I think it would be better to try and capture the non-redundant stuff in the header that is defined in the spec like source, reference etc as separate attributes

@tomwhite
Copy link
Collaborator

I agree.

Originally we added the vcf_header attribute to make it possible to losslessly round trip VCF -> Zarr -> VCF.

With the VCF output work we added the ability to generate a VCF header from INFO, FORMAT, and CONTIG fields stored in Zarr - and also use other fields from the vcf_header attribute, if present. (See https://github.com/pystatgen/sgkit/blob/d32b8714e026b5e0ab49812a87174edbd829b26a/sgkit/io/vcf/vcf_writer.py#L412-L559)

In fact, sgkit can now handle the case where there's no vcf_header attribute. So perhaps we can just mark it as optional in the spec (or remove entirely)?

@jeromekelleher
Copy link
Contributor Author

I think it's simplest to remove entirely, and plot out some potential ways we can incorporate the missing information more systematically. I think the main thing we're losing is the provenance and reference information, which would be straightforward to add as optional attrs later on.

Otherwise we have to define what the header is for, and what should take precedence in terms of fields that are present in the dataset vs the header.

@jeromekelleher
Copy link
Contributor Author

I think #28, #29 and #30 would be sufficient to remove the need for storing the full header.

More specialised handling of, e.g., ALT header information could follow later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants