Clarify purpose of the `vcf_header` attribute and/or refine? #15

jeromekelleher · 2024-02-26T10:20:38Z

It's not clear what the vcf_header attribute is for, and how complete it is expected to be, and what people are supposed to do with it.

Some fields in the header are clearly redundant (the INFO and FORMAT field definitions, as well as CONTIG) and can/should be auto generated by conversion programs producing VCF from vcf-zarr (an important task)

So, we're actually making it harder for downstream programs to output valid VCF headers of subsets of the data by requiring that the entire thing is stored in the input.

I think it would be better to try and capture the non-redundant stuff in the header that is defined in the spec like source, reference etc as separate attributes

The text was updated successfully, but these errors were encountered:

tomwhite · 2024-02-26T10:57:56Z

I agree.

Originally we added the vcf_header attribute to make it possible to losslessly round trip VCF -> Zarr -> VCF.

With the VCF output work we added the ability to generate a VCF header from INFO, FORMAT, and CONTIG fields stored in Zarr - and also use other fields from the vcf_header attribute, if present. (See https://github.com/pystatgen/sgkit/blob/d32b8714e026b5e0ab49812a87174edbd829b26a/sgkit/io/vcf/vcf_writer.py#L412-L559)

In fact, sgkit can now handle the case where there's no vcf_header attribute. So perhaps we can just mark it as optional in the spec (or remove entirely)?

jeromekelleher · 2024-02-26T13:54:57Z

I think it's simplest to remove entirely, and plot out some potential ways we can incorporate the missing information more systematically. I think the main thing we're losing is the provenance and reference information, which would be straightforward to add as optional attrs later on.

Otherwise we have to define what the header is for, and what should take precedence in terms of fields that are present in the dataset vs the header.

jeromekelleher · 2024-08-06T10:29:10Z

I think #28, #29 and #30 would be sufficient to remove the need for storing the full header.

More specialised handling of, e.g., ALT header information could follow later on.

jeromekelleher mentioned this issue Aug 6, 2024

Remove vcf_header option from write_vcf sgkit-dev/vcztools#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify purpose of the `vcf_header` attribute and/or refine? #15

Clarify purpose of the `vcf_header` attribute and/or refine? #15

jeromekelleher commented Feb 26, 2024

tomwhite commented Feb 26, 2024

jeromekelleher commented Feb 26, 2024

jeromekelleher commented Aug 6, 2024

Clarify purpose of the vcf_header attribute and/or refine? #15

Clarify purpose of the vcf_header attribute and/or refine? #15

Comments

jeromekelleher commented Feb 26, 2024

tomwhite commented Feb 26, 2024

jeromekelleher commented Feb 26, 2024

jeromekelleher commented Aug 6, 2024

Clarify purpose of the `vcf_header` attribute and/or refine? #15

Clarify purpose of the `vcf_header` attribute and/or refine? #15