-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcf_hdr_remove leaves hash keys behind #1533
Comments
For now I am replacing the code:
with:
though it requires to redefine the
while the hash tables should be opaque to the end users |
The bcf_hdr_remove() call can create gaps in tid blocks which fail assertion in bcf_hdr_seqnames(). This problem was encountered in samtools#1533, but is only a partial fix of the problem
This is trickier than it seems. The So one should not be accessing the However, the |
I think I understand the problem a bit better now. A simpler way to see what is going on is also to run:
Or again to run:
The
If the intended behavior of |
I am not sure what you are after exactly. The header and the body must stay consistent, the IDX field was introduced specifically for cases like this. It should be a considered a bug that htslib is not handling it properly. |
In my specific case I am writing a BCFtools plugin, so I have one |
Tags and contigs in BCF body are identified by their id which is defined implicitly by the order of their definitions in the header. This brings a problem: if, say, the first tag is removed from the header, ids of the remaining tags change by -1, and the entire BCF has to be recoded. The IDX field was introduced to preserve tag ids even when some are removed or reordered. This is part of the BCF specification and all readers must support it, you will not gain anything by doing this extra work. |
It seems to me like you are thinking about protecting the end users by preserving the contig table in the header. But what are the end users expected to use the modified header with the now missing header records for? The old contigs' rid's can be interpreted while reading the VCF but if the modified header was the one written in the output the old rid's should not be output as the corresponding table's entries would be missing from the printed header and will be discarded from memory once the executable is over. To be specific, my use case is a plugin that lifts over a VCF from one reference to another. The
and then filled with a new contig table from an index FASTA structure. This approach effectively resets the IDX field and solves the issue on my side. What would be a use case for using I am mostly curious. I am not necessarily advocating for one way or another and I was trying to understand if there was a canonical way to remove both the dictionary and the contig table from a header structure. |
It is not about protecting end users, but about preserving the integrity of the BCF, about programming convenience and about the speed of processing. Also it is not only about contigs, but also about FILTER, INFO, and FORMAT tags. For FILTER,INFO,FORMAT this has been used for quite a while and it works, for contigs you happened to test a combination of steps that revealed a bug. Say the VCF header looks like this
then the BCF body encodes the tags as this (shown here in a simplified form)
If we remove the TAG1 from the header and from the body, the modified BCF looks like this
If we accept your suggestion and discard the indexes, we'd have to manually recode all subsequent tags (IDX -> IDX-1) in the entire BCF and make it look like this
Even if we decided we wanted to do it this way, at this point it's too late, this would be a major breaking change. Luckily, there is no need for that, we can just fix |
Another way for what I was trying to say is that you would not use |
The bcf_hdr_remove() call can create gaps in tid blocks which fail assertion in bcf_hdr_seqnames(). This problem was encountered in samtools#1533, but is only a partial fix of the problem
The bcf_hdr_remove() call can create gaps in tid blocks which fail assertion in bcf_hdr_seqnames(). This problem was encountered in samtools#1533, but is only a partial fix of the problem
The bcf_hdr_remove() call can create gaps in tid blocks which fail assertion in bcf_hdr_seqnames(). This problem was encountered in #1533, but is only a partial fix of the problem
Fixed by #1535 |
Somehow this issue #842 has not been fully addressed. The following code shows the problem:
Then compling:
Create a toy VCF file:
Then running a first test:
Shows that the number of contigs is increasing even if the code removed all contigs with the
bcf_hdr_remove()
functionThe number of contigs increases at each iteration:
Unless the VCF is converted to text file in between:
This must not be the intended behavior as it leads to unwanted scenarios:
While it works fine if converted to text file:
Notice that in
htslib/vcf.h
the functionbcf_hdr_remove()
is explained as follows:Originally posted by @freeseek in #842 (comment)
The text was updated successfully, but these errors were encountered: