-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coverage stop coordinate switch to exclusive #204
Conversation
Hi @wasade Thanks for discussing with me the issue and suggesting the change. I think there are two things to consider: 1) Whether the right coordinate is inclusive or exclusive. SAM is inclusive whereas BED is exclusive. This is what the current PR tackles. 2) Whether the coordinates are 0-based or 1-based. To my knowledge, SAM is 1-based whereas BED is 0-based. This hasn't been addressed. Therefore even with the current PR, the result may not be precisely identical to that of BED. To precisely mimic the BEDtools' expectation, the code should be Changing how SAM is processed can affect some downstream applications, such as the coordinate-based functional annotation algorithm in Woltka. I tend to suggest that instead of changing SAM processing (which should follow SAM standard), we can add an intermediate step before feeding into the Quotes from the SAM standard:
Note: NCBI is also 1-based and inclusive. |
Thanks, @qiyunzhu! Bedtools on at least some commands allow you to do either 0 or 1-based indexing. That shouldn't matter though as long as both start and stop are shifted, but the code here was only shifting a single coordinate. With .sam, it doesn't encode the stop, you have to calculate it off CIGAR right? With these changes, if you run the same ranges through |
As an example, using Note, the micov outputs a header within BED3 which is actually against spec (?) but I would prefer to error on the side of having a header to in general simplify interpretation. $ xzcat 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz | micov compress --disable-compression > test.bed
$ grep -v genome_id test.bed | bedtools sort | bedtools merge | md5sum
88ca582b339d7e8b53191cc96d3b8568 -
$ woltka classify -i 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz -o testtest --no-demux --rank none --outcov foobar/
Input alignment file: 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz.
Demultiplexing: off.
Classification will operate on these ranks: none.
Parsing alignment file 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz . Done.
Number of sequences classified: 75.
Calculating per sample coverage... Done.
Classification completed.
Format of output feature table(s): TSV.
Writing output profiles in TSV format...
Rank: none, samples: 1, features: 1.
Profiles written.
Task completed.
$ md5sum foobar/109506_S311_L001_R1_001.trimmed.fastq.gz.cov
88ca582b339d7e8b53191cc96d3b8568 foobar/109506_S311_L001_R1_001.trimmed.fastq.gz.cov |
@wasade Thanks for the clarification and the example! You are correct that SAM doesn't encode for stop and one needs to calculate it from CIGAR. Let me think a bit about the implementation. Currently Woltka supports SAM, PAF and BLAST formats, in which SAM and BLAST are 1-based, inclusive and PAF is 0-based, exclusive (like BED). I envision that the solution would consist of 1) modify the parsers or the algorithms ( |
This PR addresses two bugs originating from Zebra.
bedtools merge
. Specifically, the spans[(2, 5), (6, 10)]
should not be considered connected:The lack of connection is consistent with
bedtools merge
: