Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage stop coordinate switch to exclusive #204

Closed
wants to merge 8 commits into from

Conversation

wasade
Copy link
Contributor

@wasade wasade commented Apr 18, 2024

This PR addresses two bugs originating from Zebra.

  1. The range calculation from Zebra performed a left shift on the start coordinate being processed. We believe this was done on the assumption the coordinate should be considered inclusive, however standard convention for coverage (see e.g. bedtools and BED3 format) is to consider it exclusive. As a consequence, regions which were adjacent but not connected would merge, yielding a different result than bedtools merge. Specifically, the spans [(2, 5), (6, 10)] should not be considered connected:
1234567890
 XXX XXXX

The lack of connection is consistent with bedtools merge:

$ cat example.bed
test	2	5
test	6	10
$ bedtools merge -i example.bed 
test	2	5
test	6	10
  1. The representation of the stop coordinate in parsing SAM/BAM, with the addition of the CIGAR length, was adjusted to convert the stop coordinate to be inclusive. As noted in (1), this is a deviation from the expectations of programs such as bedtools, so this behavior has also been adjusted.

@wasade wasade changed the title Coverage stop Coverage stop coordinate switch to exclusive Apr 18, 2024
@qiyunzhu
Copy link
Owner

Hi @wasade Thanks for discussing with me the issue and suggesting the change.

I think there are two things to consider: 1) Whether the right coordinate is inclusive or exclusive. SAM is inclusive whereas BED is exclusive. This is what the current PR tackles. 2) Whether the coordinates are 0-based or 1-based. To my knowledge, SAM is 1-based whereas BED is 0-based. This hasn't been addressed. Therefore even with the current PR, the result may not be precisely identical to that of BED. To precisely mimic the BEDtools' expectation, the code should be pos - 1, pos + offset - 1, I guess, but it needs to be tested.

Changing how SAM is processed can affect some downstream applications, such as the coordinate-based functional annotation algorithm in Woltka. I tend to suggest that instead of changing SAM processing (which should follow SAM standard), we can add an intermediate step before feeding into the merge_range function or BEDtools.

Quotes from the SAM standard:

1-based coordinate system A coordinate system where the first base of a sequence is one. In this co-
ordinate system, a region is specified by a closed interval. For example, the region between the 3rd
and the 7th bases inclusive is [3, 7]. The SAM, VCF, GFF and Wiggle formats are using the 1-based
coordinate system.

0-based coordinate system A coordinate system where the first base of a sequence is zero. In this
coordinate system, a region is specified by a half-closed-half-open interval. For example, the region
between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BCFv2, BED, and PSL formats are
using the 0-based coordinate system.

  1. POS: 1-based leftmost mapping POSition of the first CIGAR operation that “consumes” a reference
    base (see table below). The first base in a reference sequence has coordinate 1. POS is set as 0 for
    an unmapped read without coordinate. If POS is 0, no assumptions can be made about RNAME and
    CIGAR.

Note: NCBI is also 1-based and inclusive.

@wasade
Copy link
Contributor Author

wasade commented Apr 18, 2024

Thanks, @qiyunzhu! Bedtools on at least some commands allow you to do either 0 or 1-based indexing. That shouldn't matter though as long as both start and stop are shifted, but the code here was only shifting a single coordinate.

With .sam, it doesn't encode the stop, you have to calculate it off CIGAR right?

With these changes, if you run the same ranges through bedtools merge you get the same result

@wasade
Copy link
Contributor Author

wasade commented Apr 18, 2024

As an example, using micov to extract the ranges (which uses a minor adaption of the cigar_to_lens function from woltka). The extraction is just dumping subject ID, start, and start + length from CIGAR.

Note, the micov outputs a header within BED3 which is actually against spec (?) but I would prefer to error on the side of having a header to in general simplify interpretation.

$ xzcat 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz | micov compress --disable-compression > test.bed
$ grep -v genome_id test.bed | bedtools sort | bedtools merge | md5sum
88ca582b339d7e8b53191cc96d3b8568  -
$ woltka classify -i 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz -o testtest --no-demux --rank none --outcov foobar/
Input alignment file: 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz.
Demultiplexing: off.
Classification will operate on these ranks: none.
Parsing alignment file 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz . Done.
  Number of sequences classified: 75.
Calculating per sample coverage... Done.
Classification completed.
Format of output feature table(s): TSV.
Writing output profiles in TSV format...
  Rank: none, samples: 1, features: 1.
Profiles written.
Task completed.
$ md5sum foobar/109506_S311_L001_R1_001.trimmed.fastq.gz.cov 
88ca582b339d7e8b53191cc96d3b8568  foobar/109506_S311_L001_R1_001.trimmed.fastq.gz.cov

@qiyunzhu
Copy link
Owner

@wasade Thanks for the clarification and the example!

You are correct that SAM doesn't encode for stop and one needs to calculate it from CIGAR.

Let me think a bit about the implementation. Currently Woltka supports SAM, PAF and BLAST formats, in which SAM and BLAST are 1-based, inclusive and PAF is 0-based, exclusive (like BED). I envision that the solution would consist of 1) modify the parsers or the algorithms (merge_ranges or the coordinate matching algorithm), 2) have a new option to control how output coverage files are formatted. Maybe --outcov-fmt, the value can be 0i, 0e (default; like BED), 1i, 1e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants