Coverage stop coordinate switch to exclusive #204

wasade · 2024-04-18T05:37:38Z

This PR addresses two bugs originating from Zebra.

The range calculation from Zebra performed a left shift on the start coordinate being processed. We believe this was done on the assumption the coordinate should be considered inclusive, however standard convention for coverage (see e.g. bedtools and BED3 format) is to consider it exclusive. As a consequence, regions which were adjacent but not connected would merge, yielding a different result than bedtools merge. Specifically, the spans [(2, 5), (6, 10)] should not be considered connected:

1234567890
 XXX XXXX

The lack of connection is consistent with bedtools merge:

$ cat example.bed
test	2	5
test	6	10
$ bedtools merge -i example.bed 
test	2	5
test	6	10

The representation of the stop coordinate in parsing SAM/BAM, with the addition of the CIGAR length, was adjusted to convert the stop coordinate to be inclusive. As noted in (1), this is a deviation from the expectations of programs such as bedtools, so this behavior has also been adjusted.

qiyunzhu · 2024-04-18T15:30:33Z

Hi @wasade Thanks for discussing with me the issue and suggesting the change.

I think there are two things to consider: 1) Whether the right coordinate is inclusive or exclusive. SAM is inclusive whereas BED is exclusive. This is what the current PR tackles. 2) Whether the coordinates are 0-based or 1-based. To my knowledge, SAM is 1-based whereas BED is 0-based. This hasn't been addressed. Therefore even with the current PR, the result may not be precisely identical to that of BED. To precisely mimic the BEDtools' expectation, the code should be pos - 1, pos + offset - 1, I guess, but it needs to be tested.

Changing how SAM is processed can affect some downstream applications, such as the coordinate-based functional annotation algorithm in Woltka. I tend to suggest that instead of changing SAM processing (which should follow SAM standard), we can add an intermediate step before feeding into the merge_range function or BEDtools.

Quotes from the SAM standard:

1-based coordinate system A coordinate system where the first base of a sequence is one. In this co-
ordinate system, a region is specified by a closed interval. For example, the region between the 3rd
and the 7th bases inclusive is [3, 7]. The SAM, VCF, GFF and Wiggle formats are using the 1-based
coordinate system.

0-based coordinate system A coordinate system where the first base of a sequence is zero. In this
coordinate system, a region is specified by a half-closed-half-open interval. For example, the region
between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BCFv2, BED, and PSL formats are
using the 0-based coordinate system.

POS: 1-based leftmost mapping POSition of the first CIGAR operation that “consumes” a reference
base (see table below). The first base in a reference sequence has coordinate 1. POS is set as 0 for
an unmapped read without coordinate. If POS is 0, no assumptions can be made about RNAME and
CIGAR.

Note: NCBI is also 1-based and inclusive.

wasade · 2024-04-18T15:34:43Z

Thanks, @qiyunzhu! Bedtools on at least some commands allow you to do either 0 or 1-based indexing. That shouldn't matter though as long as both start and stop are shifted, but the code here was only shifting a single coordinate.

With .sam, it doesn't encode the stop, you have to calculate it off CIGAR right?

With these changes, if you run the same ranges through bedtools merge you get the same result

wasade · 2024-04-18T16:04:48Z

As an example, using micov to extract the ranges (which uses a minor adaption of the cigar_to_lens function from woltka). The extraction is just dumping subject ID, start, and start + length from CIGAR.

Note, the micov outputs a header within BED3 which is actually against spec (?) but I would prefer to error on the side of having a header to in general simplify interpretation.

$ xzcat 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz | micov compress --disable-compression > test.bed
$ grep -v genome_id test.bed | bedtools sort | bedtools merge | md5sum
88ca582b339d7e8b53191cc96d3b8568  -
$ woltka classify -i 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz -o testtest --no-demux --rank none --outcov foobar/
Input alignment file: 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz.
Demultiplexing: off.
Classification will operate on these ranks: none.
Parsing alignment file 109506_S311_L001_R1_001.trimmed.fastq.gz.sam.xz . Done.
  Number of sequences classified: 75.
Calculating per sample coverage... Done.
Classification completed.
Format of output feature table(s): TSV.
Writing output profiles in TSV format...
  Rank: none, samples: 1, features: 1.
Profiles written.
Task completed.
$ md5sum foobar/109506_S311_L001_R1_001.trimmed.fastq.gz.cov 
88ca582b339d7e8b53191cc96d3b8568  foobar/109506_S311_L001_R1_001.trimmed.fastq.gz.cov

qiyunzhu · 2024-04-18T16:41:43Z

@wasade Thanks for the clarification and the example!

You are correct that SAM doesn't encode for stop and one needs to calculate it from CIGAR.

Let me think a bit about the implementation. Currently Woltka supports SAM, PAF and BLAST formats, in which SAM and BLAST are 1-based, inclusive and PAF is 0-based, exclusive (like BED). I envision that the solution would consist of 1) modify the parsers or the algorithms (merge_ranges or the coordinate matching algorithm), 2) have a new option to control how output coverage files are formatted. Maybe --outcov-fmt, the value can be 0i, 0e (default; like BED), 1i, 1e.

wasade added 6 commits April 17, 2024 22:15

TST: stop coordinate to exclusive

56dc1a8

BUG: switch stop coordinate to exclusive to be consistent with bedtools

3ec85e3

TST: remove left shift

7556756

BUG: remove left shift in range calculation

2e91f97

DOC: changelog mention

81b0877

Missed an add

036948c

wasade changed the title ~~Coverage stop~~ Coverage stop coordinate switch to exclusive Apr 18, 2024

wasade added 2 commits April 17, 2024 22:57

Reflect adjusted stop coordinate

e591c96

Reflect adjusted stop coordinate

b7ca015

qiyunzhu mentioned this pull request Apr 29, 2024

Improved handling of range coordinates #205

Merged

qiyunzhu closed this Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage stop coordinate switch to exclusive #204

Coverage stop coordinate switch to exclusive #204

wasade commented Apr 18, 2024

qiyunzhu commented Apr 18, 2024

wasade commented Apr 18, 2024

wasade commented Apr 18, 2024

qiyunzhu commented Apr 18, 2024

Coverage stop coordinate switch to exclusive #204

Coverage stop coordinate switch to exclusive #204

Conversation

wasade commented Apr 18, 2024

qiyunzhu commented Apr 18, 2024

wasade commented Apr 18, 2024

wasade commented Apr 18, 2024

qiyunzhu commented Apr 18, 2024