Problem with insertion plots (files longer than they should be) #123

irilenia · 2020-07-23T23:20:33Z

Hi,
I wanted to raise another issue with insertion plots and suggest a possible solution for it. In issue #86 (closed), several users mentioned that their insertion plots would not open in Artemis. Although this was not considered at the time to be a problem with the biotradis code, I believe it is. As pointed out by others, these insertion plots are occasionally longer than the reference genome. I believe the problem originates with how chimeric alignments of reads covering both the end and the beginning of the circular genome are treated. I think these reads are not seen as being chimeric and instead, padding is added (to the beginning of the genome in my case) to cover for positions that are not matched. This creates extra positions and results in insertion plots that are longer than the reference sequence.

To give you a more specific example, we are looking at a genome with length 4349904 bases. You'd expect the insertion plot to have that many rows but instead it has 4349956 rows. If you examine the end region of the genome and the reads that are aligned there, you will see that nearly all align all the way to the end but in reality they have been chopped up: 52 bases align at the end of the genome and 64 bases align at the start. When your module InsertSite.pm adds padding, it seems to incorrectly assume that the bits of the reads that align at the start should have a padding of 52 bases following them, when in reality, these parts are already aligned and present at the end of the genome.
Here is an example read that's been split between the start and the end of the genome in the bam file (note the SA:Z:... tag indicating the chimeric alignment) - first is the bit aligned at the start, following by the bit aligned at the end of the genome:

Here are what the reads look at the end of the genome (the insertion plot at the top is from biotradis but I have chopped off the first 52 nucleotides to allow it to show on Artemis - as you see, once this padding is removed, the insertions align well with the reads from the bam file); the read in the red rectangle is the same read shown above:

When fixing this, you may want to consider as well how to fix the insertion at the start of the genome as the call of a site there is wrong. All reads mapping right at the start actually are split alignments carrying over from the end of the genome so no insertion site should be called.

Thanks again for your help.
Irilenia

lbarquist · 2020-07-24T15:06:56Z

Hi Irilenia,

Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging?

-Lars

irilenia · 2020-07-24T15:15:37Z

Hi Lars, Sure, I’ll ask permission from the people who produced the data and once I have that, I’ll send you the file. Irilenia

…

On 24 Jul 2020, at 17:07, lbarquist ***@***.***> wrote: Hi Irilenia, Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging? -Lars — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#123 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJZP7XSBZYJX2QVWQTWE6RTR5GPSBANCNFSM4PGGMK7Q>.

irilenia · 2020-07-25T12:52:48Z

Hi Lars, For an example where the insertion plot does not match the reference, you can take sample SRR4113440 from the DeJesus et al. libraries (available in SRA). In this case the padding adds 23 rows more than the genome (Mtb H37Rv) positions and you can see that the first few reads have 23 nucleotides soft-clipped. This can’t be a coincidence… In our own example presented in the previous message, the genome is Mbovis (fasta and annotation attached) and a handful of reads mapping to both the end and the start are given in the fastq file attached. Hope this helps, Irilenia

…

On 24 Jul 2020, at 17:07, lbarquist ***@***.***> wrote: Hi Irilenia, Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging? -Lars — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#123 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJZP7XSBZYJX2QVWQTWE6RTR5GPSBANCNFSM4PGGMK7Q>.

LCrossman · 2020-09-17T09:54:38Z

Hello, we are also seeing a set of sample files of different lengths longer than they should be and this seems to be the likely cause. Thanks for looking into it,
Best wishes,
Lisa

martinclott · 2020-09-21T08:50:01Z

We see the same problem, looking through the code it appears that the CIGAR.pm file which parses the CIGAR reported by the mapping software is incorrect. My understanding is that the original parser was incorrect, it was then changed a few months ago to fix the soft clipping issue but that causes the software to be incorrect for other CIGAR characters since they were erroneously grouped with the soft clipping character in the same 'if' statement.

According to page 8 of the SAM tools manual https://samtools.github.io/hts-specs/SAMv1.pdf, only characters {M,D,N,=,X} consume the reference and therefore we should only change the coordinate upon encountering those characters.

I'm working on the software for the Quadram Institute, I have changed the If statements in the CIGAR parser to:-
if($action eq 'M' || $action eq 'D' || $action eq 'N' || $action eq '=' || $action eq 'X' ){
$results{start} = $current_coordinate - 1 if($results{start} == -1);
$current_coordinate += $number;
$results{end} = $current_coordinate -1;
}

There are other changes that we are making and this correction should be pushed in alongside other changes in due course.

lbarquist · 2020-09-23T13:26:30Z

Hi Martin,

Do you mean that you want to remove any handling of soft-clipped bases? If so, I think this would just reintroduce the bug I fixed in issue #120, basically that soft-clipped bases at the beginning and ends of reads need to consume bases in the reference, as the design of TraDIS primers are such that the first base corresponds to the insert site -- if there's a base calling error early in the read this will lead to soft-clipping, and a miscalled insertion site as @irilenia showed in that issue report. I have similar data that shows this is a common problem, and shouldn't be ignored. It seems to be a bigger problem with the switch to bwa, as the default smalt parameters were such that reads with more than one or two mismatches were generally tossed.

I think to solve the current issue properly, we would need to track the end position of the chromosome, and forbid insertion sites that extend beyond that -- I haven't looked to see how difficult this would be.

-Lars

martinclott · 2020-09-24T07:49:45Z

Hi Lars My understanding is that the CIGAR parser has never been correct and fixing it for soft-clipping characters has broken it for other characters which are erroneously grouped in the same 'if' statement. Working from the format specification at https://samtools.github.io/hts-specs/SAMv1.pdf, on page 8 they specify 5 characters which 'consume' the reference {M,D,N,+,X} so those 5 characters affect the coordinate in the reference sequence and others do not. Therefore, I suggest changing the code to a single 'if' statement which looks for those 5 characters and ignores others, at the moment I'm using:- # Check page 8 of https://samtools.github.io/hts-specs/SAMv1.pdf to see which symbols consume the reference # If it consumes the reference then most the coordinate position if($action eq 'M' || $action eq 'D' || $action eq 'N' || $action eq '=' || $action eq 'X' ){ $results{start} = $current_coordinate - 1 if($results{start} == -1); $current_coordinate += $number; $results{end} = $current_coordinate -1; } # Do not count soft clipping bases, the coordinate ignores then Best wishes, Dr Martin Lott Researcher Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ T: +44 (0)1603 255245 www.quadram.ac.uk<http://www.quadram.ac.uk> [cid:image001.png@01D424C2.EEFB29C0]

lbarquist · 2020-09-24T09:25:56Z

Hi Martin,

I agree the X and D probably should not be grouped with soft-clipping in terms of extending the coordinates at the ends, but I don't believe this is the problem either here or with issue #120.

If you read #120, you'll see soft-clipping has to be considered to get accurate insertion sites. Basically extra bases from the soft clipping need to be appended to the edge of read first when calculating the start site, otherwise mismatches near the 5' end of the read will lead to a shift from the true insert site at the read start. Once this has been done, S needs to consume bases in the reference, as you'll end up with an incorrect alignment stop site otherwise.

Some of the current issue may be related to the current handling of soft-clipping, but I don't think this was entirely created by my updated handling of soft-clipping and I suspect it's really an issue with the padding in InsertSite.pm -- for example this problem seems to predate my fix for soft-clipping, see issue #86. My guess is that the start and end of the genome aren't being tracked properly, and this probably interacts poorly with anything that modifies the alignment coordinates, particularly when you've got split alignments.

-Lars

martinclott · 2020-10-22T10:01:01Z

Hi Lars Soft clipped bases should not consume the reference. The 'if' statement which looks for the character 'S' appears to have been erroneously grouped with characters 'D' and 'N' by Andrew Page into a single 'if' statement. Please see page 8 of https://samtools.github.io/hts-specs/SAMv1.pdf to see which symbols consume the reference. Characters D and N do consume the reference, character S does not. In changing the if statement, you fix it for S characters and introduce a new problem for D and N characters. I suggest re-writing this so that you group all characters which consume the reference, as described in the sam tools manual, together and only change the coordinate upon encountering those:- if( $cigar_item =~ /(\d+)([MIDNSHP=X])/) { my $number = $1; my $action = $2; # Check page 8 of https://samtools.github.io/hts-specs/SAMv1.pdf to see which symbols consume the reference # If it consumes the reference then most the coordinate position if($action eq 'M' || $action eq 'D' || $action eq 'N' || $action eq '=' || $action eq 'X' ){ $results{start} = $current_coordinate if($results{start} == 0); $current_coordinate += $number; $results{end} = $current_coordinate -1 if($results{end} < $current_coordinate); } # Do not count soft clipping bases, the coordinate ignores then } For some datasets that we look at, we find that there are no insertions reported near the start of the genome, this issue could not be caused by padding. If find that using the above in the CIGAR parser resolves the issues we have been getting in our sequencing data. Looking at the results from the mapping software, the numbers appear to be correct and it's an issue with how those results are parsed. Finally, I checked this with Andrew Page and it appears that the current version of the software does not pass the unit tests https://travis-ci.org/github/sanger-pathogens/Bio-Tradis/builds/712118356, perhaps fixing this will mean that the unit tests now pass? Best wishes, Martin P.S. If possible then please reply to my work email martin.lott@quadram.ac.uk

…

On Wed, 23 Sep 2020 at 14:26, lbarquist ***@***.***> wrote: Hi Martin, Do you mean that you want to remove any handling of soft-clipped bases? If so, I think this would just reintroduce the bug I fixed in issue #120 <#120>, basically that soft-clipped bases at the beginning and ends of reads need to consume bases in the reference, as the design of TraDIS primers are such that the first base corresponds to the insert site -- if there's a base calling error early in the read this will lead to soft-clipping, and a miscalled insertion site as @irilenia <https://github.com/irilenia> showed in that issue report. I have similar data that shows this is a common problem, and shouldn't be ignored. It seems to be a bigger problem with the switch to bwa, as the default smalt parameters were such that reads with more than one or two mismatches were generally tossed. I think to solve the current issue properly, we would need to track the end position of the chromosome, and forbid insertion sites that extend beyond that -- I haven't looked to see how difficult this would be. -Lars — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#123 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALJXXP7A5OOCKG5YSMPD5F3SHHZRPANCNFSM4PGGMK7Q> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with insertion plots (files longer than they should be) #123

Problem with insertion plots (files longer than they should be) #123

irilenia commented Jul 23, 2020

lbarquist commented Jul 24, 2020

irilenia commented Jul 24, 2020 via email

irilenia commented Jul 25, 2020 via email

LCrossman commented Sep 17, 2020

martinclott commented Sep 21, 2020

lbarquist commented Sep 23, 2020

martinclott commented Sep 24, 2020 via email

lbarquist commented Sep 24, 2020

martinclott commented Oct 22, 2020 via email

Problem with insertion plots (files longer than they should be) #123

Problem with insertion plots (files longer than they should be) #123

Comments

irilenia commented Jul 23, 2020

lbarquist commented Jul 24, 2020

irilenia commented Jul 24, 2020 via email

irilenia commented Jul 25, 2020 via email

LCrossman commented Sep 17, 2020

martinclott commented Sep 21, 2020

lbarquist commented Sep 23, 2020

martinclott commented Sep 24, 2020 via email

lbarquist commented Sep 24, 2020

martinclott commented Oct 22, 2020 via email