Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with insertion plots (files longer than they should be) #123

Open
irilenia opened this issue Jul 23, 2020 · 9 comments
Open

Problem with insertion plots (files longer than they should be) #123

irilenia opened this issue Jul 23, 2020 · 9 comments

Comments

@irilenia
Copy link

Hi,
I wanted to raise another issue with insertion plots and suggest a possible solution for it. In issue #86 (closed), several users mentioned that their insertion plots would not open in Artemis. Although this was not considered at the time to be a problem with the biotradis code, I believe it is. As pointed out by others, these insertion plots are occasionally longer than the reference genome. I believe the problem originates with how chimeric alignments of reads covering both the end and the beginning of the circular genome are treated. I think these reads are not seen as being chimeric and instead, padding is added (to the beginning of the genome in my case) to cover for positions that are not matched. This creates extra positions and results in insertion plots that are longer than the reference sequence.

To give you a more specific example, we are looking at a genome with length 4349904 bases. You'd expect the insertion plot to have that many rows but instead it has 4349956 rows. If you examine the end region of the genome and the reads that are aligned there, you will see that nearly all align all the way to the end but in reality they have been chopped up: 52 bases align at the end of the genome and 64 bases align at the start. When your module InsertSite.pm adds padding, it seems to incorrectly assume that the bits of the reads that align at the start should have a padding of 52 bases following them, when in reality, these parts are already aligned and present at the end of the genome.
Here is an example read that's been split between the start and the end of the genome in the bam file (note the SA:Z:... tag indicating the chimeric alignment) - first is the bit aligned at the start, following by the bit aligned at the end of the genome:
Screen Shot 2020-07-24 at 01 12 23

Here are what the reads look at the end of the genome (the insertion plot at the top is from biotradis but I have chopped off the first 52 nucleotides to allow it to show on Artemis - as you see, once this padding is removed, the insertions align well with the reads from the bam file); the read in the red rectangle is the same read shown above:
Screen Shot 2020-07-24 at 01 16 00

When fixing this, you may want to consider as well how to fix the insertion at the start of the genome as the call of a site there is wrong. All reads mapping right at the start actually are split alignments carrying over from the end of the genome so no insertion site should be called.

Thanks again for your help.
Irilenia

@lbarquist
Copy link
Contributor

Hi Irilenia,

Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging?

-Lars

@irilenia
Copy link
Author

irilenia commented Jul 24, 2020 via email

@irilenia
Copy link
Author

irilenia commented Jul 25, 2020 via email

@LCrossman
Copy link

Hello, we are also seeing a set of sample files of different lengths longer than they should be and this seems to be the likely cause. Thanks for looking into it,
Best wishes,
Lisa

@martinclott
Copy link

We see the same problem, looking through the code it appears that the CIGAR.pm file which parses the CIGAR reported by the mapping software is incorrect. My understanding is that the original parser was incorrect, it was then changed a few months ago to fix the soft clipping issue but that causes the software to be incorrect for other CIGAR characters since they were erroneously grouped with the soft clipping character in the same 'if' statement.

According to page 8 of the SAM tools manual https://samtools.github.io/hts-specs/SAMv1.pdf, only characters {M,D,N,=,X} consume the reference and therefore we should only change the coordinate upon encountering those characters.

I'm working on the software for the Quadram Institute, I have changed the If statements in the CIGAR parser to:-
if($action eq 'M' || $action eq 'D' || $action eq 'N' || $action eq '=' || $action eq 'X' ){
$results{start} = $current_coordinate - 1 if($results{start} == -1);
$current_coordinate += $number;
$results{end} = $current_coordinate -1;
}

There are other changes that we are making and this correction should be pushed in alongside other changes in due course.

@lbarquist
Copy link
Contributor

Hi Martin,

Do you mean that you want to remove any handling of soft-clipped bases? If so, I think this would just reintroduce the bug I fixed in issue #120, basically that soft-clipped bases at the beginning and ends of reads need to consume bases in the reference, as the design of TraDIS primers are such that the first base corresponds to the insert site -- if there's a base calling error early in the read this will lead to soft-clipping, and a miscalled insertion site as @irilenia showed in that issue report. I have similar data that shows this is a common problem, and shouldn't be ignored. It seems to be a bigger problem with the switch to bwa, as the default smalt parameters were such that reads with more than one or two mismatches were generally tossed.

I think to solve the current issue properly, we would need to track the end position of the chromosome, and forbid insertion sites that extend beyond that -- I haven't looked to see how difficult this would be.

-Lars

@martinclott
Copy link

martinclott commented Sep 24, 2020 via email

@lbarquist
Copy link
Contributor

Hi Martin,

I agree the X and D probably should not be grouped with soft-clipping in terms of extending the coordinates at the ends, but I don't believe this is the problem either here or with issue #120.

If you read #120, you'll see soft-clipping has to be considered to get accurate insertion sites. Basically extra bases from the soft clipping need to be appended to the edge of read first when calculating the start site, otherwise mismatches near the 5' end of the read will lead to a shift from the true insert site at the read start. Once this has been done, S needs to consume bases in the reference, as you'll end up with an incorrect alignment stop site otherwise.

Some of the current issue may be related to the current handling of soft-clipping, but I don't think this was entirely created by my updated handling of soft-clipping and I suspect it's really an issue with the padding in InsertSite.pm -- for example this problem seems to predate my fix for soft-clipping, see issue #86. My guess is that the start and end of the genome aren't being tracked properly, and this probably interacts poorly with anything that modifies the alignment coordinates, particularly when you've got split alignments.

-Lars

@martinclott
Copy link

martinclott commented Oct 22, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants