-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with insertion plots (files longer than they should be) #123
Comments
Hi Irilenia, Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging? -Lars |
Hi Lars,
Sure, I’ll ask permission from the people who produced the data and once I have that, I’ll send you the file.
Irilenia
… On 24 Jul 2020, at 17:07, lbarquist ***@***.***> wrote:
Hi Irilenia,
Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging?
-Lars
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#123 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJZP7XSBZYJX2QVWQTWE6RTR5GPSBANCNFSM4PGGMK7Q>.
|
Hi Lars,
For an example where the insertion plot does not match the reference, you can take sample SRR4113440 from the DeJesus et al. libraries (available in SRA). In this case the padding adds 23 rows more than the genome (Mtb H37Rv) positions and you can see that the first few reads have 23 nucleotides soft-clipped. This can’t be a coincidence…
In our own example presented in the previous message, the genome is Mbovis (fasta and annotation attached) and a handful of reads mapping to both the end and the start are given in the fastq file attached.
Hope this helps,
Irilenia
… On 24 Jul 2020, at 17:07, lbarquist ***@***.***> wrote:
Hi Irilenia,
Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging?
-Lars
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#123 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJZP7XSBZYJX2QVWQTWE6RTR5GPSBANCNFSM4PGGMK7Q>.
|
Hello, we are also seeing a set of sample files of different lengths longer than they should be and this seems to be the likely cause. Thanks for looking into it, |
We see the same problem, looking through the code it appears that the CIGAR.pm file which parses the CIGAR reported by the mapping software is incorrect. My understanding is that the original parser was incorrect, it was then changed a few months ago to fix the soft clipping issue but that causes the software to be incorrect for other CIGAR characters since they were erroneously grouped with the soft clipping character in the same 'if' statement. According to page 8 of the SAM tools manual https://samtools.github.io/hts-specs/SAMv1.pdf, only characters {M,D,N,=,X} consume the reference and therefore we should only change the coordinate upon encountering those characters. I'm working on the software for the Quadram Institute, I have changed the If statements in the CIGAR parser to:- There are other changes that we are making and this correction should be pushed in alongside other changes in due course. |
Hi Martin, Do you mean that you want to remove any handling of soft-clipped bases? If so, I think this would just reintroduce the bug I fixed in issue #120, basically that soft-clipped bases at the beginning and ends of reads need to consume bases in the reference, as the design of TraDIS primers are such that the first base corresponds to the insert site -- if there's a base calling error early in the read this will lead to soft-clipping, and a miscalled insertion site as @irilenia showed in that issue report. I have similar data that shows this is a common problem, and shouldn't be ignored. It seems to be a bigger problem with the switch to bwa, as the default smalt parameters were such that reads with more than one or two mismatches were generally tossed. I think to solve the current issue properly, we would need to track the end position of the chromosome, and forbid insertion sites that extend beyond that -- I haven't looked to see how difficult this would be. -Lars |
Hi Lars
My understanding is that the CIGAR parser has never been correct and fixing it for soft-clipping characters has broken it for other characters which are erroneously grouped in the same 'if' statement.
Working from the format specification at https://samtools.github.io/hts-specs/SAMv1.pdf, on page 8 they specify 5 characters which 'consume' the reference {M,D,N,+,X} so those 5 characters affect the coordinate in the reference sequence and others do not.
Therefore, I suggest changing the code to a single 'if' statement which looks for those 5 characters and ignores others, at the moment I'm using:-
# Check page 8 of https://samtools.github.io/hts-specs/SAMv1.pdf to see which symbols consume the reference
# If it consumes the reference then most the coordinate position
if($action eq 'M' || $action eq 'D' || $action eq 'N' || $action eq '=' || $action eq 'X' ){
$results{start} = $current_coordinate - 1 if($results{start} == -1);
$current_coordinate += $number;
$results{end} = $current_coordinate -1;
}
# Do not count soft clipping bases, the coordinate ignores then
Best wishes,
Dr Martin Lott
Researcher
Quadram Institute Bioscience,
Norwich Research Park, NR4 7UQ
T: +44 (0)1603 255245
www.quadram.ac.uk<http://www.quadram.ac.uk>
[cid:image001.png@01D424C2.EEFB29C0]
|
Hi Martin, I agree the X and D probably should not be grouped with soft-clipping in terms of extending the coordinates at the ends, but I don't believe this is the problem either here or with issue #120. If you read #120, you'll see soft-clipping has to be considered to get accurate insertion sites. Basically extra bases from the soft clipping need to be appended to the edge of read first when calculating the start site, otherwise mismatches near the 5' end of the read will lead to a shift from the true insert site at the read start. Once this has been done, S needs to consume bases in the reference, as you'll end up with an incorrect alignment stop site otherwise. Some of the current issue may be related to the current handling of soft-clipping, but I don't think this was entirely created by my updated handling of soft-clipping and I suspect it's really an issue with the padding in InsertSite.pm -- for example this problem seems to predate my fix for soft-clipping, see issue #86. My guess is that the start and end of the genome aren't being tracked properly, and this probably interacts poorly with anything that modifies the alignment coordinates, particularly when you've got split alignments. -Lars |
Hi Lars
Soft clipped bases should not consume the reference. The 'if' statement
which looks for the character 'S' appears to have been erroneously grouped
with characters 'D' and 'N' by Andrew Page into a single 'if' statement.
Please see page 8 of https://samtools.github.io/hts-specs/SAMv1.pdf to see
which symbols consume the reference. Characters D and N do consume the
reference, character S does not.
In changing the if statement, you fix it for S characters and introduce a
new problem for D and N characters.
I suggest re-writing this so that you group all characters which consume
the reference, as described in the sam tools manual, together and only
change the coordinate upon encountering those:-
if( $cigar_item =~ /(\d+)([MIDNSHP=X])/)
{
my $number = $1;
my $action = $2;
# Check page 8 of https://samtools.github.io/hts-specs/SAMv1.pdf to see
which symbols consume the reference
# If it consumes the reference then most the coordinate position
if($action eq 'M' || $action eq 'D' || $action eq 'N' || $action eq '=' ||
$action eq 'X' ){
$results{start} = $current_coordinate if($results{start} == 0);
$current_coordinate += $number;
$results{end} = $current_coordinate -1 if($results{end} <
$current_coordinate);
}
# Do not count soft clipping bases, the coordinate ignores then
}
For some datasets that we look at, we find that there are no insertions
reported near the start of the genome, this issue could not be caused by
padding. If find that using the above in the CIGAR parser resolves the
issues we have been getting in our sequencing data. Looking at the results
from the mapping software, the numbers appear to be correct and it's an
issue with how those results are parsed.
Finally, I checked this with Andrew Page and it appears that the current
version of the software does not pass the unit tests
https://travis-ci.org/github/sanger-pathogens/Bio-Tradis/builds/712118356,
perhaps fixing this will mean that the unit tests now pass?
Best wishes,
Martin
P.S. If possible then please reply to my work email
martin.lott@quadram.ac.uk
…On Wed, 23 Sep 2020 at 14:26, lbarquist ***@***.***> wrote:
Hi Martin,
Do you mean that you want to remove any handling of soft-clipped bases? If
so, I think this would just reintroduce the bug I fixed in issue #120
<#120>, basically
that soft-clipped bases at the beginning and ends of reads need to consume
bases in the reference, as the design of TraDIS primers are such that the
first base corresponds to the insert site -- if there's a base calling
error early in the read this will lead to soft-clipping, and a miscalled
insertion site as @irilenia <https://github.com/irilenia> showed in that
issue report. I have similar data that shows this is a common problem, and
shouldn't be ignored. It seems to be a bigger problem with the switch to
bwa, as the default smalt parameters were such that reads with more than
one or two mismatches were generally tossed.
I think to solve the current issue properly, we would need to track the
end position of the chromosome, and forbid insertion sites that extend
beyond that -- I haven't looked to see how difficult this would be.
-Lars
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#123 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALJXXP7A5OOCKG5YSMPD5F3SHHZRPANCNFSM4PGGMK7Q>
.
|
Hi,
I wanted to raise another issue with insertion plots and suggest a possible solution for it. In issue #86 (closed), several users mentioned that their insertion plots would not open in Artemis. Although this was not considered at the time to be a problem with the biotradis code, I believe it is. As pointed out by others, these insertion plots are occasionally longer than the reference genome. I believe the problem originates with how chimeric alignments of reads covering both the end and the beginning of the circular genome are treated. I think these reads are not seen as being chimeric and instead, padding is added (to the beginning of the genome in my case) to cover for positions that are not matched. This creates extra positions and results in insertion plots that are longer than the reference sequence.
To give you a more specific example, we are looking at a genome with length 4349904 bases. You'd expect the insertion plot to have that many rows but instead it has 4349956 rows. If you examine the end region of the genome and the reads that are aligned there, you will see that nearly all align all the way to the end but in reality they have been chopped up: 52 bases align at the end of the genome and 64 bases align at the start. When your module InsertSite.pm adds padding, it seems to incorrectly assume that the bits of the reads that align at the start should have a padding of 52 bases following them, when in reality, these parts are already aligned and present at the end of the genome.
Here is an example read that's been split between the start and the end of the genome in the bam file (note the SA:Z:... tag indicating the chimeric alignment) - first is the bit aligned at the start, following by the bit aligned at the end of the genome:
Here are what the reads look at the end of the genome (the insertion plot at the top is from biotradis but I have chopped off the first 52 nucleotides to allow it to show on Artemis - as you see, once this padding is removed, the insertions align well with the reads from the bam file); the read in the red rectangle is the same read shown above:
When fixing this, you may want to consider as well how to fix the insertion at the start of the genome as the call of a site there is wrong. All reads mapping right at the start actually are split alignments carrying over from the end of the genome so no insertion site should be called.
Thanks again for your help.
Irilenia
The text was updated successfully, but these errors were encountered: