-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifying indels in flanking region for PowerSeq sequences #70
Conversation
While this PR addresses the major deletions observed in PowerSeq data, there are several other indels observed that are likely errors- however, since these called alleles has been observed before (e.g. a 9.1 or 10.1 in D7), we aren't comfortable changing these sequences. We will look into increasing the AT for these loci in order to account for the errors while attempting to keep true alleles of this length. |
This is ready for review @standage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty clear from your description and from the code what you're doing here. No concerns from an accuracy perspective.
From a design perspective, I think you may consider a slightly different approach. Instead of manipulating the sequence string in a standalone subroutine and creating a new STRMarkerObject
from the modified string, is there a way to handle this in the objects themselves (e.g. STRMarker_PentaD
and so on)? If so, I think that would be a much cleaner approach and more in line with the design we have for locus-specific sequence handling.
lusSTR/wrappers/convert.py
Outdated
if ( | ||
locus == "PENTA D" | ||
and kit == "powerseq" | ||
and marker.indel_flag == "Possible indel or partial sequence" | ||
): | ||
marker = check_pentad(marker, sequence, software) | ||
indel_flag = "Possible indel or partial sequence" | ||
elif ( | ||
locus == "D7S820" | ||
and kit == "powerseq" | ||
and marker.indel_flag == "Possible indel or partial sequence" | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for the indel flag in each of these conditionals and then assigning it to an accessory variable in each block is redundant and unnecessary. You should check, but I think the following code should accomplish the same thing.
"Please check STRait Razor version!!"
)
print(msg)
- if (
- locus == "PENTA D"
- and kit == "powerseq"
- and marker.indel_flag == "Possible indel or partial sequence"
- ):
- marker = check_pentad(marker, sequence, software)
- indel_flag = "Possible indel or partial sequence"
- elif (
- locus == "D7S820"
- and kit == "powerseq"
- and marker.indel_flag == "Possible indel or partial sequence"
- ):
- marker = check_D7(marker, sequence, software)
- indel_flag = "Possible indel or partial sequence"
- elif (
- locus == "VWA"
- and kit == "powerseq"
- and marker.indel_flag == "Possible indel or partial sequence"
- ):
- marker = check_vwa(marker, sequence, software)
- indel_flag = "Possible indel or partial sequence"
- else:
- indel_flag = marker.indel_flag
+ indel_flag = marker.indel_flag
+ if indel_flag == "Possible indel or partial sequence":
+ if locus == "PENTA D" and kit == "powerseq":
+ marker = check_pentad(marker, sequence, software)
+ elif locus == "D7S820" and kit == "powerseq":
+ marker = check_D7(marker, sequence, software)
+ elif locus == "VWA" and kit == "powerseq":
+ marker = check_vwa(marker, sequence, software)
summary = [sampleid, project, analysis, locus] + marker.summary + [reads]
list_of_lists.append(summary)
if software != "uas":
lusSTR/wrappers/convert.py
Outdated
"Please check STRait Razor version!!" | ||
) | ||
print(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Messages like this are really intended for stderr, with print(..., file=sys.stderr)
or with warn()
(from warnings import warn
).
I thought about this in the beginning but since you need to know the CE allele and indel flag before running the code, it seemed complicated to include it in the STRMarker object. |
It could be. Whether it's messier or more complicated than what you have now depends on when and where the CE alleles and indel flags are computed. Are these operations done by the STRMarker object? |
I've been thinking about this and I think leaving it how it is now is the least messiest. I'm not entirely sure how to run the different functions within the STRMarker class multiple times (i.e. first calculate the canonical allele, then filter the sequence if the indel flag is reporting the sequence as a possible indel, then re-run the entire STRMarker object, including re-calculate the canonical allele, reconvert all the flanking sequences and UAS region sequences to the bracketed form, etc.). Thoughts? |
As we discussed, let's stick to the approach you've implemented for now. |
Three loci, D7, vWA and PentaD, were identified to consistently have indels, specifically deletions, within the flanking regions. When run through lusSTR, lusSTR is identifying the flanking region by length alone and therefore removing too much/too little sequence when indels are present. At D7 and PentaD, these deletions are common enough to produce sequences that are above the AT and stutter thresholds and therefore are called real alleles. However, examining these sequences, it is clear they are residing within the flanking region. The vWA locus consistently sees the same deletion, and even though it falls below the AT, will also be accounted for.
The D7 locus routinely sees deletions in the A stretch (from the end of the 5' sequence flank to the beginning of the UAS region):
AGAATTGCACCAAATATTGGTAATTAAATGTTTACTATAGACTATTTAGTGAGATAAAAAAAAACTATCAATCTGTCTATCTATCTA...
AGAATTGCACCAAATATTGGTAATTAAATGTTTACTATAGACTATTTAGTGAGATAAAAAAAACTATCAATCTGTCTATCTATCTA...
The PentaD locus routinely sees deletions in the A stretch (from the end of the 5' sequence flank to the beginning of the UAS region):
GAGCCATGATCACACCACTACACTCCAGCCTAGGTGACAGAGCAAGACACCATCTCAAGAAAGAAAAAAAAGAAAGAAAAGA...
GAGCCATGATCACACCACTACACTCCAGCCTAGGTGACAGAGCAAGACACCATCTCAAGAAAGAAAAAAAGAAAGAAAAGA...
Deletions within the 3' sequence flank have also been observed.
The vWA locus routinely sees a deletion of the first base in the sequence:
GGATAGATGGATAGATAGATAGATAG...
GATAGATGGATAGATAGATAGATAG...
This PR will account for these deletions in order to output the correct CE allele and bracketed UAS sequence region.