Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying indels in flanking region for PowerSeq sequences #70

Merged
merged 3 commits into from
Feb 6, 2024

Conversation

rnmitchell
Copy link
Contributor

Three loci, D7, vWA and PentaD, were identified to consistently have indels, specifically deletions, within the flanking regions. When run through lusSTR, lusSTR is identifying the flanking region by length alone and therefore removing too much/too little sequence when indels are present. At D7 and PentaD, these deletions are common enough to produce sequences that are above the AT and stutter thresholds and therefore are called real alleles. However, examining these sequences, it is clear they are residing within the flanking region. The vWA locus consistently sees the same deletion, and even though it falls below the AT, will also be accounted for.

The D7 locus routinely sees deletions in the A stretch (from the end of the 5' sequence flank to the beginning of the UAS region):
AGAATTGCACCAAATATTGGTAATTAAATGTTTACTATAGACTATTTAGTGAGATAAAAAAAAACTATCAATCTGTCTATCTATCTA...
AGAATTGCACCAAATATTGGTAATTAAATGTTTACTATAGACTATTTAGTGAGATAAAAAAAACTATCAATCTGTCTATCTATCTA...
Screen Shot 2024-01-12 at 1 34 05 PM

The PentaD locus routinely sees deletions in the A stretch (from the end of the 5' sequence flank to the beginning of the UAS region):
GAGCCATGATCACACCACTACACTCCAGCCTAGGTGACAGAGCAAGACACCATCTCAAGAAAGAAAAAAAAGAAAGAAAAGA...
GAGCCATGATCACACCACTACACTCCAGCCTAGGTGACAGAGCAAGACACCATCTCAAGAAAGAAAAAAAGAAAGAAAAGA...
Deletions within the 3' sequence flank have also been observed.
Screen Shot 2024-01-12 at 1 35 15 PM

The vWA locus routinely sees a deletion of the first base in the sequence:
GGATAGATGGATAGATAGATAGATAG...
GATAGATGGATAGATAGATAGATAG...

Screen Shot 2024-01-12 at 1 35 29 PM

This PR will account for these deletions in order to output the correct CE allele and bracketed UAS sequence region.

@rnmitchell rnmitchell marked this pull request as ready for review January 16, 2024 14:38
@rnmitchell
Copy link
Contributor Author

While this PR addresses the major deletions observed in PowerSeq data, there are several other indels observed that are likely errors- however, since these called alleles has been observed before (e.g. a 9.1 or 10.1 in D7), we aren't comfortable changing these sequences. We will look into increasing the AT for these loci in order to account for the errors while attempting to keep true alleles of this length.

@rnmitchell
Copy link
Contributor Author

This is ready for review @standage

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty clear from your description and from the code what you're doing here. No concerns from an accuracy perspective.

From a design perspective, I think you may consider a slightly different approach. Instead of manipulating the sequence string in a standalone subroutine and creating a new STRMarkerObject from the modified string, is there a way to handle this in the objects themselves (e.g. STRMarker_PentaD and so on)? If so, I think that would be a much cleaner approach and more in line with the design we have for locus-specific sequence handling.

Comment on lines 111 to 122
if (
locus == "PENTA D"
and kit == "powerseq"
and marker.indel_flag == "Possible indel or partial sequence"
):
marker = check_pentad(marker, sequence, software)
indel_flag = "Possible indel or partial sequence"
elif (
locus == "D7S820"
and kit == "powerseq"
and marker.indel_flag == "Possible indel or partial sequence"
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for the indel flag in each of these conditionals and then assigning it to an accessory variable in each block is redundant and unnecessary. You should check, but I think the following code should accomplish the same thing.

                         "Please check STRait Razor version!!"
                     )
                     print(msg)
-        if (
-            locus == "PENTA D"
-            and kit == "powerseq"
-            and marker.indel_flag == "Possible indel or partial sequence"
-        ):
-            marker = check_pentad(marker, sequence, software)
-            indel_flag = "Possible indel or partial sequence"
-        elif (
-            locus == "D7S820"
-            and kit == "powerseq"
-            and marker.indel_flag == "Possible indel or partial sequence"
-        ):
-            marker = check_D7(marker, sequence, software)
-            indel_flag = "Possible indel or partial sequence"
-        elif (
-            locus == "VWA"
-            and kit == "powerseq"
-            and marker.indel_flag == "Possible indel or partial sequence"
-        ):
-            marker = check_vwa(marker, sequence, software)
-            indel_flag = "Possible indel or partial sequence"
-        else:
-            indel_flag = marker.indel_flag
+        indel_flag = marker.indel_flag
+        if indel_flag == "Possible indel or partial sequence":
+            if locus == "PENTA D" and kit == "powerseq":
+                marker = check_pentad(marker, sequence, software)
+            elif locus == "D7S820" and kit == "powerseq":
+                marker = check_D7(marker, sequence, software)
+            elif locus == "VWA" and kit == "powerseq":
+                marker = check_vwa(marker, sequence, software)
         summary = [sampleid, project, analysis, locus] + marker.summary + [reads]
         list_of_lists.append(summary)
         if software != "uas":

Comment on lines 108 to 110
"Please check STRait Razor version!!"
)
print(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messages like this are really intended for stderr, with print(..., file=sys.stderr) or with warn() (from warnings import warn).

@rnmitchell
Copy link
Contributor Author

It's pretty clear from your description and from the code what you're doing here. No concerns from an accuracy perspective.

From a design perspective, I think you may consider a slightly different approach. Instead of manipulating the sequence string in a standalone subroutine and creating a new STRMarkerObject from the modified string, is there a way to handle this in the objects themselves (e.g. STRMarker_PentaD and so on)? If so, I think that would be a much cleaner approach and more in line with the design we have for locus-specific sequence handling.

I thought about this in the beginning but since you need to know the CE allele and indel flag before running the code, it seemed complicated to include it in the STRMarker object.

@standage
Copy link
Member

I thought about this in the beginning but since you need to know the CE allele and indel flag before running the code, it seemed complicated to include it in the STRMarker object.

It could be. Whether it's messier or more complicated than what you have now depends on when and where the CE alleles and indel flags are computed. Are these operations done by the STRMarker object?

@rnmitchell
Copy link
Contributor Author

rnmitchell commented Feb 1, 2024

I thought about this in the beginning but since you need to know the CE allele and indel flag before running the code, it seemed complicated to include it in the STRMarker object.

It could be. Whether it's messier or more complicated than what you have now depends on when and where the CE alleles and indel flags are computed. Are these operations done by the STRMarker object?

I've been thinking about this and I think leaving it how it is now is the least messiest. I'm not entirely sure how to run the different functions within the STRMarker class multiple times (i.e. first calculate the canonical allele, then filter the sequence if the indel flag is reporting the sequence as a possible indel, then re-run the entire STRMarker object, including re-calculate the canonical allele, reconvert all the flanking sequences and UAS region sequences to the bracketed form, etc.). Thoughts?

@standage
Copy link
Member

standage commented Feb 6, 2024

As we discussed, let's stick to the approach you've implemented for now.

@standage standage merged commit d55a2b2 into master Feb 6, 2024
2 checks passed
@standage standage deleted the indels branch February 6, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants