Identifying indels in flanking region for PowerSeq sequences #70

rnmitchell · 2024-01-12T19:20:01Z

Three loci, D7, vWA and PentaD, were identified to consistently have indels, specifically deletions, within the flanking regions. When run through lusSTR, lusSTR is identifying the flanking region by length alone and therefore removing too much/too little sequence when indels are present. At D7 and PentaD, these deletions are common enough to produce sequences that are above the AT and stutter thresholds and therefore are called real alleles. However, examining these sequences, it is clear they are residing within the flanking region. The vWA locus consistently sees the same deletion, and even though it falls below the AT, will also be accounted for.

The D7 locus routinely sees deletions in the A stretch (from the end of the 5' sequence flank to the beginning of the UAS region):
AGAATTGCACCAAATATTGGTAATTAAATGTTTACTATAGACTATTTAGTGAGATAAAAAAAAACTATCAATCTGTCTATCTATCTA...
AGAATTGCACCAAATATTGGTAATTAAATGTTTACTATAGACTATTTAGTGAGATAAAAAAAACTATCAATCTGTCTATCTATCTA...

The PentaD locus routinely sees deletions in the A stretch (from the end of the 5' sequence flank to the beginning of the UAS region):
GAGCCATGATCACACCACTACACTCCAGCCTAGGTGACAGAGCAAGACACCATCTCAAGAAAGAAAAAAAAGAAAGAAAAGA...
GAGCCATGATCACACCACTACACTCCAGCCTAGGTGACAGAGCAAGACACCATCTCAAGAAAGAAAAAAAGAAAGAAAAGA...
Deletions within the 3' sequence flank have also been observed.

The vWA locus routinely sees a deletion of the first base in the sequence:
GGATAGATGGATAGATAGATAGATAG...
GATAGATGGATAGATAGATAGATAG...

This PR will account for these deletions in order to output the correct CE allele and bracketed UAS sequence region.

rnmitchell · 2024-01-16T14:56:15Z

While this PR addresses the major deletions observed in PowerSeq data, there are several other indels observed that are likely errors- however, since these called alleles has been observed before (e.g. a 9.1 or 10.1 in D7), we aren't comfortable changing these sequences. We will look into increasing the AT for these loci in order to account for the errors while attempting to keep true alleles of this length.

rnmitchell · 2024-01-16T14:56:29Z

This is ready for review @standage

standage

It's pretty clear from your description and from the code what you're doing here. No concerns from an accuracy perspective.

From a design perspective, I think you may consider a slightly different approach. Instead of manipulating the sequence string in a standalone subroutine and creating a new STRMarkerObject from the modified string, is there a way to handle this in the objects themselves (e.g. STRMarker_PentaD and so on)? If so, I think that would be a much cleaner approach and more in line with the design we have for locus-specific sequence handling.

standage · 2024-01-16T18:49:16Z

lusSTR/wrappers/convert.py

+        if (
+            locus == "PENTA D"
+            and kit == "powerseq"
+            and marker.indel_flag == "Possible indel or partial sequence"
+        ):
+            marker = check_pentad(marker, sequence, software)
+            indel_flag = "Possible indel or partial sequence"
+        elif (
+            locus == "D7S820"
+            and kit == "powerseq"
+            and marker.indel_flag == "Possible indel or partial sequence"
+        ):


Checking for the indel flag in each of these conditionals and then assigning it to an accessory variable in each block is redundant and unnecessary. You should check, but I think the following code should accomplish the same thing.

"Please check STRait Razor version!!" ) print(msg) - if ( - locus == "PENTA D" - and kit == "powerseq" - and marker.indel_flag == "Possible indel or partial sequence" - ): - marker = check_pentad(marker, sequence, software) - indel_flag = "Possible indel or partial sequence" - elif ( - locus == "D7S820" - and kit == "powerseq" - and marker.indel_flag == "Possible indel or partial sequence" - ): - marker = check_D7(marker, sequence, software) - indel_flag = "Possible indel or partial sequence" - elif ( - locus == "VWA" - and kit == "powerseq" - and marker.indel_flag == "Possible indel or partial sequence" - ): - marker = check_vwa(marker, sequence, software) - indel_flag = "Possible indel or partial sequence" - else: - indel_flag = marker.indel_flag + indel_flag = marker.indel_flag + if indel_flag == "Possible indel or partial sequence": + if locus == "PENTA D" and kit == "powerseq": + marker = check_pentad(marker, sequence, software) + elif locus == "D7S820" and kit == "powerseq": + marker = check_D7(marker, sequence, software) + elif locus == "VWA" and kit == "powerseq": + marker = check_vwa(marker, sequence, software) summary = [sampleid, project, analysis, locus] + marker.summary + [reads] list_of_lists.append(summary) if software != "uas":

standage · 2024-01-16T18:50:21Z

lusSTR/wrappers/convert.py

                        "Please check STRait Razor version!!"
                    )
                    print(msg)


Messages like this are really intended for stderr, with print(..., file=sys.stderr) or with warn() (from warnings import warn).

rnmitchell · 2024-01-17T12:45:41Z

It's pretty clear from your description and from the code what you're doing here. No concerns from an accuracy perspective.

From a design perspective, I think you may consider a slightly different approach. Instead of manipulating the sequence string in a standalone subroutine and creating a new STRMarkerObject from the modified string, is there a way to handle this in the objects themselves (e.g. STRMarker_PentaD and so on)? If so, I think that would be a much cleaner approach and more in line with the design we have for locus-specific sequence handling.

I thought about this in the beginning but since you need to know the CE allele and indel flag before running the code, it seemed complicated to include it in the STRMarker object.

standage · 2024-01-19T15:33:36Z

I thought about this in the beginning but since you need to know the CE allele and indel flag before running the code, it seemed complicated to include it in the STRMarker object.

It could be. Whether it's messier or more complicated than what you have now depends on when and where the CE alleles and indel flags are computed. Are these operations done by the STRMarker object?

rnmitchell · 2024-02-01T14:20:26Z

I thought about this in the beginning but since you need to know the CE allele and indel flag before running the code, it seemed complicated to include it in the STRMarker object.

It could be. Whether it's messier or more complicated than what you have now depends on when and where the CE alleles and indel flags are computed. Are these operations done by the STRMarker object?

I've been thinking about this and I think leaving it how it is now is the least messiest. I'm not entirely sure how to run the different functions within the STRMarker class multiple times (i.e. first calculate the canonical allele, then filter the sequence if the indel flag is reporting the sequence as a possible indel, then re-run the entire STRMarker object, including re-calculate the canonical allele, reconvert all the flanking sequences and UAS region sequences to the bracketed form, etc.). Thoughts?

standage · 2024-02-06T15:21:04Z

As we discussed, let's stick to the approach you've implemented for now.

rnmitchell added 2 commits January 12, 2024 13:29

fixing indels for pentaD, vWA and D7 [skip ci]

12e5ce9

added test for deletions

fc928b2

rnmitchell marked this pull request as ready for review January 16, 2024 14:38

rnmitchell requested a review from standage January 16, 2024 14:38

standage requested changes Jan 16, 2024

View reviewed changes

updated covert code

d3b359f

standage approved these changes Feb 6, 2024

View reviewed changes

standage merged commit d55a2b2 into master Feb 6, 2024
2 checks passed

standage deleted the indels branch February 6, 2024 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying indels in flanking region for PowerSeq sequences #70

Identifying indels in flanking region for PowerSeq sequences #70

rnmitchell commented Jan 12, 2024

rnmitchell commented Jan 16, 2024

rnmitchell commented Jan 16, 2024

standage left a comment

standage Jan 16, 2024

standage Jan 16, 2024

rnmitchell commented Jan 17, 2024

standage commented Jan 19, 2024

rnmitchell commented Feb 1, 2024 •

edited

Loading

standage commented Feb 6, 2024

Identifying indels in flanking region for PowerSeq sequences #70

Identifying indels in flanking region for PowerSeq sequences #70

Conversation

rnmitchell commented Jan 12, 2024

rnmitchell commented Jan 16, 2024

rnmitchell commented Jan 16, 2024

standage left a comment

Choose a reason for hiding this comment

standage Jan 16, 2024

Choose a reason for hiding this comment

standage Jan 16, 2024

Choose a reason for hiding this comment

rnmitchell commented Jan 17, 2024

standage commented Jan 19, 2024

rnmitchell commented Feb 1, 2024 • edited Loading

standage commented Feb 6, 2024

rnmitchell commented Feb 1, 2024 •

edited

Loading