-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear assembly producing misannotated sequences #136
Comments
Hi, and thanks for the positive review. Thank you for taking the time to report this and making pydna better. |
Hi @BjornFJohansson I think I had mentioned this somewhere, but I could not find where. This keeps happenning, but I won't fix it here since the new assembly implementation does not have that problem. I was preparing a minimal example for #255 and I came across this error, and a similar one in the PCR module (#262). In the example below, I amplify a fragment from pombe's genome containing the gene ase1, and clone it into an expression vector using Gibson assembly: Files to reproduce: reproduce.zip from pydna.design import assembly_fragments, primer_design
from pydna.parsers import parse
from pydna.dseqrecord import Dseqrecord
from pydna.amplicon import Amplicon
from pydna.amplify import pcr
from pydna.assembly import Assembly
from pydna.common_sub_strings import terminal_overlap
# Read the plasmid
vector: Dseqrecord = parse('pREP42-MCS+.gb')[0]
# Read the gene we want to clone
genomic_dna: Dseqrecord = parse('ase1.gb')[0]
# Select the ase1 CDS from the feature
ase1_feature = next(
f for f in genomic_dna.features if (f.type == 'CDS' and 'gene' in f.qualifiers and 'ase1' in f.qualifiers['gene'])
)
# Select the entire plasmid, shifted so that the end of the promoter is at the origin
promoter_feature = next(
f
for f in vector.features
if f.type == 'promoter' and 'label' in f.qualifiers and 'nmt1 P41 promoter' in f.qualifiers['label']
)
vector_shifted = vector.shifted(promoter_feature.location.end)
# Design primers (without overhangs)
vector_amplicon_pre = primer_design(vector_shifted)
# here we use indexing instead of feature extract to keep the features
ase1_amplicon_pre = primer_design(genomic_dna[ase1_feature.location.start : ase1_feature.location.end])
# Design assembly primers (same as previous, but with overhangs for Gibson assembly)
# We include the vector twice to create a circular recombination
asm: list[Amplicon] = assembly_fragments([vector_amplicon_pre, ase1_amplicon_pre, vector_amplicon_pre])
# For the vector, we use the forward primer of the last amplicon and the reverse primer of the first amplicon
print('vector fwd:', asm[2].primers()[0].seq)
print('vector rev:', asm[0].primers()[1].seq)
print()
vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector_shifted)
# This should give the same result but it doesn't, probably a bug
# in PCR
# vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector)
# For the ase1, we keep it as is
print('ase1 fwd:', asm[1].primers()[0].seq)
print('ase1 rev:', asm[1].primers()[1].seq)
print()
ase1_amplicon = asm[1]
# Do the Gibson assembly
asm = Assembly([vector_amplicon, ase1_amplicon], algorithm=terminal_overlap)
# The ase1 CDS feature is misanotated here, see issue
product = asm.assemble_circular()[0]
print(product) Several features become misannotated:
You can see it in the image: And this is how the pnmt1 and primer annotations look:
|
As a workaround for my initial problem I used this to get around the issue. I had three fragments assembling in a linear fashion, and I imagine it's exclusively useful for that, but maybe it'll be informative anyway
|
I've noticed in certain cases the Assembly.assemble_linear() method can produce misannotated sequences, specifically it seems to shift the annotations along the sequence, so they have the same length, same name, but correspond to different basepairs.
The test case below should reproduce the error, its based on a modified version of the original assemble_linear() test case you have in pydna/tests.py
The difference in test is that a) annotations are added, and b) an extra base is added to Dseqrecord c
If the first G is removed from Dseqrecord c, the test passes. Introducing extra bases to 5' or 3' ends of Dseqrecords a & b does not produce an error.
I've also implemented it as a colab notebook with the original test case, and a case where the assembled sequence is annotated correctly
https://colab.research.google.com/drive/1akdSdrVGu7w5mD2jd7HJ-hD17J4zApJj?usp=sharing
Thank you for creating such a brilliant (and beautifully written) package
The text was updated successfully, but these errors were encountered: