Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear assembly producing misannotated sequences #136

Open
JamesBagley opened this issue Aug 22, 2023 · 3 comments
Open

Linear assembly producing misannotated sequences #136

JamesBagley opened this issue Aug 22, 2023 · 3 comments

Comments

@JamesBagley
Copy link

JamesBagley commented Aug 22, 2023

I've noticed in certain cases the Assembly.assemble_linear() method can produce misannotated sequences, specifically it seems to shift the annotations along the sequence, so they have the same length, same name, but correspond to different basepairs.

The test case below should reproduce the error, its based on a modified version of the original assemble_linear() test case you have in pydna/tests.py

The difference in test is that a) annotations are added, and b) an extra base is added to Dseqrecord c

If the first G is removed from Dseqrecord c, the test passes. Introducing extra bases to 5' or 3' ends of Dseqrecords a & b does not produce an error.

def test_linear_with_annotations2(monkeypatch):
    from pydna._pretty import pretty_str
    from pydna.assembly import Assembly
    from pydna.dseqrecord import Dseqrecord

    a = Dseqrecord("acgatgctatactgtgCCNCCtgtgctgtgctcta")
    a.add_feature(0,10,label='a_feat')
    a_feat_seq = a.features[0].extract(a)
    # 12345678901234
    b = Dseqrecord("tgtgctgtgctctaTTTTTTTtattctggctgtatcCCCCCC")
    b.add_feature(0,10,label='b_feat')
    b_feat_seq = b.features[0].extract(b)

    # 123456789012345
    c = Dseqrecord("GtattctggctgtatcGGGGGtacgatgctatactgtg")
    c.add_feature(0,10,label='c_feat')
    c_feat_seq = c.features[0].extract(c)

    feature_sequences = {'a_feat':a_feat_seq,
                         'b_feat':b_feat_seq,
                         'c_feat':c_feat_seq}

    a.name = "aaa"  # 1234567890123456
    b.name = "bbb"
    c.name = "ccc"
    asm = Assembly((a, b, c), limit=14)
    x = asm.assemble_linear()[0]
    #print(x.features)
    #print(x)
    answer = 'aaa|14\n    \\/\n    /\\\n    14|bbb|15\n           \\/\n           /\\\n           15|ccc'

    assert x.figure() == answer.strip()
    answer = 'acgatgctatactgtgCCNCCtgtgctgtgctcta\n                     TGTGCTGTGCTCTA\n                     tgtgctgtgctctaTTTTTTTtattctggctgtatc\n                                          TATTCTGGCTGTATC\n                                          tattctggctgtatcGGGGGtacgatgctatactgtg\n'
    assert x.detailed_figure()
    for feat in x.features:

        try:
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
        except(AssertionError):
            print(feat.qualifiers['label'])
            print(feat.extract(x).seq, 'extracted feat')
            print(feature_sequences[feat.qualifiers['label']].seq, 'original feat')
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
test_linear_with_annotations2('')

I've also implemented it as a colab notebook with the original test case, and a case where the assembled sequence is annotated correctly
https://colab.research.google.com/drive/1akdSdrVGu7w5mD2jd7HJ-hD17J4zApJj?usp=sharing

Thank you for creating such a brilliant (and beautifully written) package

@BjornFJohansson
Copy link
Owner

Hi, and thanks for the positive review. Thank you for taking the time to report this and making pydna better.
I think I have solved this bug and a new alpha version will be available shortly.

@manulera
Copy link
Collaborator

manulera commented Sep 13, 2024

Hi @BjornFJohansson I think I had mentioned this somewhere, but I could not find where. This keeps happenning, but I won't fix it here since the new assembly implementation does not have that problem. I was preparing a minimal example for #255 and I came across this error, and a similar one in the PCR module (#262).

In the example below, I amplify a fragment from pombe's genome containing the gene ase1, and clone it into an expression vector using Gibson assembly:

Files to reproduce: reproduce.zip

from pydna.design import assembly_fragments, primer_design
from pydna.parsers import parse
from pydna.dseqrecord import Dseqrecord
from pydna.amplicon import Amplicon
from pydna.amplify import pcr
from pydna.assembly import Assembly
from pydna.common_sub_strings import terminal_overlap

# Read the plasmid
vector: Dseqrecord = parse('pREP42-MCS+.gb')[0]

# Read the gene we want to clone
genomic_dna: Dseqrecord = parse('ase1.gb')[0]

# Select the ase1 CDS from the feature
ase1_feature = next(
    f for f in genomic_dna.features if (f.type == 'CDS' and 'gene' in f.qualifiers and 'ase1' in f.qualifiers['gene'])
)

# Select the entire plasmid, shifted so that the end of the promoter is at the origin
promoter_feature = next(
    f
    for f in vector.features
    if f.type == 'promoter' and 'label' in f.qualifiers and 'nmt1 P41 promoter' in f.qualifiers['label']
)
vector_shifted = vector.shifted(promoter_feature.location.end)

# Design primers (without overhangs)
vector_amplicon_pre = primer_design(vector_shifted)
# here we use indexing instead of feature extract to keep the features
ase1_amplicon_pre = primer_design(genomic_dna[ase1_feature.location.start : ase1_feature.location.end])

# Design assembly primers (same as previous, but with overhangs for Gibson assembly)
# We include the vector twice to create a circular recombination
asm: list[Amplicon] = assembly_fragments([vector_amplicon_pre, ase1_amplicon_pre, vector_amplicon_pre])


# For the vector, we use the forward primer of the last amplicon and the reverse primer of the first amplicon
print('vector fwd:', asm[2].primers()[0].seq)
print('vector rev:', asm[0].primers()[1].seq)
print()
vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector_shifted)

# This should give the same result but it doesn't, probably a bug
# in PCR
# vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector)

# For the ase1, we keep it as is
print('ase1 fwd:', asm[1].primers()[0].seq)
print('ase1 rev:', asm[1].primers()[1].seq)
print()
ase1_amplicon = asm[1]

# Do the Gibson assembly
asm = Assembly([vector_amplicon, ase1_amplicon], algorithm=terminal_overlap)

# The ase1 CDS feature is misanotated here, see issue
product = asm.assemble_circular()[0]

print(product)

Several features become misannotated:

  • The CDS of ase1 (shifted by 18, the homology region of the Gibson assembly)
  • The promoter (it ends at position 1, but should end just before the origin)
  • A primer (it ends at position 1, but should end just before the origin)

You can see it in the image:

Screenshot 2024-09-13 at 13 37 20

And this is how the pnmt1 and primer annotations look:

     promoter        join(7162..8315,1)
                     /label="nmt1 P41 promoter"
                     /note="mutant nmt1 promoter from Schizosaccharomyces pombe,
                     conferring medium strength thiamine-repressible expression"
     primer_bind     complement(join(8294..8315,1))
                     /label="r8315"
                     /PCR_conditions="primer
                     sequence:CATCATTACTGTTTGCATTGATTTAACAAAGCGACTATAAG"
                     /ApEinfo_fwdcolor="#baffa3"
                     /ApEinfo_revcolor="#ffbaba"

@JamesBagley
Copy link
Author

As a workaround for my initial problem I used this to get around the issue. I had three fragments assembling in a linear fashion, and I imagine it's exclusively useful for that, but maybe it'll be informative anyway

repaired_fragment = Assembly(fragments).assemble_linear()[0]
switches = list(map(SeqRecord, list(repaired_fragment.graph.edges.items())[1][0]))
      UP_trimmed = UP[:UP.find(switches[0])+len(switches[0])]
      DW_trimmed = DW[DW.find(switches[1]):]
      donor_trimmed = donor[donor.find(switches[0]):donor.find(switches[1])+len(switches[1])]
      fragments = [UP_trimmed] + [donor_trimmed] + [DW_trimmed]
      repaired_fragment = Assembly(fragments).assemble_linear()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants