Linear assembly producing misannotated sequences #136

JamesBagley · 2023-08-22T21:26:08Z

I've noticed in certain cases the Assembly.assemble_linear() method can produce misannotated sequences, specifically it seems to shift the annotations along the sequence, so they have the same length, same name, but correspond to different basepairs.

The test case below should reproduce the error, its based on a modified version of the original assemble_linear() test case you have in pydna/tests.py

The difference in test is that a) annotations are added, and b) an extra base is added to Dseqrecord c

If the first G is removed from Dseqrecord c, the test passes. Introducing extra bases to 5' or 3' ends of Dseqrecords a & b does not produce an error.

def test_linear_with_annotations2(monkeypatch):
    from pydna._pretty import pretty_str
    from pydna.assembly import Assembly
    from pydna.dseqrecord import Dseqrecord

    a = Dseqrecord("acgatgctatactgtgCCNCCtgtgctgtgctcta")
    a.add_feature(0,10,label='a_feat')
    a_feat_seq = a.features[0].extract(a)
    # 12345678901234
    b = Dseqrecord("tgtgctgtgctctaTTTTTTTtattctggctgtatcCCCCCC")
    b.add_feature(0,10,label='b_feat')
    b_feat_seq = b.features[0].extract(b)

    # 123456789012345
    c = Dseqrecord("GtattctggctgtatcGGGGGtacgatgctatactgtg")
    c.add_feature(0,10,label='c_feat')
    c_feat_seq = c.features[0].extract(c)

    feature_sequences = {'a_feat':a_feat_seq,
                         'b_feat':b_feat_seq,
                         'c_feat':c_feat_seq}

    a.name = "aaa"  # 1234567890123456
    b.name = "bbb"
    c.name = "ccc"
    asm = Assembly((a, b, c), limit=14)
    x = asm.assemble_linear()[0]
    #print(x.features)
    #print(x)
    answer = 'aaa|14\n    \\/\n    /\\\n    14|bbb|15\n           \\/\n           /\\\n           15|ccc'

    assert x.figure() == answer.strip()
    answer = 'acgatgctatactgtgCCNCCtgtgctgtgctcta\n                     TGTGCTGTGCTCTA\n                     tgtgctgtgctctaTTTTTTTtattctggctgtatc\n                                          TATTCTGGCTGTATC\n                                          tattctggctgtatcGGGGGtacgatgctatactgtg\n'
    assert x.detailed_figure()
    for feat in x.features:

        try:
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
        except(AssertionError):
            print(feat.qualifiers['label'])
            print(feat.extract(x).seq, 'extracted feat')
            print(feature_sequences[feat.qualifiers['label']].seq, 'original feat')
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
test_linear_with_annotations2('')

I've also implemented it as a colab notebook with the original test case, and a case where the assembled sequence is annotated correctly
https://colab.research.google.com/drive/1akdSdrVGu7w5mD2jd7HJ-hD17J4zApJj?usp=sharing

Thank you for creating such a brilliant (and beautifully written) package

BjornFJohansson · 2023-10-17T16:48:30Z

Hi, and thanks for the positive review. Thank you for taking the time to report this and making pydna better.
I think I have solved this bug and a new alpha version will be available shortly.

manulera · 2024-09-13T11:38:47Z

Hi @BjornFJohansson I think I had mentioned this somewhere, but I could not find where. This keeps happenning, but I won't fix it here since the new assembly implementation does not have that problem. I was preparing a minimal example for #255 and I came across this error, and a similar one in the PCR module (#262).

In the example below, I amplify a fragment from pombe's genome containing the gene ase1, and clone it into an expression vector using Gibson assembly:

Files to reproduce: reproduce.zip

from pydna.design import assembly_fragments, primer_design
from pydna.parsers import parse
from pydna.dseqrecord import Dseqrecord
from pydna.amplicon import Amplicon
from pydna.amplify import pcr
from pydna.assembly import Assembly
from pydna.common_sub_strings import terminal_overlap

# Read the plasmid
vector: Dseqrecord = parse('pREP42-MCS+.gb')[0]

# Read the gene we want to clone
genomic_dna: Dseqrecord = parse('ase1.gb')[0]

# Select the ase1 CDS from the feature
ase1_feature = next(
    f for f in genomic_dna.features if (f.type == 'CDS' and 'gene' in f.qualifiers and 'ase1' in f.qualifiers['gene'])
)

# Select the entire plasmid, shifted so that the end of the promoter is at the origin
promoter_feature = next(
    f
    for f in vector.features
    if f.type == 'promoter' and 'label' in f.qualifiers and 'nmt1 P41 promoter' in f.qualifiers['label']
)
vector_shifted = vector.shifted(promoter_feature.location.end)

# Design primers (without overhangs)
vector_amplicon_pre = primer_design(vector_shifted)
# here we use indexing instead of feature extract to keep the features
ase1_amplicon_pre = primer_design(genomic_dna[ase1_feature.location.start : ase1_feature.location.end])

# Design assembly primers (same as previous, but with overhangs for Gibson assembly)
# We include the vector twice to create a circular recombination
asm: list[Amplicon] = assembly_fragments([vector_amplicon_pre, ase1_amplicon_pre, vector_amplicon_pre])


# For the vector, we use the forward primer of the last amplicon and the reverse primer of the first amplicon
print('vector fwd:', asm[2].primers()[0].seq)
print('vector rev:', asm[0].primers()[1].seq)
print()
vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector_shifted)

# This should give the same result but it doesn't, probably a bug
# in PCR
# vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector)

# For the ase1, we keep it as is
print('ase1 fwd:', asm[1].primers()[0].seq)
print('ase1 rev:', asm[1].primers()[1].seq)
print()
ase1_amplicon = asm[1]

# Do the Gibson assembly
asm = Assembly([vector_amplicon, ase1_amplicon], algorithm=terminal_overlap)

# The ase1 CDS feature is misanotated here, see issue
product = asm.assemble_circular()[0]

print(product)

Several features become misannotated:

The CDS of ase1 (shifted by 18, the homology region of the Gibson assembly)
The promoter (it ends at position 1, but should end just before the origin)
A primer (it ends at position 1, but should end just before the origin)

You can see it in the image:

And this is how the pnmt1 and primer annotations look:

     promoter        join(7162..8315,1)
                     /label="nmt1 P41 promoter"
                     /note="mutant nmt1 promoter from Schizosaccharomyces pombe,
                     conferring medium strength thiamine-repressible expression"
     primer_bind     complement(join(8294..8315,1))
                     /label="r8315"
                     /PCR_conditions="primer
                     sequence:CATCATTACTGTTTGCATTGATTTAACAAAGCGACTATAAG"
                     /ApEinfo_fwdcolor="#baffa3"
                     /ApEinfo_revcolor="#ffbaba"

JamesBagley · 2024-09-13T12:11:29Z

As a workaround for my initial problem I used this to get around the issue. I had three fragments assembling in a linear fashion, and I imagine it's exclusively useful for that, but maybe it'll be informative anyway

repaired_fragment = Assembly(fragments).assemble_linear()[0]
switches = list(map(SeqRecord, list(repaired_fragment.graph.edges.items())[1][0]))
      UP_trimmed = UP[:UP.find(switches[0])+len(switches[0])]
      DW_trimmed = DW[DW.find(switches[1]):]
      donor_trimmed = donor[donor.find(switches[0]):donor.find(switches[1])+len(switches[1])]
      fragments = [UP_trimmed] + [donor_trimmed] + [DW_trimmed]
      repaired_fragment = Assembly(fragments).assemble_linear()

manulera added the fixed-with-new-assembly-model label Sep 5, 2024

manulera added the bug label Sep 13, 2024

This was referenced Sep 13, 2024

Annotation lost in pcr with circular template #262

Closed

Pydna documentation Gibson Assembly Literature Example #255

Closed

manulera mentioned this issue Sep 27, 2024

Circular option in assembly_fragments #270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear assembly producing misannotated sequences #136

Linear assembly producing misannotated sequences #136

JamesBagley commented Aug 22, 2023 •

edited by manulera

Loading

BjornFJohansson commented Oct 17, 2023

manulera commented Sep 13, 2024 •

edited

Loading

JamesBagley commented Sep 13, 2024

Linear assembly producing misannotated sequences #136

Linear assembly producing misannotated sequences #136

Comments

JamesBagley commented Aug 22, 2023 • edited by manulera Loading

BjornFJohansson commented Oct 17, 2023

manulera commented Sep 13, 2024 • edited Loading

JamesBagley commented Sep 13, 2024

JamesBagley commented Aug 22, 2023 •

edited by manulera

Loading

manulera commented Sep 13, 2024 •

edited

Loading