Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @BjornFJohansson here is an alternative implementation of the cutting functionality that follows what was mentioned in #157.
It's a big one, so no problem if you take a while to review it. It decreases significantly the lines of code. The difference in total lines is a net positive, but I have removed only code, and added quite a bit of comments / docstrings. The current implementation uses more of the built-in biopython functionality, which currently supports searching for cutsites in circular molecules.
The function cut in
Dseq
is now splitted into three functions:get_cutsites
get_cutsites
finds cutsites in the sequence, returned as a list oftuple[tuple[int,int], _RestrictionType]
, sorted by where they cut on the 5' strand.For a given cutsite, e.g.
[(3, 7), EcoRI]
:enzyme.search() - 1
in biopython)This is a convenient representation, and you can see why in the function
apply_cut
, where two such cuts are passed as inputs (ignore the twoif xxxx is not None
for now).get_cutsite_pairs
This pairs the cutsites 2 by 2 to render the edges of the resulting fragments.
apply_cut
Extracts a fragment from a sequence based on a pair of cuts, the code is above, and you can see now the case for when the enzyme is set to
None
(special case for the edges of a linear molecule).Extra things / thoughts
The
pos
property of Dseq could be now removedCutSite
could be made into a class with propertiescut_watson
,cut_crick
,enzyme
(would probably give more clarity).The pairs at the edges could simply beI ended up doing this, which I think is a bit clearer. With 0edcd20(None, ((3, 7), EcoRI))
instead of(((0, 0), None), ((3, 7), EcoRI))
, perhaps clearer.The methods can be renamed, perhaps
cut_fragment
would be more clear?I have added the dependency ofNot anymore (bf8aee2)more_itertools
, for the functionpairwise
. This come with normal itertools in python 3.10, but not older versions.Back compatibility
The only problem is that the cuts are returned in the same order regardless of the order of the input enzymes. I think this is a preferable behaviour, but I could make it back-compatible. I have modified some tests so that they test for the new behaviour, see
test_module_dseqrecord.py
, the line that says@pytest.mark.xfail(reason="issue #78")
, and the lines in the test files that start with# TODO:
, you can easily find them in the page of the diff of the PR.