Issue 157 #163

manulera · 2023-11-24T18:13:15Z

Hi @BjornFJohansson here is an alternative implementation of the cutting functionality that follows what was mentioned in #157.

It's a big one, so no problem if you take a while to review it. It decreases significantly the lines of code. The difference in total lines is a net positive, but I have removed only code, and added quite a bit of comments / docstrings. The current implementation uses more of the built-in biopython functionality, which currently supports searching for cutsites in circular molecules.

The function cut in Dseq is now splitted into three functions:

cutsites = self.get_cutsites(*enzymes)
cutsite_pairs = self.get_cutsite_pairs(cutsites)
return tuple(self.apply_cut(*cs) for cs in cutsite_pairs)

get_cutsites

get_cutsites finds cutsites in the sequence, returned as a list of tuple[tuple[int,int], _RestrictionType], sorted by where they cut on the 5' strand.

For a given cutsite, e.g. [(3, 7), EcoRI]:

The first position in the tuple (3) is the position where the enzyme will cut in the watson strand(same meaning as enzyme.search() - 1 in biopython)
The second position in the tuple (7) is the position where the enzyme will cut in the crick strand.
EcoRI is the enzyme python object

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> seq = Dseq('AAGAATTCAAGAATTC')
>>> seq.get_cutsites(EcoRI)
[((3, 7), EcoRI), ((11, 15), EcoRI)]

This is a convenient representation, and you can see why in the function apply_cut, where two such cuts are passed as inputs (ignore the twoif xxxx is not None for now).

def apply_cut(self, left_cut, right_cut):

        left_watson, left_crick = left_cut[0] if left_cut is not None else ((self.ovhg, 0) if self.ovhg > 0 else (0, -self.ovhg))
        ovhg = self.ovhg if left_cut is None else left_cut[1].ovhg
        right_watson, right_crick = right_cut[0] if right_cut is not None else (len(self.watson), len(self.crick))
        return Dseq(
                    str(self[left_watson:right_watson]),
                    # The line below could be easier to understand as _rc(str(self[left_crick:right_crick])), but it does not preserve the case
                    str(self.reverse_complement()[len(self) - right_crick:len(self) - left_crick]),
                    ovhg=ovhg,
                )

get_cutsite_pairs

This pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

def get_cutsite_pairs(self, cutsites):
        """ Pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

        Special cases:
        - Single cutsite on circular sequence: returns a pair where both cutsites are the same
        - Linear sequence:
            - creates a new left_cut on the first pair equal to `None` to represent the left edge of the sequence as it is.
            - creates a new right_cut on the last pair equal to `None` to represent the right edge of the sequence as it is.
            - In both new cuts, the enzyme is set to None to indicate that the cut is not made by an enzyme.

        Parameters
        ----------
        cutsites : list[tuple[tuple[int,int], _RestrictionType]]

        Returns
        -------
        list[tuple[tuple[tuple[int,int], _RestrictionType]|None],tuple[tuple[int,int], _RestrictionType]|None]

        Examples
        --------

        >>> from Bio.Restriction import EcoRI
        >>> from pydna.dseq import Dseq
        >>> seq = Dseq('AAGAATTCAAGAATTC')
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(None, ((3, 7), EcoRI)), (((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), None)]
        >>> seq = Dseq('AAGAATTCAAGAATTC', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), ((3, 7), EcoRI))]
        >>> seq = Dseq('AAGAATTCAA', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((3, 7), EcoRI))]
        """
        if len(cutsites) == 0:
            return []
        if not self.circular:
            cutsites = [None, *cutsites, None]
        else:
            # Add the first cutsite at the end, for circular cuts
            cutsites.append(cutsites[0])

        return list(zip(cutsites, cutsites[1:]))

apply_cut

Extracts a fragment from a sequence based on a pair of cuts, the code is above, and you can see now the case for when the enzyme is set to None (special case for the edges of a linear molecule).

Extra things / thoughts

The pos property of Dseq could be now removed
CutSite could be made into a class with properties cut_watson, cut_crick, enzyme (would probably give more clarity).
~~The pairs at the edges could simply be (None, ((3, 7), EcoRI)) instead of (((0, 0), None), ((3, 7), EcoRI)), perhaps clearer.~~ I ended up doing this, which I think is a bit clearer. With 0edcd20
The methods can be renamed, perhaps cut_fragment would be more clear?
~~I have added the dependency of more_itertools, for the function pairwise. This come with normal itertools in python 3.10, but not older versions.~~ Not anymore (bf8aee2)

Back compatibility

The only problem is that the cuts are returned in the same order regardless of the order of the input enzymes. I think this is a preferable behaviour, but I could make it back-compatible. I have modified some tests so that they test for the new behaviour, see test_module_dseqrecord.py, the line that says @pytest.mark.xfail(reason="issue #78"), and the lines in the test files that start with # TODO:, you can easily find them in the page of the diff of the PR.

codecov · 2023-11-24T18:14:17Z

Codecov Report

Merging #163 (31f36de) into develop (5e7c5dc) will increase coverage by 0.01%.
The diff coverage is 95.23%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #163      +/-   ##
===========================================
+ Coverage    93.94%   93.95%   +0.01%     
===========================================
  Files           36       36              
  Lines         3617     3575      -42     
  Branches       560      543      -17     
===========================================
- Hits          3398     3359      -39     
  Misses         183      183              
+ Partials        36       33       -3

Files	Coverage Δ
src/pydna/dseq.py	`96.20% <100.00%> (+0.86%)`	⬆️
src/pydna/dseqrecord.py	`95.59% <88.88%> (-0.51%)`	⬇️

manulera added 6 commits November 23, 2023 09:58

working for Dseq

765e026

WIP

2aac24e

Fixed itertools issue and modified tests , see #78

73ea44f

more WIP

3385265

closes #160

7ce1881

working

c3863c7

manulera changed the base branch from master to develop November 24, 2023 18:13

manulera added 5 commits November 24, 2023 18:16

remove old cut2

1bd4e18

extra docs

8560727

removed more_itertools deps

bf8aee2

switch to alternative representation

0edcd20

update docs

31f36de

BjornFJohansson changed the base branch from develop to cutsite_pairs December 8, 2023 10:32

BjornFJohansson merged commit ffc55af into cutsite_pairs Dec 8, 2023
26 of 32 checks passed

manulera deleted the issue_157 branch February 6, 2024 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 157 #163

Issue 157 #163

manulera commented Nov 24, 2023 •

edited

Loading

codecov bot commented Nov 24, 2023 •

edited

Loading

Issue 157 #163

Issue 157 #163

Conversation

manulera commented Nov 24, 2023 • edited Loading

get_cutsites

get_cutsite_pairs

apply_cut

Extra things / thoughts

Back compatibility

codecov bot commented Nov 24, 2023 • edited Loading

Codecov Report

manulera commented Nov 24, 2023 •

edited

Loading

codecov bot commented Nov 24, 2023 •

edited

Loading