Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 157 #163

Merged
merged 11 commits into from
Dec 8, 2023
Merged

Issue 157 #163

merged 11 commits into from
Dec 8, 2023

Conversation

manulera
Copy link
Collaborator

@manulera manulera commented Nov 24, 2023

Hi @BjornFJohansson here is an alternative implementation of the cutting functionality that follows what was mentioned in #157.

It's a big one, so no problem if you take a while to review it. It decreases significantly the lines of code. The difference in total lines is a net positive, but I have removed only code, and added quite a bit of comments / docstrings. The current implementation uses more of the built-in biopython functionality, which currently supports searching for cutsites in circular molecules.

The function cut in Dseq is now splitted into three functions:

cutsites = self.get_cutsites(*enzymes)
cutsite_pairs = self.get_cutsite_pairs(cutsites)
return tuple(self.apply_cut(*cs) for cs in cutsite_pairs)

get_cutsites

get_cutsites finds cutsites in the sequence, returned as a list of tuple[tuple[int,int], _RestrictionType], sorted by where they cut on the 5' strand.

For a given cutsite, e.g. [(3, 7), EcoRI]:

  • The first position in the tuple (3) is the position where the enzyme will cut in the watson strand(same meaning as enzyme.search() - 1 in biopython)
  • The second position in the tuple (7) is the position where the enzyme will cut in the crick strand.
  • EcoRI is the enzyme python object
>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> seq = Dseq('AAGAATTCAAGAATTC')
>>> seq.get_cutsites(EcoRI)
[((3, 7), EcoRI), ((11, 15), EcoRI)]

This is a convenient representation, and you can see why in the function apply_cut, where two such cuts are passed as inputs (ignore the twoif xxxx is not None for now).

def apply_cut(self, left_cut, right_cut):

        left_watson, left_crick = left_cut[0] if left_cut is not None else ((self.ovhg, 0) if self.ovhg > 0 else (0, -self.ovhg))
        ovhg = self.ovhg if left_cut is None else left_cut[1].ovhg
        right_watson, right_crick = right_cut[0] if right_cut is not None else (len(self.watson), len(self.crick))
        return Dseq(
                    str(self[left_watson:right_watson]),
                    # The line below could be easier to understand as _rc(str(self[left_crick:right_crick])), but it does not preserve the case
                    str(self.reverse_complement()[len(self) - right_crick:len(self) - left_crick]),
                    ovhg=ovhg,
                )

get_cutsite_pairs

This pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

def get_cutsite_pairs(self, cutsites):
        """ Pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

        Special cases:
        - Single cutsite on circular sequence: returns a pair where both cutsites are the same
        - Linear sequence:
            - creates a new left_cut on the first pair equal to `None` to represent the left edge of the sequence as it is.
            - creates a new right_cut on the last pair equal to `None` to represent the right edge of the sequence as it is.
            - In both new cuts, the enzyme is set to None to indicate that the cut is not made by an enzyme.

        Parameters
        ----------
        cutsites : list[tuple[tuple[int,int], _RestrictionType]]

        Returns
        -------
        list[tuple[tuple[tuple[int,int], _RestrictionType]|None],tuple[tuple[int,int], _RestrictionType]|None]

        Examples
        --------

        >>> from Bio.Restriction import EcoRI
        >>> from pydna.dseq import Dseq
        >>> seq = Dseq('AAGAATTCAAGAATTC')
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(None, ((3, 7), EcoRI)), (((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), None)]
        >>> seq = Dseq('AAGAATTCAAGAATTC', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), ((3, 7), EcoRI))]
        >>> seq = Dseq('AAGAATTCAA', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((3, 7), EcoRI))]
        """
        if len(cutsites) == 0:
            return []
        if not self.circular:
            cutsites = [None, *cutsites, None]
        else:
            # Add the first cutsite at the end, for circular cuts
            cutsites.append(cutsites[0])

        return list(zip(cutsites, cutsites[1:]))

apply_cut

Extracts a fragment from a sequence based on a pair of cuts, the code is above, and you can see now the case for when the enzyme is set to None (special case for the edges of a linear molecule).

Extra things / thoughts

  • The pos property of Dseq could be now removed

  • CutSite could be made into a class with properties cut_watson, cut_crick, enzyme (would probably give more clarity).

  • The pairs at the edges could simply be (None, ((3, 7), EcoRI)) instead of (((0, 0), None), ((3, 7), EcoRI)), perhaps clearer. I ended up doing this, which I think is a bit clearer. With 0edcd20

  • The methods can be renamed, perhaps cut_fragment would be more clear?

  • I have added the dependency of more_itertools, for the function pairwise. This come with normal itertools in python 3.10, but not older versions. Not anymore (bf8aee2)

Back compatibility

The only problem is that the cuts are returned in the same order regardless of the order of the input enzymes. I think this is a preferable behaviour, but I could make it back-compatible. I have modified some tests so that they test for the new behaviour, see test_module_dseqrecord.py, the line that says @pytest.mark.xfail(reason="issue #78"), and the lines in the test files that start with # TODO:, you can easily find them in the page of the diff of the PR.

@manulera manulera changed the base branch from master to develop November 24, 2023 18:13
Copy link

codecov bot commented Nov 24, 2023

Codecov Report

Merging #163 (31f36de) into develop (5e7c5dc) will increase coverage by 0.01%.
The diff coverage is 95.23%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #163      +/-   ##
===========================================
+ Coverage    93.94%   93.95%   +0.01%     
===========================================
  Files           36       36              
  Lines         3617     3575      -42     
  Branches       560      543      -17     
===========================================
- Hits          3398     3359      -39     
  Misses         183      183              
+ Partials        36       33       -3     
Files Coverage Δ
src/pydna/dseq.py 96.20% <100.00%> (+0.86%) ⬆️
src/pydna/dseqrecord.py 95.59% <88.88%> (-0.51%) ⬇️

@BjornFJohansson BjornFJohansson changed the base branch from develop to cutsite_pairs December 8, 2023 10:32
@BjornFJohansson BjornFJohansson merged commit ffc55af into cutsite_pairs Dec 8, 2023
26 of 32 checks passed
@manulera manulera deleted the issue_157 branch February 6, 2024 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants