Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/dssp #4304

Merged
merged 86 commits into from
Jun 18, 2024
Merged

Feature/dssp #4304

merged 86 commits into from
Jun 18, 2024

Conversation

marinegor
Copy link
Contributor

@marinegor marinegor commented Sep 29, 2023

Fixes #1612

Changes made in this Pull Request:

  • introduces MDAnalysis.analysis.dssp.DSSP class for secondary structure analysis, using code implemented in pydssp package available for secondary structure annotation
  • adds tests from pydssp package

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4304.org.readthedocs.build/en/4304/

@pep8speaks
Copy link

pep8speaks commented Sep 29, 2023

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 2:80: E501 line too long (90 > 79 characters)
Line 14:80: E501 line too long (87 > 79 characters)
Line 16:80: E501 line too long (115 > 79 characters)
Line 27:80: E501 line too long (83 > 79 characters)
Line 46:80: E501 line too long (80 > 79 characters)
Line 62:80: E501 line too long (80 > 79 characters)
Line 71:80: E501 line too long (84 > 79 characters)
Line 78:80: E501 line too long (83 > 79 characters)
Line 80:80: E501 line too long (84 > 79 characters)
Line 82:80: E501 line too long (80 > 79 characters)
Line 115:1: W293 blank line contains whitespace
Line 116:59: W291 trailing whitespace
Line 117:80: E501 line too long (85 > 79 characters)
Line 118:80: E501 line too long (81 > 79 characters)
Line 118:82: W291 trailing whitespace
Line 119:80: E501 line too long (84 > 79 characters)
Line 120:80: E501 line too long (83 > 79 characters)
Line 123:1: W293 blank line contains whitespace
Line 125:80: E501 line too long (86 > 79 characters)
Line 125:87: W291 trailing whitespace
Line 127:80: E501 line too long (84 > 79 characters)
Line 127:85: W291 trailing whitespace
Line 128:80: E501 line too long (80 > 79 characters)
Line 128:81: W291 trailing whitespace
Line 132:1: W293 blank line contains whitespace
Line 135:1: W293 blank line contains whitespace
Line 136:80: E501 line too long (85 > 79 characters)
Line 137:80: E501 line too long (81 > 79 characters)
Line 199:75: W291 trailing whitespace
Line 228:80: E501 line too long (80 > 79 characters)
Line 232:78: W291 trailing whitespace
Line 233:80: W291 trailing whitespace
Line 234:80: E501 line too long (83 > 79 characters)
Line 266:72: W291 trailing whitespace
Line 273:80: E501 line too long (83 > 79 characters)
Line 299:80: E501 line too long (86 > 79 characters)
Line 305:80: E501 line too long (82 > 79 characters)
Line 312:80: E501 line too long (81 > 79 characters)
Line 312:82: W291 trailing whitespace
Line 313:80: E501 line too long (81 > 79 characters)
Line 322:80: E501 line too long (91 > 79 characters)
Line 323:80: E501 line too long (87 > 79 characters)
Line 332:80: E501 line too long (91 > 79 characters)
Line 361:80: E501 line too long (81 > 79 characters)
Line 392:80: E501 line too long (87 > 79 characters)

Line 68:80: E501 line too long (85 > 79 characters)
Line 157:80: E501 line too long (84 > 79 characters)
Line 179:80: E501 line too long (83 > 79 characters)
Line 206:80: E501 line too long (80 > 79 characters)
Line 207:80: E501 line too long (81 > 79 characters)

Line 10:80: E501 line too long (81 > 79 characters)
Line 11:80: E501 line too long (85 > 79 characters)
Line 12:80: E501 line too long (82 > 79 characters)
Line 30:80: E501 line too long (80 > 79 characters)
Line 42:80: E501 line too long (80 > 79 characters)
Line 52:80: E501 line too long (80 > 79 characters)
Line 55:80: E501 line too long (82 > 79 characters)

Comment last updated at 2024-06-13 05:15:32 UTC

@github-actions
Copy link

github-actions bot commented Sep 29, 2023

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/9494112403/job/26163914658


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

@marinegor
Copy link
Contributor Author

marinegor commented Sep 29, 2023

There's a problem I don't know the workaround for yet: prolines.

In the original pydssp following happens:

  • atoms get parsed from pdb regardless of residue type
  • hydrogens get guilt automatically
  • based on that, secondary structure is assigned

However, in my implementation all prolines (and N-terminal residue) are assigned '-' (=unstructured loop).

This is likely less correct than original implementation, but I'm not sure how to fix that.
My suggestion would be to:

  • if guess_hydrogens=False, use positions of existing hydrogens and fill in fake proline hydrogens with pydssp
  • if guess_hydrogens=True, default to pydssp behaviour with fully automatic hydrogens

@marinegor
Copy link
Contributor Author

The prolines problem is now dealt with, exactly as I described above -- if guess_hydrogens=False, it would take hydrogens from atoms that have them, and otherwise (=prolines & N-term) guess their positions. Otherwise, would guess all the positions.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks already very promising.

For an initial quick review please see inline comments.

Additional requests

  • Fix the tests: AttributeError: module 'MDAnalysisTests.datafiles' has no attribute 'DSSP'
  • We also need a test with a trajectory.
  • Compress the PDF files with bz2 or gz.
  • Update CHANGELOG.
  • Add a .. versionadded:: 2.7.0 to the docs.
  • Fix PEP8 complaints.

Initially distinguishing H,E,- is good but eventually it would be good to distinguish other secondary structures, too, in particular 310 and π helices.

package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
Comment on lines 152 to 153
h3 = h3 * ~np.roll(helix4, -1, 1) * ~helix4 # helix4 is higher prioritized
h5 = h5 * ~np.roll(helix4, -1, 1) * ~helix4 # helix4 is higher prioritized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get 3_10 and π helix, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain, but seems like yes (used this paper's Fig 1):

from MDAnalysis.analysis.dssp import DSSP
import MDAnalysis as mda

u = mda.Universe('./1FUR.pdb')
r = DSSP(u).run()

r.results.resids[151:158]
# array([155, 156, 157, 158, 159, 160, 161])

''.join(r.results.dssp[0])[151:158]
# 'HHHH-HH'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can detect 3-10 and pi helixes indeed -- I generated idealized structures with mmtbx from here and ran DSSP on them:

dssp/o_beta_seq.pdb           --------------------
dssp/o_helix310_seq.pdb    -HHHHHHHHHHHHHHHHHH-
dssp/o_helix_seq.pdb          -HHHHHHHHHHHHHHHHHH-
o_helix_pi_seq.pdb              -HHHHHHHHHHHHHHHHHH-

package/MDAnalysis/analysis/dssp.py Outdated Show resolved Hide resolved
testsuite/MDAnalysisTests/analysis/test_dssp.py Outdated Show resolved Hide resolved
@orbeckst orbeckst self-assigned this Oct 10, 2023
@orbeckst orbeckst linked an issue Oct 10, 2023 that may be closed by this pull request
@orbeckst orbeckst added the hackathon part of a MDAnalysis coding event label Oct 10, 2023
test_dssp.py Outdated Show resolved Hide resolved
manual test a top level, should not be in source
@orbeckst
Copy link
Member

orbeckst commented May 3, 2024

@IAlibay would you mind taking a quick final glance at the DSSP PR?

I think your comments were addressed but would prefer you confirm instead of me dismissing a totally valid review.

Would be great if we could finally get that one in.

@marinegor
Copy link
Contributor Author

@IAlibay any comments on this one?

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long review delay.

I just have the one docstring confusion, once addressed then it's good to go.

Tagging @orbeckst here - please do dismiss my review if this gets addressed and I don't get a chance to re-review.

package/MDAnalysis/analysis/dssp/dssp.py Outdated Show resolved Hide resolved
@marinegor
Copy link
Contributor Author

Hi @IAlibay -- couldn't follow your link, maybe you explain in more detail? I updated class's documentation slightly to match the description of the run method, but not sure if it solves your confusion

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marinegor and @IAlibay I think I understand the point of your discussion: As far as I can tell, I agree with @IAlibay that the current ValueError check does not exactly check that there's exactly one HN. If we could change the code then that would be safer, please see comments.

I think I also found a potential issue with how _heavy_atoms and _hydrogens are constructed. Please check.

package/MDAnalysis/analysis/dssp/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp/dssp.py Outdated Show resolved Hide resolved
Comment on lines +278 to +289
self._heavy_atoms: dict[str, "AtomGroup"] = {
t: ag.atoms[
np.isin(
ag.names, t.split()
) # need split() since `np.isin` takes an iterable as second argument
# and "N".split() -> ["N"]
]
for t in heavyatom_names
}
self._hydrogens: list["AtomGroup"] = [
res.atoms.select_atoms(f"name {hydrogen_name}") for res in ag.residues
]
Copy link
Member

@orbeckst orbeckst May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the following code I see that there's the assumption that the order of atoms in _heavy_atoms["CA"] is the same as in _hydrogens.

However, that may not be true: _hydrogens is created from a select_atoms() which always sorts the atoms by index in increasing order whereas ag's in _heavy_atoms are created by look up and slicing. If the original input ag is NOT ORDERED then _hydrogens will be ORDERED whereas _heavy_atoms remain in the CUSTOM ORDER of the original ag. At least that's what it looks like to me.

It may be safer to just sort the original ag first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be safer to just sort the original ag first.

but there's line 274 that says:

ag: AtomGroup = atoms.select_atoms("protein")

and then all operations are on this ag. I even think that you suggested this, but I'm not entirely sure :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: ag is guaranteed to be ordered.

Maybe add a comment somewhere that the code is guaranteed to work because ag is ordered. Just in case someone later wants to make it more general (eg so that one doesn't have to rely on MDAnalysis's definition of "protein").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on Dr. Richard Hipp's obsession with code comments https://discord.com/channels/807348386012987462/807712498249236520/1237819983833337917

@orbeckst
Copy link
Member

@marinegor do you have any question regarding my last review? Please ping me if you need feedback or want to discuss anything.

Egor Marin and others added 2 commits June 12, 2024 21:46
@marinegor
Copy link
Contributor Author

@orbeckst sorry for some silence -- I don't have any questions, and also have added few lines that improve the guess-hydrogens check, following your suggestions.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy! Great work, @marinegor !

Comment on lines +278 to +289
self._heavy_atoms: dict[str, "AtomGroup"] = {
t: ag.atoms[
np.isin(
ag.names, t.split()
) # need split() since `np.isin` takes an iterable as second argument
# and "N".split() -> ["N"]
]
for t in heavyatom_names
}
self._hydrogens: list["AtomGroup"] = [
res.atoms.select_atoms(f"name {hydrogen_name}") for res in ag.residues
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: ag is guaranteed to be ordered.

Maybe add a comment somewhere that the code is guaranteed to work because ag is ordered. Just in case someone later wants to make it more general (eg so that one doesn't have to rely on MDAnalysis's definition of "protein").

Comment on lines +278 to +289
self._heavy_atoms: dict[str, "AtomGroup"] = {
t: ag.atoms[
np.isin(
ag.names, t.split()
) # need split() since `np.isin` takes an iterable as second argument
# and "N".split() -> ["N"]
]
for t in heavyatom_names
}
self._hydrogens: list["AtomGroup"] = [
res.atoms.select_atoms(f"name {hydrogen_name}") for res in ag.residues
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on Dr. Richard Hipp's obsession with code comments https://discord.com/channels/807348386012987462/807712498249236520/1237819983833337917

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just adding doc fixes) still ✅

package/MDAnalysis/analysis/dssp/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp/dssp.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/dssp/dssp.py Outdated Show resolved Hide resolved
@orbeckst
Copy link
Member

@IAlibay I resolved most of the comments that you had because @marinegor had addressed them. There are only three or so for you to quickly look at. Would be great to get it shipped once you are satisfied. Cheers!

orbeckst added 3 commits June 12, 2024 23:34
- corrected docs for DSSP.results.dssp_ndarray and results.dssp
- added docs for DSSP.results.resids
- more reST fixes and updates
@orbeckst orbeckst merged commit d2d9d27 into MDAnalysis:develop Jun 18, 2024
22 of 23 checks passed
@orbeckst
Copy link
Member

Congratulations @marinegor , a big effort merged and one of the oldest outstanding issues closed! Well done. Thank you for all your work!

@marinegor
Copy link
Contributor Author

marinegor commented Jun 18, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component-Analysis enhancement hackathon part of a MDAnalysis coding event
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for secondary structure identification
6 participants