Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/dssp #4304

Merged
merged 86 commits into from
Jun 18, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
5065e2d
Add DSSP test files
marinegor Sep 29, 2023
1dfdb29
Add DSSP test files
marinegor Sep 29, 2023
f73fcda
First implementation of DSSP class
marinegor Sep 29, 2023
fe66c30
Remove pydssp dependency
marinegor Sep 29, 2023
13bc86a
Add documentation
marinegor Sep 29, 2023
cf79a89
Fix hydrogen selection issue
marinegor Sep 29, 2023
7230845
Remove unnecessary print
marinegor Sep 29, 2023
929d011
Fix problem with prolines and add example with average secondary stru…
marinegor Oct 2, 2023
7c966d5
Fix pep8 formatting issues
marinegor Oct 2, 2023
a5a4891
Fix pep8 formatting issues
marinegor Oct 2, 2023
c2a333c
Replace ndarray in results with list, and remove unnecessary code
marinegor Oct 4, 2023
bd5ba75
Add documentation for translate
marinegor Oct 4, 2023
0d6f734
Added a reference to Kabsch 983 DSSP paper
marinegor Oct 11, 2023
8840d9b
Update documentation and formatting
marinegor Nov 24, 2023
c6fca72
Make numpy docstrings
marinegor Jan 14, 2024
20e672f
Compress dssp pdb files
marinegor Jan 14, 2024
47c4123
Update documentation and add trajectory tests
marinegor Jan 14, 2024
c4cf7c9
Fix pep8 in docstrings and add pydssp license info
marinegor Jan 14, 2024
de2133d
Add versionadded
marinegor Jan 14, 2024
60e9f13
Merge remote-tracking branch 'upstream/develop' into feature/dssp
marinegor Jan 14, 2024
603d209
Update changelog
marinegor Jan 14, 2024
6044372
Fix datafiles DSSP error
marinegor Jan 14, 2024
78b5a4f
Add einops as dependency to analysis
marinegor Jan 14, 2024
2bd0af5
Move einops as required dependency
marinegor Jan 14, 2024
21a9403
Move einops to install_requires
marinegor Jan 14, 2024
a15ed16
Move einops to analysis dependencies
marinegor Jan 15, 2024
8161c55
Add skipped tests if einops not present
marinegor Jan 15, 2024
b0093fe
Fix dssp test without einops
marinegor Jan 15, 2024
f947c04
Remove _check_input function
marinegor Feb 6, 2024
3493443
Rewrite get_hbond_map function
marinegor Feb 6, 2024
9d34e49
Correct helix detection but bug in loops
marinegor Feb 6, 2024
30c958e
Remove unnecessary print
marinegor Feb 6, 2024
4c6c2a8
Fix big with loops
marinegor Feb 6, 2024
3eb3058
Implement no-einops dssp
marinegor Feb 6, 2024
a3c386a
Moved dssp into separate module and separated original pydssp and my …
marinegor Feb 6, 2024
1a4d4b8
Add pydssp as analysis dependency
marinegor Feb 6, 2024
82334c7
Update package/pyproject.toml
Feb 6, 2024
edcb713
Add suggestions from code review
marinegor Feb 6, 2024
04e33ae
Remove pydssp from pyproject.toml and add section in documentation
marinegor Feb 7, 2024
f7e4ac0
Update documentation
marinegor Feb 7, 2024
1baf1e2
Merge with develop
marinegor Feb 7, 2024
f31d6f5
Add @due for DSSP by Kabsch
marinegor Feb 7, 2024
742e19d
Update package/MDAnalysis/analysis/dssp/dssp.py
Feb 23, 2024
47ef5ad
Update package/MDAnalysis/analysis/dssp/dssp.py
Feb 23, 2024
530ca0c
Fix documentation
marinegor Feb 23, 2024
c0fdd99
Fix documentation issues
marinegor Feb 23, 2024
118be27
Fix code formatting and add coverage exceptions
marinegor Feb 23, 2024
8c33e83
Add dssp entry for documentation
marinegor Feb 23, 2024
25dae6f
Add one more no cover pragma
marinegor Feb 23, 2024
ac6f063
Remove unnecessary get_hbond_map import
marinegor Feb 23, 2024
42392f2
Move shape check to _prepare and add respective test
marinegor Feb 23, 2024
8de3630
Fix DSSP documentation issues
marinegor Feb 24, 2024
5a21c9a
Fix test for uneven number of atoms in universe
marinegor Feb 24, 2024
864a43e
Add entry for pydssp in LICENSE
marinegor Feb 24, 2024
001ce41
Merge remote-tracking branch 'upstream/develop' into feature/dssp
marinegor Feb 26, 2024
5b04387
Add DSSP entry to testuite/pyproject.toml
marinegor Feb 26, 2024
1484c5d
Merge branch 'develop' into feature/dssp
orbeckst Mar 25, 2024
9c28640
Apply suggestions from code review
Apr 8, 2024
7b30cdd
Add AtomGroup to DSSP
marinegor Apr 8, 2024
2d9efec
Fix documentation in pydssp_numpy
marinegor Apr 8, 2024
ebaf90a
Add test for AtomGroup in DSSP
marinegor Apr 8, 2024
5b783b3
Move functions after DSSP class
marinegor Apr 8, 2024
e203ab8
Fix docs in DSSP.translate
marinegor Apr 8, 2024
579a5f0
Fix citations
marinegor Apr 8, 2024
81112d4
Fix merge conflicts
marinegor Apr 8, 2024
38d6d1f
Update references.bib
yuxuanzhuang Apr 9, 2024
5c2fddf
add reference
yuxuanzhuang Apr 9, 2024
428e892
move functions below classes
yuxuanzhuang Apr 9, 2024
15ab7f5
Improve selection syntax and add configurable selection names
marinegor Apr 18, 2024
dcf22fe
Remove unnecessary nocover pragmas
marinegor Apr 18, 2024
eb10f00
Apply suggestions from codereview
marinegor Apr 19, 2024
858bef0
Make sure I add duecredit test
marinegor Apr 19, 2024
99b2164
Update `hydrogen_name` description with a Note
marinegor Apr 19, 2024
58c24c1
Add print line in dssp documentation
marinegor Apr 19, 2024
610b39f
Merge branch 'develop' into feature/dssp
orbeckst May 3, 2024
f50f5c3
Delete test_dssp.py
orbeckst May 3, 2024
848060e
Update testsuite/MDAnalysisTests/analysis/test_dssp.py
orbeckst May 3, 2024
84dd012
Update class documentation with "hydrogen_atom" argument explanation
marinegor May 29, 2024
ebcab2f
Add check for exactly 1 hydrogen on the atom
Jun 12, 2024
f4daa24
Apply suggestions from code review
marinegor Jun 12, 2024
2fef313
Generalize hydrogen checks
Jun 12, 2024
daf6529
Merge remote-tracking branch 'origin/feature/dssp' into feature/dssp
Jun 12, 2024
884630d
minor dssp reST fixes & updates
orbeckst Jun 13, 2024
8d503a2
fix dssp doc
orbeckst Jun 13, 2024
5438877
fixed docs for DSSP.results
orbeckst Jun 13, 2024
d3d2cf2
Update dssp.py
orbeckst Jun 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
279 changes: 279 additions & 0 deletions package/MDAnalysis/analysis/dssp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
from .base import AnalysisBase
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
import numpy as np
from einops import repeat, rearrange
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
from .. import Universe

CONST_Q1Q2 = 0.084
CONST_F = 332
DEFAULT_CUTOFF = -0.5
DEFAULT_MARGIN = 1.0
orbeckst marked this conversation as resolved.
Show resolved Hide resolved

# code of the off-class functions
# is borrowed from https://github.com/ShintaroMinami/PyDSSP
# in accordance with the MIT License


def _unfold(a: np.ndarray, window: int, axis: int):
"""Unfolds an ndarray if it was batched (pydssp legacy)

Args:
a (np.ndarray): input array
window (int): window
axis (int): axis to unfold across

Returns:
np.ndarray: unfolded array
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
"""
idx = np.arange(window)[:, None] + \
np.arange(a.shape[axis] - window + 1)[None, :]
unfolded = np.take(a, idx, axis=axis)
return np.moveaxis(unfolded, axis - 1, -1)


def _check_input(coord):
"""Check if input coordinates have appropriate shape

Args:
coord (np.ndarray): input coordinates

Returns:
(np.ndarray, tuple): coordinates and appropriate shape
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
"""
org_shape = coord.shape
assert (len(org_shape) == 3) or (
len(org_shape) == 4
), "Shape of input tensor should be [batch, L, atom, xyz] or [L, atom, xyz]"
coord = repeat(coord, "... -> b ...",
b=1) if len(org_shape) == 3 else coord
return coord, org_shape


def _get_hydrogen_atom_position(coord: np.ndarray) -> np.ndarray:
"""Fills in hydrogen atoms positions is they're abscent

Args:
coord (np.ndarray): input coordinates, shape (n_atoms, 4, 3)

Returns:
np.ndarray: coordinates of additional hydrogens, shape (n_atoms, 3)
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
"""
# A little bit lazy (but should be OK) definition of H position here.
vec_cn = coord[:, 1:, 0] - coord[:, :-1, 2]
vec_cn = vec_cn / np.linalg.norm(vec_cn, axis=-1, keepdims=True)
vec_can = coord[:, 1:, 0] - coord[:, 1:, 1]
vec_can = vec_can / np.linalg.norm(vec_can, axis=-1, keepdims=True)
vec_nh = vec_cn + vec_can
vec_nh = vec_nh / np.linalg.norm(vec_nh, axis=-1, keepdims=True)
return coord[:, 1:, 0] + 1.01 * vec_nh


def get_hbond_map(
coord: np.ndarray,
cutoff: float = DEFAULT_CUTOFF,
margin: float = DEFAULT_MARGIN,
return_e: bool = False,
) -> np.ndarray:
"""Returns hydrogen bond map

Args:
coord (np.ndarray): input coordinates in either (n, 4, 3) or (n, 5, 3) shape (without or with hydrogens)
cutoff (float, optional): Defaults to DEFAULT_CUTOFF.
margin (float, optional): margin. Defaults to DEFAULT_MARGIN.
return_e (bool, optional): if to return energy instead of hbond map. Defaults to False.

Returns:
np.ndarray: output hbond map or energy depending on return_e param
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
"""
# check input
coord, org_shape = _check_input(coord)
b, l, a, _ = coord.shape
# add pseudo-H atom if not available
assert (a == 4) or (
a == 5
), "Number of atoms should be 4 (N,CA,C,O) or 5 (N,CA,C,O,H)"
h = coord[:, 1:, 4] if a == 5 else _get_hydrogen_atom_position(coord)
# distance matrix
nmap = repeat(coord[:, 1:, 0], "... m c -> ... m n c", n=l - 1)
hmap = repeat(h, "... m c -> ... m n c", n=l - 1)
cmap = repeat(coord[:, 0:-1, 2], "... n c -> ... m n c", m=l - 1)
omap = repeat(coord[:, 0:-1, 3], "... n c -> ... m n c", m=l - 1)
d_on = np.linalg.norm(omap - nmap, axis=-1)
d_ch = np.linalg.norm(cmap - hmap, axis=-1)
d_oh = np.linalg.norm(omap - hmap, axis=-1)
d_cn = np.linalg.norm(cmap - nmap, axis=-1)
# electrostatic interaction energy
e = np.pad(
CONST_Q1Q2 * (1.0 / d_on + 1.0 / d_ch - 1.0 /
d_oh - 1.0 / d_cn) * CONST_F,
[[0, 0], [1, 0], [0, 1]],
)
if return_e:
return e
# mask for local pairs (i,i), (i,i+1), (i,i+2)
local_mask = ~np.eye(l, dtype=bool)
local_mask *= ~np.diag(np.ones(l - 1, dtype=bool), k=-1)
local_mask *= ~np.diag(np.ones(l - 2, dtype=bool), k=-2)
# hydrogen bond map (continuous value extension of original definition)
hbond_map = np.clip(cutoff - margin - e, a_min=-margin, a_max=margin)
hbond_map = (np.sin(hbond_map / margin * np.pi / 2) + 1.0) / 2
hbond_map = hbond_map * repeat(local_mask, "l1 l2 -> b l1 l2", b=b)
# return h-bond map
hbond_map = np.squeeze(hbond_map, axis=0) if len(
org_shape) == 3 else hbond_map
return hbond_map


def assign(coord: np.ndarray) -> np.ndarray:
"""Assigns secondary structure for a given coordinate array

Args:
coord (np.ndarray): input coordinates in either (n, 4, 3) or (n, 5, 3) shape -- without or with hydrogens

Returns:
np.ndarray: output (n,) array with labels in C3 notation ('-', 'H', 'E')
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
"""
# check input
coord, org_shape = _check_input(coord)
# get hydrogen bond map
hbmap = get_hbond_map(coord)
hbmap = rearrange(
hbmap, "... l1 l2 -> ... l2 l1"
) # convert into "i:C=O, j:N-H" form
# identify turn 3, 4, 5
turn3 = np.diagonal(hbmap, axis1=-2, axis2=-1, offset=3) > 0.0
turn4 = np.diagonal(hbmap, axis1=-2, axis2=-1, offset=4) > 0.0
turn5 = np.diagonal(hbmap, axis1=-2, axis2=-1, offset=5) > 0.0
# assignment of helical sses
h3 = np.pad(turn3[:, :-1] * turn3[:, 1:], [[0, 0], [1, 3]])
h4 = np.pad(turn4[:, :-1] * turn4[:, 1:], [[0, 0], [1, 4]])
h5 = np.pad(turn5[:, :-1] * turn5[:, 1:], [[0, 0], [1, 5]])
# helix4 first
helix4 = h4 + np.roll(h4, 1, 1) + np.roll(h4, 2, 1) + np.roll(h4, 3, 1)
h3 = h3 * ~np.roll(helix4, -1, 1) * ~helix4 # helix4 is higher prioritized
h5 = h5 * ~np.roll(helix4, -1, 1) * ~helix4 # helix4 is higher prioritized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get 3_10 and π helix, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain, but seems like yes (used this paper's Fig 1):

from MDAnalysis.analysis.dssp import DSSP
import MDAnalysis as mda

u = mda.Universe('./1FUR.pdb')
r = DSSP(u).run()

r.results.resids[151:158]
# array([155, 156, 157, 158, 159, 160, 161])

''.join(r.results.dssp[0])[151:158]
# 'HHHH-HH'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can detect 3-10 and pi helixes indeed -- I generated idealized structures with mmtbx from here and ran DSSP on them:

dssp/o_beta_seq.pdb           --------------------
dssp/o_helix310_seq.pdb    -HHHHHHHHHHHHHHHHHH-
dssp/o_helix_seq.pdb          -HHHHHHHHHHHHHHHHHH-
o_helix_pi_seq.pdb              -HHHHHHHHHHHHHHHHHH-

helix3 = h3 + np.roll(h3, 1, 1) + np.roll(h3, 2, 1)
helix5 = (
h5
+ np.roll(h5, 1, 1)
+ np.roll(h5, 2, 1)
+ np.roll(h5, 3, 1)
+ np.roll(h5, 4, 1)
)
# identify bridge
unfoldmap = _unfold(_unfold(hbmap, 3, -2), 3, -2) > 0.0
unfoldmap_rev = rearrange(unfoldmap, "b l1 l2 ... -> b l2 l1 ...")
p_bridge = (unfoldmap[:, :, :, 0, 1] * unfoldmap_rev[:, :, :, 1, 2]) + (
unfoldmap_rev[:, :, :, 0, 1] * unfoldmap[:, :, :, 1, 2]
)
p_bridge = np.pad(p_bridge, [[0, 0], [1, 1], [1, 1]])
a_bridge = (unfoldmap[:, :, :, 1, 1] * unfoldmap_rev[:, :, :, 1, 1]) + (
unfoldmap[:, :, :, 0, 2] * unfoldmap_rev[:, :, :, 0, 2]
)
a_bridge = np.pad(a_bridge, [[0, 0], [1, 1], [1, 1]])
# ladder
ladder = (p_bridge + a_bridge).sum(-1) > 0
# H, E, L of C3
helix = (helix3 + helix4 + helix5) > 0
strand = ladder
loop = ~helix * ~strand
onehot = np.stack([loop, helix, strand], axis=-1)
onehot = rearrange(
onehot, "1 ... -> ...") if len(org_shape) == 3 else onehot
return onehot


def translate(onehot: np.ndarray) -> np.ndarray:
"""Translate a one-hot encoding ndarray into char-based secondary structure assignment.
One-hot encoding corresponds to '-', 'H', 'E' (in that order) -- loop, helix and sheet respectively.
Input array must have its last axis of shape 3: (n_residues, 3) or (n_frames, n_residues, 3), etc.

Example:

>>> from MDAnalysis.analysis.dssp import translate
>>> import numpy as np
>>> # encoding 'HE-'
>>> onehot = np.array([[False, True, False], [False, False, True], [True, False, False]]])
>>> ''.join(translate(onehot))
'HE-'

Args:
onehot (np.ndarray): _description_

Returns:
np.ndarray: _description_
"""
C3_ALPHABET = np.array(["-", "H", "E"])
index = np.argmax(onehot, axis=-1)
return C3_ALPHABET[index]


class DSSP(AnalysisBase):
"""Assign secondary structure using DSSP algorithm as implemented by pydssp package (v. 0.9.0)
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
Here:
- 'H' represents a 4-helix (alpha-helix)
- 'E' represents 'extended strand', participating in beta-ladder
- '-' represents unordered part

Example:
>>> from MDAnalysis.analysis.dssp import DSSP
>>> from MDAnalysisTests.datafiles import PDB
>>> import MDAnalysis as mda
>>> u = mda.Universe(PDB)
>>> run = DSSP(u).run()
>>> print("".join(run.results.dssp[0]))
'--EEEEE-----HHHHHHHHHHHH--EEE-HHHHHHHHHHH--HHHHHHHHHHHH-----HHHHHHHHHHH---HHH---EEEE-----HHHHHHHHHH-----EEEEEE--HHHHHHHH--EE--------EE---E------E------E----HHH-HHHHHHHHHHHHHHHHHHHHHHHHHHHH---EEEEEE----HHHHHHHHHHHH-'

Also, per-frame dssp assignment allows you to build average secondary structure -- `DSSP.results.dssp_ndarray`
holds (n_frames, n_residues, 3) shape ndarray with one-hot encoding of loop, helix and sheet, respectively:

>>> from MDAnalysis.analysis.dssp import translate
>>> u = mda.Universe(...)
>>> long_run = DSSP(u).run()
>>> mean_secondary_structure = translate(long_run.results.dssp_ndarray.mean(axis=0))
>>> ''.join(mean_secondary_structure)
---HHHHHHHH--------------------------HHHHHH-------HHHH---HHHHHHHHHHHHHH------------HHHHH------------------HHHHHHHH---
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
"""

def __init__(self, u: Universe, guess_hydrogens: bool = True):
super().__init__(u.trajectory)
self._guess_hydrogens = guess_hydrogens

# define necessary selections
# O1 is for C-terminal residue
heavyatom_names = ("N", "CA", "C", "O O1")
self._heavy_atoms: dict[str, "AtomGroup"] = {
t: u.select_atoms(f"protein and name {t}") for t in heavyatom_names
}
self._hydrogens: list["AtomGroup"] = [
res.atoms.select_atoms("name H")
for res in u.select_atoms("protein").residues
] # can't do it the other way because I need missing values to exist so that I could fill them in later

def _prepare(self):
self.results.dssp_ndarray = []

def _single_frame(self):
coords = np.array(
[group.positions for group in self._heavy_atoms.values()])

if not self._guess_hydrogens:
guessed_h_coords = _get_hydrogen_atom_position(
coords.swapaxes(0, 1))
h_coords = np.array(
[
group.positions[0] if group else guessed_h_coords[idx]
for idx, group in enumerate(self._hydrogens)
]
)
h_coords = np.expand_dims(h_coords, axis=0)
coords = np.vstack([coords, h_coords])

coords = coords.swapaxes(0, 1)
dssp = assign(coords)
self.results.dssp_ndarray.append(dssp)

def _conclude(self):
self.results.dssp = translate(np.array(self.results.dssp_ndarray))
self.results.dssp_ndarray = np.array(self.results.dssp_ndarray)
self.results.resids = np.array(
[at.resid for at in self._heavy_atoms["CA"]])
24 changes: 24 additions & 0 deletions testsuite/MDAnalysisTests/analysis/test_dssp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import pytest
import glob
import MDAnalysis as mda

from MDAnalysis.analysis.dssp import DSSP
from MDAnalysisTests.datafiles import DSSP_FOLDER
from MDAnalysisTests.datafiles import XTC_multi_frame
orbeckst marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize("pdb_filename", glob.glob(f"{DSSP_FOLDER}/*.pdb"))
def test_file_guess_hydrogens(pdb_filename):
u = mda.Universe(pdb_filename)
with open(f"{pdb_filename}.dssp", "r") as fin:
correct_answ = fin.read().strip().split()[0]

run = DSSP(u, guess_hydrogens=True).run()
answ = "".join(run.results.dssp[0])
assert answ == correct_answ


@pytest.mark.parametrize("pdb_filename", glob.glob(f"{DSSP_FOLDER}/*.pdb"))
def test_trajectory_without_guess_hydrogens(pdb_filename):
u = mda.Universe(pdb_filename)
DSSP(u, guess_hydrogens=False).run()
Loading
Loading