Predicting a pairwise protein-protein interactions and structures from multiple sequence alignments.
- Installation
- Credit
- Command-line usage
- Generating multiple-sequence alignments
- Calculating contact maps
- [Predicting protein complex structures]
- Command-line tools
- Python API
- Scaling up
- Issues, problems, suggestions
- Further help
yunta
provides several implementations of protein-protein interaction evaluation. In increasing computational cost:
- GPU-accelerated direct coupling analysis (DCA) (in Tensorflow and PyTorch)
- RoseTTAFold-2track via the
rf2t-micro
package - AlphaFold2 for protein-protein structure prediction
yunta
has streamlined installation, a command-line interface, a Python API, and some resilience to GPU out-of-memory error (though CPU-fallback). It takes as input unpaired multiple-sequence alignments in A3M format (as generated by tools like hhblits
), and outputs a matrix of inter-residue contacts.
Rough timings for a pair of ~200 amino-acid proteins (S. cerevisiae DHFR and WW domain-containing protein) on CPU:
- DCA: 5 seconds
- RosettaFold-2track: 10 seconds
- AlphaFold2: 1 hour
Note that these times will increase quadratically with the total length of the proteins.
Obtaining and setting up yunta
is easy.
$ pip install yunta
If you want to enable GPU, use
$ pip install yunta[cuda12]
If you want to use a local CUDA installation instead, use
$ pip install yunta[cuda12_local]
Using the embedded model requires using the RoseTTAFold-2track and AlphaFold2 weights. These are automatically downloaded, but by using yunta
you agree that the trained weights for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license and AlphaFold2's pretrained parameters fall under the CC BY 4.0 license.
yunta
is a fork of SpeedPPI, which is itself inspired by FoldDock. This method used AlphaFold2 to evaluate 65,484 protein-protein interactions from the human proteome in Towards a structurally resolved human protein interaction network.
The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:
- Cong et al., Protein interaction networks revealed by proteome coevolution. Science, 2019
- Humpreys et al., Computed structures of core eukaryotic protein complexes. Science, 2021
- Humpreys et al., Protein interactions in human pathogens revealed through deep learning. Nature Microbiology, 2024
yunta
puts these algorithms in one place with easy installation, a command-line interface, and a Python API.
You can always get more help by running
$ yunta --help
usage: yunta [-h] {dca-single,dca-many,rf2t-single,af2-single,af2-many} ...
Screening protein-protein interactions using DCA and AlphaFold2.
options:
-h, --help show this help message and exit
Sub-commands:
{dca-single,dca-many,rf2t-single,af2-single,af2-many}
Use these commands to specify the tool you want to use.
dca-single Calculate DCA for one protein-protein interaction.
dca-many Calculate DCA between two sets of proteins, or all pairs in one set of proteins.
rf2t-single Calculate RF-2track contacts for between one protein and a series of others.
af2-single Model one protein-protein interaction.
af2-many Model all interactions between two sets of proteins, or all pairs in one set of proteins.
All the algorithms depend on pre-computed multiple-sequence alignments (MSAs) between a protein of interest and as many other proteins as possible. This allows computations to be sped up by separating out this phase of the calculation. You can generate MSAs using a dedicated tool like hhblits
, which will speed up the process by using pre-clustered datasbes like UniClust. We typically use a command like:
hhblits -e 0.01 -v 3 -d /path/to/UniClust-database -i input.fasta -oa3m output-msa.a3m -o /dev/null -cov 60 -n 3 -realign -realign_max 10000
In our experience, this can take 1-40 min depending on the complexity of the query. Check the hhsuite
documentation for more details.
Once you have your MSAs, you can use the information contained within them using tools in yunta
to calculate contact maps and predict structures of protein complexes with AlphaFold2.
Given two MSAs, yunta
will calculate the contact map using DCA, RF2t, or AlphaFold2, and produce a summary table for each pair provided as input.
Using DCA or RF2t will produce a table like this:
$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc
ID | uniprot_id_1 | uniprot_id_2 | seq_len | chain_a_len | chain_b_len | msa1_depth | msa2_depth | msa_depth | n_eff | apc | mean | median | maximum | minimum |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
O13297-D6VTK4 | O13297 | D6VTK4 | 980 | 549 | 431 | 14246 | 1546 | 670 | 2 | False | 0.01830857 | 0.014683756 | 0.07428725 | 2.284808e-06 |
If you also give the --plot
option, then the contact maps for the entore complex and only the inter-chain contacts will be saved, along with CSV files containing the numerical values as matrix. For example,
$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc --plot test/outputs/DYR_YEAST-CAPZA_YEAST
0 | 1 | 2 | ... | 420 | 421 |
---|---|---|---|---|---|
-0.0 | 0.0009014737 | 0.0010275221 | ... | 0.0005961701 | -1.9190367e-05 |
... |
yunta
can also feed your MSAs into the AlphaFold2 model to predict structures of binary protein complexes.
$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single
This will also generate a table
Using --plot
will generate the contact maps as with the other commands.
$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single --plot test/outputs/af2-single-plot
You can run 1-vs-many with the *-single
commands. For example:
$ yunta dca-single --help
usage: yunta dca-single [-h] [--msa2 [MSA2 ...]] [--list-file] [--output [OUTPUT]] [--plot PLOT] [--apc] [msa1]
positional arguments:
msa1 MSA file. Default: STDIN.
options:
-h, --help show this help message and exit
--msa2 [MSA2 ...], -2 [MSA2 ...]
Second MSA file(s). Default: if not provided, all pairwise from msa1.
--list-file, -l Treat inputs as plain-text list of MSA files, rather than MSA filenames. Default: treat as MSA filenames.
--output [OUTPUT], -o [OUTPUT]
Output filename. Default: STDOUT.
--plot PLOT, -p PLOT Directory for saving plots. Default: don't plot.
--apc, -a Whether to use APC correction in DCA. Default: don't apply correction.
If one MSA is provided, then homodimeric interactions are probed. For convenience, you can use the --list-file
option to provide a single file containing a list of MSA files (one per line).
You can run many-vs-many with the *-many
commands. For example:
$ yunta af2-many --help
usage: yunta af2-many [-h] [--msa2 [MSA2 ...]] [--list-file] --output OUTPUT [--params PARAMS] [--recycles RECYCLES] [--plot PLOT] [msa1 ...]
positional arguments:
msa1 MSA file(s). Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".
options:
-h, --help show this help message and exit
--msa2 [MSA2 ...], -2 [MSA2 ...]
Second MSA file(s). Default: if not provided, all pairwise from msa1.
--list-file, -l Treat inputs as plain-text list of MSA files, rather than MSA filenames. Default: treat as MSA filenames.
--output OUTPUT, -o OUTPUT
Output directory. Required.
--params PARAMS, -w PARAMS
Path to AlphaFold2 params file (.npz).
--recycles RECYCLES, -x RECYCLES
Maximum number of recyles through the model. Default: "10".
--plot PLOT, -p PLOT Directory for saving plots. Default: don't plot.
We provide an API for using MSAs in your own programs.
>>> from yunta.structs.msa import *
>>> msa = MSA.from_file("my-msa-file.a3m")
>>> msa.neff()
6
We also provide a reusable GPU-accelerated Tensorflow implementation of DCA (adapted from Humpreys, Science, 2021).
>>> from yunta.dca import calculate_dca
>>> from yunta.structs.msa import *
>>> paired_msa = PairedMSA.from_file("my-msa-file1.a3m", "my-msa-file2.a3m")
>>> calculate_dca(paired_msa, apc=True, gpu=False)
In case you prefer, you can also import a PyTorch implementation (which anecdotally is faster on both CPU and GPU).
>>> from yunta.dca_torch import calculate_dca
>>> from yunta.structs.msa import *
>>> paired_msa = PairedMSA.from_file("my-msa-file1.a3m", "my-msa-file2.a3m")
>>> calculate_dca(paired_msa, apc=True, gpu=False)
(More documentation coming soon!)
While the *-many
commands can deal with processing multiple possible protein-protein interactions, if you want to screen more than a few and have access to a HPC cluster then using our nf-ggi
Nextflow pipeline will be more efficient.
Add to the issue tracker.