gapTrick is a tool based on monomeric AlphaFold2 models that can identify critical residue-residue interactions in low-accuracy models of protein complexes. The approach can aid in the interpretation of challenging cryo-EM and MX structures and the computational identification of protein-protein interactions.
Chojnowski bioRxiv (2025) 01.31.635911
The gapTrick code depends on AlphaFold2 and MMseqs2, remember to cite them as well!
You can run gapTrick on Colab server without having to satisfy its hardware (GPU) and software dependencies on your computer. Click the link to start a new Colab session: gapTrick.ipynb
The code requires only a standard AlphaFold2 installation to run. Check AlphaFold2 installation instructions at the official webpage or follow the instructions below. I use them to install the code on Colab. It should work smoothly on most Linux distributions.
If you already have a working copy of AlphaFold2 go directly to gapTrick installation instructions. You can also run gapTrick directly from a cloned repository.
gapTrick has the same hardware requirements as AlphaFold2. In most of the cases a standard GPU (T4 on colab, 3090) will be enough as it can handle predictions up to roughly 1,200 residues. For larger targets, you will need to either use cards with more memory (e.g. A100 that can handle up to roughly 3,000 residues), or split your target into smaller fragments.
First, install the NVIDIA CUDA Toolkit on the system level.
Create a target directory, build base conda environment, and install dependencies
mkdir gapTrick
cd gapTrick
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p conda
source conda/bin/activate
conda install -qy -c conda-forge -c bioconda python=3.10 openmm=8.0.0 matplotlib kalign2 hhsuite pdbfixer pyopenssl==22.0.0 git
Once you installed all dependencies install AlphaFold2 from an official repository
git clone --branch main https://github.com/deepmind/alphafold alphafold
pip install -r alphafold/requirements.txt
pip install --no-dependencies ./alphafold
pip install --upgrade "jax[cuda12]"==0.4.26
pip install --upgrade tensorflow==2.16.1
mkdir -p conda/lib/python3.10/site-packages/alphafold/common/
curl -o conda/lib/python3.10/site-packages/alphafold/common/stereo_chemical_props.txt https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
To finalize installatin you need to download the AlphaFold2 neural network (NN) parameters and place them in a directory that gapTrick can find (you don't need to download sequence databases)
If you installed gapTrick from scratch, use the following to place the parameters where they can be easily found
mkdir --parents conda/lib/python3.10/site-packages/alphafold/data/params
curl -o alphafold_params_2021-07-14.tar https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar
tar --extract --verbose --file=alphafold_params_2021-07-14.tar --directory=conda/lib/python3.10/site-packages/alphafold/data/params --preserve-permissions
rm alphafold_params_2021-07-14.tar
If you use an existing installation of AlphaFold2, or prefer to place the parameters in a custom directory (e.g. /a/directory/with/parameters
), run the following
mkdir --parents /a/directory/with/parameters/params
curl -o /a/directory/with/parameters/params/alphafold_params_2021-07-14.tar https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar
tar --extract --verbose --file=/a/directory/with/parameters/params/alphafold_params_2021-07-14.tar --directory=/a/directory/with/parameters/params --preserve-permissions
rm /a/directory/with/parameters/params/alphafold_params_2021-07-14.tar
and use gapTrick with a flag --data_dir=/a/directory/with/parameters/
The gapTrick itself can be installed with a command:
pip install git+https://github.com/gchojnowski/gapTrick
gapTrick --help
but you can also use it directly from a cloned repository, e.g. with a shared copy of AlphaFold2 on a cluster. If you use your own installation, remember to specify directory with AF2 parameters using --data_dir
keyword
git clone --recursive https://github.com/gchojnowski/gapTrick.git
python gapTrick/gapTrick --help
The prediction of protein-protein complexes in gapTrick is based on AlphaFold2 neural network (NN) models trained on single-chain proteins. The method uses only two out of five AF2 NN models that allow for the use of input templates. To allow prediction of multimeric structural models, all input protein chains are merged into a single protein chain interspersed with 200 amino acid gaps. The only input required from user is a PDB/mmCIF template and a FASTA file containing all target sequences in arbitrary order (it doesn't make sense to run it without a template). The structure prediction is performed fully automatically.
The most important gapTrick outputs are a model and precicted contacts. Two residues are predicted to be in contact if the probability that their CB atoms (CA for glycine) are within 8Å radius is greater than 80%. The contact predictions are very precise. If you see one of them on the complex interface, it's a strong evidence that your complex prediction is correct.
The most important keywords are
--seqin
- input sequences in FASTA format (number of chains must match the target, e.g. for a homodimer same sequence must be repeated twice)--template
- input template in PDB/mmCIF format--jobnmame
- name of a target job directory. Existing directories will not be overwritten
To see a full keywords list run
gapTrick --help
First, check your installation with a most basic run based on files provided in a examples directory. It's a small homo-dimer, part of a larger complex (1bjp) that would be difficult to predict without a template
gapTrick --seqin examples/piaq.fasta --template examples/piaq.pdb --jobname piaq_test --max_seq 5000 --relax
this will automatically download MSAs from MMseqs2 API and run prediction. Remember about fair use ot the API server.
gapTrick by default uses MMseqs2 API to generate MSAs. MMseqs2 API is a shared resource that has been generously provided by the developers and as such should be used in a fair manner (in simple words; not TOO extensively). Otherwise your IP will be blocked for a while. To reduce the number of requests for predictions repeated for the same target use --msa_dir
keyword. It will store MSA files in a local directory and reuse in subsequent jobs. For example, the following will use a directory local_msas
:
gapTrick --seqin examples/piaq.fasta --template examples/piaq.pdb --jobname piaq_test --max_seq 5000 --relax --msa_dir local_msas
now, whenever you rerun the job above, gapTrick will check the local_msas
directory for MSAs matching your target sequences.
Alternatively, you can pre-calculate target MSAs, store *.a3m
files in a directory and reuse with --msa_dir
keyword. A script colabfold_search
distributed with ColabFold is a pretty handy solution here. All gapTrick benchmarks were based on MSAs generated with the following command (check ColabFold website for more details)
colabfold_search --db1 uniref30_2202_db --mmseqs /path/to/bin/mmseqs --use-env 0 --filter 0 --db-load-mode 2 input_sequences.fasta /path/to/db_folder msas
After a job finishes the output directory will contain the following files and directories
msas/
- MSAs downloaded from MMseqs2 API. They can be used in subsequent jobs with the--msa_dir
keyword (unless you haven't used it already).input/ranked_0.pdb
- top-ranked prediciton in PDB format.input/ranked_0_pae.json
- PAE matrix for top-ranked prediciton. You can use it for generating self-restrains in ISOLDE.figures/
- PAE, pLDDT, and distogram plots in png and svg format.contacts.txt
- list of all residue pairs predicted to be at most 8Å apart with corresponding probabilitiesn (above 0.8). Leading * marks inter-chain intercations. If you have them, the complex prediction is very likely to be correct.pymol_interchain_contacts.pml
- a pymol script for displaying inter-chain contacts. First, open ranked_0.pdb and then use File->Run Script option to run the scirpt.pymol_interchain_saltbridges.pml
- this pymol script will ald salt-bridges only to your model. These may be the most crucial intercations!chimerax_interchain_contacts.cxc
- a script for displaying contacts in ChimeraX. Open top prediction and run a commandrun [path to scirpt]/chimerax_interchain_contacts.cxc [model id]
chimerax_interchain_saltbridges.cxc
- a script for displaying inter-chain salt-bridges in ChimeraX. Open top prediction and run a commandrun [path to scirpt]/chimerax_interchain_saltbridges.cxc [model id]
- if running gapTrick produces models identical to those given on input (and there are no contact predictions), it often means that the template is far from a native conformation
gapTrick can be very useful in interpreting your cryo-EM (and MX!) maps. Single-chain predictions fittend into maps and rebuilt with gapTrick will have very good stereochemical properties (easy to refine) and provide contact predictions (powerful validation score), which will help in functinal analysis. Here, in a few simple steps, I will show how I use it
- Fit your chains (or complexes, if available) into the cryo-EM map. I like to use Molrep in Doppio.
- Refine the models in real space with self-restraints (COOT or ISOLDE). It needs to fit the map as good as possible.
- Run gapTrtick on the initial model. If it was close to a native complex, you will see many inter-chain contact predictions. This is a very strong evidence that your model is correct (see the paper). If not, you may need to refine the model further or check your priors (are there any known deleterious mutants at the interface?).
- Once you are happy with a prediction (you can run gapTrick iteratively), you can proceed with automatic refinement. I like to use Servalcat with half-maps and jelly body restraints (a bit stronger than default at lower resolutions). The gapTrick predictions usually have very good geometry, which should not deteriorate during refinement (use DOPPIO validation tools). Also check the final B-factor distribution. It it's not smooth (should resemble inverse-gamma distribution) try increasing the initial B-factor in Servalcat. B-factors can be very high in some regions, but that's OK if the map resolution is heterogenous.
(c) 2025 Grzegorz Chojnowski