Skip to content

Fast structure embedding search tool for Merizo

License

Notifications You must be signed in to change notification settings

andymlau/merizo_search

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Merizo-search

Merizo-search is a method that builds on the original Merizo (Lau et al., 2023) by combining state-of-the-art domain segmentation with fast embedding-based searching. Specifically, Merizo-search makes use of an EGNN-based method called Foldclass, which embeds a structure and its sequence into a fixed size 128-length vector. This vector is then searched against a pre-encoded library of domains, and the top-k matches in terms of cosine similarity are used for confirmatory TM-align runs to validate the search. Merizo-search also supports searching larger-than-memory databases of embeddings using the Faiss library.

Installation

Using conda with GPU support (Recommended):

cd /path/to/merizo_search
conda create -n merizo_search python=3.9
conda activate merizo_search
pip install -r merizo_search/programs/Merizo/requirements.txt
conda install -c pytorch -c nvidia faiss-gpu

For the CPU-only version of Faiss, replace the last step with conda install faiss-cpu. A GPU provides only minor speedups for searching with Faiss, but is beneficial when segmenting and/or embedding many structures.

We recommend using conda as there is no official Faiss package on PyPI the time of writing. Unofficial packages are available; use these at your own risk.

Ansible Installation

First ensure that ansible is installed on your system, then clone the github repo.

pip install ansible
git clone https://github.com/psipred/merizo_search.git
cd merizo_search/ansible_installer

Next edit the the config_vars.yml to reflect where you would like Merizo Search and its underlying data to be installed.

You can now run ansible as per

ansible-playbook -i hosts install.yml

You can edit the hosts file to install Merizo Search on one or more machines. This ansible installation creates a python virtualenv called merizosearch_env which the program needs to run. You can activate this with

source [app path]/merizosearch_env/bin/activate

If you're using a virtualenv to install Torch you may find you need to add the paths to virtualenv versions of cudnn/lib/ and nccl/lib/ to your LD_LIBRARY_PATH

BY DEFAULT we do not download the Merizo-search databases as they are nearly 1TB in size. You can do this manually (see below) or open install.yml and uncomment the line - dataset

Databases

We provide pre-built Foldclass databases for domains in CATH 4.3 and all 365 million domains from TED. They can be obtained from here. We recommend using our convenience script in this repository (download_dbs.sh) to download them. If using the URL above, please make sure you download the individual files in each directory, rather than download each directory as a whole.

Metadata format

Our pre-built databases (including the ones in the example/ directory in this repo) include metadata for each domain in the database. Metadata is organised in JSON key-value format, and the exact fields in the db are allowed to vary. For the CATH databases, we currently include the CATH assignment numbers up to H-level, and the resolution of the structure, where applicable. For the TED databases, we supply a subset of the fields available in the master TSV file. Here is an example, reformatted over multiple lines for clarity and annotated:

{
  'ted': 'AF-Q9UKA2-F1-model_v4_TED01',  # TED consensus domain ID.
  'cnsl': 'high',                        # TED consensus level; this is either 'high' or 'medium'.
  'rr': '50-229',                        # TED consensus residue range in the AFDB model, sometimes called the 'chopping'.
  'plddt': '93.735',                     # Average plDDT of the domain residues.
  'cath': '2.60.120.260',                # Putative CATH label. This is in formatted as C.A.T.H, or C.A.T, or '-' where a label could not be assigned.
  'cl': 'H',                             # The level in the CATH hierarchy up to which the label was assigned. This is either 'H', 'T', or '-'.
  'cm': 'foldseek',                      # The method used to assign the CATH label. This is either 'foldseek', 'foldclass', or '-'.
  'dens': '11.6',                        # The packing density for this domain.
  'rg': '0.297',                         # The radius of gyration for the domain.
  'taxid': '9606',                       # The NCBI TaxID associated with this protein.
  'taxsci': 'Homo_sapiens'               # The short taxonomic name for the TaxID.
}

We will soon release scripts that will allow you to add JSON-formatted metadata to a custom database created by the createdb module (see below).

Usage

Merizo-search supports the functionalities listed below. The -h flag can be used to show all options for each mode :

segment         Runs standard Merizo segment on a multidomain target.
search          Runs Foldclass search on a single input PDB against a Foldclass database.
easy-search     Runs Merizo to segment a query into domains and then searches against a Foldclass database.
createdb        Creates a Foldclass database given a directory of PDB files. 

segment

The segment module of Merizo can be used to segment a multidomain protein into domains and can be run using:

python merizo.py segment <input.pdb> <output_prefix> <options>

# Example:
python merizo.py segment ../examples/*.pdb results --iterate

The input PDB can be a single PDB, or multiple, including something like /dir/*.pdb. The output_prefix will be appended with _segment.tsv to indicate the results of segment.

The --iterate option can sometimes be used to generate a better segmentation result on longer models, e.g. AlphaFold models.

The --pdb_chain option lets you select which PDB chain will be analysed. If not provided, chain A is assumed. If supplying multiple structures as queries, you can supply either a single chain ID to be used for all queries, or a comma-separated list of chain IDs, e.g. A,A,B,D,A.

This will print:

2024-03-10 19:43:00,945 | INFO | Starting merizo segment with command:

merizo_search/merizo.py segment examples/AF-Q96HM7-F1-model_v4.pdb examples/AF-Q96PD2-F1-model_v4.pdb results --iterate

2024-03-10 19:44:11,318 | INFO | Finished merizo segment in 70.37289953231812 seconds.

Results will be written to results_segment.tsv:

filename        nres    nres_dom        nres_ndr        ndom    pIoU    runtime result
AF-Q96PD2-F1-model_v4   775     383     392     3       0.4942  0.7174  71-189,190-290,291-453
AF-Q96HM7-F1-model_v4   432     267     165     1       0.6343  0.3958  1-267
3w5h    272     272     0       2       1.0000  0.2517  1001-1117,1118-1272
M0      31      0       31      0       0.0000  0.0225 

search

The search module of Merizo-search will call Foldclass to search queries (as they are, without segment) against a pre-compiled database (created using createdb). This is useful when queries are already domains.

The search module is called using:

python merizo.py search <input.pdb> <database_name> <output_prefix> <tmp> <options>

Again, the -h option will print all options that can be given to the program. The database_name argument is the prefix of a Foldclass database. A Foldclass database can be created using createdb.

For default Foldclass databases, database_name should be the basename of the database without .pt or .index. For example:

python merizo.py search ../examples/AF-Q96HM7-F1-model_v4.pdb ../examples/database/cath results tmp

For Faiss databases, use the basename of the .json file without extension:

python merizo.py search ../examples/AF-Q96HM7-F1-model_v4.pdb ../examples/database/ted100 results tmp

Results will be written to results_search.tsv:

query   topk_rank   target  cosine_similarity   q_len   t_len   len_ali seq_id  q_tm    t_tm    max_tm  rmsd
AF-Q96HM7-F1-model_v4	0	3.40.50.10540__SSG5__1_1	0.8204	432	304	169	0.1120	0.2646	0.3470	0.3470	6.27

Output fields are configurable using the --format flag which allows the section of different fields, specified as a comma-separated list. The defulat is to output all fields: query,chopping,conf,plddt,emb_rank,target,emb_score,q_len,t_len,ali_len,seq_id,q_tm,t_tm,max_tm,rmsd,metadata.

easy-search

easy-search combines segment and search into a single workflow. A multidomain query is parsed using segment, and the resultant domains are searched against a database using search. This can be called using:

python merizo.py search <input.pdb> <database_name> <output_prefix> <tmp> <options>

# Example:
python merizo.py easy-search ../examples/AF-Q96HM7-F1-model_v4.pdb ../examples/database/cath results tmp --iterate 

As with search, the -h option will print all options that can be given to the program. The database_name argument is the prefix of a Foldclass database, as above. A Foldclass database can be created using createdb.

The results in the _search.tsv file will be different to that of search and will show extra information about the domain parse:

query_dom   chopping    conf    plddt   topk_rank   target  cosine_similarity   q_len   t_len   len_ali seq_id  q_tm    t_tm    max_tm  rmsd
AF-Q96HM7-F1-model_v4_merizo_01	1-267	1.0000	91.9215	0	3.40.50.720__SSG5__79_12	0.8583	267	178	147	0.0680	0.3811	0.5180	0.5180	4.95

As with segment, the _segment.tsv file will show the results of segment:

query   nres    nres_domain nres_non_domain num_domains conf    time_sec    chopping
AF-Q96HM7-F1-model_v4	432	267	165	1	0.6343	22.7448	1-267

Output fields are configurable using the --format flag which allows the section of different fields: query, target, conf, plddt, chopping, emb_rank, emb_score, q_len, t_len, ali_len, seq_id, q_tm, t_tm, max_tm, rmsd.

createdb

createdb can be used to create a standard Foldclass database given a directory of PDB structures (anything with the extension .pdb will be read automatically). This can be run using:

python merizo.py createdb <directory_containing_pdbs> <output_database_prefix>

# Example:
python merizo_search/merizo.py createdb examples/database/cath_pdb examples/database/cath

The argument given to output_database_prefix will be appended with .pt and .index, with the two files constituting a Foldclass database.

The .pt file is a Pytorch tensor containing the embedding representation of the PDB files. The .index file contains the PDB names, CA coordinates and the sequences of the input PDBs.

Multi-domain searching

Both search and easy-search support searching for database entries that match all domains in a query chain. In the case of search, all supplied query structures are considered as domains originating from a single chain and searched against the database. In the case of easy-search, segmentation and multi-domain search operate on a per-query-chain basis, that is, only domains segmented from individual query chains are searched together as a set.

To enable multi-domain searching, add the option --multi_domain_search to a search or easy-search command.

A few important things to note:

  • In multi-domain searches, -k still controls the maximum number of per-domain hits retrieved using vector search. We recommend setting it to around 100.
  • We only keep hits where all domains in each query chain are matched at least once in a hit chain. We don't return hits containing fewer domains than the query domain set. You can, however, manually supply a subset of pre-segmented domains to the search command with --multi_domain_search enabled.
  • The accuracy of multi-domain easy-search runs is dependent on the accuracy of the initial Merizo segmentation. If you're not getting many meaningful hits, we recommend checking the output from the implicit segment step from your run. Merizo is fairly robust, but you may wish to manually segment your query chain and then re-run multi-domain search using the search module.

Multi-domain search output

When --multi_domain_search is supplied, multi-domain search results are output in a file with the suffix _search_multi_dom.tsv. Each line of this file describes a match between a query chain and a hit chain. This is different from the outputs from search, in which each line describes a domain-level match.

The format of this file is not configurable (though headers can be enabled with the --output_headers option), and has the following format:

query_chain	nqd	hit_chain	nhd	match_category	match_info	hit_metadata
3w5h	2	1amoA	4	1	3w5h_merizo_01:1amoA02:0.70881,3w5h_merizo_02:1amoA04:0.71	[{"cath": "2.40.30.10", "res": "2.600"},{"cath": "3.40.50.80", "res": "2.600"}]
3w5h	2	1amoB	4	1	3w5h_merizo_01:1amoB02:0.70881,3w5h_merizo_02:1amoB04:0.71	[{"cath": "2.40.30.10", "res": "2.600"},{"cath": "3.40.50.80", "res": "2.600"}]
3w5h	2	1b2rA	2	3	3w5h_merizo_01:1b2rA01:0.73567,3w5h_merizo_02:1b2rA02:0.70819	[{"cath": "2.40.30.10", "res": "1.800"},{"cath": "3.40.50.80", "res": "1.800"}]
3w5h	2	1bjkA	2	3	3w5h_merizo_01:1bjkA01:0.7425,3w5h_merizo_02:1bjkA02:0.708	[{"cath": "2.40.30.10", "res": "2.300"},{"cath": "3.40.50.80", "res": "2.300"}]

Multi-domain hits are categorised into one of 4 categories in the match_category field of the output, representing the type of multi-domain match. Each can be seen as a subset of the last:

match_category value Category name Meaning
0 Unordered domain match All query domains present in hit chain, but in different sequential order to query chain. Domains may be inserted relative to the query chain at any position.
1 Discontiguous domain match All query domains matched in sequential order, but hit chain has at least one extra domain in an interstitial position.
2 Contiguous domain match All query domains matched in sequential order. Hit chain has extra domains at one or both ends, but not in interstitial positions.
3 Exact multi-domain architecture (MDA) match Query chain and hit chain correspond at domain level without domain rearrangement or insertions.

It is possible for the same hit chain to be listed more than once for the same query chain, as multiple query domain-hit domain mappings may be possible (e.g. in the case of repeats of domains). In such cases, Merizo-search will list all such pairings, one per line.

Other outputs

The segment module used in segment and easy-search produces a number of different output files that can be turned on using various flags:

--save_domains  Save the domains as individual PDBs.
--save_pdb      Save a single PDB with the occupancy column replaced with domain IDs. (Visualise in PyMOL using the `spectrum q` command).
--save_pdf      Save a PDF output showing the domain map.
--save_fasta    Save the sequence of the input file.

By default, all output files will be saved alongside the original input query PDB, but they can be saved into a folder given by --merizo_output.

About

Fast structure embedding search tool for Merizo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Shell 0.9%