Proteogram: an image embedding-based search approach to protein structure similarity

This repo has the source code for the proteogram project and paper.

Abstract

Motivation: Huge strides have been made in the last decade in the detection of protein similarity across evolutionarily remote distances using protein structure alignment for known structures (sequence-order dependent/independent, sequence-alignment dependent/independent). These methods accurately identify similar structures according to various measures such as TM-score, RMSD, and cosine-similarity. The main challenges remaining are computational efficiency, gains in accuracy, and new enhancements to the structural data representations, such as the incorporation of physicochemical properties and evolutionary information. We have developed Proteogram, a new approach to protein structure search that leverages computer vision models to create embedding vectors for a cosine-similarity-based search across a database of Proteograms, wherein each Proteogram represents a single protein monomer as image data. The Proteogram is a novel data type consisting of three categories of residue-level information: alpha-carbon backbone distances, hydrophobicity similarities, and charge similarities. These three dimensions of data are stacked to form an NxN 3-channel data structure (where N is the residue length), which can, consequently, be captured on disk as an RGB image and are inherently sequence-alignment independent. Proteograms not only represent spatial similarity through distograms (the pair-wise residue alpha-carbon distances), but also physicochemical properties of residues through the integration of hydrophobicity values and charge states. Thus, this work outlines a novel and efficient approach to the current challenges in protein structure similarity, alignment, and search.

Results: Two state-of-the-art methods in structure alignment and similarity search, USalign and GTalign, serve as points of comparison to the method introduced in our study. Two evaluation datasets are used to understand search performance, one from the GTalign study and a new one introduced in this paper. On the GTalign evaluation dataset, superfamily and family performance could not be discerned easily, however on a more representative dataset, the Proteogram approach scores highest in Precision@K for superfamily and family-level search results. Additionally, for the search step itself, the Proteogram approach has the highest throughput (according to pair-wise alignments per minute) of the three methods (1.63 times faster than GTalign and 89.4 times faster than USalign). This unique and promising technique shows the gains that can be achieved by accounting for physicochemical information in addition to spatial information with a data structure compatible with advanced computer vision models.

Getting started

This repo uses Python 3.8+.

Installing the package

This project uses Python Poetry to manage packages. Using poetry==1.8.3, the following commands may be found useful.

To install all packages from the pyproject.toml:

poetry install

Setting up an environment

A virtual envrionment is highly encouraged. One way to create a virtaul environment is with the built-in tool venv as in:

python3 -m venv env

To use the virtual environment, it needs to be activated which on Unix systems:

source env/bin/activate

To activate the virtual envrionment on Windows:

env\Scripts\activate

See Create and Use Virtual Environments for more on venv.

Set up configuration

Fill out a config.yml (based on config.example.yml) file and place in the scripts folder where all scripts should be run.

Creating proteograms

To create proteograms of your domains, run the following from the scripts folder:

python create_proteograms.py

Measure similarity of a single domain to DB of proteograms

To utilize a new structure and compare to a DB (directory) of proteograms, run the following from the scripts folder (modify script values for your particular structure - see script):

python measure_similarity_single_domain.py

Example resulting search image (scores and proteogram files are also output):

Additional notes

To add a package dependency to the environment (and add to the main dependencies section of the pyproject.toml):

poetry add <packagename>

To add a package dependency to the environment (and add to the dev dependencies section of the pyproject.toml):

poetry add <packagename> --dev

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
proteogram		proteogram
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Proteogram: an image embedding-based search approach to protein structure similarity

Abstract

Getting started

Installing the package

Setting up an environment

Set up configuration

Creating proteograms

Measure similarity of a single domain to DB of proteograms

Additional notes

Workflow for paper where the proteogram approach was compared to GTalign and USalign

Overview

Proteogram generation

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

dataqueso/proteogram

Folders and files

Latest commit

History

Repository files navigation

Proteogram: an image embedding-based search approach to protein structure similarity

Abstract

Getting started

Installing the package

Setting up an environment

Set up configuration

Creating proteograms

Measure similarity of a single domain to DB of proteograms

Additional notes

Workflow for paper where the proteogram approach was compared to GTalign and USalign

Overview

Proteogram generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages