Taxonomy-specific-RPS-BLAST-part-I

Protein is usually made up of domain(s). Conserved Domains can be described by local multiple sequence alignments spanning a variety of organisms to reveal sequence regions that contain the same, or similar, patterns of amino acids. Although it is easy to retrieve the taxonomical distribution of a protein, it is not available at the domain level.

This 2019 NCBI-Codeathon project will develop a pipeline to assign a lowest common taxid to a conserved protein domain (defined by a Position-Specific Score Matrix, PSSM). The taxid represents the taxon that contains this domain specifically with given threshold. This project will be incorporated into the future RPS-Blast (RPS-BLAST uses the query sequence to search a database of pre-calculated PSSMs, and report significant hits) to provide taxonomic information in CD Search (Conserved Domain Search) results.

Dependencies:

Python 3

BioPython

NCBI Conserved Domain Architecture Retrieval Tool (CDART)

taxidlineage.dmp extracted from new_taxdump.zip available in NCBI Taxonomy ftp site (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/)

Workflow:

Input:

A table generate by sql query against NCBI internal CDART database (example file)

Command:

./dtrt.py <data/*.tsv-file-name> [-threshold <number-between-0-and-1> ]

options:

-threshold: Threshold to report taxonomy node. Default value: 0.95
-show_names: Display taxonomic names instead of just taxids. Default value: false
-show_tree: Display taxonomy tree for model. Default value: false
-shake: Experimental: 'shakes' the tree to remove nodes that contribute less than 1% to the parent's weight. Default value: false

Output:

A taxonomy tree with the lowest common taxid with the threshold for the domain

Validation:

PSSM-Id: 129695,200311,334026,334050,337780,335786,274086,308214,315456,338615,287328,313550,313551,274263

Future Work

Elaborate weight by considering taxonomic origin (e.g.: metagenomic, plasmids, synthetic/artificial) may count for less
Use model-specific thresholds to get better matches
Elaborate weight by providing higher weight to those sequences that have higher evalue/bitscore
Instead of IPG, go to the individual sequences and extract the taxid
From Ryan Connor: weights to consider the variation that sequences bring into the model

People/Team

Marc Gwadz, IEB/NCBI/NIH

Christiam Camacho, IEB/NCBI/NIH

Jianli Dai, IEB/NCBI/NIH

Hanguan Liu, IEB/NCBI/NIH

Mingzhang Yang, IEB/NCBI/NIH

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
Notebooks		Notebooks
Presentation		Presentation
results		results
src		src
utilities		utilities
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taxonomy-specific-RPS-BLAST-part-I

Dependencies:

Workflow:

Input:

Command:

Output:

Validation:

Future Work

People/Team

About

Releases

Packages

Contributors 5

Languages

License

NCBI-Codeathons/Taxonomy-specific-RPS-BLAST-part-I

Folders and files

Latest commit

History

Repository files navigation

Taxonomy-specific-RPS-BLAST-part-I

Dependencies:

Workflow:

Input:

Command:

Output:

Validation:

Future Work

People/Team

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages