peptide-matcher-data

This repository contains the workflow used to generate fasta datasets compatible with Peptide matcher or potentially other software capable of parsing the data. To download the datasets, go to releases.

Workflow description

The workflow takes alphafold structures and the corresponding uniprot records and extracts structural information from the two to create a single dataset file that can be used for analyses of structural features in proteins with a single-residue resolution.

Dependencies

The workflow is implemented in snakemake and depends on dssp, python>=3.6, wget and requires pandas and biopython.

Format description

The format of the datasets is an extension of fasta:

>P0A8G3 Uronate isomerase confidence:49545c5f5e606062626263616262636363636363636363636262636263636362636262626162626263626262626262616261616161605e5c52505b5f6161626262626262626262616162605f5d60626262626262626363636363636363636362626262626363626263626262626262626262615e6062626263626363626263636362636262626262625f60626262626263636262626263626263636363636363636363636363636262626161615f5f606062626262636363636363626262626261605e60626262636362636363636363626262626162636363636363636363636363636363626262616262626363636262616262626363626262616263636363636362626262626162616262626363636363636363636363636363636363636262626062636363636363626262605f626160616060605f5f5e5e5d5e5f616262636262616262636363636363636363636262605e5f61626262636363636363636362636363636363636363626262625a58585a5e616263636363636363626263636363636363636363636363636363626262636262636363636362615f5e6062626262626363636363636363636363636363636363626262626262626362626363636363636363636363626262606062615e5b595949 secstruct:3-2T1-4T2-1S5H6I2T2-3E1-1S4-5H2T3-2S5H3T1-9H2T2-3G2T3S1-11H4G2T1S11H3T3-1S3-1S3T12H1T1S3G1S7H2T8E1-2T3-8H1-2T2-1S6E2-4H1-2T1S2T12H1T4-1S17H2T2-6E2S7-12H2T4-22H1T1-6E1-2E4-7H1-4S2-2E5-12H3T3-6E2S3G7H4G2-2T1S2T1S3E1-1S2-3G1-1S14H1S1-3G8-1S1-4T19H2T2S3-9H5I6H1T4- accessibility:6c473b250c18291f080e032701310a001a26000116350d00243d0f0e0301000000001f011b0e2a1400052f1d112404430b0008070600012b1304130c01051f0b010f4621420f29152e160117224204251c1a2705001e0a00001f01001e3b053f50240f020a150200040001010514090333033559152a000e35460e022f2e00092c1e00133e100031283d220201000705000030240c1f0226000000010f06090012212902351503122b04444a1b4b3b05310414050e0101000f000c2a00000e002329453001133b04033531000c4621054b0c34083b290510100019280000172408002a16002242151c0221020000000405181506100d0734182c31362504122701002716091d3c312b510b23383410050607000802100002140c00000909001633182603000000000001050122151b281f35440d1f453b1c142d2e0d0510021a2008342e4804203918001c1d00002716031b384441110203080000000104000c09280704251700002d05022e3808251f5b5521340e0901020100040805081405031314290100112201001c11021a41231a3a0520290002000404020d082e060101010101010209020104000000000404010027170b3b323f4110313f250c3631070026100016180106060e08021f3105002b05481556 transmembrane:470-
MTPFMTEDFLLDTEFARRLYHDYAKDQPIFDYHCHLPPQQIAEDYRFKNLYDIWLKGDHY
KWRAMRTNGVAERLCTGDASDREKFDAWAATVPHTIGNPLYHWTHLELRRPFGITGKLLS
PSTADEIWNECNELLAQDNFSARGIMQQMNVKMVGTTDDPIDSLEHHAEIAKDGSFTIKV
LPSWRPDKAFNIEQATFNDYMAKLGEVSDTDIRRFADLQTALTKRLDHFAAHGCKVSDHA
LDVVMFAEANEAELDSILARRLAGETLSEHEVAQFKTAVLVFLGAEYARRGWVQQYHIGA
LRNNNLRQFKLLGPDVGFDSINDRPMAEELSKLLSKQNEENLLPKTILYCLNPRDNEVLG
TMIGNFQGEGMPGKMQFGSGWWFNDQKDGMERQMTQLAQLGLLSRFVGMLTDSRSFLSYT
RHEYFRRILCQMIGRWVEAGEAPADINLLGEMVKNICFNNARDYFAIELN

In addition to the record identifier and description, the definition line contains additional data for each position of the protein:

Field	Description	Source	Values	Encoding
`confidence`	pLDDT confidence scores	alphafold	Integers 1-100	Hexadecimal numbers
`secstruct`	Secondary structure	dssp	dssp codes	CIGAR-like compression
`accessibility`	Relative solvent accessibility	dssp	`ASA/max(ASA)` where max. values for abs. solvent accessibility are the theoretical maxima from Tien et al 2013	Hexadecimal numbers
`transmembrane`	Transmembrane regions	uniprot	T: transmembrane, S: signal peptide, -: otherwise	CIGAR-like compression

Two encoding are used to compress the data to short strings:

2-digit hexadecimal numbers concatenated into a single string. It is obtained as encoded = ''.join('%02x' % x for x in scores) and can be decoded as: scores = [ int(encoded[i:i+2], 16) for i in range(0, len(encoded), 2) ].
CIGAR-like compression. It is a simple format of the form <int><char><int><char>... where the integers signify the number of times the characters appear consecutively. The encoding is obtained as encoded = ''.join(str(sum(1 for x in g)) + k for k, g in itertools.groupby(chars)) and can be decoded as chars = list(itertools.chain(*[ [k] * int(n) for n, k in re.findall('(\d+)(.)', encoded) ])).

How to run

Run as follows: snakemake -c$CPUS --resources unisave=5 --config organism=$ORGANISM where $ORGANISM is the organisms code in the form UP000000589_10090_MOUSE (uniprot proteome, taxid, short name) as appears on the alphafold's website and $CPUS is the number of threads you want to allocate for parallel execution.

The default configuration file is in workflow/config.yml. The most relevant value to change there is the alphafold version.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
workflow		workflow
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

peptide-matcher-data

Workflow description

Dependencies

Format description

How to run

About

Releases 1

Packages

Languages

OKLAB2016/peptide-matcher-data

Folders and files

Latest commit

History

Repository files navigation

peptide-matcher-data

Workflow description

Dependencies

Format description

How to run

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages