Glycopeptide Sequence Finder

A tool for identifying and analyzing in silico generated digested glycopeptide sequences from protein FASTA files.

Check out the wiki!

Quick Start Guide

Project Structure

The project follows a standard Python package structure:

Glycopeptide_Sequence_Finder/
├── data/                           # Data directory containing all input/output files
│   ├── digested_peptide_library/  # Output peptide library
│   ├── digested_glycopeptide_library/  # Output glycopeptide library
│   ├── glycan_mass_library/       # Glycan mass library files
│   ├── glycosite_library/         # Output glycosite library
│   ├── logs/                      # Log files
│   ├── mock_mass_spectra/         # Mock mass spectrometry data
│   ├── glycan_library/            # Glycan library files
│   └── test_proteomes/            # Input FASTA files
├── src/                           # Source code directory
│   └── glycopeptide_sequence_finder_cmd.py  # Main script
├── scripts/                       # Utility scripts
│   └── batch_glycopeptide_sequence_finder.sh  # Batch processing script
├── Dockerfile                     # Container configuration
└── requirements.txt               # Python dependencies

Installation

Clone the repository:

git clone https://github.com/RichardDShipman/Glycopeptide_Sequence_Finder.git
cd Glycopeptide_Sequence_Finder

Install dependencies:

pip install -r requirements.txt

Usage

Command Line Interface

Run the script from the project root directory:

python src/glycopeptide_sequence_finder_cmd.py -i data/test_proteomes/your_input.fasta -o data/digested_glycopeptide_library/output.csv

Batch Processing

Run the batch script to process multiple FASTA files:

./scripts/batch_glycopeptide_sequence_finder.sh

Docker

Build the Docker image:

docker build -t glycopeptide-sequence-finder .

Run the container:

docker run -v "$(pwd)/data/test_proteomes:/app/data/test_proteomes" \
           -v "$(pwd)/data/digested_glycopeptide_library:/app/data/digested_glycopeptide_library" \
           glycopeptide-finder -i /app/data/test_proteomes/phosvitin_uniprotkb_2025_03_20.fasta \
           -o /app/data/digested_glycopeptide_library/phosvitin_output.csv -v

Overview

Welcome to the Glycopeptide Sequence Finder!

This tool aims to facilitate the theoretical in silico study of digested glycopeptide sequences found in various species across the tree of life. It can be used with publicly available curated data from protein sequence databases or by uploading your own protein FASTA sequences files.

Example above: Calculated mass spectrum for glycopeptide Kininogen-1 (P01042) KNG1_HUMAN - ITYSIVQTNCSK - 205 - G62765YT - HexNAc(2)Hex(8) created plot_mock_mass_spectra.py. Note: This is a basic N-glycan structure fragment under are minimum sequential fragmentation of Hex(8) down to HexHAc(2) core N-glycan.

Glycopeptide Sequence Finder is a Python script that processes protein sequences from a FASTA file to find amino acid sequences which may or may not contain the post-translational modification glycosylation, the attachment of glycans (polysaccharides) to protein sequences. It uses user-specified proteases to digest and cleave protein sequences into amino acid sequences. The script then identifies N-linked glycopeptides using glycosylation sequon (motifs) like the N-sequon "N[^P][STC]" (NX[STC], where X is not P), O-sequon "[S/T"], or C-sequon "W..[WCF]". It calculates the properties of these glycopeptides, including mass, hydrophobicity, and glycosylation sites. Additionally, the script gathers information from the inputted FASTA file to create a predicted digested glycopeptide (peptide sequence backbone) library. The output is written to a CSV file, making it easy to integrate into downstream analyses.

Additionally, a directory containing proteomics related info is stored in a directory titled digested_peptide_library for other use cases. This may be expanded in the future for other needs. Run the tool to generate, files too big for GitHub.

I am currently developing a basic calculation tool for fragment ions of high mannose N-glycans to simulate the fragmentation behavior of N-glycopeptides under HCD (High-Energy Collisional Dissociation) conditions. That is the image used above of a mock mass spectrum. This tool is designed to provide a fundamental outline of N-glycopeptide fragmentation, focusing on the generation of fragment ions typically observed in mass spectrometry experiments.

As part of the project, I am also integrating a plotting utility that visualizes the calculated mock fragment ion values, allowing for the comparison of the theoretical results against the mass spectrum data. This enables a clearer understanding of how glycopeptides and their glycans fragment during analysis.

For ease of experimentation and reproducibility, all data used in the calculations and plots, including example glycopeptides and their corresponding glycan compositions, are provided in organized folders. This setup allows for quick reference and testing of different theoretical models and fragmentation pathways.

Note: This project is for fun and more of an exploration of glycoproteomic space in silico. Who knows where this may lead.

Protease-Specific Cleavage:
- Supports several commonly used proteases, including:
  - Trypsin: Cleaves after K or R, except if followed by P.
  - Chymotrypsin: Cleaves after F, W, or Y, except if followed by P.
  - Glu-C: Cleaves after E.
  - Lys-C: Cleaves after K.
  - Arg-C: Cleaves after R.
  - Asp-N: Cleaves before D.
  - Pepsin: Cleaves after F, L, W, or Y.
  - Proteinase K: Cleaves after A, F, I, L, V, W, or Y.
  - All: Runs all proteases above.
Missed Cleavages:
- Allows specifying the number of missed cleavages to simulate incomplete digestion.
Peptide Length Control:
- Minimum peptide length filter (default: 5 amino acids)
- Maximum peptide length filter (default: 50 amino acids)
Glycosylation Type:
- Select from N-linked (N), O-linked (O), or C-linked (C) glycopeptides. (Adjust or add sequon)
  - N-linked: N-sequon "N[^P][STC]"
  - O-linked: O-sequon "[ST]"
  - C-linked: C-sequon "W..[WCF]"
  - Others in the works.
Peptide Property Calculation:
- Calculates peptide mass, hydrophobicity, isoelectric point (pI), charge states m/z values, and N-glycan ion fragmentation series (experimental).

Requirements

Python 3.7 or later
Libraries:
- argparse
- csv
- re
- biopython

Installation

Install the required Python libraries using pip:

pip install argparse biopython pandas

Usage

Run the script from the command line with the following arguments:

python glycopeptide_finder_cmd.py -i <input_fasta> [-o <output_csv>] [-p <protease>] [-g <glycosylation>] [-c <missed_cleavages>] [-l <log.txt>] [-v]

Arguments

-i, --input (required): Path to the input FASTA file.
-o, --output (optional): Path to the output CSV file. If omitted, a default name is generated.
-p, --protease (optional): Protease to use for cleavage. Default is trypsin.
-g, --glycosylation (optional): Glycosylation sequon to find in peptides. Default is N-linked. (N, O, C) Warning when using O or C, experimental.
-c, --missed_cleavages (optional): Number of missed cleavages allowed. Default is 0.
-l log.txt, --log log.txt (optional): Path to the log file. If omitted, logging is disabled.
-v, --verbose (optional): Enable verbose output. Default is False.
-y, --glycan: Path to the glycan file (CSV format) (Default, 4 glycans stored in file).
-z, --charge: (Optional) Maximum charge state to compute (default: 5).
-m, --max_peptide_length: (Optional) Max peptide length after digestion (default: 50).
--min-length: (Optional) Minimum peptide length after digestion (default: 5).

Example

python glycopeptide_finder_cmd.py -i test_proteomes/human_uniprotkb_proteome_UP000005640_AND_revi_2025_01_17.fasta -p trypsin -g N -c 0 -z 3 --min-length 6

The output file will be dynamically named:

example_predicted_trypsin_glycopeptides.csv

Example CSV Content

Ion series data can be generated with --ion-series arguement flag, example below. Prototype at the moment. Work in progress. Moving to separate function.

ProteinID,Site,GlyToucan_AC,Composition,ShorthandGlycan,Peptide,Start,End,Length,Sequon,GlycopeptideMass,PeptideMass,GlycanMass,Hydrophobicity,pI,z2,Charge,IonSeries
sp|O95445|APOM_HUMAN,135.0,G22768VO,HexNAc(2)Hex(3),N2H3,TELFSSSCPGGIMLNETGQGYQR,121.0,143.0,23.0,NET,3690.543457999999,2474.1205949999994,1216.422863,-0.47826,4.26,1846.2790049999996,2,"{'b': [102.055, 231.0975, 344.1816, 491.25, 578.282, 665.3141, 752.3461, 855.3553, 952.4081, 1009.4295, 1066.451, 1179.535, 1310.5755, 1423.6596, 1537.7025, 1666.7451, 1767.7928, 1824.8142, 1952.8728, 2009.8943, 2172.9576, 2301.0162], 'y': [175.119, 303.1775, 466.2409, 523.2623, 651.3209, 708.3424, 809.39, 938.4326, 1052.4756, 1165.5596, 1296.6001, 1409.6842, 1466.7056, 1523.7271, 1620.7799, 1723.789, 1810.8211, 1897.8531, 1984.8851, 2131.9535, 2245.0376, 2374.0802], 'c': [119.0815, 248.124, 361.2081, 508.2765, 595.3085, 682.3406, 769.3726, 872.3818, 969.4346, 1026.456, 1083.4775, 1196.5615, 1327.602, 1440.6861, 1554.729, 1683.7716, 1784.8193, 1841.8407, 1969.8993, 2026.9208, 2189.9841, 2318.0427], 'z': [140.0819, 268.1405, 431.2038, 488.2253, 616.2838, 673.3053, 774.353, 903.3956, 1017.4385, 1130.5226, 1261.563, 1374.6471, 1431.6686, 1488.69, 1585.7428, 1688.752, 1775.784, 1862.816, 1949.8481, 2096.9165, 2210.0005, 2339.0431], 'Y': {'Y0': 2475.1279, 'Y1': 2678.2073, 'Y2': 2881.2867, 'Y3': 3043.3395, 'Y4': 3205.3923, 'Y5': 3367.4451}, '2Y': {'2Y0': 1237.5639, '2Y1': 1339.1036, '2Y2': 1440.6433, '2Y3': 1521.6697, '2Y4': 1602.6961, '2Y5': 1683.7225}, 'B': {'B_HexNAc_1': 204.0867, 'B_HexNAc_2': 407.1661, 'B_Hex_1': 569.2189, 'B_Hex_2': 731.2717, 'B_Hex_3': 893.3245}, 'oxonium': {'ox_HexNAc': 204.0867, 'ox_Hex': 163.0601}}"
sp|P00450|CERU_HUMAN,138.0,G22768VO,HexNAc(2)Hex(3),N2H3,EHEGAIYPDNTTDFQR,129.0,144.0,16.0,NTT,3108.2565080000004,1891.8336450000002,1216.422863,-1.51875,3.95,1555.1355300000002,2,"{'b': [130.0499, 267.1088, 396.1514, 453.1728, 524.2099, 637.294, 800.3573, 897.4101, 1012.437, 1126.48, 1227.5276, 1328.5753, 1443.6023, 1590.6707, 1718.7292], 'y': [175.119, 303.1775, 450.2459, 565.2729, 666.3206, 767.3682, 881.4112, 996.4381, 1093.4909, 1256.5542, 1369.6383, 1440.6754, 1497.6968, 1626.7394, 1763.7983], 'c': [147.0764, 284.1353, 413.1779, 470.1993, 541.2364, 654.3205, 817.3838, 914.4366, 1029.4635, 1143.5065, 1244.5541, 1345.6018, 1460.6288, 1607.6972, 1735.7557], 'z': [140.0819, 268.1405, 415.2089, 530.2358, 631.2835, 732.3312, 846.3741, 961.401, 1058.4538, 1221.5171, 1334.6012, 1405.6383, 1462.6598, 1591.7024, 1728.7613], 'Y': {'Y0': 1892.8409, 'Y1': 2095.9203, 'Y2': 2298.9997, 'Y3': 2461.0525, 'Y4': 2623.1053, 'Y5': 2785.1581}, '2Y': {'2Y0': 946.4205, '2Y1': 1047.9602, '2Y2': 1149.4999, '2Y3': 1230.5263, '2Y4': 1311.5527, '2Y5': 1392.5791}, 'B': {'B_HexNAc_1': 204.0867, 'B_HexNAc_2': 407.1661, 'B_Hex_1': 569.2189, 'B_Hex_2': 731.2717, 'B_Hex_3': 893.3245}, 'oxonium': {'ox_HexNAc': 204.0867, 'ox_Hex': 163.0601}}"
sp|P00450|CERU_HUMAN,227.0,G22768VO,HexNAc(2)Hex(3),N2H3,EFVVMFSVVDENFSWYLEDNIK,216.0,237.0,22.0,NFS,3925.6900780000005,2709.2672150000003,1216.422863,0.14545,3.39,1963.8523150000003,2,"{'b': [130.0499, 277.1183, 376.1867, 475.2551, 606.2956, 753.364, 840.396, 939.4644, 1038.5328, 1153.5598, 1282.6024, 1396.6453, 1543.7137, 1630.7457, 1816.8251, 1979.8884, 2092.9724, 2222.015, 2337.042, 2451.0849, 2564.169], 'y': [147.1128, 260.1969, 374.2398, 489.2667, 618.3093, 731.3934, 894.4567, 1080.536, 1167.5681, 1314.6365, 1428.6794, 1557.722, 1672.7489, 1771.8173, 1870.8857, 1957.9178, 2104.9862, 2236.0267, 2335.0951, 2434.1635, 2581.2319], 'c': [147.0764, 294.1448, 393.2132, 492.2816, 623.3221, 770.3905, 857.4225, 956.4909, 1055.5593, 1170.5863, 1299.6289, 1413.6718, 1560.7402, 1647.7722, 1833.8516, 1996.9149, 2109.9989, 2239.0415, 2354.0685, 2468.1114, 2581.1955], 'z': [112.0757, 225.1598, 339.2027, 454.2297, 583.2723, 696.3563, 859.4196, 1045.499, 1132.531, 1279.5994, 1393.6423, 1522.6849, 1637.7119, 1736.7803, 1835.8487, 1922.8807, 2069.9491, 2200.9896, 2300.058, 2399.1264, 2546.1948], 'Y': {'Y0': 2710.2745, 'Y1': 2913.3539, 'Y2': 3116.4333, 'Y3': 3278.4861, 'Y4': 3440.5389, 'Y5': 3602.5917}, '2Y': {'2Y0': 1355.1372, '2Y1': 1456.6769, '2Y2': 1558.2166, '2Y3': 1639.243, '2Y4': 1720.2694, '2Y5': 1801.2958}, 'B': {'B_HexNAc_1': 204.0867, 'B_HexNAc_2': 407.1661, 'B_Hex_1': 569.2189, 'B_Hex_2': 731.2717, 'B_Hex_3': 893.3245}, 'oxonium': {'ox_HexNAc': 204.0867, 'ox_Hex': 163.0601}}"

Protease Rules

The following proteases are supported:

Protease	Cleavage Rule
Trypsin	After K or R, not P
Chymotrypsin	After F, W, or Y, not P
Glu-C	After E
Lys-C	After K
Arg-C	After R
Asp-N	Before D
Pepsin	After F, L, W, or Y
Proteinase K	After A, F, I, L, V, W, or Y
All	Runs all proteases above

Glycosylation Type Rules

The following glycosylation types sequons (motifs) are supported:

Glycosylation Type	Sequon Pattern
N-linked	N[^P][STC]
O-linked	[ST]
C-linked	W..[WCF]

Glycan Library

The default glycan mass library is defined as a DataFrame containing a set of glycans with their respective compositions and masses. This library is used to calculate the properties of glycopeptides. Alter if you wish to change the glycan mass library in the script

default_glycan_library = pd.DataFrame([
    #{"glytoucan_ac": "G80920RR", "byonic": "HexNAc(2)Hex(9) % 1864.634157", "composition": "HexNAc(2)Hex(9)", "mass": 1864.634157}, # N2H9
    {"glytoucan_ac": "G62765YT", "byonic": "HexNAc(2)Hex(8) % 1702.581333", "composition": "HexNAc(2)Hex(8)", "mass": 1702.581333}, # N2H8
    #{"glytoucan_ac": "G31852PQ", "byonic": "HexNAc(2)Hex(7) % 1540.528510", "composition": "HexNAc(2)Hex(7)", "mass": 1540.528510}, # N2H7
    #{"glytoucan_ac": "G41247ZX", "byonic": "HexNAc(2)Hex(6) % 1378.475686", "composition": "HexNAc(2)Hex(6)", "mass": 1378.475686}, # N2H6
])

Other libraries in file. Feel free to expand to meet needs of user.

This DataFrame includes the following columns:

glytoucan_ac: The glycosylation context identifier.
byonic: The peptide sequence and mass in Byonic format.
composition: The glycan composition.
mass: The mass of the glycan.

The default glycan mass library can be expanded or customized as needed for specific analyses.

Example glycan mass library data

This data is stored in the glycan_mass_library directory.

glytoucan_ac,byonic,composition,converted_glycan,glycan_composition_sequence
G62765YT,HexNAc(2)Hex(8) % 1702.581333,HexNAc(2)Hex(8),N2H8,NNHHHHHHHH
G31852PQ,HexNAc(2)Hex(7) % 1540.528510,HexNAc(2)Hex(7),N2H7,NNHHHHHHH
G41247ZX,HexNAc(2)Hex(6) % 1378.475686,HexNAc(2)Hex(6),N2H6,NNHHHHHH

Plot Mock Mass Spectrum

plot_mock_mass_spectra.py

Usage Guide

This script generates mock mass spectrum plots from glycopeptide ion series stored in a CSV file. It processes ion series data, assigns colors to different ion types, computes ion numbers, and generates labeled mass spectrum plots.

Ion labels are automatically assigned based on their types.
If a peptide sequence exceeds 50 characters, it will be skipped.
The script ensures unique filenames for each output image.

Arguments:

-i, --input (required): Path to the input CSV file containing glycopeptide ion series data.
-o, --output (optional): Directory to save the generated plots (default: mock_mass_spectra directory).

Input CSV Format

The CSV file must contain the following required columns:

IonSeries: Dictionary-like string containing ion data (b, y, Y, B, oxonium).
ProteinID: Identifier for the protein.
Peptide: Peptide sequence.
Composition: Glycan composition.
GlyToucan_AC: GlyToucan accession number.

python script.py -i digested_glycopeptide_library/pig_uniprotkb_proteome_UP000008227_AND_revi_2025_02_01_trypsin_digested_mc0_z2_N-glycopeptides.csv

This will process example_data.csv and save the mass spectrum plots in the results/ directory.

Glycan Composition to Sequence Converter

This Python script converts glycan composition data into a glycan sequence. It takes an input CSV file containing a column with glycan composition (e.g., N2H3F1) and generates a glycan sequence, expanding the monosaccharides according to their counts. The results are saved to a new CSV file.

Usage

python glycan_converter.py -i <input_file> -o <output_file> -g <glycan_column>

-i, --input: Path to the input CSV file.
-o, --output: Path to save the output CSV file (default: input filename with "_composition_sequences.csv" suffix).
-g, --glycan: The name of the column containing glycan composition data.

Requirements

•	pandas
•	re (built-in Python module)

Example

python glycan_converter.py -i glycans.csv -g glycan_composition

Create Glycan Mass Library

create_glycan_library.py

This script processes glycan data from a CSV file and splits it into columns based on specific formatting rules.

Overview

The script takes an input CSV file with glyc肽 data formatted in two columns:

glytoucan_ac: A string representing the glycosylation context.
byonic / byonic_sequence: A composite column containing two pieces of information separated by a '%' character:
- The peptide sequence (composition).
- A numerical value representing mass.

The script splits the byonic column into its constituent parts, creating three new columns in the output file:

glytoucan_ac: The glycosylation context.
composition: The peptide sequence.
mass: The numerical mass value.

Key Features

Input File Format

The input CSV file must have exactly two columns per row, with the second column formatted as <peptide> % <mass>.

Example:

"glytoucan_ac","sequence_byonic","name_source"
"G00002CF","Hex(2)NeuGc(2) % 956.29687423","G93218EI"
"G00009BX","HexNAc(2)Hex(2)dHex(1) % 894.33286578","G93579XB"
"G00012MO","HexNAc(1)Hex(3) % 707.248407805","G08590QR"

Output File Format

The output file will have three columns:

The first column (glytoucan_ac) remains unchanged.
The second column contains the peptide sequence from byonic.
The third column contains the numerical mass value.

Example:

glytoucan_ac,byonic,composition,converted_glycan,glycan_composition_sequence,composition,mass
G62765YT,HexNAc(2)Hex(8) % 1702.581333,HexNAc(2)Hex(8),N2H8,NNHHHHHHHH,HexNAc(2)Hex(8),1702.581333
G31852PQ,HexNAc(2)Hex(7) % 1540.528510,HexNAc(2)Hex(7),N2H7,NNHHHHHHH,HexNAc(2)Hex(7),1540.528510
G41247ZX,HexNAc(2)Hex(6) % 1378.475686,HexNAc(2)Hex(6),N2H6,NNHHHHHH,HexNAc(2)Hex(6),1378.475686

The output CSV will contain glycans ranked by their weighted adjusted HF score.

Batch Processing Scripts

Shell scripts for batch processing.

Batch Run for FASTA Processing

batch_glycopeptide_sequence_finder.sh

To process multiple FASTA files in parallel using all proteases, run the following command:

./batch_glycopeptide_sequence_finder.sh

Parameters can be adjusted in the shell script.

Parameters

ls test_proteomes/*.fasta: Lists all FASTA files in the test_proteomes directory.
xargs -I {} -P 4: Executes the command in parallel with up to 4 processes. The {} is a placeholder for each file name.
python glycopeptide_finder_cmd.py: The script to run for each FASTA file.
-i "{}": Specifies the input FASTA file, where {} is replaced by each file name.
-p all: Uses all proteases for cleavage.
-g N: Searches for N-linked glycosylation sequons.
-c 0: Allows 0 missed cleavages.
-v: Enables verbose output.

This command allows you to efficiently process multiple FASTA files in parallel, reducing the overall processing time.

Merging CSV Files

The script includes a function to merge all CSV files from a specified directory into a single CSV file. This can be useful for consolidating the results of multiple digestions into one file for easier analysis.

python merge_digested_glycopeptide_library.py

Machine Learning (experimental)

Explore machine learning space with the glycopeptide data above. Experimental, in development.

Glycopeptide One-Hot Encoding Script

This script encodes glycopeptide data into one-hot encoded feature vectors. It handles peptide sequences, glycan compositions, and charge states, generating a feature vector for each glycopeptide in the input data. This will be utilized for future Y class labels in glycoproteomics machine learning.

Features

Peptide Encoding: Encodes the 20 standard amino acids into a 50x20 matrix.
Glycan Encoding: Encodes a simplified set of monosaccharides (N, H, F, A) into a 30x4 matrix.
Charge State Encoding: Encodes charge states (1-10) into a one-hot vector.
CSV Output: Generates an encoded dataset and an encoding definition CSV file.

Usage

python encode_glycopeptides.py -i input.csv -o output.csv -d encoding_definition.csv

-i : Input CSV file with peptide, glycan, and charge data.
-o : Output CSV file for encoded data.
-d : Output CSV for encoding definitions.

Dependencies

numpy
pandas
argparse
re

Example

python glycopeptide_finder_cmd.py -i test_proteomes/human_uniprotkb_proteome_UP000005640_AND_revi_2025_01_17.fasta -p trypsin -g N -c 0 -z 3

The output file will be dynamically named:

example_predicted_trypsin_glycopeptides.csv

Example CSV Content

ProteinID,Site,GlyToucan_AC,Composition,ShorthandGlycan,Peptide,Start,End,Length,Sequon,GlycopeptideMass,PeptideMass,GlycanMass,Hydrophobicity,pI,z2,Charge,IonSeries,Glycan_Composition_Sequence,One_Hot_Encoding
ProteinID,Site,GlyToucan_AC,Composition,ShorthandGlycan,Peptide,Start,End,Length,Sequon,GlycopeptideMass,PeptideMass,GlycanMass,Hydrophobicity,pI,z2,Charge,IonSeries
sp|O95445|APOM_HUMAN,135.0,G22768VO,HexNAc(2)Hex(3),N2H3,TELFSSSCPGGIMLNETGQGYQR,121.0,143.0,23.0,NET,3690.543457999999,2474.1205949999994,1216.422863,-0.47826,4.26,1846.2790049999996,2,"{'b': [102.055, 231.0975, 344.1816, 491.25, 578.282, 665.3141, 752.3461, 855.3553, 952.4081, 1009.4295, 1066.451, 1179.535, 1310.5755, 1423.6596, 1537.7025, 1666.7451, 1767.7928, 1824.8142, 1952.8728, 2009.8943, 2172.9576, 2301.0162], 'y': [175.119, 303.1775, 466.2409, 523.2623, 651.3209, 708.3424, 809.39, 938.4326, 1052.4756, 1165.5596, 1296.6001, 1409.6842, 1466.7056, 1523.7271, 1620.7799, 1723.789, 1810.8211, 1897.8531, 1984.8851, 2131.9535, 2245.0376, 2374.0802], 'c': [119.0815, 248.124, 361.2081, 508.2765, 595.3085, 682.3406, 769.3726, 872.3818, 969.4346, 1026.456, 1083.4775, 1196.5615, 1327.602, 1440.6861, 1554.729, 1683.7716, 1784.8193, 1841.8407, 1969.8993, 2026.9208, 2189.9841, 2318.0427], 'z': [140.0819, 268.1405, 431.2038, 488.2253, 616.2838, 673.3053, 774.353, 903.3956, 1017.4385, 1130.5226, 1261.563, 1374.6471, 1431.6686, 1488.69, 1585.7428, 1688.752, 1775.784, 1862.816, 1949.8481, 2096.9165, 2210.0005, 2339.0431], 'Y': {'Y0': 2475.1279, 'Y1': 2678.2073, 'Y2': 2881.2867, 'Y3': 3043.3395, 'Y4': 3205.3923, 'Y5': 3367.4451}, '2Y': {'2Y0': 1237.5639, '2Y1': 1339.1036, '2Y2': 1440.6433, '2Y3': 1521.6697, '2Y4': 1602.6961, '2Y5': 1683.7225}, 'B': {'B_HexNAc_1': 204.0867, 'B_HexNAc_2': 407.1661, 'B_Hex_1': 569.2189, 'B_Hex_2': 731.2717, 'B_Hex_3': 893.3245}, 'oxonium': {'ox_HexNAc': 204.0867, 'ox_Hex': 163.0601}}"
sp|P00450|CERU_HUMAN,138.0,G22768VO,HexNAc(2)Hex(3),N2H3,EHEGAIYPDNTTDFQR,129.0,144.0,16.0,NTT,3108.2565080000004,1891.8336450000002,1216.422863,-1.51875,3.95,1555.1355300000002,2,"{'b': [130.0499, 267.1088, 396.1514, 453.1728, 524.2099, 637.294, 800.3573, 897.4101, 1012.437, 1126.48, 1227.5276, 1328.5753, 1443.6023, 1590.6707, 1718.7292], 'y': [175.119, 303.1775, 450.2459, 565.2729, 666.3206, 767.3682, 881.4112, 996.4381, 1093.4909, 1256.5542, 1369.6383, 1440.6754, 1497.6968, 1626.7394, 1763.7983], 'c': [147.0764, 284.1353, 413.1779, 470.1993, 541.2364, 654.3205, 817.3838, 914.4366, 1029.4635, 1143.5065, 1244.5541, 1345.6018, 1460.6288, 1607.6972, 1735.7557], 'z': [140.0819, 268.1405, 415.2089, 530.2358, 631.2835, 732.3312, 846.3741, 961.401, 1058.4538, 1221.5171, 1334.6012, 1405.6383, 1462.6598, 1591.7024, 1728.7613], 'Y': {'Y0': 1892.8409, 'Y1': 2095.9203, 'Y2': 2298.9997, 'Y3': 2461.0525, 'Y4': 2623.1053, 'Y5': 2785.1581}, '2Y': {'2Y0': 946.4205, '2Y1': 1047.9602, '2Y2': 1149.4999, '2Y3': 1230.5263, '2Y4': 1311.5527, '2Y5': 1392.5791}, 'B': {'B_HexNAc_1': 204.0867, 'B_HexNAc_2': 407.1661, 'B_Hex_1': 569.2189, 'B_Hex_2': 731.2717, 'B_Hex_3': 893.3245}, 'oxonium': {'ox_HexNAc': 204.0867, 'ox_Hex': 163.0601}}"
sp|P00450|CERU_HUMAN,227.0,G22768VO,HexNAc(2)Hex(3),N2H3,EFVVMFSVVDENFSWYLEDNIK,216.0,237.0,22.0,NFS,3925.6900780000005,2709.2672150000003,1216.422863,0.14545,3.39,1963.8523150000003,2,"{'b': [130.0499, 277.1183, 376.1867, 475.2551, 606.2956, 753.364, 840.396, 939.4644, 1038.5328, 1153.5598, 1282.6024, 1396.6453, 1543.7137, 1630.7457, 1816.8251, 1979.8884, 2092.9724, 2222.015, 2337.042, 2451.0849, 2564.169], 'y': [147.1128, 260.1969, 374.2398, 489.2667, 618.3093, 731.3934, 894.4567, 1080.536, 1167.5681, 1314.6365, 1428.6794, 1557.722, 1672.7489, 1771.8173, 1870.8857, 1957.9178, 2104.9862, 2236.0267, 2335.0951, 2434.1635, 2581.2319], 'c': [147.0764, 294.1448, 393.2132, 492.2816, 623.3221, 770.3905, 857.4225, 956.4909, 1055.5593, 1170.5863, 1299.6289, 1413.6718, 1560.7402, 1647.7722, 1833.8516, 1996.9149, 2109.9989, 2239.0415, 2354.0685, 2468.1114, 2581.1955], 'z': [112.0757, 225.1598, 339.2027, 454.2297, 583.2723, 696.3563, 859.4196, 1045.499, 1132.531, 1279.5994, 1393.6423, 1522.6849, 1637.7119, 1736.7803, 1835.8487, 1922.8807, 2069.9491, 2200.9896, 2300.058, 2399.1264, 2546.1948], 'Y': {'Y0': 2710.2745, 'Y1': 2913.3539, 'Y2': 3116.4333, 'Y3': 3278.4861, 'Y4': 3440.5389, 'Y5': 3602.5917}, '2Y': {'2Y0': 1355.1372, '2Y1': 1456.6769, '2Y2': 1558.2166, '2Y3': 1639.243, '2Y4': 1720.2694, '2Y5': 1801.2958}, 'B': {'B_HexNAc_1': 204.0867, 'B_HexNAc_2': 407.1661, 'B_Hex_1': 569.2189, 'B_Hex_2': 731.2717, 'B_Hex_3': 893.3245}, 'oxonium': {'ox_HexNAc': 204.0867, 'ox_Hex': 163.0601}}"

Glycan Library

The default glycan mass library is defined as a DataFrame containing a set of glycans with their respective compositions and masses. This library is used to calculate the properties of glycopeptides. Alter if you wish to change the glycan mass library in the script

default_glycan_library = pd.DataFrame([
    #{"glytoucan_ac": "G80920RR", "byonic": "HexNAc(2)Hex(9) % 1864.634157", "composition": "HexNAc(2)Hex(9)", "mass": 1864.634157}, # N2H9
    {"glytoucan_ac": "G62765YT", "byonic": "HexNAc(2)Hex(8) % 1702.581333", "composition": "HexNAc(2)Hex(8)", "mass": 1702.581333}, # N2H8
    #{"glytoucan_ac": "G31852PQ", "byonic": "HexNAc(2)Hex(7) % 1540.528510", "composition": "HexNAc(2)Hex(7)", "mass": 1540.528510}, # N2H7
    #{"glytoucan_ac": "G41247ZX", "byonic": "HexNAc(2)Hex(6) % 1378.475686", "composition": "HexNAc(2)Hex(6)", "mass": 1378.475686}, # N2H6
])

Other libraries in file. Feel free to expand to meet needs of user.

This DataFrame includes the following columns:

glytoucan_ac: The glycosylation context identifier.
byonic: The peptide sequence and mass in Byonic format.
composition: The glycan composition.
mass: The mass of the glycan.

The default glycan mass library can be expanded or customized as needed for specific analyses.

Example glycan mass library data

This data is stored in the glycan_mass_library directory.

glytoucan_ac,byonic,composition,converted_glycan,glycan_composition_sequence
G62765YT,HexNAc(2)Hex(8) % 1702.581333,HexNAc(2)Hex(8),N2H8,NNHHHHHHHH
G31852PQ,HexNAc(2)Hex(7) % 1540.528510,HexNAc(2)Hex(7),N2H7,NNHHHHHHH
G41247ZX,HexNAc(2)Hex(6) % 1378.475686,HexNAc(2)Hex(6),N2H6,NNHHHHHH

Plot Mock Mass Spectrum

plot_mock_mass_spectra.py

Usage Guide

This script generates mock mass spectrum plots from glycopeptide ion series stored in a CSV file. It processes ion series data, assigns colors to different ion types, computes ion numbers, and generates labeled mass spectrum plots.

Ion labels are automatically assigned based on their types.
If a peptide sequence exceeds 50 characters, it will be skipped.
The script ensures unique filenames for each output image.

Arguments:

-i, --input (required): Path to the input CSV file containing glycopeptide ion series data.
-o, --output (optional): Directory to save the generated plots (default: mock_mass_spectra directory).

Input CSV Format

The CSV file must contain the following required columns:

IonSeries: Dictionary-like string containing ion data (b, y, Y, B, oxonium).
ProteinID: Identifier for the protein.
Peptide: Peptide sequence.
Composition: Glycan composition.
GlyToucan_AC: GlyToucan accession number.

python plot_mock_mass_spectra.py -i data/digested_glycopeptide_library/pig_uniprotkb_proteome_UP000008227_AND_revi_2025_02_01_trypsin_digested_mc0_z2_N-glycopeptides.csv

This will process example_data.csv and save the mass spectrum plots in the results/ directory.

Glycan Composition to Sequence Converter

This Python script converts glycan composition data into a glycan sequence. It takes an input CSV file containing a column with glycan composition (e.g., N2H3F1) and generates a glycan sequence, expanding the monosaccharides according to their counts. The results are saved to a new CSV file.

Usage

python glycan_converter.py -i <input_file> -o <output_file> -g <glycan_column>

-i, --input: Path to the input CSV file.
-o, --output: Path to save the output CSV file (default: input filename with “_composition_sequences.csv” suffix).
-g, --glycan: The name of the column containing glycan composition data.

Requirements

•	pandas
•	re (built-in Python module)

Example

python glycan_converter.py -i glycans.csv -g glycan_composition

Create Glycan Mass Library

create_glycan_library.py

This script processes glycan data from a CSV file and splits it into columns based on specific formatting rules.

Overview

The script takes an input CSV file with glyc肽 data formatted in two columns:

glytoucan_ac: A string representing the glycosylation context.
byonic / byonic_sequence: A composite column containing two pieces of information separated by a '%' character:
- The peptide sequence (composition).
- A numerical value representing mass.

The script splits the byonic column into its constituent parts, creating three new columns in the output file:

glytoucan_ac: The glycosylation context.
composition: The peptide sequence.
mass: The numerical mass value.

Key Features

Input File Format

The input CSV file must have exactly two columns per row, with the second column formatted as <peptide> % <mass>.

Example:

"glytoucan_ac","sequence_byonic","name_source"
"G00002CF","Hex(2)NeuGc(2) % 956.29687423","G93218EI"
"G00009BX","HexNAc(2)Hex(2)dHex(1) % 894.33286578","G93579XB"
"G00012MO","HexNAc(1)Hex(3) % 707.248407805","G08590QR"

Output File Format

The output file will have three columns:

The first column (glytoucan_ac) remains unchanged.
The second column contains the peptide sequence from byonic.
The third column contains the numerical mass value.

Example:

glytoucan_ac,byonic,composition,converted_glycan,glycan_composition_sequence,composition,mass
G62765YT,HexNAc(2)Hex(8) % 1702.581333,HexNAc(2)Hex(8),N2H8,NNHHHHHHHH,HexNAc(2)Hex(8),1702.581333
G31852PQ,HexNAc(2)Hex(7) % 1540.528510,HexNAc(2)Hex(7),N2H7,NNHHHHHHH,HexNAc(2)Hex(7),1540.528510
G41247ZX,HexNAc(2)Hex(6) % 1378.475686,HexNAc(2)Hex(6),N2H6,NNHHHHHH,HexNAc(2)Hex(6),1378.475686

The output CSV will contain glycans ranked by their weighted adjusted HF score.

Batch Processing Scripts

Shell scripts for batch processing.

Batch Run for FASTA Processing

batch_glycopeptide_sequence_finder.sh

To process multiple FASTA files in parallel using all proteases, run the following command:

./batch_glycopeptide_sequence_finder.sh

Parameters can be adjusted in the shell script.

Parameters

ls test_proteomes/*.fasta: Lists all FASTA files in the test_proteomes directory.
xargs -I {} -P 4: Executes the command in parallel with up to 4 processes. The {} is a placeholder for each file name.
python glycopeptide_finder_cmd.py: The script to run for each FASTA file.
-i "{}": Specifies the input FASTA file, where {} is replaced by each file name.
-p all: Uses all proteases for cleavage.
-g N: Searches for N-linked glycosylation sequons.
-c 0: Allows 0 missed cleavages.
-v: Enables verbose output.

This command allows you to efficiently process multiple FASTA files in parallel, reducing the overall processing time.

Merging CSV Files

The script includes a function to merge all CSV files from a specified directory into a single CSV file. This can be useful for consolidating the results of multiple digestions into one file for easier analysis.

python merge_digested_glycopeptide_library.py

Dockerfile

Docker Setup for Glycopeptide Sequence Finder

This section explains how to build and run the Docker container for the Glycopeptide Sequence Finder.

Build the Docker Image

To create the Docker image, run the following command in the directory containing your Dockerfile and requirements.txt:

docker build -t gsf .

This will:

Use the official Python 3.10-slim image.
Set /app as the working directory.
Install dependencies from requirements.txt.
Copy the glycopeptide_sequence_finder_cmd.py script into the container.
Set the entrypoint so that the script can be executed with arguments.

Run the Docker Container

To execute the script with test data, use:

docker run --rm \
    -v "$(pwd)/test_proteomes:/app/test_proteomes" \
    -v "$(pwd)/output:/app/digested_glycopeptide_library" \
    gsf \
    -i test_proteomes/apple_uniprotkb_proteome_UP000290289_AND_revi_2025_02_04.fasta \
    -g N \
    -o digested_glycopeptide_library/test.csv \
    -p chymotrypsin \
    -c 0 \
    -v

Explanation of Flags:

--rm → Removes the container after execution.
-v "$(pwd)/test_proteomes:/app/test_proteomes" → Mounts the input FASTA files.
-v "$(pwd)/output:/app/digested_glycopeptide_library" → Mounts the output directory.
gsf → Runs the built image.
-i → Specifies the input FASTA file.
-g → Sets the glycosylation type (default: N).
-o → Defines the output file.
-p → Specifies the protease (e.g., chymotrypsin).
-c → Defines the missed cleavages.
-v → Enables verbose mode.

Access the Output

The output files will be saved in the mounted directory on your local machine:

ls output/digested_glycopeptide_library/

Your results should be inside output/digested_glycopeptide_library/test.csv.

Machine Learning (experimental)

Explore machine learning space with the glycopeptide data above. Experimental, in development.

Glycopeptide One-Hot Encoding Script

This script encodes glycopeptide data into one-hot encoded feature vectors. It handles peptide sequences, glycan compositions, and charge states, generating a feature vector for each glycopeptide in the input data. This will be utilized for future Y class labels in glycoproteomics machine learning.

Features

Peptide Encoding: Encodes the 20 standard amino acids into a 50x20 matrix.
Glycan Encoding: Encodes a simplified set of monosaccharides (N, H, F, A) into a 30x4 matrix.
Charge State Encoding: Encodes charge states (1-10) into a one-hot vector.
CSV Output: Generates an encoded dataset and an encoding definition CSV file.

Usage

python encode_glycopeptides.py -i input.csv -o output.csv -d encoding_definition.csv

-i : Input CSV file with peptide, glycan, and charge data.
-o : Output CSV file for encoded data.
-d : Output CSV for encoding definitions.

Dependencies

numpy
pandas
argparse
re

Output

ProteinID,Site,GlyToucan_AC,Composition,ShorthandGlycan,Peptide,Start,End,Length,Sequon,GlycopeptideMass,PeptideMass,GlycanMass,Hydrophobicity,pI,z2,Charge,IonSeries,Glycan_Composition_Sequence,One_Hot_Encoding
sp|O95445|APOM_HUMAN,135.0,G22768VO,HexNAc(2)Hex(3),N2H3,TELFSSSCPGGIMLNETGQGYQR,121.0,143.0,23.0,NET,3690.543457999999,2474.1205949999994,1216.422863,-0.47826,4.26,1846.2790049999996,2,"{'b': [102.055, 231.0975, 344.1816, 491.25, 578.282, 665.3141, 752.3461, 855.3553, 952.4081, 1009.4295, 1066.451, 1179.535, 1310.5755, 1423.6596, 1537.7025, 1666.7451, 1767.7928, 1824.8142, 1952.8728, 2009.8943, 2172.9576, 2301.0162], 'y': [175.119, 303.1775, 466.2409, 523.2623, 651.3209, 708.3424, 809.39, 938.4326, 1052.4756, 1165.5596, 1296.6001, 1409.6842, 1466.7056, 1523.7271, 1620.7799, 1723.789, 1810.8211, 1897.8531, 1984.8851, 2131.9535, 2245.0376, 2374.0802], 'c': [119.0815, 248.124, 361.2081, 508.2765, 595.3085, 682.3406, 769.3726, 872.3818, 969.4346, 1026.456, 1083.4775, 1196.5615, 1327.602, 1440.6861, 1554.729, 1683.7716, 1784.8193, 1841.8407, 1969.8993, 2026.9208, 2189.9841, 2318.0427], 'z': [140.0819, 268.1405, 431.2038, 488.2253, 616.2838, 673.3053, 774.353, 903.3956, 1017.4385, 1130.5226, 1261.563, 1374.6471, 1431.6686, 1488.69, 1585.7428, 1688.752, 1775.784, 1862.816, 1949.8481, 2096.9165, 2210.0005, 2339.0431], 'Y': {'Y0': 2475.1279, 'Y1': 2678.2073, 'Y2': 2881.2867, 'Y3': 3043.3395, 'Y4': 3205.3923, 'Y5': 3367.4451}, '2Y': {'2Y0': 1237.5639, '2Y1': 1339.1036, '2Y2': 1440.6433, '2Y3': 1521.6697, '2Y4': 1602.6961, '2Y5': 1683.7225}, 'B': {'B_HexNAc_1': 204.0867, 'B_HexNAc_2': 407.1661, 'B_Hex_1': 569.2189, 'B_Hex_2': 731.2717, 'B_Hex_3': 893.3245}, 'oxonium': {'ox_HexNAc': 204.0867, 'ox_Hex': 163.0601}}",NNHHH,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"

Encoding Definitions

The following is found in the encoding_definition.txt file, produced using the -d flag.

Type,Position,Feature,Index,Size
Peptide,1-50,A,0,20
Peptide,1-50,C,1,20
Peptide,1-50,D,2,20
Peptide,1-50,E,3,20
Peptide,1-50,F,4,20
Peptide,1-50,G,5,20
Peptide,1-50,H,6,20
Peptide,1-50,I,7,20
Peptide,1-50,K,8,20
Peptide,1-50,L,9,20
Peptide,1-50,M,10,20
Peptide,1-50,N,11,20
Peptide,1-50,P,12,20
Peptide,1-50,Q,13,20
Peptide,1-50,R,14,20
Peptide,1-50,S,15,20
Peptide,1-50,T,16,20
Peptide,1-50,V,17,20
Peptide,1-50,W,18,20
Peptide,1-50,Y,19,20
Glycan,1-30,N,0,4
Glycan,1-30,H,1,4
Glycan,1-30,F,2,4
Glycan,1-30,A,3,4
Charge,1,Charge_1,0,10
Charge,1,Charge_2,1,10
Charge,1,Charge_3,2,10
Charge,1,Charge_4,3,10
Charge,1,Charge_5,4,10
Charge,1,Charge_6,5,10
Charge,1,Charge_7,6,10
Charge,1,Charge_8,7,10
Charge,1,Charge_9,8,10
Charge,1,Charge_10,9,10

License

This script is released under the MIT License.

Acknowledgments

The Hitchhiker’s Guide to Glycoproteomics.

Oliveira, Tiago, Morten Thaysen-Andersen, Nicolle Packer, and Daniel Kolarich. “The Hitchhiker’s Guide to Glycoproteomics.” Biochemical Society Transactions 49 (July 20, 2021). https://doi.org/10.1042/BST20200879.

In Silico Platform for Prediction of N-, O- and C-Glycosites in Eukaryotic Protein Sequences.

Chauhan, Jagat Singh, Alka Rao, and Gajendra P. S. Raghava. “In Silico Platform for Prediction of N-, O- and C-Glycosites in Eukaryotic Protein Sequences.” PLoS ONE 8, no. 6 (June 28, 2013): e67008. https://doi.org/10.1371/journal.pone.0067008.

Large-Scale Identification of N-Linked Intact Glycopeptides in Human Serum Using HILIC Enrichment and Spectral Library Search.

Shu, Qingbo, Mengjie Li, Lian Shu, Zhiwu An, Jifeng Wang, Hao Lv, Ming Yang, et al. “Large-Scale Identification of N-Linked Intact Glycopeptides in Human Serum Using HILIC Enrichment and Spectral Library Search.” Molecular & Cellular Proteomics : MCP 19, no. 4 (April 2020): 672–89. https://doi.org/10.1074/mcp.RA119.001791.

Assessing the Hydrophobicity of Glycopeptides Using Reversed-Phase Liquid Chromatography and Tandem Mass Spectrometry.

Wang, Junyao, Aiying Yu, Byeong Gwan Cho, and Yehia Mechref. “Assessing the Hydrophobicity of Glycopeptides Using Reversed-Phase Liquid Chromatography and Tandem Mass Spectrometry.” Journal of Chromatography. A 1706 (September 13, 2023): 464237. https://doi.org/10.1016/j.chroma.2023.464237.

Molecular Basis of C-Mannosylation – a Structural Perspective.

Crine, Samuel L., and K. Ravi Acharya. “Molecular Basis of C-Mannosylation – a Structural Perspective.” The FEBS Journal 289, no. 24 (2022): 7670–87. https://doi.org/10.1111/febs.16265.

BioPython for handling FASTA files.

Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., … others. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423.

Pyteomics for accurate peptide mass calculations.

Goloborodko, A.A.; Levitsky, L.I.; Ivanov, M.V.; and Gorshkov, M.V. (2013) “Pyteomics - a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics”, Journal of The American Society for Mass Spectrometry, 24(2), 301–304. DOI: 10.1007/s13361-012-0516-6

Levitsky, L.I.; Klein, J.; Ivanov, M.V.; and Gorshkov, M.V. (2018) “Pyteomics 4.0: five years of development of a Python proteomics framework”, Journal of Proteome Research. DOI: 10.1021/acs.jproteome.8b00717

Multicenter Longitudinal Quality Assessment of MS-Based Proteomics in Plasma and Serum.

Kardell, Oliver, Thomas Gronauer, Christine von Toerne, Juliane Merl-Pham, Ann-Christine König, Teresa K. Barth, Julia Mergner, et al. “Multicenter Longitudinal Quality Assessment of MS-Based Proteomics in Plasma and Serum.” Journal of Proteome Research, February 7, 2025. https://doi.org/10.1021/acs.jproteome.4c00644.

GlyGen: Computational and Informatics Resources for Glycoscience.

York WS, Mazumder R, Ranzinger R, Edwards N, Kahsay R, Aoki-Kinoshita KF, Campbell MP, Cummings RD, Feizi T, Martin M, Natale DA, Packer NH, Woods RJ, Agarwal G, Arpinar S, Bhat S, Blake J, Castro LJG, Fochtman B, Gildersleeve J, Goldman R, Holmes X, Jain V, Kulkarni S, Mahadik R, Mehta A, Mousavi R, Nakarakommula S, Navelkar R, Pattabiraman N, Pierce MJ, Ross K, Vasudev P, Vora J, Williamson T, Zhang W. GlyGen: Computational and Informatics Resources for Glycoscience. Glycobiology. 2020 Jan 28;30(2):72-73. doi: 10.1093/glycob/cwz080. PMID: 31616925; PMCID: PMC7335483.

Glycosylation of Viral Proteins: Implication in Virus–Host Interaction and Virulence.

Feng, Tingting, Jinyu Zhang, Zhiqian Chen, Wen Pan, Zhengrong Chen, Yongdong Yan, and Jianfeng Dai. “Glycosylation of Viral Proteins: Implication in Virus–Host Interaction and Virulence.” Virulence 13, no. 1 (n.d.): 670–83. https://doi.org/10.1080/21505594.2022.2060464.

Role of Protein Glycosylation in Interactions of Medically Relevant Fungi with the Host.

Gómez-Gaviria, Manuela, Ana P. Vargas-Macías, Laura C. García-Carnero, Iván Martínez-Duncker, and Héctor M. Mora-Montes. “Role of Protein Glycosylation in Interactions of Medically Relevant Fungi with the Host.” Journal of Fungi 7, no. 10 (October 18, 2021): 875. https://doi.org/10.3390/jof7100875.

XMAn v2—a Database of Homo Sapiens Mutated Peptides.

Flores, Marcela Aguilera, and Iulia M Lazar. “XMAn v2—a Database of Homo Sapiens Mutated Peptides.” Bioinformatics 36, no. 4 (February 15, 2020): 1311–13. https://doi.org/10.1093/bioinformatics/btz693.

Appendix

Additional information, logging runs, and references.

Log File

The log file provides detailed information about the processing steps, including the number of peptides found after cleavage and the identified N-glycopeptides.

python glycopeptide_finder_cmd.py -i test_proteomes/human_uniprotkb_proteome_UP000005640_AND_revi_2025_01_17.fasta -p trypsin -c 0 -l log.txt

The script generates a log file that records the processing details of each protein sequence. Logging to a text file can be activated with the -l log.txt flag. Below are some example log entries:

2025-02-04 00:57:21,942 - INFO - Processing sp|A0A087X1C5|CP2D7_HUMAN with 515 amino acids.
2025-02-04 00:57:21,942 - INFO - Found 50 peptides after trypsin cleavage. The peptides were: ['MGLEALVPLAMIVAIFLLLVDLMHR', 'HQR', 'WAAR', 'YPPGPLPLPGLGNLLHVDFQNTPYCFDQLR', 'R', 'R', 'FGDVFSLQLAWTPVVVLNGLAAVR', 'EAMVTR', 'GEDTADRPPAPIYQVLGFGPR', 'SQGVILSR', 'YGPAWR', 'EQR', 'R', 'FSVSTLR', 'NLGLGK', 'K', 'SLEQWVTEEAACLCAAFADQAGRPFRPNGLLDK', 'AVSNVIASLTCGR', 'R', 'FEYDDPR', 'FLR', 'LLDLAQEGLK', 'EESGFLR', 'EVLNAVPVLPHIPALAGK', 'VLR', 'FQK', 'AFLTQLDELLTEHR', 'MTWDPAQPPR', 'DLTEAFLAK', 'K', 'EK', 'AK', 'GSPESSFNDENLR', 'IVVGNLFLAGMVTTSTTLAWGLLLMILHLDVQR', 'GR', 'R', 'VSPGCPIVGTHVCPVR', 'VQQEIDDVIGQVR', 'RPEMGDQAHMPCTTAVIHEVQHFGDIVPLGVTHMTSR', 'DIEVQGFR', 'IPK', 'GTTLITNLSSVLK', 'DEAVWK', 'KPFR', 'FHPEHFLDAQGHFVKPEAFLPFSAGR', 'R', 'ACLGEPLAR', 'MELFLFFTSLLQHFSFSVAAGQPRPSHSR', 'VVSFLVTPSPYELCAVPR', '']
2025-02-04 00:57:21,943 - INFO - Found 1 N-glycopeptides. The glycopeptides were: [('GTTLITNLSSVLK', 416)]
2025-02-04 00:57:21,943 - INFO - Processing sp|A0A0B4J2F0|PIOS1_HUMAN with 54 amino acids.
2025-02-04 00:57:21,945 - INFO - Found 9 peptides after trypsin cleavage. The peptides were: ['MFR', 'R', 'LTFAQLLFATVLGIAGGVYIFQPVFEQYAK', 'DQK', 'ELK', 'EK', 'MQLVQESEEK', 'K', 'S']
2025-02-04 00:57:21,945 - INFO - Found 0 N-glycopeptides. The glycopeptides were: []
2025-02-04 00:57:21,945 - INFO - Processing sp|A0A0C5B5G6|MOTSC_HUMAN with 16 amino acids.
2025-02-04 00:57:21,945 - INFO - Found 5 peptides after trypsin cleavage. The peptides were: ['MR', 'WQEMGYIFYPR', 'K', 'LR', '']
2025-02-04 00:57:21,945 - INFO - Found 0 N-glycopeptides. The glycopeptides were: []
2025-02-04 00:57:21,945 - INFO - Processing sp|A0A0K2S4Q6|CD3CH_HUMAN with 201 amino acids.
2025-02-04 00:57:21,945 - INFO - Found 13 peptides after trypsin cleavage. The peptides were: ['MTQR', 'AGAAMLPSALLLLCVPGCLTVSGPSTVMGAVGESLSVQCR', 'YEEK', 'YK', 'TFNK', 'YWCR', 'QPCLPIWHEMVETGGSEGVVR', 'SDQVIITDHPGDLTFTVTLENLTADDAGK', 'YR', 'CGIATILQEDGLSGFLPDPFFQVQVLVSSASSTENSVK', 'TPASPTRPSQCQGSLPSSTCFLLLPLLK', 'VPLLLSILGAILWVNRPWR', 'TPWTES']
2025-02-04 00:57:21,945 - INFO - Found 1 N-glycopeptides. The glycopeptides were: [('SDQVIITDHPGDLTFTVTLENLTADDAGK', 100)]

Notes

The script assumes well-formatted FASTA input files.
Only N-linked glycosylation sequons are detected (no O-linked or other modifications).
FASTA protein files contain new lines and or return carrages. When returning to the FASTA, remember this when searching for peptide in original sequence.

List all common names used in test_proteome folder.

for file in ./test_proteomes/*; do
  filename=$(basename "$file")
  part_before_underscore="${filename%%_*}"
  echo "$part_before_underscore"
done

Test Proteomes

Test proteome files from UniProt are available in the test_proteomes folder. Below is a list of species gathered. Only Swiss-Prot reviewed proteins were downloaded, and not every sequence available for a species is included.

I used these test proteomes to generate a zoo of glycopeptides under constrained conditions to fit into a GitHub repo. To build full zoo, remove constraints in batch processing script.

Common Name	Scientific Name	Taxon ID
Alpaca	Vicugna pacos	30538
Amoeba	Naegleria gruberi	5762
Anemone	Nematostella vectensis	45351
Ant	Camponotus floridanus	104421
Apple	Malus domestica	3750
Arabidopsis	Arabidopsis thaliana	3702
Aspergillus fumigata	Aspergillus fumigata (strain ATCC MYA-4609 / CBS 101355 / FGSC A1100 / Af293)	330879
Aspergillus nidulans	Emericella nidulans (strain FGSC A4 / ATCC 38163 / CBS 112.46 / NRRL 194 / M139)	227321
Avocado	Persea americana	3435
Banana	Musa acuminata	4641
Barley	Hordeum vulgare	4513
Bat	Myotis lucifugus	59463
Black Cherry	Prunus serotina	23207
Black Truffle	Tuber melanosporum (strain Mel28)	656061
Blood Fluke	Schistosoma mansoni	6183
Brine Shrimp	Artemia franciscana	6661
Brown Alga	Ectocarpus siliculosus	2880
Bushbaby	Otolemur garnettii	30611
Camel	Camelus bactrianus	9837
Candida albicans (Yeast, human pathogen)	Candida albicans (strain SC5314 / ATCC MYA-2876)	237561
Cat	Felis catus	9685
C. elegans	Caenorhabditis elegans	6239
Chameleon	Anolis carolinensis	28377
Charcoal Rot	Macrophomina phaseolina (strain MS6)	1126212
Chicken	Gallus gallus	9031
Chimpanzee	Pan troglodytes	9598
Chinchilla	Chinchilla lanigera	34839
C. jejuni	Campylobacter jejuni	1951
Coffee	Coffea arabica	13443
Cow	Bos taurus	9913
Crocodile	Crocodylus porosus	8502
Crytococcus	Cryptococcus neoformans var. neoformans serotype D (strain JEC21 / ATCC MYA-565)	214684
Cytomegalovirus	Human cytomegalovirus (strain Merlin)	295027
Corn Smut	Mycosarcoma maydis	5270
Date Palm	Phoenix dactylifera	42345
Debaryomyces hansenii (yeast)	Debaryomyces hansenii (strain ATCC 36239 / CBS 767 / BCRC 21394 / JCM 1990 / NBRC 0083 / IGC 2968)	284592
Deer Tick	Ixodes scapularis	6945
Diatom	Thalassiosira pseudonana	35128
Dictyostelium	Dictyostelium discoideum	44689
Dog	Canis lupus familiaris	9615
Donkey	Equus asinus	9796
Duck	Cairina moschata	8855
Dugbe Virus	Dugbe virus (isolate ArD44313)	766194
Ebola	Zaire ebolavirus (strain Mayinga-76)	128952
Elephant	Loxodonta africana (African Elephant)	9785
Fall Armyworm	Spodoptera frugiperda (Fall Armyworm)	7108
Ferret	Mustela putorius furo	9669
Fission Yeast	Schizosaccharomyces japonicus (strain yFS275 / FY16936)	402676
Frog	Xenopus laevis	8355
Fruit Fly	Drosophila melanogaster	7227
Goat	Capra hircus	9925
Gorilla	Gorilla gorilla gorilla	9595
Grape	Vitis vinifera	29760
Green Alga	Chlamydomonas reinhardtii	3055
Guinea Pig	Cavia porcellus	10141
Hamster	Mesocricetus auratus	10036
Hemp	Cannabis sativa	4565
HHV-1	Human herpesvirus 1 (strain 17)	10299
HIV-1	Human immunodeficiency virus type 1 group N (isolate YBF30)	388818
HIV-2	Human immunodeficiency virus type 2 subtype A (isolate BEN)	11714
Honeybee	Apis mellifera	7460
Horse	Equus caballus	9796
HRSV S-2	Human respiratory syncytial virus A (strain S-2)	410078
Human	Homo sapiens	9606
Influenza B	Influenza B virus (strain B/Lee/1940)	518987
Influenza C	Influenza C virus (strain C/Ann Arbor/1/1950)	11553
JEV	Japanese encephalitis virus (strain M28)	2555554
Kidney Bean	Phaseolus vulgaris	3885
Kluyveromyces lactis (lactate processing yeast)	Kluyveromyces lactis (strain ATCC 8585 / CBS 2359 / DSM 70799 / NBRC 1267 / NRRL Y-1140 / WM37)	284590
LASV	Lassa virus (strain Mouse/Sierra Leone/Josiah/1976)	11622
LCMV	Lymphocytic choriomeningitis virus (strain Armstrong)	11624
Lemur	Microcebus murinus	30608
Macaque (Rhesus monkey)	Macaca mulatta	9544
Maize	Zea mays	4577
Measles virus	Measles virus (strain Ichinose-B95a)	645098
Monkey (cynomolgus, crab-eating)	Macaca fascicularis	9541
Mosquito (African malaria)	Anopheles gambiae	7165
Mouse	Mus musculus	10090
Naked Mole Rat	Heterocephalus glaber	10181
Nematode (roundworm)	Caenorhabditis briggsae	6238
Norovirus	Norovirus (strain Human/NoV/United States/Norwalk/1968/GI)	524364
Octopus	Octopus vulgaris	6645
Olive	Olea europaea	4146
Opossum	Monodelphis domestica	13616
Orange	Citrus sinensis	2711
Orangutan	Pongo abelii	9601
Oyster	Magallana gigas	29159
Paramecium	Paramecium tetraurelia	5888
Peach	Prunus persica	3760
Penicillium	Penicillium rubens (strain ATCC 28089 / DSM 1075 / NRRL 1951 / Wisconsin 54-1255)	500485
Pig (Domestic)	Sus scrofa domesticus	9823
Platypus	Ornithorhynchus anatinus	9258
Poplar Leaf Rust Fungus	Melampsora larici-populina (strain 98AG31 / pathotype 3-4-7)	747676
Potato	Solanum tuberosum	4113
Psilocybe mushroom	Psilocybe cubensis	181762
Pufferfish	Takifugu rubripes	31033
Rabbit	Oryctolagus cuniculus	9986
Rat	Rattus norvegicus	10116
Red Alga	Cyanidioschyzon merolae (strain NIES-3377 / 10D)	280699
Rice	Oryza sativa subsp. japonica	39947
Rice Blast Fungus	Pyricularia oryzae (strain 70-15 / ATCC MYA-4617 / FGSC 8958)	242507
Rice Fish (Japanese)	Oryzias latipes	8090
RVA	Rotavirus A (isolate RVA/Monkey/South Africa/SA11-H96/1958/G3P5B[2])	450149
RVB	Rotavirus B (isolate RVB/Human/China/ADRV/1982)	10942
RVC	Rotavirus C (isolate RVC/Human/United Kingdom/Bristol/1989)	31567
SARS-CoV	SARS-CoV (Severe Acute Respiratory Syndrome Coronavirus)	694009
SFTSV	SFTS phlebovirus (isolate SFTSV/Human/China/HB29/2010)	992212
Shark	Callorhinchus milii	7868
Sheep	Ovis aries	9940
Silk Moth	Bombyx mori	7091
Silveira (Coccidioides Silveira strain)	Coccidioides posadasii (strain RMSCC 757 / Silveira)	443226
Snake (Brown Eastern)	Pseudonaja textilis	8673
Softshell Turtle	Pelodiscus sinensis	13735
Spike Moss (lycophyte)	Selaginella moellendorffii	88036
Sponge	Amphimedon queenslandica	400682
Sorghum	Sorghum bicolor	4558
Squirrel	Ictidomys tridecemlineatus	43179
Starfish	Patiria pectinifera	7594
Strawberry	Fragaria ananassa	3747
Sugarcane	Saccharum officinarum	4547
Sunflower	Helianthus annuus	4232
Sycamore	Platanus occidentalis	4403
Tea plant	Camellia sinensis	4442
Tobacco	Nicotiana tabacum	4097
Tilapia	Oreochromis niloticus	8128
Tomato	Solanum lycopersicum	4081
Trout (Brown)	Oreochromis niloticus	8128
Turkey	Meleagris gallopavo	9103
Urchin	Strongylocentrotus purpuratus	7668
VZV	Varicella-zoster virus (strain Dumas)	10338
Wasp (parasitoid)	Nasonia vitripennis	7425
Watermelon	Citrullus lanatus	3654
Wheat	Triticum aestivum	4565
Whisk fern	Psilotum nudum	3240
Wild Rice (North America)	Oryza nivara	4536
WNV	West Nile virus	11082
XMAn v2 Missense	Homo sapians - Unknown Mutation Analysis (Human missense peptide library) Download at: https://github.com/lazarlab/XMAn-v2	9606
XMAn v2 Nonsense	Homo sapians - Unknown Mutation Analysis (Human nonsense peptide library) Download at: https://github.com/lazarlab/XMAn-v2	9606
Yak	Bos mutus grunniens	30521
Yeast (Budding, Baker's)	Saccharomyces cerevisiae (strain ATCC 204508 / S288c)	559292
Yeast (Fission)	Schizosaccharomyces pombe (strain 972 / ATCC 24843)	284812
Zebra Finch	Taeniopygia guttata	59729
Zebrafish	Danio rerio	7955
Zebu	Bos indicus	9915
Zika	Zika virus	64320

Glycan Mass Library

Glycan mass libraries were gathered from GlyGen. The follow data was processed and used in this tool. Plug and play glycans to meet needs.

It is advisable to create a targeted glycan library along with a list of peptides to compute an intact glycan library. The size of the library can grow rapidly, so it is important to manage it effectively.

File Name	Glycan Count
glycan_database	44686
glycan_type_n_linked_byonic.csv	369
test_glycan_library.csv	3

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
reports		reports
scripts		scripts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

RichardDShipman/Glycopeptide_Sequence_Finder

Folders and files

Latest commit

History

Repository files navigation

Glycopeptide Sequence Finder

Quick Start Guide

Project Structure

Installation

Usage

Command Line Interface

Batch Processing

Docker

Overview

Table of Contents

Reference Materials

Features

Requirements

Installation

Usage

Arguments

Example

Example CSV Content

Protease Rules

Glycosylation Type Rules

Glycan Library

Example glycan mass library data

Plot Mock Mass Spectrum

Usage Guide

Arguments:

Input CSV Format

Glycan Composition to Sequence Converter

Requirements

Example

Create Glycan Mass Library

Overview

Key Features

Input File Format

Output File Format

Batch Processing Scripts

Batch Run for FASTA Processing

Parameters

Merging CSV Files

Machine Learning (experimental)

Glycopeptide One-Hot Encoding Script

Features

Usage

Dependencies

Example

Example CSV Content

Glycan Library

Example glycan mass library data

Plot Mock Mass Spectrum

Usage Guide

Arguments:

Input CSV Format

Glycan Composition to Sequence Converter

Requirements

Example

Create Glycan Mass Library

Overview

Key Features

Input File Format

Output File Format

Batch Processing Scripts

Batch Run for FASTA Processing

Parameters

Merging CSV Files

Dockerfile

Machine Learning (experimental)

Glycopeptide One-Hot Encoding Script

Features

Usage

Dependencies

Output

Encoding Definitions

License

Acknowledgments

Appendix

Packages