- Simpler workflow to process Roche-NimbleGen arrays.
- This does not complete the additional processing if the phyical coordinates of the array is present.
- Use pepMeld for better preprocessing if the physical coordinates of the array are exposed
- This python script will:
- Open the data file,
- Take the log base2 of each binding affinity value
- Subtract the associate control with each value
- Take the median of the repeated peptide sequences of each array, by MHC type.
- Open a "meta data" file to
- rename sequences to a human readable format.
- Apply a qualitative positive/negative "Threshold" for each MHC-Type based on the distribution.
- Open the peptide sequence to protein accession number lookup
- Choose a unique accession number / peptide sequence
- Join the sequences to the peptide data, for each MHC type.
- Output a file ready to upload into iedb.org epitope databases.
- Open the data file,
argument | single_character | type | default | help | required |
---|---|---|---|---|---|
meta_filepath | m | string | nan | filepath of the meta data | t |
meta_sep | e | string | \t | filepath of the meta data | f |
pep_seq_protein_lookup_path | p | string | nan | Filepath of the protein lookup file | t |
lookup_sep | l | string | \t | separator used in the pep_seq_protein_lookup file [ , \t ] | f |
in_dir | i | string | nan | Location of the input data files | t |
in_data_ext | x | string | Mamu.txt.gz | File names endswith to search for in data files of the in_dir | f |
in_data_sep | d | string | \t | separator used in the in_data file(s) [ , \t ] | nan |
sequence_lengths | s | string | 8,9,10 | comma delimited list of lengths to include [ 8,9,10 ] | f |
out_dir | o | string | nan | output directory | t |
out_prefix | f | string | mhc_ppma | prefix to use in the output files. | f |
python3 ~/github/pepMeld-lite/main.py \
--meta_filepath='/Volumes/mhc_dataset/mhc_meta.tsv' \
--meta_sep='\t' \
--pep_seq_protein_lookup_path=/Volumes/mhc_dataset/SIV_SHIV_PEP_QX12_correspondence_key.txt.gz \
--lookup_sep='\t' \
--in_dir='/Volumes/mhc_dataset/input_data' \
--in_data_ext='Mamu.txt.gz' \
--in_data_sep='\t' \
--sequence_lengths='8,9,10' \
--out_dir='/Volumes/mhc_dataset/output_data' \
--out_prefix='mhc_ppma'
Required columns:
Column Name | Details |
---|---|
PROBE_SEQUENCE | Peptide sequence (all caps) |
REPL | The corresponding Position of a Replicate (used to match control position) |
SEQ_ID | Either MER_REDUNDANT or _ |
Required columns:
Column Name | Details |
---|---|
VENDOR_NAME | Data column name from the vendor provided data table |
MHC Allele Name | This is the MHC Allele Name that will be uploaded to IEDB.org |
CONTROL_SUBTRACT | This is the column to subtract the values by Peptide_Sequence & Replicate position |
EXCLUDE | To exclude the column from analysis |
PASSING_THRESHOLD | Control Subtracted threshold that is considered passing |
Required columns:
Column Name | Details |
---|---|
PEPTIDE_SEQUENCE | Peptide sequence (all caps) |
ORIGINAL_SEQ_ID | Associated Virus Accession Number _ Protein Accession Number |