- General Description
- Dependencies
- Installation via Docker
- Installation from source
- Usage
- Input Files
- Setting parameters
- Output Files
- Contact
- Algorithmic Flow Chart
pTuneos is the state-of-the-art computational pipeline for identifying personalized tumor neoantigens from next-generation sequencing data. With raw whole-exome sequencing data and/or RNA-seq data, pTuneos calculates five important immunogenicity features to construct a machine learning-based classifier (Pre&RecNeo) to predict and prioritize neoantigens recognized by T cell, followed by an efficient score scheme (RefinedNeo) to ealuate naturally processed, MHC presented and T cell recognized probability of a predicted neoepitope.
pTuneos currently tested on x86_64 on ubuntu 16.04.
Note: if you want to call neoantigen from raw sequence data (WES mode), please ensure that you server have enough CPU core and RAM (We tested on machine with 2 CPU, 88 core and 256 GB RAM).
- Python 2.7
- R 3.2.3
- NetMHCpan 4.0
- Variant Effect Predictor (VEP)
- BWA
- samtools
- Optitype
- Pyclone
- GATK 3.8
- Picard tools
- Java 8
- kallisto
- trimmomatic
- vcftools
- blast
- tabix
- gawk
- yaml
- XGboost
- biopython
- scikit-learn==0.19.1
- pandas
- numpy
- imblearn
- Pyomo
- tables
- pysam
- PypeR
- multiprocessing
- subprocess
- math
- matplotlib
- collections
Docker image of pTuneos is at https://cloud.docker.com/u/bm2lab/repository/docker/bm2lab/ptuneos.
-
Install Docker on your computer and make sure it works.
-
Call docker
pull bm2lab/ptuneos:v2.1
which will download the Docker image. -
Run the image in interactive mode with your dataset:
docker run -it -v /your/path/to/dataset/:/root/data bm2lab/ptuneos:v2.1 /bin/bash
-
Change directory into /home/bioworker/project/pTuneos:
cd /root/pTuneos
-
Download reference data for different genome version(hg19 or hg38):
bash data_download_hg19.sh
or
bash data_download_hg38.sh
-
Edit
config_WES.yaml
orconfig_VCF.yaml
and fill the proper path of input files. -
Run the program with follow commands:
python pTuneos.py WES -i config_WES.yaml
or
python pTuneos.py VCF -i config_VCF.yaml
-
Install all software, python packages and R packages listed above, and make sure each software and package works in your system.
-
Install multiprocessing and other packages with the
pip
command:pip install -U multiprocessing pip install -U pyper ...
-
Install R package
copynumber
and its dependence:source("http://bioconductor.org/biocLite.R") biocLite("copynumber")
Install R package
squash
andsequenza
:install.packages('squash') install.packages('sequenza')
-
Download or clone the pTuneos repository to your local system:
git clone https://github.com/bm2-lab/pTuneos.git
-
Reference data includes genome fasta, cDNA, peptide(GRCh38 build) could be downloaded and processed through our script.(you should be aware that the version of VEP library you use should match the references used (peptide and cDNA). E.g. if you install VEP release-89, then you should set the
VEP_release
torelease-89
), then you can run:bash data_download.sh
a few reference data would be in the fold
database
and processed by custom script in order to run the pipeline, including:[Fasta] This fold contains the reference fasta file, its bwa index and some other files result from `huamn.fasta`: human.fasta human.fasta.amb human.fasta.ann etc... [VCF_annotation] This fold contains the vcf file and its index files used to run GATK best practice: dbsnp_138.hg38.vcf.gz 1000G_phase1.snps.high_confidence.hg38.vcf.gz Mills_and_1000G_gold_standard.indels.hg38.vcf.gz [Protein] This fold contains the reference cDNA and protein sequence of human: human.cdna.all.fa human.pep.all.fa
-
Among the required software listed above, BWA, GATK 3.8, kallisto, picard, samtools, tabix, trimmomatic-0.36, blast and VarScan.v2.4.2 were prepared in software directory, other software should be installed by user own due to complexity, please refer to the software links above.
-
Fill in the
config_WES.yaml
file with your local path, make sure you have installed all above software and have downloaded reference data.You should be aware that the version of VEP library you use should match the references used (peptide and cDNA). E.g. in the example above used version/release 89 of GRCh38.
pTuneos has two modes: WES
mode and VCF
mode.
WES
mode accepts WES and RNA-seq sequencing data as input, it conduct sequencing quality control, mutation calling, hla typing, expression profiling and neoepitope prediction, filtering, annotation.
VCF
mode accepts mutation VCF file, expression profile, copy number profile and tumor cellularity as input, it performs neoepitope prediction, filtering, annotation directly on input file.
You can use these two modes by:
python pTuneos.py WES -i config_WES.yaml
python pTuneos.py VCF -i config_VCF.yaml
Pair-end matched tumor-normal whole exome sequencing file should be provided for basic neoepitopes identification, expression profile file or raw RNA sequencing file (pairend or single-end) is optional if you want to get expressed neoepitope. pTuneos accepts pair-end matched tumor-normal whole exome sequencing as input. It could be in .fastq.gz
or .fastq
format.
You should specify the right path to the sequencing file in config_WES.yaml
like:
#your path to first tumor fastq file
tumor_fastq_path_first: ~/ncbi/dbGaP-14145/sra/SRR2770550_1.fastq.gz
#your path to second tumor fastq file
tumor_fastq_path_second: ~/ncbi/dbGaP-14145/sra/SRR2770550_2.fastq.gz
#your path to first normal fastq file
normal_fastq_path_first: ~/ncbi/dbGaP-14145/sra/SRR2669057_1.fastq.gz
#your path to second normal fastq file
normal_fastq_path_second: ~/ncbi/dbGaP-14145/sra/SRR2669057_2.fastq.gz
#your path to first RNA-seq fastq file
tumor_rna_fastq_1: ~/ncbi/dbGaP-14145/sra/SRR2673065_1.fastq.gz
#your path to second RNA-seq fastq file
tumor_rna_fastq_2: ~/ncbi/dbGaP-14145/sra/SRR2673065_2.fastq.gz
We give the downloading script of WES+RNA-seq testing data in fold WES_example_data/
. You can use bash test_data_download.sh
to download these data for testing. (Note: If your RNA-seq data was single-end, just set tumor_rna_fastq_2
to None
. In addition, if you know the fragment length
and Standard deviation of fragment length
of your single end RNA-seq, replace it with your values, otherwise, just leave them unchanged.)
Input file for VCF
mode contains:
- mutaiton file in vcf format from mutect2.
- expression profile in the format same as mentioned in WES mode (recommend obtain from
kallisto
). - copynumber profile (recommend obtain from
sequenza
). - tumor cellularity (bewteen 0 and 1) (also recommend obtain from
sequenza
).
We give the example data of these files in fold VCF_example_data/
.
The following references are required for pTuneos to run:
-
Reference DNA sequence and its annotation file. These files are used in somatic variant calling process.
[Genome reference] human.fasta dbsnp_138.hg38.vcf.gz 1000G_phase1.snps.high_confidence.hg38.vcf.gz Mills_and_1000G_gold_standard.indels.hg38.vcf.gz CosmicCodingMuts_chr_M_sorted.vcf.gz
-
Peptide and cDNA: The peptide reference is a FASTA file containing all peptides and all cDNA sequences of the human proteome.
[cDNA and protein] Homo_sapiens.GRCh38.cdna.all.fa Homo_sapiens.GRCh38.pep.all.fa
-
EnsemblVEP: VEP cache database (It should be emphasized that it is of very high importance that the references and VEP match in release version (e.g. release-89)).
[EnsemblVEP] homo_sapiens_vep_89_GRCh38.tar.gz
User should set all the parameters in the configuration file config_WES.yaml
or config_VCF.yaml
. The configuration file contains three parts of parameters:
- Input data parameters, including path of DNA/RNA sequencing data, output fold, run name, hla alleles, expression file and thread number (for WES mode). If your RNA-seq data was single-end, just set
tumor_rna_fastq_2
toNone
. In addition, if you know thefragment length
andStandard deviation of fragment length
of your single end RNA-seq, replace it with your values, otherwise, just leave them unchanged. (Note: user could specific hla allele throughthla_str
, otherwise set it toNone
, the pipeline will make the prediction utilizing sequencing data. If RNA sequencing data is provided, please also set expression file toNone
.) - Some filter parameter including mutation sequence depth, mutation variant allele fraction(vaf), binding affinity rank and expression FPKM.
- Software excutable path of opitype, vep, netMHCpan and PyClone.
pTuneos output four result files contain information of identified neoepitopes corresponding to nonsynonymous point mutation and INDEL mutation.
The output files are the following: final_neo_model.tsv
The file is a TSV file with the extracted mutated peptides derived from nonsynonymous point mutation and INDEL mutation with a model-based
score measures the immunity of neoepitopes.
The prediction output (final_neo_model.tsv) for each peptide pair consists of the following columns:
Column Name | Description |
---|---|
Position | Mutation position in genome. |
HLA_type | HLA allele name. |
Gene | HUGO symbol name of mutatied gene. |
WT_pep | The extracted normal peptide. |
WT_Binding_EL | %Rank of prediction score for nomal peptides use NetMHCpan4.0 (defalut model). |
WT_Binding_Rank | %Rank of prediction score for nomal peptides use NetMHCpan4.0 (-ba model). |
MT_pep | The extracted mutant peptide. |
MT_Binding_EL | %Rank of prediction score for mutated peptides use NetMHCpan4.0(defalut model). |
MT_Binding_Rank | %Rank of prediction score for mutant peptides use NetMHCpan4.0 (-ba model). |
Transcript_name | Ensembl transcript ID |
Mutation | Necleotide change of mutated gene |
AA_change | Amino acid change annotated in VEP file. |
Variant_allele_frequency | Genomic allele frequency detected by MuTect2. |
DriverGene_Lable | TRUE if the HUGO symbol is in the cosmic reference list, FALSE if it is not. |
MT_Binding_level_des | Binding level description of mutated peptide. |
WT_Binding_level_des | Binding level description of normal peptide. |
Homolog_pep | The extracted homologous peptide of neo-peptide in human protein. |
Homolog_Binding_EL | %Rank of prediction score for homologous peptides use NetMHCpan4.0 (defalut model). |
Recognition_score | T cell recognition score calculated based on TCR cross reactivity. |
Hydrophobicity_score | Neo-peptide immunity mesurement based on animo acid hydrophobicity. |
Self_sequence_similarity | Sequence similarity bewteen mutated peptide and normal(homglogous) peptide, We select the bigger one as final score |
Model_pro | Model prediction score (probability) for neoepitope recognized by T cell denpend on Recognition_score, Hydrophobicity_score, Self_sequence_similarity, WT_Binding_EL, MT_Binding_EL (calculated by Pre&RecNeo). |
Immuno_effect_score | Refined immunogenicty score for neoepitopes (calculated by RefinedNeo). |
1410782Chiz@tongji.edu.cn or qiliu@tongji.edu.cn
Biological and Medical Big data Mining Lab
Tongji University