Ma, K., N. Xu, A. He, and Y. Bai, “SNPAAMapper-Python: A highly efficient genome-wide SNP variant analysis pipeline for next-generation sequencing data.” Poster Presentation at the 18th Annual Conference for the Mid-South Computational Biology and Bioinformatics Society (MCBIOS 2022)
SNPAAMapper is a downstream variant annotation program that can effectively classify variants by region (e.g. exon, intron, etc.), predict amino acid change type (e.g. synonymous, non-synonymous mutation, etc.), and prioritize mutation effects (e.g. CDS versus 5'UTR, etc.).
- The pipeline accepts a VCF input file in tab-delimited format and processes the vcf input file containing all cases (G5, lowFreq, and novel)
- The variant mapping step allows users to select whether they want to report the base pair distance between each identified intron variant and its nearby exon
- Compatibility with VCF files called by different SAMTools versions (0.1.18 and older) and/or generated using SAMTools with two or three samples
- The spreadsheet result file contains full protein sequences for both reference and alternative alleles, which makes it easier for downstream protein structure/function analysis tools to use
- python 3.x
- sys
- csv
- re
- shutil
- Git LFS
If you haven't yet, initialize Git LFS by running
git lfs install
Clone this repo as follows
git clone https://github.com/nicolexxuu/SNPAAMapper-Python
cd ./SNPAAMapper-Python
and download hg19_CDSIntronWithSign.txt.out to your local repository.
Next, type
./run_SNPAAMapper-Python.sh config.txt
OR run the following steps in sequential order (Note: the first two steps were compiled for the human hg19 genome and output files have already been generated):
-
Process exon annotation files and generate feature start and gene mapping files:
python3 Algorithm_preprocessing_exon_annotation_RR.py ChrAll_knownGene.txt.exon
-
Classify variants by regions (CDS, Upstream, Downstream Intron, UTRs...)
python3 Algorithm_mapping_variants_reporting_class_intronLocation_updown.py ChrAll_knownGene.txt.exon VCF_input_file_in_tab_delimited_format.vcf
OR
python3 Algorithm_mapping_variants_reporting_class_intronLocation_updown.py ChrAll_knownGene.txt.exon VCF_input_file_in_tab_delimited_format.vcf IntronExon_boundary_in_bp
-
Predict amino acid change type
python3 Algorithm_predicting_full_AA_change_samtools_updown.py VCF_input_file_in_tab_delimited_format.vcf.append kgXref.txt hg19_CDSIntronWithSign.txt.out ChrAll_knownGene.txt >VCF_input_file_in_tab_delimited_format.vcf.out.txt
-
Prioritize mutation effects
python3 Algorithm_prioritizing_mutation_headerTop_updown.py VCF_input_file_in_tab_delimited_format.vcf.append.out.txt
The final output file is *.append.out.txt.prioritzed_out.
- “The Human Genome Project.” Genome.gov, www.genome.gov/human-genome-project.
- Nature News, Nature Publishing Group, www.nature.com/articles/d42473-021-00030-9.
- Lewis, Tanya. “Human Genome Project Marks 10th Anniversary.” LiveScience, Purch, 14 Apr. 2013, www.livescience.com/28708-human-genome-project-anniversary.html.
- Barba, Marina, Czosnek, Henryk, Hadidi, Ahmed. “Historical Perspective, Development and Applications of next-Generation Sequencing in Plant Virology.” Viruses, MDPI, 6 Jan. 2014, www.ncbi.nlm.nih.gov/pmc/articles/PMC3917434/.
- Bai, Yongsheng, and James Cavalcoli. “SNPAAMapper: An Efficient Genome-Wide SNP Variant Analysis Pipeline for next-Generation Sequencing Data.” Bioinformation, Biomedical Informatics, 16 Oct. 2013, www.ncbi.nlm.nih.gov/pmc/articles/PMC3819573/.
- “UCSC Genome Browser Project History.” Genome Browser History, https://genome.ucsc.edu/goldenPath/history.html.
- “The Perl Programming Language.” TIOBE, https://www.tiobe.c
MIT