Skip to content

Variant Annotation

evanbiederstedt edited this page Apr 30, 2019 · 1 revision

GRCh37

SNVs and indels

Basic annotation of merged vcf files from the individual variants callers is carried out in two steps. First, the combined vcf is annotated with information from RepeatMasker and the ENCODE consortium. These files are retrieved from the UCSC genome browser and parsed as such:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz
gunzip rmsk.txt.gz
cut -f6-8,12 rmsk.txt | \
    grep -e "Low_complexity" -e "Simple_repeat" | \
    sed 's/^chr//g'> rmsk_mod.bed
bgzip rmsk_mod.bed
tabix --preset bed rmsk_mod.bed.gz

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz
gunzip wgEncodeDacMapabilityConsensusExcludable.bed.gz
sed -i 's/^chr//g' wgEncodeDacMapabilityConsensusExcludable.bed
bgzip wgEncodeDacMapabilityConsensusExcludable.bed
tabix --preset bed wgEncodeDacMapabilityConsensusExcludable.bed.gz

Subsequently, vcf2maf is used to annotate functional effects of mutations as well as other metadata using VEP. The --custom-enst argument to vcf2maf takes a list of preferred gene transcript isoforms which to map mutations onto. We supply a consensus list of isoform_overrides_at_mskcc and isoform_overrides_uniprot, generated as such:

t1 = readr::read_tsv('isoform_overrides_at_mskcc')
t2 = readr::read_tsv('isoform_overrides_uniprot')
t2 %>%
    dplyr::filter(gene_name %nin% t1$gene_name) %>%
    dplyr::bind_rows(., t1) %>%
    readr::write_tsv('isoforms')
Clone this wiki locally