maftools
provides various functions to perform most commonly used analyses in cancer genomics. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either The Cancer Genome Atlas (TCGA) sources or any in-house studies as long as the data is in MAF format.
Such cohort-based large-scale characterizations often produce large amounts of data in the form of somatic variants containing single-nucleotide variants (SNV) and small insertion/deletions (indels). Somatic variants provide baseline data for many analyses, such as Single Nucleotide Polimorphism (SNP) types, driver gene detection, mutation frequencies, somatic interactions, visualization, and estimation of tumor heterogeneity, applied by the package employment.
Thanks to Genome Resarch. PMID: 30341162 and PoisonAlien/maftools
Install from BiocManager
package
BiocManager::install("maftools")
library(maftools)
-
MAF files - can be obtained from TCGA or gz compressed.
-
here we downloaded the MAF file of LUAD cohort from TCGA (GDC portal)
library(TCGAbiolinks)
library(DT)
library(dplyr)
luad_maf <- GDCquery(project = "TCGA-LUAD",
data.category = "Simple Nucleotide Variation",
data.type = "Masked Somatic Mutation",
access = "open")
GDCdownload(luad_maf,method = "client")
luadmaf <- GDCprepare(luad_maf)
Now, we have all genes with mutation types in LUAD cohort along with features data. SNP mutations are detectable in the output table, which can be separated based on Variant_Classification column in luadmaf table resulted from GDCprepare
function.
head (luadmaf)
A tibble: 6 Ă— 140
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification
<chr> <int> <chr> <chr> <chr> <int> <int> <chr> <chr>
1 ELAPOR1 57535 BI GRCh38 chr1 109197542 109197542 + Silent
2 HIPK1 204851 BI GRCh38 chr1 113962403 113962403 + Missense_Mutation
3 FDPS 2224 BI GRCh38 chr1 155319833 155319833 + Missense_Mutation
4 YY1AP1 55249 BI GRCh38 chr1 155668660 155668660 + Silent
5 FCRL3 115352 BI GRCh38 chr1 157697402 157697402 + Silent
6 OR6F1 343169 BI GRCh38 chr1 247712205 247712205 + Missense_Mutation
We also access to altered alleles and substitutions' position in each gene, indication nucleotides and amino acid exchanges in columns: HGVSc, HGVSp, and HGVSp_Short.
HGVSc HGVSp HGVSp_Short
<chr> <chr> <chr>
1 c.2190C>T p.Phe730= p.F730=
2 c.2068C>G p.Gln690Glu p.Q690E
3 c.964G>C p.Gly322Arg p.G322R
4 c.984C>T p.Leu328= p.L328=
5 c.582G>T p.Leu194= p.L194=
6 c.551G>T p.Trp184Leu p.W184L
-
read.maf
function reads and summarizes MAF files in various ways, and stores as a MAF object. -
It is recommended to provide annotations associated with samples in MAF.
-
Here I read MAF file of LUAD cohort based on an object resulted from
GDCprepare
function because there are sample barcodes annotation in it.
LUAD <- read.maf(maf = luadmaf)
#Validating
-Silent variants: 47915
-Summarizing
--Possible FLAGS among top ten genes:
TTN
MUC16
USH2A
FLG
-Processing clinical data
-Processing clinical data
-Finished in 30.9s elapsed (25.6s cpu)
#Typing laml shows basic summary of MAF file.
LUAD
An object of class MAF
ID summary Mean Median
<char> <char> <num> <num>
1: NCBI_Build GRCh38 NA NA
2: Center BI NA NA
3: Samples 616 NA NA
4: nGenes 16642 NA NA
5: Frame_Shift_Del 4256 6.909 4.0
6: Frame_Shift_Ins 1290 2.094 1.0
7: In_Frame_Del 406 0.659 0.0
8: In_Frame_Ins 49 0.080 0.0
9: Missense_Mutation 126254 204.958 137.5
10: Nonsense_Mutation 10454 16.971 10.0
11: Nonstop_Mutation 175 0.284 0.0
12: Splice_Site 3719 6.037 3.0
13: Translation_Start_Site 211 0.343 0.0
14: total 146814 238.334 158.5
sample summary
sample <- getSampleSummary(LUAD)
Gene summary
gene <- getGeneSummary(LUAD)
Recommended clinical data associated with each sample/Tumor_Sample_Barcode in MAF.
- Here I get the linicalData from MAF file resulted from
read.maf
function. Another way is to obtain clinical data frame from GDC portal usingGDCquery
function.
clin <- getClinicalData(LUAD)
Interaction between pair genes using somaticInteractions
function based on:
-
Co-occurance- co-occurance frequency - both genes are often mutated together.
-
Mutually exlusive - One gene is mutated, the other gene is not.
Useful for identification of relationships between genes based on their mutation patterns across a population of samples. Highlights potential functional connections between the genes.
- Significance customization -- P.Value < 0.05 or 0.01
somaticinter <- somaticInteractions(maf = LUAD, top = 20, pvalue = 0.01)
Binary feature for identifying mutated or non-mutated based on 0 and 1
The result shows pair genes interacted together. The significant interactions are identified by pAdj column and the interaction type is described by Event column.
gene1 gene2 pValue oddsRatio 00 01 11 10 pAdj Event pair
<char> <char> <num> <num> <int> <int> <int> <int> <num> <char> <char>
1: LRP1B MUC16 1.122434e-18 4.882350 296 120 133 67 2.040789e-17 Co_Occurence LRP1B, MUC16
2: CSMD3 TTN 3.465805e-18 4.433788 267 107 155 87 5.776341e-17 Co_Occurence CSMD3, TTN
3: USH2A CSMD3 1.326806e-17 4.868679 313 124 118 61 2.041241e-16 Co_Occurence CSMD3, USH2A
4: TTN MUC16 6.281772e-17 4.131432 259 95 158 104 8.973961e-16 Co_Occurence MUC16, TTN
5: APOB MUC16 1.693377e-16 6.421572 337 169 84 26 2.257836e-15 Co_Occurence APOB, MUC16
6: TTN USH2A 1.019462e-15 4.367813 296 58 121 141 1.274328e-14 Co_Occurence TTN, USH2A
maftools
has a function oncodrive
which identifies cancer genes (driver) from a given MAF. Oncodriver - driver genes - Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (hot-spots).
driver <- oncodrive(maf = LUAD, minMut = 5, pvalMethod = "zscore")
Estimating background scores from synonymous variants..
Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
head(luadsig)
Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins Missense_Mutation
<char> <int> <int> <int> <int> <int>
1: CDC42EP2 0 0 0 0 5
2: DDX49 0 0 0 0 8
3: DEFB121 0 0 0 0 6
4: KRTCAP3 0 0 0 0 6
5: LCAT 0 0 0 0 4
6: OXT 0 0 0 0 6
Nonsense_Mutation Nonstop_Mutation Splice_Site Translation_Start_Site total MutatedSamples
<int> <int> <int> <int> <num> <int>
1: 0 0 0 0 5 5
2: 0 0 0 0 8 8
3: 0 0 0 0 6 6
4: 0 0 0 0 6 6
5: 1 0 0 0 5 4
6: 0 0 0 0 6 6
AlteredSamples clusters muts_in_clusters clusterScores protLen zscore pval fdr
<int> <int> <int> <num> <int> <num> <num> <num>
1: 5 1 5 1 210 5.546154 1.46011e-08 6.363157e-06
2: 8 2 8 1 483 5.546154 1.46011e-08 6.363157e-06
3: 6 1 6 1 76 5.546154 1.46011e-08 6.363157e-06
4: 6 1 6 1 240 5.546154 1.46011e-08 6.363157e-06
5: 4 2 5 1 440 5.546154 1.46011e-08 6.363157e-06
6: 6 1 6 1 125 5.546154 1.46011e-08 6.363157e-06
fract_muts_in_clusters
<num>
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
maftools
comes with the function pfamDomains
, which adds pfam domain information to the amino acid changes. Adding pfam domain changes for detection of amino acid changes in protein.pfamDomain
also summarizes amino acid changes according to the domains that are affected.
pfam <- pfamDomains(maf = LUAD, top = 10)
Get summary of protein and amino acid
prSum <- pfam$proteinSummary
Dsum <- pfam$domainSummary
head(prSum)
HGNC AAPos Variant_Classification N total fraction DomainLabel pfam
<char> <num> <fctr> <int> <num> <num> <char> <char>
1: KRAS 12 Missense_Mutation 146 163 0.8957055 COG1100 <NA>
2: EGFR 858 Missense_Mutation 23 104 0.2211538 PTKc_EGFR cd05108
3: EGFR 746 In_Frame_Del 21 104 0.2019231 PTKc_EGFR cd05108
4: BRAF 640 Missense_Mutation 13 54 0.2407407 Pkinase pfam00069
5: ALDH1B1 75 Missense_Mutation 12 22 0.5454545 PLN02466 <NA>
6: CD163 987 Missense_Mutation 12 52 0.2307692 SR smart00202
Description
<char>
1: GTPase SAR1 and related small G proteins [General function prediction only]
2: Catalytic domain of the Protein Tyrosine Kinase, Epidermal Growth Factor Receptor
3: Catalytic domain of the Protein Tyrosine Kinase, Epidermal Growth Factor Receptor
4: Protein kinase domain
5: aldehyde dehydrogenase family 2 member
6: Scavenger receptor Cys-rich
Survival analysis is an essential part of cohort anaysis.
Required clinical input => Tumor_sample_Barcode -> matches to those in MAF file.
binary event -> 1,0 => Status (vital_status): Alive = 0, Dead = 1
time to event -> last followup => days
We can load clinical data in various ways such as TCGA or othe clinical files including annotation of samples, specifically Tumor_sample_Barcode column as in MAF file.
- Here I load all data of LUAD combined with clinical data
cohort <- tcgaLoad(study = "LUAD")
releasing clinicalData
clin <- getClinicalData(cohort)
removing NA values and setting 0 and 1 for Alive and DEAD status respectively because it is important to have binary feature.
clin$days_to_last_followup[clin$days_to_last_followup=="[Not Available]"] <- NA
clin$vital_status[clin$vital_status=="Alive"] <- 0
clin$vital_status[clin$vital_status=="Dead"] <- 1
obtaining all data
data <- cohort@data
luad <- read.maf(data, clinicalData = clin)
Now, we have a full data with clinical eature added.
Function mafSurvive
performs survival analysis and draws kaplan meier curve by grouping samples based on mutation status
mafSurvival(maf = luad, genes = "DNMT3A",
time = "days_to_last_followup", Status = "vital_status")
Group medianTime N
<char> <num> <int>
1: Mutant 259.5 18
2: WT 626.0 375
- Identify set of selected significant gene pairs which affects survival ratio
mafSurvGroup(maf = luad, geneSet = c("KRAS","TP53"),
time = "days_to_last_followup", Status = "vital_status")
- Summary of variant classification
plotmafSummary(maf = luad, rmOutlier = T, addStat = "median", dashboard = T, titvRaw = F)
- Oncoplot
oncoplot(maf = luad)
- Transition and Transversion
luad.titv <- titv(maf = luad,plot = F, useSyn = T)
- Lollipop plot
Indicating the mutation frequencies positions on a gene based on protien domain.
lollipopPlot(maf = luad, gene = "TP53", AACol = "HGVSp_Short",
showMutationRate = T, showDomainLabel = T)