-
Notifications
You must be signed in to change notification settings - Fork 0
Discover TF binding motifs affected by QTLs
This tutorial is about using VCF file as input to discover which motifs are associated with the variants provided in the file. In this example, we will replicate a piece of Figure 3C of our paper, in which we test the DNase QTLs from Degner et al., 2012 for chromatin accessibility in the lymphoblastoid cell line.
Here is the total analysis. All the scripts used in this tutorial can be found in the bin folder. We will go through each step in the subsequent sections.
#Download relevant data
wget http://homer.ucsd.edu/zeyang/maggie/QTL/dsQTL.vcf #QTLs
#Split variants based on genomic regions
python splitVariants.py \
dsQTL.vcf \
hg18 \
--overlap 100 \
-o .
#Run MAGGIE
python maggie_vcf_input.py \
dsQTL.vcf \
hg18 \
--size 100 \
--effect 6 \
-o ./maggie_output/ \
-p 8
Download the DNase QTLs in the format of VCF.
wget http://homer.ucsd.edu/zeyang/maggie/QTL/dsQTL.vcf
Besides the default columns for VCF file such as variant position and reference and alternative allele, MAGGIE requires an extra column that indicates the effect sizes of each variant ("Beta" column in the file dsQTL.vcf). Effect size is necessary for MAGGIE to associate the direction of the change of motif induced by a certain variant with the direction of the change of histone function. This is also what distinguishes MAGGIE from conventional motif enrichment tools.
If you would like to test on other types of QTLs (histone modification QTLs) listed in Figure 3C from Grubert et al., 2015, here are the links to download those data.
#H3K27ac QTLs
wget http://homer.ucsd.edu/zeyang/maggie/QTL/H3K27ac.vcf
#H3K4me1 QTLs
wget http://homer.ucsd.edu/zeyang/maggie/QTL/H3K4me1.vcf
#H3K4me3 QTLs
wget http://homer.ucsd.edu/zeyang/maggie/QTL/H3K4me3.vcf
If there could be a chance that the functional motifs for chromatin accessibility (or other epigenomic features) vary by genomic regions (e.g., distal enhancers vs. promoters), you would probably like to split the variants before running MAGGIE. You can always test on the whole set and then refine your analysis by looking at subsets.
python splitVariants.py \
dsQTL.vcf \
hg18 \
--overlap 100 \
-o .
Specify the genome coordinate that the positions in VCF are based on. Those DNase QTLs were identified using the hg18 genome. If you try other types of QTLs mentioned before, replace it with hg19
. The software will look for GFF3 file for the specified genome in /path/to/maggie/data/genomes/. In this case, if "hg18.gff3" does not exist, it will automatically download a GFF3 file into /path/to/maggie/data/genomes/. If you have the annotation file stored in a different path, you can directly specify /path/to/your/FASTA/file.gff3.
The range extended from variants to overlay with genomic annotations. The rule of thumb is to make it the same as the size you would like to account for motifs in the next step (i.e., --size
flag in the next step).
Specify a folder to store output files, which include a summary file of annotations (allAnnotation.txt
) and subsets of variants split into different categories, including intergenic (variants_Intergenic.vcf
), intronic (variants_intron.vcf
), and near TSS (variants_TSS.vcf
). Each one of these VCF files can be put into MAGGIE in step 3.
Next, run maggie_vcf_input.py for VCF file as the input format.
python maggie_vcf_input.py \
dsQTL.vcf \
hg18 \
--size 100 \
--effect 6 \
-o ./maggie_output/ \
-p 8
The input VCF file requires at least 6 columns to indicate chromosome (1st), position (2nd), variant ID (3rd), reference allele (4th), alternative allele (5th), and effect size (determined by -e
flag). This format works best for summary data like QTLs and allele-specific sites.
Specify the reference genome. Again, for the histone modification QTLs mentioned before, replace it with hg19
. As a default, the software will look for FASTA file for the specified genome in /path/to/maggie/data/genomes/. In this case, if "hg18.fa" does not exist, it will automatically download a FASTA file into /path/to/maggie/data/genomes/. If you have the genome file stored in a different path, you can directly specify /path/to/your/FASTA/file.fa.
The range of expanded region around variants to conduct analysis. The default is 100 bp, which focuses on the proximal area of any given variant within the open chromatin region.
The e
th column in VCF file represents effect size. Here the 6th column with header "Beta" includes the effect size of those DNase QTLs on DNase signal.
Output folder that stores several result files from MAGGIE.
The number of processors to run MAGGIE in parallel for motifs. It is recommended to use multiple processors, which can greatly reduce computing time. Using 2 cores will take approximately half the time of running on a single core!
Results will be organized in a folder specified by -o
when running MAGGIE. An example can be found here.
It contains the summary statistics for all the motifs. Each line represents a motif. There are thirteen columns:
- Column 1: motif ID obtained from the JASPAR-format motif file, the label right after">".
- Column 2: motif description obtained from the motif file, the text separated by TAB from motif ID.
- Column 3, 4, 5: statistics from Wilcoxon signed-rank test at 5%, 50% (median), and 95% percentile. A confidence interval is computed based on 1,000 iterations of bootstrapping.
- Column 6, 7, 8: p-values from Wilcoxon signed-rank test at 5%, 50% (median), and 95% percentile.
- Column 9, 10, 11: total number of samples with any mutations (change of motif score), positive mutations (increase in score), negative mutations (decrease in score) on a given motif.
- Column 12, 13: median and mean motif score difference among all the samples with mutations
The aggregated results from similar motifs based on a high correlation (Pearson > 0.6) of motif score differences. Each line represents a merged set of motifs. Statistics and p-values are averaged, while the count of mutated samples is done by taking the union. Column 1 of this file indicates the motifs with the naming rule as below: {motif 1 description}${motif 1 ID}|{motif 2 description}${motif 2 ID}|...
.
Include a subset of maggie_output_merged.tsv
file that are significant based on FDR < 0.05.
Include the motif logos of significant merged motifs from maggie_output_mergedSignificant.tsv
file.
Include the distributions or box plots of significant merged motifs from maggie_output_mergedSignificant.tsv
file.
Visualize the significant merged lists of motifs in maggie_output_mergedSignificant.tsv
file. See an example here. It additionally displays motif logos and distributions of motif score differences, which are respectively stored in folder logos/
and distributions/
.