Skip to content
zeyang-shen edited this page Jul 20, 2020 · 8 revisions

General

  • Why should I use MAGGIE?

If you are bothered by a long list of enriched motifs from traditional tools and would like to narrow down to those that are most likely to have a function, you should consider MAGGIE.

  • How is MAGGIE different from motif enrichment/discovery tools (e.g., HOMER, MEME)?

HOMER or MEME can find enriched motifs among input sequences by comparing to predefined backgrounds. In general, these tools focus on the frequency of motif occurrence, while MAGGIE focuses on the change of motifs. To go from the static metric to a more dynamic view, MAGGIE requires additional knowledge about the effects of genetic variation. For example, if you only have measurements from a homozygous sample, there is no way to show MAGGIE to good advantage in this case. However, if your sample is heterozygous or you have additional measurements from a genetically different sample, you can use MAGGIE to look for motifs that are changed by the genetic variation and, at the same time, those changes are associated with the differences in your measurements. Due to this more dynamic view of motifs, we often see that MAGGIE can assign functions to the testing motifs and also become more sensitive in distinguishing motifs with different functions.

  • Can I use MAGGIE on two different cell types or the same cell type under different conditions?

No, if you are looking for a one-stop tool to give you a list of differential motifs that are significant only for one cell type rather than the other. The best way of using MAGGIE is to apply it to each cell type or condition separately and compare the results from different scenarios.

Input

  • What kinds of inputs does MAGGIE support now?

MAGGIE currently supports 1) VCF file where you have the information of variants and effect sizes of each variant on your epigenomic feature of interest, and 2) FASTA file where you clearly label your positive (associated with the epigenomic feature) and negative (not associated or associated with a lower level of the epigenomic feature) sequences and put them into two separate files. See documentation for VCF file and FASTA file as inputs, or real biological examples for VCF and FASTA.

  • How long were the sequences you used, and where did they start and end relative to the peak boundaries/summits?

The rule of thumb is to set a sequence length able to cover all the potential TF binding motifs, approximately equivalent to the range of ATAC-seq or TF ChIP-seq peaks: 200-300 bp. For a more focused investigation of the motif mutations (e.g., QTLs), we used 100-bp sequences centering around the genetic variants of interest in order to gain a little statistical power.

  • Why do I get the following error message?
    ERROR: unequal # sequences in input files! Make sure your inputs come as pairs

This could happen when you use FASTA files as inputs. Make sure you separate your positive and negative sequences into different files. Those positive and negative sequences should be classified by their different association with your epigenomic feature of interest, and not by their sources (individuals, animal strains, alleles), cell types, etc. which are irrelevant for MAGGIE analysis.

Output

  • How should I interpret the output from MAGGIE?

The best way to visualize the output from MAGGIE is to look at the HTML file generated by the software. In that HTML file, it displays all the predicted functional motifs that are tested significantly to associate with the change of epigenomic feature of interest. However, the motifs in that file only represent their statistical significance, not necessarily functional significance. Be careful about your interpretation of any specific motif, since there might only be 1 or 2 transcription factors in a merged set of 10 motifs that are actually functional in your specific setting. Further evaluation of gene expression level and experimental validation will be needed to eventually determine the functionality of a specific TF.

Computation and algorithm

  • What is the motif score that MAGGIE uses for computation?

PWM score. By default, MAGGIE considers the best motif (i.e., the highest PWM score) of an input sample as the representative motif score to compute score differences and test statistically for mutation bias.

  • Could MAGGIE perform better if you pre-learned motifs with HOMER using the de novo mode, and then interrogate with those instead of using predefined motifs from JASPAR?

Absolutely possible! However, for the MAGGIE software to work, you would need to convert the HOMER motif format to JASPAR format.

  • How much time/memory does it take to run MAGGIE?

It takes about 1.5 hours and 100MB RAM on a single core of my 2.8GHz Macbook Pro to run MAGGIE on 2,000 pairs of 200-bp input sequences using ~1,000 PWMs from 2020 JASPAR core vertebrates. The time and memory complexity are both linear in the size of samples (both the number and the length of sequences) and the number of motifs. The majority of the running time is spent on motif score computation. It is highly recommended to compute motif scores in parallel using -p #core flag.