Skip to content

Custom analysis with python package

zeyang-shen edited this page Jun 26, 2020 · 5 revisions

Overview

We also provide python APIs to conduct every step of the motif analysis in MAGGIE separately. It will be extremely useful if you would like to 1) customize your statistical testing methods, or 2) obtain intermediate results (e.g., motif scores) instead of going straight to summary statistics.

Import packages

import sys
sys.path.append('/path/to/your/cloned/maggie/folder/')
from maggie import score, utils

Read positive and negative sequences

pos_seq_file = './data/AlleleSpecificBinding/CTCF_binding_alleles.fa'
neg_seq_file = './data/AlleleSpecificBinding/CTCF_nonbinding_alleles.fa'
pos_seq_dict = utils.read_fasta(pos_seq_file)
neg_seq_dict = utils.read_fasta(neg_seq_file)

Positive and negative sequences are required to form pairs and have the same order after sorting by their sequence IDs in FASTA files. To make sure that the analysis will be done on the correct sequence pairs, it is recommended to give identical IDs to your sequences belonging to the same pair.

Load motif PWMs

motif_dict = score.load_motifs('./data/JASPAR2020_CORE_vertebrates_motifs/')

Currently, this module only supports motif files in JASPAR format and requires separate files for different motifs with all the files stored in one folder. Go here to take a look at the format of motif files.

Run MAGGIE to test a specific motif while obtaining motif score differences

This way gives you motif score differences for custom statistical testing.

name, ID, stats, pvals, score_diffs = score.test_one_motif(motif_dict['CTCF$MA0139.1'], pos_seq_dict, neg_seq_dict)

stats, pvals

This module will output a list of 1,000 statistics and 1,000 p-values from the Wilcoxon signed-rank test. The first values represent the original results from the given samples. The following 999 values are based on bootstrapping by randomly sampling the inputs with replacement and re-doing the statistical testing. These values are used to generate the confidence interval displayed in the final outputs of MAGGIE using executable scripts.

score_diffs

This includes all the motif score differences of the input samples for the testing motif. By default, MAGGIE filters out zero differences and conducts statistical tests only on non-zero differences. These score differences are not automatically saved by using executable scripts since they could take up a lot of space depending on your sample size.

Another way to obtain motif score differences

We also provide modules to compute motif scores. After computing the motif scores of positive and negative sequences, you can obtain score differences by substracting motif scores in pairs.

#compute scores for positive sequences
pos_scores = score.compute_scores(motif_dict['CTCF$MA0139.1'], pos_seq_dict, top_site=1)

#compute scores for negative sequences
neg_scores = score.compute_scores(motif_dict['CTCF$MA0139.1'], neg_seq_dict, top_site=1)

#compute score differences
import numpy as np
score_diffs = np.array(pos_scores) - np.array(neg_scores)

score.compute_scores

This module will output a list of motif scores in the order of input sequences after being sorted by their sequence IDs. To assure the score difference is computed for the correct sequence pair, we suggest giving the same ID to the sequences of the same pair when you create the FASTA file.

top_site

This indicates the number of top motifs to consider as representative for any input sample. The default (=1) is to consider the best motif, which is shown to well reflect the actual transcription factor binding activity (Figure 1C) and is considered most generalizable. If this parameter is set larger than 1, the output score will be the additive log-likelihood score: log2(\sum{2**score}).

If you are interested in a more sophisticated selection of motifs to involve in MAGGIE analysis or other analyses, we provide another module to output motif score together with its position and strand:

score_list, loc_list, strand_list = score.find_motif(motif_dict['CTCF$MA0139.1'], pos_seq_dict, top_site=1)

If you set top_site here to larger than 1, this module will output a list of scores, positions, and strands for each sample (instead of an aggregated score as in compute_scores module) for your convenience of additional filtering.

Detailed usage

Look into the documentation of python modules for more information on parameter and usage.

help(score.compute_scores) #check the documentation of any module