Skip to content

Lab 2: BLAST Tutorial

Ricky Woo edited this page Nov 30, 2017 · 39 revisions

Sequence alignment is the core of bioinformatics, and Smith-Waterman algorithm is the fundamental techniques in pairwise sequence alignment. However, when it comes to huge amount of data, the computational cost of this exact method can be overwhelming. BLAST, a heuristic realization of the algorithm, can balance between the sensitivity and the computatinal time.

BLAST algorithm

This is simply called the seed-and-extend strategy.

  1. Decompose the query sequence of length $m$ into a list of $m-w+1$ word of size $w$;
  2. Seeding: Find the exact matches or nearly-exact matches (hits) in database of size $n$ with the threshold score of $T$
  3. Extension: For each hit, extend the alignment in two directions to find alignment that score greater than the threshold $S$.
  4. Output the alignments and also the statistics.

Karlin-Altschul Statistics

E-value

E-value is the expect number of alignments that can reach a score of at least $S$ by chance under the current search space and scoring system: $$ E = Kmne^{-\lambda S} $$ where

  • $S$ is the alignment raw score;
  • $K$ is the Karlin-Altschul parameter that can be computed theoretically depending only on the scoring system;
  • $m$ is the query length;
  • $n$ is the effective target database size;
  • $\lambda$ is another Karlin-Altschul parameter that depends on the scoring system by solving the equation in ungapped alignment: $$ \sum_{i,j} p_i p_j e^{-\lambda S_{ij}} = 1 $$ where $p_i$ and $p_j$ are the background probability of residue $i$ and $j$.

The equation is inferred from how we obtain the scoring matrix: $$ S_{ij} = \log \frac{q_{ij}}{p_i p_j} $$ and we know that $\sum_{i,j} q_{ij} = 1$ where $q_{ij}$ is the observed frequency for the residue pair $i$ and $j$ in the high-scoring pairs (HSPs) for computing the scoring matrix.

  • The two parameters, $\lambda$ and $K$ for gapped alignments, can be priorly computed from simulation.

Bit score

The bit score, or the normalized raw score, $S'$, can be computed as: $$ S' = \frac{\lambda S - \ln K}{\ln 2} $$ Thus: $$ E = mn2^{-S'} $$

Example

Comparing a protein sequence of length 250 residues to a database containing 1e9 residues, how many local alignments with normalized score $\ge$ 35 bits can one expect to find by chance? The search space size is approximately $2^8 \times 2^{30} = 2^{38}$, so $E \approx 2^{38}/2^{35} = 2^3 = 8$. And the number of alignments with score $\ge 45$ bits one can expect to find by chance is $E = 2^{38}/2^{45} = 2^{-7} = 0.008$.

$p$-value

The probability of finding at least one such HSP with raw score $S$ by chance, $p$-value can be computed using: $$ p = K e^{-\lambda S} = 2^{-S'} $$ Therefore: $$ E = mn p $$

Q&A

  1. Why do we use the effective length instead of the raw length for the database size, $n$?
  2. What is the difference between raw score, $S$, and bit score, $S'$?
  3. How to interpret the $E$-value and $P$-value, as well as their relationship?
  4. Can we compare the BLAST $E$-values of a same query against different databases?
  5. Are the $E$-values comparable for the same query and database, but with different scoring systems?
  6. How can I obtain the Karlin-Altschul parameters $K$ and $\lambda$ for any arbitrary scoring system?
  • Answer: src/algo/blast/core/blast_stat.c
    • Compute $\lambda$
static double NlmKarlinLambdaNR(double* probs, Int4 d, Int4 low, Int4 high, double lambda0,
                  double tolx, Int4 itmax, Int4 maxNewton, Int4 * itn );
double Blast_KarlinLambdaNR(Blast_ScoreFreq* sfp, double initialLambdaGuess);
  • Compute the relative entropy, $H$:
static double BlastKarlinLtoH(Blast_ScoreFreq* sfp, double lambda);
  • Compute $K$:
static double BlastKarlinLHtoK(Blast_ScoreFreq* sfp, double lambda, double H)
  1. For a protein-coding nucleotide sequence, should we use BLASTN or BLASTX with 6 different reading frames? Why?
  2. Do we have the rule-of-thumb for choosing the appropriate scoring matrix?
  3. What's the difference between identity, similarity and homology?

Theoretical Exercise

In a database with a size of 100,000 amino acids, we use BLASTP to identify a 100% alignment between the query peptide sequence MVAETN and a database entry.

MVAETN
|||||| 
MVAETN
  • What is the probability that this alignment is evolutionarily related, compared with the probability that this alignment occurs by chance?

We can use two different scoring sytems:

  • BLOSUM62: $K=0.13, \lambda=0.34$  
  • PAM250: $K=0.13, \lambda=0.80$

Running a blast program

  • Create the index for the target database using makeblastdb;
  • Choose the task program: blastn, blastp, blastx, tblatx, psiblast or deltablast;
  • Set the configuration for match, mismatch, gap-open penalty, gap-extension penalty or scoring matrix;
  • Set the word size;
  • Set the E-value threshold;
  • Set the output format and the number of output results;

$\S$Standalone BLAST Exercises

  1. Download the uniprot protein database.
  2. Use makeblastdb to build the index.
  3. Download the human protein encoding by the gene human beta globin
  4. Run blastp and psiblast against the uniprot database using default parameters
  5. Change the scoring matrix, record the changes in the alignment results and interpret the results.

$\S$Programming Exercises

  1. Write Python program to conduct the classical Smith-Waterman algorithm and illustrate using an example of pairwise alignment between two DNA sequence files in FASTA format. If you change the setting for match, mismatch scores, and also the gap penalty, see what will happen to the alignment results.
  2. Extend your algorithms to affine penalty. Have a look at the changes of the alignment results for different gap-open and gap-extension penalty scores.
  3. BLASTP Exercise
  • Download the protein sequence database of an arbitrary strain of E.coli.
  • Use makeblastdb to build the index database for these proteins.
  • Generate at random 100 amino acid sequences of length=60, based on the zero-order Markov model of the above proteome.
  • Use blastp to conduct the queries against the database.
  • You can change the scoring matrix, the gap penalty, and also the other BLASTP parameters.
  1. Use one of the BLASTP result to illustrate the statistics behind the queries.

REFERENCE