Skip to content

Lab 2: BLAST Tutorial

Ricky Woo edited this page Dec 7, 2017 · 39 revisions

Sequence alignment is the core of bioinformatics, and Smith-Waterman algorithm is the fundamental techniques in pairwise sequence alignment. However, when it comes to huge amount of data, the computational cost of this exact method can be overwhelming. BLAST, a heuristic realization of the algorithm, can balance between the sensitivity and the computatinal time.

BLAST algorithm

This is simply called the seed-and-extend strategy.

  1. Decompose the query sequence of length $m$ into a list of $m-w+1$ word of size $w$;
  2. Seeding: Find the exact matches or nearly-exact matches (hits) in database of size $n$ with the threshold score of $T$
  3. Extension: For each hit, extend the alignment in two directions to find alignment that score greater than the threshold $S$.
  4. Output the alignments and also the statistics.

Karlin-Altschul Statistics

E-value

E-value is the expect number of alignments that can reach a score of at least $S$ by chance under the current search space and scoring system: $$ E = Kmne^{-\lambda S} $$ where

  • $S$ is the alignment raw score;
  • $K$ is the Karlin-Altschul parameter that can be computed theoretically depending only on the scoring system;
  • $m$ is the query length;
  • $n$ is the effective target database size;
  • $\lambda$ is another Karlin-Altschul parameter that depends on the scoring system by solving the equation in ungapped alignment: $$ \sum_{i,j} p_i p_j e^{-\lambda S_{ij}} = 1 $$ where $p_i$ and $p_j$ are the background probability of residue $i$ and $j$.

The equation is inferred from how we obtain the scoring matrix: $$ S_{ij} = \log \frac{q_{ij}}{p_i p_j} $$ and we know that $\sum_{i,j} q_{ij} = 1$ where $q_{ij}$ is the observed frequency for the residue pair $i$ and $j$ in the high-scoring pairs (HSPs) for computing the scoring matrix.

  • The two parameters, $\lambda$ and $K$ for gapped alignments, can be priorly computed from simulation.

Relative entropy

The relative entropy ($H$) of a scoring matrix can conveniently capture the general behavior of the matrix. $H$ is the average number of bits (or nats) per position in an alignment: $$ H = - \sum_{i=1}^{20} q_{ij} \lambda S_{ij} $$

Sequence length correction

In BLAST, we shoud use the effective length other than the raw length, since extension can not be conducted at either end of the sequences. Karlin-Altschul statistics provides a way to calculate just how long a sequence must be before it can produce an alignment with a significant Expect. This minimum length, l, is usually referred to as the expected HSP length: $$ l = \ln(Kmn) / H $$ and then the effective length for the query: $$ m' = m - l $$ and the database: $$ n' = n - l * \text{number of sequences in the database} $$

Sequence composition correction

Since both the BLOSUM and PAM matrices are inferred from a set of empirical sequences, which may have very different amino-acid composition with the query and the database under investigation. Thus the scoring matrix shoulld be corrected for the composition of the sequences.

Bit score

The bit score, or the normalized raw score, $S'$, can be computed as: $$ S' = \frac{\lambda S - \ln K}{\ln 2} $$ Thus: $$ E = mn2^{-S'} $$

Example

Comparing a protein sequence of length 250 residues to a database containing 1e9 residues, how many local alignments with normalized score $\ge$ 35 bits can one expect to find by chance? The search space size is approximately $2^8 \times 2^{30} = 2^{38}$, so $E \approx 2^{38}/2^{35} = 2^3 = 8$. And the number of alignments with score $\ge 45$ bits one can expect to find by chance is $E = 2^{38}/2^{45} = 2^{-7} = 0.008$.

$p$-value

The probability of finding at least one such HSP with raw score $S$ by chance, $p$-value can be computed using: $$ p = K e^{-\lambda S} = 2^{-S'} $$ Therefore: $$ E = mn p $$

Scoring system

Subsitution matrix

  • For nucleotide sequences,
  • For amino acid sequences,

PAM (Point-accepted mutations)

  • Inferred from global alignment by taking all the mutations in the sequences;
  • Based on evolution theory.
  • All the other PAM are extrapolated from PAM1 using the assumed Markov chain.
How to compute PAM1
  • Compute $p_i$ all every amino acid, $i$;
  • Compute $p(i \to j)$ for every pair of $i$ and $j$;
  • $M_{ij} = p(i \to j)$
  • Scale $M$ such that $\sum_i p_i (1 - M_{ii}) = 0.01$
  • $S_{ij} = \text{round}(10\log_{10} \frac{M_{ij}}{p_j})$
  • $PAMn = PAM1^n$

BLOSUM (BLOcks SUbstitution Matrix)

  • Based on comparisons of Blocks of sequences derived from the Blocks database
  • The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment)
  • BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-matrix number (e.g. BLOSUM 62 is derived from Blocks containing >62% identity in ungapped sequence alignment)
  • BLOSUM 62 is the default matrix for the standard protein BLAST program
How to compute BLOSUM
  • Compute the background $p_i$ for each $i$;
  • Compute the probability $q_{ij}$ of each pair $i$ and $j$;
  • Compute the expected probability for each pair of $i$ and $j$: $p_{ii} = p_i^2$ and $p_{ij} = 2p_i p_j (i \neq j)$
  • $S_{ij} = \log_2 \frac{q_{ij}}{p_{ij}}/\lambda$

Gap penalties

  • Linear gap penalty: The penalty of contiguous gaps can be computed using $\ell \times \rho$, where $\ell$ is the length of the gap.
  • Affine gap penalty: The penalty of contiguous gaps can be computed using $\rho + \ell \times \phi$ where $\rho$ is the gap-open penalty and $\phi$ is the gap-extension penalty.

Q&A

  1. Why do we use the effective length instead of the raw length for the database size, $n$?
  2. What is the difference between raw score, $S$, and bit score, $S'$?
  3. How to interpret the $E$-value and $P$-value, as well as their relationship?
  4. Can we compare the BLAST $E$-values of a same query against different databases?
  5. Are the $E$-values comparable for the same query and database, but with different scoring systems?
  6. How can I obtain the Karlin-Altschul parameters $K$ and $\lambda$ for any arbitrary scoring system?
  • Answer: src/algo/blast/core/blast_stat.c
    • Compute $\lambda$
static double NlmKarlinLambdaNR(double* probs, Int4 d, Int4 low, Int4 high, double lambda0,
                  double tolx, Int4 itmax, Int4 maxNewton, Int4 * itn );
double Blast_KarlinLambdaNR(Blast_ScoreFreq* sfp, double initialLambdaGuess);
  • Compute the relative entropy, $H$:
static double BlastKarlinLtoH(Blast_ScoreFreq* sfp, double lambda);
  • Compute $K$:
static double BlastKarlinLHtoK(Blast_ScoreFreq* sfp, double lambda, double H)
  1. For a protein-coding nucleotide sequence, should we use BLASTN or BLASTX with 6 different reading frames? Why?
  2. Do we have the rule-of-thumb for choosing the appropriate scoring matrix?
  3. What's the difference between identity, similarity and homology?

$\S$1 Theoretical Exercise

1.1 Dynamic programming exercise

You are running a cold-drink sales corporation, and you want to advertise on the web. There are two choices of ads you can run, and you’ve noticed that Style-C works best on cold days and Type-W works best on warm days. The anticipated amount of increased profit depends on the whether and the type of your choice of ad that day:

Cold Warm
Type-C +45 +75
Type-W +30 +100

You have committed to running an ad every day. The cost of placing either ad is 10 per day, and the web charges you 25 every time you change which ad you are running the day before.

You are given a perfectly correct weather prediction for the next $n$ days. Design a dynamic programming algorithm to select which ad to run on each of the next $n$ days to maximize your profit.

Here is an example if the predicted weather in the following 3 days are WCW:

Weather WCW WCW
Ad WCW WWW
Profit 165 200

1.2 Kalin-Altschul statistical exercise

In a database with a size of 100,000 amino acids, we use BLASTP to identify a 100% alignment between the query peptide sequence MVAETN and a database entry.

MVAETN
|||||| 
MVAETN
  • What is the probability that this alignment is evolutionarily related, compared with the probability that this alignment occurs by chance?

We can use two different scoring sytems:

  • BLOSUM62: $K=0.13, \lambda=0.34$  
  • PAM250: $K=0.13, \lambda=0.80$

$\S$2 Web-based BLAST search

2.1 BLASTN example

TTCGAGGGGCATGTTTGTTGCTATGAATGATAATAAAACAATGCTTTTTATTCCGGGGGCAACCAATTAAGTAATTC

2.2 BLASTP Query

  • Do a BLASTP on NCBI website with the following protein against nr, but limit the organism to cetartiodactyla using default parameters:
>query1
MASGPGGWLGPAFALRLLLAAVLQPVSAFRAEFSSESCRELGFSSNLLCSSCDLLGQFSL
LQLDPDCRGCCQEEAQFETKKYVRGSDPVLKLLDDNGNIAEELSILKWNTDSVEEFLSEK
LERI
  • This protein is from a camel. Can you describe and explain the results? Have a look at the MSA output.
  • Change the Compositional adjustments parameter to No adjustment, what happens to the results? Why? You can read the help information after this checkbox.
  • Have a look at the multiple sequence alignment, can you explain the results?
  • Do a similar blastp vs UniProtKB (UniProt) without post filtering. Select and align all proteins. Can you explain the differences? What do you think of the Bos taurus sequence (A8YXY3) and the pig sequence (A1Z623)?

Running a standalone blast program

  • Create the index for the target database using makeblastdb;
  • Choose the task program: blastn, blastp, blastx, tblatx, psiblast or deltablast;
  • Set the configuration for match, mismatch, gap-open penalty, gap-extension penalty or scoring matrix;
  • Set the word size;
  • Set the E-value threshold;
  • Set the output format and the number of output results

$\S$3 Standalone BLAST Exercises

  • Download the uniprot protein database.
  • Use makeblastdb to build the index.
  • Download the human protein encoding by the gene human beta globin
  • Run blastp and psiblast against the uniprot database using default parameters
  • Change the scoring matrix, record the changes in the alignment results and interpret the results.

$\S$4 Programming Exercises

4.1 Smith-Waterman implementation

Write Python program to conduct the classical Smith-Waterman algorithm and illustrate using an example of pairwise alignment between two DNA sequence files in FASTA format. If you change the setting for match, mismatch scores, and also the gap penalty, see what will happen to the alignment results.

4.2 Affine-gap penalty

Extend your algorithms to affine penalty. Have a look at the changes of the alignment results for different gap-open and gap-extension penalty scores.

4.3 BLASTP Exercise

  • Download the protein sequence database of an arbitrary strain of E.coli.
curl -L -o ncbi-ecoli.faa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
  • Use makeblastdb to build the index database for these proteins.
  • Generate at random 100 amino acid sequences of length=10,20,30,60,100, respectively, based on the zero-order Markov model of the above proteome.
  • Use blastp to conduct the queries against the database.
  • You can change the scoring matrix, the gap penalty, and also the other BLASTP parameters.

4.4 Statistical interpretation

Use one of the BLASTP result to illustrate the statistics behind the queries.

4.5 Compute the BLOSUM matrix

One investigator has identified a form of extraterrestrial life which is of only 4 new amino acids, denoted ${a,b,c,d}$, and then obtain and sequence a sample of so-called ezProtein:

d	d	d	d	c	b	c	d	c	a	b	c
d	d	d	d	c	c	b	a	c	a	a	c
d	d	d	d	b	b	c	d	b	d	a	c
d	d	d	d	c	a	c	d	c	b	b	b
d	d	d	d	c	a	c	d	b	a	b	c
d	d	d	d	c	b	b	d	c	d	b	c
d	d	d	d	c	c	c	d	a	a	b	c
d	d	d	d	b	a	c	d	c	b	b	c
d	d	d	d	c	a	c	a	a	a	a	c
d	d	d	d	c	b	c	d	c	d	b	c
d	d	d	d	c	a	c	d	c	a	b	b
d	d	d	d	c	a	c	a	c	b	b	c

Your task is to compute an ezBLOSUM scoring matrix from this sample.

4.6 Extra

Read the source code, answer the following questions:

  • How to compute the effective length of query and target database?
  • How to conduct the adjustment of the scoring matrix due to the composition of the sequences?

REFERENCE