-
Notifications
You must be signed in to change notification settings - Fork 10
Lab 2: BLAST Tutorial
- Understanding the
heuristic
mechanism of the BLAST algorithm. - Grasping the statistical background of the assessement of the alignment.
- Knowing the tricks in running BLAST (either Web-based, standalone).
Sequence alignment is the core of bioinformatics, and Smith-Waterman algorithm is the fundamental techniques in pairwise sequence alignment. However, when it comes to huge amount of data, the time cost of this exact method can be overwhelming. BLAST, a heuristic approximation of the algorithm, can balance between the sensitivity and the computational speed.
This is simply called the seed-and-extend strategy.
- Decompose the query sequence of length
$m$ into a list of$m-w+1$ words of size$w$ ; -
Seeding: Find the exact matches or nearly-exact matches (hits) in database of size
$n$ with the threshold score of$T$ ; -
Extending: For each hit, extend the alignment in two directions to find alignment that score greater than the threshold
$S$ (high-score segment pairs (HSPs)). - Assess the statistical significance of the score (
E
-value andP
-value) and output the alignments (as well as theidentity
andsimilarity
).
E-value is the expect number of alignments that can reach a score of at least
-
$S$ is the alignment raw score; -
$K$ is the Karlin-Altschul parameter that can be computed theoretically depending only on the scoring system; -
$m$ is the query length; -
$n$ is the effective target database size; -
$\lambda$ is another Karlin-Altschul parameter that depends on the scoring system by solving the equation in ungapped alignment: $$ \sum_{i,j} p_i p_j e^{-\lambda S_{ij}} = 1 $$ where$p_i$ and$p_j$ are the background probability of residue$i$ and$j$ .
The equation is inferred from how we obtain the scoring matrix:
$$
S_{ij} = \log \frac{q_{ij}}{p_i p_j}
$$
and we know that
- The two parameters,
$\lambda$ and$K$ for gapped alignments, can be priorly computed from simulation.
The relative entropy (
In BLAST, we shoud use the effective length other than the raw length, since extension can not be conducted at either end of the sequences. Karlin-Altschul statistics provides a way to calculate just how long a sequence must be before it can produce an alignment with a significant Expect. This minimum length, l, is usually referred to as the expected HSP length: $$ l = \ln(Kmn) / H $$ and then the effective length for the query: $$ m' = m - l $$ and the database: $$ n' = n - l * \text{number of sequences in the database} $$
Since both the BLOSUM and PAM matrices are inferred from a set of empirical sequences, which may have very different amino-acid composition with the query and the database under investigation. Thus the scoring matrix shoulld be corrected for the composition of the sequences.
The bit score, or the normalized raw score,
Comparing a protein sequence of length 250 residues to a database containing 1e9 residues, how many local alignments with normalized score
The probability of finding at least one such HSP with raw score
- For nucleotide sequences,
- For amino acid sequences,
- Inferred from global alignment by taking all the mutations in the sequences;
- Based on evolution theory.
- All the other PAM are extrapolated from PAM1 using the assumed Markov chain.
- Compute
$p_i$ all every amino acid,$i$ ; - Compute
$p(i \to j)$ for every pair of$i$ and$j$ ; $M_{ij} = p(i \to j)$ - Scale
$M$ such that$\sum_i p_i (1 - M_{ii}) = 0.01$ $S_{ij} = \text{round}(10\log_{10} \frac{M_{ij}}{p_j})$ $PAMn = PAM1^n$
- Based on comparisons of Blocks of sequences derived from the Blocks database
- The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment)
- BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-matrix number (e.g. BLOSUM 62 is derived from Blocks containing >62% identity in ungapped sequence alignment)
- BLOSUM 62 is the default matrix for the standard protein BLAST program
- Compute the background
$p_i$ for each$i$ ; - Compute the probability
$q_{ij}$ of each pair$i$ and$j$ ; - Compute the expected probability for each pair of
$i$ and$j$ :$p_{ii} = p_i^2$ and$p_{ij} = 2p_i p_j (i \neq j)$ $S_{ij} = \log_2 \frac{q_{ij}}{p_{ij}}/\lambda$
- Linear gap penalty: The penalty of contiguous gaps can be computed using
$\ell \times \rho$ , where$\ell$ is the length of the gap. - Affine gap penalty: The penalty of contiguous gaps can be computed using
$\rho + \ell \times \phi$ where$\rho$ is the gap-open penalty and$\phi$ is the gap-extension penalty.
- Why do we use the effective length instead of the raw length for the database size,
$n$ ? - What is the difference between raw score,
$S$ , and bit score,$S'$ ? - How to interpret the
$E$ -value and$P$ -value, as well as their relationship? - Can we compare the BLAST
$E$ -values of a same query against different databases? - Are the
$E$ -values comparable for the same query and database, but with different scoring systems? - How can I obtain the Karlin-Altschul parameters
$K$ and$\lambda$ for any arbitrary scoring system?
- Answer:
src/algo/blast/core/blast_stat.c
- Compute
$\lambda$
- Compute
static double NlmKarlinLambdaNR(double* probs, Int4 d, Int4 low, Int4 high, double lambda0,
double tolx, Int4 itmax, Int4 maxNewton, Int4 * itn );
double Blast_KarlinLambdaNR(Blast_ScoreFreq* sfp, double initialLambdaGuess);
- Compute the relative entropy,
$H$ :
static double BlastKarlinLtoH(Blast_ScoreFreq* sfp, double lambda);
- Compute
$K$ :
static double BlastKarlinLHtoK(Blast_ScoreFreq* sfp, double lambda, double H)
- For a protein-coding nucleotide sequence, should we use BLASTN or BLASTX with 6 different reading frames? Why?
- Do we have the rule-of-thumb for choosing the appropriate scoring matrix?
- What's the difference between identity, similarity and homology?
You are running a cold-drink sales corporation, and you want to advertise on the web. There are two choices of ads you can run, and you’ve noticed that Style-C works best on cold days and Type-W works best on warm days. The anticipated amount of increased profit depends on the whether and the type of your choice of ad that day:
Cold | Warm | |
---|---|---|
Type-C | +45 | +75 |
Type-W | +30 | +100 |
You have committed to running an ad every day. The cost of placing either ad is 10 per day, and the web charges you 25 every time you change which ad you are running the day before.
You are given a perfectly correct weather prediction for the next
Here is an example if the predicted weather in the following 3 days are WCW
:
Weather | WCW | WCW |
---|---|---|
Ad | WCW | WWW |
Profit | 165 | 200 |
In a database with a size of 100,000 amino acids, we use BLASTP to identify a 100% alignment between the query peptide sequence MVAETN and a database entry.
MVAETN
||||||
MVAETN
- What is the probability that this alignment is evolutionarily related, compared with the probability that this alignment occurs by chance?
We can use two different scoring sytems:
- BLOSUM62:
$K=0.13, \lambda=0.34$ - PAM250:
$K=0.13, \lambda=0.80$
Run blastn
against the nt
database.
TTCGAGGGGCATGTTTGTTGCTATGAATGATAATAAAACAATGCTTTTTATTCCGGGGGCAACCAATTAAGTAATTC
- Do a
BLASTP
on NCBI website with the following protein againstnr
, but limit the organism tocetartiodactyla
using default parameters:
>query1
MASGPGGWLGPAFALRLLLAAVLQPVSAFRAEFSSESCRELGFSSNLLCSSCDLLGQFSL
LQLDPDCRGCCQEEAQFETKKYVRGSDPVLKLLDDNGNIAEELSILKWNTDSVEEFLSEK
LERI
- This protein is from a camel. Can you describe and explain the results? Have a look at the MSA output.
- Change the
Compositional adjustments
parameter toNo adjustment
, what happens to the results? Why? You can read the help information after this checkbox. - Have a look at the multiple sequence alignment, can you explain the results?
- Do a similar blastp vs
UniProtKB (UniProt)
without post filtering. Select and align all proteins. Can you explain the differences? What do you think of the Bos taurus sequence (A8YXY3) and the pig sequence (A1Z623)?
- Create the index for the target database using
makeblastdb
; - Choose the task program:
blastn
,blastp
,blastx
,tblatx
,psiblast
ordeltablast
; - Set the configuration for
match
,mismatch
,gap-open penalty
,gap-extension penalty
orscoring matrix
; - Set the
word
size; - Set the
E-value threshold
; - Set the
output format
and the number of output results
- Download the
uniprot
protein database. - Use
makeblastdb
to build the index. - Download the human protein encoding by the gene
human beta globin
- Run
blastp
andpsiblast
against theuniprot
database using default parameters - Change the scoring matrix, record the changes in the alignment results and interpret the results.
Write Python program to conduct the classical Smith-Waterman algorithm and illustrate using an example of pairwise alignment between two DNA sequence files in FASTA format. If you change the setting for match, mismatch scores, and also the gap penalty, see what will happen to the alignment results.
Extend your algorithms to affine penalty. Have a look at the changes of the alignment results for different gap-open and gap-extension penalty scores.
- Download the protein sequence database of an arbitrary strain of E.coli.
curl -L -o ncbi-ecoli.faa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
- Use
makeblastdb
to build the index database for these proteins. - Generate at random 100 amino acid sequences of length=10,20,30,60,100, respectively, based on the zero-order Markov model of the above proteome.
- Use
blastp
to conduct the queries against the database. - You can change the scoring matrix, the gap penalty, and also the other BLASTP parameters.
Use one of the BLASTP result to illustrate the statistics behind the queries.
One investigator has identified a form of extraterrestrial life which is of only 4 new amino acids, denoted
d d d d c b c d c a b c
d d d d c c b a c a a c
d d d d b b c d b d a c
d d d d c a c d c b b b
d d d d c a c d b a b c
d d d d c b b d c d b c
d d d d c c c d a a b c
d d d d b a c d c b b c
d d d d c a c a a a a c
d d d d c b c d c d b c
d d d d c a c d c a b b
d d d d c a c a c b b c
Your task is to compute an ezBLOSUM scoring matrix from this sample.
Read the source code, answer the following questions:
- How to compute the effective length of query and target database?
- How to conduct the adjustment of the scoring matrix due to the composition of the sequences?
On the way to the garden of bioinformatics.
A bioinformatics wiki for the course BI462.