-
Notifications
You must be signed in to change notification settings - Fork 10
Lab 2: BLAST Tutorial
Sequence alignment is the core of bioinformatics, and Smith-Waterman algorithm is the fundamental techniques in pairwise sequence alignment. However, when it comes to huge amount of data, the computational cost of this exact method can be overwhelming. BLAST, a heuristic realization of the algorithm, can balance between the sensitivity and the computatinal time.
This is simply called the seed-and-extend strategy.
- Decompose the query sequence of length
$m$ into a list of$m-w+1$ word of size$w$ ; -
Seeding: Find the exact matches or nearly-exact matches (hits) in database of size
$n$ with the threshold score of$T$ ; -
Extension: For each hit, extend the alignment in two directions to find alignment that score greater than the threshold
$S$ . - Output the alignments and also the statistics.
E-value is the expect number of alignments that can reach a score of at least
-
$S$ is the alignment raw score; -
$K$ is the Karlin-Altschul parameter that can be computed theoretically depending only on the scoring system; -
$m$ is the query length; -
$n$ is the effective target database size; -
$\lambda$ is another Karlin-Altschul parameter that depends on the scoring system by solving the equation in ungapped alignment: $$ \sum_{i,j} p_i p_j e^{-\lambda S_{ij}} = 1 $$ where$p_i$ and$p_j$ are the background probability of residue$i$ and$j$ .
The equation is inferred from how we obtain the scoring matrix:
$$
S_{ij} = \log \frac{q_{ij}}{p_i p_j}
$$
and we know that
- The two parameters,
$\lambda$ and$K$ for gapped alignments, can be priorly computed from simulation.
The bit score, or the normalized raw score,
Comparing a protein sequence of length 250 residues to a database containing 1e9 residues, how many local alignments with normalized score
The probability of finding at least one such HSP with raw score
- For nucleotide sequences,
- For amino acid sequences,
- Inferred from global alignment by taking all the mutations in the sequences;
- Based on evolution theory.
- All the other PAM are extrapolated from PAM1 using the assumed Markov chain.
- Compute
$p_i$ all every amino acid,$i$ ; - Compute
$p(i \to j)$ for every pair of$i$ and$j$ ; $M_{ij} = p(i \to j)$ - Scale
$M$ such that$\sum_i p_i M_{ii} = 0.01$ $S_{ij} = \text{round}(10\log_{10} \frac{M_{ij}}{p_j})$ $PAMn = PAM1^n$
- Based on comparisons of Blocks of sequences derived from the Blocks database
- The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment)
- BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-matrix number (e.g. BLOSUM 62 is derived from Blocks containing >62% identity in ungapped sequence alignment)
- BLOSUM 62 is the default matrix for the standard protein BLAST program
- Compute the background
$p_i$ for each$i$ ; - Compute the probability
$q_{ij}$ of each pair$i$ and$j$ ; - Compute the expected probability for each pair of
$i$ and$j$ :$p_{ii} = p_i^2$ and$p_{ij} = 2p_i p_j (i \neq j)$ $S_{ij} = \log_2 \frac{q_{ij}}{p_{ij}}/\lambda$
- Linear gap penalty: The penalty of contiguous gaps can be computed using
$\ell \times \rho$ , where$\ell$ is the length of the gap. - Affine gap penalty: The penalty of contiguous gaps can be computed using
$\rho + \ell \times \phi$ where$\rho$ is the gap-open penalty and$\phi$ is the gap-extension penalty.
- Why do we use the effective length instead of the raw length for the database size,
$n$ ? - What is the difference between raw score,
$S$ , and bit score,$S'$ ? - How to interpret the
$E$ -value and$P$ -value, as well as their relationship? - Can we compare the BLAST
$E$ -values of a same query against different databases? - Are the
$E$ -values comparable for the same query and database, but with different scoring systems? - How can I obtain the Karlin-Altschul parameters
$K$ and$\lambda$ for any arbitrary scoring system?
- Answer:
src/algo/blast/core/blast_stat.c
- Compute
$\lambda$
- Compute
static double NlmKarlinLambdaNR(double* probs, Int4 d, Int4 low, Int4 high, double lambda0,
double tolx, Int4 itmax, Int4 maxNewton, Int4 * itn );
double Blast_KarlinLambdaNR(Blast_ScoreFreq* sfp, double initialLambdaGuess);
- Compute the relative entropy,
$H$ :
static double BlastKarlinLtoH(Blast_ScoreFreq* sfp, double lambda);
- Compute
$K$ :
static double BlastKarlinLHtoK(Blast_ScoreFreq* sfp, double lambda, double H)
- For a protein-coding nucleotide sequence, should we use BLASTN or BLASTX with 6 different reading frames? Why?
- Do we have the rule-of-thumb for choosing the appropriate scoring matrix?
- What's the difference between identity, similarity and homology?
You are running a cold-drink sales corporation, and you want to advertise on the web. There are two choices of ads you can run, and you’ve noticed that Style-C works best on cold days and Type-W works best on warm days. The anticipated amount of increased profit depends on the whether and the type of your choice of ad that day:
Cold | Warm | |
---|---|---|
Type-C | +45 | +75 |
Type-W | +30 | +100 |
You have committed to running an ad every day. The cost of placing either ad is 10 per day, and the web charges you 25 every time you change which ad you are running the day before.
You are given a perfectly correct weather prediction for the next
Here is an example if the predicted weather in the following 3 days are WCW
:
Weather | WCW | WCW |
---|---|---|
Ad | WCW | WWW |
Profit | 165 | 200 |
In a database with a size of 100,000 amino acids, we use BLASTP to identify a 100% alignment between the query peptide sequence MVAETN and a database entry.
MVAETN
||||||
MVAETN
- What is the probability that this alignment is evolutionarily related, compared with the probability that this alignment occurs by chance?
We can use two different scoring sytems:
- BLOSUM62:
$K=0.13, \lambda=0.34$ - PAM250:
$K=0.13, \lambda=0.80$
TTCGAGGGGCATGTTTGTTGCTATGAATGATAATAAAACAATGCTTTTTATTCCGGGGGCAACCAATTAAGTAATTC
- Do a
BLASTP
on NCBI website with the following protein againstnr
, but limit the organism tocetartiodactyla
using default parameters:
>query1
MASGPGGWLGPAFALRLLLAAVLQPVSAFRAEFSSESCRELGFSSNLLCSSCDLLGQFSL
LQLDPDCRGCCQEEAQFETKKYVRGSDPVLKLLDDNGNIAEELSILKWNTDSVEEFLSEK
LERI
- This protein is from a camel. Can you describe and explain the results? Have a look at the MSA output.
- Change the
Compositional adjustments
parameter toNo adjustment
, what happens to the results? Why? You can read the help information after this checkbox. - Have a look at the multiple sequence alignment, can you explain the results?
- Do a similar blastp vs
UniProtKB (UniProt)
with post filtering with Taxonomy againcetartiodactyla
. Select and align all 8 proteins. Can you explain the differences? What do you think of the Bos mutus (Yak) sequence (L8IZ46_9CETA) and the second pig sequence (I3LH65_PIG)? Follow the ENSEMBL link in the UniProt entry of this pig sequence to view the genomic region and try to understand the problem.
- Create the index for the target database using
makeblastdb
; - Choose the task program:
blastn
,blastp
,blastx
,tblatx
,psiblast
ordeltablast
; - Set the configuration for
match
,mismatch
,gap-open penalty
,gap-extension penalty
orscoring matrix
; - Set the
word
size; - Set the
E-value threshold
; - Set the
output format
and the number of output results
- Download the
uniprot
protein database. - Use
makeblastdb
to build the index. - Download the human protein encoding by the gene
human beta globin
- Run
blastp
andpsiblast
against theuniprot
database using default parameters - Change the scoring matrix, record the changes in the alignment results and interpret the results.
Write Python program to conduct the classical Smith-Waterman algorithm and illustrate using an example of pairwise alignment between two DNA sequence files in FASTA format. If you change the setting for match, mismatch scores, and also the gap penalty, see what will happen to the alignment results.
Extend your algorithms to affine penalty. Have a look at the changes of the alignment results for different gap-open and gap-extension penalty scores.
- Download the protein sequence database of an arbitrary strain of E.coli.
- Use
makeblastdb
to build the index database for these proteins. - Generate at random 100 amino acid sequences of length=10,20,30,60,100, respectively, based on the zero-order Markov model of the above proteome.
- Use
blastp
to conduct the queries against the database. - You can change the scoring matrix, the gap penalty, and also the other BLASTP parameters.
Use one of the BLASTP result to illustrate the statistics behind the queries.
One investigator has identified a form of extraterrestrial life which is of only 4 new amino acids, denoted
d d d d c b c d c a b c
d d d d c c b a c a a c
d d d d b b c d b d a c
d d d d c a c d c b b b
d d d d c a c d b a b c
d d d d c b b d c d b c
d d d d c c c d a a b c
d d d d b a c d c b b c
d d d d c a c a a a a c
d d d d c b c d c d b c
d d d d c a c d c a b b
d d d d c a c a c b b c
Your task is to compute an ezBLOSUM scoring matrix from this sample.
On the way to the garden of bioinformatics.
A bioinformatics wiki for the course BI462.