The goal of the project was to implement global alignment algorithm. We scored each nucleotide in a given sequence by how many times it alings within the dataset. We experiment with different scoring functions and alingments. We also compared difference in alingment between aminoacid and nuqleotid sequence.
The dataset consist of genome of 3 different nematodes (round worms) - japanica, brenneri and remanei. It is common for this group of nematodes that their genome contains a vast ammount of redundancy. This means that not whole genes are translated into proteins. Parts that end up in protein are called exons nad parts that do not are called introns. Part of genes given in dataset contains at least two exons. Compressed file of the dataset is accesbile on this website [accesed in december 2016].
Provides reading and parsing of the dataset. In addition to loading a whole genome we can access individual gene. Furthermore we can access one or all exons belonging to gene. For instructions and usage examples see wiki
Accepts two strings and scores their alignment. Return score and alignment of strings. Local, global and multiple alignment are supported. Mutliple alignment is deprecated - it works only on short strings, in our case we have genes with more than 700 characters, which is computationaly unacceptable, so we did not use it in our project.
Visualization of differences between alignment of all genes between species. Distance: higher alignment score means lower distance, so we compute distance as (1 - ratio between score and highest score), where highest score is multiplied by 1.1 to avoid distance zero.
In the folowing section we present the experiments and their results. We test our implementation of gene alignment with different scoring matrices, with junk DNA and without etc.
In the first experiment we perform global alignment on species pair by pair. For each two genes we extract exons and perform alignment on them. The sum of score is our metric to compare alignments of species pair by pair. After alignment using nucleotides we also experiment with aminoacid sequence.
The resulting dendrogram is shown below:
As we can see better results were obtained using nucleotides (note that scores were inverted into distances - higher score translate into lower distance). From dendrogram we can also conclude that remanei and japonica have higher similarity compared to brenner, no matter which sequence type was used.
Species pair | Nucleotide | Amino-acid |
---|---|---|
rem_jap | 13848 | -2736 |
pb_jap | 9112 | -4108 |
rem_pb | 12144 | -3562 |
As we see from table above, the scores obtained using amino acid sequences are significantly lower.
Species pair | Nucleotide gene similarities (%) |
---|---|
rem_jap | 44.83 |
pb_jap | 31.03 |
rem_pb | 24.14 |
From table above, we see which species pair is the most similar to each other, according to scores of each gene seperately. We see that rem_jap is the most similar pair.
We also took a look at which genes have the highest and the lowest alignment score. The results are shown below, where 5 genes with lowest (first table) and highest score(second table) for each pair are presented with their name and alignment score.
rem_jap | pb_jap | rem_pb |
---|---|---|
g13065 -1241 | g8079 -663 | g8079 -1620 |
g3039 -866 | g5838 -653 | g3149 -1020 |
g1158 -524 | g12334 -592 | g5838 -463 |
g14459 -423 | g13065 -492 | g5059 -319 |
g10223 -319 | g10223 -344 | g992 -289 |
As we can see, some genes are appearing more than once (g13065, g5838, g10223). We can conclude that similar genes are causing lower alignment scores in different species pairs.
rem_jap | pb_jap | rem_pb |
---|---|---|
g5889 1214 | g5889 1675 | g13065 1378 |
g832 1165 | g13779 1323 | g10223 1356 |
g8079 1103 | g5621 1076 | g5889 1094 |
g14404 1022 | g6414 1014 | g13779 867 |
g13198 986 | g14404 938 | g12334 864 |
Similar conclusion can be drawn here. Genes that has the best aligment are common in more than one pair.
The goal here was to discover if different scoring matrices produce same results ie. if alignment between remanei and japonica will still be the highest scored alignment. We ran the alignment with the folowing matrices: blosum62, pam250, pam30, and rao. The results for nucleotide sequences are avaliable in the table below.
Species pair | blosum62 | pam250 | pam30 | rao |
---|---|---|---|---|
rem_jap | 14655 | 13848 | 16267 | 114918 |
pb_jap | 10082 | 9112 | 12533 | 104259 |
rem_pb | 13630 | 12144 | 15967 | 117715 |
From the obtained results we see that remanei and japonica still have the best alignment score, except when using rao scoring matrix, where remanei and brenneri are scored better. In all cases the lowest scored alignment is between brenneri and japonica. We can conclude that even though choosing appropriate scoring matrix is important it rarely affects the results to such an extent that the order of species pairs is different.
In this experiment we perform global alignment on species pair by pair. For each two genes we take a whole sequence and perform alignment on them. The sum of score is our metric to compare alignments of species pair by pair. After alignment using nucleotides we also experiment with aminoacid sequence.
The resulting dendrogram is shown below:
We can see results are different as in the first experiment. Now remanei and brenneri have higher similarity compared to japonica, no matter which sequence type was used. And also distances between them are slightly larger.
Species pair | Nucleotide | Amino-acid |
---|---|---|
rem_jap | 14249 | -3584 |
pb_jap | 13593 | -3812 |
rem_pb | 20000 | -2548 |
As in the first experiment, the scores obtained using amino acid sequences are significantly lower.
Species pair | Nucleotide gene similarities (%) |
---|---|
rem_jap | 31.03 |
pb_jap | 27.59 |
rem_pb | 41.38 |
From table above, we see which species pair is the most similar to each other, according to scores of each gene seperately (percentages of dendrograms, which give most similar results). We get different results, saying pair rem_pb is the most similar pair, unlike in first experiment, where rem_jap was the most similar.
From experiments performed on the dataset we can conclude that global alignment is largley dependent on the variables used, like included/excluded introns and different scoring matrices. The method is fast enough when aligning pairs, but our experiment with multiple alignment failed since the dataset was too large.