This Tool calculates a BLOSUM Matrix (log-odds ratios) given arbitrary Sequences (can be anything, not only DNA or Amino-Acid) by elimination.
To run this application, go to the folder where the main.py
is located and open the command prompt in it. Then run the application by typing:
python3 main.py
[TBA] You can also run this application without a GUI, by typing:
python3 main.py --nogui --path path_to_sequnce_file --degree [0,100]
To execute this application, python3
is required.
Also, following python packages are required:
package | version |
---|---|
numpy | >= 1.18 |
tkinter | >= 8.6 |
Note that tkinter
is installed by default on Windows10
, but not on Linux
.
This is accomplished by following mathematical foundation:
-
Given Sequences with at least p% identity to each other are clustered. The other sequences are eliminated (The degree decides, how similar they must be).
-
The sequences are now compared to each other, where the sequence letters (eg. DNA-Bases) are counted according to their frequency. Looking at this example:
ATGTACGT
TAGCTAGA
GTACGACC
The columns k are observed, such that would be thenATG
and so on. By computing the C values a matrix is obtained:
Note that this matrix is Symmetric. -
The sum of all entry's in the Matrix and Z (normalization factor) is given by:
where L is the sequence length (column number, i.e. forATGTACGT
: L = 8) and N the number of sequences. -
To obtain the probability of the occurrence of one sequence letter i use:
-
Finally the log-odds ratios are computed with:
The result (every entry) is rounded to integers.
This calculation is based on Dr. Sepp Hochreiters Script Bioinformatics I
.
The Input file can end with any extension. The sequences in the input file should fulfill following propertys:
- Be all the same length.
- Every sequence is separated by a newline.
- At least two sequences are given.
- Any input line starting with
-
will be ignored.
Examples: Valid ✔️
-This is a valid input file, this line is ignored.
TACGTAGCTAGC
TGCATGCTAGCC
TGCTGCTGCCCA
TGTGTACACCCC
-This line is also ignored.
Not Valid ❌
-This is a invalid input file, because sequences differ in length.
TACGTAGCTAGC
TGCATGCT
TGCTGCTGCCCA
TGTGTACACCC