🌼 PYSUM 🌼

This Tool calculates a BLOSUM Matrix (log-odds ratios) given arbitrary Sequences (can be anything, not only DNA or Amino-Acid) by elimination.

Usage

To run this application, go to the folder where the main.py is located and open the command prompt in it. Then run the application by typing: python3 main.py

[TBA] You can also run this application without a GUI, by typing: python3 main.py --nogui --path path_to_sequnce_file --degree [0,100]

Requirements

To execute this application, python3 is required.
Also, following python packages are required:

package	version
numpy	>= 1.18
tkinter	>= 8.6

Note that tkinter is installed by default on Windows10, but not on Linux.

Mathematical Foundation

This is accomplished by following mathematical foundation:

Given Sequences with at least p% identity to each other are clustered. The other sequences are eliminated (The degree decides, how similar they must be).
The sequences are now compared to each other, where the sequence letters (eg. DNA-Bases) are counted according to their frequency. Looking at this example:
ATGTACGT
TAGCTAGA
GTACGACC
The columns k are observed, such that would be then ATG and so on. By computing the C values a matrix is obtained:

Note that this matrix is Symmetric.
The sum of all entry's in the Matrix and Z (normalization factor) is given by:

where L is the sequence length (column number, i.e. for ATGTACGT: L = 8) and N the number of sequences.
Then, is normalized to obtain the Q-Matrix:
To obtain the probability of the occurrence of one sequence letter i use:
Finally the log-odds ratios are computed with:

The result (every entry) is rounded to integers.

This calculation is based on Dr. Sepp Hochreiters Script Bioinformatics I .

Input Files

The Input file can end with any extension. The sequences in the input file should fulfill following propertys:

Be all the same length.
Every sequence is separated by a newline.
At least two sequences are given.
Any input line starting with - will be ignored.

Examples: Valid ✔️

-This is a valid input file, this line is ignored.
TACGTAGCTAGC
TGCATGCTAGCC
TGCTGCTGCCCA
TGTGTACACCCC
-This line is also ignored.

Not Valid ❌

-This is a invalid input file, because sequences differ in length.
TACGTAGCTAGC
TGCATGCT
TGCTGCTGCCCA
TGTGTACACCC

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
interface.py		interface.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌼 PYSUM 🌼

Usage

Requirements

Mathematical Foundation

Input Files

About

Releases 2

Packages

Languages

License

Kryptagora/pysum

Folders and files

Latest commit

History

Repository files navigation

🌼 PYSUM 🌼

Usage

Requirements

Mathematical Foundation

Input Files

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages