Binary-Nucleotide-Encoding

A method for encoding data from FASTA files (or other Nucleotide raw data) into a more size efficient file

FASTA FILE ENCODING - ASCII: In ASCII encoding, each character is represented by a single byte, which is a sequence of 8 bits. Since there are only 128 characters in the ASCII character set, each character can be represented by a unique combination of 7 bits (2^7 = 128).

For example, the ASCII encoding of the letter "A" is the binary sequence 01000001. This corresponds to the decimal value 65, which is the ASCII code for the letter "A". Similarly, the ASCII encoding of the letter "B" is the binary sequence 01000010, which corresponds to the decimal value 66.

Since each character is represented by a single byte in ASCII encoding, the size of an ASCII-encoded file is directly proportional to the number of characters in the file. For example, a text file containing 1000 ASCII characters will have a file size of 1000 bytes.

ENCODING SCHEMES: DEGENERATE ENCODING (Tetrad 4-bit per nucleotide)

ASCII Binary (8-bit)	Binary Tetrad (4-bit)	Representative Character	Description
01011111	0000	-	Dash
00100000	0000		Blank
01010100	0001	T	Thymine
01010101	0001	U	Uracil
01000001	0010	A	Adenosine
01000011	0100	C	Cytosine
01000111	1000	G	Guanine
01010111	0011	W	A/T
01010011	1100	S	C/G
01001101	0110	M	A/C
01001011	1001	K	G/T
01010010	1010	R	A/G
01011001	0101	Y	C/T
01000010	1101	B	Not A
01000100	1011	D	Not C
01001000	0111	H	Not G
01010110	1100	V	Not T
01001110	1111	N	Any

CONFIDENCE ENCODING (Triad 3-bit per nucleotide)

ASCII Binary (8-bit)	Binary Triad (3-bit)	Representative Character	Description
01010100	000	T	Thymine
	100	t	(Thymine)
01010101	000	U	Uracil
	100	u	(Uracil)
01000001	001	A	Adenosine
	101	a	(Adenosine)
01000011	010	C	Cytosine
	110	c	(Cytosine)
01000111	011	G	Guanine
	111	g	(Guanine)

SIMPLE ENCODING (Diad 2-bit per nucleotide)

ASCII Binary (8-bit)	Binary Diad (2-bit)	Representative Character	Description
01010100	00	T	Thymine
01010101	00	U	Uracil
01000001	01	A	Adenosine
01000011	10	C	Cytosine
01000111	11	G	Guanine

Sequence of "TACG" would normally get encoded in ASCII as "01010100010000010100001101000111" (32 bits)

With the proposed Binary Tetrad encoding the same sequence would be at largest "0001001001001000" (16-bits), at the smallest "00011011" (8-bits)

Furthermore there are applications of each encoding structure that can be applied to provide deeper insight or alternative workflows as well. For the Degenerate Encoding we can then consider that each nucleotide binary tetrad consists of four values T => [0,0,0,1], which is perfectly size to be representative of pixel information (r,g,b,a). So we can then encode a single row of pixels where each value in the rgba-array holds the nucleotide information. We can then use an already small image format like PNG to reduce the file size significantly.

GRAPHICAL SIZE COMPARISONS:

Notice the file size redction is only to 2/3 the original file. This is due to the Description information taking up more data than the Sequence.

As the Sequence length increases; file size reduction can easily break 1/2 the original file size!

The longer the sequences get the closer towards compression level file sizes can be achieved.

AREAS OF EXPANSION (WIP): .fastq (Illumina), .fasta with ascii codes phred32/phred64 base calling meta-quality-data with spacer lines, a 4 line file 1) Header, 2) Quality, 3) Bases, 4) Spacer Line .fast5 (Nanopore), .fastq with pore data There are 3 main branches of data stored in the fast

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Resources		Resources
LICENSE		LICENSE
Nucleotide_Encoder.py		Nucleotide_Encoder.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binary-Nucleotide-Encoding

About

Releases

Packages

Languages

License

iPsychonaut/Binary-Nucleotide-Encoding

Folders and files

Latest commit

History

Repository files navigation

Binary-Nucleotide-Encoding

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages