Skip to content

This hackathon project aims to use a Convolutional Neural Network (CNN) to accurately identify promoter classes for DNA sequencing, leveraging spatial pattern recognition to enhance genetic analysis.

Notifications You must be signed in to change notification settings

sean-atkinson/classifying_dna_sequences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Classifying DNA Sequences

Table of Contents


I know absolutely nothing about DNA
What the heck is a promoter?
And what the heck are we going to do with a promoter?
K-mers
In conclusion, I still know next to nothing about DNA but...
Datasets and libraries used

I know absolutely nothing about DNA

I know absolutely nothing about DNA.

Actually, that’s a lie.

I’m vaguely familiar with what DNA is and what a DNA structure looks like.

DNA sequence

Yeah, that thingy.

The point remains though, I’m anything but a DNA expert.

This is why I experienced a not-insignificant amount of trepidation if we’re being kind, or sheer terror, if we’re being honest, when I participated in a hackathon recently to create a model to classify DNA sequences into promotor and non-promoter classes.

What the heck is a promoter?

(Back to table of contents)

As the DNA sequence data grows rapidly, the maintenance and annotation of these data are now of utmost importance.

A promoter is necessary for DNA sequence transcription. Knowing the position of the promoter in the sequence, we can get the starting position of the transcription region that will later be translated into a protein sequence.

If we know which part of a DNA sequence is a promoter sequence, we can use that promoter sequence to keep the rate of translation from DNA into protein under control.

In other words, identifying the promoter is essential for DNA sequence analysis.

Many methods and tools have been developed for this purpose, and many have achieved high accuracy.

The goal of this hackathon was to create a model of my own to classify whether a given DNA sequence is a promoter sequence or not.

And what the heck are we going to do with a promoter?

(Back to table of contents)

Good question!

I had no earthly idea how I was going to do that either.

Luckily, we weren’t left completely on our own for the hackathon.

We received a selection of papers to read from for ideas on how we might proceed.

Two papers of note caught my attention:

While I didn’t understand the ins-and-outs of the DNA parts, I most definitely was able to understand the machine learning/NLP parts, and that is how I came across the concept of k-mers.

K-mers

(Back to table of contents)

In Bioinformatics, k-mers are subsequences of length k contained within a biological sequence.

What k-mers help you to do is turn DNA into a language of sorts by creating a group of words of k length.

An example of a DNA sequence and k-mer sequences where k = 3

An example of a DNA sequence and k-mer sequences where k = 3 Source

Once you choose your k, you simply divide up the rest of a DNA sequence in chunks of the same length.

Again, for k =3

Dividing a DNA sequence into chunks of 3 Source

Generally speaking, decomposing a sequence into k-size chunks allows for fast and easy string manipulation.

Once you figure that out, it simply becomes a matter of picking a model type (or two or three) and playing around with the parameters to see what gives you the best results.

For me, that was a Convolutional Neural Network that peaked at a validation accuracy score of 90.04%. It did a little worse on the unseen test data that we were given to predict for the hackathon, but that’s nothing a little parameter tuning can’t fix.

In conclusion, I still know next to nothing about DNA but...

(Back to table of contents)

Thankfully, that doesn’t prevent me from creating a model that can differentiate between promoter and non-promoter classes. And with a high degree of accuracy too.

Though I can’t help but wonder how much better my models would perform if I knew more about DNA...

Datasets and libraries used

(Back to table of contents)

Dataset:

  • Custom Hackathon dataset

Libraries: keras, numpy, pandas, sklearn, and tensorflow.

About

This hackathon project aims to use a Convolutional Neural Network (CNN) to accurately identify promoter classes for DNA sequencing, leveraging spatial pattern recognition to enhance genetic analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published