Implementation, Testing, and Performance Analysis of Core DNA Sequence Algorithms

Overview

This repository contains a modular and highly optimized Python library for performing fundamental operations and statistical analysis on DNA sequences.

Developed for CCA3: DNA Fundamentals and Basic Tools, this project emphasizes:

Correctness
Code Quality (PEP 8 Compliance)
Algorithmic Efficiency — particularly O(N) and O(1) performance, demonstrated through the optimization of the Reverse Complement function.

The project structure separates concerns into distinct modules for data structures, core algorithms, and utility functions.

Features

1. Core Data Structures — `dna_data_structures.py`

DNA Class:
An object-oriented representation of a DNA sequence.

O(1) Complexity:
Nucleotide counts, GC content, and AT content are pre-calculated during object initialization (an O(N) one-time cost), allowing for O(1) (constant time) lookup via methods like calculate_GC_content().
Robust Validation:
Ensures sequences contain only valid A, T, C, G bases (non-IUPAC) upon instantiation.

2. Essential Algorithms — Transcription & Reverse Complement

DNA to RNA Transcription (`dna_transcription.py`)

Handles T → U substitution.
Supports transcription from both coding and template strands (Question 4).
Includes batch processing capabilities for multiple sequences.

Reverse Complement Generation (`reverse_complement_generation.py`)

Implements a comprehensive complement map that includes IUPAC Degenerate Codes (R, Y, S, W, K, M, etc.) (Question 5).
Supports output in both:
- 5′ − 3′ (Reverse Complement)
- 3′ − 5′ (Complement) orientations.

3. Utility Functions — `dna_string_manipulation.py`, `dna_analysis.py`

Sequence case conversion (to_uppercase, to_lowercase).
Non-nucleotide character removal.
Sequence splitting into codons based on customizable reading frames.
Detailed nucleotide analysis and reporting (counts, percentages, GC/AT content).

Performance and Optimization (Question 6)

A central focus of this project was optimizing Reverse Complement Generation to achieve high efficiency.
Three versions were implemented and benchmarked using timeit and cProfile (algorithm_optimisation.py, run_benchmarks.py).

Optimization Strategy: C-Level Translation

The final optimized function (generate_reverse_complement) achieves maximum speed using C-level execution to minimize Python interpreter overhead:

V3 (Optimized): Uses str.maketrans() and sequence.translate() for fast complementation in a single C-level operation.
Reversal: Handled via [::−1] slicing — also a C-level operation.

This combination results in a highly efficient O(N) runtime, making it suitable for large-scale DNA datasets.

Testing Suite (Question 7)

Comprehensive unit testing ensures correctness, robustness, and performance across all components.

Framework

Implemented using Python’s unittest framework (unittest_*.py files).

Test Coverage Includes:

Invalid Input Handling:
Tests for InvalidNucleotideError across all core functions (test_dna_analysis.py, unittest_dna_transcription.py).
Edge Cases:
Empty sequences, single-base sequences, and long sequences (up to 100,000 bases).
Robustness:
Tests for correct handling of IUPAC degenerate codes and accurate orientation output in the Reverse Complement function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Implementation, Testing, and Performance Analysis of Core DNA Sequence Algorithms

Overview

Features

1. Core Data Structures — `dna_data_structures.py`

2. Essential Algorithms — Transcription & Reverse Complement

DNA to RNA Transcription (`dna_transcription.py`)

Reverse Complement Generation (`reverse_complement_generation.py`)

3. Utility Functions — `dna_string_manipulation.py`, `dna_analysis.py`

Performance and Optimization (Question 6)

Optimization Strategy: C-Level Translation

Testing Suite (Question 7)

Framework

Test Coverage Includes:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
algorithm_optimisation.py		algorithm_optimisation.py
dna_analysis.py		dna_analysis.py
dna_data_structures.py		dna_data_structures.py
dna_string_manipulation.py		dna_string_manipulation.py
dna_transcription.py		dna_transcription.py
reverse_complement_generation.py		reverse_complement_generation.py
run_benchmarks.py		run_benchmarks.py
test_dna_analysis.py		test_dna_analysis.py
test_dna_data_structures.py		test_dna_data_structures.py
test_dna_string_manipulation.py		test_dna_string_manipulation.py
test_dna_transcription.py		test_dna_transcription.py
test_reverse_complement_generation.py		test_reverse_complement_generation.py
test_utilities.py		test_utilities.py
unittest_dna_transcription.py		unittest_dna_transcription.py
unittest_reverse_complement.py		unittest_reverse_complement.py

siyaagarwal2005/Algorithms-In-Bioinformatics-CCA-3

Folders and files

Latest commit

History

Repository files navigation

Implementation, Testing, and Performance Analysis of Core DNA Sequence Algorithms

Overview

Features

1. Core Data Structures — dna_data_structures.py

2. Essential Algorithms — Transcription & Reverse Complement

DNA to RNA Transcription (dna_transcription.py)

Reverse Complement Generation (reverse_complement_generation.py)

3. Utility Functions — dna_string_manipulation.py, dna_analysis.py

Performance and Optimization (Question 6)

Optimization Strategy: C-Level Translation

Testing Suite (Question 7)

Framework

Test Coverage Includes:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Core Data Structures — `dna_data_structures.py`

DNA to RNA Transcription (`dna_transcription.py`)

Reverse Complement Generation (`reverse_complement_generation.py`)

3. Utility Functions — `dna_string_manipulation.py`, `dna_analysis.py`

Packages