This repository contains a modular and highly optimized Python library for performing fundamental operations and statistical analysis on DNA sequences.
Developed for CCA3: DNA Fundamentals and Basic Tools, this project emphasizes:
- Correctness
- Code Quality (PEP 8 Compliance)
- Algorithmic Efficiency — particularly O(N) and O(1) performance, demonstrated through the optimization of the Reverse Complement function.
The project structure separates concerns into distinct modules for data structures, core algorithms, and utility functions.
DNA Class:
An object-oriented representation of a DNA sequence.
- O(1) Complexity:
Nucleotide counts, GC content, and AT content are pre-calculated during object initialization (an O(N) one-time cost), allowing for O(1) (constant time) lookup via methods likecalculate_GC_content(). - Robust Validation:
Ensures sequences contain only valid A, T, C, G bases (non-IUPAC) upon instantiation.
- Handles T → U substitution.
- Supports transcription from both coding and template strands (Question 4).
- Includes batch processing capabilities for multiple sequences.
- Implements a comprehensive complement map that includes IUPAC Degenerate Codes (R, Y, S, W, K, M, etc.) (Question 5).
- Supports output in both:
- 5′ − 3′ (Reverse Complement)
- 3′ − 5′ (Complement) orientations.
- Sequence case conversion (
to_uppercase,to_lowercase). - Non-nucleotide character removal.
- Sequence splitting into codons based on customizable reading frames.
- Detailed nucleotide analysis and reporting (counts, percentages, GC/AT content).
A central focus of this project was optimizing Reverse Complement Generation to achieve high efficiency.
Three versions were implemented and benchmarked using timeit and cProfile (algorithm_optimisation.py, run_benchmarks.py).
The final optimized function (generate_reverse_complement) achieves maximum speed using C-level execution to minimize Python interpreter overhead:
- V3 (Optimized): Uses
str.maketrans()andsequence.translate()for fast complementation in a single C-level operation. - Reversal: Handled via
[::−1]slicing — also a C-level operation.
This combination results in a highly efficient O(N) runtime, making it suitable for large-scale DNA datasets.
Comprehensive unit testing ensures correctness, robustness, and performance across all components.
Implemented using Python’s unittest framework (unittest_*.py files).
- Invalid Input Handling:
Tests forInvalidNucleotideErroracross all core functions (test_dna_analysis.py,unittest_dna_transcription.py). - Edge Cases:
Empty sequences, single-base sequences, and long sequences (up to 100,000 bases). - Robustness:
Tests for correct handling of IUPAC degenerate codes and accurate orientation output in the Reverse Complement function.