Skip to content

siyaagarwal2005/Algorithms-In-Bioinformatics-CCA-3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Implementation, Testing, and Performance Analysis of Core DNA Sequence Algorithms

Overview

This repository contains a modular and highly optimized Python library for performing fundamental operations and statistical analysis on DNA sequences.

Developed for CCA3: DNA Fundamentals and Basic Tools, this project emphasizes:

  • Correctness
  • Code Quality (PEP 8 Compliance)
  • Algorithmic Efficiency — particularly O(N) and O(1) performance, demonstrated through the optimization of the Reverse Complement function.

The project structure separates concerns into distinct modules for data structures, core algorithms, and utility functions.


Features

1. Core Data Structures — dna_data_structures.py

DNA Class:
An object-oriented representation of a DNA sequence.

  • O(1) Complexity:
    Nucleotide counts, GC content, and AT content are pre-calculated during object initialization (an O(N) one-time cost), allowing for O(1) (constant time) lookup via methods like calculate_GC_content().
  • Robust Validation:
    Ensures sequences contain only valid A, T, C, G bases (non-IUPAC) upon instantiation.

2. Essential Algorithms — Transcription & Reverse Complement

DNA to RNA Transcription (dna_transcription.py)

  • Handles T → U substitution.
  • Supports transcription from both coding and template strands (Question 4).
  • Includes batch processing capabilities for multiple sequences.

Reverse Complement Generation (reverse_complement_generation.py)

  • Implements a comprehensive complement map that includes IUPAC Degenerate Codes (R, Y, S, W, K, M, etc.) (Question 5).
  • Supports output in both:
    • 5′ − 3′ (Reverse Complement)
    • 3′ − 5′ (Complement) orientations.

3. Utility Functions — dna_string_manipulation.py, dna_analysis.py

  • Sequence case conversion (to_uppercase, to_lowercase).
  • Non-nucleotide character removal.
  • Sequence splitting into codons based on customizable reading frames.
  • Detailed nucleotide analysis and reporting (counts, percentages, GC/AT content).

Performance and Optimization (Question 6)

A central focus of this project was optimizing Reverse Complement Generation to achieve high efficiency.
Three versions were implemented and benchmarked using timeit and cProfile (algorithm_optimisation.py, run_benchmarks.py).

Optimization Strategy: C-Level Translation

The final optimized function (generate_reverse_complement) achieves maximum speed using C-level execution to minimize Python interpreter overhead:

  • V3 (Optimized): Uses str.maketrans() and sequence.translate() for fast complementation in a single C-level operation.
  • Reversal: Handled via [::−1] slicing — also a C-level operation.

This combination results in a highly efficient O(N) runtime, making it suitable for large-scale DNA datasets.


Testing Suite (Question 7)

Comprehensive unit testing ensures correctness, robustness, and performance across all components.

Framework

Implemented using Python’s unittest framework (unittest_*.py files).

Test Coverage Includes:

  • Invalid Input Handling:
    Tests for InvalidNucleotideError across all core functions (test_dna_analysis.py, unittest_dna_transcription.py).
  • Edge Cases:
    Empty sequences, single-base sequences, and long sequences (up to 100,000 bases).
  • Robustness:
    Tests for correct handling of IUPAC degenerate codes and accurate orientation output in the Reverse Complement function.

About

Python toolkit for fundamental DNA analysis, transcription, and reverse complement algorithms

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages