Skip to content

Latest commit

 

History

History
116 lines (107 loc) · 4.3 KB

tasks.md

File metadata and controls

116 lines (107 loc) · 4.3 KB

Implementation Tasks

1. Project Setup

  • Create project structure and virtual environment using uv
  • Set up basic CLI framework using argparse
    • Test: Basic CLI argument parsing
  • Create pyproject.toml with dependencies
  • Set up testing framework (pytest)
  • Create initial README.md

2. Data Source Ingestion

  • Implement file format detection (CSV/JSONL)
    • Test: Format detection for CSV and JSONL files
  • Create CSV reader with configurable delimiter
    • Test: Reading CSV with different delimiters (test_scenarios.md #7)
  • Create JSONL reader
    • Test: Reading JSONL files (test_scenarios.md #2)
  • Implement chunked reading for large files
    • Test: Processing large files (test_scenarios.md #3)
  • Add error handling for file access and parsing
    • Test: Invalid file formats and access errors
  • Create data source abstraction layer
    • Test: Common interface for different file types

3. Column Mapping

  • Implement automatic column mapping
    • Header name similarity matching
      • Test: Basic column name matching (test_scenarios.md #1)
    • Data content similarity analysis
      • Test: Content-based mapping accuracy
  • Create configuration file parser (JSON/YAML)
    • Test: Config file parsing and validation
  • Implement manual column mapping via config
    • Test: Custom mapping configurations
  • Add validation for mapping configuration
    • Test: Invalid mapping scenarios

4. Unique Identifier Handling

  • Implement automatic ID column detection
    • Column name analysis ("id", "key", etc.)
      • Test: ID column detection (test_scenarios.md #3)
    • Data uniqueness analysis
      • Test: Uniqueness validation
  • Add manual ID column specification
    • Test: Custom ID column configuration
  • Implement ID column validation
    • Test: Invalid ID columns
  • Add duplicate ID detection
    • Test: Duplicate ID handling

5. Comparison Engine

  • Implement row-level comparison
    • Find rows unique to source 1
      • Test: Unique row detection (test_scenarios.md #1, #2)
    • Find rows unique to source 2
      • Test: Unique row detection (test_scenarios.md #1, #2)
    • Detect rows with matching IDs but different values
      • Test: Value difference detection (test_scenarios.md #2)
  • Implement column-level comparison
    • Calculate matching value percentages
      • Test: Similarity calculations
    • Identify columns with highest/lowest similarity
      • Test: Column similarity ranking
  • Add support for case-insensitive comparison
    • Test: Case sensitivity handling (test_scenarios.md #6)
  • Add support for string trimming
    • Test: String trimming functionality
  • Implement column selection/exclusion
    • Test: Column filtering

6. Output Generation

  • Create summary report generator
    • Row count differences
      • Test: Summary statistics accuracy
    • Column similarity statistics
      • Test: Statistical calculations
  • Implement detailed diff generation
    • Colorized console output
      • Test: Console formatting
    • Side-by-side comparison
      • Test: Comparison display format
  • Add output format handlers
    • Console output formatter
      • Test: Console output formatting
    • CSV output formatter
      • Test: CSV output generation
    • JSON output formatter
      • Test: JSON output generation
  • Implement drill-down queries
    • Show unique rows by source
      • Test: Row filtering
    • Show differences for specific IDs
      • Test: ID-based filtering

7. Performance Optimization

  • Implement memory-efficient processing
    • Test: Memory usage with large datasets
  • Add progress indicators for large files
    • Test: Progress reporting accuracy
  • Optimize comparison algorithms
    • Test: Performance benchmarks
  • Add performance benchmarking
    • Test: Benchmark suite execution

8. Documentation

  • Write detailed API documentation
  • Create user guide with examples
  • Add command-line help text
  • Document configuration file format
  • Add contributing guidelines

9. Final Steps

  • Code cleanup and refactoring
  • Error message improvements
  • Final performance tuning
  • Release preparation