Skip to content

Data tools cleanup: consolidate read methods + chunking logic #47

@aleenaharoldpeter

Description

@aleenaharoldpeter

Hey @Vaibhav2154,

I’ve attached the refactored modules here:
cli.py
data_converter.py
data_processor.py
data_reader.py
data_writer.py

My initial refactor in #26(PR #30) followed the existing structure pretty closely, just modernizing with pandas. After revisiting, I realized we can simplify further:

Chunked reading was handled differently in DataConverter and DataProcessor — now it’s unified under auto_read for consistent behavior and easier maintenance.

auto_read will check the file extension and call the appropriate explicit function for that format.

This is a small follow-up aimed at clarity, consistency, and reduced duplication, so no extra work should be needed.

I’ve organized data_tools into four modules:
DataReader → only read functions (with auto_read as the main entry point)
DataWriter → only write functions
DataProcessor → cleaning/enhancement functions
DataConverter → format converters

Quick thought — since we’re centralizing read methods, do you think it might make sense to extend this approach to all data ingestion (PDF/DOCX/Image extraction)? That way, everything that reads or parses data lives in data_tools, and scripts/ML code can focus on main logic. Would love your take on whether that’s a good idea or overkill 🙂

Also, just a heads-up — if these changes were applied, they could affect the read functionality in issue #44, which is now merged and closed. I wanted to give you the choice to close this ticket if you think this refactor would be too much or unnecessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions