Data tools cleanup: consolidate read methods + chunking logic

Hey @Vaibhav2154,

I’ve attached the refactored modules here:
[cli.py](https://github.com/user-attachments/files/22018017/cli.py)
[data_converter.py](https://github.com/user-attachments/files/22018016/data_converter.py)
[data_processor.py](https://github.com/user-attachments/files/22018019/data_processor.py)
[data_reader.py](https://github.com/user-attachments/files/22018018/data_reader.py)
[data_writer.py](https://github.com/user-attachments/files/22000863/data_writer.py)

My initial refactor in #26(PR #30) followed the existing structure pretty closely, just modernizing with pandas. After revisiting, I realized we can simplify further:

Chunked reading was handled differently in DataConverter and DataProcessor — now it’s unified under auto_read for consistent behavior and easier maintenance.

auto_read will check the file extension and call the appropriate explicit function for that format.

This is a small follow-up aimed at clarity, consistency, and reduced duplication, so no extra work should be needed.

I’ve organized data_tools into four modules:
DataReader → only read functions (with auto_read as the main entry point)
DataWriter → only write functions
DataProcessor → cleaning/enhancement functions
DataConverter → format converters

Quick thought — since we’re centralizing read methods, do you think it might make sense to extend this approach to all data ingestion (PDF/DOCX/Image extraction)? That way, everything that reads or parses data lives in data_tools, and scripts/ML code can focus on main logic. Would love your take on whether that’s a good idea or overkill 🙂

Also, just a heads-up — if these changes were applied, they could affect the read functionality in issue #44, which is now merged and closed. I wanted to give you the choice to close this ticket if you think this refactor would be too much or unnecessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data tools cleanup: consolidate read methods + chunking logic #47

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data tools cleanup: consolidate read methods + chunking logic #47

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions