-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Hey @Vaibhav2154,
I’ve attached the refactored modules here:
cli.py
data_converter.py
data_processor.py
data_reader.py
data_writer.py
My initial refactor in #26(PR #30) followed the existing structure pretty closely, just modernizing with pandas. After revisiting, I realized we can simplify further:
Chunked reading was handled differently in DataConverter and DataProcessor — now it’s unified under auto_read for consistent behavior and easier maintenance.
auto_read will check the file extension and call the appropriate explicit function for that format.
This is a small follow-up aimed at clarity, consistency, and reduced duplication, so no extra work should be needed.
I’ve organized data_tools into four modules:
DataReader → only read functions (with auto_read as the main entry point)
DataWriter → only write functions
DataProcessor → cleaning/enhancement functions
DataConverter → format converters
Quick thought — since we’re centralizing read methods, do you think it might make sense to extend this approach to all data ingestion (PDF/DOCX/Image extraction)? That way, everything that reads or parses data lives in data_tools, and scripts/ML code can focus on main logic. Would love your take on whether that’s a good idea or overkill 🙂
Also, just a heads-up — if these changes were applied, they could affect the read functionality in issue #44, which is now merged and closed. I wanted to give you the choice to close this ticket if you think this refactor would be too much or unnecessary.