Skip to content

Comments

Changes in code to aggregate single files together#2

Open
dyutivartak wants to merge 1 commit intomainfrom
dyuti-develop
Open

Changes in code to aggregate single files together#2
dyutivartak wants to merge 1 commit intomainfrom
dyuti-develop

Conversation

@dyutivartak
Copy link

Description

Updated 02-aggregate_data.py to correctly aggregate CSV files from the nested output/ directory structure. The script now recursively searches through all date-stamped subdirectories (us-census-data-YYYYMMDD/imports/ and us-census-data-YYYYMMDD/exports/) and combines all CSV files into two consolidated datasets.

Key Changes:

  • Replaced flat directory reading with recursive file discovery using glob.glob() with recursive pattern matching
  • Updated input paths from Raw_Data/imports_all_files/ and Raw_Data/exports_all_files/ to output/*/imports and output/*/exports
  • Changed output paths to output/imports_combined.csv and output/exports_combined.csv
  • Added progress indicators showing file processing status (every 100 files)
  • Added error handling to skip invalid or empty files gracefully
  • Added filtering to exclude request_log.csv files from aggregation
  • Improved console output with formatted sections and record counts

Motivation and Context (link issue)

The previous implementation expected CSV files in flat directories (Raw_Data/imports_all_files/ and Raw_Data/exports_all_files/), but the actual data structure from 01-request_data.py stores files in a nested hierarchy:
output/
└── us-census-data-YYYYMMDD/
├── imports/
│ └── YYYY/
│ └── .csv
└── exports/
└── YYYY/
└── .csv

This change ensures the aggregation script correctly processes all data files regardless of their location in the nested directory structure.

How Has This Been Tested?

Test Results:

  • Successfully processed 7,769 import CSV filesoutput/imports_combined.csv (1,470,643 records, 230MB)
  • Successfully processed 2,726 export CSV filesoutput/exports_combined.csv (1,376,889 records, 212MB)
  • Verified that request_log.csv files are correctly excluded
  • Confirmed progress indicators display correctly during processing
  • Verified output files are created with proper formatting and record counts

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant