feat: add glob-based file inclusion and exclusion filtering with dry run support#41
feat: add glob-based file inclusion and exclusion filtering with dry run support#41kulnor wants to merge 8 commits intoMIT-LCP:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds glob-based file filtering to Croissant Maker’s discovery pipeline and exposes it via the CLI, including a dry-run mode to preview selected files.
Changes:
- Extend
discover_files()to supportinclude_patterns/exclude_patternsfiltering. - Wire include/exclude options through CLI →
MetadataGenerator→ discovery. - Add
--dry-runmode and document file filtering behavior; add unit tests for filtering.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/croissant_maker/files.py |
Adds glob-based include/exclude filtering on discovered relative paths. |
src/croissant_maker/__main__.py |
Introduces --include/--exclude options and --dry-run early-exit behavior; passes patterns into generation. |
src/croissant_maker/metadata_generator.py |
Stores include/exclude patterns and applies them during discovery. |
tests/test_files.py |
Adds unit tests validating include/exclude filtering behavior. |
docs/file_filtering.md |
New documentation describing filtering and dry-run usage. |
Comments suppressed due to low confidence (1)
docs/file_filtering.md:40
- Docs refer to an early-exit flag
--list-files, but the CLI option implemented in this PR is--dry-run. Update the documentation to use the correct flag name (and ensure wording matches the CLI help) to avoid confusion.
### CLI (`croissant_maker.__main__`)
The CLI uses `typer` to handle multi-value options for `--include` and `--exclude`. It also implements the early-exit logic for `--list-files` to provide a fast "dry-run" experience.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…e dry run logic, and bump version to 0.1.0
|
Patches issues #43 |
rafiattrach
left a comment
There was a problem hiding this comment.
Really nice addition, glob filtering and dry-run are genuinely useful features, thank you!
Two small things before merging listed below:
|
|
||
| if not output: | ||
| output = _get_default_output_name(input) | ||
| typer.echo(f"Auto-generated output filename: {output}") |
There was a problem hiding this comment.
This message prints even when --dry-run is passed: no file is actually created so it's misleading. The if not output: block (lines 120–122) should be guarded with and not dry_run for example so it's skipped during dry runs.
There was a problem hiding this comment.
Good catch. Just pushed the fix.
docs/file_filtering.md
Outdated
| The `MetadataGenerator` class stores the include/exclude patterns and passes them to the discovery utility. This ensures that the generated `distribution` (FileObjects) and `recordSet` (RecordSets) only contain the filtered subset. | ||
|
|
||
| ### CLI (`croissant_maker.__main__`) | ||
| The CLI uses `typer` to handle multi-value options for `--include` and `--exclude`. It also implements the early-exit logic for `--list-files` to provide a fast "dry-run" experience. |
There was a problem hiding this comment.
Small nit: --list-files on last line was never an actual CLI flag, it leaked from the original internal variable name list_files. More broadly, per-feature .md
files tend to be really hard to maintain without a full docs pipeline to validate them — this PR is actually a great example of how quickly they can drift! I'd suggest
holding off on this pattern until we have a full docs site with proper CI validation. Really appreciate the addition though!
There was a problem hiding this comment.
I deleted the file for now, and this is for sure an interesting point. I usually ask the agents to update the documentation as needed. But it is easy to miss one. Maybe should have this integrated in my precommit habits.
|
@rafiattrach let me know if you need anything else to approve this PR so I can bring my fork up to date. |
Thanks for the print fix! however for the docs, did you mean to remove |
|
@rafiattrach Oops, I did indeed delete the wrong file... Just fixed this (restored the |
rafiattrach
left a comment
There was a problem hiding this comment.
Thanks @kulnor! Looks great! Could you just rebase and possibly remove inline import if it already exists at the top? (small comment below for the exact line)
|
|
||
| # If just listing files, output and exit | ||
| if dry_run: | ||
| from croissant_maker.handlers.registry import ( |
There was a problem hiding this comment.
one more thing I noticed: I believe this is already imported at the top of the file, do we need this again as inline import?
There was a problem hiding this comment.
I just accepted the rebase pull request. I don't see these imports at the top, so I didn't make any other changes.
Now it looks like we have conflicts, and this is where my git skills are starting to fall apart (merge conflict in PR). Is this something you can help with or provide guidance?
…ponents--croissant-maker chore(main): release croissant-maker 0.2.0
Processing all supported data files in the directory tree is often not desirable.
To provide greater control over which data files are processed, I added repeatable
--includeand--excludeoptions that use glob patterns for filtering. A --dry-run option can be used to preview the selected files.See the docs/file_filtering.md for details