[Feature] Add iFDO header metadata support with context-aware generation and auto-deduplication #32
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements iFDO header metadata support allowing pipelines to control header fields at dataset, pipeline, and collection levels while auto-deduplicating common fields. A new
get_metadata_header()method inBasePipelinereturns generic metadata interpreted by schema classes (iFDOMetadata, GenericMetadata). Common fields across images are automatically promoted to headers, reducing metadata repetition. Smart defaults generate hierarchical names when not provided. A priority system ensures user metadata > smart defaults > auto-deduplicated fields. Fully backward compatible with existing pipelines.Problem
All iFDO files used the same
image-set-nameregardless of granularity level (dataset/pipeline/collection). Common fields were repeated in every image item causing file bloat. Pipelines couldn't provide context-specific metadata. No automatic optimization existed for reducing metadata file sizes when images shared identical field values.Solution
Added
get_metadata_header(context, collection_config)method toBasePipelinereturning generic metadata dictionaries. Enhanced iFDOMetadata and GenericMetadata with deduplication algorithms that identify common fields and promote them to headers. Smart defaults generate hierarchical names when custom names aren't provided. UpdatedDatasetWrapperandProjectWrapperto pass pipeline instances and collection configs through the metadata pipeline.Design
Clean separation:
BasePipelineprovides schema-agnostic metadata, schema classes interpret it. Three-tier priority: user metadata > smart defaults > auto-deduplication. Single-pass deduplication algorithm identifies common fields efficiently. Backward compatible via defaultget_metadata_header()returning empty dict.DatasetWrapperparses collection names to determine context and looks up appropriate pipeline instance. For dataset-level, finds pipelines with overridden methods. Comprehensive debug logging throughout.Impact
Reduces iFDO file sizes by up to 90% for datasets with common metadata. Enables semantically correct names for different metadata file levels. Existing pipelines work unchanged while benefiting from improved defaults. New pipelines can optionally customize metadata for richer context.
Testing
Added 20 unit tests in
tests/core/schemas/test_ifdo_header.pycovering deduplication edge cases, header building priorities, smart defaults, and collection name parsing. Added 2 tests intests/core/test_pipeline.pyfor default/override behavior. Updated tests intest_generic.py,test_ifdo.py,test_base.py, andtest_darwin.pyfor new structure. All 901 tests pass. Validated with demo dataset showing correct context-specific names and deduplication for both iFDO and Generic schemas.Documentation
Updated
docs/pipeline.mdwith 103 lines documentingget_metadata_header()method, context parameter usage, example implementations, and generic-to-iFDO key mapping. Added comprehensive docstrings to all new methods. Included inline comments explaining priority order and context detection logic.Breaking Changes
None. The feature is designed to be fully backward compatible. New parameters have default values, and the default
get_metadata_header()implementation returns an empty dictionary, ensuring existing pipelines work without modification.Added Files
tests/core/schemas/test_ifdo_header.py: Unit test suite with 20 tests covering deduplication, header building, smart defaults, and collection name parsing.Modified Files
marimba/core/pipeline.py: Addedget_metadata_header()method with default implementation.marimba/core/schemas/base.py: Updatedcreate_dataset_metadata()signature with pipeline_instance, context, collection_config parameters.marimba/core/schemas/ifdo.py: Added IFDO_VERSION constant, implemented header building with priority system, deduplication algorithm, smart defaults, and generic-to-iFDO mapping.marimba/core/schemas/generic.py: Implemented deduplication, header building with priorities, and item deduplication.marimba/core/wrappers/dataset.py: Added pipeline_instances and collection_configs parameters, smart lookup logic, context detection.marimba/core/wrappers/project.py: Built pipeline_instances and collection_configs mappings, passed to DatasetWrapper.docs/pipeline.md: Added 103 lines documenting new method with examples and mapping tables.tests/core/schemas/test_base.py: Updated fixtures for new interface.tests/core/schemas/test_darwin.py: Updated fixtures for new interface.tests/core/schemas/test_generic.py: Updated assertions for header/items structure and deduplication.tests/core/schemas/test_ifdo.py: Updated test to use two images with different altitudes, imported IFDO_VERSION.tests/core/test_pipeline.py: Added tests for default implementation and override behavior.Additional Notes
Dataset-level metadata intelligently selects pipelines with overridden
get_metadata_header()methods. Deduplication algorithm has no hardcoded exclusions - any identical field is deduplicated. Validated end-to-end with real demo data. All pre-commit checks pass (Ruff, Black, Mypy, Bandit).Notes to Reviewer
Review smart lookup logic in
DatasetWrapper._create_metadata_for_type()for context detection. Verify deduplication algorithms handle edge cases (empty datasets, video frames). Check priority system ensures user metadata overrides defaults. Examine demo output intemp/mritc-demo/datasets/IN2018_V06/for correct names and deduplication. Confirm backward compatibility with default implementation.