Skip to content

Conversation

@cjackett
Copy link
Contributor

Summary

This PR implements iFDO header metadata support allowing pipelines to control header fields at dataset, pipeline, and collection levels while auto-deduplicating common fields. A new get_metadata_header() method in BasePipeline returns generic metadata interpreted by schema classes (iFDOMetadata, GenericMetadata). Common fields across images are automatically promoted to headers, reducing metadata repetition. Smart defaults generate hierarchical names when not provided. A priority system ensures user metadata > smart defaults > auto-deduplicated fields. Fully backward compatible with existing pipelines.

Problem

All iFDO files used the same image-set-name regardless of granularity level (dataset/pipeline/collection). Common fields were repeated in every image item causing file bloat. Pipelines couldn't provide context-specific metadata. No automatic optimization existed for reducing metadata file sizes when images shared identical field values.

Solution

Added get_metadata_header(context, collection_config) method to BasePipeline returning generic metadata dictionaries. Enhanced iFDOMetadata and GenericMetadata with deduplication algorithms that identify common fields and promote them to headers. Smart defaults generate hierarchical names when custom names aren't provided. Updated DatasetWrapper and ProjectWrapper to pass pipeline instances and collection configs through the metadata pipeline.

Design

Clean separation: BasePipeline provides schema-agnostic metadata, schema classes interpret it. Three-tier priority: user metadata > smart defaults > auto-deduplication. Single-pass deduplication algorithm identifies common fields efficiently. Backward compatible via default get_metadata_header() returning empty dict. DatasetWrapper parses collection names to determine context and looks up appropriate pipeline instance. For dataset-level, finds pipelines with overridden methods. Comprehensive debug logging throughout.

Impact

Reduces iFDO file sizes by up to 90% for datasets with common metadata. Enables semantically correct names for different metadata file levels. Existing pipelines work unchanged while benefiting from improved defaults. New pipelines can optionally customize metadata for richer context.

Testing

Added 20 unit tests in tests/core/schemas/test_ifdo_header.py covering deduplication edge cases, header building priorities, smart defaults, and collection name parsing. Added 2 tests in tests/core/test_pipeline.py for default/override behavior. Updated tests in test_generic.py, test_ifdo.py, test_base.py, and test_darwin.py for new structure. All 901 tests pass. Validated with demo dataset showing correct context-specific names and deduplication for both iFDO and Generic schemas.

Documentation

Updated docs/pipeline.md with 103 lines documenting get_metadata_header() method, context parameter usage, example implementations, and generic-to-iFDO key mapping. Added comprehensive docstrings to all new methods. Included inline comments explaining priority order and context detection logic.

Breaking Changes

None. The feature is designed to be fully backward compatible. New parameters have default values, and the default get_metadata_header() implementation returns an empty dictionary, ensuring existing pipelines work without modification.

Added Files

  • tests/core/schemas/test_ifdo_header.py: Unit test suite with 20 tests covering deduplication, header building, smart defaults, and collection name parsing.

Modified Files

  • marimba/core/pipeline.py: Added get_metadata_header() method with default implementation.
  • marimba/core/schemas/base.py: Updated create_dataset_metadata() signature with pipeline_instance, context, collection_config parameters.
  • marimba/core/schemas/ifdo.py: Added IFDO_VERSION constant, implemented header building with priority system, deduplication algorithm, smart defaults, and generic-to-iFDO mapping.
  • marimba/core/schemas/generic.py: Implemented deduplication, header building with priorities, and item deduplication.
  • marimba/core/wrappers/dataset.py: Added pipeline_instances and collection_configs parameters, smart lookup logic, context detection.
  • marimba/core/wrappers/project.py: Built pipeline_instances and collection_configs mappings, passed to DatasetWrapper.
  • docs/pipeline.md: Added 103 lines documenting new method with examples and mapping tables.
  • tests/core/schemas/test_base.py: Updated fixtures for new interface.
  • tests/core/schemas/test_darwin.py: Updated fixtures for new interface.
  • tests/core/schemas/test_generic.py: Updated assertions for header/items structure and deduplication.
  • tests/core/schemas/test_ifdo.py: Updated test to use two images with different altitudes, imported IFDO_VERSION.
  • tests/core/test_pipeline.py: Added tests for default implementation and override behavior.

Additional Notes

Dataset-level metadata intelligently selects pipelines with overridden get_metadata_header() methods. Deduplication algorithm has no hardcoded exclusions - any identical field is deduplicated. Validated end-to-end with real demo data. All pre-commit checks pass (Ruff, Black, Mypy, Bandit).

Notes to Reviewer

Review smart lookup logic in DatasetWrapper._create_metadata_for_type() for context detection. Verify deduplication algorithms handle edge cases (empty datasets, video frames). Check priority system ensures user metadata overrides defaults. Examine demo output in temp/mritc-demo/datasets/IN2018_V06/ for correct names and deduplication. Confirm backward compatibility with default implementation.

…to-deduplication

This commit implements a comprehensive iFDO header metadata feature that allows pipelines to control header fields at dataset, pipeline, and collection levels while automatically deduplicating common image fields to reduce file size.
@GermanHydrogen
Copy link
Collaborator

I do not understand your approach to developing this feature to be metadata-standard agnostic (and your critique of PR #31) as pipelines are already tightly coupled to a metadata standard by returning a metadata-standard-specific implementation of the BaseMetadata class.

The compromise to use an opinionated universal metadata standard restricts _map_user_metadata_to_header to a small subset of fields from the perspective of an iFDO user. It currently does not cover all required fields in the iFDO standard.
It also couples marimbas version directly to the iFDO version. Since iFDOMetadatawas just a thin wrapper around the ImageData class, the marimba version was fairly independent from the version of the ifdo-py package. This could have enabled a feature in which a pipeline or user could have requested a specific iFDO version through this weak coupling, better supporting constantly evolving schema.

Accepting these points, the main problem of your solution would be the return type of get_metadata_header. The weak typing of dict[str, Any] does not catch "user-error" in the pipeline implementation, as the keys can have typos and the values can have the wrong type. These mistakes only cause exceptions late in the packaging or not at all; which could make debugging for users a bit annoying.
The risk for typos could be mitigated by using an Enum instead of raw strings, as typos would be caught by a typechecker and IDEs can offer autocomplete options. This would also allow for better documentation, as the allowed values are directly defined by the Enum, eliminating the need to document the allowed values multiple times in the repository.
The weak typing of the values could only be fixed by introducing a dataclass instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants