Skip to content

Conversation

@AlbertvanHouten
Copy link
Contributor

@AlbertvanHouten AlbertvanHouten commented Aug 27, 2025

This pull request introduces significant updates to the OTX data pipeline, focusing on integrating the new experimental Datumaro Dataset and related APIs for classification tasks. The changes add new base and classification dataset classes, update the dataset factory and sampler logic to support both legacy and experimental Datumaro datasets, and refactor sample entities for better compatibility.

Experimental Datumaro integration and new dataset classes:

  • Added OTXDataset and OTXMulticlassClsDataset classes in base_new.py and classification_new.py, respectively, to support the new Datumaro experimental Dataset API for classification tasks. These classes include new batching and sample handling logic. [1] [2]
  • Introduced new sample entity classes (OTXSample, ClassificationSample) in sample.py to standardize sample structure and conversion from Datumaro items for classification.

Factory and sampler logic updates:

  • Refactored OTXDatasetFactory in factory.py to instantiate datasets using either legacy or new Datumaro datasets, including conversion logic and label category extraction for the experimental API. [1] [2] [3] [4]
  • Updated BalancedSampler in balanced_sampler.py to support both legacy and new dataset types, with logic to extract class indices appropriately. [1] [2] [3]

Compatibility and bug fixes:

  • Updated _dispatch_label_info in models/base.py to handle both legacy and new LabelCategories types for label info dispatching. [1] [2]
  • Added support for the mps device type in GPU memory monitoring callbacks for broader hardware compatibility.

Testing and minor fixes:

  • Updated classification dataset tests to use the new sample entity (ClassificationSample) and dataset class (OTXMulticlassClsDataset). [1] [2] [3]

Other minor changes:

  • Changed the datumaro dependency in pyproject.toml to the develop branch for OTX integration.
  • Minor update to image conversion utility and class statistics logic for improved robustness. [1] [2]

Summary

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have ran e2e tests and there is no issues.
  • I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).​
  • I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
  • I have linked related issues.

License

  • I submit my code changes under the same Apache License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

@github-actions github-actions bot added the TEST Any changes in tests label Aug 27, 2025
@github-actions
Copy link

github-actions bot commented Aug 27, 2025

Docker Image Sizes

Image Size
geti-tune-backend-pr-4606 1.2G
geti-tune-backend-sha-3dfe574 1.2G
geti-tune-ui-pr-4606 50M
geti-tune-ui-sha-3dfe574 50M

@github-actions github-actions bot added DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM BUILD labels Aug 28, 2025
@github-actions github-actions bot removed DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM TEST Any changes in tests BUILD labels Sep 1, 2025
@github-actions github-actions bot added the TEST Any changes in tests label Sep 2, 2025
@AlbertvanHouten AlbertvanHouten changed the title [WIP] Classification Dataset refactor Classification Dataset refactor Sep 2, 2025
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…omments.

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
@AlbertvanHouten AlbertvanHouten changed the base branch from develop to feature/datumaro September 3, 2025 08:13
@AlbertvanHouten AlbertvanHouten marked this pull request as ready for review September 3, 2025 08:13
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
@AlbertvanHouten AlbertvanHouten removed the TEST Any changes in tests label Sep 3, 2025
gdlg
gdlg previously approved these changes Sep 3, 2025
Copy link
Contributor

@gdlg gdlg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Albert!

]
dependencies = [
"datumaro==1.10.0",
"datumaro @ git+https://github.com/open-edge-platform/datumaro.git@develop",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: before the feature branch (feature/datumaro) can be merged to develop, this needs to be updated to point to an official datumaro release.

@property
def masks(self) -> Mask | None:
"""Get masks for the sample."""
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the semantics of these properties masks, bboxes, ... ? Why do they return None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large portion of the OTX codebase depends on the OTXDataItem. For example, all the transforms in https://github.com/open-edge-platform/training_extensions/blob/8418221b4e065d14b6f2c223f1ee297b80fd64c8/lib/src/otx/data/transform_libs/torchvision.py.

For now, we will mimic that functionality with the OTXSample to not break any logic. In the future, OTXDataItem should be replaced with OTXSample but this will require a larger refactor, including updating all the transforms to not take OTXDataItems as input.

Comment on lines +73 to +74
else:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations is this property expected to return None? Wouldn't it be better to raise an exception if it's not possible to extract image information from the sample?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very hard to understand all usages of OTXDataItem.img_info and if they would work with None or not. For now, it is better to maintain the existing logic until we start to factor out OTXDataItem.

"""OTXDataItemSample is a base class for OTX data items."""

image: np.ndarray | tv_tensors.Image = image_field(dtype=pl.UInt8)
label: torch.Tensor = label_field(pl.Int32())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the label is always a torch type, why can image be a numpy type? The double type np.ndarray | tv_tensors.Image may be problematic for the code that consumes this class, because it needs to handle both cases (image as np.ndarray and image as tv_tensors.Image).

Copy link
Contributor Author

@AlbertvanHouten AlbertvanHouten Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again existing OTX logic from OTXDataItem. The OTX codebase is already taking the different types into account.

return OTXDataItem.collate_fn

def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int | str, list[int]]:
"""Compute class statistics."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a better docstring to explain what the method does and what's the structure of the returned dictionary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied and pasted from https://github.com/open-edge-platform/training_extensions/pull/4606/files#diff-f29e3ef832fa9ab1f32358061431d30bf261265823142f6093bc460f3effe2a8

I'll update this function doc, but I am not really planning to address all these previously existing issues in the OTX codebase within this feature branch. We should create a task to update the linter ruleset and enforce this throughout the codebase.

kwargs["sample_type"] = ClassificationSample
super().__init__(**kwargs)

def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int, list[int]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is already implemented in the parent class OTXDataset. Why does this class need to override it?

Copy link
Contributor Author

@AlbertvanHouten AlbertvanHouten Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the implementation of this method will depend on the sample type. If it will be uniform across samples, it can be moved back to the new OTXDataset. The new OTXDataset does not implement this method at the moment.

)


class OTXDataset(TorchDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the old and new OTXDataset have the same API, I suggest to add some black-box functional tests that instantiate both classes, apply a sequence of operations to each, then finally assert that the respective results are identical.

Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Copy link
Contributor

@eugene123tw eugene123tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Albert, for the refactoring! I just have a few comments, but overall it looks great.

return OTXDataItem.collate_fn

def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int | str, list[int]]:
"""Get a dictionary with class labels (string/int) as keys and lists of sample indices as values."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the intended use case for use_string_label? Could you also add a Google-style docstring explaining it? For example:

"""Get a dictionary mapping class labels (string or int) to lists of samples.

Args:
    use_string_label (bool): If True, use string class labels as keys.  
        If False, use integer indices as keys.
"""

eugene123tw
eugene123tw previously approved these changes Sep 3, 2025
…fo step for the new categories class.

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
@AlbertvanHouten AlbertvanHouten merged commit 9bd3fa7 into feature/datumaro Sep 4, 2025
12 of 15 checks passed
@AlbertvanHouten AlbertvanHouten deleted the albert/new-dataset-support branch September 4, 2025 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants