Classification Dataset refactor #4606

AlbertvanHouten · 2025-08-27T07:50:03Z

This pull request introduces significant updates to the OTX data pipeline, focusing on integrating the new experimental Datumaro Dataset and related APIs for classification tasks. The changes add new base and classification dataset classes, update the dataset factory and sampler logic to support both legacy and experimental Datumaro datasets, and refactor sample entities for better compatibility.

Experimental Datumaro integration and new dataset classes:

Added OTXDataset and OTXMulticlassClsDataset classes in base_new.py and classification_new.py, respectively, to support the new Datumaro experimental Dataset API for classification tasks. These classes include new batching and sample handling logic. [1] [2]
Introduced new sample entity classes (OTXSample, ClassificationSample) in sample.py to standardize sample structure and conversion from Datumaro items for classification.

Factory and sampler logic updates:

Refactored OTXDatasetFactory in factory.py to instantiate datasets using either legacy or new Datumaro datasets, including conversion logic and label category extraction for the experimental API. [1] [2] [3] [4]
Updated BalancedSampler in balanced_sampler.py to support both legacy and new dataset types, with logic to extract class indices appropriately. [1] [2] [3]

Compatibility and bug fixes:

Updated _dispatch_label_info in models/base.py to handle both legacy and new LabelCategories types for label info dispatching. [1] [2]
Added support for the mps device type in GPU memory monitoring callbacks for broader hardware compatibility.

Testing and minor fixes:

Updated classification dataset tests to use the new sample entity (ClassificationSample) and dataset class (OTXMulticlassClsDataset). [1] [2] [3]

Other minor changes:

Changed the datumaro dependency in pyproject.toml to the develop branch for OTX integration.
Minor update to image conversion utility and class statistics logic for improved robustness. [1] [2]

Summary

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have ran e2e tests and there is no issues.
I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).
I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
I have linked related issues.

License

I submit my code changes under the same Apache License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

github-actions · 2025-08-27T08:01:10Z

Docker Image Sizes

Image	Size
geti-tune-backend-pr-4606	1.2G
geti-tune-backend-sha-3dfe574	1.2G
geti-tune-ui-pr-4606	50M
geti-tune-ui-sha-3dfe574	50M

…ing_extensions into albert/new-dataset-support

…a from Sample

…aining

…ing_extensions into albert/new-dataset-support # Conflicts: # lib/src/otx/data/dataset/base_new.py # lib/src/otx/data/dataset/classification_new.py # lib/src/otx/data/entity/sample.py

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

lib/src/otx/data/utils/utils.py

lib/src/otx/data/samplers/balanced_sampler.py

lib/src/otx/data/factory.py

lib/src/otx/backend/native/models/base.py

lib/tests/unit/data/conftest.py

…omments. Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

gdlg

LGTM, thanks Albert!

leoll2 · 2025-09-03T08:28:46Z

lib/pyproject.toml

 ]
 dependencies = [
-    "datumaro==1.10.0",
+    "datumaro @ git+https://github.com/open-edge-platform/datumaro.git@develop",


Note: before the feature branch (feature/datumaro) can be merged to develop, this needs to be updated to point to an official datumaro release.

leoll2 · 2025-09-03T08:34:38Z

lib/src/otx/data/entity/sample.py

+    @property
+    def masks(self) -> Mask | None:
+        """Get masks for the sample."""
+        return None


What's the semantics of these properties masks, bboxes, ... ? Why do they return None?

A large portion of the OTX codebase depends on the OTXDataItem. For example, all the transforms in https://github.com/open-edge-platform/training_extensions/blob/8418221b4e065d14b6f2c223f1ee297b80fd64c8/lib/src/otx/data/transform_libs/torchvision.py.

For now, we will mimic that functionality with the OTXSample to not break any logic. In the future, OTXDataItem should be replaced with OTXSample but this will require a larger refactor, including updating all the transforms to not take OTXDataItems as input.

leoll2 · 2025-09-03T08:41:00Z

lib/src/otx/data/entity/sample.py

+            else:
+                return None


In what situations is this property expected to return None? Wouldn't it be better to raise an exception if it's not possible to extract image information from the sample?

It is very hard to understand all usages of OTXDataItem.img_info and if they would work with None or not. For now, it is better to maintain the existing logic until we start to factor out OTXDataItem.

leoll2 · 2025-09-03T08:49:26Z

lib/src/otx/data/entity/sample.py

+    """OTXDataItemSample is a base class for OTX data items."""
+
+    image: np.ndarray | tv_tensors.Image = image_field(dtype=pl.UInt8)
+    label: torch.Tensor = label_field(pl.Int32())


If the label is always a torch type, why can image be a numpy type? The double type np.ndarray | tv_tensors.Image may be problematic for the code that consumes this class, because it needs to handle both cases (image as np.ndarray and image as tv_tensors.Image).

This is again existing OTX logic from OTXDataItem. The OTX codebase is already taking the different types into account.

tests/unit/data/entity/test_sample.py

lib/src/otx/data/samplers/class_incremental_sampler.py

lib/src/otx/data/dataset/base_new.py

leoll2 · 2025-09-03T08:55:40Z

lib/src/otx/data/dataset/base.py

        return OTXDataItem.collate_fn
+
+    def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int | str, list[int]]:
+        """Compute class statistics."""


Please add a better docstring to explain what the method does and what's the structure of the returned dictionary

This is copied and pasted from https://github.com/open-edge-platform/training_extensions/pull/4606/files#diff-f29e3ef832fa9ab1f32358061431d30bf261265823142f6093bc460f3effe2a8

I'll update this function doc, but I am not really planning to address all these previously existing issues in the OTX codebase within this feature branch. We should create a task to update the linter ruleset and enforce this throughout the codebase.

leoll2 · 2025-09-03T08:57:34Z

lib/src/otx/data/dataset/classification_new.py

+        kwargs["sample_type"] = ClassificationSample
+        super().__init__(**kwargs)
+
+    def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int, list[int]]:


This method is already implemented in the parent class OTXDataset. Why does this class need to override it?

I believe the implementation of this method will depend on the sample type. If it will be uniform across samples, it can be moved back to the new OTXDataset. The new OTXDataset does not implement this method at the moment.

leoll2 · 2025-09-03T09:01:26Z

lib/src/otx/data/dataset/base_new.py

+    )
+
+
+class OTXDataset(TorchDataset):


If the old and new OTXDataset have the same API, I suggest to add some black-box functional tests that instantiate both classes, apply a sequence of operations to each, then finally assert that the respective results are identical.

Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

… albert/new-dataset-support

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

eugene123tw

Thanks, Albert, for the refactoring! I just have a few comments, but overall it looks great.

eugene123tw · 2025-09-03T15:44:11Z

lib/src/otx/data/dataset/base.py

        return OTXDataItem.collate_fn
+
+    def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int | str, list[int]]:
+        """Get a dictionary with class labels (string/int) as keys and lists of sample indices as values."""


What’s the intended use case for use_string_label? Could you also add a Google-style docstring explaining it? For example:

"""Get a dictionary mapping class labels (string or int) to lists of samples. Args: use_string_label (bool): If True, use string class labels as keys. If False, use integer indices as keys. """

lib/src/otx/data/factory.py

…fo step for the new categories class. Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

[WIP] Classification Dataset refactor

3bc16d3

github-actions bot added the TEST Any changes in tests label Aug 27, 2025

AlbertvanHouten added 3 commits August 27, 2025 15:40

Fix image conversions

5ea1908

add batching logic

87c5a31

Merge branch 'develop' of https://github.com/open-edge-platform/train…

3df9e59

…ing_extensions into albert/new-dataset-support

github-actions bot added DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM BUILD labels Aug 28, 2025

AlbertvanHouten added 4 commits August 28, 2025 13:28

Switch logic back to reuse OTXDataBatch which is initialized with dat…

1d8f332

…a from Sample

Fix datumaro LabelGroup in _dispatch_label_info method

966ae56

Several fixes to use new Datumaro Dataset class for classification tr…

53afc10

…aining

Merge branch 'develop' of https://github.com/open-edge-platform/train…

e9338d2

…ing_extensions into albert/new-dataset-support # Conflicts: # lib/src/otx/data/dataset/base_new.py # lib/src/otx/data/dataset/classification_new.py # lib/src/otx/data/entity/sample.py

github-actions bot removed DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM TEST Any changes in tests BUILD labels Sep 1, 2025

AlbertvanHouten added 2 commits September 2, 2025 09:42

add base OTX Sample class

3ae6745

Add test cases

742f399

github-actions bot added the TEST Any changes in tests label Sep 2, 2025

AlbertvanHouten changed the title ~~[WIP] Classification Dataset refactor~~ Classification Dataset refactor Sep 2, 2025

AlbertvanHouten added 2 commits September 2, 2025 16:35

Address ruff/mypy issues

856d54e

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Pin datumaro develop branch

785f5e6

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

gdlg reviewed Sep 2, 2025

View reviewed changes

Move get_idx_list_per_classes to dataset class and address other PR c…

df3c776

…omments. Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

AlbertvanHouten changed the base branch from develop to feature/datumaro September 3, 2025 08:13

AlbertvanHouten marked this pull request as ready for review September 3, 2025 08:13

AlbertvanHouten requested review from A-Artemis, itallix, leoll2, samet-akcay and warrkan as code owners September 3, 2025 08:13

AlbertvanHouten requested review from Daankrol, ashwinvaidya17, eugene123tw, kprokofi, rajeshgangireddy and sovrasov as code owners September 3, 2025 08:13

Raise error if dataset isn't of correcty type.

121bba7

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

AlbertvanHouten removed the TEST Any changes in tests label Sep 3, 2025

gdlg previously approved these changes Sep 3, 2025

View reviewed changes

leoll2 reviewed Sep 3, 2025

View reviewed changes

Apply suggestion from @leoll2

6fb07d8

Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>

AlbertvanHouten dismissed gdlg’s stale review via 6fb07d8 September 3, 2025 09:35

AlbertvanHouten added 4 commits September 3, 2025 11:49

Update Copyright headers and add function doc

069474f

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Merge remote-tracking branch 'origin/albert/new-dataset-support' into…

aaddb60

… albert/new-dataset-support

Pin numpy version

9ab6b7a

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Fix failing test

543963e

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

eugene123tw reviewed Sep 3, 2025

View reviewed changes

eugene123tw previously approved these changes Sep 3, 2025

View reviewed changes

Create LabelInfo class in the dataset and remove dispatching label in…

86e4acd

…fo step for the new categories class. Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

AlbertvanHouten dismissed eugene123tw’s stale review via 86e4acd September 4, 2025 06:49

AlbertvanHouten added 2 commits September 4, 2025 09:00

Update function doc

d089db9

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Fix failing tests after LabelInfo change

b22fcc3

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

gdlg approved these changes Sep 4, 2025

View reviewed changes

AlbertvanHouten merged commit 9bd3fa7 into feature/datumaro Sep 4, 2025
12 of 15 checks passed

AlbertvanHouten deleted the albert/new-dataset-support branch September 4, 2025 08:10

Classification Dataset refactor #4606

Classification Dataset refactor #4606

Uh oh!

Conversation

AlbertvanHouten commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to test

Checklist

License

Uh oh!

github-actions bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Docker Image Sizes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gdlg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlbertvanHouten Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlbertvanHouten Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eugene123tw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AlbertvanHouten commented Aug 27, 2025 •

edited

Loading

github-actions bot commented Aug 27, 2025 •

edited

Loading

AlbertvanHouten Sep 3, 2025 •

edited

Loading

AlbertvanHouten Sep 3, 2025 •

edited

Loading