-
Notifications
You must be signed in to change notification settings - Fork 462
Classification Dataset refactor #4606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Docker Image Sizes
|
…ing_extensions into albert/new-dataset-support
…ing_extensions into albert/new-dataset-support # Conflicts: # lib/src/otx/data/dataset/base_new.py # lib/src/otx/data/dataset/classification_new.py # lib/src/otx/data/entity/sample.py
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…omments. Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
gdlg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Albert!
| ] | ||
| dependencies = [ | ||
| "datumaro==1.10.0", | ||
| "datumaro @ git+https://github.com/open-edge-platform/datumaro.git@develop", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: before the feature branch (feature/datumaro) can be merged to develop, this needs to be updated to point to an official datumaro release.
| @property | ||
| def masks(self) -> Mask | None: | ||
| """Get masks for the sample.""" | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the semantics of these properties masks, bboxes, ... ? Why do they return None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A large portion of the OTX codebase depends on the OTXDataItem. For example, all the transforms in https://github.com/open-edge-platform/training_extensions/blob/8418221b4e065d14b6f2c223f1ee297b80fd64c8/lib/src/otx/data/transform_libs/torchvision.py.
For now, we will mimic that functionality with the OTXSample to not break any logic. In the future, OTXDataItem should be replaced with OTXSample but this will require a larger refactor, including updating all the transforms to not take OTXDataItems as input.
| else: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situations is this property expected to return None? Wouldn't it be better to raise an exception if it's not possible to extract image information from the sample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very hard to understand all usages of OTXDataItem.img_info and if they would work with None or not. For now, it is better to maintain the existing logic until we start to factor out OTXDataItem.
| """OTXDataItemSample is a base class for OTX data items.""" | ||
|
|
||
| image: np.ndarray | tv_tensors.Image = image_field(dtype=pl.UInt8) | ||
| label: torch.Tensor = label_field(pl.Int32()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the label is always a torch type, why can image be a numpy type? The double type np.ndarray | tv_tensors.Image may be problematic for the code that consumes this class, because it needs to handle both cases (image as np.ndarray and image as tv_tensors.Image).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again existing OTX logic from OTXDataItem. The OTX codebase is already taking the different types into account.
lib/src/otx/data/dataset/base.py
Outdated
| return OTXDataItem.collate_fn | ||
|
|
||
| def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int | str, list[int]]: | ||
| """Compute class statistics.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a better docstring to explain what the method does and what's the structure of the returned dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied and pasted from https://github.com/open-edge-platform/training_extensions/pull/4606/files#diff-f29e3ef832fa9ab1f32358061431d30bf261265823142f6093bc460f3effe2a8
I'll update this function doc, but I am not really planning to address all these previously existing issues in the OTX codebase within this feature branch. We should create a task to update the linter ruleset and enforce this throughout the codebase.
| kwargs["sample_type"] = ClassificationSample | ||
| super().__init__(**kwargs) | ||
|
|
||
| def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int, list[int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is already implemented in the parent class OTXDataset. Why does this class need to override it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the implementation of this method will depend on the sample type. If it will be uniform across samples, it can be moved back to the new OTXDataset. The new OTXDataset does not implement this method at the moment.
| ) | ||
|
|
||
|
|
||
| class OTXDataset(TorchDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the old and new OTXDataset have the same API, I suggest to add some black-box functional tests that instantiate both classes, apply a sequence of operations to each, then finally assert that the respective results are identical.
Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
… albert/new-dataset-support
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
eugene123tw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Albert, for the refactoring! I just have a few comments, but overall it looks great.
lib/src/otx/data/dataset/base.py
Outdated
| return OTXDataItem.collate_fn | ||
|
|
||
| def get_idx_list_per_classes(self, use_string_label: bool = False) -> dict[int | str, list[int]]: | ||
| """Get a dictionary with class labels (string/int) as keys and lists of sample indices as values.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the intended use case for use_string_label? Could you also add a Google-style docstring explaining it? For example:
"""Get a dictionary mapping class labels (string or int) to lists of samples.
Args:
use_string_label (bool): If True, use string class labels as keys.
If False, use integer indices as keys.
"""…fo step for the new categories class. Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
This pull request introduces significant updates to the OTX data pipeline, focusing on integrating the new experimental Datumaro
Datasetand related APIs for classification tasks. The changes add new base and classification dataset classes, update the dataset factory and sampler logic to support both legacy and experimental Datumaro datasets, and refactor sample entities for better compatibility.Experimental Datumaro integration and new dataset classes:
OTXDatasetandOTXMulticlassClsDatasetclasses inbase_new.pyandclassification_new.py, respectively, to support the new Datumaro experimentalDatasetAPI for classification tasks. These classes include new batching and sample handling logic. [1] [2]OTXSample,ClassificationSample) insample.pyto standardize sample structure and conversion from Datumaro items for classification.Factory and sampler logic updates:
OTXDatasetFactoryinfactory.pyto instantiate datasets using either legacy or new Datumaro datasets, including conversion logic and label category extraction for the experimental API. [1] [2] [3] [4]BalancedSamplerinbalanced_sampler.pyto support both legacy and new dataset types, with logic to extract class indices appropriately. [1] [2] [3]Compatibility and bug fixes:
_dispatch_label_infoinmodels/base.pyto handle both legacy and newLabelCategoriestypes for label info dispatching. [1] [2]mpsdevice type in GPU memory monitoring callbacks for broader hardware compatibility.Testing and minor fixes:
ClassificationSample) and dataset class (OTXMulticlassClsDataset). [1] [2] [3]Other minor changes:
datumarodependency inpyproject.tomlto the develop branch for OTX integration.Summary
How to test
Checklist
License
Feel free to contact the maintainers if that's a concern.