-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1 Implement Dataset config for loading HuggingFace datasets #31
#1 Implement Dataset config for loading HuggingFace datasets #31
Conversation
what about adding some integration tests ? |
huggingface_pipelines/dataset.py
Outdated
@dataclass | ||
class TextDatasetConfig(DatasetConfig): | ||
""" | ||
Configuration for text datasets. | ||
|
||
This class inherits from BaseDatasetConfig and can be used for | ||
text-specific dataset configurations. | ||
""" | ||
|
||
|
||
@dataclass | ||
class AudioDatasetConfig(DatasetConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they should live in their corresponding modules, but let's move them once it's merged !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with some nits.
@botirk38 Please make sure to clean up linter / isort complaints before landing
assert self.world_size >= 1, f"Invalid world_size: {self.world_size}. It should be >= 1." | ||
assert 0 <= self.rank < self.world_size, f"Invalid rank: {self.rank}. It should be between 0 and {self.world_size - 1}." | ||
|
||
def with_overwrites(self, overwrites: DatasetOverwrites): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we use fairseq2, we can also just try fairseq2.utils.dataclass.update_dataclass
here https://github.com/facebookresearch/fairseq2/blob/main/src/fairseq2/utils/dataclass.py#L36 (in combination with deepcopy())
"pylint>=2.8.0", | ||
] | ||
|
||
hg = ["transformers>=4.44.0", "datasets>=2.20.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to constraint the version of transformers that much ?
huggingface_pipelines/dataset.py
Outdated
Raises: | ||
AssertionError: If world_size or rank are invalid. | ||
""" | ||
assert self.world_size >= 1, f"Invalid world_size: {self.world_size}. It should be >= 1." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit comment, but this might help you in the future:
in standard SWE, assert
is normally to assert low-level, internal variables that should act in certain way in normal cases, to define the points of fail fast in the program.
When it comes to validating user inputs, it is normally better to use "if" condition and raise ValueError
Why?
We need to implement this feature to standardize and simplify the process of loading and sharding Hugging Face datasets in SONAR pipelines. The primary use cases are:
This implementation will improve code maintainability, enhance reproducibility of experiments, and facilitate easier scaling of training processes across multiple GPUs or nodes.
How?
Key technical decisions made in this implementation:
DatasetConfig
).DatasetOverwrites
) to specify allowed configuration overwrites, ensuring type safety.datasets
library for actual dataset loading, leveraging its robust features.with_overwrites
method for easy parameter overriding without modifying the original configuration.Test plan
To test these changes, we will: