Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1 Implement Dataset config for loading HuggingFace datasets #31

Merged
merged 14 commits into from
Sep 3, 2024

Conversation

botirk38
Copy link
Collaborator

Why?

We need to implement this feature to standardize and simplify the process of loading and sharding Hugging Face datasets in SONAR pipelines. The primary use cases are:

  1. To provide a consistent and reusable configuration system for dataset loading across different SONAR pipelines.
  2. To enable efficient distributed training by incorporating dataset sharding capabilities.
  3. To allow flexible configuration overrides, making it easier to experiment with different dataset parameters without modifying the core code.

This implementation will improve code maintainability, enhance reproducibility of experiments, and facilitate easier scaling of training processes across multiple GPUs or nodes.

How?

Key technical decisions made in this implementation:

  1. Utilized Python's dataclasses for a clean, type-hinted configuration system (DatasetConfig).
  2. Implemented a TypedDict (DatasetOverwrites) to specify allowed configuration overwrites, ensuring type safety.
  3. Integrated with the Hugging Face datasets library for actual dataset loading, leveraging its robust features.
  4. Incorporated dataset sharding functionality to support distributed training scenarios.
  5. Added a UUID field to uniquely identify each configuration instance.
  6. Implemented a with_overwrites method for easy parameter overriding without modifying the original configuration.

Test plan

To test these changes, we will:

  1. Create unit tests to cover basic dataset loading, sharding, and configuration overwriting.
  2. Implement integration tests within the SONAR pipeline to ensure compatibility with existing systems.
  3. Perform stress tests with large datasets and multiple shards to verify distributed training capabilities.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 19, 2024
@artemru
Copy link
Contributor

artemru commented Jul 30, 2024

what about adding some integration tests ?

@botirk38 botirk38 changed the title Implement Dataset config for loading HuggingFace datasets #1 Implement Dataset config for loading HuggingFace datasets Aug 2, 2024
Comment on lines 157 to 168
@dataclass
class TextDatasetConfig(DatasetConfig):
"""
Configuration for text datasets.

This class inherits from BaseDatasetConfig and can be used for
text-specific dataset configurations.
"""


@dataclass
class AudioDatasetConfig(DatasetConfig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they should live in their corresponding modules, but let's move them once it's merged !

pyproject.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@antoine-tran antoine-tran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with some nits.

@botirk38 Please make sure to clean up linter / isort complaints before landing

assert self.world_size >= 1, f"Invalid world_size: {self.world_size}. It should be >= 1."
assert 0 <= self.rank < self.world_size, f"Invalid rank: {self.rank}. It should be between 0 and {self.world_size - 1}."

def with_overwrites(self, overwrites: DatasetOverwrites):
Copy link
Contributor

@antoine-tran antoine-tran Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we use fairseq2, we can also just try fairseq2.utils.dataclass.update_dataclass here https://github.com/facebookresearch/fairseq2/blob/main/src/fairseq2/utils/dataclass.py#L36 (in combination with deepcopy())

"pylint>=2.8.0",
]

hg = ["transformers>=4.44.0", "datasets>=2.20.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to constraint the version of transformers that much ?

Raises:
AssertionError: If world_size or rank are invalid.
"""
assert self.world_size >= 1, f"Invalid world_size: {self.world_size}. It should be >= 1."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit comment, but this might help you in the future:

in standard SWE, assert is normally to assert low-level, internal variables that should act in certain way in normal cases, to define the points of fail fast in the program.

When it comes to validating user inputs, it is normally better to use "if" condition and raise ValueError

@botirk38 botirk38 merged commit cbda414 into facebookresearch:main Sep 3, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants