#1 Implement Dataset config for loading HuggingFace datasets #31

botirk38 · 2024-07-19T00:38:49Z

Why?

We need to implement this feature to standardize and simplify the process of loading and sharding Hugging Face datasets in SONAR pipelines. The primary use cases are:

To provide a consistent and reusable configuration system for dataset loading across different SONAR pipelines.
To enable efficient distributed training by incorporating dataset sharding capabilities.
To allow flexible configuration overrides, making it easier to experiment with different dataset parameters without modifying the core code.

This implementation will improve code maintainability, enhance reproducibility of experiments, and facilitate easier scaling of training processes across multiple GPUs or nodes.

How?

Key technical decisions made in this implementation:

Utilized Python's dataclasses for a clean, type-hinted configuration system (DatasetConfig).
Implemented a TypedDict (DatasetOverwrites) to specify allowed configuration overwrites, ensuring type safety.
Integrated with the Hugging Face datasets library for actual dataset loading, leveraging its robust features.
Incorporated dataset sharding functionality to support distributed training scenarios.
Added a UUID field to uniquely identify each configuration instance.
Implemented a with_overwrites method for easy parameter overriding without modifying the original configuration.

Test plan

To test these changes, we will:

Create unit tests to cover basic dataset loading, sharding, and configuration overwriting.
Implement integration tests within the SONAR pipeline to ensure compatibility with existing systems.
Perform stress tests with large datasets and multiple shards to verify distributed training capabilities.

huggingface_pipelines/dataset.py

artemru · 2024-07-30T09:28:31Z

what about adding some integration tests ?

artemru · 2024-08-06T08:52:21Z

huggingface_pipelines/dataset.py

+@dataclass
+class TextDatasetConfig(DatasetConfig):
+    """
+    Configuration for text datasets.
+
+    This class inherits from BaseDatasetConfig and can be used for
+    text-specific dataset configurations.
+    """
+
+
+@dataclass
+class AudioDatasetConfig(DatasetConfig):


they should live in their corresponding modules, but let's move them once it's merged !

pyproject.toml

…ng pipelines

antoine-tran

Approved with some nits.

@botirk38 Please make sure to clean up linter / isort complaints before landing

antoine-tran · 2024-08-14T05:14:40Z

huggingface_pipelines/dataset.py

+        assert self.world_size >= 1, f"Invalid world_size: {self.world_size}. It should be >= 1."
+        assert 0 <= self.rank < self.world_size, f"Invalid rank: {self.rank}. It should be between 0 and {self.world_size - 1}."
+
+    def with_overwrites(self, overwrites: DatasetOverwrites):


since we use fairseq2, we can also just try fairseq2.utils.dataclass.update_dataclass here https://github.com/facebookresearch/fairseq2/blob/main/src/fairseq2/utils/dataclass.py#L36 (in combination with deepcopy())

antoine-tran · 2024-08-14T05:15:18Z

pyproject.toml

+  "pylint>=2.8.0",
+]
+
+hg = ["transformers>=4.44.0", "datasets>=2.20.0"]


Do we need to constraint the version of transformers that much ?

antoine-tran · 2024-08-14T05:19:58Z

huggingface_pipelines/dataset.py

+        Raises:
+            AssertionError: If world_size or rank are invalid.
+        """
+        assert self.world_size >= 1, f"Invalid world_size: {self.world_size}. It should be >= 1."


nit comment, but this might help you in the future:

in standard SWE, assert is normally to assert low-level, internal variables that should act in certain way in normal cases, to define the points of fail fast in the program.

When it comes to validating user inputs, it is normally better to use "if" condition and raise ValueError

botirk38 added 2 commits July 19, 2024 01:34

Add dataset config class

0ba1ce0

Add dataset config class

347489f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 19, 2024

Make dataset now handle writing dataset values to disk

87cceaf

artemru reviewed Jul 30, 2024

View reviewed changes

huggingface_pipelines/dataset.py Outdated Show resolved Hide resolved

artemru reviewed Jul 30, 2024

View reviewed changes

huggingface_pipelines/dataset.py Outdated Show resolved Hide resolved

artemru reviewed Jul 30, 2024

View reviewed changes

huggingface_pipelines/dataset.py Outdated Show resolved Hide resolved

artemru reviewed Jul 30, 2024

View reviewed changes

huggingface_pipelines/dataset.py Outdated Show resolved Hide resolved

botirk38 added 2 commits July 30, 2024 23:31

Implement option to stream datasets

b557938

Allow different types of configs based on type of dataset

2340d79

botirk38 changed the title ~~Implement Dataset config for loading HuggingFace datasets~~ #1 Implement Dataset config for loading HuggingFace datasets Aug 2, 2024

artemru reviewed Aug 6, 2024

View reviewed changes

artemru approved these changes Aug 6, 2024

View reviewed changes

Add transformers and datasets as dependencies to pyproject.toml

f726fc5

artemru reviewed Aug 8, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

botirk38 added 3 commits August 12, 2024 11:03

Remove post process dataset as that should be done in the preprocessi…

b58f2ad

…ng pipelines

Add optional dependencies for huggingface using tag

6532ebb

Remove commas between configs

c1fdd02

antoine-tran approved these changes Aug 14, 2024

View reviewed changes

antoine-tran mentioned this pull request Aug 14, 2024

#2 Create pipeline abstract to streamline new pipelines #33

Merged

botirk38 added 5 commits August 14, 2024 09:12

Fix import sorts and dict subscript issues

42482df

Fix linting issues in dataset file

441315d

Make config optional to allow setting to None

c0e2c37

Ignore checking HF stubs

cd384ab

Reformat file with black

d3701c1

botirk38 merged commit cbda414 into facebookresearch:main Sep 3, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1 Implement Dataset config for loading HuggingFace datasets #31

#1 Implement Dataset config for loading HuggingFace datasets #31

botirk38 commented Jul 19, 2024

artemru commented Jul 30, 2024

artemru Aug 6, 2024

antoine-tran left a comment •

edited

Loading

antoine-tran Aug 14, 2024 •

edited

Loading

antoine-tran Aug 14, 2024

antoine-tran Aug 14, 2024

#1 Implement Dataset config for loading HuggingFace datasets #31

#1 Implement Dataset config for loading HuggingFace datasets #31

Conversation

botirk38 commented Jul 19, 2024

Why?

How?

Test plan

artemru commented Jul 30, 2024

artemru Aug 6, 2024

Choose a reason for hiding this comment

antoine-tran left a comment • edited Loading

Choose a reason for hiding this comment

antoine-tran Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

antoine-tran Aug 14, 2024

Choose a reason for hiding this comment

antoine-tran Aug 14, 2024

Choose a reason for hiding this comment

antoine-tran left a comment •

edited

Loading

antoine-tran Aug 14, 2024 •

edited

Loading