Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cloud storage in load_dataset via fsspec #5580

Merged
merged 10 commits into from
Mar 11, 2023

Conversation

dwyatte
Copy link
Contributor

@dwyatte dwyatte commented Feb 27, 2023

Closes #5281

This PR uses fsspec to support datasets on cloud storage (tested manually with GCS). ETags are currently unsupported for cloud storage. In general, a much larger refactor could be done to just use fsspec for all schemes (ftp, http/s, s3, gcs) to unify the interfaces here, but I ultimately opted to leave that out of this PR

I didn't create a GCS filesystem class in datasets.filesystems since the S3 one appears to be a wrapper around s3fs.S3FileSystem and mainly used to generate docs.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 27, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

I added a few comments.

Regarding the tests I think it should be possible to use the mockfs fixture, it allows to play with a dummy fsspec FileSystem with the "mock://" protocol.

However it requires some storage_options to be passed. Maybe it can be added to DownloadConfig which is passed to cached_path, so that fsspec_get and fsspec_head can use the user's storage_options ?

setup.py Outdated Show resolved Hide resolved
def fsspec_get(url, temp_file, timeout=10.0):
_raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
try:
fsspec.filesystem(urlparse(url).scheme).get(url, temp_file, timeout=timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to have a tqdm bar as in http_get

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you may need to use fsspec.get_fs_token_paths first to instantiate the filesystem in case some filesystem kwargs can be parsed from the URL (it concerns all the filesystems that implement _get_kwargs_from_urls including gcsfs)

def fsspec_head(url, timeout=10.0):
_raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
try:
fsspec.filesystem(urlparse(url).scheme).info(url, timeout=timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using a hash of the file info as a pseudo ETag ? We can use it as a normal ETag to invalidate the cache if the remote file changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, you may need to use fsspec.get_fs_token_paths

Copy link
Contributor Author

@dwyatte dwyatte Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using a hash of the file info as a pseudo ETag ? We can use it as a normal ETag to invalidate the cache if the remote file changed

Interesting idea. This actually returns quite a bit of info including an ETag on GCS, so as long as it's deterministic, I think we could. In the worst case if a response has a uuid or similar, we always invalidate the cache, but maybe that's the safer thing to do

src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved
src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@dwyatte dwyatte force-pushed the fsspec branch 2 times, most recently from 0d041d4 to 50f2f64 Compare March 2, 2023 04:01
@dwyatte
Copy link
Contributor Author

dwyatte commented Mar 2, 2023

Regarding the tests I think it should be possible to use the mockfs fixture, it allows to play with a dummy fsspec FileSystem with the "mock://" protocol.

However it requires some storage_options to be passed. Maybe it can be added to DownloadConfig which is passed to cached_path, so that fsspec_get and fsspec_head can use the user's storage_options ?

@lhoestq I went ahead and tested this with a patch so that I could assign the mockfs as a return value. Let me know if I'm missing something though and we need to pass storage_options down

@dwyatte dwyatte requested a review from lhoestq March 2, 2023 04:11
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of patching think it would be better to have a new filesystem TmpDirFileSystem (tmpfs) that doesn't need storage_options for the tests, and that is based on a temporary directory created just for the fixture. Maybe something like this ?

class TmpDirFileSystem(MockFileSystem):
    protocol = "tmp"
    tmp_dir = None

    def __init__(self):
        assert self.tmp_dir is not None, "TmpDirFileSystem.tmp_dir is not set"
        super().__init__(local_root_dir=self.tmp_dir, auto_mkdir=True)


@pytest.fixture
def mock_fsspec():
    original_registry = fsspec.registry.copy()
    fsspec.register_implementation("mock", MockFileSystem)
    fsspec.register_implementation("tmp", TmpDirFileSystem)
    yield
    fsspec.registry = original_registry


@pytest.fixture
def tmpfs(tmp_path_factory, mock_fsspec):
    tmp_fs_dir = tmp_path_factory.mktemp("tmpfs")
    with patch.object(TmpDirFileSystem, "tmp_dir", tmp_fs_dir):
        yield TmpDirFileSystem()

tests/test_file_utils.py Outdated Show resolved Hide resolved
def mockfs_file(mockfs):
with open(os.path.join(mockfs.local_root_dir, FILE_PATH), "w") as f:
f.write(FILE_CONTENT)
return mockfs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the fixture is named mockfs_file I'd expect it to return the file path inside the mock filesystem ?

Suggested change
return mockfs
return FILE_PATH

Copy link
Contributor Author

@dwyatte dwyatte Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we need to return the fs itself (which has been seeded with the file) to patch the fs infile_utils.fsspec_get so we can test get_from_cache

Maybe mockfs_with_file is a better fixture name, but let me also explore the tmpfs solution above too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tmpfs solution feels pretty clean, thanks for the recommendation!

Comment on lines 43 to 45
def get_file(self, rpath, lpath, *args, **kwargs):
rpath = posixpath.join(self.local_root_dir, self._strip_protocol(rpath))
return self._fs.get_file(rpath, lpath, *args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed ? IIRC it's already implemented as part of the AbstractFileSystem and uses self.open() under the hood

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. Will remove in next commit

@dwyatte
Copy link
Contributor Author

dwyatte commented Mar 5, 2023

Instead of patching think it would be better to have a new filesystem TmpDirFileSystem (tmpfs) that doesn't need storage_options for the tests, and that is based on a temporary directory created just for the fixture. Maybe something like this ?

Thanks for the recommendation, this works great.

@lhoestq
Copy link
Member

lhoestq commented Mar 10, 2023

Feel free to merge main into your PR to fix the CI :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks ! I added a few suggestions and we can merge

src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved
src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved
dwyatte and others added 9 commits March 10, 2023 14:18
Co-authored-by: Alvaro Bartolome <alvarobartt@yahoo.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@dwyatte
Copy link
Contributor Author

dwyatte commented Mar 10, 2023

Should be good to go. Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks !

@lhoestq lhoestq merged commit 3e62699 into huggingface:main Mar 11, 2023
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006183 / 0.011353 (-0.005170) 0.004180 / 0.011008 (-0.006829) 0.095965 / 0.038508 (0.057457) 0.026754 / 0.023109 (0.003645) 0.339724 / 0.275898 (0.063826) 0.381628 / 0.323480 (0.058149) 0.004615 / 0.007986 (-0.003371) 0.004469 / 0.004328 (0.000140) 0.074035 / 0.004250 (0.069784) 0.035089 / 0.037052 (-0.001963) 0.352253 / 0.258489 (0.093764) 0.389598 / 0.293841 (0.095757) 0.032262 / 0.128546 (-0.096285) 0.011392 / 0.075646 (-0.064254) 0.323884 / 0.419271 (-0.095388) 0.042658 / 0.043533 (-0.000874) 0.331533 / 0.255139 (0.076394) 0.364723 / 0.283200 (0.081523) 0.086349 / 0.141683 (-0.055334) 1.465687 / 1.452155 (0.013533) 1.559782 / 1.492716 (0.067066)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.198562 / 0.018006 (0.180556) 0.457170 / 0.000490 (0.456680) 0.000409 / 0.000200 (0.000209) 0.000061 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022439 / 0.037411 (-0.014973) 0.096551 / 0.014526 (0.082025) 0.102230 / 0.176557 (-0.074326) 0.160878 / 0.737135 (-0.576257) 0.109348 / 0.296338 (-0.186990)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.456635 / 0.215209 (0.241426) 4.563571 / 2.077655 (2.485916) 2.313048 / 1.504120 (0.808928) 2.117433 / 1.541195 (0.576239) 2.127478 / 1.468490 (0.658988) 0.699478 / 4.584777 (-3.885299) 3.358955 / 3.745712 (-0.386757) 1.821437 / 5.269862 (-3.448424) 1.158239 / 4.565676 (-3.407438) 0.083207 / 0.424275 (-0.341068) 0.012925 / 0.007607 (0.005318) 0.556526 / 0.226044 (0.330482) 5.552364 / 2.268929 (3.283435) 2.744696 / 55.444624 (-52.699928) 2.374455 / 6.876477 (-4.502022) 2.442021 / 2.142072 (0.299949) 0.809393 / 4.805227 (-3.995834) 0.152305 / 6.500664 (-6.348359) 0.066164 / 0.075469 (-0.009305)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.258268 / 1.841788 (-0.583520) 13.402391 / 8.074308 (5.328083) 13.816927 / 10.191392 (3.625535) 0.148466 / 0.680424 (-0.531958) 0.016487 / 0.534201 (-0.517714) 0.385888 / 0.579283 (-0.193395) 0.378840 / 0.434364 (-0.055524) 0.444527 / 0.540337 (-0.095810) 0.531011 / 1.386936 (-0.855925)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006230 / 0.011353 (-0.005123) 0.004488 / 0.011008 (-0.006520) 0.077539 / 0.038508 (0.039031) 0.026611 / 0.023109 (0.003502) 0.342093 / 0.275898 (0.066195) 0.371555 / 0.323480 (0.048075) 0.004665 / 0.007986 (-0.003321) 0.003289 / 0.004328 (-0.001039) 0.078378 / 0.004250 (0.074128) 0.035223 / 0.037052 (-0.001829) 0.339972 / 0.258489 (0.081483) 0.378755 / 0.293841 (0.084914) 0.031331 / 0.128546 (-0.097215) 0.011406 / 0.075646 (-0.064241) 0.086891 / 0.419271 (-0.332381) 0.047713 / 0.043533 (0.004180) 0.342678 / 0.255139 (0.087539) 0.364536 / 0.283200 (0.081337) 0.092132 / 0.141683 (-0.049551) 1.537050 / 1.452155 (0.084895) 1.639927 / 1.492716 (0.147211)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.219933 / 0.018006 (0.201927) 0.391627 / 0.000490 (0.391137) 0.002238 / 0.000200 (0.002038) 0.000072 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024890 / 0.037411 (-0.012521) 0.098989 / 0.014526 (0.084464) 0.104505 / 0.176557 (-0.072052) 0.156252 / 0.737135 (-0.580884) 0.108027 / 0.296338 (-0.188312)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.443957 / 0.215209 (0.228748) 4.450850 / 2.077655 (2.373196) 2.076043 / 1.504120 (0.571923) 1.866396 / 1.541195 (0.325202) 1.902692 / 1.468490 (0.434202) 0.703160 / 4.584777 (-3.881617) 3.373761 / 3.745712 (-0.371951) 2.615649 / 5.269862 (-2.654213) 1.340612 / 4.565676 (-3.225065) 0.083836 / 0.424275 (-0.340439) 0.012619 / 0.007607 (0.005012) 0.553410 / 0.226044 (0.327365) 5.526500 / 2.268929 (3.257571) 2.513213 / 55.444624 (-52.931411) 2.152701 / 6.876477 (-4.723776) 2.165092 / 2.142072 (0.023019) 0.818381 / 4.805227 (-3.986846) 0.152118 / 6.500664 (-6.348546) 0.066950 / 0.075469 (-0.008519)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.291468 / 1.841788 (-0.550320) 13.694828 / 8.074308 (5.620520) 13.821019 / 10.191392 (3.629627) 0.126077 / 0.680424 (-0.554347) 0.016543 / 0.534201 (-0.517658) 0.381399 / 0.579283 (-0.197884) 0.377326 / 0.434364 (-0.057038) 0.439275 / 0.540337 (-0.101063) 0.524021 / 1.386936 (-0.862915)

@Xe
Copy link

Xe commented Nov 14, 2024

@dwyatte

(tested manually with GCS)

Can you please paste the code you used to test this with? It's not clear how one would go about actually using this to access datasets in Google Cloud Storage or S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support cloud storage in load_dataset
5 participants