-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cloud storage in load_dataset via fsspec #5580
Conversation
The documentation is not available anymore as the PR was closed or merged. |
b7f309e
to
2fceb77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thank you !
I added a few comments.
Regarding the tests I think it should be possible to use the mockfs
fixture, it allows to play with a dummy fsspec FileSystem with the "mock://"
protocol.
However it requires some storage_options
to be passed. Maybe it can be added to DownloadConfig
which is passed to cached_path
, so that fsspec_get
and fsspec_head
can use the user's storage_options
?
src/datasets/utils/file_utils.py
Outdated
def fsspec_get(url, temp_file, timeout=10.0): | ||
_raise_if_offline_mode_is_enabled(f"Tried to reach {url}") | ||
try: | ||
fsspec.filesystem(urlparse(url).scheme).get(url, temp_file, timeout=timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to have a tqdm bar as in http_get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you may need to use fsspec.get_fs_token_paths
first to instantiate the filesystem in case some filesystem kwargs can be parsed from the URL (it concerns all the filesystems that implement _get_kwargs_from_urls
including gcsfs)
src/datasets/utils/file_utils.py
Outdated
def fsspec_head(url, timeout=10.0): | ||
_raise_if_offline_mode_is_enabled(f"Tried to reach {url}") | ||
try: | ||
fsspec.filesystem(urlparse(url).scheme).info(url, timeout=timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using a hash of the file info as a pseudo ETag ? We can use it as a normal ETag to invalidate the cache if the remote file changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, you may need to use fsspec.get_fs_token_paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using a hash of the file info as a pseudo ETag ? We can use it as a normal ETag to invalidate the cache if the remote file changed
Interesting idea. This actually returns quite a bit of info including an ETag on GCS, so as long as it's deterministic, I think we could. In the worst case if a response has a uuid or similar, we always invalidate the cache, but maybe that's the safer thing to do
0d041d4
to
50f2f64
Compare
@lhoestq I went ahead and tested this with a patch so that I could assign the mockfs as a return value. Let me know if I'm missing something though and we need to pass storage_options down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of patching think it would be better to have a new filesystem TmpDirFileSystem
(tmpfs
) that doesn't need storage_options
for the tests, and that is based on a temporary directory created just for the fixture. Maybe something like this ?
class TmpDirFileSystem(MockFileSystem):
protocol = "tmp"
tmp_dir = None
def __init__(self):
assert self.tmp_dir is not None, "TmpDirFileSystem.tmp_dir is not set"
super().__init__(local_root_dir=self.tmp_dir, auto_mkdir=True)
@pytest.fixture
def mock_fsspec():
original_registry = fsspec.registry.copy()
fsspec.register_implementation("mock", MockFileSystem)
fsspec.register_implementation("tmp", TmpDirFileSystem)
yield
fsspec.registry = original_registry
@pytest.fixture
def tmpfs(tmp_path_factory, mock_fsspec):
tmp_fs_dir = tmp_path_factory.mktemp("tmpfs")
with patch.object(TmpDirFileSystem, "tmp_dir", tmp_fs_dir):
yield TmpDirFileSystem()
tests/test_file_utils.py
Outdated
def mockfs_file(mockfs): | ||
with open(os.path.join(mockfs.local_root_dir, FILE_PATH), "w") as f: | ||
f.write(FILE_CONTENT) | ||
return mockfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the fixture is named mockfs_file
I'd expect it to return the file path inside the mock filesystem ?
return mockfs | |
return FILE_PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we need to return the fs itself (which has been seeded with the file) to patch the fs infile_utils.fsspec_get
so we can test get_from_cache
Maybe mockfs_with_file
is a better fixture name, but let me also explore the tmpfs solution above too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tmpfs solution feels pretty clean, thanks for the recommendation!
tests/fixtures/fsspec.py
Outdated
def get_file(self, rpath, lpath, *args, **kwargs): | ||
rpath = posixpath.join(self.local_root_dir, self._strip_protocol(rpath)) | ||
return self._fs.get_file(rpath, lpath, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed ? IIRC it's already implemented as part of the AbstractFileSystem
and uses self.open()
under the hood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. Will remove in next commit
Thanks for the recommendation, this works great. |
Feel free to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks ! I added a few suggestions and we can merge
…tes pseudo etag from head response
Co-authored-by: Alvaro Bartolome <alvarobartt@yahoo.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Should be good to go. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks !
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Can you please paste the code you used to test this with? It's not clear how one would go about actually using this to access datasets in Google Cloud Storage or S3. |
Closes #5281
This PR uses fsspec to support datasets on cloud storage (tested manually with GCS). ETags are currently unsupported for cloud storage. In general, a much larger refactor could be done to just use fsspec for all schemes (ftp, http/s, s3, gcs) to unify the interfaces here, but I ultimately opted to leave that out of this PR
I didn't create a GCS filesystem class in
datasets.filesystems
since the S3 one appears to be a wrapper arounds3fs.S3FileSystem
and mainly used to generate docs.