Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing state.json when creating a cloud dataset using a dataset_builder #5402

Open
danielfleischer opened this issue Jan 3, 2023 · 3 comments

Comments

@danielfleischer
Copy link

Describe the bug

Using load_dataset_builder to create a builder, run download_and_prepare do upload it to S3. However when trying to load it, there are missing state.json files. Complete example:

from aiobotocore.session import AioSession as Session
from datasets import load_from_disk, load_datase, load_dataset_builder
import s3fs

storage_options = {"session": Session()}
fs = s3fs.S3FileSystem(**storage_options)

output_dir = "s3://bucket/imdb"
builder = load_dataset_builder("imdb")
builder.download_and_prepare(output_dir, storage_options=storage_options)

load_from_disk(output_dir, fs=fs)  # ERROR
# [Errno 2] No such file or directory: '/tmp/tmpy22yys8o/bucket/imdb/state.json'

As a comparison, if you use the non lazy load_dataset, it works and the S3 folder has different structure + state.json files. Example:

from aiobotocore.session import AioSession as Session
from datasets import load_from_disk, load_dataset, load_dataset_builder
import s3fs

storage_options = {"session": Session()}
fs = s3fs.S3FileSystem(**storage_options)

output_dir = "s3://bucket/imdb"
dataset = load_dataset("imdb",)
dataset.save_to_disk(output_dir, fs=fs)

load_from_disk(output_dir, fs=fs)  # WORKS

You still want the 1st option for the laziness and the parquet conversion. Thanks!

Steps to reproduce the bug

from aiobotocore.session import AioSession as Session
from datasets import load_from_disk, load_datase, load_dataset_builder
import s3fs

storage_options = {"session": Session()}
fs = s3fs.S3FileSystem(**storage_options)

output_dir = "s3://bucket/imdb"
builder = load_dataset_builder("imdb")
builder.download_and_prepare(output_dir, storage_options=storage_options)

load_from_disk(output_dir, fs=fs)  # ERROR
# [Errno 2] No such file or directory: '/tmp/tmpy22yys8o/bucket/imdb/state.json'

BTW, you need the AioSession as s3fs is now based on aiobotocore, see fsspec/s3fs#385.

Expected behavior

Expected to be able to load the dataset from S3.

Environment info

s3fs               2022.11.0
s3transfer         0.6.0
datasets           2.8.0
aiobotocore        2.4.2
boto3              1.24.59
botocore           1.27.59

python 3.7.15.

@lhoestq
Copy link
Member

lhoestq commented Jan 4, 2023

load_from_disk must be used on datasets saved using save_to_disk: they correspond to fully serialized datasets including their state.

On the other hand, download_and_prepare just downloads the raw data and convert them to arrow (or parquet if you want). We are working on allowing you to reload a dataset saved on S3 with download_and_prepare using load_dataset in #5281

For now I'd encourage you to keep using save_to_disk

@danielfleischer
Copy link
Author

Thanks, I'll follow that issue.

I was following the cloud storage docs section and perhaps I'm missing some part of the flow; start with load_dataset_builder + download_and_prepare. You say I need an explicit save_to_disk but what object needs to be saved? the builder? is that related to the other issue?

@lhoestq
Copy link
Member

lhoestq commented Jan 4, 2023

Right now load_dataset_builder + download_and_prepare is to be used with tools like dask or spark, but load_dataset will support private cloud storage soon as well so you'll be able to reload the dataset with datasets.

Right now the only function that can load a dataset from a cloud storage is load_from_disk, that must be used with a dataset serialized with save_to_disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants