Missing state.json when creating a cloud dataset using a dataset_builder #5402

danielfleischer · 2023-01-03T13:39:59Z

Describe the bug

Using load_dataset_builder to create a builder, run download_and_prepare do upload it to S3. However when trying to load it, there are missing state.json files. Complete example:

from aiobotocore.session import AioSession as Session
from datasets import load_from_disk, load_datase, load_dataset_builder
import s3fs

storage_options = {"session": Session()}
fs = s3fs.S3FileSystem(**storage_options)

output_dir = "s3://bucket/imdb"
builder = load_dataset_builder("imdb")
builder.download_and_prepare(output_dir, storage_options=storage_options)

load_from_disk(output_dir, fs=fs)  # ERROR
# [Errno 2] No such file or directory: '/tmp/tmpy22yys8o/bucket/imdb/state.json'

As a comparison, if you use the non lazy load_dataset, it works and the S3 folder has different structure + state.json files. Example:

from aiobotocore.session import AioSession as Session
from datasets import load_from_disk, load_dataset, load_dataset_builder
import s3fs

storage_options = {"session": Session()}
fs = s3fs.S3FileSystem(**storage_options)

output_dir = "s3://bucket/imdb"
dataset = load_dataset("imdb",)
dataset.save_to_disk(output_dir, fs=fs)

load_from_disk(output_dir, fs=fs)  # WORKS

You still want the 1st option for the laziness and the parquet conversion. Thanks!

Steps to reproduce the bug

from aiobotocore.session import AioSession as Session
from datasets import load_from_disk, load_datase, load_dataset_builder
import s3fs

storage_options = {"session": Session()}
fs = s3fs.S3FileSystem(**storage_options)

output_dir = "s3://bucket/imdb"
builder = load_dataset_builder("imdb")
builder.download_and_prepare(output_dir, storage_options=storage_options)

load_from_disk(output_dir, fs=fs)  # ERROR
# [Errno 2] No such file or directory: '/tmp/tmpy22yys8o/bucket/imdb/state.json'

BTW, you need the AioSession as s3fs is now based on aiobotocore, see fsspec/s3fs#385.

Expected behavior

Expected to be able to load the dataset from S3.

Environment info

s3fs               2022.11.0
s3transfer         0.6.0
datasets           2.8.0
aiobotocore        2.4.2
boto3              1.24.59
botocore           1.27.59

python 3.7.15.

The text was updated successfully, but these errors were encountered:

lhoestq · 2023-01-04T15:23:52Z

load_from_disk must be used on datasets saved using save_to_disk: they correspond to fully serialized datasets including their state.

On the other hand, download_and_prepare just downloads the raw data and convert them to arrow (or parquet if you want). We are working on allowing you to reload a dataset saved on S3 with download_and_prepare using load_dataset in #5281

For now I'd encourage you to keep using save_to_disk

danielfleischer · 2023-01-04T17:19:40Z

Thanks, I'll follow that issue.

I was following the cloud storage docs section and perhaps I'm missing some part of the flow; start with load_dataset_builder + download_and_prepare. You say I need an explicit save_to_disk but what object needs to be saved? the builder? is that related to the other issue?

lhoestq · 2023-01-04T17:23:57Z

Right now load_dataset_builder + download_and_prepare is to be used with tools like dask or spark, but load_dataset will support private cloud storage soon as well so you'll be able to reload the dataset with datasets.

Right now the only function that can load a dataset from a cloud storage is load_from_disk, that must be used with a dataset serialized with save_to_disk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing state.json when creating a cloud dataset using a dataset_builder #5402

Missing state.json when creating a cloud dataset using a dataset_builder #5402

danielfleischer commented Jan 3, 2023

lhoestq commented Jan 4, 2023

danielfleischer commented Jan 4, 2023

lhoestq commented Jan 4, 2023

Missing state.json when creating a cloud dataset using a dataset_builder #5402

Missing state.json when creating a cloud dataset using a dataset_builder #5402

Comments

danielfleischer commented Jan 3, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Jan 4, 2023

danielfleischer commented Jan 4, 2023

lhoestq commented Jan 4, 2023