You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using load_dataset_builder to create a builder, run download_and_prepare do upload it to S3. However when trying to load it, there are missing state.json files. Complete example:
fromaiobotocore.sessionimportAioSessionasSessionfromdatasetsimportload_from_disk, load_datase, load_dataset_builderimports3fsstorage_options= {"session": Session()}
fs=s3fs.S3FileSystem(**storage_options)
output_dir="s3://bucket/imdb"builder=load_dataset_builder("imdb")
builder.download_and_prepare(output_dir, storage_options=storage_options)
load_from_disk(output_dir, fs=fs) # ERROR# [Errno 2] No such file or directory: '/tmp/tmpy22yys8o/bucket/imdb/state.json'
As a comparison, if you use the non lazy load_dataset, it works and the S3 folder has different structure + state.json files. Example:
load_from_disk must be used on datasets saved using save_to_disk: they correspond to fully serialized datasets including their state.
On the other hand, download_and_prepare just downloads the raw data and convert them to arrow (or parquet if you want). We are working on allowing you to reload a dataset saved on S3 with download_and_prepare using load_dataset in #5281
For now I'd encourage you to keep using save_to_disk
I was following the cloud storage docs section and perhaps I'm missing some part of the flow; start with load_dataset_builder + download_and_prepare. You say I need an explicit save_to_disk but what object needs to be saved? the builder? is that related to the other issue?
Right now load_dataset_builder + download_and_prepare is to be used with tools like dask or spark, but load_dataset will support private cloud storage soon as well so you'll be able to reload the dataset with datasets.
Right now the only function that can load a dataset from a cloud storage is load_from_disk, that must be used with a dataset serialized with save_to_disk.
Describe the bug
Using
load_dataset_builder
to create a builder, rundownload_and_prepare
do upload it to S3. However when trying to load it, there are missingstate.json
files. Complete example:As a comparison, if you use the non lazy
load_dataset
, it works and the S3 folder has different structure + state.json files. Example:You still want the 1st option for the laziness and the parquet conversion. Thanks!
Steps to reproduce the bug
BTW, you need the AioSession as s3fs is now based on aiobotocore, see fsspec/s3fs#385.
Expected behavior
Expected to be able to load the dataset from S3.
Environment info
python 3.7.15.
The text was updated successfully, but these errors were encountered: