Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFound even though exists #5825

Closed
Muennighoff opened this issue May 5, 2023 · 4 comments
Closed

FileNotFound even though exists #5825

Muennighoff opened this issue May 5, 2023 · 4 comments

Comments

@Muennighoff
Copy link
Contributor

Describe the bug

I'm trying to download https://huggingface.co/datasets/bigscience/xP3/resolve/main/ur/xp3_facebook_flores_spa_Latn-urd_Arab_devtest_ab-spa_Latn-urd_Arab.jsonl which works fine in my webbrowser, but somehow not with datasets. Am I doing sth wrong?

Downloading builder script: 100%
2.82k/2.82k [00:00<00:00, 64.2kB/s]
Downloading readme: 100%
12.6k/12.6k [00:00<00:00, 585kB/s]
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-2-4b45446a91d5>](https://localhost:8080/#) in <cell line: 4>()
      2 lang = "ur"
      3 fname = "xp3_facebook_flores_spa_Latn-urd_Arab_devtest_ab-spa_Latn-urd_Arab.jsonl"
----> 4 dataset = load_dataset("bigscience/xP3", data_files=f"{lang}/{fname}")

6 frames
[/usr/local/lib/python3.10/dist-packages/datasets/data_files.py](https://localhost:8080/#) in _resolve_single_pattern_locally(base_path, pattern, allowed_extensions)
    291         if allowed_extensions is not None:
    292             error_msg += f" with any supported extension {list(allowed_extensions)}"
--> 293         raise FileNotFoundError(error_msg)
    294     return sorted(out)
    295 

FileNotFoundError: Unable to find 'https://huggingface.co/datasets/bigscience/xP3/resolve/main/ur/xp3_facebook_flores_spa_Latn-urd_Arab_devtest_ab-spa_Latn-urd_Arab.jsonl' at /content/https:/huggingface.co/datasets/bigscience/xP3/resolve/main

Steps to reproduce the bug

!pip install -q datasets
from datasets import load_dataset
lang = "ur"
fname = "xp3_facebook_flores_spa_Latn-urd_Arab_devtest_ab-spa_Latn-urd_Arab.jsonl"
dataset = load_dataset("bigscience/xP3", data_files=f"{lang}/{fname}")

Expected behavior

Correctly downloads

Environment info

latest versions

@mariosasko
Copy link
Collaborator

Hi!

This would only work if bigscience/xP3 was a no-code dataset, but it isn't (it has a Python builder script).

But this should work:

load_dataset("json", data_files="https://huggingface.co/datasets/bigscience/xP3/resolve/main/ur/xp3_facebook_flores_spa_Latn-urd_Arab_devtest_ab-spa_Latn-urd_Arab.jsonl")

@Muennighoff
Copy link
Contributor Author

I see, it's not compatible w/ regex right?
e.g.
load_dataset("json", data_files="https://huggingface.co/datasets/bigscience/xP3/resolve/main/ur/*")

@mariosasko
Copy link
Collaborator

mariosasko commented May 7, 2023

I see, it's not compatible w/ regex right? e.g. load_dataset("json", data_files="https://huggingface.co/datasets/bigscience/xP3/resolve/main/ur/*")

It should work for patterns that "reference" the local filesystem, but to make this work with the Hub, we must implement #5281 first.

In the meantime, you can fetch these glob files with HfFileSystem and pass them as a list to load_dataset:

from datasets import load_dataset
from huggingface_hub import HfFileSystem, hf_hub_url # `HfFileSystem` requires the latest version of `huggingface_hub`

fs = HfFileSystem()
glob_files = fs.glob("datasets/bigscience/xP3/ur/*")
# convert fsspec URLs to HTTP URLs
resolved_paths = [fs.resolve_path(file) for file in glob_files]
data_files = [hf_hub_url(resolved_path.repo_id, resolved_path.path_in_repo, repo_type=resolved_path.repo_type) for resolved_path in resolved_paths]

ds = load_dataset("json", data_files=data_files)

@lhoestq
Copy link
Member

lhoestq commented Aug 16, 2023

This works using load_dataset("json", data_files="hf://datasets/bigscience/xP3/ur/*") now, closing

@lhoestq lhoestq closed this as completed Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants