Skip to content

Latest commit

 

History

History
92 lines (59 loc) · 3.04 KB

UPLOADING.md

File metadata and controls

92 lines (59 loc) · 3.04 KB

Uploading a dataloader script to the Hub

At this point, there should be no further changes to your dataloader script after the PR was accepted.

1. Make an account on the Hub

Please do the following before getting started:

  • Make an account on 🤗's Hub and login. Choose a good password, as you'll need to authenticate your credentials.

  • Make a github account; you can follow instructions to install git here.

Note - your permissions will be set to READ. Please contact an admin in your dataset's GitHub issue to be granted WRITE access; this should be given after your PR is accepted.

2) Activate the Huggingface hub

You can find the official instructions here. We will provide what you need for the seacrowd-datasets hackathon environment.

With your active seacrowd environment, use the following command:

huggingface-cli login

Login with your 🤗 Hub account username and password.

3. Create a dataset repository

Make a repository via the 🤗 Hub here with the following details.

  • Set Owner: seacrowd-datasets
  • Set Dataset name: the name of the dataset
  • Set License: the license that applies to this dataset
  • Select Private
  • Click Create dataset

Please name your dataloading script with the same name as the dataset. For example, if your dataset loader script is called absa_prosa.py, then your dataset name should be absa_prosa.

If there is no appropriate license available in the provided options (for example for datasets with specific data user agreements) you should select "other".

4. Clone the dataset repository

Using terminal access, find a location to place your GitHub repository. In this location, use the following command:

git clone https://huggingface.co/datasets/SEACrowd/<your_dataset_name>

5. Commit your changes

Run the following commands to add and push your work.

git add <your_file_name.py>  # add the dataset
git commit -m "Adds <your_dataset_name>"
git push origin

6) Test your data-loader

Run the following command in a folder that does not include your data-loading script:

Test both the original dataset schema/config and the seacrowd schema/config.

Public Dataset

from datasets import load_dataset

dataset_orig = load_dataset("SEACrowd/<your_dataset_name>", name="source", use_auth_token=True)
dataset_SEACrowd = load_dataset("SEACrowd/<your_dataset_name>", name="SEACrowd", use_auth_token=True)

Private Dataset

from datasets import load_dataset

dataset_orig = load_dataset(
    "SEACrowd/<your_dataset_name>",
    name="source",
    data_dir="/local/path/to/data/files",
    use_auth_token=True)

dataset_SEACrowd = load_dataset(
    "SEACrowd/<your_dataset_name>",
    name="indobenchmark",
    data_dir="/local/path/to/data/files",
    use_auth_token=True)

And with that, you have successfully contributed a data-loading script!