Hugging Face Dataset Downloader and Tokenizer

This repository contains Python scripts for downloading and tokenizing datasets from the Hugging Face Datasets library.

Requirements

To use these scripts, you will need:

Python 3.x
The Hugging Face datasets library (installable via pip)
A Conda environment (recommended, but not required)
An environment variable HF_DATASETS_CACHE to indicate where the cache stores so that it won't exceed the space of the default disk.

To install the datasets library via pip, run:

pip install datasets

Before using the scripts, you should set the HF_DATASETS_CACHE environment variable to indicate where the cache will be stored. For example, to set it to a directory called hf_datasets_cache in your specified directory, run:

export HF_DATASETS_CACHE="$YOURDIR/hf_datasets_cache"

Usage

To download a dataset, run the download_dataset.py script. By default, it downloads the c4 dataset to a subdirectory in the downloaded_dataset/hf_datasets_cache directory.

python download_dataset.py

To tokenize a downloaded dataset, run the tokenize_dataset.py script with the desired tokenization method as an argument. By default, it tokenizes the c4 dataset using the GPT-2 tokenizer from the Hugging Face transformers library, and saves the resulting tokenized dataset to a binary file called c4_en_train_gpt2 in the tokenized_bin directory.

python tokenize_dataset.py

Note that you can specify a different dataset by modifying the dataset_name , split_name, subset_name variables in the download_dataset.py script, and you can specify a different tokenization method by modifying variable by yourself in the tokenize_dataset.py script.

Example

To check some example usages(i.e. download and tokenize openwebtext), please check the codes in the folder example

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
example		example
stopwords_tokens		stopwords_tokens
test		test
.gitignore		.gitignore
20BTokenizer_filtered_tokens.bin		20BTokenizer_filtered_tokens.bin
20B_tokenizer.json		20B_tokenizer.json
README.md		README.md
download_dataset.py		download_dataset.py
tokenize_dataset.py		tokenize_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hugging Face Dataset Downloader and Tokenizer

Requirements

Usage

Example

About

Releases

Packages

Languages

rutgers-db/TokenizeDataset

Folders and files

Latest commit

History

Repository files navigation

Hugging Face Dataset Downloader and Tokenizer

Requirements

Usage

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages