Skip to content

Releases: NirantK/hindi2vec

Dataset Release: BBC Hindi v0.1

10 Apr 06:21
545dd0f
Compare
Choose a tag to compare

This release consists of 4335 Hindi documents with tags from the BBC Hindi News website. The compressed file bbc-hindiv01.tar.gz is what you want to download. You can ignore the source code.

Data Organization

Files

hindi-train.csv and hindi-test.csv - for text classification in Hindi.
We follow the emerging tag, text convention from fastText and fastai.text. This means the first column is the tag/category, separated by a tab or \t character more accurately by the text content that follows. Each line is one record/example.

bbc-hindi-news.json contains extra information for each record such as url, heading and intro. The heading attribute can be used as a text summary of the entire piece.

Tags

There are 14 unique categories. Each document has exactly one tag associated with it. These are the tags: india, pakistan, news, international, entertainment, sport, science, china, learningenglish, social, southasia, business, institutional, multimedia

Encoding Issues

Please add encoding='utf-8' when loading these files. The files above have been tested for no encoding/decoding issues. In case you still face difficulty, please raise an issue. I am more than happy to help.

E.g. when reading the hindi-train.csv:

df_train = pd.read_csv('hindi-train.csv', sep="\t", encoding='utf-8', header=None)