Releases: NirantK/hindi2vec
Dataset Release: BBC Hindi v0.1
This release consists of 4335 Hindi documents with tags from the BBC Hindi News website. The compressed file bbc-hindiv01.tar.gz
is what you want to download. You can ignore the source code.
Data Organization
Files
hindi-train.csv
and hindi-test.csv
- for text classification in Hindi.
We follow the emerging tag, text
convention from fastText and fastai.text
. This means the first column is the tag/category, separated by a tab or \t
character more accurately by the text content that follows. Each line is one record/example.
bbc-hindi-news.json
contains extra information for each record such as url
, heading
and intro
. The heading
attribute can be used as a text summary of the entire piece.
Tags
There are 14 unique categories. Each document has exactly one tag associated with it. These are the tags: india
, pakistan
, news
, international
, entertainment
, sport
, science
, china
, learningenglish
, social
, southasia
, business
, institutional
, multimedia
Encoding Issues
Please add encoding='utf-8'
when loading these files. The files above have been tested for no encoding/decoding issues. In case you still face difficulty, please raise an issue. I am more than happy to help.
E.g. when reading the hindi-train.csv
:
df_train = pd.read_csv('hindi-train.csv', sep="\t", encoding='utf-8', header=None)