diff --git a/notebook/bnlp_colab_training.ipynb b/notebook/bnlp_colab_training.ipynb
index 79684f4..af5ff14 100644
--- a/notebook/bnlp_colab_training.ipynb
+++ b/notebook/bnlp_colab_training.ipynb
@@ -1,6 +1,6 @@
{
"nbformat": 4,
- "nbformat_minor": 0,
+ "nbformat_minor": 2,
"metadata": {
"colab": {
"name": "bnlp_colab_training.ipynb",
@@ -18,53 +18,46 @@
"cells": [
{
"cell_type": "markdown",
+ "source": [
+ ""
+ ],
"metadata": {
"id": "view-in-github",
"colab_type": "text"
- },
- "source": [
- ""
- ]
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "0SQ0x9bh9QsL"
- },
"source": [
"# BNLP\n",
"\n",
"BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Construct Neural Model for Bengali NLP purposes.\n",
"\n",
"Here we are prodiving training approach of different model using **BNLP**"
- ]
+ ],
+ "metadata": {
+ "id": "0SQ0x9bh9QsL"
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "MuT4uyIf5-Gy"
- },
"source": [
"## Installation"
- ]
+ ],
+ "metadata": {
+ "id": "MuT4uyIf5-Gy"
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "KJN642aj5nVc",
- "outputId": "20f88496-2e42-47e1-b70d-e4dca8037351",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 462
- }
- },
+ "execution_count": 1,
"source": [
"!pip install -U bnlp_toolkit"
],
- "execution_count": 1,
"outputs": [
{
"output_type": "stream",
+ "name": "stdout",
"text": [
"Collecting bnlp_toolkit\n",
" Downloading https://files.pythonhosted.org/packages/16/be/44d78b55ad8121cce1ca0bdbc7cf1db8d3f585006bacb08bd53ec8653957/bnlp_toolkit-3.0.0-py3-none-any.whl\n",
@@ -91,30 +84,30 @@
"Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim->bnlp_toolkit) (1.24.3)\n",
"Installing collected packages: python-crfsuite, sklearn-crfsuite, sentencepiece, bnlp-toolkit\n",
"Successfully installed bnlp-toolkit-3.0.0 python-crfsuite-0.9.7 sentencepiece-0.1.91 sklearn-crfsuite-0.3.6\n"
- ],
- "name": "stdout"
+ ]
}
- ]
+ ],
+ "metadata": {
+ "id": "KJN642aj5nVc",
+ "outputId": "20f88496-2e42-47e1-b70d-e4dca8037351",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 462
+ }
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "IWy0qUdy6BY3"
- },
"source": [
"## Downloading Bengali Processed Wikipedia Data "
- ]
+ ],
+ "metadata": {
+ "id": "IWy0qUdy6BY3"
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "AcwFE8le5yTF",
- "outputId": "69cad5d1-3917-4376-bc81-ea1340cfd240",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 51
- }
- },
+ "execution_count": 2,
"source": [
"#drive data download code\n",
"!pip install -U -q PyDrive\n",
@@ -132,36 +125,40 @@
"!unzip bn_wiki_data.txt.zip\n",
"!rm -rf bn_wiki_data.txt.zip"
],
- "execution_count": 2,
"outputs": [
{
"output_type": "stream",
+ "name": "stdout",
"text": [
"Archive: bn_wiki_data.txt.zip\n",
" inflating: bn_wiki_data.txt \n"
- ],
- "name": "stdout"
+ ]
+ }
+ ],
+ "metadata": {
+ "id": "AcwFE8le5yTF",
+ "outputId": "69cad5d1-3917-4376-bc81-ea1340cfd240",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 51
}
- ]
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "350KPo4D6Z4o"
- },
"source": [
"## Training\n",
"\n",
"Here we present `bengali sentencepiece`, `bengali word2vec`, `bengali fasttext` training on `bengali wikipedia data`\n",
"\n",
"Training time will depend on data size."
- ]
+ ],
+ "metadata": {
+ "id": "350KPo4D6Z4o"
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "I_wHJFOW6dlo"
- },
"source": [
"### Training Bengali Sentencepice Model\n",
"\n",
@@ -169,18 +166,14 @@
"\n",
"* `wiki_sp.model` \n",
"* `wiki_sp.vecab`"
- ]
+ ],
+ "metadata": {
+ "id": "I_wHJFOW6dlo"
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "8l7DUWI66MD4",
- "outputId": "d7710e45-6981-432e-96fb-9ec2b9c159ab",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 85
- }
- },
+ "execution_count": 3,
"source": [
"from bnlp import SentencepieceTokenizer\n",
"\n",
@@ -190,25 +183,29 @@
"vocab_size = 30000\n",
"bsp.train(data, model_prefix, vocab_size) "
],
- "execution_count": 3,
"outputs": [
{
"output_type": "stream",
+ "name": "stdout",
"text": [
"punkt not found. downloading...\n",
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
"[nltk_data] Unzipping tokenizers/punkt.zip.\n",
"wiki_sp.model and wiki_sp.vocab is saved on your current directory\n"
- ],
- "name": "stdout"
+ ]
+ }
+ ],
+ "metadata": {
+ "id": "8l7DUWI66MD4",
+ "outputId": "d7710e45-6981-432e-96fb-9ec2b9c159ab",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 85
}
- ]
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "k-k4Dszo61v2"
- },
"source": [
"### Training Bengali Word2Vec Model\n",
"\n",
@@ -218,18 +215,14 @@
"* `wiki_word2vec.vector`\n",
"* `wiki_word2vec.model.trainables.syn1neg.npy`\n",
"* `wiki_word2vec..model.wv.vectors.npy`\n"
- ]
+ ],
+ "metadata": {
+ "id": "k-k4Dszo61v2"
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "OphHV5Yp60KW",
- "outputId": "7ce2a259-6339-494d-e023-5ffbe787c774",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 88
- }
- },
+ "execution_count": 4,
"source": [
"from bnlp import BengaliWord2Vec\n",
"bwv = BengaliWord2Vec()\n",
@@ -238,37 +231,42 @@
"vector_name = \"wiki_word2vec.vector\"\n",
"bwv.train(data_file, model_name, vector_name)"
],
- "execution_count": 4,
"outputs": [
{
"output_type": "stream",
+ "name": "stderr",
"text": [
"/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:252: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function\n",
" 'See the migration notes for details: %s' % _MIGRATION_NOTES_URL\n"
- ],
- "name": "stderr"
+ ]
},
{
"output_type": "stream",
+ "name": "stdout",
"text": [
"wiki_word2vec.model and wiki_word2vec.vector saved in your current directory.\n"
- ],
- "name": "stdout"
+ ]
+ }
+ ],
+ "metadata": {
+ "id": "OphHV5Yp60KW",
+ "outputId": "7ce2a259-6339-494d-e023-5ffbe787c774",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 88
}
- ]
+ }
},
{
+ "cell_type": "markdown",
"source": [
"### Pre-training or resume Bengali word2vec training"
],
- "cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
- "outputs": [],
"source": [
"from bnlp import BengaliWord2Vec\n",
"bwv = BengaliWord2Vec()\n",
@@ -278,38 +276,33 @@
"model_name = \"test_model.model\"\n",
"vector_name = \"test_vector.vector\"\n",
"bwv.pretrain(trained_model_path, data_file, model_name, vector_name, epochs=5)"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "TAMgr4WT8x2a"
- },
"source": [
"### Training Bengali Fasttext Model\n",
"First of all install `fasttext` using `pip install fasttext` and restart runtime.\n",
"\n",
"After successfully training it will produce: \n",
"* `wiki_fasttext.bin` "
- ]
+ ],
+ "metadata": {
+ "id": "TAMgr4WT8x2a"
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "JXptOhxg4s6r",
- "outputId": "a9386ef0-032c-437e-c416-e34bce2b792e",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 258
- }
- },
+ "execution_count": 5,
"source": [
"!pip install fasttext"
],
- "execution_count": 5,
"outputs": [
{
"output_type": "stream",
+ "name": "stdout",
"text": [
"Collecting fasttext\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)\n",
@@ -324,16 +317,21 @@
"Successfully built fasttext\n",
"Installing collected packages: fasttext\n",
"Successfully installed fasttext-0.9.2\n"
- ],
- "name": "stdout"
+ ]
}
- ]
+ ],
+ "metadata": {
+ "id": "JXptOhxg4s6r",
+ "outputId": "a9386ef0-032c-437e-c416-e34bce2b792e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 258
+ }
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "F67Yzdu08xBd"
- },
+ "execution_count": 1,
"source": [
"from bnlp.embedding.fasttext import BengaliFasttext\n",
"\n",
@@ -343,44 +341,40 @@
"epoch = 1\n",
"bft.train(data, model_name, epoch)"
],
- "execution_count": 1,
- "outputs": []
+ "outputs": [],
+ "metadata": {
+ "id": "F67Yzdu08xBd"
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "ZtsLVmOs9lgG"
- },
"source": [
"### Training Bengali POS TAGGING CRF model\n",
"\n",
"After successfully training it will produce a trained model with accuracy on evaluation data: \n",
"\n",
"* `pos_model.pkl`"
- ]
+ ],
+ "metadata": {
+ "id": "ZtsLVmOs9lgG"
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "VUKhbkaBE-CV",
- "outputId": "e3cd7857-9fec-42dc-d2c2-f037c3ab55f5",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 170
- }
- },
+ "execution_count": 2,
"source": [
"from bnlp import POS\n",
"bn_pos = POS()\n",
"model_name = \"pos_model.pkl\"\n",
- "tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]\n",
+ "train_data = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]\n",
+ "test_data = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]\n",
"\n",
- "bn_pos.train(model_name, tagged_sentences)"
+ "bn_pos.train(model_name, train_data, test_data)"
],
- "execution_count": 2,
"outputs": [
{
"output_type": "stream",
+ "name": "stdout",
"text": [
"1\n",
"1\n",
@@ -391,45 +385,46 @@
"Accuracy is: \n",
"0.1111111111111111\n",
"Model Saved!\n"
- ],
- "name": "stdout"
+ ]
}
- ]
+ ],
+ "metadata": {
+ "id": "VUKhbkaBE-CV",
+ "outputId": "e3cd7857-9fec-42dc-d2c2-f037c3ab55f5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 170
+ }
+ }
},
{
"cell_type": "markdown",
- "metadata": {
- "id": "dPB7SBrKuSna"
- },
"source": [
"## Training Bengali NER model\n",
"After successfully training it will produce a trained model with accuracy on evaluation data:\n",
"\n",
"* `ner_model.pkl` "
- ]
+ ],
+ "metadata": {
+ "id": "dPB7SBrKuSna"
+ }
},
{
"cell_type": "code",
- "metadata": {
- "id": "of_1lkdW917n",
- "outputId": "b3d54074-cda1-46d9-b4f9-800c50e4ef18",
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 170
- }
- },
+ "execution_count": 3,
"source": [
"from bnlp import NER\n",
"bn_ner = NER()\n",
"model_name = \"ner_model.pkl\"\n",
- "tagged_sentences = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]\n",
+ "train_data = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]\n",
+ "test_data = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]\n",
"\n",
- "bn_ner.train(model_name, tagged_sentences)"
+ "bn_ner.train(model_name, train_data, test_data)"
],
- "execution_count": 3,
"outputs": [
{
"output_type": "stream",
+ "name": "stdout",
"text": [
"2\n",
"1\n",
@@ -440,19 +435,26 @@
"Accuracy is: \n",
"1.0\n",
"Model Saved!\n"
- ],
- "name": "stdout"
+ ]
}
- ]
+ ],
+ "metadata": {
+ "id": "of_1lkdW917n",
+ "outputId": "b3d54074-cda1-46d9-b4f9-800c50e4ef18",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 170
+ }
+ }
},
{
"cell_type": "code",
+ "execution_count": null,
+ "source": [],
+ "outputs": [],
"metadata": {
"id": "qVrYxT5DulwP"
- },
- "source": [],
- "execution_count": null,
- "outputs": []
+ }
}
]
}
\ No newline at end of file