Language-modelling-from-scratch

This project is about language modelling on Indian Ex-PM Mr.Manmohan singh speeches

Note - This is my 1st project just after I completed my NLP course. I wanted to learn how to create an entire project from scratch. I made many mistakes in this, but it was a great learning experience. Wherever possible in the notebook, the mistakes were noted and the better alternatives are explained.

The repository consists of 4 notebooks which demonstrate the following tasks

Preparing dataset through scraping the web: This notebook demonstrates how the dataset is prepared by scraping the archives of PM speeches using Scrapy. This forms our training dataset
Cleaning the dataset: This notebook demonstrates how to clean the data to using Regex inorder to maximize the coverage of vocabulary using GLOVE word vectors.
Language Modelling using Pytorch: This notebbok demonstrates how to create a language model using Pytorch and how to generate better text during inference using Top-K sampling.

Additional: Preparation of test dataset: When the speeches archive is scraped earlier, some speeches were missed. Those missing speeches were scraped in this notebook and those are used as Test set during inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Language-modelling-from-scratch

Files

README.md

Latest commit

History

README.md

File metadata and controls

Language-modelling-from-scratch