Skip to content

Scraping archive to create a dataset of ex-Indian PM Mr.Manmohan Singh speeches and creating a language model on it from scratch

License

Notifications You must be signed in to change notification settings

chittiman/Language-modelling-from-scratch

Repository files navigation

Language-modelling-from-scratch

This project is about language modelling on Indian Ex-PM Mr.Manmohan singh speeches

Note - This is my 1st project just after I completed my NLP course. I wanted to learn how to create an entire project from scratch. I made many mistakes in this, but it was a great learning experience. Wherever possible in the notebook, the mistakes were noted and the better alternatives are explained.

The repository consists of 4 notebooks which demonstrate the following tasks

  1. Preparing dataset through scraping the web: This notebook demonstrates how the dataset is prepared by scraping the archives of PM speeches using Scrapy. This forms our training dataset
  2. Cleaning the dataset: This notebook demonstrates how to clean the data to using Regex inorder to maximize the coverage of vocabulary using GLOVE word vectors.
  3. Language Modelling using Pytorch: This notebbok demonstrates how to create a language model using Pytorch and how to generate better text during inference using Top-K sampling.

Additional: Preparation of test dataset: When the speeches archive is scraped earlier, some speeches were missed. Those missing speeches were scraped in this notebook and those are used as Test set during inference

About

Scraping archive to create a dataset of ex-Indian PM Mr.Manmohan Singh speeches and creating a language model on it from scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published