Language-modelling-from-scratch

This project is about language modelling on Indian Ex-PM Mr.Manmohan singh speeches

Note - This is my 1st project just after I completed my NLP course. I wanted to learn how to create an entire project from scratch. I made many mistakes in this, but it was a great learning experience. Wherever possible in the notebook, the mistakes were noted and the better alternatives are explained.

The repository consists of 4 notebooks which demonstrate the following tasks

Preparing dataset through scraping the web: This notebook demonstrates how the dataset is prepared by scraping the archives of PM speeches using Scrapy. This forms our training dataset
Cleaning the dataset: This notebook demonstrates how to clean the data to using Regex inorder to maximize the coverage of vocabulary using GLOVE word vectors.
Language Modelling using Pytorch: This notebbok demonstrates how to create a language model using Pytorch and how to generate better text during inference using Top-K sampling.

Additional: Preparation of test dataset: When the speeches archive is scraped earlier, some speeches were missed. Those missing speeches were scraped in this notebook and those are used as Test set during inference

Name	Name	Last commit message	Last commit date
Latest commit chittiman Update README.md Mar 19, 2021 bbb53bf · Mar 19, 2021 History 5 Commits
.gitignore	.gitignore	Initial commit	Mar 17, 2021
Dataset Cleaning through Regex.ipynb	Dataset Cleaning through Regex.ipynb	Added jupyter notebook	Mar 19, 2021
Dataset Preparation through web scraping.ipynb	Dataset Preparation through web scraping.ipynb	Added jupyter notebook	Mar 19, 2021
LICENSE	LICENSE	Initial commit	Mar 17, 2021
Language Modelling through Pytorch.ipynb	Language Modelling through Pytorch.ipynb	Added jupyter notebook	Mar 19, 2021
README.md	README.md	Update README.md	Mar 19, 2021
Test Dataset Preparation.ipynb	Test Dataset Preparation.ipynb	Added jupyter notebook	Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language-modelling-from-scratch

About

Releases

Packages

Languages

License

chittiman/Language-modelling-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Language-modelling-from-scratch

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages