I analyzed the transcripts of 117,937 television episodes from 4,667 different television shows using Latent Dirichlet Allocation in order to find clusters of common language between different shows. and to then take those similarities to build a content based recommender for television shows.
- Python==3.7.3
- gensim==3.8.1
- Flask==1.1.1
- nltk==3.4.5
- pandas==0.25.2
- matplotlib==3.1.1
- numpy==1.17.2
- spacy==2.2.1
- spacy-langdetect==0.1.2
- beautifulsoup4==4.8.0
For Google Cloud Virtual Instance:
- need Virtual Machine with at least 104 GBs of RAM
- google-api-core==1.14.3
- google-auth==1.7.1
- google-auth-oauthlib==0.4.1
- google-cloud==0.34.0
- google-cloud-core==1.0.3
- google-cloud-storage==1.23.0
- google-pasta==0.1.8
- google-resumable-media==0.5.0
All final production code is in the final_code folder, while the development_code folder contains other pieces of code written during the project that ended up not being used to create the final result. The notebooks Python scripts are listed in chronological order. None of my final data is posted because of its size (2.6 GBs), but please contact me if you would like a copy!