Skip to content

neerajvashistha/lda_blog_topic_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

blog topic prediction

System Requirements:

  • 16 GB RAM
  • 12 GB RAM GPU 1080 Ti
  • i7-8700 @ 3.20 GHz

Dependency:

  • Custom NER, my bachelor year project NER
  • Google word2vec
  • gensim
  • Mallet LDA
  • spacy
  • nltk

Grouping/classification of 114 instances (for 3 instances blog URL returned no data) of different blogs has to be done in the following way

Marketing Branding Growth marketing Growth strategies Product Management
Product discovery Product Growth Product Management Fundamentals Agile principles Company Culture
Company Growth People Management Startup Fundamentals Interpersonal skills Business Fundamentals
Business Growth Sales Growth Investment cycle

Steps followed in the Machine learning pipeline

  • Gather data in raw_blog_content.csv using gather_data.ipynb
  • Clean data
  • Build Feature
  • Create Model
  • Predict topics
  • Map them on actual topics

In order to gather data/blog content, requests and beautifulSoup4 and simple preprocessing was conducted in gather_data.ipynb. The preprocessing of data, with feature extraction and model creation is done in lda_topic_modeling.py. Three models were used and compared on Term Document frequency features those were

  • lda
  • ldamulticore
  • lda mallet

The coherence and perplexity scores of each were checked and best model was picked to predict the topic of a given blog. In this case lda mallet showed best coherence of around 0.42. Due to the time constraint this metric could not be improved further.

Lastly, for each blog prominent topics were caluclated and were mapped to given topics, here I have used word2vec. I have calculated the vector of the predicted topic phrase and given topic and using Word Mover's Distance [https://github.com/mkusner/wmd/] calculated document distance. The result are written back to JSON file, articles_topic.json.