Skip to content

Thomson Reuters Data Challenge 2018 - Reuters News Classification

Notifications You must be signed in to change notification settings

VXU1230/Thomson-Reuters-Data-Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

News Data Classification using TF-IDF and Topic Modeling

The Thomson Reuters GHC Machine Learning/Natural Language Challenge(Title Classification): predict the news category based on the news content. https://github.com/thomsonreuters/TR-DataChallenge1

Code

News_Classification.ipynb

  1. Feature Engineering
    A. Tokenization
    B. Punctuation & Stopwords Removal
    C. Lemmatization
  1. Text to Feature
    A. TF-IDF
    B. LDA Topic Modeling
    C. Word Embedding (Word2Vec/GloVe)
    D. Ensemble: TF-IDF + LDA
  1. Training and Hyperparameter Tuning (Ranked by GridSearchCV Best Accuracy Score)
    A. SVM:  0.8947833775419982
    B. Stochastic Gradient Descen: 0.8890994063407857
    C. Logistic Regression: 0.8880889225716811
    D. Naive Bayes 0.8769736011115321
    E. XGBoost: 0.8676266262473159
    F. KNN: 0.8556271314892004
    G. Random Forest: 0.8505747126436781

WordEmbedding.py

A module to create word embedding for news data. 
Source: word2vec-google-news-300; glove-wiki-gigaword-300

Usage

To replace the BoW with Word Embeddings, simply import the module and create a WordEmbedding object.
Three options to use the word embedding vectors: 
	1. Mean
	2. Sum
	3. IDF Weighted Mean

About

Thomson Reuters Data Challenge 2018 - Reuters News Classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published