The Thomson Reuters GHC Machine Learning/Natural Language Challenge(Title Classification): predict the news category based on the news content. https://github.com/thomsonreuters/TR-DataChallenge1
- Feature Engineering
A. Tokenization
B. Punctuation & Stopwords Removal
C. Lemmatization
- Text to Feature
A. TF-IDF
B. LDA Topic Modeling
C. Word Embedding (Word2Vec/GloVe)
D. Ensemble: TF-IDF + LDA
- Training and Hyperparameter Tuning (Ranked by GridSearchCV Best Accuracy Score)
A. SVM: 0.8947833775419982
B. Stochastic Gradient Descen: 0.8890994063407857
C. Logistic Regression: 0.8880889225716811
D. Naive Bayes 0.8769736011115321
E. XGBoost: 0.8676266262473159
F. KNN: 0.8556271314892004
G. Random Forest: 0.8505747126436781
A module to create word embedding for news data.
Source: word2vec-google-news-300; glove-wiki-gigaword-300
To replace the BoW with Word Embeddings, simply import the module and create a WordEmbedding object.
Three options to use the word embedding vectors:
1. Mean
2. Sum
3. IDF Weighted Mean