The issue of “fake news” has arisen recently as a potential threat to high-quality
journalism and well-informed public discourse.In this project we are performing
stance detection
-- i.e. identifying whether a particular news headline agrees
with, disagrees
with, discusses
, or is unrelated
to a particular news article -- in
order to allow journalists and others to more easily find and investigate possible
instances of fake news.
The problem is about “stance detection,” which involves comparing a headline with a body of text from a news article to determine what relationship (if any) existsbetween the two. There are 4 possible classifications:
- The article text
agrees
with the headline. - The article text
disagrees
with the headline. - The article text is a
discussion
of the headline, without taking a position on it. - The article text is
unrelated
to the headline (i.e. it doesn’t address the same topic).
The above problem is based on the Fake News Challenge and is based on this research paper.
Download the dataset required from Fake News Challenge DataSet and put in data folder located in the home folder
Download the Pretrained GLove word embeddings from here Glove Word embeddings
Add the above glove word50.txt file in the data folder
For knowing more about the dataset used go here
Our goal in approaching the stance detection problem was to experiment with a wide range of the machine learning Algorithms namely.
* Linear regression
* Logistic regression
* K Nearest Neighbour
* Support Vector Machine
* Qudratic Discriminant Analysis
* Random Forest
* Adaboost
* SGD Classifier
* Decision Tree
* XG Boost
* Linear Discriminant Analysis
* Gaussian Naive Bayes
- First after loading the data in the form of csv format,we will perform Data Cleaning practises and remove all the stop words,do stemming,and all the unwanted symbols from the text data.
- Then we tokenize the text which represents the text into set of words or tokens and represent the whole sentence as a whole.
- We preprocess the data so as to be used by different word embedding techniques
- We perform the feature engineering on our preprocessed data and
extract three features from it namely
- Word vectorization to find the similarity between the two set of text.
- KL Divergence method of finding the divergence between the words.
- N gram overlap method of finding the number of occurence of the words in the sentence so as to determine the weight of the word in the text.
- We pass the vectored method through the pretrained word embedding vector called Glove Word embeddings which has been trained on the massive data set from the wikipedia and get the corresponding numerical weight for a particular word.
Please find the implementation here.
We got the best accuracy of 88.8% with Support Vector Machine Classifier
Working on the different architectures of RNN and other algorithms for features extraction.
Please see the link for two more approaches used to solve the above problem.