Skip to content

Latest commit

 

History

History
50 lines (44 loc) · 3.69 KB

Final_Report.md

File metadata and controls

50 lines (44 loc) · 3.69 KB
gsoc DSC-IEM

Winter of Code Final Work Product

Selection a Proper Dataset

  • Getting proper data for training models suitable to our requirements is important.
  • I have searched a lot of dataset like twitter analysis data and many more but at last i finalised Amazon Fine Food Review .
  • I have choosed this dataset because it includes rating from 0-5 scores for every individual review.
  • The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012.
  • It contains huge dataset due to this i have choosen this dataset for my project.
  • images

Data Preprocessing on dataset

  • Before we move to train our model we have to do preprocessing so that we can remove unwanted data.
  • So here in my dataset there are various columns of different values but for my project i have selected only scores , id and reviews text column for my project.
  • after that there is score given 0-5 for reviews so i divided that into 3 categories negative(score>3),Positive(score>3),neutral(score==3).
  • after that i have checked that if any duplicate values are there than i have seen that there are duplicates values , I removed all duplicated value.
  • Then i removed html tags, special character and Tokenize the reviews into word tokens.
  • After that i split the review into words and then check if these words are in the stop words if so we will remove them, if not we will join.

Featurisation , Tf-idf

  • Now we have splitted my cleaned dataset into train and test set to work on that and build a gentle model.
  • Then i featurised my dataset on tf-idf vectorizer and fit it as tfidf_model.fit(reviews_train,sentiment_train).
  • Then i transform it on train reviews as reviews_train_tfidf=tfidf_model.transform(reviews_train).
  • Then i used WordCloud to see top 10 words by importing Wordcloud in my model.
  • images

Model Selection

  • Model selection is very key point to make your project best in term of accuracy and precission.
  • after applying EDA on dataset i tried three algorithms to train my model for better prediction.
  • These are 3 algorithms :- Logistic Regression, Naive Bayes and Decision tree.
  • From these three i have figureout one algorithm which will fit to my model.
  • So on the basis of parameters and accuracy i choosed Naive Bayes for my model to train my model on it.
  • images

Model Deployement

  • For pre-check my model frontend and backend i deployed it on my local server, and it working efficiently and precisely.
  • I have added feature of prediction of sentiment , keywords extraction and showing , Polarity and Subjectivity and Summary. these are some features which will comes up when we put sone text in my frontend and gives output after processing in backend.
  • images of frontend