Skip to content

Determined the potential spammers in preprocessed Amazon Food Reviews dataset based on reviews and ratings of reviewers using certain heuristics in Pig and applied Naïve Bayes Algorithm on it using PySpark to identify the actual spammers

Notifications You must be signed in to change notification settings

Nandita-93/Opinion-Spam-Detection

Repository files navigation

Opinion-Spam-Detection

Team of 5 The dataset was downloaded from: http://snap.stanford.edu/data/web-FineFoods.html

The dataset should be cleaned for special characters, formatted (timestamp to date) and properly delimited to fit our pig scripts (to extract the spam, ham data based on our heuristics) run in HDFS

The following are the cleaning python files and they should be executed on the actual dataset:

dateCleanUp.py timeConversion.py

The outcome of this would be collected in foods.txt (the clean, formatted and delimited file) - of size around the actual dataset size with same number of records as in original dataset

The following pig scripts should be executed to extract the ham and spam review data based on heuristics (per product & per day average):

spamperday.pig hamperday.pig spamperproduct.pig hamperproduct.pig

These four pig scripts would generate

timesuspect.txt timeham.txt avgesupsect.txt hamavgnew.txt

folders respectively

These HDFS Reducer output folders will have the files that are to be converted to .txt files to get the text versions of output

We saved these reducer output files as Spam_Time.txt, Ham_Time.txt, Spam_Average.txt, Ham_Average.txt respectively

These four .txt files are then combined into a single HamSpam.txt file using a simple java code (with two columns: firs column being ham/spam classification and second column as the corresponding review data)

reviewText.txt is the actual dataset with only the last column (review text column) that needs to be classified as ham or spam (based on the Naive Bayes Classification using the HamSpam.txt that has the extracted data based on heuristics)

The following .py file will execute (with HamSpam.txt and reviewText.txt in it's path) to produce the actual output:

ham_spam.py

tesPrediction.txt is the FINAL output file that would have classified the actual dataset (in this case reviewText.txt) as ham or spam reviews

Additionally, output.txt will display the accuracy of our Machine Learning Model

About

Determined the potential spammers in preprocessed Amazon Food Reviews dataset based on reviews and ratings of reviewers using certain heuristics in Pig and applied Naïve Bayes Algorithm on it using PySpark to identify the actual spammers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published