Opinion-Spam-Detection

Team of 5 The dataset was downloaded from: http://snap.stanford.edu/data/web-FineFoods.html

The dataset should be cleaned for special characters, formatted (timestamp to date) and properly delimited to fit our pig scripts (to extract the spam, ham data based on our heuristics) run in HDFS

The following are the cleaning python files and they should be executed on the actual dataset:

dateCleanUp.py timeConversion.py

The outcome of this would be collected in foods.txt (the clean, formatted and delimited file) - of size around the actual dataset size with same number of records as in original dataset

The following pig scripts should be executed to extract the ham and spam review data based on heuristics (per product & per day average):

spamperday.pig hamperday.pig spamperproduct.pig hamperproduct.pig

These four pig scripts would generate

timesuspect.txt timeham.txt avgesupsect.txt hamavgnew.txt

folders respectively

These HDFS Reducer output folders will have the files that are to be converted to .txt files to get the text versions of output

We saved these reducer output files as Spam_Time.txt, Ham_Time.txt, Spam_Average.txt, Ham_Average.txt respectively

These four .txt files are then combined into a single HamSpam.txt file using a simple java code (with two columns: firs column being ham/spam classification and second column as the corresponding review data)

reviewText.txt is the actual dataset with only the last column (review text column) that needs to be classified as ham or spam (based on the Naive Bayes Classification using the HamSpam.txt that has the extracted data based on heuristics)

The following .py file will execute (with HamSpam.txt and reviewText.txt in it's path) to produce the actual output:

ham_spam.py

tesPrediction.txt is the FINAL output file that would have classified the actual dataset (in this case reviewText.txt) as ham or spam reviews

Additionally, output.txt will display the accuracy of our Machine Learning Model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Opinion-Spam-Detection

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
dateCleanUp.py		dateCleanUp.py
ham_spam.py		ham_spam.py
hamperday.pig		hamperday.pig
hamperproduct.pig		hamperproduct.pig
spamperday.pig		spamperday.pig
spamperproduct.pig		spamperproduct.pig
timeConversion.py		timeConversion.py

Nandita-93/Opinion-Spam-Detection

Folders and files

Latest commit

History

Repository files navigation

Opinion-Spam-Detection

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages