Evaluation of a simple score-based Natural Language Processing (NLP) algorithm

Our study involves the assessment of a rudimentary Natural Language Processing algorithm aimed at rating product reviews. Each positive word increases the score of the review, whereas a negative word decreases its score. While this method may not be well-suited for accurately predicting a review's numerical rating on a scale of 1 to 5, it proves to be an effective approach for categorizing reviews into positive and negative classes with acceptable levels of accuracy.

Folder structure

code contains the Jupyter notebooks to run the experiment
data/input contains the external datasets used as input files
data/ouput contains files generated while running the experiment
- Intermediary result
- Result
- Rating Confusion Matrix
- Category Confusion Matrix
documentation contains the architecture of the pipeline and pipeline metadata

Data Sources

Amazon Customer Review Data

"Amazon Customer Review Data for sentiment analysis"
Author: Akash Shashikant Vaykar, Abhishek Kaushik (ORCID)
Publication: November 21, 2019
License: Creative Commons Attribution 4.0 International

Google Play Store Data

Mobile App Stores such as Google, Apple have wide range of applications to suffice every need of customers in the digital platform. Customer feedback and ratings has always been one of the major metrics that can be used to review the performance and accordingly provide suitable recommendations to enhance the functionality. The Given dataset contain the feedback of the customer regarding the app used in app store.
Author: Abhishek Kaushik (ORCID), Swathi Venkatakrishnan
Publication: May 15, 2019
License: Creative Commons Attribution 4.0 International

NRC Word-Emotion Association Lexicon

The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.
http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Author: Saif M. Mohammad
Publication: July 10, 2011
License: No licence specified, however the dataset can be used freely for non-commercial research and educational purposes.

Run the experiment

You can either run this experiment on your host or in a Docker container. We recommend using Docker.
The Amazon and Google Play Reviews are already included in this repository
Due to licensing issues, we are not allowed to distribute the NRC dataset. Thus, it it necessary to manually download it:
- Download the NRC dataset (http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm), extract it and place the content of the extracted folder into data/input/nrc.

Host

Make sure that Python 3.0 or higher is installed on your system. If it is not already, follow https://www.python.org/downloads/ to install it
Run pip install -r requirements.txt in the project root directory
Run jupyter notebook
Open http://localhost:8888
The code directory contains the Jupyter notebooks
Run the notebooks in the following order: 01_merge_preprocess.ipynb, 02_score_reviews.ipynb, 03_ visualize.ipynb

Docker

Make sure that Docker is installed on your device. If it is not already, follow https://docs.docker.com/get-docker/ to install it
Run docker build . -t simple-nlp to build the docker container
Run docker run -p 8888:8888 simple-nlp
Open localhost:8888
The code directory contains the Jupyter notebooks
Run the notebooks in the following order: 01_merge_preprocess.ipynb, 02_score_reviews.ipynb, 03_ visualize.ipynb

Jupyter notebooks

`01_merge_preprocess.ipynb`

This Jupyter notebook file is used to merge the Google Play Store review and Amazon review datasets. Furthermore, it filters out stopwords (e.g. this, ) in the dataset. This file produces the data/output/[ddmmyyy]_merged_preprocessed.csv file.

`02_score_reviews.ipynb`

This is used to calculate the predicted ranking of the review using the simple NLP algorithm. It produces the output file data/output/[ddmmyyy]_predicted_rating.csv.

`03_ visualize.ipynb`

Visualizes the data. It creates confusion matrices for the predicted rating (data/output/rating_confusion.pdf) and the predicted category (data/output/category_confusion.pdf).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
data		data
documentation		documentation
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of a simple score-based Natural Language Processing (NLP) algorithm

Folder structure

Data Sources

Amazon Customer Review Data

Google Play Store Data

NRC Word-Emotion Association Lexicon

Run the experiment

Host

Docker

Jupyter notebooks

`01_merge_preprocess.ipynb`

`02_score_reviews.ipynb`

`03_ visualize.ipynb`

Architecture

Reusing the software

License

About

Releases 2

Packages

Languages

License

beerphilipp/nlp-evaluation

Folders and files

Latest commit

History

Repository files navigation

Evaluation of a simple score-based Natural Language Processing (NLP) algorithm

Folder structure

Data Sources

Amazon Customer Review Data

Google Play Store Data

NRC Word-Emotion Association Lexicon

Run the experiment

Host

Docker

Jupyter notebooks

01_merge_preprocess.ipynb

02_score_reviews.ipynb

03_ visualize.ipynb

Architecture

Reusing the software

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`01_merge_preprocess.ipynb`

`02_score_reviews.ipynb`

`03_ visualize.ipynb`

Packages