Executive Summary
Sentiment Analysis (or Opinion Mining) is the task of identifying what the user thinks about a particular piece of text. Sentiment analysis often takes the form of an annotation task with the purpose of annotating a portion of text with a positive, negative, or neutral label.
In this Sentiment Analysis (SA) task, the presented deep learning regression model is able to predict the score assigned by a user in a product review.
This task has been completed to participate to ATE_ABSITA at EVALITA 2020: http://www.di.uniba.it/~swap/ate_absita/task.html#
Feature set information
4364 real-life product user reviews, written in the Italian language, about 23 products. The training, dev and test sets is randomly generated in the portion: 70% training, 2.5% dev, 27.5% test set. This mean that the test set will be not out-of-domain. The data format used is NDJSON (http://ndjson.org/) with UTF-8 encoding and newline as delimiter. Note that some reviews may not contain any aspect, but the final review score is always available.
- TRAINING SET: 3054 reviews - ate_absita_training.ndjson - 1.1 MB
- DEV SET: 109 reviews - ate_absita_dev.ndjson - 37 KB
- TEST SET: 1200 reviews - ate_absita_test.ndjson - 322 KB - NOT YET RELEASED
Feature example
{
sentence: "Ottimo rasoio dal semplice utilizzo. Rade molto bene e in qualsiasi direzione. Pratico e facile da pulire"
score: 5
}
Data augmentation
The original dataset has been enriched with a limited number of additional product reviews to beat the baseline. The product reviews are available in the additional_scraped_reviews
folder.
How this works
The dataset is modeled using the following approach:
dataframe_pipeline.py
converts the ndjson input file into a pandas dataframe that is then saved into the joblib_not_processed_dataframe folder in joblib format;additional_features_preprocessing.py
works on the product reviews added later to this dataset and stored in theadditional_scraped_reviews
folder;preprocessing.py
loads the dataset created in step 1 and applies the first cleaning layer on the data by removing i) punctuation, ii) numbers, iii) single characters, iv) multiple spaces and V) stopwords. After the cleaning layer is completed, the output is again saved in joblib format in the joblib_processed_features folder;train.py
makes the last part of preprocessing including applying word embeddings to create the feature matrices. The model is then saved into themodels
folder. To run this, you need to download the word embeddings from https://fasttext.cc/docs/en/crawl-vectors.html, create anembeddings
folder into the main directory of the project and place the file downloaded within it. The expected structure of the directory/folder is:
embeddings
--cc.it.300.vec
Model used and metrics
RMS error on the dev set with the current model and data augmentation is 0.9978965872313215 (competition baseline: 1.004)
License
All material used and produced by the organizers and the researcher for this evaluation task is released for non-commercial research purposes only. In this regard, no tools are provided to link the reviews released as datasets, to specific subjects on the web, or to trademarks and third parties. Furthermore, any use for statistical, propagandistic or advertising purposes of any kind is prohibited. It is not possible to modify, alter or enrich the data provided for the purposes of redistribution.
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License: http://creativecommons.org/licenses/by-nc-nd/4.0/
EVALITA Credits
@inproceedings{demattei2020overview,
title={{Overview of the evalita 2020 ATE\_ABSITA: Aspect Term Extraction and Aspect-basedSentiment Analysis task}},
author={de Mattei, Lorenzo and De Martino, Graziella and Iovine, Andrea and Miaschi, Alessio and Marco, Polignano},
booktitle={EVALITA 2020-Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian},
year={2020},
organization={CEUR}
}