Exploring cross-domain sentiment analysis.
IT University of Copenhagen | Spring 2022
This repository contains the work for a three phase project in Natural Language Processing.
- Phase 1 is the development of an NLP model for the purpose of sentiment classification of Amazon music reviews.
- Stage 2 involved creating edge cases that would challenge the model from stage one, as well as the predictions of these hard cases.
- Stage 3 involves a novel study related to sentiment analysis in a cross-domain setting.
- code: contains all notebooks and scripts used in this project.
- models: contains all trained models
- archive: scripts and notebooks not needed in the end
- workdir: a folder necessary for running the similarity script
- data: all of the data required for this project.
- dissimilar: selected dissimilar training data
- interim: processed review data
- predictions: predicted classes based on probabilities files
- probabilities: probability of positive and negative classes of test set
- random: randomly selected training data
- raw: Raw Amazon review data
- docs: doc and txt files created for this project
- report: Contains the final report itself, and
- figures: all figures produced during data analysis and visualization
- metrics: all metrics produced
- sabrina: Sabrina's folder
- archive: ?
- src: ?
The language used in this project is Python 3.8.10
All required libraries/versions etc can be found in the requirements.txt.
All of the data were sourced from Jianmo Ni. 2018 Amazon review data, using the complete review data. The initial model was trained on the source domain Digital Music. The target domains were Video Games and Arts, Crafts, and Sewing.
Most up-to-date version can be found in The Big How to Reproduce Our Findings Guide.txt.
Run all commands from the repository's root directory.
The interim datasets were created by loading the JSON files, and outputting CSVs containing only the text with corresponding label. Required files:
code/data_prep.ipynb
*This is a jupyter notebook. When run, it will output music_train.csv, music_dev.csv and music_test.csv to the data/interim folder. The raw Digital Music data was already split for us into train, dev, and test sets by our professor. If recreating this project, you will have to create your own train, dev, test splits. (ADD DETAILS FOR COUNTS IN EACH CATEGORY)
code/corpus_load.py
Run via:
python3 code/corpus_load.py Arts_Crafts_and_Sewing.json.gz sew
python3 code/corpus_load.py Video_Games.json.gz games
Will output to data/interim 3 csv files for each: train, dev & test.
Required files:
- code/kl_divergence.py
- data/interim/music_train.csv
- data/interim/sew_train.csv
- data/interim/games_train.csv
From kl_divergence.py you may need to uncomment these two lines:
12 #import nltk
13 #nltk.download('punkt')
Run via:
python3 code/kl_divergence.py music_train games_train
python3 code/kl_divergence.py music_train sew_train
This will print the KL-divergence between the given corpus. Please note the order as this is an asymmetric measure: kl(music-->games) =/= kl(games-->music).
Required files:
- code/cosine.py
- data/interim/music_train.csv
- data/interim/sew_train.csv
- data/interim/games_train.csv
- code/workdir
Run via:
python3 code/cosine.py 'music_train' 'games_train' 'games' 10000
python3 code/cosine.py 'music_train' 'sew_train' 'sew' 10000
This will output to data/dissimilar the 4 different sized training sets and another file containing all the scores.
Required Files:
- code/random_select.ipynb
- data/interim/music_train.csv
- data/interim/sew_train.csv
- data/interim/games_train.csv
When the jupyter notebook is run, it will take the training sets and create 4 sets of training data in the data/random folder:
games random: 10, 100, 1000, 10000
games balanced random: 10, 100, 1000, 10000
sewing random: 10, 100, 1000, 10000
sewing balanced random: 10, 100, 1000, 10000
Baseline Model:
Required files:
- code/baseline.py
- data/interim/music_train.csv
- data/interim/music_dev.csv
Run via:
python3 code/baseline.py 1 'data/interim/music_train.csv' 'data/interim/music_dev.csv' None 'base'
This will output a pickled model which can be found at: code/models/model_base.pkl
Required files:
- code/baseline.py
- data/dissimilar/games*.csv (4 files)
- data/dissimilar/sew*.csv (4 files)
- data/random/games_*.csv(4 files)
- data/random/sew_*.csv(4 files)
- data/random/games_res_*.csv(4 files)
- data/random/sew_res_*.csv(4 files)
Run via:
Selected:
python3 code/baseline.py 0 'data/dissimilar/games10.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_00010cp'
python3 code/baseline.py 0 'data/dissimilar/games100.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_00100cp'
python3 code/baseline.py 0 'data/dissimilar/games1000.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_01000cp'
python3 code/baseline.py 0 'data/dissimilar/games10000.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_10000cp'
python3 code/baseline.py 0 'data/dissimilar/sew10.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_00010cp'
python3 code/baseline.py 0 'data/dissimilar/sew100.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_00100cp'
python3 code/baseline.py 0 'data/dissimilar/sew1000.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_01000cp'
python3 code/baseline.py 0 'data/dissimilar/sew10000.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_10000cp'
Randomised:
python3 code/baseline.py 0 'data/random/games_00010.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_00010ra'
python3 code/baseline.py 0 'data/random/games_00100.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_00100ra'
python3 code/baseline.py 0 'data/random/games_01000.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_01000ra'
python3 code/baseline.py 0 'data/random/games_10000.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_10000ra'
python3 code/baseline.py 0 'data/random/sew_00010.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_00010ra'
python3 code/baseline.py 0 'data/random/sew_00100.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_00100ra'
python3 code/baseline.py 0 'data/random/sew_01000.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_01000ra'
python3 code/baseline.py 0 'data/random/sew_10000.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_10000ra'
Balanced Randomised:
python3 code/baseline.py 0 'data/random/games_res_00010.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_00010es'
python3 code/baseline.py 0 'data/random/games_res_00100.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_00100es'
python3 code/baseline.py 0 'data/random/games_res_01000.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_01000es'
python3 code/baseline.py 0 'data/random/games_res_10000.csv' 'data/interim/games_val.csv' code/models/model_base.pkl 'games_10000es'
python3 code/baseline.py 0 'data/random/sew_res_00010.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_00010es'
python3 code/baseline.py 0 'data/random/sew_res_00100.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_00100es'
python3 code/baseline.py 0 'data/random/sew_res_01000.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_01000es'
python3 code/baseline.py 0 'data/random/sew_res_10000.csv' 'data/interim/sew_val.csv' code/models/model_base.pkl 'sew_10000es'
Each line will output a pickled model into the code/models folder.
Required files:
- code/test.py
- code/model_run_v2.py
- code/predict_max_v2.py
It is required that the model_base.pkl from above, as well as a 10, 100, 1000, 10000 model exists for each domain being tested.
Run via:
python3 code/test.py sew cp
python3 code/test.py games cp
python3 code/test.py sew ra
python3 code/test.py games ra
python3 code/test.py sew es
python3 code/test.py games es
These will run the test datasets against each model within each catagory, and output for each a probabilities file and predictions file, in data/probabilities and data/predictions respectively.
Required files:
- code/metrics.py
data/predictions/all the prediction files from above!
Run Via:
python3 code/metrics.py games 00000ba 00010cp 00100cp 01000cp 10000cp 00010ra 00100ra 01000ra 10000ra 00010es 00100es 01000es 10000es > report/metrics/games_mixed_metrics.txt
python3 code/metrics.py sew 00000ba 00010cp 00100cp 01000cp 10000cp 00010ra 00100ra 01000ra 10000ra 00010es 00100es 01000es 10000es > report/metrics/sew_mixed_metrics.txt
It it is not required to pipe it to the .txt file, but this is more human readable if you just want to look quickly, it will print to the terminal otherwise.
It will also output the same information into a CSV file in the report/metrics folder, with the headers:
['domain', 'trial_type', 'add_data', 'correctly_predicted', 'incorrectly_predicted', 'total_predicted_positives', 'ground_truth_positives', 'TP', 'TN', 'FP', 'FN', 'accuracy', 'precision', 'recall', 'f1']
Group 12
- Danielle Dequin dmdequin@gmail.com
- Chrisanna Cornish ccor@itu.dk
- Sabrina Pereira sabf@itu.dk
See also the list of contributors who participated in this project.
- This tutorial
- PurpleBooth for the ReadMe template.