SENMO

This is the (partial) implementation of the paper I know what you did on Venmo: Discovering privacy leaks in mobile social payments. SENMO is short for SENsitive content on venMO.

Overview

The SENMO pipeline consists of three runnning scripts.

Preprocessing: polish Venmo notes which are arbitrary to pure texts.
Train: use the cleaned text inputs with their labels, Trainset, to fine-tune BERT.
Test: evaluate the trained model on Testset and report scores.

Requirements

To run our code, please install the dependency packages by using the following command:

pip install -r requirements.txt

NOTE: We use a Conda environment with Python 3.9.10 to run the experiment. This code is tested and requirements.txt is generated for MAC M1 architecture. All packages can be installed by running requirements.txt except tensorflow (version 2.8.0). For MAC M1, please follow the Apple instructions here to install tensorflow. For other platforms, tensorflow should be installed through the following Google instruction here.

Data

We store the dataset we use for training and testing the model in ./data/. Specifically, inside ./data/, we have ./data/train_orig.csv or Trainset to fine-tune BERT and ./data/test_orig.csv or Testset to evaluate the trained model.

Preprocessing

Use the following command to preprocess the data.

python preprocessing.py @preprocessing.config

preprocessing.py contains several functions such as remove-stopword, remove-special-characters, remove-extra-whitespaces and spell-corrector. preprocessing.config define command line arguments that will be parsed to preprocessing.py. The format and arguments are illustrated below.

-i
./data/train_orig.csv
-o
./data/train_clean.csv
-c
regex

The format is simple. It contains a command line argument on the first line and its value on the next line.
Note: We use this format for all .config files to parse command line arguments and their values.

We further explain the arguments specified above:

-i: path to the labeled data in .csv format (please refer the format of labeled data on ./data/train_orig.csv and ./data/test_orig.csv).
-o: path to save the preprocessed data.
-c: choice of spell-corrector functions: "regex", "blob", "spellchecker", "autocorrect". We use "regex" in all the experiments presented on the paper. For more details about data preprocessing, please refer to Section 5.1 of the paper.

Note: we have to run preprocessing.py at least twice——one for ./data/train_orig.csv and the other for ./data/test_orig.csv.

Train

We fine-tune the pre-trained language model BERT by the preprocessed Trainset from the previous step.

To train the model, use the following command.

python train.py @train.config

We parse command line arguments specified in train.config (shown below) to train.py.

-i
./data/train_clean.csv
-o
./model/
-m
30
-b
32
-l
2e-5
-e
6

We further explain these arguments:

-i: path to the preprocessed Trainset saved in the previous step.
-o: path to a directory that the model weights will be saved.
-m: max length or maximum number of tokens/words for each text input. In the paper, we set it to 30.
-b: batch size. In the paper, we set it to 32.
-l: learning rate. In the paper, we set it to 2e-5.
-e: number of epochs. In the paper, we set it to 6.

For more details, please refer to Section 5.2 of the paper.

Test

We evaluate the fine-tuned model from the previous step on the separate (preprocessed) Testset. To preprocess Testset, we run preprocessing.py by setting -i to ./data/test_orig.csv and -o to ./data/test_clean.csv (your choice).

Once the (preprocessed) Testset is ready, we run the following command to get the Testset predictions as well as evaluate the results.

python test.py @test.config

We parse command line arguments specified in test.config (shown below) to test.py.

-t
./data/test_clean.csv
-i
./model/
-o
./prediction/
-m
30
-b
32

We further explain these arguments:

-t: path to the preprocessed Testset.
-i: path to the directory that the model weights were saved in the previous step.
-o: path to a directory that Testset predictions and evaluation results will be stored.
-m: max length or maximum number of tokens/words for each text input (should be the same as specified in train.config).
-b: number of epochs (should be the same as specified in train.config).

test.py will generated two output files: pred.csv and score.txt which will be saved in the directory specified by -o in test.config. pred.csv is the model predictions with the same format as Testset. score.txt contains several evaluation scores. Specifically, we report accuracy, true positive, false positive and per-note accuracy for every class. For more details, please refer to metric.py.

Bugs or questions?

If you have any questions related to the code (i.e. run into problems while setting up dependencies or training/testing the model), feel free to email me, Pithayuth (charnset@usc.edu).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SENMO

Overview

Requirements

Data

Preprocessing

Train

Test

Bugs or questions?

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
.gitignore		.gitignore
README.md		README.md
STOPWORDS.txt		STOPWORDS.txt
metric.py		metric.py
nb.config		nb.config
nb.py		nb.py
preprocessing.config		preprocessing.config
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
test.config		test.config
test.py		test.py
train.config		train.config
train.py		train.py
utils.py		utils.py

charnset/SENMO

Folders and files

Latest commit

History

Repository files navigation

SENMO

Overview

Requirements

Data

Preprocessing

Train

Test

Bugs or questions?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages