Repository of the paper "Investigating Human Values in Online Communities"
Still a work in progress.
- The list of subreddit we analised can be found in
outputs/subreddits.txt
. - Subreddit Schwartz values can be found in
outputs/subreddit_schwartz_values.csv
. - Figures from the paper can be found in
outputs/
.
To use the code, clone it to a new folder, cd into it, and run conda env create -f environment.yml
, which will create a fresh conda enviroment with all the needed dependencies.
First, download the posts data from pushshift.io (the RS_*.zst
files). In the paper, we analised the files RS_2022-01.zst, ..., RS_2022-09.zst
, but any set of RS_*.zst
files will do. These files are very large (compresed size can surpass 15GB, and the uncompressed size can surpass 200GB), so save them in a system with enough storage.
Next, run the script
python collect_reddit_data_to_label.py [PATH TO PUSHIFT] [SAVE PATH]
It will collect up to 1000 random posts from each subreddit in outputs/subreddits.txt
and store them in [SAVE PATH]
.
We use the training code from https://github.com/m0re4u/value-disagreement
, the repositopry of the paper "Do Differences in Values Influence Disagreements in Online Discussions?".
Either use their repository to train the model, or follow their instructions while using ours, which contains a simplified, slim, version of their code. The training script is scripts/training_moral_values/train_paper.py
. To use it, run
python scripts/training_moral_values/train_paper.py both --use_model microsoft/deberta-v3-base --n_runs 1
Do not forget to download ValueEval
and ValueNet
datasets and store them in data
directory.
To label the posts collected by collect_reddit_data_to_label.py
with Schwartz values, run
python label_subreddits.py [POSTS PATH] [MODEL CHECKPOINT] [SAVE DIR]
Where [POSTS PATH]
is the same as [SAVE PATH]
from the data collection step, [MODEL CHECKPOINT]
is the checkpoint of the trained model from the training step. The code also supports basic parrallelism -- you can include the arguments --subset [SUBSET ID]
and --total [N SUBSETS]
to process only subset number [SUBSET ID]
out of [N SUBSETS]
.
TBD. The notebook analyse_subreddits.ipynb
contains some reference code.
To run the evaluation, we first generate a synthetic dataset of Reddit posts exhibitive of a certain value and then perform evaluation.
python scripts/synthetic_eval.py [MODEL CHECKPOINT] [GENERATION FLAG] [GENERATION CUT-OFF] [POSTS PATH]
Where GENERATION FLAG
is the flag determining whether to run the synthetic data generation.
If you use our code or dataset, kindly cite it using
@misc{borenstein2024investigatinghumanvaluesonline,
title={Investigating Human Values in Online Communities},
author={Nadav Borenstein and Arnav Arora and Lucie-Aimée Kaffee and Isabelle Augenstein},
year={2024},
eprint={2402.14177},
archivePrefix={arXiv},
primaryClass={cs.SI},
url={https://arxiv.org/abs/2402.14177},
}