Reddit Depression Detection

Project Objective

The objective of this project is to train a random forest classifier to predict symptoms of depression from real Reddit text data. We use two methods to create linguistic features: fitting LDA and generating embeddings with (Distil)RoBERTa. To achieve this, we reimplement most parts of the paper Detecting Symptoms of Depression on Reddit, including dataset generation and preprocessing.

Dataset Generation

Due to memory limitations, we reduced the size of their dataset. You can access our reduced dataset here.

control dataset

We create the control dataset by collecting non-mental health posts from same authors at least 180 days before their first post in a depression-related subreddit.

symptom dataset

For each symptom, we create a dataset by collecting posts from their respective subreddits, as shown in the table below.

Preprocessing and Feature Extraction

We used two methods to extract features:

1. LDA (Latent Dirichlet Allocation)

Tokenization: The dataset was tokenized using the happiestfuntokenizing library.
Stopword Removal: The top 100 most frequent words were removed from the dataset to reduce noise.
LDA Implementation:
- The paper uses MALLET for LDA implementation.
- We opted for Gensim’s LdaMulticore model as it trains much faster than Scikit-learn’s implementation.

2. DistilRoBERTa

Tokenization: The dataset was tokenized using RoBERTa's AutoTokenizer.
Embedding Extraction:
- The paper utilized the full RoBERTa model with 12 transformer blocks, extracting contextual embeddings from the 10th layer for downstream classification.
- For this project, we used DistilRoBERTa, a distilled version of RoBERTa with only 6 transformer blocks, enabling faster computation.
- We extracted embeddings from the 5th layer for downstream classification tasks.

Evaluation

We train 13 binary classifiers, each comparing a single symptom against control posts. Each symptom classifier is evaluated using:

5-fold cross-validation
ROC-AUC scoring
Balanced classes (equal samples from control and symptom)

Note: The original paper also evaluates symptom vs. control+other symptoms, which we omit.

Ethical Considerations

Benefits

NLP can identify linguistic patterns associated with depression before clinical symptoms become severe, enabling earlier intervention and support.
Since discussing mental health problems remains taboo in many societies, some individuals may feel more comfortable expressing their mental health struggles online rather than in person.

Drawbacks/Potential Harms

As mentioned in the paper, many social media users have not explicitly consented to their data being used for mental health research.
NLP applications may misidentify depression signals, leading to unnecessary interventions and false positives.
NLP applications cannot fully replace human clinicians, as automated systems may miss important contextual nuances that trained professionals would recognize.
NLP models may perform differently across demographic groups and cultures, potentially leading to biased or inaccurate assessments.

Future Plans

Right now, it takes about 40 mins to train the random forest classifiers, I plan to reduce this runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
cache_files		cache_files
.gitignore		.gitignore
README.md		README.md
evaluation.py		evaluation.py
lda_reddit_topics.py		lda_reddit_topics.py
main.py		main.py
preprocessing.py		preprocessing.py
roberta_embeddings.py		roberta_embeddings.py
table1.png		table1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Depression Detection

Project Objective

Dataset Generation

control dataset

symptom dataset

Preprocessing and Feature Extraction

1. LDA (Latent Dirichlet Allocation)

2. DistilRoBERTa

Evaluation

Ethical Considerations

Benefits

Drawbacks/Potential Harms

Future Plans

About

Releases

Packages

Languages

MuhiimAli/reddit-depression-detection

Folders and files

Latest commit

History

Repository files navigation

Reddit Depression Detection

Project Objective

Dataset Generation

control dataset

symptom dataset

Preprocessing and Feature Extraction

1. LDA (Latent Dirichlet Allocation)

2. DistilRoBERTa

Evaluation

Ethical Considerations

Benefits

Drawbacks/Potential Harms

Future Plans

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages