The objective of this project is to train a random forest classifier to predict symptoms of depression from real Reddit text data. We use two methods to create linguistic features: fitting LDA and generating embeddings with (Distil)RoBERTa. To achieve this, we reimplement most parts of the paper Detecting Symptoms of Depression on Reddit, including dataset generation and preprocessing.
Due to memory limitations, we reduced the size of their dataset. You can access our reduced dataset here.
- We create the control dataset by collecting non-mental health posts from same authors at least 180 days before their first post in a depression-related subreddit.
- For each symptom, we create a dataset by collecting posts from their respective subreddits, as shown in the table below.
We used two methods to extract features:
- Tokenization: The dataset was tokenized using the
happiestfuntokenizing
library. - Stopword Removal: The top 100 most frequent words were removed from the dataset to reduce noise.
- LDA Implementation:
- The paper uses MALLET for LDA implementation.
- We opted for Gensim’s
LdaMulticore
model as it trains much faster than Scikit-learn’s implementation.
- Tokenization: The dataset was tokenized using RoBERTa's
AutoTokenizer
. - Embedding Extraction:
- The paper utilized the full RoBERTa model with 12 transformer blocks, extracting contextual embeddings from the 10th layer for downstream classification.
- For this project, we used DistilRoBERTa, a distilled version of RoBERTa with only 6 transformer blocks, enabling faster computation.
- We extracted embeddings from the 5th layer for downstream classification tasks.
We train 13 binary classifiers, each comparing a single symptom against control posts. Each symptom classifier is evaluated using:
- 5-fold cross-validation
- ROC-AUC scoring
- Balanced classes (equal samples from control and symptom)
Note: The original paper also evaluates symptom vs. control+other symptoms, which we omit.
- NLP can identify linguistic patterns associated with depression before clinical symptoms become severe, enabling earlier intervention and support.
- Since discussing mental health problems remains taboo in many societies, some individuals may feel more comfortable expressing their mental health struggles online rather than in person.
- As mentioned in the paper, many social media users have not explicitly consented to their data being used for mental health research.
- NLP applications may misidentify depression signals, leading to unnecessary interventions and false positives.
- NLP applications cannot fully replace human clinicians, as automated systems may miss important contextual nuances that trained professionals would recognize.
- NLP models may perform differently across demographic groups and cultures, potentially leading to biased or inaccurate assessments.
Right now, it takes about 40 mins to train the random forest classifiers, I plan to reduce this runtime.