An Application of Random Forest!
-
Objective: Project for my intern at Research Center VERA, Ca' Foscari University of Venice.
-
Abstract: 2,045,322 cryptocurrency-related Tweets (~287MB) are retrieved using StockTwits API. The messages are posted from 28/11/2014 to 25/07/2020. Nearly half of those messages are labelled with sentiment (i.e. Bullish/Bearish). Based on the labeled dataset, a Random Forest model is then trained to classify the sentiments of Tweets about cryptocurrencies, resulting in a 74.75% prediction accuracy on test set.
-
Status: Completed.
- Text-processing, inspired by Renault (2017) and Chen et al. (2019).
- TF-IDF (for text-vectorization).
- Truncated SVD (for dimension reduction).
- Random Forest.
- Python 3
- numpy==1.18.5
- pandas==1.0.5
- scikit-learn==0.23.2
- requests==2.24.0
-
Clone this repo:
git clone https://github.com/dang-trung/stocktwits-sentiment-classifier
-
Create your environment (virtualenv):
virtualenv -p python3 venv
source venv/bin/activate
(bash) orvenv\Scripts\activate
(windows)
(venv) cd stocktwits-sentiment-classifier
(venv) pip install -e
Or (conda):
conda env create -f environment.yml
conda activate stocktwits-sentiment-classifier
-
Run in terminal:
python -m sentiment_classifier
Note that due to API limits, it will take several days to fully download all 2m+ cryptocurrencies-related Tweets on StockTwits from 2014 to 2020.
- Downloaded messages will be stored in
data/01_raw
. - Messages after being processed (so that only information relevant to sentiment)
will be stored in
data/02_processed
. - Vectorized text messages are stored in
data/03_vectorized
(since this file is small compared to the files generated by step 1 and 2, I already included this in the repo.) - External files (symbols of cryptos & rules for text-processing) are stored in
data/04_external
- Model parameters:
ntree=500, max_depth=20, max_samples=0.75
- Confusion matrix of training set
Actual Classes | |||
---|---|---|---|
Bearish | Bullish | ||
Predicted Class | Bearish | 82,208 | 8,426 |
Bullish | 5,269 | 85,365 |
- Confusion matrix of test set (~74.75% accuracy)
Actual Classes | |||
---|---|---|---|
Bearish | Bullish | ||
Predicted Class | Bearish | 59,888 | 30,747 |
Bullish | 175,937 | 551,880 |
For better understanding of the project, kindly read the report.