Skip to content

Latest commit

 

History

History
110 lines (91 loc) · 4.55 KB

README.md

File metadata and controls

110 lines (91 loc) · 4.55 KB

Logo

StockTwits Sentiment Classifier

An Application of Random Forest!

MIT License GitHub LinkedIn

Project Description

Introduction

  • Objective: Project for my intern at Research Center VERA, Ca' Foscari University of Venice.

  • Abstract: 2,045,322 cryptocurrency-related Tweets (~287MB) are retrieved using StockTwits API. The messages are posted from 28/11/2014 to 25/07/2020. Nearly half of those messages are labelled with sentiment (i.e. Bullish/Bearish). Based on the labeled dataset, a Random Forest model is then trained to classify the sentiments of Tweets about cryptocurrencies, resulting in a 74.75% prediction accuracy on test set.

  • Status: Completed.

Methods Used

Dependencies

  • Python 3
  • numpy==1.18.5
  • pandas==1.0.5
  • scikit-learn==0.23.2
  • requests==2.24.0

Table of Contents

Getting Started

How to Run

  1. Clone this repo: git clone https://github.com/dang-trung/stocktwits-sentiment-classifier

  2. Create your environment (virtualenv):
    virtualenv -p python3 venv
    source venv/bin/activate (bash) or venv\Scripts\activate (windows)
    (venv) cd stocktwits-sentiment-classifier
    (venv) pip install -e

    Or (conda):
    conda env create -f environment.yml
    conda activate stocktwits-sentiment-classifier

  3. Run in terminal:
    python -m sentiment_classifier
    Note that due to API limits, it will take several days to fully download all 2m+ cryptocurrencies-related Tweets on StockTwits from 2014 to 2020.

Data Storage

  1. Downloaded messages will be stored in data/01_raw.
  2. Messages after being processed (so that only information relevant to sentiment) will be stored in data/02_processed.
  3. Vectorized text messages are stored in data/03_vectorized (since this file is small compared to the files generated by step 1 and 2, I already included this in the repo.)
  4. External files (symbols of cryptos & rules for text-processing) are stored in data/04_external

Results

  • Model parameters: ntree=500, max_depth=20, max_samples=0.75
  • Confusion matrix of training set
Actual Classes
Bearish Bullish
Predicted Class Bearish 82,208 8,426
Bullish 5,269 85,365
  • Confusion matrix of test set (~74.75% accuracy)
Actual Classes
Bearish Bullish
Predicted Class Bearish 59,888 30,747
Bullish 175,937 551,880

Read More

For better understanding of the project, kindly read the report.