A Machine Learning project using Regression techniques to analyze impacts of Twitter sentiment on cryptocurrency prices.
Methods and Tools:
- Tweets Scraping: twint library
- Cryptocurrency Price Scraping: CoinGecko API
- Sentiment Analysis: tweet-preprocessor, ftfy and VADER
- Regression algorithms: Machine Learning algorithms, Neural Network
Cryptocurrency has been growing in popularity and relevance, and Twitter can be accredited as one of the most active medium for crypto-enthusiasts to communicate. Our goal for this project is to examine the relationship between tweets, reflecting public opinion, and the daily price of popular e-coins, applying Machine Learning techniques to predict price fluctuations.
We combined the tweets dataset with the price dataset to build a comprehensive dataset for the analysis. The details of each dataset is described as below:
- coingecko.csv: includes coins' daily prices
- tweets_sentiment.csv: includes tweets scraped from Twitter, cleaned using tweet-preprocessor, with sentiment scores calculated using VADER.
- cleaned_data.csv: a combined dataset from the previous 2 datasets, grouping daily prices with aggregated daily tweets' sentiment scores.
This dataset was generated by 2 scripts: coingecko and missing date to get missing data, containing coin's daily prices and related information, with 12817 rows and 7 columns.
Variable Name | Description | Data Type |
---|---|---|
index | unique index for daily price | int |
id | unique identifier for each coin given by CoinGecko | object |
name | unique name for each coin given by CoinGecko | object |
price_usd | daily price in USD | float |
market_cap_usd | total value of all the coins that have been mined in USD | float |
total_vol_usd | total amount of coins traded in the last 24 hours | float |
date | date | object |
This dataset was generated using 2 scripts: tweets scraping and text processing with sentiment analysis.
It contains relevant tweets (based on keywords and hashtags) and all the additional details obtained using twint
library. However we will only describe the necessary columns for the scope of this project. The length of this dataset is 823346 rows, and the total number of included columns is 11.
The tweets' contents have also been cleaned using tweet-preprocessor
, ftfy
before getting the sentiment scores calculated with VADER
.
Variable Name | Description | Data Type |
---|---|---|
date | date posted | object |
time | time posted | object |
replies_count | number of replies | int |
retweets_count | number of retweets | int |
likes_count | number of likes | int |
search | name of the coin | object |
decoded_tweet | actual tweet content | object |
neg | negative sentiment score calculated by VADER | float |
neu | netral sentiment score calculated by VADER | float |
pos | positive sentiment score calculated by VADER | float |
compound | compound sentiment score calculated by VADER | float |
This dataset was generated by combing the two datasets above, by grouping tweets by the date, and aggregating a sentiment score for each day, considering its popularity based on number of likes, replies, and retweets.
Variable Name | Description | Data Type |
---|---|---|
total_vol | total amount of coins traded in the last 24 hours | float |
date | date in ordinal value | int |
price | daily price in USD | float |
positive | positive sentiment score calculated by VADER | float |
negative | negative sentiment score calculated by VADER | float |
neutral | neutral sentiment score calculated by VADER | float |
total_tweets | count value of total tweets posted each day | int |
bitcoin | identify coin type (0, 1) | int |
litecoin | identify coin type (0, 1) | int |
yearn-finance | identify coin type (0, 1) | int |
| - crypto_price_analysis
| -- project_presentation Includes the presentation of the project
| --- Cryptocurrency Analysis - Project Presentation.pdf
| -- scraping_and_preprocessing Includes Python scripts with source codes to scrape and preprocess data
| --- coingecko_get_missing_dates.py Script used to get missing dates' prices, update coingecko.csv
| --- coingecko_scraper.py Script used to scrape daily prices of interested coins, create coingecko.csv
| --- twint_scraper.py Script used to scrape daily tweets from Twitter
| --- vader_sentiment_analysis.py Script used to perform text processing and calculate sentiment scores, create
| -- src Includes the datasets and the analysis code
| --- data Includes all datasets
| ---- coingecko Includes the coins' prices scraped from CoinGecko API
| ----- coingecko.csv
| ---- tweets Includes the tweets scraped from Twitter with sentiment scores
| ----- Link_to_dataset.md
| ---- cleaned_data.csv The combined clean dataset for analysis
| --- tree.dot Tree structure to generate diagram for illustration
| --- tree.png An illustrated tree from Random Forest Regressor
| -- .gitignore gitignore files
| -- LICENSE MIT License
| -- README.md Project Overview
CoinGecko API only allows free public access to daily prices of the coins, which makes it difficult to apply Machine Learning algorithms to predict individual coins due to the sheer volume. Therefore, it was necessary to to combine different currencies to generate sufficient data for the ML models, with the observation that BTC and YFI prices are significantly higher than others. We understand that this discrepancy in behavior may impact our models’ predictive capability due to this skewed characteristic.
To further improve predictions for practical use, we believe that access to more granular price data would be beneficial, rendering enough datapoints to run separate models for each coin.