Skip to content

Machine Learning project using Regression techniques to analyze impacts of Twitter sentiment in cryptocurrency prices.

License

Notifications You must be signed in to change notification settings

fabioturazzi/crypto_price_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cryptocurrency Price Analysis

Project Overview

A Machine Learning project using Regression techniques to analyze impacts of Twitter sentiment on cryptocurrency prices.

Methods and Tools:

Project Motivations

Cryptocurrency has been growing in popularity and relevance, and Twitter can be accredited as one of the most active medium for crypto-enthusiasts to communicate. Our goal for this project is to examine the relationship between tweets, reflecting public opinion, and the daily price of popular e-coins, applying Machine Learning techniques to predict price fluctuations.

Datasets:

We combined the tweets dataset with the price dataset to build a comprehensive dataset for the analysis. The details of each dataset is described as below:

  • coingecko.csv: includes coins' daily prices
  • tweets_sentiment.csv: includes tweets scraped from Twitter, cleaned using tweet-preprocessor, with sentiment scores calculated using VADER.
  • cleaned_data.csv: a combined dataset from the previous 2 datasets, grouping daily prices with aggregated daily tweets' sentiment scores.

coingecko.csv

This dataset was generated by 2 scripts: coingecko and missing date to get missing data, containing coin's daily prices and related information, with 12817 rows and 7 columns.

Variable Name Description Data Type
index unique index for daily price int
id unique identifier for each coin given by CoinGecko object
name unique name for each coin given by CoinGecko object
price_usd daily price in USD float
market_cap_usd total value of all the coins that have been mined in USD float
total_vol_usd total amount of coins traded in the last 24 hours float
date date object

tweets_sentiment.csv

This dataset was generated using 2 scripts: tweets scraping and text processing with sentiment analysis. It contains relevant tweets (based on keywords and hashtags) and all the additional details obtained using twint library. However we will only describe the necessary columns for the scope of this project. The length of this dataset is 823346 rows, and the total number of included columns is 11. The tweets' contents have also been cleaned using tweet-preprocessor, ftfy before getting the sentiment scores calculated with VADER.

Variable Name Description Data Type
date date posted object
time time posted object
replies_count number of replies int
retweets_count number of retweets int
likes_count number of likes int
search name of the coin object
decoded_tweet actual tweet content object
neg negative sentiment score calculated by VADER float
neu netral sentiment score calculated by VADER float
pos positive sentiment score calculated by VADER float
compound compound sentiment score calculated by VADER float

cleaned_data.csv

This dataset was generated by combing the two datasets above, by grouping tweets by the date, and aggregating a sentiment score for each day, considering its popularity based on number of likes, replies, and retweets.

Variable Name Description Data Type
total_vol total amount of coins traded in the last 24 hours float
date date in ordinal value int
price daily price in USD float
positive positive sentiment score calculated by VADER float
negative negative sentiment score calculated by VADER float
neutral neutral sentiment score calculated by VADER float
total_tweets count value of total tweets posted each day int
bitcoin identify coin type (0, 1) int
litecoin identify coin type (0, 1) int
yearn-finance identify coin type (0, 1) int

Project Directory

| - crypto_price_analysis                                       
|   -- project_presentation                                         Includes the presentation of the project
|     --- Cryptocurrency Analysis - Project Presentation.pdf
|   -- scraping_and_preprocessing                                   Includes Python scripts with source codes to scrape and preprocess data
|     --- coingecko_get_missing_dates.py                            Script used to get missing dates' prices, update coingecko.csv
|     --- coingecko_scraper.py                                      Script used to scrape daily prices of interested coins, create coingecko.csv  
|     --- twint_scraper.py                                          Script used to scrape daily tweets from Twitter
|     --- vader_sentiment_analysis.py                               Script used to perform text processing and calculate sentiment scores, create 
|   -- src                                                          Includes the datasets and the analysis code
|     --- data                                                      Includes all datasets
|         ---- coingecko                                            Includes the coins' prices scraped from CoinGecko API
|             ----- coingecko.csv
|         ---- tweets                                               Includes the tweets scraped from Twitter with sentiment scores
|             ----- Link_to_dataset.md                              
|         ---- cleaned_data.csv                                     The combined clean dataset for analysis
|     --- tree.dot                                                  Tree structure to generate diagram for illustration
|     --- tree.png                                                  An illustrated tree from Random Forest Regressor
|   -- .gitignore                                                   gitignore files
|   -- LICENSE                                                      MIT License
|   -- README.md                                                    Project Overview

Challenges

CoinGecko API only allows free public access to daily prices of the coins, which makes it difficult to apply Machine Learning algorithms to predict individual coins due to the sheer volume. Therefore, it was necessary to to combine different currencies to generate sufficient data for the ML models, with the observation that BTC and YFI prices are significantly higher than others. We understand that this discrepancy in behavior may impact our models’ predictive capability due to this skewed characteristic.

Future Improvements

To further improve predictions for practical use, we believe that access to more granular price data would be beneficial, rendering enough datapoints to run separate models for each coin.

About

Machine Learning project using Regression techniques to analyze impacts of Twitter sentiment in cryptocurrency prices.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published