Cryptocurrency Price Analysis

Project Overview

A Machine Learning project using Regression techniques to analyze impacts of Twitter sentiment on cryptocurrency prices.

Methods and Tools:

Tweets Scraping: twint library
Cryptocurrency Price Scraping: CoinGecko API
Sentiment Analysis: tweet-preprocessor, ftfy and VADER
Regression algorithms: Machine Learning algorithms, Neural Network

Project Motivations

Cryptocurrency has been growing in popularity and relevance, and Twitter can be accredited as one of the most active medium for crypto-enthusiasts to communicate. Our goal for this project is to examine the relationship between tweets, reflecting public opinion, and the daily price of popular e-coins, applying Machine Learning techniques to predict price fluctuations.

Datasets:

We combined the tweets dataset with the price dataset to build a comprehensive dataset for the analysis. The details of each dataset is described as below:

coingecko.csv: includes coins' daily prices
tweets_sentiment.csv: includes tweets scraped from Twitter, cleaned using tweet-preprocessor, with sentiment scores calculated using VADER.
cleaned_data.csv: a combined dataset from the previous 2 datasets, grouping daily prices with aggregated daily tweets' sentiment scores.

coingecko.csv

This dataset was generated by 2 scripts: coingecko and missing date to get missing data, containing coin's daily prices and related information, with 12817 rows and 7 columns.

Variable Name	Description	Data Type
index	unique index for daily price	int
id	unique identifier for each coin given by CoinGecko	object
name	unique name for each coin given by CoinGecko	object
price_usd	daily price in USD	float
market_cap_usd	total value of all the coins that have been mined in USD	float
total_vol_usd	total amount of coins traded in the last 24 hours	float
date	date	object

tweets_sentiment.csv

This dataset was generated using 2 scripts: tweets scraping and text processing with sentiment analysis. It contains relevant tweets (based on keywords and hashtags) and all the additional details obtained using twint library. However we will only describe the necessary columns for the scope of this project. The length of this dataset is 823346 rows, and the total number of included columns is 11. The tweets' contents have also been cleaned using tweet-preprocessor, ftfy before getting the sentiment scores calculated with VADER.

Variable Name	Description	Data Type
date	date posted	object
time	time posted	object
replies_count	number of replies	int
retweets_count	number of retweets	int
likes_count	number of likes	int
search	name of the coin	object
decoded_tweet	actual tweet content	object
neg	negative sentiment score calculated by VADER	float
neu	netral sentiment score calculated by VADER	float
pos	positive sentiment score calculated by VADER	float
compound	compound sentiment score calculated by VADER	float

cleaned_data.csv

This dataset was generated by combing the two datasets above, by grouping tweets by the date, and aggregating a sentiment score for each day, considering its popularity based on number of likes, replies, and retweets.

Variable Name	Description	Data Type
total_vol	total amount of coins traded in the last 24 hours	float
date	date in ordinal value	int
price	daily price in USD	float
positive	positive sentiment score calculated by VADER	float
negative	negative sentiment score calculated by VADER	float
neutral	neutral sentiment score calculated by VADER	float
total_tweets	count value of total tweets posted each day	int
bitcoin	identify coin type (0, 1)	int
litecoin	identify coin type (0, 1)	int
yearn-finance	identify coin type (0, 1)	int

Project Directory

| - crypto_price_analysis                                       
|   -- project_presentation                                         Includes the presentation of the project
|     --- Cryptocurrency Analysis - Project Presentation.pdf
|   -- scraping_and_preprocessing                                   Includes Python scripts with source codes to scrape and preprocess data
|     --- coingecko_get_missing_dates.py                            Script used to get missing dates' prices, update coingecko.csv
|     --- coingecko_scraper.py                                      Script used to scrape daily prices of interested coins, create coingecko.csv  
|     --- twint_scraper.py                                          Script used to scrape daily tweets from Twitter
|     --- vader_sentiment_analysis.py                               Script used to perform text processing and calculate sentiment scores, create 
|   -- src                                                          Includes the datasets and the analysis code
|     --- data                                                      Includes all datasets
|         ---- coingecko                                            Includes the coins' prices scraped from CoinGecko API
|             ----- coingecko.csv
|         ---- tweets                                               Includes the tweets scraped from Twitter with sentiment scores
|             ----- Link_to_dataset.md                              
|         ---- cleaned_data.csv                                     The combined clean dataset for analysis
|     --- tree.dot                                                  Tree structure to generate diagram for illustration
|     --- tree.png                                                  An illustrated tree from Random Forest Regressor
|   -- .gitignore                                                   gitignore files
|   -- LICENSE                                                      MIT License
|   -- README.md                                                    Project Overview

Challenges

CoinGecko API only allows free public access to daily prices of the coins, which makes it difficult to apply Machine Learning algorithms to predict individual coins due to the sheer volume. Therefore, it was necessary to to combine different currencies to generate sufficient data for the ML models, with the observation that BTC and YFI prices are significantly higher than others. We understand that this discrepancy in behavior may impact our models’ predictive capability due to this skewed characteristic.

Future Improvements

To further improve predictions for practical use, we believe that access to more granular price data would be beneficial, rendering enough datapoints to run separate models for each coin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cryptocurrency Price Analysis

Project Overview

Project Motivations

Datasets:

coingecko.csv

tweets_sentiment.csv

cleaned_data.csv

Project Directory

Challenges

Future Improvements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
project_presentation		project_presentation
scraping_and_preprocessing		scraping_and_preprocessing
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

fabioturazzi/crypto_price_analysis

Folders and files

Latest commit

History

Repository files navigation

Cryptocurrency Price Analysis

Project Overview

Project Motivations

Datasets:

coingecko.csv

tweets_sentiment.csv

cleaned_data.csv

Project Directory

Challenges

Future Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages