MODULE 18
RESOURCES:
IMAGE: image obtained from PNGwing.com
DATA: crypto_data.csv, crypto_clustering_starter_code.ipynb
SOFTWARE: Anaconda, Jupyter Notebook, Python, VSC
LIBRARIES: Plotly, hvPlot, Scikit-learn, Pandas
OVERVIEW:
This challenge entailed collaberation with Martha, a Senior manager for the Advisory Services team at Accountability Accounting, to assist with preparation of an analysis for an investment bank who wishes to enter the Cryptocurrency market by way of offering a cryptocurrency investment portfolio. A report was compiled addressing which cryptocurrencies are currently available on the market and how they can be grouped together into a classification system for this investment venture. Due to the nature of this project, unsupervised machine learning was deemed the most efficient tool for analysis via a clustering algorithm with corresponding visualizations.
RESULTS:
Deliverable 1: Preprocessing the data for PCA.
For this first Deliverable, the preprocessing was completed prior to the Principal Component Analysis (PCA) via:
- the traded cryptocurrencies were kept
- the 'IsTrading' column was dropped
- null rows were removed
- rows with coins that are not being mined were removed
- DF with cryptocurrency names was created
- the CoinName column was removed
- variables for 'Algorithm' and 'ProofType' were created and stored in X DF
- the data was standardized with StandardScaler
FIGURE 1: Removed non-trading Cryptocurrencies and Drop CoinName
Deliverable 2: Reducing Data Dimensions Using PCA
Deliverable 2, entailed applying the Principal Component Analysis algorithm:
- reducing the X-DF into 3 dimensions
- placing this into new DF
- creating the pcs DF, including the 3 columns PC1, PC2, PC3, which has the index from the crypto_df
FIGURE 2: PCA Algorith Reducing Dimensions to 3 Principal Componenets
Deliverable 3: Clustering Cryptocurrencies Using K-means
This Deliverable included clustering with vizualizations:
- an elbow curve wa created with hvplot to find the optimal value for K
- the K-means algorithm was implemented to predict the K clusters for the cryptocurrency data
- the crypto_df and pcs_df were concatenated into the new clustered_df
- a CoinName column was added
- a class column was also added to hold the predictions with the following columns: Algorithm, ProofType, TotalCoinsMined, TotalCoinSupply, PC1, PC2, PC3, CoinName, Class
FIGURE 3: Concatenated Clustered Dataframe with added CoinName and Class
FIGURE 4: Elbow Curve Depicting Best K Value for Predictions
Deliverable 4: Vizualize Cryptocurrencies Results
In this last Deliverable, the results were transformed into relatable vizualizations:
- plotly express/hvplot were utilized to create a 3D scatter plot for visualizing the distinct groups corresponding to the 3 principal components
- 'CoinName' and 'Algorithm' columns were added to the hover_name and hover_data parameters to show the data points
- a table displaying the tradable cryptocurrencies with hvplot.table() was created
- the total number of tradable cryptocurrencies was printed
- a DF containing clustered_df index, the scaled data and the columns 'CoinName' and 'Class' was created
- Finally, a scatterplot with X-axis='TotalCoinsMined' and Y-axis='TotalCoinSupply', with ordered data by 'Class', with hover point showing 'CoinName' was created
FIGURE 5: 3D Scatterplot with hover name and data for the 3 Clusters
FIGURE 6: hvplot Table Illustrating tradable Cryptocurrencies
FIGURE 7: Dataframe with added CoinName and Class
FIGURE 8: Scatter plot with TotalCoinsMined and TotalCoinSupply, by Class
SUMMARY:
In conclusion, unsupervised machine learning was successfull in determining the data required for the bank to implement its novel portfolio implementation. Additionally, vizualization libraries were utilized to effectively convey the requested information as follows:
- the elbow curve depicting the best value for K for the K-means algorithm for cryptocurrency cluster predictions
- a 3D scatterplot plotting the 3 clusters, including hover name and data
- the hvplot.table() listing clustered PCA data and in tabular format
- a scatterplot illustrating TotalCoinsMined(x) and TotalCoinSupply(y) by Class
REFERENCES: BCS, Google, StackoverFlow, GitHub