The Spotify Data Analysis Python Project delves into the world of music data
analysis using Python, showcasing the powerful capabilities of data-driven
insights in understanding trends, patterns, and correlations within music
datasets. All data was collected directly from the Spotify API, underscoring
the authenticity and reliability of the dataset used for analysis. In today's
digital age, data analysis plays a pivotal role in various domains, including
music streaming services like Spotify. This project serves as an exploration
into the realm of data science, specifically focusing on extracting meaningful
insights from Spotify's extensive dataset.
Feel free to reach out!
Linkedln | Cristina Genduso
Tools Used🛠️:
- Programming Language: Python
- Libraries: Pandas, Numpy, Matplotlib, Seaborn
- IDE: Jupyter Notebook
- Dataset: Personal Spotify Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
-
import numpy as np
: This imports the NumPy library and aliases it as 'np'. NumPy is used for numerical computations and provides support for arrays and matrices. -
import pandas as pd
: This imports the Pandas library and aliases it as 'pd'. Pandas is used for data manipulation and analysis, providing data structures like DataFrames for tabular data. -
import matplotlib.pyplot as plt
: This imports the Pyplot module from the Matplotlib library and aliases it as 'plt'. Matplotlib is a popular plotting library in Python, and Pyplot provides a convenient interface to create visualizations. -
import seaborn as sns
: This imports the Seaborn library and aliases it as 'sns'. Seaborn is built on top of Matplotlib and offers a higher-level interface for creating attractive statistical visualizations.
The dataset used in this project was meticulously collected directly from the Spotify API, comprising a comprehensive collection of personal liked songs. Leveraging the capabilities of the Spotify API, I gathered a diverse range of music tracks, reflecting my musical preferences and tastes. This hands-on approach ensured the authenticity and relevance of the dataset, as it consists entirely of songs that resonate with me personally.
tracks = pd.read_csv('./inputs/saved_tracks_with_audio_features.csv')
The dataset provides a detailed glimpse into my music library, encompassing various audio features, artist information, genre classifications, and temporal attributes of each track. With this rich dataset at hand, the exploration aims to uncover intriguing patterns, correlations, and insights hidden within the vast realm of my favorite songs on Spotify. Let's delve deeper into the dataset to uncover fascinating insights and trends that illuminate my musical journey.
tracks.head()
NOTE: The image provided is not the entirety of the complete image, as there are restrictions in capturing full images through screenshots. To access the comprehensive table, please refer to the Jupyter notebook folder within this repository.
This line of code calls the head()
method on the 'tracks'
DataFrame. This method is used to display the first few (5 by default) rows of
the DataFrame. This is useful for quickly getting an overview of the data.
#checking null in tracks data
pd.isnull(tracks).sum()
This line of code uses the pd.isnull()
function on the 'tracks'
DataFrame to create a new boolean DataFrame where each cell contains:
-
True
if the corresponding cell in the original DataFrame ('tracks') is null; False
otherwise.
.sum()
function is then used to count the number of
True
values in each column, effectively giving the count of missing
values in each column.
#checking info in tracks data
tracks.info()
This line of code calls the info()
method on the 'tracks'
DataFrame. The info()
method provides a concise summary of the
DataFrame, including the data types of each column, the number of non-null
values, and memory usage.
- Discovering the Top 10 Popular Songs in the Spotify Dataset
- Descriptive Statistics
- Average popularity of the tracks
- Visualization: Pearson Correlation Heatmap for Two Variables
-
numeric_columns = tracks.select_dtypes(include=['float64', 'int64']).columns
: This line of code select the columns from thetracks
DataFrame that have numeric data types ('float64', 'int64'
). -
hmap = sns.heatmap(td, annot=True, fmt='.1g', vmin=-1, vmax=1, center=0, cmap='crest', linewidths=0.1, linecolor='black')
: This line of code uses Seaborn'sheatmap()
function to create a heatmap visualization of the correlation matrix. It displays the correlation values as annotations, uses a color map ('crest') to represent the correlation strength, and sets the range of correlation values to be between -1 and 1. - Regression Plot of Popularity vs. Acousticness with Regression Line
-
sns.regplot(data=tracks, x='Acousticness', y='Popularity', color='orange')
: This line of code uses Seaborn'sregplot()
function to create a regression plot. It visualizes the relationship between the 'popularity' and 'acousticness' columns from thetracks
DataFrame. -
.set(title='Popularity vs Acousticness Correlation')
: This line of code sets the title for the regression plot. - Top 5 Genres the Spotify Dataset
-
popular = genre_counts.sort_values(ascending=False).head(5)
: This line of code extracts the 5 most popular genres in the dataset. -
sns.barplot(y=popular.index, x=popular.values, palette='viridis', legend=False)
: This line of code uses Seaborn'sbarplot()
function to create a bar plot. -
.set(title='Top 5 Genres by Frequency')
: This line of code sets the title for the regression plot. - Monthly additions of favorite tracks
-
monthly_additions = tracks['Month'].value_counts().reindex(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
: This line of code reindexes theMonth
columns, into the name of the corresponding month. -
sns.barplot(x=monthly_additions.index, y=monthly_additions.values, palette='flare')
: This line of code uses Seaborn'sbarplot()
function to create a bar plot. - Days with Peaks of Indie Tracks
- Top 5 indie artists in the dataset
most = tracks.sort_values(by='Popularity', ascending=False).head(10)
This line of code creates sorts the 'tracks'
DataFrame based on the 'popularity' column in ascending order. The head(10)
notation selects the first 10 rows of the sorted
DataFrame, effectively selecting the 10 most popular tracks.
# Display summary statistics of the numerical columns in the dataset
tracks.describe().transpose()
This line of code generates a concise summary of the numerical features in the tracks DataFrame, providing insights into the central tendency, dispersion, and distribution of the data across different attributes
average_popularity = tracks['Popularity'].mean()
This line of code calculates the average popularity of all the tracks in the
tracks
DataFrame and stores this value in the variable
average_popularity
.
numeric_columns = tracks.select_dtypes(include=['float64',
'int64']).columns td = tracks[numeric_columns].corr(method = 'pearson') hmap =
sns.heatmap(td, annot = True, fmt = '.1g', vmin=-1, vmax=1, center=0,
cmap='crest', linewidths=0.1, linecolor='black')
sns.set_style('darkgrid') plt.figure(figsize=(10, 6))
sns.regplot(data=tracks, x='Acousticness', y='Popularity',
color='orange').set(title='Popularity vs Acousticness Correlation') plt.show()
popular = genre_counts.sort_values(ascending=False).head(5)
sns.barplot(y=popular.index, x=popular.values, palette='viridis',
legend=False).set(title='Top 5 Genres by Frequency')
monthly_additions = tracks['Month'].value_counts().reindex(['Jan',
'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
sns.barplot(x=monthly_additions.index, y=monthly_additions.values,
palette='flare')
daily_indie_additions = indie_tracks['Added
At'].dt.date.value_counts().sort_index()
sns.lineplot(x=daily_indie_additions.index, y=daily_indie_additions.values,
marker='o', color='purple')
indie_artists =
indie_tracks['Artists'].str.split(',').explode().value_counts().head(5)
sns.barplot(y=indie_artists.index, x=indie_artists.values,
palette='cubehelix')