Skip to content

An exploratory data analysis (EDA) and data visualization project using data from Spotify using Python.

License

Notifications You must be signed in to change notification settings

cristinagenduso/Spotify-Music-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify Music Data Analysis Project🎧

Introduction

The Spotify Data Analysis Python Project delves into the world of music data analysis using Python, showcasing the powerful capabilities of data-driven insights in understanding trends, patterns, and correlations within music datasets. All data was collected directly from the Spotify API, underscoring the authenticity and reliability of the dataset used for analysis. In today's digital age, data analysis plays a pivotal role in various domains, including music streaming services like Spotify. This project serves as an exploration into the realm of data science, specifically focusing on extracting meaningful insights from Spotify's extensive dataset.
Feel free to reach out! Linkedln | Cristina Genduso

Tools Used🛠️:

  • Programming Language: Python
  • Libraries: Pandas, Numpy, Matplotlib, Seaborn
  • IDE: Jupyter Notebook
  • Dataset: Personal Spotify Dataset

Import Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
  • import numpy as np: This imports the NumPy library and aliases it as 'np'. NumPy is used for numerical computations and provides support for arrays and matrices.
  • import pandas as pd: This imports the Pandas library and aliases it as 'pd'. Pandas is used for data manipulation and analysis, providing data structures like DataFrames for tabular data.
  • import matplotlib.pyplot as plt: This imports the Pyplot module from the Matplotlib library and aliases it as 'plt'. Matplotlib is a popular plotting library in Python, and Pyplot provides a convenient interface to create visualizations.
  • import seaborn as sns: This imports the Seaborn library and aliases it as 'sns'. Seaborn is built on top of Matplotlib and offers a higher-level interface for creating attractive statistical visualizations.

Exploring the Dataset

Data Collection

The dataset used in this project was meticulously collected directly from the Spotify API, comprising a comprehensive collection of personal liked songs. Leveraging the capabilities of the Spotify API, I gathered a diverse range of music tracks, reflecting my musical preferences and tastes. This hands-on approach ensured the authenticity and relevance of the dataset, as it consists entirely of songs that resonate with me personally.

tracks = pd.read_csv('./inputs/saved_tracks_with_audio_features.csv')

Overview

The dataset provides a detailed glimpse into my music library, encompassing various audio features, artist information, genre classifications, and temporal attributes of each track. With this rich dataset at hand, the exploration aims to uncover intriguing patterns, correlations, and insights hidden within the vast realm of my favorite songs on Spotify. Let's delve deeper into the dataset to uncover fascinating insights and trends that illuminate my musical journey.

tracks.head()

NOTE: The image provided is not the entirety of the complete image, as there are restrictions in capturing full images through screenshots. To access the comprehensive table, please refer to the Jupyter notebook folder within this repository.

Output:

Head

This line of code calls the head() method on the 'tracks' DataFrame. This method is used to display the first few (5 by default) rows of the DataFrame. This is useful for quickly getting an overview of the data.

Identifying Null Values in the Dataset

#checking null in tracks data
pd.isnull(tracks).sum()

Output:

Null Datas

This line of code uses the pd.isnull() function on the 'tracks' DataFrame to create a new boolean DataFrame where each cell contains:

  • True if the corresponding cell in the original DataFrame ('tracks') is null;
  • False otherwise.
The .sum() function is then used to count the number of True values in each column, effectively giving the count of missing values in each column.

Dataset Info

#checking info in tracks data
tracks.info()

Output:

Coding

This line of code calls the info() method on the 'tracks' DataFrame. The info() method provides a concise summary of the DataFrame, including the data types of each column, the number of non-null values, and memory usage.


Extracting Insights from the Dataset through Analysis

  1. Discovering the Top 10 Popular Songs in the Spotify Dataset
  2. most = tracks.sort_values(by='Popularity', ascending=False).head(10)

    Output:

    Coding

    This line of code creates sorts the 'tracks' DataFrame based on the 'popularity' column in ascending order. The head(10) notation selects the first 10 rows of the sorted DataFrame, effectively selecting the 10 most popular tracks.


  3. Descriptive Statistics
  4. # Display summary statistics of the numerical columns in the dataset
    tracks.describe().transpose()

    Output:

    Coding

    This line of code generates a concise summary of the numerical features in the tracks DataFrame, providing insights into the central tendency, dispersion, and distribution of the data across different attributes


  5. Average popularity of the tracks
  6. average_popularity = tracks['Popularity'].mean()

    Output:

    Coding

    This line of code calculates the average popularity of all the tracks in the tracks DataFrame and stores this value in the variable average_popularity.


  7. Visualization: Pearson Correlation Heatmap for Two Variables
  8. numeric_columns = tracks.select_dtypes(include=['float64',
    'int64']).columns td = tracks[numeric_columns].corr(method = 'pearson') hmap =
    sns.heatmap(td, annot = True, fmt = '.1g', vmin=-1, vmax=1, center=0,
    cmap='crest', linewidths=0.1, linecolor='black')

    Output:

    Coding
    • numeric_columns = tracks.select_dtypes(include=['float64', 'int64']).columns: This line of code select the columns from the tracks DataFrame that have numeric data types ('float64', 'int64').
    • hmap = sns.heatmap(td, annot=True, fmt='.1g', vmin=-1, vmax=1, center=0, cmap='crest', linewidths=0.1, linecolor='black'): This line of code uses Seaborn's heatmap() function to create a heatmap visualization of the correlation matrix. It displays the correlation values as annotations, uses a color map ('crest') to represent the correlation strength, and sets the range of correlation values to be between -1 and 1.

  9. Regression Plot of Popularity vs. Acousticness with Regression Line
  10. sns.set_style('darkgrid') plt.figure(figsize=(10, 6))
    sns.regplot(data=tracks, x='Acousticness', y='Popularity',
    color='orange').set(title='Popularity vs Acousticness Correlation') plt.show()

    Output:

    Coding
    • sns.regplot(data=tracks, x='Acousticness', y='Popularity', color='orange'): This line of code uses Seaborn's regplot() function to create a regression plot. It visualizes the relationship between the 'popularity' and 'acousticness' columns from the tracks DataFrame.
    • .set(title='Popularity vs Acousticness Correlation'): This line of code sets the title for the regression plot.

  11. Top 5 Genres the Spotify Dataset
  12. popular = genre_counts.sort_values(ascending=False).head(5)
    sns.barplot(y=popular.index, x=popular.values, palette='viridis',
    legend=False).set(title='Top 5 Genres by Frequency')

    Output:

    Coding
    • popular = genre_counts.sort_values(ascending=False).head(5): This line of code extracts the 5 most popular genres in the dataset.
    • sns.barplot(y=popular.index, x=popular.values, palette='viridis', legend=False): This line of code uses Seaborn's barplot() function to create a bar plot.
    • .set(title='Top 5 Genres by Frequency'): This line of code sets the title for the regression plot.

  13. Monthly additions of favorite tracks
  14. monthly_additions = tracks['Month'].value_counts().reindex(['Jan',
    'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    sns.barplot(x=monthly_additions.index, y=monthly_additions.values,
    palette='flare')

    Output:

    Coding
    • monthly_additions = tracks['Month'].value_counts().reindex(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']): This line of code reindexes the Month columns, into the name of the corresponding month.
    • sns.barplot(x=monthly_additions.index, y=monthly_additions.values, palette='flare'): This line of code uses Seaborn's barplot() function to create a bar plot.

  15. Days with Peaks of Indie Tracks
  16. daily_indie_additions = indie_tracks['Added
    At'].dt.date.value_counts().sort_index()
    sns.lineplot(x=daily_indie_additions.index, y=daily_indie_additions.values,
    marker='o', color='purple')

    Output:

    Coding
  17. Top 5 indie artists in the dataset
  18. indie_artists =
    indie_tracks['Artists'].str.split(',').explode().value_counts().head(5)
    sns.barplot(y=indie_artists.index, x=indie_artists.values,
    palette='cubehelix')

    Output:

    Coding

About

An exploratory data analysis (EDA) and data visualization project using data from Spotify using Python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages