Skip to content

Latest commit

 

History

History
570 lines (366 loc) · 21.6 KB

README.md

File metadata and controls

570 lines (366 loc) · 21.6 KB

Retail Product Sales Exploratory Data Analysis on Python

Wholesale-Stationery-Products-Online

iconTable of contents:

Project Overview

This project aims to provide "Pens and Printers" Company with a novel strategic approach for maximizing sales of their recently launched products by effectively engaging both existing and new customers, drawing insights from the company's historical data

Business Background

Pens and Printers was founded in 1984 and provides high-quality office products to large organizations. We are a trusted provider of everything from pens and notebooks to desk chairs and monitors. We don’t produce our own products but sell those other companies make. We have built long-lasting relationships with our customers and they trust us to provide them with the best products. As the way in which consumers buy products is changing, our sales tactics have to change too. Launching a new product line is expensive and we need to ensure we use the best techniques to sell the new product effectively. The best approach may

Business questions:

We need to know:

  • How many customers were there for each approach?
  • What does the spread of the revenue look like overall? And for each method?
  • Was there any difference in revenue over time for each of the methods?
  • Based on the data, which method would you recommend we continue to use? Some of these methods take more time from the team so they may not be the best for us to use if the results are similar.

Product Sales EDA on Python Project Executive Summary


Data validation and data cleaning

  • To begin we have to import all the Python libraries that will be useful, at least the ones we need to begin with.
  • Then import our dataset
  • Then take a view of our fields
  • If there's any action needed for cleaning then we proceed, if not we leave them.
# Import all the useful library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as style
style.use('ggplot')
plt.figure(figsize=(10,6))
sns.set_palette(['gray'])

# Load the dataset
df = pd.read_csv('product_sales_data.csv')

df.info()
df.head()
df.columns

# I want to know how many na or null values are there in the dataset
df_null_count = df.isna().sum().sum()
print("dataset null:",df_null_count)

# I want to know how many na or null values are there in revenue column
revenue_nll_count = df["revenue"].isna().sum()
print("revenue null:",revenue_nll_count)

From the above outputs: We could see that our data contains 15000 rows and 8 columns. That the dataset has 1074 null and the null values are in the revenue column. We could see that the revenue column only have 13926 numeric values and 1074 null My quick jugement of this: For me, becuse the revenue column values is one of the most important columns here, we have a few options here,

  1. We could either replace the null values with the revenue mean, median or mode values,
  2. We could delete all the null rows.

I would prefare to work with a genuine data rather than fabricating figures to fill in the null. I have decided to drop the null revenue rows.

# Delete rows where 'revenue' column is null
df.dropna(subset=['revenue'], inplace=True)

# I want to check again if the null rows has been dropped
revenue_nll_count = df["revenue"].isna().sum()
print("revenue null:",revenue_nll_count)

# Print the updated DataFrame info
df.info()

Print the datafame discribtion
df.describe()

df.head()

# Filter the DataFrame based on the condition 'years_as_customer' > 39
filtered_customers = df[df['years_as_customer'] > 39]

print(filtered_customers)

# View the list of customers with 'years_as_customer' > 39
years_of_customers = filtered_customers['years_as_customer'].tolist()
print(years_of_customers)

# Remove rows where 'years_as_customer' is greater than 39
df = df[df['years_as_customer'] <= 39]

# Reset the index of the DataFrame
df.reset_index(drop=True, inplace=True)

# Display the modified DataFrame
print(df)


# Delete rows where 'revenue' column is null
df.dropna(subset=['revenue'], inplace=True)

Validation of all fields one after the other After removing the null rows, our dataset now remains 13926 rows and 8 columns. In the next steps, I will be validating all columns by going through them to observe all the contents and correct any errors necessary.

Column 1: week

# Let's check for unique values
unique_count = df['week'].nunique()
print(unique_count)

unique_strings = df['week'].unique()
print(unique_strings)

Week column validation: Week column is numeric, has 6 unique weeks with 13926 values, no cleaning needed.

Column 2: sales_method

# let's check the unique categories of sales_method
sales_method_type = df['sales_method'].unique()
print(sales_method_type)

#Let's count number of values on each category
sales_method = df.groupby('sales_method')['sales_method'].count()
print(sales_method)

#Let's clean the sales_method categories, the categories are supposed to be 3: Email + Call, Email, then Call, instead of 5, let's correct the em + call spelling to match Email + Call, and  email to match Email. 

df['sales_method'] = df['sales_method'].replace({'em + call': 'Email + Call', 'email': 'Email'})

#Let's count number of values on each category to see if the errors are corrected.
sales_method = df.groupby('sales_method')['sales_method'].count()
print(sales_method)

Sales_method column validation: Sales_method is a category values with Call, Email + Call and Email, there were errors in the categories, but it's corrected and cleaned up.

Column 3: n_id

# let's check the data type of customer_id
customer_id = df['customer_id'].dtype
print(customer_id)

# let's check the count of customer_id
customer_id = df['customer_id'].nunique()
print(customer_id)

duplicates = df['customer_id'].duplicated().sum()
print(duplicates)

Customer_id column validation: customer_id is a unique identifier with no duplicates, so we have 13926 unique customers.

Column 4: nb_sold

# let's check the data type of nb_sold
nb_sold_type = df['nb_sold'].dtype
print(nb_sold_type)

# let's check the count of nb_sold
unique_count = df['nb_sold'].nunique()
print(unique_count)

# let's check the values in nb_sold column
unique_values = df['nb_sold'].unique()
print(unique_values)

#Let's count number of values on each nb_sold
nb_sold = df.groupby('nb_sold')['nb_sold'].count()
print(nb_sold)

nb_sold column validation: Number sole column is numeric, it has 10 unique values of 7 to 16, no cleaning needed.

Column 5: revenue

# let's check the data type of revenue
revenue_type = df['revenue'].dtype
print(revenue_type)

# let's check the sum of revenue
unique_sum = df['revenue'].sum()
print(unique_sum)

# Let's check again if there's null values in revenue column
revenue_null_count = df["revenue"].isna().sum()
print("revenue null:",revenue_null_count)

Revenue column validation: Revenue is float64 numeric with two decimal place, I initially remove all rows with revenue null values, after cleaning we have 13926 values.

Column 6: years_as_customer

# let's check the data type of years_as_customer
years_as_customer_type = df['years_as_customer'].dtype
print(years_as_customer_type)

# let's check the count of years_as_customer
unique_count = df['years_as_customer'].nunique()
print(unique_count)

# let's check the values in years_as_customer column
unique_values = df['years_as_customer'].unique()
print(unique_values)

#Let's check the oldest customer
years_as_customer_max = df['years_as_customer'].max()
print(years_as_customer_max)

#Let's count number of values on each years_as_customer
years_as_customer = df.groupby('years_as_customer')['years_as_customer'].count()
print(years_as_customer)

# Filter the DataFrame based on the condition 'years_as_customer' > 39
filtered_customers = df[df['years_as_customer'] > 39]
print(filtered_customers)

# View the list of customers with 'years_as_customer' > 39
years_of_customers = filtered_customers['years_as_customer'].tolist()
print(years_of_customers)

# Remove rows where 'years_as_customer' is greater than 39
df = df[df['years_as_customer'] <= 39]

# Reset the index of the DataFrame
df.reset_index(drop=True, inplace=True)

years_as_customer column validation: Years as cutommer is numeric with number of years of cutomers patronage, there are 39 unique years with 39 years as the oldest customer and 0 year as the new customers. 2 outliers reomoved 42 and 63. column cleaned.

Column 7: nb_site_visits

# let's check the data type of nb_site_visits
nb_site_visits_type = df['nb_site_visits'].dtype
print(nb_site_visits_type)

# let's check the count of nb_site_visits
unique_count = df['nb_site_visits'].nunique()
print(unique_count)

# let's check the values in nb_site_visits column
unique_values = df['nb_site_visits'].unique()
print(unique_values)

# Display the modified DataFrame
print(df.head())

nb_site_visits column validation: nb_site_visits is numeric, the same as descrbtion. No cleaning needed.

Column 8: state

# let's check the data type of state
state_type = df['state'].dtype
print(nb_site_visits_type)

# let's check the count of state
unique_count = df['state'].nunique()
print(unique_count)

# let's check the values in state column
unique_values = df['state'].unique()
print(unique_values)

states = df.groupby('state')['state'].count()
print(states)

Cleaned data head: Product_sales_data_head

State column validation: state is a category value with 50 unique state values, the same as the description. No cleaning is needed.

#Data Validation Summary:

The original dataset contained 15,000 rows and 8 columns. I validated all columns against the criteria provided in the dataset description:

week: Data type numeric, 6 unique weeks with no missing values, consistent with the description. No cleaning required.

sales_method: Data type string, three categories (Email, Call, Email + Call), as described, but 2 additional typo errors were found that did not match the description. These typos were corrected to align with the 3 correct categories.

customer_id: 15,000-character entries, representing unique identifiers for customers, consistent with the description. After removing rows with null revenue values, 13,926 entries remained.

nb_sold: Data type numeric, with 10 unique values indicating the number of products sold. This matched the description, and no cleaning was required.

revenue: Data type numeric, representing sales revenue rounded to 2 decimal places. After dropping 1,074 rows with null revenue values, 13,926 rows remained. The revenue data was cleaned accordingly.

years_as_customer: Data type numeric, with 42 unique values representing the number of years a customer has been purchasing products. However, since the business started in 1984 (39 years ago as at 2023), some records showed customer tenure of 47 and 63 years, which exceeded the valid range. Values greater than 39 years were removed, and the column now includes only customers with 0 to 39 years, as required.

nb_site_visits: Data type numeric, with 27 unique values and no missing data, matching the description. No cleaning was needed.

state: 50 unique categories representing customer states, consistent with the description. No cleaning was required.

After completing the data validation and cleaning process, the dataset now contains 13,926 rows and 8 columns.

Exploratory Data Analysis & Visualization

Question 1: How many customers were there for each approach?

The below single variable graph indicates that each strategic approach provides insights into the distribution of each sales method, and the result of each methods over the period of time as included in the dataset.

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as style
style.use('ggplot')
plt.figure(figsize=(10,6));
sns.set_palette(['gray']);

# Group the data by 'sales_method' and count the unique 'customer_id' values
grouped = df.groupby('sales_method')['customer_id'].nunique().sort_values(ascending=False)



# Create the figure and axes with specified figsize
fig, ax = plt.subplots(figsize=(10, 6));

# Label each bar with their respective numbers
for i, value in enumerate(grouped.values):
    plt.text(i, value, str(value), ha='center', va='bottom')

# Plot the bar chart with gray bars
ax.bar(grouped.index, grouped.values, color='gray')  # Fixed the error by providing the color directly

# Set labels and title
ax.set_xlabel('Sales Method')
ax.set_ylabel('Count of Unique Customers')
ax.set_title('Number of Customers by Sales Method')

# Display the chart
plt.show();

1  Distribution of Sales Method

Findings

Distribution of Customers by Sales Method: We could see that The Email approach reached the majority of 45.17% with 6922 customers, followed by Call aproach 31.14% with 4781, and the Email + Call approaches 23.69% with 2223, these makes the total number of 13926 customers.

Question 2: What does the spread of the revenue look like overall? And for each method?

# Distribution of Revenue
plt.figure(figsize=(10,6))
sns.histplot(data=df, x='revenue')
plt.ylabel('Count of Revenue')
plt.xlabel('Revenue')
plt.title('Distribution of Revenue')

2  Distribution of Revenue

The Spread of the Revenue: Looking at the distribution of the revenue we can see from the graph that most single products sold had amount less than 200 USD, and the largest number of products sold are within the amount of 45 and 100 USD. The distribution of the revenue is right skewed. There are some sold more than 220 USD and above could be considered outliers.

Distribution of Revenue by Sales Method

# import matplotlib
import matplotlib.pyplot as plt

# Group the data by 'sales_method' and sum the 'revenue' values
grouped = df.groupby('sales_method')['revenue'].sum().sort_values(ascending=False)

# Set the gray color for bars and grid lines
bar_color = 'gray'
grid_color = 'lightgray'

# Create the figure and axes with specified figsize
fig, ax = plt.subplots(figsize=(10, 6))

# Label each bar with their respective numbers
for i, value in enumerate(grouped.values):
    plt.text(i, value, str(value), ha='center', va='bottom')

# Plot the bar chart with gray bars
ax.bar(grouped.index, grouped.values, color=bar_color)

# Set labels and title
ax.set_xlabel('Sales Method')
ax.set_ylabel('Revenue')
ax.set_title('Distribution of Revenue by Sales Method')

# Display the chart
plt.show()

2  1 Distribution of Revenue by Sales Method

Distribution of Revenue by Sales Method:The chart above shows the overal spread of the revenue for each method. The Email method had the highest revenue with 50.53%, followed by Email + Call with 30.71%, and Call with 18.76%.

Correlation of Revenue and Product sold

# The heatmap relationship between the values

# Create the figure and axes with specified figsize
plt.figure(figsize=(10, 6));

# Heatmap plot to showcase the overall relationship
sns.heatmap(df.corr(),annot=True);
plt.title('The relationship between the values');

Heatmap -The relationship between the values

Revenue and product sold are positively correlated

Revenue Differences over Time for each Sales Method

# Create the figure and axes with specified figsize
fig, ax = plt.subplots(figsize=(10, 6))

# Create a box plot to visualize the difference in 'revenue' over 'years_as_customer' for each 'sales_method'
sns.boxplot(x='sales_method', y='revenue', hue='sales_method', data=df, palette='Set2');
plt.xlabel('Sales Method')
plt.ylabel('Revenue')
plt.title('Difference in Revenue over Years as Customer for each Sales Method')

# Display the plot
plt.legend(title='Sales Method', facecolor='lightgray')
plt.show()

3 1 Difference in Revenue over Years as Customer for each Sales Method

Revenue Differences over Time for each Sales Method: With just 2223 customers Email + Call generated 30.71% of the overal revenue comapared to other method with higher numbers of customers. Also, combining the use of both Email and Call as sales methods tends to produce a quicker and effective result compared to Email or Call.

Question 3: Was there any difference in revenue over time for each of the methods?

# Create the figure and axes with specified figsize
fig, ax = plt.subplots(figsize=(10, 6))

# Create the line plot with markers
sns.lineplot(x='years_as_customer', y='revenue', hue='sales_method', data=df, marker='o', palette='Set2')

# Set the labels and title of the plot
plt.xlabel('Years as Customer')
plt.ylabel('Revenue')
plt.title('Revenue Differences over Time for each Sales Method')

# Adjust the legend position
plt.legend(title='Sales Method', loc='upper right')

# Display the plot
plt.show()

3  0 Revenue Differences over Time for each Sales Method

Difference in Revenue over Years as Customer for each Sales Method: The difference indicates the approach of Email + Call method generated revenue within the range of 140 to 205 per sale on some of the top products and the Years of customers patronage is between 0 and 39 years. Also, combining the use of both Email and Call as sales methods tends to produce a quicker and effective result compared to Email or Call.

Question 4: Based on the data, which method would you recommend we continue to use?

# Pivot the data to create a DataFrame with 'years_as_customer' as index, 'sales_method' as columns, and 'revenue' as values
pivot_df = df.pivot_table(index='week', columns='sales_method', values='revenue')

# Create the figure and axes with specified figsize
fig, ax = plt.subplots(figsize=(10, 6))

# Create a grouped bar chart to visualize the revenue difference between sales methods over time
pivot_df.plot(kind='bar', color=['salmon', 'brown', 'olive'], ax=ax)
plt.xlabel('Week sales was made')
plt.ylabel('Revenue')
plt.title('Sales Method and Week sale was made since product launch')
# Rotate the x-axis labels horizontally
plt.xticks(rotation='horizontal')

# Display the plot
plt.show()

4  Sales Method and Week sale was made since product launch

Sales Method and Week sale was made since product launch: The chart above depicts the revenue spread across the sales methods over a 6-week period post-product launch. Email + Call consistently shows productive and consistent performance throughout the weeks.

The Business Metrics

Since every business goal is selling products that help customers and in return increases the business revenue. I would recommend that the percentage of Email and Call Approach on a total number of customers of the last 6 weeks' sales to implement Email + Call as our metric. Based on our last 6 weeks' product launched data, 45% of sales was from Email approach and 31% from Call approch. Therefore, if Email + Call method is implemented as our metric, and this number is increasing next 6 weeks, it indicates a very good sign to achieve our goal.

Recommendations

**For the following 6 weeks, I would recommend the company can focus on the following steps:

  • Using key metrics to monitor can provide insights into the likelihood of sales growth.

  • To implement an Email marketing campaign using customer segmentation, automation, and analytics to deliver targeted emails, track engagement metrics, and optimize Email performance.

  • To effectively implement the Call approach within a call center setting for customer follow-ups regarding product inquiries and interests.

  • To implement a CRM system if it's not already inplace, to enable us build and maintain strong customer relationships and improve customer experience.

  • Data Collection for in-depth analysis:

    • Improve data quality - how much time was spent on each customer?
    • New related data - Date product was sold, delivery date, and other time stamp.**