Skip to content

Predicting Walmart Sales and Performing Exploratory Data Analysis

Notifications You must be signed in to change notification settings

tgchacko/Walmart-Sales-Forecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Walmart Sales Forecasting

Table of Contents

Project Overview

Data Sources

Data Description

Tools

EDA Steps

Data Preprocessing Steps and Inspiration

Forecasting of Weekly Sales

Assumptions

Model Evaluation Metrics

Results

Recommendations

Limitations

Future Possibilities of the Project

References

Project Overview

The objective of this project is to analyze Walmart sales data to extract meaningful insights and develop predictive models to forecast weekly sales for each store. This analysis aims to help Walmart improve inventory management, strategic decision-making, and overall operational efficiency.

Data Sources

Walmart Sales Data: The primary dataset used for this analysis is the Walmart.csv file, containing detailed information about weekly sales across multiple Walmart stores.

Walmart Dataset

Data Description

The dataset, named "Walmart.csv", comprises 6,435 rows and 8 columns, each offering valuable insights into the weekly sales dynamics at Walmart across 45 stores. The dataset Walmart.csv contains various columns including:

  1. Store: Store number (categorical variable ranging from 1 to 45)
  2. Date: The week of sales (temporal dimension spanning from February 5, 2010, to October 26, 2012)
  3. Weekly_Sales: Sales figure for the given store in a particular week
  4. Holiday_Flag: A binary indicator (0 or 1) discerning whether a given week includes a holiday
  5. Temperature: The temperature on the day of the sale
  6. Fuel_Price: The regional fuel cost
  7. CPI: Consumer Price Index, indicating the average change in prices paid by consumers
  8. Unemployment: The unemployment rate in the region

Picture1

Picture2

Tools

Libraries

Below are the links for details and commands (if required) to install the necessary Python packages:

Below are the links for details and commands (if required) to install the necessary Python packages:

EDA Steps

EDA involved exploring the Walmart sales data to answer key questions, such as:

  1. What is the overall trend of weekly sales?
  2. How do these trends vary by store, region, and other factors?
  3. What is the impact of holidays on weekly sales?
  4. How do external factors like temperature, fuel price, CPI, and unemployment affect sales?

Data Preprocessing Steps and Inspiration

  1. Data Cleaning:

Handling Missing Values: There are no missing values found in the dataset. Removing Duplicates: There are no duplicate values found in the dataset. Addressing Outliers: Outliers are not being addressed since we are considering the actual weekly sales for time series forecasting.

  1. Data Transformation:

    Converting Data Types: The data type for the column ‘Date’ is changed to ‘datetime’ from ‘object’. From this date, we have created new columns by obtaining the year, quarter, month, week, day of week, and day of month.

  2. Exploratory Data Analysis (EDA):

Data Gaps: Before doing the EDA, we observed that there is a gap in the data for January 2010 and for November, December 2012. The absence of data for these three months can impact our ability to perform accurate yearly, quarterly, and monthly comparisons. The distribution of data is thus affected. It is essential to consider this data gap while conducting analyses that involve these specific time periods.

Distibution of Data:

  1. Across Years

Picture3

  1. Across Quarters

Picture4

  1. Across Holidays/Non Holidays)

Picture5

  1. Distribution of Data - Pie Charts

Picture6

Picture7

Top 5 Performing Stores

Picture8

Picture9

Worst 5 Performing Stores

Picture10

Store No. 20 has the highest sales, whereas store No. 33 has the lowest sales.

Total Yearly Sales

Picture11

Total Monthly Sales

Picture12

As there is a gap in the data for January 2010 and for November, December 2012, we would average it out to show for these months which month has the highest sales. After doing the necessary adjustment, we can see that December is the best performing month and February is the worst performing month.

Top 5 months with Highest and Lowest Sales

Picture13

Total Holiday/Non-Holiday Sales

Picture14

If we make the adjustment by dividing the sales with the actual number of working days and holidays, we can see the daily sales on a holiday is higher.

Average Daily Sales on a Holiday / a Non Holiday

Picture15

Impact of Unemployment on Weekly Sales

The data indicates a noticeable decline in spending coinciding with the initiation of unemployment. Typically, an elevated unemployment index corresponds to a reduction in sales. However, in our dataset, the correlation between the unemployment rate index and weekly sales is relatively low, measuring at -0.106.

Picture16

Impact of Temperature on Weekly Sales

Picture17

The observed correlation of -0.063 between temperature and sales in Walmart suggests a weak negative relationship. Several factors could contribute to this low correlation:

  1. Seasonal Variations
  2. Diverse Product Range
  3. Regional Variations
  4. Consumer Behavior
  5. Multifactorial Influence

Impact of CPI on Weekly Sales

The observed low correlation of -0.072 between the Consumer Price Index (CPI) and sales indicates a weak relationship.

Picture18

Seasonal Trend of Weekly Sales

Picture19

Seasonal Trend in Weekly Sales: Sales are the highest in December, which can be attributed to several factors:

  1. Holiday Shopping Season
  2. Special Promotions and Discounts
  3. Winter Weather and Seasonal Products
  4. Year-End Clearance Sales
  5. Increased Consumer Spending
  6. Marketing and Advertising Campaigns
  7. Extended Store Hours

Trend Component, Seasonal Component, Residual Component of Weekly Sales

Picture20

Forecasting of Weekly Sales

Time Series Forecasting Models

  1. ARIMA (AutoRegressive Integrated Moving Average)
    • Captures linear trends and seasonality.
    • Suitable for stationary data.
  2. SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous factors)
    • Extends ARIMA by incorporating external factors.
    • Suitable for data influenced by external variables like holidays.
  3. AutoARIMA
    • Automates the selection of the optimal ARIMA model.
    • User-friendly and efficient.
  4. Prophet
    • Developed by Facebook for handling seasonality, holidays, and special events.
    • Flexible and easy to use.
  5. TBATS (Trigonometric Seasonal Decomposition of Time Series)
    • Handles multiple seasonalities and complex patterns.
    • Robust in capturing diverse seasonal patterns.

Assumptions

  1. Stationarity Assumption

    Definition: The statistical properties of the time series data, such as mean and variance, do not change over time. Rationale: Many time series forecasting models, including ARIMA, perform better on stationary data. Ensuring or achieving stationarity enhances model effectiveness.

  2. Linearity Assumption

    Definition: The relationships between variables, including past and future values in the time series, can be adequately represented using linear models. Rationale: Models like ARIMA and SARIMAX are designed based on linear relationships. Assuming linearity simplifies the modeling process.

  3. Independence Assumption

    Definition: Each observation in the time series is assumed to be independent of others. Rationale: Time series models often assume independence to prevent past observations from unduly influencing future ones. Violating this assumption can lead to biased model performance.

  4. Identifiability Assumption

    Definition: The parameters of the chosen forecasting model can be uniquely determined from the available data. Rationale: Ensuring that the parameters are identifiable is crucial for accurate estimation in models like ARIMA and SARIMAX. This supports the reliability of the model's parameter estimates.

Model Evaluation Metrics

  1. RMSE (Root Mean Squared Error): Measures the average magnitude of the errors between predicted and observed values.
  2. MAE (Mean Absolute Error): Calculates the average absolute differences between predicted and observed values.
  3. MAPE (Mean Absolute Percentage Error): Expresses the average percentage difference between predicted and observed values.

Results

Considering Store 24

  • The TBATS model achieved the best performance metrics (lowest MAPE, RMSE, and MAE) for Store No. 24.

    Picture21

  • The analysis highlighted significant trends and seasonal patterns in weekly sales.

  • The impact of external factors like unemployment, temperature, and CPI on sales was explored.

Picture22

Recommendations

  • Implement targeted interventions during peak sales periods and holidays to maximize revenue.
  • Continue monitoring external factors to understand their potential impact on sales.
  • Utilize the forecasting model to plan inventory management and resource allocation more effectively.

Limitations

  • Data Quality: Some data points may be inaccurate due to underreporting or delays in reporting.
  • Model Limitations: The models used may not capture all complexities of sales patterns and may need continuous updating.
  • External Factors: Other factors not included in the analysis, such as social behavior and political decisions, can significantly impact sales.

Future Possibilities of the Project

  1. Advanced Predictive Modeling

Investigate advanced forecasting models such as: a) NBEATS (Neural Basis Expansion Analysis for Time Series) b) NHITS (Neural Hierarchical Time Series) c) PatchTST (Patch-level Temporal Super-Resolution Network for Time Series) d) VARMAX (Vector Autoregressive Moving-Average with exogenous variables) e) VAR (Vector Autoregression) f) KATS (Kit for Automated Time Series analysis)

These models offer potential for enhanced accuracy in sales forecasting.

  1. Store-Specific Analysis

Conduct comprehensive analyses for each of the 45 Walmart stores to uncover unique patterns and optimize forecasting models tailored to individual store characteristics. This approach can help identify specific trends and factors influencing sales at each location.

  1. External Factors Integration Incorporate Additional Factors: Consider integrating additional external factors into the forecasting models, such as:
  1. Economic Indicators: GDP, inflation rates, etc.
  2. Social Events: Festivals, public holidays, etc.
  3. Regional Factors: Local economic conditions, demographic changes, etc.

Incorporating these factors can provide a more comprehensive and accurate forecasting approach, capturing the broader context influencing sales.

References

  1. Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts. Forecasting: principles and practice
  2. Time Series forecasting in Python: Time Series Forecasting in Python
  3. Time Series Forecasting TBATS: TBATS Time Series Forecasting