🎯 Problem statement
To conduct a thorough exploratory data analysis (EDA) and hypothesis testing on the dataset which contains information on customers visiting the shopping site for purchase.
Feature | Description |
---|---|
Administrative | Number of administrative pages visited (e.g., account, cart, orders). |
Administrative_Duration | Time spent on administrative pages. |
Informational | Number of informational pages visited. |
Informational_Duration | Time spent on informational pages. |
ProductRelated | Number of product-related pages visited. |
ProductRelated_Duration | Time spent on product-related pages. |
BounceRates | % of visitors who exit without further interaction. |
ExitRates | % of pageviews ending on a specific page. |
PageValues | Average value of the page relative to transaction completion. |
SpecialDay | Proximity of browsing date to special days, e.g., holidays. |
Month | Month of the pageview (string format). |
OperatingSystems | Integer representing the user's operating system. |
Browser | Integer representing the user's browser. |
Region | Integer representing the user's location region. |
TrafficType | Integer categorizing the traffic type. |
VisitorType | Visitor status: New, Returning, or Other. |
Weekend | Boolean indicating if the session occurred on a weekend. |
Revenue | Boolean indicating if the user completed a purchase. |
-
Data Preprocessing: Handled missing values, formatted data types, and ensured all necessary transformations were made for consistency in the dataset.
-
Univariate Analysis: Plotted histograms and box plots for each numerical feature to identify distribution shapes and detect outliers in features like
PageValues
,BounceRates
, andExitRates
. -
Correlation Analysis: Calculated correlations between numerical features to detect potential relationships, focusing on features like
PageValues
,Revenue
, andDuration
. -
Visualizations: Created scatter plots, pair plots, and heatmaps to visualize relationships between key numerical variables, such as
PageViews
,Duration
, andRevenue
. -
Class Distribution: Examined the distribution of the target variable (
Revenue
) to assess class balance and evaluate potential bias in the dataset. -
Page Category Analysis: Summarized page views, session durations, and bounce/exit rates for different page categories to identify user behavior patterns on each page type.
-
SpecialDay Analysis: Investigated the distribution of the
SpecialDay
feature and analyzed its correlation withRevenue
to understand how special events influence conversions. -
Binary Feature Creation: Generated a binary feature indicating whether a user visited all three page categories (
Informational
,ProductRelated
,Administrative
) during their session. -
PageValues and Behavior Analysis: Explored the relationship between
PageValues
and factors likeTrafficType
,VisitorType
, andRegion
, highlighting engagement and purchase behavior differences. -
User Session Length Impact: Analyzed user session lengths to determine their influence on conversion rates, identifying any trends between longer sessions and higher purchase likelihood.
-
User Grouping by Behavior: Grouped users based on
VisitorType
,OperatingSystems
, andRegion
to identify behavioral differences and their impact on conversion rates. -
Traffic Type Segmentation: Segmented users by
TrafficType
and analyzed engagement patterns, exploring the impact of different traffic sources on purchase probability and session behaviors.
This comprehensive approach prepared the data for in-depth analysis and built the foundation for actionable insights.