This project focuses on predicting the sales of various items in Big Mart outlets using machine learning techniques. The dataset includes information about the items, outlets, and their sales. The goal is to build a predictive model that can accurately estimate the sales based on different features.
To run this project, you'll need to install the required Python packages. You can do this by running the following commands in your Python environment:
!pip install shapely==1.8.5
!pip install numpy==1.23.5
!pip install scipy==1.10.1
!pip install pandas numpy seaborn matplotlib klib dtale scikit-learn joblib pandas-profiling xgboost
The project begins with loading the dataset and performing exploratory data analysis (EDA) to understand its structure and contents. Key steps include:
- Reading the dataset using Pandas.
- Checking for missing values in the dataset.
- Descriptive statistics of numerical features.
- Handling missing values using mean and mode imputation.
- Reducing dimensionality by dropping unnecessary columns.
- Utilizing libraries like Dtale, Pandas Profiling, and Klib for detailed EDA.
The dataset is then cleaned using Klib library functions. This involves:
- Dropping duplicates and empty rows/columns.
- Adjusting data types for more efficient storage.
- Handling missing values and reducing memory usage.
The data is prepared for model building through preprocessing tasks such as:
- Label encoding for categorical features.
- Splitting the dataset into training and testing sets.
- Standardizing the data for consistency in scale and distribution.
Machine learning models can then be trained using the preprocessed data. Common models include linear regression, decision trees, and ensemble methods. The project uses Scikit-Learn for model implementation.
The model's performance is evaluated using appropriate metrics, and adjustments can be made to improve accuracy.
Feel free to explore the Jupyter Notebook for a step-by-step walkthrough of the project. You can run each code cell to understand the analysis and reproduce the results.