This project implements a machine learning-based pricing model for Walmart products using Jupyter notebooks. The system analyzes various product attributes including cost, stock levels, supplier information, and product characteristics to predict optimal pricing strategies with confidence scores.
- XGBoost Regression Model: Implements gradient boosting for accurate price predictions
- Feature Engineering: Handles missing data with mean imputation and numeric type conversion
- Confidence Scoring: Generates mlScore based on tree-level prediction variance
- Model Persistence: Saves trained models and preprocessing components
- Price Prediction: Generates suggested prices with confidence scores
- Performance Metrics: Comprehensive evaluation with MAE, RΒ², and cross-validation
Walmart/
βββ train.ipynb # Main training notebook
βββ Train_Pricing_Model.ipynb # Alternative training approach
βββ product_cleaned.csv # Cleaned product dataset
βββ predicted_prices_with_score_cleaned.json # Model predictions output
βββ suggested_price_xgb_model_cleaned.pkl # Trained XGBoost model
βββ imputer_cleaned.pkl # Data imputation model
βββ confidence_scaler.pkl # Confidence score scaler
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore rules
- Python 3.8+
- Jupyter Notebook or JupyterLab (already installed in your environment)
-
Clone the repository
git clone <repository-url> cd Walmart
-
Install dependencies
pip install -r requirements.txt
-
Launch Jupyter and run the notebook
jupyter notebook
Then open
train.ipynband run all cells.
The model uses the following features from the product dataset:
cost: Product costcurrentPrice: Current selling priceoriginalPrice: Original product pricemargin: Profit margin
stock: Current stock levelmaxStock: Maximum stock capacityminStockLevel: Minimum stock thresholddaysUntilExpiry: Days until product expiresisPerishable: Perishable product flag
priceFactors.expirationUrgency: Expiration urgency factorpriceFactors.stockLevel: Stock level factorpriceFactors.timeOfDay: Time-based pricing factorpriceFactors.demandForecast: Demand prediction factorpriceFactors.competitorPrice: Competitor pricing factorpriceFactors.seasonality: Seasonal pricing factorpriceFactors.marketTrend: Market trend factor
clearanceRate: Product clearance ratewasteReduction: Waste reduction percentage
- XGBoost Regressor with optimized hyperparameters
- Parameters: n_estimators=100, learning_rate=0.1, max_depth=4
- Missing Data: Mean imputation using SimpleImputer
- Data Types: Automatic conversion to numeric types
- Feature Selection: 18 carefully selected features
- MAE (Mean Absolute Error): ~450.89
- RΒ² Score: ~0.9957
- Cross-Validation: 3-fold CV RΒ² mean ~0.9212
The model generates an mlScore (0.70-0.99) based on:
- Tree-level prediction variance across XGBoost ensemble
- Standard deviation of predictions
- Normalized confidence scaling
- Power transformation for score distribution
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, r2_score
# Load data
df = pd.read_csv("product_cleaned.csv")
df = df[df['suggestedPrice'].notnull()]
# Prepare features
selected_features = [
'cost', 'currentPrice', 'originalPrice', 'margin',
'stock', 'maxStock', 'minStockLevel', 'daysUntilExpiry', 'isPerishable',
'priceFactors.expirationUrgency', 'priceFactors.stockLevel', 'priceFactors.timeOfDay',
'priceFactors.demandForecast', 'priceFactors.competitorPrice',
'priceFactors.seasonality', 'priceFactors.marketTrend',
'clearanceRate', 'wasteReduction'
]
# Train model
X = df[selected_features]
y = df['suggestedPrice']
imputer = SimpleImputer(strategy="mean")
X_imputed = imputer.fit_transform(X)
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=4, random_state=42)
model.fit(X_imputed, y)# Load saved model
import joblib
model = joblib.load("suggested_price_xgb_model_cleaned.pkl")
imputer = joblib.load("imputer_cleaned.pkl")
# Prepare new data
new_data = pd.DataFrame([{
'cost': 10.0,
'currentPrice': 15.0,
'stock': 100,
# ... other features
}])
# Make prediction
X_new = imputer.transform(new_data[selected_features])
predicted_price = model.predict(X_new)[0]
print(f"Suggested Price: ${predicted_price:.2f}")model_params = {
'n_estimators': 100, # Number of boosting rounds
'learning_rate': 0.1, # Learning rate
'max_depth': 4, # Maximum tree depth
'random_state': 42 # Reproducibility
}- Base Score: 0.70 (minimum confidence)
- Max Score: 0.99 (maximum confidence)
- Power Transform: 0.3 (score distribution)
[
{
"productId": "12345",
"suggestedPrice_predicted": 12.99,
"mlScore": 0.85
}
]suggested_price_xgb_model_cleaned.pkl: Trained XGBoost modelimputer_cleaned.pkl: Data imputation modelconfidence_scaler.pkl: Confidence score scaler
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or support, please open an issue in the repository or contact the development team.
Note: This model is designed for educational and research purposes. Always validate predictions in production environments and consider business constraints when implementing pricing strategies.