Skip to content

Algorithmic implementation of Multivariate Linear Regression built entirely with NumPy, featuring Ridge (L2) Regularization and Early Stopping via Batch Gradient Descent. Includes a robust preprocessing pipeline, comprehensive evaluation metrics, and performance benchmarking against Scikit-learn on the California Housing dataset.

Notifications You must be signed in to change notification settings

ashbix23/Numpy-Native-Regression-Toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Numpy-Native Regression Toolkit


Python
NumPy
License: MIT
Jupyter


Overview

This repository provides a rigorous, modular implementation of Multivariate Linear Regression, specifically incorporating Ridge (L2) Regularization and Adaptive Early Stopping. The entire core model is built from scratch using only the NumPy library, bypassing high-level ML frameworks.

The project is designed to demonstrate deep algorithmic mastery and core ML Engineering fundamentals. It functions as an extensible first-principles ML module and is validated through a comprehensive evaluation and benchmarking suite.


Table of Contents


Key Features

  • Batch Gradient Descent with configurable hyperparameters
  • Ridge (L2) Regularization for better generalization
  • Early Stopping (tolerance + patience)
  • Standardized Train/Validation/Test Split
  • Custom Evaluation Metrics: MSE, MAE, R²
  • Visual Diagnostics:
    • Convergence curves
    • Feature importance bar chart
    • Residuals plot
    • Predicted vs Actual scatter
  • Baselines: Compare against scikit-learn’s LinearRegression and Ridge

Project Structure

Linear-Regression-Scratch/
├── linear_scratch/
│   ├── __init__.py
│   ├── model.py
│   ├── metrics.py
│   ├── plotting.py
│   ├── preprocessing.py        
│   ├── evaluation.py           
│   ├── visualization.py        
│   └── benchmarks.py           
│
├── notebooks/
│   ├── Linear_Regression_Scratch.ipynb
│   ├── End_to_End_Pipeline.ipynb        
│   └── images/
│       ├── feature_importance.png
│       ├── residuals.png
│       └── predicted_vs_actual.png
│
├── requirements.txt
├── LICENSE
└── README.md

Dataset & Preprocessing

  • Dataset: California Housing dataset from scikit-learn.datasets
  • Target: Median house value (log-scaled)
  • Preprocessing Steps:
    • Missing-value handling via SimpleImputer
    • Scaling using StandardScaler or RobustScaler
    • Random Train/Validation/Test split (seeded)
    • Ridge penalty tuning and early stopping based on validation loss

This ensures stable gradient updates and mitigates feature magnitude bias.


Installation

# Clone the repository
git clone https://github.com/AshBeeXD/Linear-Regression-Scratch.git
cd Linear-Regression-Scratch

# (Optional) Create a virtual environment
python -m venv .venv
source .venv/bin/activate  # macOS/Linux
.venv\Scripts\Activate.ps1 # Windows

# Install dependencies
pip install -r requirements.txt


Quick Start

Run the Jupyter notebook to reproduce the full workflow:

jupyter notebook notebooks/Linear_Regression_Scratch.ipynb

Or import and train directly from Python:

from linear_scratch import LinearRegressionScratch
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load and prepare data
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Standardize features
scaler = StandardScaler()
X_train, X_test = scaler.fit_transform(X_train), scaler.transform(X_test)

# Train custom model
model = LinearRegressionScratch(
    learning_rate=0.01, n_iters=1000, lambda_reg=0.1, tol=1e-4, patience=10
)
model.fit(X_train, y_train, verbose=True)

# Evaluate performance
print(model.evaluate(X_test, y_test))

Results & Visualizations

Metric Train (Scratch) Test (Scratch) Train (SK Ridge) Test (SK Ridge)
MSE 0.6534 0.6625 0.6474 0.6589
MAE 0.6070 0.6112 0.5987 0.6033
0.5112 0.4945 0.5157 0.4972

Visual Diagnostics

  • Feature Importance
    Feature Importance

  • Residuals Distribution
    Residuals

  • Predicted vs Actual
    Predicted vs Actual


Interpretation

The model highlights Median Income (MedInc) as the most influential predictor of housing prices, followed by House Age and Average Rooms.
Ridge regularization reduces overfitting by penalizing large coefficients, while early stopping ensures faster and more stable convergence.


Benchmarks & Performance Parity

benchmarks.py demonstrates near-parity (≈96–98%) with scikit-learn’s Ridge regression.

Run:

python -m linear_scratch.benchmarks

Benchmark Results:

Model MSE MAE R2
Scratch 0.5826 0.5624 0.5554
Ridge 0.5559 0.5332 0.5758

Performance parity: 96.46%

These results confirm that the scratch-built model performs comparably to Ridge regression while being trained via gradient descent rather than a closed-form analytical solution.


Advanced Preprocessing Pipeline

preprocessing.py introduces a structured and reliable data preprocessing system to ensure consistent, reproducible inputs for model training.

Key Features

  • Handles missing values using SimpleImputer
  • Scales features using either StandardScaler or RobustScaler
  • Supports column-wise transformations with ColumnTransformer
  • Integrates seamlessly with the end-to-end pipeline
  • Ensures consistent preprocessing between training and testing sets

This approach improves model stability, prevents data leakage, and ensures reproducibility across experiments.

Usage Example:

from linear_scratch.preprocessing import preprocess

X_train_prep, X_test_prep, transformer = preprocess(X_train, X_test)

Comprehensive Evaluation Suite

The evaluation.py module introduces a complete and extensible evaluation system for analyzing model performance through both numerical metrics and visual diagnostics.


Metrics Included

  • MSE — Mean Squared Error
  • RMSE — Root Mean Squared Error
  • MAE — Mean Absolute Error
  • MAPE — Mean Absolute Percentage Error
  • — Coefficient of Determination

Each metric provides a different perspective on model performance — from absolute error magnitude (MAE) to relative error percentage (MAPE) and overall variance explanation (R²).

Example Usage:

from linear_scratch.evaluation import evaluate_model, plot_diagnostics

results = evaluate_model(y_test, y_pred)
plot_diagnostics(y_test, y_pred)

Outputs:

  • Tabular summary of all performance metrics

  • Residual distribution plots for model error analysis

  • Predicted vs Actual plots to visualize model fit quality

  • Optional MAPE percentage for interpretability on scaled data

  • This evaluation suite ensures that both quantitative and visual validation steps are incorporated into the workflow.

Visualization Module for Model Insights

The visualization.py module introduces a comprehensive and unified visualization framework for interpreting model behavior, performance, and convergence patterns.
It centralizes all major diagnostic plots into a single, easy-to-use interface to help users understand how and why their linear model performs the way it does.


Key Capabilities

  • Convergence Curve: Visualizes loss reduction across epochs, confirming whether the model converged smoothly or prematurely.
  • Feature Importance Chart: Displays the contribution of each feature to the prediction, derived from learned weights.
  • Residual Distribution Plot: Highlights bias and variance in model errors — helps identify underfitting or overfitting.
  • Predicted vs Actual Plot: Provides a visual measure of regression accuracy and model alignment with the target variable.

Example Usage

from linear_scratch.visualization import plot_model_insights

plot_model_insights(model, X_test, y_test, feature_names=feature_names)

This single call generates all key diagnostics automatically and saves or displays them as high-resolution figures for reporting or further analysis.


Generated Plots

When the visualization module is executed, it automatically generates and saves several key diagnostic plots that capture the model’s performance and learning behavior.
These visualizations help assess the regression quality, feature influence, and stability of training.


1. Convergence Curve

  • Purpose: Shows how the loss decreases over training epochs.
  • Interpretation:
    • A smooth downward trend indicates stable convergence.
    • A noisy or flat curve may signal too high a learning rate or early stopping.
  • Insight: Confirms that gradient descent optimization is functioning as expected.

2. Feature Importance Chart

  • Purpose: Displays the relative weight or importance of each input feature.
  • Interpretation:
    • Features with higher absolute weights contribute more to predictions.
    • Negative weights indicate an inverse relationship with the target variable.
  • Insight: Helps identify the key drivers of housing prices, e.g., Median Income (MedInc) typically has the strongest positive influence.

3. Residual Distribution Plot

  • Purpose: Shows how prediction errors (residuals) are distributed.
  • Interpretation:
    • A centered and symmetric distribution around zero suggests an unbiased model.
    • Heavy tails or skewness may indicate systematic prediction bias.
  • Insight: Helps evaluate underfitting, overfitting, or model bias.

4. Predicted vs Actual Plot

  • Purpose: Compares the model’s predictions with actual target values.
  • Interpretation:
    • Points close to the diagonal line represent accurate predictions.
    • Dispersion away from the diagonal indicates variance or bias in the model.
  • Insight: A dense cluster along the diagonal demonstrates strong performance parity with the true data distribution.

Benefits of Visualization

  • Enables quick debugging of training behavior through visual cues.
  • Offers interpretability — users can understand how the model arrives at predictions.
  • Ensures reproducibility, as plots are automatically generated and saved.
  • Serves as a communication tool in reports, presentations, or publications.

By combining quantitative metrics with visual analysis, the project bridges the gap between algorithmic performance and human interpretability.


Integration with the Full Pipeline

The visualization module integrates directly into the end-to-end pipeline, running seamlessly after model training and evaluation.

Example Workflow:

from linear_scratch.visualization import plot_model_insights
from linear_scratch.evaluation import evaluate_model

# Evaluate model predictions
results = evaluate_model(y_test, y_pred)

# Generate all visualizations in one call
plot_model_insights(model, X_test, y_test, feature_names=feature_names)

Typical Execution Order:

  • Train model with LinearRegressionScratch

  • Evaluate metrics with evaluation.py

  • Generate plots using visualization.py

  • Save all outputs under /notebooks/images/

  • This ensures a unified, reproducible workflow — where numerical validation and visualization are tightly coupled.


Troubleshooting

Issue Possible Cause Fix
Diverging loss Too high learning rate Lower learning_rate (e.g., 0.001)
No improvement in validation Over-regularization Reduce lambda_reg
NaN or inf values Data not standardized Use StandardScaler before training

⏭️ Future Work and Technical Roadmap

This section outlines planned enhancements to expand the toolkit's functionality, robustness, and performance, ensuring the project remains a cutting-edge demonstration of core ML engineering and algorithmic mastery.

1. Algorithmic Extensions (Deepening Mastery)

Feature Technical Goal
Implement Mini-Batch Gradient Descent (MBGD) Refactor the optimizer to utilize data in small batches, improving training speed and resource utilization on larger datasets.
Add L1 (Lasso) Regularization Introduce the L1 penalty term, requiring a solution using sub-gradients (e.g., ISTA) or coordinate descent.
Polynomial Feature Generator Implement a feature engineering utility to generate polynomial and interaction terms, allowing the model to capture non-linear relationships.

2. Robustness and Software Engineering

Feature Technical Goal
CI/CD Integration (GitHub Actions) Establish an automated workflow to run unit tests and linting (flake8 / black) on every commit or PR.
Hyperparameter Tuning Utility Build a custom Grid Search utility to automate the process of finding the optimal learning rate ($\alpha$) and regularization strength ($\lambda$).
Comprehensive Unit Testing Increase test coverage to $90%+$ across all modules, focusing on mathematical edge cases and component integration.

3. Performance and Scalability

Feature Technical Goal
JIT Compilation with Numba Integrate the Numba library to Just-In-Time (JIT) compile the core gradient and prediction loops.
Detailed Profiling and Benchmarking Use Python's cProfile module to identify specific bottlenecks and document the performance gains achieved through vectorization.

Contributing

Contributions, bug reports, and improvements are welcome!
Fork the repo, create a feature branch, and submit a pull request.

git checkout -b feature/new-feature
git commit -m "Add new feature"
git push origin feature/new-feature

License

Released under the MIT License. See LICENSE for details.

About

Algorithmic implementation of Multivariate Linear Regression built entirely with NumPy, featuring Ridge (L2) Regularization and Early Stopping via Batch Gradient Descent. Includes a robust preprocessing pipeline, comprehensive evaluation metrics, and performance benchmarking against Scikit-learn on the California Housing dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published