A hands-on tutorial project that implements fundamental machine learning and data science algorithms from scratch using pure Python — no scikit-learn, no TensorFlow, no PyTorch for the core algorithms. The goal is to build deep understanding of how these algorithms actually work under the hood.
- Students learning ML/DS who want to understand the math and mechanics behind the algorithms
- Developers who want to go beyond calling
model.fit()and understand what happens inside - Anyone preparing for ML interviews where algorithmic understanding is tested
- Python 3.6+
- Basic understanding of linear algebra (vectors, matrices, dot products)
- Familiarity with calculus (derivatives, chain rule)
- Basic probability and statistics knowledge
- Clone this repository
- Follow the chapters in numerical order (0 through 19)
- Each chapter contains Jupyter notebooks (
.ipynb) for theory and Python scripts (.py) for implementations - The
X.Kaggle_Practice_Projects/folder contains end-to-end projects applying each algorithm to real datasets - Datasets are stored in
Y.Kaggle_Data/
| Chapter | Topic | Key Concepts |
|---|---|---|
| 0. Statistics Supplement | Descriptive & inferential statistics | Mean, median, mode, variance, hypothesis testing, odds & log-odds |
| 1. Finding and Reading Data | Data I/O | CSV parsing, string-to-float conversion |
| 2. Data Preprocessing | Data preparation | Min-max normalization, z-score standardization, feature engineering |
| 3. Resampling Methods | Train/test strategies | Train/test split, k-fold cross-validation |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 4. Evaluating Accuracy | Classification & advanced metrics | Accuracy, precision, recall, F1-score, ROC curve, AUC |
| 5. Confusion Matrix | Classification evaluation | Multi-class confusion matrix |
| 6. MAE and RMSE | Regression evaluation | Mean Absolute Error, Root Mean Squared Error, R-squared |
| 7. Baseline Models | Benchmarking | Random prediction, ZeroR algorithm |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 8. Linear Regression | Regression | OLS, covariance, correlation, regularization (Ridge/Lasso) |
| 9. Stochastic Gradient Descent | Optimization | SGD algorithm, learning rate, convergence |
| 10. Logistic Regression | Binary classification | Sigmoid function, maximum likelihood, regularization |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 11. Perceptron | Linear classifier | Step function, perceptron learning rule, linear separability |
| 12. Decision Trees | Tree-based models | CART, Gini impurity, recursive splitting, pruning |
| 13. Naive Bayes | Probabilistic classifier | Bayes theorem, conditional independence, Gaussian NB |
| 14. K-Nearest Neighbor | Instance-based learning | Euclidean distance, choosing k, lazy learning |
| 15. Learning Vector Quantization | Prototype-based | Codebook vectors, BMU, competitive learning |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 16. Neural Networks | Deep learning foundations | Forward/backward propagation, sigmoid, weight updates |
| 17. K-Means Clustering | Unsupervised learning | Centroid initialization, cluster assignment, elbow method, K-Means++ |
| 18. PCA | Dimensionality reduction | Covariance matrix, eigendecomposition, explained variance |
| 19. Support Vector Machine | Maximum margin classifier | Hinge loss, kernel trick, soft margin, SGD-based SVM |
| Chapter | Topic | Key Concepts |
|---|---|---|
| Ensemble Algorithms | Model combination | Bootstrap, bagging, random forests, boosting concepts |
| Project | Algorithm | Dataset |
|---|---|---|
| case00 | Simple Linear Regression | Insurance costs |
| case01 | Linear Regression via SGD | Wine quality |
| case02 | Logistic Regression | Diabetes prediction |
| case03 | Perceptron | Sonar classification |
| case04 | CART Decision Tree | Banknote authentication |
| case05 | KNN | Abalone age prediction |
| case06 | LVQ | Ionosphere radar signals |
| case07 | Neural Network | Wheat seed classification |
| case08 | Bagging | Sonar classification |
| case09 | Random Forest | Sonar classification |
Statistics & Data Handling (Ch. 0-3)
|
v
Evaluation Metrics (Ch. 4-7)
|
v
Linear Models (Ch. 8-10)
|
+----+----+
| |
v v
Classic ML Neural Networks
(Ch. 11-15) (Ch. 16)
| |
+----+----+
|
v
Unsupervised & Advanced (Ch. 17-19)
|
v
Ensemble Methods (PlusPlus)
|
v
Practice Projects (X.Kaggle_Practice_Projects)
- No black boxes: Every algorithm is implemented step-by-step so you can see exactly how it works
- Pure Python first: Core algorithms use only Python's standard library (
math,random,csv) - Optional visualization: Some notebooks use
matplotlib/seabornfor plots, but these are optional and wrapped in try/except blocks - Learn by doing: Each chapter includes working code you can run, modify, and experiment with
git clone https://github.com/your-username/Pure_Python_for_DS_ML.git
cd Pure_Python_for_DS_ML
pip install -r requirements.txt # optional, only for visualization
jupyter notebookThis project is for educational purposes. Feel free to use and modify for learning.
William