Skip to content

gil-son/machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Repository Purpose

This repository is a comprehensive guide to Machine Learning, designed to bridge theoretical concepts with practical, hands-on implementations. It serves as a learning lab for anyoneβ€”from beginners to practitionersβ€”looking to deepen their understanding of core ML foundations and algorithms.

Goals

  • Demystify Machine Learning through structured explanations and illustrative examples
  • Organize ML algorithms into key paradigms: Supervised, Unsupervised, Semi-Supervised, and Reinforcement Learning
  • Enable experimentation with interactive Jupyter Notebooks for real-world learning
  • Support understanding of mathematical concepts and simplify complex topics like optimization, statistics, and linear algebra

Repository Structure

machine-learning/
β”‚
β”œβ”€β”€ README.md                             # High-level introduction to Machine Learning
β”‚
β”œβ”€β”€ supervised/
β”‚   β”œβ”€β”€ 00.concepts.md                       # Core concepts: labeled data, overfitting, etc.
β”‚   β”œβ”€β”€ 01.linear_regression.md
β”‚   β”œβ”€β”€ 02.logistic_regression.md
β”‚   β”œβ”€β”€ 03.k_nearest_neighbors.md
β”‚   β”œβ”€β”€ 04.naive_bayes.md
β”‚   β”œβ”€β”€ 05.svm.md
β”‚   β”œβ”€β”€ 06.decision_trees.md
β”‚   β”œβ”€β”€ 07.random_forest.md
β”‚   β”œβ”€β”€ 08.gradient_boosting.md
β”‚   β”œβ”€β”€ 09.neural_networks.md
β”‚   β”œβ”€β”€ algorithms/
β”‚   └── notebooks/
β”‚
β”œβ”€β”€ unsupervised/
β”‚   β”œβ”€β”€ 00.concepts.md                       # Key ideas: clustering, dimensionality reduction, etc.
β”‚   β”œβ”€β”€ 01.k_means.md
β”‚   β”œβ”€β”€ 02.dbscan.md
β”‚   β”œβ”€β”€ 03.hierarchical_clustering.md
β”‚   β”œβ”€β”€ 04.pca.md
β”‚   β”œβ”€β”€ 05.tsne.md
β”‚   β”œβ”€β”€ algorithms/
β”‚   └── notebooks/
β”‚
β”œβ”€β”€ reinforcement_learning/
β”‚   β”œβ”€β”€ 00.concepts.md                       # Basics of agents, environments, rewards, etc.
β”‚   β”œβ”€β”€ 01.q_learning.md
β”‚   β”œβ”€β”€ 02.sarsa.md
β”‚   β”œβ”€β”€ 03.deep_q_network.md
β”‚   β”œβ”€β”€ 04.policy_gradient.md
β”‚   β”œβ”€β”€ algorithms/
β”‚   └── notebooks/
β”‚
β”œβ”€β”€ semi_supervised_learning/
β”‚   β”œβ”€β”€ 00.concepts.md                       # Hybrid between supervised and unsupervised
β”‚   β”œβ”€β”€ 01.self_training.md
β”‚   β”œβ”€β”€ 02.label_propagation.md
β”‚   β”œβ”€β”€ algorithms/
β”‚   └── notebooks/
β”‚
└── shared_resources/
    β”œβ”€β”€ datasets/                         # Sample datasets used across topics
    β”œβ”€β”€ utils/                            # Reusable utility functions
    └── references.md                     # Useful academic references and links

What is Machine Learning?

Machine Learning (ML) is a subset of Artificial Intelligence that allows systems to learn from experience (data) and improve their performance on a task without being explicitly programmed with rules. Instead of following hardcoded instructions, the system identifies patterns in data and uses those patterns to make predictions or decisions.

Analogy:

Think of a baby learning to recognize animals. At first, the baby is shown pictures of cats and dogs. Over time, the baby begins to notice patterns β€” cats have pointy ears, dogs often have longer snouts. Eventually, the baby can identify a new picture as a "dog" or "cat" based on what they’ve seen before β€” even without being told the rules. Machine Learning works in a similar way: it learns from examples instead of being told exactly what to do.

Example:

A machine learning model learns to recommend movies based on a user's viewing history and preferences β€” just like how a friend might suggest a movie based on what you’ve enjoyed before.

Types of Machine Learning

Supervised Learning

This is by far the most widely used type of ML in real-world applications.
  • What it is: You train a model on labeled data (i.e., the input and expected output are both known).
  • Use Cases:
    • Email spam detection
    • Credit scoring
    • Medical diagnosis
    • House price prediction

βœ… Popular Algorithms

Linear Regression
  • Concept: Predicts a continuous value (e.g., student test score) based on one or more input features.

  • Essential Math:

    $y = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b$

  • It minimizes the Mean Squared Error (MSE) between predicted and actual values.

  • Use Case: Predicting prices, trends, or scores.

Logistic Regression
  • Concept: Used for binary classification (e.g., pass/fail, spam/ham).

  • Essential Math:

    $P(y = 1 \mid x) = \sigma(w_1x_1 + w_2x_2 + \cdots + w_nx_n + b)$

    Where the sigmoid function is:

    $\sigma(z) = \frac{1}{1 + e^{-z}}$

  • Use Case: Disease prediction, marketing response, fraud detection.

Decision Trees
  • Concept: A flowchart-like structure where each internal node splits the data based on a feature.

  • Essential Math:

    • Gini Impurity:

      $G = 1 - \sum_{i=1}^{C} p_i^2$

    • Entropy (for Information Gain):

      $H = - \sum_{i=1}^{C} p_i \log_2(p_i)$

  • Use Case: Customer segmentation, credit risk modeling.

Random Forest
  • Concept: An ensemble of decision trees trained on random subsets of data and features.

  • Essential Math:

    • For Regression:

      Ε· = (1 / T) Γ— (y₁ + yβ‚‚ + ... + yβ‚œ)

  • For Classification:

    Ε· = majority vote of (y₁, yβ‚‚, ..., yβ‚œ)

  • Use Case: Robust classification and regression tasks, e.g., loan approval, stock prediction.

Support Vector Machines (SVM)
  • Concept:
    • Finds the hyperplane that best separates the data into classes.
  • Essential Math:
    • Decision boundary:

      $w \cdot x + b = 0$

    • Optimization constraint:

      $y_i(w \cdot x_i + b) \geq 1$

    • Margin to maximize:

      $\frac{2}{\lVert w \rVert}$

  • Can use the kernel trick (e.g., RBF kernel) to handle non-linear decision boundaries.
  • Use Case: Text classification, face recognition, bioinformatics.
k-Nearest Neighbors (kNN)
  • Concept: Classifies a sample based on the majority vote (classification) or average (regression) of its k closest neighbors.

  • Essential Math:

    • Euclidean Distance:

      $d(x, x') = \sqrt{ \sum_{i=1}^{n} (x_i - x'_i)^2 }$

  • Other distance metrics can be used, such as Manhattan, Cosine, or Minkowski, depending on the data.

  • Use Case: Recommender systems, image classification, anomaly detection.


Unsupervised Learning

  • What it is: The model tries to find patterns and groupings in the data without labeled outputs.
  • Use Cases:
    • Customer segmentation
    • Market basket analysis
    • Anomaly detection
  • Popular Algorithms:
    • k-Means Clustering
    • DBSCAN
    • PCA (Principal Component Analysis)
  • Python Libraries: scikit-learn, scipy, matplotlib

Reinforcement Learning

  • What it is: An agent learns to make decisions by interacting with an environment and getting feedback (rewards or penalties).
  • Use Cases:
    • Robotics
    • Game playing (e.g., AlphaGo)
    • Self-driving cars
  • Popular Libraries: OpenAI Gym, Stable-Baselines, TensorFlow, PyTorch

Semi-Supervised Learning

  • What it is: Combines a small amount of labeled data with a large amount of unlabeled data to improve learning when labeling is expensive.
  • Use Cases:
    • Web page classification
    • Medical imaging
    • Speech recognition
    • Fraud detection
  • Popular Algorithms:
    • Self-training
    • Label propagation
    • Semi-supervised Support Vector Machines (S3VM)
    • Graph-based methods
  • Python Libraries: scikit-learn, sklearn.semi_supervised, TensorFlow, PyTorch

Machine Learning Techniques

  • Classification
    A supervised learning task where the model learns to categorize data into predefined classes or labels.
    Example: Predicting if an email is spam or not spam.

  • Regression
    A supervised learning task where the goal is to predict a continuous value.
    Example: Predicting the price of a house based on size, location, etc.

  • Clustering
    An unsupervised learning method where the algorithm groups data into clusters based on similarityβ€”without predefined labels.
    Example: Segmenting customers into groups based on their behavior or purchases.

  • Anomaly Detection
    Identifying data points that are unusual or deviate significantly from the majority.
    Example: Detecting fraudulent credit card transactions.

  • Sequence Mining
    Analyzing and identifying patterns in ordered data (sequences), especially over time.
    Example: Finding common sequences in customer purchases or website navigation.

  • Dimension Reduction
    Reducing the number of features (dimensions) in a dataset while keeping important informationβ€”used to simplify models and visualize high-dimensional data.
    Example: Using PCA (Principal Component Analysis) to reduce image data with thousands of pixels into just a few features.

  • Recommendation System
    A system that suggests items (movies, products, etc.) to users based on their preferences or behaviors.
    Example: Netflix recommending movies or shows based on your watch history.


Machine Learning Model Lifecycle

  • Problem Definition
    Clearly define the objective of the machine learning task.
    Example: Predict customer churn or classify product reviews as positive or negative.

  • Data Collection
    Gather relevant and sufficient raw data from various sources like databases, APIs, sensors, or manual input.
    Example: Collecting user behavior logs or survey results.

  • Data Preparation
    Clean, transform, and structure the data for training. This includes handling missing values, encoding categories, and normalizing values.
    Example: Converting text into numeric form or removing outliers.

  • Model Development and Evaluation
    Choose a model type, train it using prepared data, and evaluate its accuracy, precision, recall, or other relevant metrics.
    Example: Training a decision tree and evaluating it using cross-validation.

  • Model Deployment
    Integrate the trained model into a production environment where it can receive real input and make predictions.
    Example: Deploying a fraud detection model via an API to monitor real-time transactions.


About

πŸ“– Comprehensive guide to Machine Learning: theory, algorithms, and hands-on examples.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published