Skip to content

Latest commit

 

History

History
475 lines (285 loc) · 1.42 MB

Machine Learning.md

File metadata and controls

475 lines (285 loc) · 1.42 MB
  • In 1959, the term "machine learning" was first introduced by Arthur Samuel. He defined it as the "field of study that gives computers the ability to learn without being explicitly programmed".
  • The learning process improves the machine model over time by using training data.
  • The evolved model is used to make future predictions.

Arthur Samuel, former IBM engineer and a professor at Stanford, was one of the pioneers in the field of computer gaming and artificial intelligence. He was the first one to introduce the term “machine learning”. Machine learning is a field of artificial intelligence. It uses statistical methods to give computer the ability to "learn" from data, without being explicitly programmed.

If a computer program can improve how it performs certain tasks based on past experiences, then it has learned. This differs from performing the task always the same way because it has been programmed to do so.

The learning process improves the so called “model” over time by using different data points  (training data).  The evolved model is used to make future predictions.

What is a statistical model?

  •  A model in a computer is a mathematical function that represents a relationship or mapping between a set of inputs and a set of outputs. `f(x)   =  x^2 Violent Crime Incidents per day = Average Temperature * 2

  • New data “X” can predict the output “Y”. `Y = b0 * X +b1

The representation of a model in the computer is in the form of a mathematical function. It is a relationship or mapping between a set of inputs and a set of outputs. For example, f(x)=x^2.

Assume that a system is fed with data indicating that the rates of violent crime are higher when the weather is warmer and more pleasant, even rising sharply during warmer-than-typical winter days. Then, this model can predict the crime rate for this year compared to last year’s rates based on the weather forecast.

Returning to the mathematical representation of the model that can predict crime rate based on temperature, we might propose the following mathematical model:

Violent crime incidents per day = Average Temperature × 2

This is an oversimplified example to explain that machine learning refers to a set of techniques for estimating functions (for example, predicting crime incidents) that is based on data sets (pairs of the day’s average temperature and the associated number of crime incidents). These models can be used for predictions of future data.

Machine Learning Techniques

Supervised Learning

Supervised learning can be separated into two types of problems when data mining: classification and regression:

  • Classification problems use an algorithm to accurately assign test data into specific categories, such as separating apples from oranges. Or, in the real world, supervised learning algorithms can be used to classify spam in a separate folder from your inbox. Linear classifiers, support vector machines, decision trees and random forest are all common types of classification algorithms.
  • Regression is another type of supervised learning method that uses an algorithm to understand the relationship between dependent and independent variables. Regression models are helpful for predicting numerical values based on different data points, such as sales revenue projections for a given business. Some popular regression algorithms are linear regression, logistic regression and polynomial regression.

Unsupervised Learning

Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”).

Unsupervised learning models are used for three main tasks: clustering, association and dimensionality reduction:

  • Clustering is a data mining technique for grouping unlabeled data based on their similarities or differences. For example, K-means clustering algorithms assign similar data points into groups, where the K value represents the size of the grouping and granularity. This technique is helpful for market segmentation, image compression, etc.
  • Association is another type of unsupervised learning method that uses different rules to find relationships between variables in a given dataset. These methods are frequently used for market basket analysis and recommendation engines, along the lines of “Customers Who Bought This Item Also Bought” recommendations.
  • Dimensionality reduction is a learning technique used when the number of features  (or dimensions) in a given dataset is too high. It reduces the number of data inputs to a manageable size while also preserving the data integrity. Often, this technique is used in the preprocessing data stage, such as when autoencoders remove noise from visual data to improve picture quality.

Reinforcement Learning

Reinforcement learning is a learning paradigm that learns to optimize sequential decisions, which are decisions that are taken recurrently across time steps, for example, daily stock replenishment decisions taken in inventory control.

Reinforcement learning works in a mathematical framework consisting of the following ingredients:

  • state space (or observation space): All available information and problem features that are useful for taking a decision. This includes fully known or measured variables as well as unmeasured variables for which you might only have a belief or estimate.

  • An action space: Decisions that you can take in each state of the system.

  • reward signal: A scalar signal that provides the necessary feedback about performance, and, therefore, the opportunity to learn which actions are beneficial in any given state. The learning is both local in its nature to learn immediate gain as well as long-term gain because actions that are taken in any state lead to future states where another action is taken and so on. The discounted cumulative reward signal is the optimization objective for reinforcement learning, making it focus on a long-term strategy that yields the best cumulative reward.

Learning Algorithms

Naïve Bayes Classifier

Naïve Bayes classifiers assume that the value of a particular feature is independent of the value of  any other feature, given the class variable.

  1. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter.
  2. Features: Color, roundness, and diameter.
  3.  Assumption: Each of these features contributes independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

Naïve Bayes classifiers is a powerful and simple supervised machine learning algorithm. It assumes that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter.

Features: Color, roundness, and diameter. A Naïve Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

Imagine that you have the data set that is shown in the table in this slide. The column with title “Is Apple?” represents the label of the data. Our objective is to make a new prediction for an unknown object. The unknown object has the following features:

•    Color: Red

•    Shape: Round

•    Diameter: 10 cm

Your algorithm basically depends on calculating two probability values:

  • Class probabilities: The probabilities of having each class in the training data set.
  • Conditional probabilities: The probabilities of each input feature giving a specific class value.

To do a classification, you must perform the following steps: 1.    Define two classes CY and CN that correspond to Apple = Yes and Apple = No.

2.    Compute the probability for CY as x: p(CY | x): p(Apple = Yes | Colour = Red, Shape = round, Diameter => 10 cm)

3.    Compute the probability for CN as x: p(CN | x): p(Apple = No | Colour = Red, Shape = round, Diameter => 10 cm)

4.    Discover which conditional probability is larger: If p(CY |x) > _p(_CN |x), then it is an apple.

Naive Bayes model:       

5.  Compute p(x|CY) = p(Colour = Red, Shape = round, Diameter =>10 cm | Apple = Yes). Naïve Bayes assumes that the features of the input data (the apple parameters) are independent.

The Naïve Bayes formula is given by this model. Our target is to compute the formula to reach

p(CK |x), where K is any class (CY or CN).

5.    Compute the conditional probability of having each feature given that the class is CYp(x|CY) =

p(Colour = Red, Shape = round, Diameter =>10 cm | Apple = Yes).

Because Naïve Bayes assumes that the features of the input data (the object features) are independent, to get the p(x|CY) value, we calculate the conditional probability of each feature at a time with the class CY, and then multiply all the values.

Thus, we can rewrite p(x| CY) as:

= p(Colour = Red | Apple = Yes) X p(Shape = round | Apple = Yes) X

p(Diameter => 10 cm | Apple = Yes) .

Same for p(x| CN):

= p(Color = Red | Apple = No) X p(Shape = round | Apple = No) X

p(Diameter => 10 cm | Apple = No)

Thus, we can rewrite p(x| CY) as:

= p(Colour = Red | Apple = Yes) X p(Shape = round | Apple = Yes) X p(Diameter => 10 cm | Apple

= Yes)

We apply the same rule for p(x| CN) by multiplying the conditional probabilities of reach input feature given CN:

= p(Color = Red | Apple = No) X p(Shape = round | Apple = No) X p(Diameter => 10 cm | Apple = No)

6. Calculate each conditional probability:

p(Colour = Red | Apple = Yes) = 3/5 (Out of five apples, three of them were red.)

p(Colour = Red | Apple = No) = 2/5 p(Shape = Round | Apple = Yes) = 4/5 _p(_Shape = Round __| Apple = No) = 2/5 p(Diameter = > 10 cm | Apple = Yes)  = 2/5 p(Diameter = > 10 cm | Apple = No) = 3/5

Let us see how to calculate these conditional probabilities. For example, to calculate p(Colour = Red | Apple = Yes), you are asking, “What is the probability for having a red color object given that we know that it is an apple”. Now, from the table, how many of these five occurrences are when you have a color = red? You find that there are three occurrences for red color. Therefore, p(Colour = Red | Apple = Yes) = 3/5.

Repeat these steps for the rest of the features.

•   p(Color = Red | Apple = Yes) X p(Shape = round | Apple = Yes) X

p(Diameter = > 10 cm | Apple = Yes) = (3/5) x (4/5) x (2/5) = 0.192

•   p(Color = Red | Apple = No) X p(Shape = round | Apple = No) X

p(Diameter = > 10 cm | Apple = No) = (2/5) x (2/5) x (3/5) = 0.096

•   p(Apple = Yes) = 5/10

•     p(Apple = No) = 5/10

Now, we have all the values that we need. As mentioned in step 5, we multiply the conditional probabilities as follows:

p(Color = Red | Apple = Yes) X p(Shape = round | Apple = Yes) X p(Diameter = > 10 cm | Apple = Yes) = (3/5) x (4/5) x (2/5) = 0.192

p(Color = Red | Apple = No)p(Shape = round | Apple = No)p(Diameter = > 10 cm | Apple = No)

= (2/5) x (2/5) x (3/5) = 0.096

p(Apple = Yes) = 5/10

p(Apple = No) = 5/10

Finally, we compare the values of p(CY |x) versus p(CN |x). By substituting the values that were calculated in the previous steps, we discover that  p(CY |x) > _p(CN |x), which means that the object is an apple.

Linear Regression

  • Linear regression analysis is used to predict the value of a variable based on the value of another variable.
  • The target variable is a continuous value.

In simple linear regression, we establish a relationship between the target variable and input variables by fitting a line that is known as the regression line.

There are different applications that benefit from linear regression:

  • Analyze the marketing effectiveness, pricing, and promotions on the sales of a product.
  • Forecast sales by analyzing the monthly company’s sales for the past few years.
  • Predict house prices with an increase in the sizes of houses.
  • Calculate causal relationships between parameters in biological systems

Example: Assume that we are studying the real state market. Objective: Predict the price of a house given its size by using previous data.

Size Price
30 30,000
70 40,000
90 55,000
110 60,000
130 80,000
150 90,000
180 95,000
190 110,000

Assume that we are studying the real state market and our objective is to predict the price of a house given its size by using previous data. The label in this case is the price column.

After plotting the points on the graph, they seem to be forming a line..

  • Can you guess what is the best estimate for a price of a 140-meter square house?

  • Which one is correct?

  • $60,000

  • $95,000

  • $85,000

Size Price
30 30,000
70 40,000
90 55,000
110 60,000
130 80,000
150 90,000
180 95,000
190 110,000

You want to find the price value of a 140-meter square house. Which of the following choices is correct?

1. $60,000

2. $95,000

3. $85,000

  • Target: A line that is within a "proper" distance from all points.
  • Error: The aggregated distance between data points and the assumed line.
  • Solution: Calculate the error iteratively until you reach the most accurate line with a minimum error value (that is, the minimum distance between the line and all points).

To answer the question “What is the price for a 140-meter square house?”, we need to draw the line that best fits most of the data points.

How we can find the line that best fits all the data points? We can draw many lines, so which one is the best line?

The best line should have the minimal error value. The error refers to the aggregated distance between data points and the assumed line. Calculate the error iteratively until you reach the most accurate line with a minimum error value.

  • After the learning process, you get the most accurate line, the bias, and the slope to draw your line.
  • Here is our linear regression model representation for this problem:

h(p) = p0 + p1 * X1 or Price = 30,000 + 392*Size Price = 30,000 + 392*140 = 85,000

After the learning process, you get the most accurate line, the bias, and the slope to draw your line. p0 is the bias. It is also called the intercept because it determines where the line intercepts the y axis. p1 is the slope because it defines the slope of the line or how x correlates with a y value before adding the bias. If you have the optimum value of p0 and p1, you can draw the line that best represents the data.

The squared error function J is represented by the difference between the predicted point and the actual points. It is calculated as follows:

J(P) = (1/(2*m)) Σ (hp(xi) - yi)2 Where:

•    i is the number of a sample or data point within the data set samples.

•    hp(xi) is the predicted value for data point i.

•    yi is the actual value for data point i.

•    m is the count of data set samples or data points.

We can use an optimization technique that is called stochastic gradient descent. The algorithm evaluates and updates the weights on every iteration to minimize the model error. The technique works iteratively. In each iteration, the training instance is exposed to the model once. The model makes a prediction and the corresponding error is calculated. The model is updated to reduce the error for the next prediction. The process continues to adjust the model weights to reach the smallest error..

Here we use the gradient descent algorithm to iteratively get the values of p0 and p1 (the intercept and slope of the line are also called weights) by the following algorithm:

Pj := Pj – α (hp(xi) - yi) xji

Where:

j is the feature number.

α is the learning rate.

  • In higher dimensions where we have more than one input (X), the line is called a plane or a hyper-plane.
  • The equation can be generalized from simple linear regression to multiple linear regression as follows:

Y(X)=p0+p1*X1+p2*X2+...+pn*Xn

With more features, you do not have a line; instead, you have a plane. In higher dimensions where we have more than one input (X), the line is called a plane or a hyper-plane.

The equation can be generalized from simple linear regression to multiple linear regression as follows:

Y(X)=p0+p1*X1+p2*X2+...+pn*Xn

Logistic Regression

  • Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables
  • Target: A dependent variable (Y) is a discrete category or a class 

Example: Class1 = Cancer, Class2 = No Cancer

  • Logistic regression is named for the function that is used at the core of the algorithm.
  • The logistic function (sigmoid function) is an S-shaped curve for data discrimination across multiple classes. It can take any real value 0 – 1.

Logistic regression is named for the function that is used at the core of the algorithm, which is the logistic function. The logistic function is also known as the sigmoid function. It is an S-shaped curve (as shown in the figure) for data segregation across multiple classes that can take any real value 0 - 1.

During the learning process, the system tries to generate a model (estimate a set of parameters p0, p1, …) that can best predict the probability that Y will fall in class A or B given the input X. The sigmoid function squeezes the input value between [0,1], so if the output is 0.77 it is closer to 1, and the predicted class is 1.

  • Example: Assume that the estimated values of p’s for a certain model that predicts the gender from a person’s height are p0= -120 and p1 = 0.5.
  • Class 0 represents female and class 1 represents male.
  •  To compute the prediction, use:

Y = exp(-120+0.5X)/(1+exp(-120+0.5X)) Y = 0.00004539

P(male | height=150) is 0 in this case.

Support Vector Machines

  • The goal is to find a separating hyperplane between positive and negative examples of input  data.
  • SVM is also called a “large Margin Classifier”.
  • The SVM algorithm seeks the hyperplane with the largest margin, that is, the largest distance to  the nearest sample points.
  • SVM is a supervised learning model that can be a linear or non-linear classifier. 
  • SVM is also called a “large Margin Classifier” because the algorithm seeks the hyperplane with  the largest margin, that is, the largest distance to the nearest sample points. ![[Pasted image 20230623185304.png]] Assume that a data set lies in a two-dimensional space and that the hyperplane will be a one-dimensional line.

Although many lines (in light blue) do separate all instances correctly, there is only one optimal hyperplane (red line) that maximizes the distance to the closest points (in yellow).

Decision Tree

A decision tree is a popular supervised learning algorithm that can be used for classification and regression problems. Decision trees are a popular prediction method. Decision trees can explain why a specific prediction was made by traversing the tree.

There are different examples for applications that can use decision tree in business. For example, predicting customers’ willingness to purchase a given product in a given setting, for example, online versus a physical store.

A decision tree includes three main entities: root node, decision nodes, and leaves. The figure shows the graphical representation of these entities.

A decision tree builds the classification or regression model in the form of a tree structure. It resembles a flowchart, and is easy to interpret because it breaks down a data set into smaller and smaller subsets while building the associated decision tree.

The “Play Tennis” example is one of the most popular examples to explain decision trees.

In the data set, the label is represented by “PlayTennis”. The features are the rest of the columns: “Outlook”, “Temperature”, “Humidity”, and “Wind”. Our goal here is to predict, based on some weather conditions, whether a player can play tennis or not.

A decision tree is built by making decisions regarding the following items:

  • Which feature to choose as the root node
  • What conditions to use for splitting
  • When to stop splitting
  • Using entropy and information gain to construct a decision tree.
  •  Entropy: It is the measure of the amount of uncertainty and randomness in a set of data for the  classification task.
  • Information gain: It is used for ranking the attributes or features to split at given node in the  tree.

`Information gain = (Entropy of distribution before the split)–(entropy of distribution after the split)

The Iterative Dichotomiser3 (ID3) algorithm works by using entropy and information gain to construct a decision tree. Entropy is the measure of the amount of uncertainty and randomness  in a set of data for the classification task. Entropy is maximized when all points have equal  probabilities.  If entropy is minimal, it means that the attribute or feature appears close to one class and has a good discriminatory power for classification. Entropy zero means that there is no randomness for this attribute. Information gain is a metric that is used for ranking the attributes or features to split at given  node in the tree. It defines how much information a feature provides about a class. The feature with the highest information gain is used for the first split.

K-Means Clustering

  • Unsupervised machine learning algorithm.
  • It groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than those in other groups (other clusters).

K-means clustering is an unsupervised machine learning technique. The main goal of the algorithm is to group the data observations into k clusters, where each observation belongs to the cluster with the nearest mean.

A cluster’s center is the centroid. The figure shown plots of the partition of a data set into five clusters, with the cluster centroids shown as crosses.

Examples of applications include:

  • Customer segmentation: Imagine that you are the owner of electronics store. You want to understand preferences of your clients to expand your business. It is not possible to look at each client’s purchase details to find a good marketing strategy. But, you can group the details into, for example, five groups based on their purchasing habits. Then, you start building your marketing strategy for each group.
  • Image segmentation and compression: The process of partitioning a digital image into multiple segments (sets of pixels) to simplify and change the representation of an image into something that is more meaningful and easier to analyze. To achieve this task, we need a process that assigns a label to every pixel in an image such that pixels with the same label share certain features. The image in this slide is segmented and compressed into three regions by using k-means clustering.  With smaller number of clusters, it provides more image compression but at the expense of less image quality.
  • Recommendation systems: These systems help you find users with the same preferences to build better recommendation systems.

Neural Networks

  • Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms.
  • Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another.
  • Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer.
  • Each node, or artificial neuron, connects to another and has an associated weight and threshold.
  • If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. ![[Pasted image 20230606212910.png]]

Perceptron

![[Pasted image 20230623190158.png]]

  • A single neuron model and originator for the neural network. 
  • Similar to linear classification, where each input has weight.
  • A perceptron is a single neuron model that was an originator for neural networks. It is similar to linear regression.
  • Each neuron has its own bias and slope (weights).

For example, assume that a neuron have two inputs (X1 and X2), so it requires  three weights (P1, P2 and P0 ). The figure in this slide shows a weight for each input and one for the bias.

How do neural networks work?

Each individual node can be thought of as its own linear regression model, composed of input data, weights, a bias (or threshold), and an output. The formula would look something like this:

∑wixi + bias = w1x1 + w2x2 + w3x3 + bias

output = f(x) = 1 if ∑w1x1 + b>= 0; 0 if ∑w1x1 + b < 0

  • Once an input layer is determined, weights are assigned. These weights help determine the importance of any given variable, with larger ones contributing more significantly to the output compared to other inputs.
  • All inputs are then multiplied by their respective weights and then summed.
  • Afterward, the output is passed through an activation function, which determines the output.
  • If that output exceeds a given threshold, it “fires” (or activates) the node, passing data to the next layer in the network.
  • This results in the output of one node becoming in the input of the next node. This process of passing data from one layer to the next layer defines this neural network as a feedforward network.

We can apply this concept to a more tangible example, like whether you should go surfing (Yes: 1, No: 0). The decision to go or not to go is our predicted outcome, or y-hat. Let’s assume that there are three factors influencing your decision-making:

  1. Are the waves good? (Yes: 1, No: 0)
  2. Is the line-up empty? (Yes: 1, No: 0)
  3. Has there been a recent shark attack? (Yes: 0, No: 1)

Then, let’s assume the following, giving us the following inputs:

  • X1 = 1, since the waves are pumping
  • X2 = 0, since the crowds are out
  • X3 = 1, since there hasn’t been a recent shark attack

Now, we need to assign some weights to determine importance. Larger weights signify that particular variables are of greater importance to the decision or outcome.

  • W1 = 5, since large swells don’t come around often
  • W2 = 2, since you’re used to the crowds
  • W3 = 4, since you have a fear of sharks

Finally, we’ll also assume a threshold value of 3, which would translate to a bias value of –3. With all the various inputs, we can start to plug in values into the formula to get the desired output.

Y-hat = (1 * 5) + (0 * 2) + (1 * 4) – 3 = 6

If we use the activation function from the beginning of this section, we can determine that the output of this node would be 1, since 6 is greater than 0. In this instance, you would go surfing; but if we adjust the weights or the threshold, we can achieve different outcomes from the model. When we observe one decision, like in the above example, we can see how a neural network could make increasingly complex decisions depending on the output of previous decisions or layers.

In the example above, we used perceptrons to illustrate some of the mathematics at play here, but neural networks leverage sigmoid neurons, which are distinguished by having values between 0 and 1. Since neural networks behave similarly to decision trees, cascading data from one node to another, having x values between 0 and 1 will reduce the impact of any given change of a single variable on the output of any given node, and subsequently, the output of the neural network.

As we start to think about more practical use cases for neural networks, like image recognition or classification, we’ll leverage supervised learning, or labeled datasets, to train the algorithm. As we train the model, we’ll want to evaluate its accuracy using a cost (or loss) function. This is also commonly referred to as the mean squared error (MSE). In the equation below,

  • i represents the index of the sample,
  • y-hat is the predicted outcome,
  • y is the actual value, and
  • m is the number of samples.

𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛= 𝑀𝑆𝐸=1/2𝑚 ∑129_(𝑖=1)^𝑚▒(𝑦 ̂^((𝑖) )−𝑦^((𝑖) ) )^2

Ultimately, the goal is to minimize our cost function to ensure correctness of fit for any given observation. As the model adjusts its weights and bias, it uses the cost function and reinforcement learning to reach the point of convergence, or the local minimum. The process in which the algorithm adjusts its weights is through gradient descent, allowing the model to determine the direction to take to reduce errors (or minimize the cost function). With each training example, the parameters of the model adjust to gradually converge at the minimum.  

Most deep neural networks are feedforward, meaning they flow in one direction only, from input to output. However, you can also train your model through backpropagation; that is, move in the opposite direction from output to input. Backpropagation allows us to calculate and attribute the error associated with each neuron, allowing us to adjust and fit the parameters of the model(s) appropriately.

Gradient Descent Diagram

Backpropagation

Backpropagation is an algorithm for training neural networks that has many layers. It works in two phases:

  • First Phase: The propagation of inputs through a neural network to the final layer (called feedforward).
  • Second phase: The algorithm computes an error. An error value is then calculated by using the wanted output and the actual output for each output neuron in the network. The error value is propagated backward through the weights of the network (adjusting the weights) beginning with the output neurons through the hidden layer and to the input layer (as afunction of the contribution of the error).

Backpropagation is an algorithm for training neural networks that have many layers. It works in two phases:

  • Propagation of inputs through a neural network to the final layer (called feedforward).
  • The algorithm computes an error. An error value is then calculated by using the wanted output and the actual output for each output neuron in the network. The error value is propagated backward through the weights of the network (adjusting the weights) beginning with the output neurons through the hidden layer and to the input layer (as a function of the contribution of the error).

Backpropagation continues to be an important aspect of neural network learning. With faster and cheaper computing resources, it continues to be applied to larger and denser networks.