- Maximal Margin Classifier
- Support Vector Classifier
- Support Vector Machines
- Support Vector Regression
Q. Is support vector machine(SVM) a generalization of maximal margin classifier?
Answer
True
Q. What is a hyperplane?
Answer
In a p-dimensional space, a hyperplane is a flat affine subspace of dimension
Q. Write generic expression of a p-dimensional hyperplane?
Answer
Q. How to determine whether a point or a vector lies on a hyperplane?
Answer
We can put the given point in the hyperplane equation and check the sign.
Q. How does maximal margin classifier work?
Answer
Maximal margin classifier is a linear classifier that attempts to separate two classes by finding the hyperplane that maximizes the margin between the classes.
It works as follows:
Suppose we are working on a binary classification problem with
-
$x_i \in \mathbb{R}^n$ is a$p$ -dimensional feature vector $y_i \in {-1, +1}$
With the above setup, maximal margin classifier tries to find a hyperplane which can be represented as:
where:
-
$w \in \mathbb{R}^p$ is the weight vector (normal to the hyperplane) -
$b \in \mathbb{R}$ is the offset of the hyperplane from the origin
Decision boundary can be defined by:
$$ w^Tx + b = 0$$
Now, we want to maximize the margin which is basically the distance between the decision boundary(hyperplane) and the closest points in the dataset(support vectors).
Assume
If we assume data is perfectly separable:
Using above two equations, margin can be derived as
Note that here
So, maximal margin classifier tends to optimize the margin subject to the constraints that all points are classified correctly:
subject to:
- Here we have used
$||w||^2$ instead of$||w||$ to make the objective function convex and differential - The constraints ensures that all points are on the correct side of the margin
Q. What is the main limitation of maximal margin classifier?
Answer
Maximal margin classifier has following limitations:
- It works well when the data is perfectly linearly separable, in case of non linearly separable cases it fails to find the optimal hyperplane.
- It is highly sensitive to outliers if they are close to decision boundary or wrong side of it.
- The maximal margin hyperplane is extremely sensitive to a change in a single observation suggests that it may have overfit the training data.
Q. How can we overcome limitation of maximal margin classifier?
Answer
We can use soft margin i.e a hyperplane that almost separates the classes. It is also called as support vector classifier.
Q. How can we overcome limitation of maximal margin classifier?
Answer
We can use soft margin i.e a hyperplane that almost separates the classes. It is also called as support vector classifier.
Q. What is difference between Maximal margin classifier(MMC) and Support vector classifier(SVC)?
Answer
MMC seeks the largest possible margin so that every observation is on correct side of hyperplane as well as on the correct side of margin.
But support vector machine, sometimes called as soft margin classifier allows some observations to be on incorrect side of the margin or even the incorrect side of hyperplane(The margin is soft because it can be violated by some training observations.)
Q. How does Support vector classifier(SVC) works?
Answer
The support vector classifier classifies a test observations depending on which side of hyperplane it lies. The hyperplane is chosen to correctly separate most of the training observations into two classes, but may classify few observations.
It is the solution to the optimization problem:
subjected to
Where
Once we have hyperplane after solving above set of equations, we classify a test observations
$$ Sign(f(x^) = \beta_0 + \beta_1x_1^{}+...+\beta_Px_p^*)$$
Q. What kind of information we get from slack variables
Answer
The slack variable
- If
$\epsilon_i > 0$ : $i$th observation is on the wrong side of the margin - If
$\epsilon_i = 0$ : $i$th observation is on the correct side of the margin - If
$\epsilon_i > 1$ : $i$th observation is on the wrong side of the hyperplane
Q. How to interpret regularization parameter
Answer
We can think of
Q. What are support vectors in SVC?
Answer
Observations that lie directly on the margin or on the wrong side of the margin for their class, are known as support vectors.
Q. Explain bias-variance tradeoff in SVC?
Answer
The regularization parameter
- C is small : Low bias and high variance
- C is large : High bias and low variance
Q. Is SVC robust to outliers?
Answer
Yeah mostly, since the decision boundary is influenced only by support vectors, outliers that are far from the hyperplane(on the correct side) have little or no impact on the decision boundary.
Q. What is the main limitation of SVC?
Answer
It struggles with the cases having non-linear decision boundary.
Q. Can we use feature space enlarging technique to solve non linear decision boundary problem with SVC?
Answer
Yeah we can address the problem of non-linear boundaries between classes by enlarging the feature space using quadratic, cubic or even higher order polynomial functions of the predictors. But there is an issue because there are many possible ways to enlarge the feature space and we may end up with huge number of features. Then computation would become unmanageable.
Q. What is support vector machine(SVM)?
Answer
The support vector machine(SVM) is an extension of the support vector classifier that results from the enlarging the feature in a certain way using kernels.
Q. What do you mean by kernel function in SVM?
Answer
Kernel function quantifies the similarity of two observations.
-
Linear kernel
$$K(x_i, x_{i'}) = \sum_{j-1}^p{x_{ij}{x_{i'j}}}$$ -
Polynomial kernel
$$K(x_i, x_i') = (1 + \sum_{j=1}^p{x_{ij}{x_{i'j}}})^d$$ where,
$d > 1$ -
Radial kernel
$$K(x_i,x_{i'}) = exp(-\gamma\sum_{j=1}^{p}(x_{ij} - x_{i'j})^2)$$
where,
Q. What is advantage of using kernel trick over simply enlarging feature space using functions of the original features?
Answer
Kernel trick is more computationally effective technique. We only need to compute
Q. What are major drawbacks of kernel based machines?
Answer
- Computation cost : In kernel machines the cost of evaluating the decision function is linear in number of training examples i.e
$O(n)$ time complexity. - Generic kernels struggles to generalize well.
Q. Why kernel trick is popular in SVM?
Answer
Although kernel machines are computationally expensive, SVMs are inherently efficient because they rely only on a subset of the training data, called support vectors, to define the decision boundary. The number of support vectors is typically much smaller than the total number of data points, which significantly reduces both training and prediction time, making the kernel trick particularly popular in SVM.
$$ f(x) = b + \sum_i \alpha_i k(x, x^{(i)}) $$
In SVMs, the vector of coefficients, (\alpha), is sparse, with non-zero values only for the support vectors and zero for the rest of the training points.
Q. Is SVM a linear model or non-linear model?
Answer
SVM can be linear or non-linear depending on the kernels we are using. When we are using linear kernel then the decision boundary will be linear, but if we using non-linear kernels like radial basis function(RBF) or sigmoid kernel.
Q. What is the kernel trick in Support Vector Machines (SVM), and how does it work?
Answer
The kernel trick in Support Vector Machines (SVM) is a technique used to handle non-linearly separable data. It allows SVM to find a hyperplane that can separate data points in higher-dimensional space without explicitly computing their coordinates in that space.
Instead of transforming the data explicitly:
- The SVM algorithm computes the kernel function directly between pairs of data points.
- This computation effectively simulates the inner products in the higher-dimensional space.
- The optimization problem remains linear in the original space, but the decision boundary can capture complex, non-linear relationships.
The linear function used by SVM can be rewritten as follows:
$$ f(x) = w^{T}x + b = b + \sum_{i}\alpha_{i}k(x, x^{(i)}) $$
Q. How can we set up SVMs to work with multi class classification problems?
Answer
We can use following approaches to accomplish it:
One-versus-one classification
To classify data with SVMs when there are
One-Versus-All classification
In the one-versus-all (or one-versus-rest) approach for classifying with SVMs when there are
To classify a new observation
Q. Is the SVM unique in its use of kernels to enlarge the feature space to accommodate non-linear class boundaries?
Answer
No, we could use it for other classification methods as well like with logistic regression. For historical reasons, the use of non-linear kernel is attached with SVMs.
Q. When should we use SVMs over logistic regression?
Answer
When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.
Q. What is the main benefit of using logistic regression over SVM?
Answer
SVM does not provide probabilities like logistic regression output. It only outputs a class identity/label.
Q. Given a true label
Answer
To calculate the hinge loss for the given true label and predicted score, use the hinge loss formula:
On putting the given values:
Q. SVM.
- What’s linear separation? Why is it desirable when we use SVM?
- How well would vanilla SVM work on this dataset?
- How well would vanilla SVM work on this dataset?
- How well would vanilla SVM work on this dataset?
Answer
Q. What is the main difference between Support Vector Machines (SVM) and Support Vector Regression (SVR)?
Answer
SVM is designed for classification by finding the best decision boundary between classes, while SVR is tailored for regression, focusing on fitting data points within a defined margin of error.
SVM and SVR also differ in the loss function they want to minimize:
- SVM Loss Function (Hinge Loss)
- SVR Loss Function (Epsilon-Insensitive Loss)
$$ \text{Loss} = \begin{cases} 0 & \text{if } |y_i - (\mathbf{w} \cdot \mathbf{x}_i + b)| \leq \epsilon \ |y_i - (\mathbf{w} \cdot \mathbf{x}_i + b)| - \epsilon & \text{otherwise} \end{cases} $$
Q. What are the key components of SVR?
Answer
The key components of SVR are:
-
$ε-tube$ (ε-insensitive tube), - Support vectors
- Hyperplane
Q. What is the role of the ε-tube in SVR?
Answer
The ε-tube (epsilon-tube) in Support Vector Regression (SVR) plays a crucial role in defining the margin of error around the regression line within which the model does not penalize errors.
ε-tube (epsilon-tube) in Support Vector Regression (SVR) |
Role of the ε-Tube in SVR:
-
Defines Tolerance Margin: The ε-tube sets a tolerance level within which predictions can deviate from the actual target values without incurring a penalty. Points that lie inside this tube are considered acceptable deviations, and no loss is calculated for these points.
-
Controls Model Complexity: By adjusting the width of the ε-tube (controlled by the epsilon parameter), you can make the model more or less sensitive to deviations from the actual data points. A wider ε-tube allows for more flexibility and fewer support vectors, leading to a simpler model, while a narrower tube makes the model stricter and more sensitive to the data points.
-
Balances Fit and Generalization: The ε-tube helps SVR strike a balance between fitting the data closely and maintaining good generalization. By ignoring small errors within the tube, SVR avoids overfitting to minor variations or noise in the data.
-
Determines Support Vectors: Only the data points that lie outside the ε-tube become support vectors, which influence the final regression model. Points inside the tube do not affect the optimization of the regression line, allowing the model to focus only on significant deviations.
Q. How do you tune the hyperparameters in SVR?
Answer
Hyperparameters in SVR include:
- Regularization parameter
$C$ - Kernel type
- Kernel-specific parameters (e.g., degree for polynomial kernels, gamma for RBF kernels)
Tuning can be done using techniques like grid search or randomized search cross-validation to find the optimal combination of hyperparameters.
Q. What is the significance of the regularization parameter (
Answer
The parameter
Q. How does SVR handle outliers in the data?
Answer
SVR handles outliers by using an ε-insensitive loss function, a soft margin approach, and a tunable regularization parameter, making it more robust to outliers compared to traditional regression methods. However, extreme outliers can still affect the model, so proper preprocessing, such as scaling and outlier removal, is recommended.
Q. What are the evaluation metrics used to assess the performance of SVR models?
Answer
Common evaluation metrics for SVR include:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R-squared (R2) to measure the goodness of fit.
Q. What are the advantages and disadvantages of SVR compared to other regression techniques?
Answer
Advantages of using SVR:
- Robust to Outliers: SVR uses a margin of tolerance (epsilon-insensitive loss), making it less sensitive to outliers compared to methods like linear regression.
- Non-Linear Relationships: SVR can capture non-linear relationships using kernel tricks (e.g., RBF, polynomial kernels), making it more versatile than simple linear models.
- Generalization: SVR aims to find a hyperplane that maximizes the margin, often resulting in better generalization on unseen data compared to other methods that minimize error without considering margin.
Disadvantages of using SVR:
- Complexity and Computational Cost: SVR can be computationally expensive, especially with large datasets or complex kernels, unlike simpler models like linear or ridge regression.
- Not Scale Invariant: SVR is sensitive to the scale of input features, requiring careful feature scaling, unlike some tree-based methods that are more robust to unscaled data.
- Interpretability Issue: It is less interpretable because it doesn't provide clear coefficients for each feature.
Q. Can SVR be used for time series forecasting?
Answer
Yes, SVR can be applied to time series data by performing feature engineering on time/date attributes. However, it's not the best approach because SVR does not account for the temporal relationships in the data, which are crucial for time series forecasting.