-
Notifications
You must be signed in to change notification settings - Fork 11
Basics of machine learning
- Link to Resources - https://docs.google.com/spreadsheets
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
Basic machine learning approaches:
- Supervised learning
- Unsupervised learning
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. We'll be covering a few topics under supervised learning in this page.
Regression analysis is a form of predictive modelling technique that defines relationship between dependent and independent variables.
Some of the major types of regression include:
Linear regression helps in finding the relationship between one or more features (independent variables) and a continuous target variable (dependent variable). It can be represented by the equation as follows: Y= a+bX where Y:dependent variable, X:explanatory variable, b:slope of the line, a:interecpt (value of y when x is 0).
- Advantages: Extremely simple implementation.
- Disadvantages: It only models relationship between dependent and independent variables that are linear.
Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x.
- Advantages: It can fit a broad range of functions.
- Disadvantages: Presence of one or two outliers in the data can affect the results.
Logistic regression models the probabilities for classification problems with two possible outcomes. It’s an extension of the linear regression model for classification problems.
- Advantages: The output of a logistic regression is more informative than other classification algorithms.
- Disadvantages: Independent observations required.
Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
- Advantages: Nonlinear relationships between parameters do not affect tree performance.
- Disadvantages: Over fitting (Model learns from the noise as well).
More information on regression concepts can be found here:
- https://www.youtube.com/watch?v=zPG4NjIkCjc
- https://www.youtube.com/watch?v=Qnt2vBRW8Io
- https://www.youtube.com/watch?v=7qJ7GksOXoA
- https://www.youtube.com/watch?v=DCZ3tsQIoGU
Classification can be performed on structured or unstructured data. Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under.
Some major types of classification include:
KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry.
- Advantages: K-NN has no assumptions.
- Disadvantages: Optimal number of neighbors.
Random forest classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object. Basic parameters to Random Forest Classifier can be total number of trees to be generated and decision tree related parameters like minimum split, split criteria etc.
- Advantages: Flexible and high accuracy.
- Disadvantages: They are much harder and time-consuming to construct than decision trees.
More information on classification concepts can be found here:
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
Some major types of clustering are:
The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
- Advantages: K-Means most of the times computationally faster.
- Disadvantages: Difficult to predict the number of clusters (K-Value)
Like K-means clustering, hierarchical clustering also groups together the data points with similar characteristics. In some cases the result of hierarchical and K-Means clustering can be similar.
- Advantages: Easier to decide on the number of clusters.
- Disadvantages: Once the instances have been assigned to a cluster, they can no longer be moved around.
More information on clustering concepts can be found here: