Skip to content

Commit

Permalink
Merge pull request #5 from Jonah-gr/main
Browse files Browse the repository at this point in the history
some corrections
  • Loading branch information
florian-huber authored Mar 6, 2024
2 parents 201b343 + cc7fa2c commit 16f242e
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 37 deletions.
46 changes: 23 additions & 23 deletions notebooks/live_coding_09_machine_learning_algorithms.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"metadata": {},
"source": [
"## k-nearest neighbors (k-NN)\n",
"**$k$-nearest neighbors** is for very good reasons one of the most commonly known machine learning algorithms. It is relatively intuitive and simple, yet still powerful enough to find plenty of use cases even today (despite havine much fancier techniques on the market).\n",
"**$k$-nearest neighbors** is, for very good reasons, one of the most commonly known machine learning algorithms. It is relatively intuitive and simple, yet still powerful enough to find plenty of use cases even today (despite having much fancier techniques on the market).\n",
"\n",
"The algorithm works as follows ({numref}`fig_knn_algorithm`). For any given data point $x$, do the following:\n",
"- Search for the $k$ nearest neighbors within the known data.\n",
Expand All @@ -36,8 +36,8 @@
"### Pros, Cons, Caveats\n",
"Conceptually, the k-nearest neighbors algorithm is rather simple and intuitive. However, there are a few important aspects to consider when applying this algorithm.\n",
"\n",
"First of all, k-nearest kneighbors is a distance-based algorithm. This means that we have to ensure that closer really means \"more similar\" which is not as simple as it maybe sounds. We have to decide on a *distance metric* that is the measure (or function) by which we calculate the distance between data points. We can use common metrics like the Euclidean distance, but there are many different options to choose from.\n",
"Even more critical is the proper *scaling* of our features. Just think of an example. We want to predict the shoe size of a person from the person's height (measured in $m$) and weight (measured in $kg$). This means that we have two features here, height and weight. For a prediction on a new person we simply need his/her height and weight. Then k-NN will compare those values to all known (\"learned\") data points in our model and find the closest $k$ other people. If we now use the Euclidean distance, the distance $d$ will simply be\n",
"First of all, k-nearest kneighbors is a distance-based algorithm. This means that we have to ensure that closer really means \"more similar\" which is not as simple as it may sound. We have to decide on a *distance metric*, that is the measure (or function) by which we calculate the distance between data points. We can use common metrics like the Euclidean distance, but there are many different options to choose from.\n",
"Even more critical is the proper *scaling* of our features. Just think of an example. We want to predict the shoe size of a person based on the person's height (measured in $m$) and weight (measured in $kg$). This means that we have two features here: height and weight. For a prediction on a new person, we simply need his/her height and weight. Then k-NN will compare those values to all known (\"learned\") data points in our model and find the closest $k$ other people. If we now use the Euclidean distance, the distance $d$ will simply be\n",
"\n",
"$$\n",
" d = \\sqrt{(w_1 - w_2) ^ 2 + (h_1 - h_2) ^ 2}\n",
Expand All @@ -57,7 +57,7 @@
"Ok. The issue here is, that the weights are in kilograms ($kg$), so we are talking about values like 50, 60, 80, 100. The height, however, is measured in meters ($m$) such that values are many times smaller. As a result, having two people differ one meter in height (which is a lot) will count no more than one kilogram difference (which is close to nothing). Clearly not what we intuitively mean by \"nearest neighbors\"!\n",
"\n",
"The solution to this is a proper **scaling** of our data. Often, we will simply apply one of the following two scaling methods:\n",
"1. MinMax Scaling - this means we linearly rescale our data such that the lowest occuring value becomes 0 and the highest value becomes 1.\n",
"1. MinMax Scaling - this means we linearly rescale our data such that the lowest occurring value becomes 0 and the highest value becomes 1.\n",
"2. Standard Scaling - here we rescale our data such that the mean value will be 0 and the standard deviation will be 1.\n",
"\n",
"Both methods might give you values that look awkward at first. Standard scaling, for instance, gives both positive and negative values so that our height values in the example could be -1.04 or +0.27. But don't worry, the scaling is really only meant to be used for the machine learning algorithm itself."
Expand All @@ -73,20 +73,20 @@
"But there are still some questions we need to consider.\n",
"\n",
"The obvious one is: What should we use as $k$? \n",
"This is the model's main parameter and we are free to choose any value we like. And there is no simple best choice that always work. In practice the choice of $k$ will depend on the number of data points we have, but also the distribution of data and the number of classes or parameter ranges. We usually want to pick odd values here to avoid draws as much as possible (imagine two nearest neighbors are \"spam\" and two are \"no-spam\"). But whether 3, 5, 7, or 13 is the best choice will depend on our specific task at hand. \n",
"This is the model's main parameter and we are free to choose any value we like. And there is no simple best choice that always works. In practice, the choice of $k$ will depend on the number of data points we have, but also on the distribution of data and the number of classes or parameter ranges. We usually want to pick odd values here to avoid draws as much as possible (imagine two nearest neighbors are \"spam\" and two are \"no-spam\"). But whether 3, 5, 7, or 13 is the best choice will depend on our specific task at hand. \n",
"\n",
"\n",
"In machine learning we call such a thing a **fitting parameter**. This means that we are free to change its value and it might have a considerable impact on the quality of our predictions, or our \"model performance\". Ideally we would compare several different models with different parameters and pick the one that performs best.\n",
"In machine learning, we call such a thing a **fitting parameter**. This means that we are free to change its value, and it might have a considerable impact on the quality of our predictions, or our \"model performance\". Ideally, we would compare several different models with different parameters and pick the one that performed best.\n",
"\n",
"Let's consider a situation as in {numref}`fig_knn_caveats`A. Here we see that a change in $k$ can lead to entirely different predictions for certain data points. In general, kNN predictions can be highly unstable close to border regions, and they also tend to be highly sensitive to the local density of data points. The later can be a problem if we have far more points of one category than for another.\n",
"Let's consider a situation as in {numref}`fig_knn_caveats`A. Here we see that a change in $k$ can lead to entirely different predictions for certain data points. In general, kNN predictions can be highly unstable close to border regions, and they also tend to be highly sensitive to the local density of data points. The later can be a problem if we have far more points in one category than in another.\n",
"\n",
"```{figure} ../images/fig_knn_caveats.png\n",
":name: fig_knn_caveats\n",
"\n",
"k-nearest neighbors has a few important caveats. **A** its predictions can change with changing $k$, and generally are very density sensitive. **B** it suffers (as many machine learning models) from overconfidence, which simply means that it will confidently output predictions even for data points that are entirely different from the training data (or even physically impossible).\n",
"```\n",
"\n",
"Finally, another common problem with kNN -but also many other models- is called **over-confidence** ({numref}`fig_knn_caveats`B). The algorithm described here creates its predictions on the $k$ closest neighbors. But for very unusual inputs or even entirely impossible inputs, the algorithm will still find $k$ closest neighbors and make a prediction. So if you ask for the shoe size of a person of 6.20m and 840 kg your model might confidently answer your question and say: 48 (if nothing bigger occurred in the data). So much for the \"intelligent\" in *artificial intelligence* ..."
"Finally, another common problem with kNN -but also many other models- is called **over-confidence** ({numref}`fig_knn_caveats`B). The algorithm described here creates its predictions on the $k$ closest neighbors. But for very unusual inputs or even entirely impossible inputs, the algorithm will still find $k$ closest neighbors and make a prediction. So if you ask for the shoe size of a person of 6.20m and 840 kg, your model might confidently answer your question and say: 48 (if nothing bigger occurred in the data). So much for the \"intelligent\" in *artificial intelligence* ..."
]
},
{
Expand All @@ -103,7 +103,7 @@
"- Does not make impossible predictions (because it only takes values from the training data)\n",
"\n",
"**Cons** \n",
"- Predictions are sensitive to local density of data points and the choice of $k$\n",
"- Predictions are sensitive to the local density of data points and the choice of $k$\n",
"- Can suffer from over-confidence.\n",
"- Does not scale well for very large datasets (computing all distances can take very long)"
]
Expand All @@ -118,7 +118,7 @@
"To get a better sense of how machine learning is done in practice with Python, let us work on a simple example.\n",
"For this, we will use the [`Penguin Dataset`](https://allisonhorst.github.io/palmerpenguins/) {cite}`penguins` that consists of data from three different penguin species.\n",
"\n",
"The machine learning goal in this section will be to create models that can predict the species from the penguins body features.\n",
"The machine learning goal in this section will be to create models that can predict species based on penguin body features.\n",
"This means that the species information will later be our *label*.\n",
"\n",
"We start by importing and inspecting the dataset:"
Expand Down Expand Up @@ -532,7 +532,7 @@
"### Data Exploration\n",
"\n",
"We have already seen multiple ways to explore the data and the relationship between the different features in more detail.\n",
"Here, we will focus on the correlations and the actual plots between two features to get a better intuition.\n",
"Here, we will focus on the correlations and the actual plots between two features to get better intuition.\n",
"\n",
"First, we use `.corr(numeric_only=True)` to compute the correlation coefficients between all numeric features."
]
Expand Down Expand Up @@ -623,7 +623,7 @@
"source": [
"Look at this plot and ask yourself the question: Can we predict the species based on the shown body features?\n",
"\n",
"The answer here must be a clear \"yes\", because several plot panels show a clear visual distinction between the penguin species. If this is the case, we can ususally expect that this task is rather simple to learn for a machine learning model."
"The answer here must be a clear \"yes,\" because several plot panels show a clear visual distinction between the penguin species. If this is the case, we can usually expect that this task is rather simple to learn for a machine learning model."
]
},
{
Expand All @@ -633,10 +633,10 @@
"source": [
"### Prepare Data for Training\n",
"\n",
"We have inspected, cleaned, explored the data. But it is not yet ready to be used for training a machine learning model.\n",
"The precise process will depend on our goal (here: predicting species), the data, but also the choice of our model.\n",
"We have inspected, cleaned and explored the data. But it is not yet ready to be used for training a machine learning model.\n",
"The precise process will depend on our goal (here: predicting species), the data, and also the choice of our model.\n",
"\n",
"First, we will split the labels from the data. But we should also remove features which rather belong to the labels than to the data. In the present case this would be the column `island` which happens to correlate perfectly with the `species`."
"First, we will split the labels from the data. But we should also remove features that belong more to the labels than to the data. In the present case, this would be the column `island` which happens to correlate perfectly with the `species`."
]
},
{
Expand Down Expand Up @@ -756,7 +756,7 @@
"\n",
"Many machine learning models, including k-NN, require numerical inputs. In the case of k-NN this is needed to compute the distances between data points. In our present data, however, we have the column `sex` that contains strings.\n",
"\n",
"We now can choose to either remove the column, or to convert it to numerical values. The later is not always possible, but if entries belong to a limited number of catogeries we can use `one-hot-encoding`. This is the conversion of categorical features to binary features."
"We can now choose to either remove the column or convert it to numerical values. The latter is not always possible, but if entries belong to a limited number of categories we can use `one-hot-encoding`. This is the conversion of categorical features to binary features."
]
},
{
Expand Down Expand Up @@ -866,9 +866,9 @@
"### Train/Test split\n",
"\n",
"As mentioned before, this split is crucial for all supervised machine learning processes. It will later allow us to evaluate the model.\n",
"Two things are important for this split. We should shuffle the data since it might be ordered (for instance alphabetically or chronologically). But we should make 100% sure that we shuffle both our labels and our data in extactly the same way. If not, the order of our labely $y$ will not match the order of our data $X$ anymore, which is a disaster and a common cause for failing machine learning processes.\n",
"Two things are important for this split. We should shuffle the data, since it might be ordered (for instance, alphabetically or chronologically). But we should make 100% sure that we shuffle both our labels and our data in exactly the same way. If not, the order of our labels $y$ will not match the order of our data $X$ anymore, which is a disaster and a common cause of failing machine learning processes.\n",
"\n",
"This would be relatively easy to implement ourselves. But why reinventing the wheel? From this point on we can luckily rely on a very extensive machine learning library for Python: [`Scikit-Learn`](https://scikit-learn.org/stable/)."
"This would be relatively easy to implement ourselves. But why reinventing the wheel? From this point on, we can luckily rely on a very extensive machine learning library for Python: [`Scikit-Learn`](https://scikit-learn.org/stable/)."
]
},
{
Expand Down Expand Up @@ -913,10 +913,10 @@
"source": [
"### Data Scaling\n",
"\n",
"As discussed above, we need scaled data when we use k-NN models to make sure that each feature is considered equally. \n",
"As discussed above, we need scaled data when we use k-NN models to make sure that each feature is considered equally.  \n",
"This, again, is already implemented in Scikit-Learn.\n",
"\n",
"One important note here: The scaling is adjusted **only** based on the training data, but not on the test data. This is the only way that we can mimick a real-life situation of having known data (X_train) and unknown data (X_test)."
"One important note here: The scaling is adjusted **only** based on the training data, but not on the test data. This is the only way that we can mimic a real-life situation of having known data (X_train) and unknown data (X_test)."
]
},
{
Expand Down Expand Up @@ -1040,7 +1040,7 @@
"source": [
"### Train a model (and make predictions)\n",
"\n",
"Now we finally get to use a k-nearest neighbors model! As most of the commonly used machine learning models, this is implemented in Scikit-Learn which makes this fairly easy to use for us, see also [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier).\n",
"Now we finally get to use a k-nearest neighbors model! As with most of the commonly used machine learning models, this is implemented in Scikit-Learn which makes it fairly easy to use for us; see also [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier).\n",
"\n",
"The key parameter ($k$) is here called `n_neighbors`."
]
Expand Down Expand Up @@ -1138,7 +1138,7 @@
"\n",
"We could of course manually compare the predictions we just generated with the actual true values (`y_test`), but I guess it is obvious that this is not the best way to do things. In particular, because we usually work with much larger datasets.\n",
"\n",
"One of the most common and best ways to assess if a classification model performes well is to compute a **confusion matrix**. \n",
"One of the most common and best ways to assess if a classification model performs well is to compute a **confusion matrix**. \n",
"This matrix will compare all predictions to all true values and make a summary."
]
},
Expand Down Expand Up @@ -1209,7 +1209,7 @@
"id": "0c95a37b-6c3f-4cdb-9515-3d5c5cfa2227",
"metadata": {},
"source": [
"What you should see is a confusion matrix that represents perfect (or near perfect) predictions. In many real world examples our confusion matrix will not look like this, but also show all types of misclassifications."
"What you should see is a confusion matrix that represents perfect (or near-perfect) predictions. In many real-world examples, our confusion matrix will not look like this, but also show all types of misclassifications."
]
},
{
Expand Down
Loading

0 comments on commit 16f242e

Please sign in to comment.