diff --git a/_freeze/materials/mva-intro/execute-results/html.json b/_freeze/materials/mva-intro/execute-results/html.json index 535d26c..847359c 100644 --- a/_freeze/materials/mva-intro/execute-results/html.json +++ b/_freeze/materials/mva-intro/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "099133a87e3ae138c8e615322592c643", + "hash": "9ee4b95d29be3864921e9dfe498be977", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Background\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n## Principal component analysis\n\n## K-means clustering\n\n## Hierarchical clustering\n", - "supporting": [], + "markdown": "---\ntitle: \"Background\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n## Principal component analysis\n\nThis is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.\n\n## K-means clustering\n\nThis is a method for grouping observations into clusters. It groups data based on similarity and is an often-used method unsupervised machine learning algorithm.\n\nIt groups the data into a fixed number of clusters ($k$) and the ultimate aim is to discover patterns in the data.\n\n## Hierarchical clustering\n\nA clustering method that tries to create a hierarchy across data, often displayed as a *dendogram*. This visualises the way the clusters are arranged.\n", + "supporting": [ + "mva-intro_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/materials/mva-pca/execute-results/html.json b/_freeze/materials/mva-pca/execute-results/html.json index 7218310..32cebac 100644 --- a/_freeze/materials/mva-pca/execute-results/html.json +++ b/_freeze/materials/mva-pca/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "0181868506a39bb97533519c7e9d4169", + "hash": "a9e3be3e3b15ff3f8c4b0802133dbb5d", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Principal component analysis\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Understand when PCAs can be useful\n- Be able to perform a PCA\n- Learn how to plot and interpret a screeplot\n- Plot and interpret the loadings for each PCA\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n### Functions\n:::\n:::\n\n## Purpose and aim\n\nThis is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.\n\n\n## Data\n\nThe example in this section uses the following data set:\n\n`data/finches_hybridisation.csv`\n\nThese data come from an analysis of gene flow across two finch species [@grant2020]. They are slightly adapted here for illustrative purposes (the changes are documented in `materials/data_preparation.R` if you are interested).\n\nLong story short: these data are part of a hybridisation study on finches at the Galapagos Islands. Here, we've only included the non-hybrid observations, but feel free to rerun the analysis with all the hybrids!\n\nFrom many years of studies, going back to Darwin, we now know that the beak shape of the finches are important in their chances for survival. Changes in beak shape have led to new species and this study explored how movement of genes from a rare immigrant species (*Geospiza fuliginosa*) occurred through two other species (*G. fortis* and *G. scandens*). The researchers recorded various morphological traits.\n\n\nIt uses the following abbreviations:\n\n* `F`\t= *G. fortis*\n* `f`\t= *G. fuliginosa*\n* `S`\t= *G. scandens*\n\n## Load and visualise the data\n\nFirst, we load the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches_hybrid <- read_csv(\"data/finches_hybridisation.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py = pd.read_csv(\"data/finches_hybridisation.csv\")\n```\n:::\n\n\n:::\n\nNext, we visualise the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(finches_hybrid)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 7\n category weight wing tarsus blength bdepth bwidth\n \n1 fortis 15.8 67.1 19.6 10.3 8.95 8.32\n2 fortis 16 67 17.7 10.4 9.3 8.6 \n3 fortis 19.2 72.4 19.4 12.1 10.8 10.1 \n4 fortis 16.1 68 19.1 11.3 10 8.2 \n5 fortis 18.6 67.5 20.0 11.2 9.5 8.9 \n6 fortis 17.3 71.1 19.6 11.8 10.3 9.59\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n category weight wing tarsus blength bdepth bwidth\n0 fortis 15.84 67.13 19.55 10.30 8.95 8.32\n1 fortis 16.00 67.00 17.70 10.40 9.30 8.60\n2 fortis 19.25 72.38 19.38 12.10 10.85 10.13\n3 fortis 16.10 68.00 19.10 11.30 10.00 8.20\n4 fortis 18.60 67.50 19.95 11.15 9.50 8.90\n```\n\n\n:::\n:::\n\n:::\n\nWe have 7 columns. We're not going to recreate the published analysis exactly, but for the purpose of this section we will be looking if differences in the measured morphological traits can be attributed to specific categories.\n\nWe have 3 different `category` values:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches_hybrid %>% distinct(category)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 1\n category \n \n1 fortis \n2 fuliginosa\n3 scandens \n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py['category'].unique()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\narray(['fortis', 'fuliginosa', 'scandens'], dtype=object)\n```\n\n\n:::\n:::\n\n\n:::\n\nWe have 6 continuous response variables, which we're not going to visualise independently! However, you *could* look at some of the potential relationships within the data.\n\nFor example, looking at body weight (`weight`) against beak length (`blength`):\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(finches_hybrid, aes(x = blength, y = weight,\n colour = category)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nor just `weight` across the different categories:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(finches_hybrid, aes(x = category, y = weight)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(finches_hybrid_py,\n aes(x = \"blength\", y = \"weight\",\n colour = \"category\")) +\n geom_point())\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\nor just `weight` across the different categories:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(finches_hybrid_py,\n aes(x = \"category\", y = \"weight\")) +\n geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-12-3.png){width=614}\n:::\n:::\n\n\n\n:::\n\nThe number of combinations are rather large, given the number of variables we have. Hence it's a good idea to see if we can \"condense\" some of variables into new ones.\n\nWhat we're doing with a PCA is trying to capture as much *variance* that exists in the data using a Principal Component (PC). The PC therefore explains some of the variance for each of the individual variables.\n\nI like to compare the PCs to a smoothie: a smoothie might consist of 4 parts apple, 3 parts strawberry, 2 parts blueberry and 1 part grape. Along the lines of that delicious metaphor, one of our Principal components may consist of 4 parts `blength`, 3 parts `weight`, 2 parts `bdepth` and 1 part `wing`. We don't know this yet, so let's go find out.\n\n## Performing the PCA\n\nTo perform a PCA, we need to keep a few things in mind:\n\n1. We can only calculate principal components for numeric data\n2. The numeric data need to be on the same **scale**\n\nThis last point of scaling is very important. Measurements can take place at different scales. You shouldn't compare milimetres and kilogrammes directly, for example. Or seconds and height. That simply does not make sense.\n\nThis issue is even more prevalent when comparing gene expression data, for example. Genes can be active at varying levels, but some genes only need (an indeed *do*) change a tiny amount to elicit an effect, whereas other genes might have to change several fold-changes before something happens.\n\nTo compensate for this, we bring all of our data onto the same scale.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can scale our data with the `scale()` function. We perform the PCA using the `prcomp()` function. Here we store the output into an object called `pca_fit`, because we'll be working with the output later on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit <- finches_hybrid %>% \n # keep only the numeric columns\n select(where(is.numeric)) %>%\n # scale the data\n scale() %>%\n # perform the PCA\n prcomp()\n```\n:::\n\n\nThis is a bit of daunting looking output, but not to worry. We'll unpack things along the way!\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStandard deviations (1, .., p=6):\n[1] 1.9815712 1.1261455 0.5329781 0.4847379 0.4170304 0.3349943\n\nRotation (n x k) = (6 x 6):\n PC1 PC2 PC3 PC4 PC5 PC6\nweight 0.4585415 0.1513150 0.04230031 -0.44900961 0.74497043 -0.09199756\nwing 0.4333116 0.1946944 -0.84096840 0.20843413 -0.12117514 0.09475803\ntarsus 0.4174676 0.3293946 0.46092626 0.70300481 0.08141532 0.06263195\nblength 0.4338157 0.2986440 0.26067134 -0.49214612 -0.64184633 -0.02217340\nbdepth 0.3370530 -0.6256540 0.10181858 -0.03331599 -0.02041369 0.69502424\nbwidth 0.3548249 -0.5916637 0.01460313 0.13195474 -0.10641317 -0.70362223\n```\n\n\n:::\n:::\n\n\n\n## Python\n\nIn Python we can scale our data with the `StandardScaler()` function from `sklearn.preprocessing`. We can only scale numerical data, so we need to get those first.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom sklearn.preprocessing import StandardScaler\n\n# select the numeric values\nX = finches_hybrid_py.select_dtypes(include = ['float64', 'int64'])\n\n# define the scaler\nstd_scaler = StandardScaler()\n\n# scale the numeric values\nfinches_scaled = std_scaler.fit_transform(X)\n```\n:::\n\n\nNow that we have the scaled values, we can perform the PCA. We do this using the `PCA()` function from `sklearn.decomposition`.\n\nWe need to tell it how many principal components we want it to return. We set it to 6 here, which is the number of numerical variables.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom sklearn.decomposition import PCA\n\n# define the number of principal components\nn_components = 6\n\n# set the number of principal components\npca = PCA(n_components = n_components)\n\n# perform the PCA\nprincipal_components = pca.fit_transform(finches_scaled)\n\n# create a data frame containing the information\n# changing the column names based on the PC\npca_fit_py = pd.DataFrame(data = principal_components, columns=[f'PC{i}' for i in range(1, n_components + 1)])\n```\n:::\n\n\n\n:::\n\n### Visualising the principal components\n\nWe can figure out how much each principal component is contributing to the amount of variance that is being explained. This is called a *screeplot*.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>% \n tidy(matrix = \"eigenvalues\") %>% \n ggplot(aes(x = PC,\n y = percent)) +\n geom_point() +\n geom_line()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nFirst, we extract the amount of variance explained by each principal component. Next, we convert this to a DataFrame:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nexplained_variance_pc = pca.explained_variance_ratio_\n\npercent = (\n pd.DataFrame({'variance_explained':\n (explained_variance_pc * 100),\n 'PC': [f'PC{i+1}' for i in range(n_components)]})\n )\n \npercent.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n variance_explained PC\n0 65.443738 PC1\n1 21.136728 PC2\n2 4.734428 PC3\n3 3.916180 PC4\n4 2.898573 PC5\n```\n\n\n:::\n:::\n\n\nNext, we can plot this:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(percent,\n aes(x = \"PC\", y = \"variance_explained\")) +\n geom_point() +\n geom_line(group = 1))\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-19-1.png){width=614}\n:::\n:::\n\n\n\n:::\n\nThis means that PC1 is able to explain around 65% of the variance in the data. PC2 is able to explain around 21% of the variance in the data, and so forth.\n\n### Loadings\n\nLet's think back to our smoothy metaphor. Remember how the smoothy was made up of various fruits - just like our PCs are made up of parts of our original variables.\n\nLet's, for the sake of illustrating this, assume the following for PC1:\n\n| parts| variable|\n|:- |:- |\n| 4 | `blength` |\n| 1 | `tarsus` |\n\nEach PC has something called an **eigenvector**, which in simplest terms is a line with a certain direction and length.\n\nIf we want to calculate the length of the eigenvector for PC1, we can employ Pythagoras (well, not directly, just his legacy). This gives:\n\n$eigenvector \\, PC1 = \\sqrt{4^2 + 1^2} = 4.12$\n\nThe **loading scores** for PC1 are the \"parts\" scaled for this length, _i.e._:\n\n| scaled parts| variable|\n|:- |:- |\n| 4 / 4.12 = 0.97 | `blength` |\n| 1 / 4.12 = 0.24 | `tarsus` |\n\nWhat we can do with these values is plot the **loadings** for each of the original variables.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIt might be helpful to visualise this in context of the original data. We can easily add the original data to the fitted PCs as follows (and plot it):\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>%\n # add the original data\n augment(finches_hybrid) %>%\n ggplot(aes(.fittedPC1, .fittedPC2, colour = category)) + \n geom_point(size = 1.5)\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-20-3.png){width=672}\n:::\n:::\n\n\nThis gives us the individual contributions to PC1 and PC2 for each observation.\n\nIf we wanted to know how much each *variable* is contributing to PC1 and PC2 we would have to use the loadings.\n\nWe can extract all the loadings as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>% \n tidy(matrix = \"loadings\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 36 × 3\n column PC value\n \n 1 weight 1 0.459 \n 2 weight 2 0.151 \n 3 weight 3 0.0423\n 4 weight 4 -0.449 \n 5 weight 5 0.745 \n 6 weight 6 -0.0920\n 7 wing 1 0.433 \n 8 wing 2 0.195 \n 9 wing 3 -0.841 \n10 wing 4 0.208 \n# ℹ 26 more rows\n```\n\n\n:::\n:::\n\n\nWe'll have to reformat this a little, so that we have the values in separate columns. First, we rename some of the columns / values to make things more consistent and clearer:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_fit %>% \n tidy(matrix = \"loadings\") %>% \n rename(terms = column,\n component = PC) %>% \n mutate(component = paste0(\"PC\", component))\n\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n terms component value\n \n1 weight PC1 0.459 \n2 weight PC2 0.151 \n3 weight PC3 0.0423\n4 weight PC4 -0.449 \n5 weight PC5 0.745 \n6 weight PC6 -0.0920\n```\n\n\n:::\n:::\n\n\nNow we can reformat the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_loadings %>% \n pivot_wider(names_from = \"component\",\n values_from = \"value\")\n\npca_loadings\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 7\n terms PC1 PC2 PC3 PC4 PC5 PC6\n \n1 weight 0.459 0.151 0.0423 -0.449 0.745 -0.0920\n2 wing 0.433 0.195 -0.841 0.208 -0.121 0.0948\n3 tarsus 0.417 0.329 0.461 0.703 0.0814 0.0626\n4 blength 0.434 0.299 0.261 -0.492 -0.642 -0.0222\n5 bdepth 0.337 -0.626 0.102 -0.0333 -0.0204 0.695 \n6 bwidth 0.355 -0.592 0.0146 0.132 -0.106 -0.704 \n```\n\n\n:::\n:::\n\n\nWe can then plot this. This is a little bit involved, unfortunately. And not something I'd recommend remembering the code by heart, but we're doing the following:\n\n1. Take the PCA output and add the original data\n2. Plot this\n3. Create a line from the origin (`x = 0`, `y = 0`) for each loading\n4. Make it an arrow\n5. Add the variable name as a label\n\nWe define the arrow as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define arrow style\narrow_style <- arrow(length = unit(2, \"mm\"),\n type = \"closed\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>%\n # add the original data\n augment(finches_hybrid) %>%\n ggplot() + \n geom_point(aes(.fittedPC1, .fittedPC2, colour = category), size = 1.5) +\n geom_segment(data = pca_loadings,\n aes(xend = PC1, yend = PC2),\n x = 0, y = 0,\n arrow = arrow_style) +\n geom_text(data = pca_loadings,\n aes(x = PC1, y = PC2, label = terms), \n hjust = 0, \n vjust = 1,\n size = 5) \n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-25-1.png){width=672}\n:::\n:::\n\n\n\n## Python\n\n:::\n\nAfter all that we end up with a rather unclear plot. The 6 variables that contribute to each principal component have very overlapping contributions in PC1. As such, it's difficult to see which variable contributes what!\n\nThe reason why I'm still showing it here is that this kind of plot is very often used in PCA, so at least you can recognise it. If the variables have well-separated contributions across the two principal components, then it can be quite informative.\n\nA better way would be to plot the individual contributions to each principal component as an ordered bar plot. This does require some rejigging of the data again (sorry!).\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFirst, we convert our loadings data back to a \"long\" format. We also add an extra column, `direction`, which indicates if the contribution to the principal component is positive or negative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_loadings %>% \n pivot_longer(cols = -terms,\n names_to = \"component\",\n values_to = \"value\") %>% \n # add a column that indicates direction\n # (positive or negative)\n mutate(direction = ifelse(value < 0, \"positive\", \"negative\"))\n\n# have a look at the data\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 4\n terms component value direction\n \n1 weight PC1 0.459 negative \n2 weight PC2 0.151 negative \n3 weight PC3 0.0423 negative \n4 weight PC4 -0.449 positive \n5 weight PC5 0.745 negative \n6 weight PC6 -0.0920 positive \n```\n\n\n:::\n:::\n\n\nWe can now visualise this. Here we are using some additional functionality offered by the `tidytext` library. Make sure to install it, if needed. Then load it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# we need this library\nlibrary(tidytext)\n\npca_loadings %>% \n mutate(terms = tidytext::reorder_within(terms,\n abs(value),\n component)) %>%\n ggplot(aes(x = abs(value), y = terms, fill = direction)) +\n geom_col() +\n facet_wrap(vars(component), scales = \"free_y\") +\n tidytext::scale_y_reordered()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-27-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n:::\n\n:::{.callout-important}\nIt is important to keep the amount of variance explained by each principal component in mind. For example, PC3 only explains around 4.7 of the variance. So although several variables contribute substantially to PC3, the contribution of PC3 itself remains small.\n:::\n\n## Exercises\n\n### Penguins {#sec-exr_penguins}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/penguins.csv`. These data are from the `palmerpenguins` package (for more information, see [the GitHub page](https://github.com/allisonhorst/palmerpenguins)).\n\nI would like you to do the following:\n\n1. Load and visualise the data.\n2. Create a screeplot and see how many PCs would be best.\n3. Calculate the loadings for PC1 and PC2 and visualise them.\n4. Any conclusions you might draw from the analysis.\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins <- read_csv(\"data/penguins.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 344 Columns: 8\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (3): species, island, sex\ndbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n1 Adelie Torgersen 39.1 18.7 181 3750\n2 Adelie Torgersen 39.5 17.4 186 3800\n3 Adelie Torgersen 40.3 18 195 3250\n4 Adelie Torgersen NA NA NA NA\n5 Adelie Torgersen 36.7 19.3 193 3450\n6 Adelie Torgersen 39.3 20.6 190 3650\n# ℹ 2 more variables: sex , year \n```\n\n\n:::\n:::\n\n\nWe can see that there are different kinds of variables, both categorical and numerical. Also, there appear to be some missing data in the data set, so we probably have to deal with that.\n\nLastly, we should be careful with the `year` column: it is recognised as a numerical column (because it contains numbers), but we should view it as a factor, since the years have an ordered, categorical meaning.\n\nTo get a better sense of our data we could plot all the numerical variables against each other, to see if there is any possible correlation between them. Doing that one-by-one in ggplot is tedious, so I'm using the `pairs()` function here. Pretty? No. Effective? Yes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins %>% \n select(where(is.numeric)) %>% \n pairs()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\n\n\n## Python\n\n:::\n\nSo we see that there is some possible grouping going on and possibly some linear relationships, too. However, there are multiple groups and closely related measurements, so it is not very surprising that there are possible relationships within the data.\n\n#### Perform the PCA\n\nFirst, we need to do a little bit of data tidying. We convert `year` to a factor and deal with the missing values. We're not dealing with them in a particularly subtle way here, removing all the rows that contain at least one missing value.\n\nIn your own research you may want to be more careful and only remove missing values from variables that you using in the PCA (here we include everything).\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins <- penguins %>% \n mutate(year = factor(year)) %>% \n drop_na() # remove all rows with missing values\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit <- penguins %>% \n # keep only the numeric columns\n select(where(is.numeric)) %>%\n # scale the data\n scale() %>%\n # perform the PCA\n prcomp()\n```\n:::\n\n\n## Python\n\n:::\n\n#### Visualise the PCs\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can visualise the principal components:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>% \n tidy(matrix = \"eigenvalues\") %>% \n ggplot(aes(x = PC,\n y = percent)) +\n geom_point() +\n geom_line()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-33-1.png){width=672}\n:::\n:::\n\n\n\n## Python\n\n:::\n\nIt looks like using the first two principal components is probably capturing the majority of the variance in the data. Combined they account for 90% of the variance.\n\n#### Loadings\n\nNext, we get the loadings: how much is each original variable contributing to the individual principal components?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe start with some data tidying and wrangling, since we need the data in a \"wide\" format for the next plot to work:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_fit %>% \n tidy(matrix = \"loadings\") %>% \n rename(terms = column,\n component = PC) %>% \n mutate(component = paste0(\"PC\", component)) %>% \n pivot_wider(names_from = \"component\",\n values_from = \"value\")\n\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 5\n terms PC1 PC2 PC3 PC4\n \n1 bill_length_mm 0.454 -0.600 -0.642 0.145\n2 bill_depth_mm -0.399 -0.796 0.426 -0.160\n3 flipper_length_mm 0.577 -0.00579 0.236 -0.782\n4 body_mass_g 0.550 -0.0765 0.592 0.585\n```\n\n\n:::\n:::\n\n\nArrow style:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define arrow style\narrow_style <- arrow(length = unit(2, \"mm\"),\n type = \"closed\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>%\n # add the original data\n augment(penguins) %>%\n ggplot() + \n geom_point(aes(.fittedPC1, .fittedPC2, colour = species), size = 1.5) +\n geom_segment(data = pca_loadings,\n aes(xend = PC1, yend = PC2),\n x = 0, y = 0,\n arrow = arrow_style) +\n geom_text(data = pca_loadings,\n aes(x = PC1, y = PC2, label = terms), \n hjust = 0, \n vjust = 1,\n size = 5) \n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-36-1.png){width=672}\n:::\n:::\n\n\nWe can also visualise these contributions using a bar plot. We need the data in a slightly different format before we can do this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_loadings %>% \n pivot_longer(cols = -terms,\n names_to = \"component\",\n values_to = \"value\") %>% \n # add a column that indicates direction\n # (positive or negative)\n mutate(direction = ifelse(value < 0, \"positive\", \"negative\"))\n\n# have a look at the data\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 4\n terms component value direction\n \n1 bill_length_mm PC1 0.454 negative \n2 bill_length_mm PC2 -0.600 positive \n3 bill_length_mm PC3 -0.642 positive \n4 bill_length_mm PC4 0.145 negative \n5 bill_depth_mm PC1 -0.399 positive \n6 bill_depth_mm PC2 -0.796 positive \n```\n\n\n:::\n:::\n\n\nBut after that, we can visualise it as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# we need this library\nlibrary(tidytext)\n\npca_loadings %>% \n mutate(terms = tidytext::reorder_within(terms,\n abs(value),\n component)) %>%\n ggplot(aes(x = abs(value), y = terms, fill = direction)) +\n geom_col() +\n facet_wrap(vars(component), scales = \"free_y\") +\n tidytext::scale_y_reordered()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-38-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n:::\n\n#### Conclusions\n1. Load the data.\n2. Create a screeplot and see how many PCs would be best.\n3. Calculate the loadings for PC1 and PC2 and visualise them.\n4. Any conclusions you might draw from the analysis.\n\nTaken together, we can conclude/comment on a few things:\n\n1. Endlessly looking at pairwise comparisons between continuous variables probably becomes a bit tedious. An alternative would be to calculate correlations between the variables to get some insight into your data. In the end it depends on how many variables you have and how much you (want to) know about them.\n2. In this case, I'd say that the first two principal components capture most of the variance in the data.\n3. The largest contributing variables mostly differ between the first two principal components. The variables that make up PC1 are very similar in terms of contribution, whereas two variables more or less make up PC2 entirely.\n\nThe variables `flipper_length_mm` and `body_mass_g` contribute pretty much only to PC1 (they are horizontal in the loadings plot), whereas `bill_length_mm` is contributing a reasonable amount to both PC1 *and* PC2.\n\nFrom the data itself we can see that there is clear separation between the three `species`. The Gentoo penguins separate further once again from the other two species.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- PCA allows you to reduce a large number of variables into fewer principal components.\n- Each PC is made up of a combination of the original variables and captures as much of the variance within the data as possible.\n- The loadings tell you how much each original variable contributes to each PC.\n- A screeplot is a graphical representation of the amount of variance explained by each PC.\n:::\n", + "markdown": "---\ntitle: \"Principal component analysis\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Understand when PCAs can be useful\n- Be able to perform a PCA\n- Learn how to plot and interpret a screeplot\n- Plot and interpret the loadings for each PCA\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n### Functions\n:::\n:::\n\n## Purpose and aim\n\nThis is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.\n\n\n## Data\n\nThe example in this section uses the following data set:\n\n`data/finches_hybridisation.csv`\n\nThese data come from an analysis of gene flow across two finch species [@grant2020]. They are slightly adapted here for illustrative purposes (the changes are documented in `materials/data_preparation.R` if you are interested).\n\nLong story short: these data are part of a hybridisation study on finches at the Galapagos Islands. Here, we've only included the non-hybrid observations, but feel free to rerun the analysis with all the hybrids!\n\nFrom many years of studies, going back to Darwin, we now know that the beak shape of the finches are important in their chances for survival. Changes in beak shape have led to new species and this study explored how movement of genes from a rare immigrant species (*Geospiza fuliginosa*) occurred through two other species (*G. fortis* and *G. scandens*). The researchers recorded various morphological traits.\n\n## Load and visualise the data\n\nFirst, we load the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches_hybrid <- read_csv(\"data/finches_hybridisation.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py = pd.read_csv(\"data/finches_hybridisation.csv\")\n```\n:::\n\n\n:::\n\nNext, we visualise the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(finches_hybrid)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 7\n category weight wing tarsus blength bdepth bwidth\n \n1 fortis 15.8 67.1 19.6 10.3 8.95 8.32\n2 fortis 16 67 17.7 10.4 9.3 8.6 \n3 fortis 19.2 72.4 19.4 12.1 10.8 10.1 \n4 fortis 16.1 68 19.1 11.3 10 8.2 \n5 fortis 18.6 67.5 20.0 11.2 9.5 8.9 \n6 fortis 17.3 71.1 19.6 11.8 10.3 9.59\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n category weight wing tarsus blength bdepth bwidth\n0 fortis 15.84 67.13 19.55 10.30 8.95 8.32\n1 fortis 16.00 67.00 17.70 10.40 9.30 8.60\n2 fortis 19.25 72.38 19.38 12.10 10.85 10.13\n3 fortis 16.10 68.00 19.10 11.30 10.00 8.20\n4 fortis 18.60 67.50 19.95 11.15 9.50 8.90\n```\n\n\n:::\n:::\n\n:::\n\nWe have 7 columns. We're not going to recreate the published analysis exactly, but for the purpose of this section we will be looking if differences in the measured morphological traits can be attributed to specific categories.\n\nWe have 3 different `category` values:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches_hybrid %>% distinct(category)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 1\n category \n \n1 fortis \n2 fuliginosa\n3 scandens \n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py['category'].unique()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\narray(['fortis', 'fuliginosa', 'scandens'], dtype=object)\n```\n\n\n:::\n:::\n\n\n:::\n\nWe have 6 continuous response variables, which we're not going to visualise independently! However, you *could* look at some of the potential relationships within the data.\n\nFor example, looking at body weight (`weight`) against beak length (`blength`):\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(finches_hybrid, aes(x = blength, y = weight,\n colour = category)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nor just `weight` across the different categories:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(finches_hybrid, aes(x = category, y = weight)) +\n geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(finches_hybrid_py,\n aes(x = \"blength\", y = \"weight\",\n colour = \"category\")) +\n geom_point())\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\nor just `weight` across the different categories:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(finches_hybrid_py,\n aes(x = \"category\", y = \"weight\")) +\n geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-12-3.png){width=614}\n:::\n:::\n\n\n\n:::\n\nThe number of combinations are rather large, given the number of variables we have. Hence it's a good idea to see if we can \"condense\" some of variables into new ones.\n\nWhat we're doing with a PCA is trying to capture as much *variance* that exists in the data using a Principal Component (PC). The PC therefore explains some of the variance for each of the individual variables.\n\nI like to compare the PCs to a smoothie: a smoothie might consist of 4 parts apple, 3 parts strawberry, 2 parts blueberry and 1 part grape. Along the lines of that delicious metaphor, one of our Principal components may consist of 4 parts `blength`, 3 parts `weight`, 2 parts `bdepth` and 1 part `wing`. We don't know this yet, so let's go find out.\n\n## Performing the PCA\n\nTo perform a PCA, we need to keep a few things in mind:\n\n1. We can only calculate principal components for numeric data\n2. The numeric data need to be on the same **scale**\n\nThis last point of scaling is very important. Measurements can take place at different scales. You shouldn't compare milimetres and kilogrammes directly, for example. Or seconds and height. That simply does not make sense.\n\nThis issue is even more prevalent when comparing gene expression data, for example. Genes can be active at varying levels, but some genes only need (an indeed *do*) change a tiny amount to elicit an effect, whereas other genes might have to change several fold-changes before something happens.\n\nTo compensate for this, we bring all of our data onto the same scale.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can scale our data with the `scale()` function. We perform the PCA using the `prcomp()` function. Here we store the output into an object called `pca_fit`, because we'll be working with the output later on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit <- finches_hybrid %>% \n # keep only the numeric columns\n select(where(is.numeric)) %>%\n # scale the data\n scale() %>%\n # perform the PCA\n prcomp()\n```\n:::\n\n\nThis is a bit of daunting looking output, but not to worry. We'll unpack things along the way!\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStandard deviations (1, .., p=6):\n[1] 1.9815712 1.1261455 0.5329781 0.4847379 0.4170304 0.3349943\n\nRotation (n x k) = (6 x 6):\n PC1 PC2 PC3 PC4 PC5 PC6\nweight 0.4585415 0.1513150 0.04230031 -0.44900961 0.74497043 -0.09199756\nwing 0.4333116 0.1946944 -0.84096840 0.20843413 -0.12117514 0.09475803\ntarsus 0.4174676 0.3293946 0.46092626 0.70300481 0.08141532 0.06263195\nblength 0.4338157 0.2986440 0.26067134 -0.49214612 -0.64184633 -0.02217340\nbdepth 0.3370530 -0.6256540 0.10181858 -0.03331599 -0.02041369 0.69502424\nbwidth 0.3548249 -0.5916637 0.01460313 0.13195474 -0.10641317 -0.70362223\n```\n\n\n:::\n:::\n\n\n\n## Python\n\nIn Python we can scale our data with the `StandardScaler()` function from `sklearn.preprocessing`. We can only scale numerical data, so we need to get those first.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom sklearn.preprocessing import StandardScaler\n\n# select the numeric values\nX = finches_hybrid_py.select_dtypes(include = ['float64', 'int64'])\n\n# define the scaler\nstd_scaler = StandardScaler()\n\n# scale the numeric values\nfinches_scaled = std_scaler.fit_transform(X)\n```\n:::\n\n\nNow that we have the scaled values, we can perform the PCA. We do this using the `PCA()` function from `sklearn.decomposition`.\n\nWe need to tell it how many principal components we want it to return. We set it to 6 here, which is the number of numerical variables.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfrom sklearn.decomposition import PCA\n\n# define the number of principal components\nn_components = 6\n\n# set the number of principal components\npca = PCA(n_components = n_components)\n\n# perform the PCA\nprincipal_components = pca.fit_transform(finches_scaled)\n\n# create a data frame containing the information\n# changing the column names based on the PC\npca_fit_py = pd.DataFrame(data = principal_components, columns=[f'PC{i}' for i in range(1, n_components + 1)])\n```\n:::\n\n\n\n:::\n\n### Visualising the principal components\n\nWe can figure out how much each principal component is contributing to the amount of variance that is being explained. This is called a *screeplot*.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>% \n tidy(matrix = \"eigenvalues\") %>% \n ggplot(aes(x = PC,\n y = percent)) +\n geom_point() +\n geom_line()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n## Python\n\nFirst, we extract the amount of variance explained by each principal component. Next, we convert this to a DataFrame:\n\n\n::: {.cell}\n\n```{.python .cell-code}\nexplained_variance_pc = pca.explained_variance_ratio_\n\npercent = (\n pd.DataFrame({'variance_explained':\n (explained_variance_pc * 100),\n 'PC': [f'PC{i+1}' for i in range(n_components)]})\n )\n \npercent.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n variance_explained PC\n0 65.443738 PC1\n1 21.136728 PC2\n2 4.734428 PC3\n3 3.916180 PC4\n4 2.898573 PC5\n```\n\n\n:::\n:::\n\n\nNext, we can plot this:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(percent,\n aes(x = \"PC\", y = \"variance_explained\")) +\n geom_point() +\n geom_line(group = 1))\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-19-1.png){width=614}\n:::\n:::\n\n\n\n:::\n\nThis means that PC1 is able to explain around 65% of the variance in the data. PC2 is able to explain around 21% of the variance in the data, and so forth.\n\n### Loadings\n\nLet's think back to our smoothy metaphor. Remember how the smoothy was made up of various fruits - just like our PCs are made up of parts of our original variables.\n\nLet's, for the sake of illustrating this, assume the following for PC1:\n\n| parts| variable|\n|:- |:- |\n| 4 | `blength` |\n| 1 | `tarsus` |\n\nEach PC has something called an **eigenvector**, which in simplest terms is a line with a certain direction and length.\n\nIf we want to calculate the length of the eigenvector for PC1, we can employ Pythagoras (well, not directly, just his legacy). This gives:\n\n$eigenvector \\, PC1 = \\sqrt{4^2 + 1^2} = 4.12$\n\nThe **loading scores** for PC1 are the \"parts\" scaled for this length, _i.e._:\n\n| scaled parts| variable|\n|:- |:- |\n| 4 / 4.12 = 0.97 | `blength` |\n| 1 / 4.12 = 0.24 | `tarsus` |\n\nWhat we can do with these values is plot the **loadings** for each of the original variables.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIt might be helpful to visualise this in context of the original data. We can easily add the original data to the fitted PCs as follows (and plot it):\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>%\n # add the original data\n augment(finches_hybrid) %>%\n ggplot(aes(.fittedPC1, .fittedPC2, colour = category)) + \n geom_point(size = 1.5)\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-20-3.png){width=672}\n:::\n:::\n\n\nThis gives us the individual contributions to PC1 and PC2 for each observation.\n\nIf we wanted to know how much each *variable* is contributing to PC1 and PC2 we would have to use the loadings.\n\nWe can extract all the loadings as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>% \n tidy(matrix = \"loadings\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 36 × 3\n column PC value\n \n 1 weight 1 0.459 \n 2 weight 2 0.151 \n 3 weight 3 0.0423\n 4 weight 4 -0.449 \n 5 weight 5 0.745 \n 6 weight 6 -0.0920\n 7 wing 1 0.433 \n 8 wing 2 0.195 \n 9 wing 3 -0.841 \n10 wing 4 0.208 \n# ℹ 26 more rows\n```\n\n\n:::\n:::\n\n\nWe'll have to reformat this a little, so that we have the values in separate columns. First, we rename some of the columns / values to make things more consistent and clearer:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_fit %>% \n tidy(matrix = \"loadings\") %>% \n rename(terms = column,\n component = PC) %>% \n mutate(component = paste0(\"PC\", component))\n\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n terms component value\n \n1 weight PC1 0.459 \n2 weight PC2 0.151 \n3 weight PC3 0.0423\n4 weight PC4 -0.449 \n5 weight PC5 0.745 \n6 weight PC6 -0.0920\n```\n\n\n:::\n:::\n\n\nNow we can reformat the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_loadings %>% \n pivot_wider(names_from = \"component\",\n values_from = \"value\")\n\npca_loadings\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 7\n terms PC1 PC2 PC3 PC4 PC5 PC6\n \n1 weight 0.459 0.151 0.0423 -0.449 0.745 -0.0920\n2 wing 0.433 0.195 -0.841 0.208 -0.121 0.0948\n3 tarsus 0.417 0.329 0.461 0.703 0.0814 0.0626\n4 blength 0.434 0.299 0.261 -0.492 -0.642 -0.0222\n5 bdepth 0.337 -0.626 0.102 -0.0333 -0.0204 0.695 \n6 bwidth 0.355 -0.592 0.0146 0.132 -0.106 -0.704 \n```\n\n\n:::\n:::\n\n\nWe can then plot this. This is a little bit involved, unfortunately. And not something I'd recommend remembering the code by heart, but we're doing the following:\n\n1. Take the PCA output and add the original data\n2. Plot this\n3. Create a line from the origin (`x = 0`, `y = 0`) for each loading\n4. Make it an arrow\n5. Add the variable name as a label\n\nWe define the arrow as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define arrow style\narrow_style <- arrow(length = unit(2, \"mm\"),\n type = \"closed\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>%\n # add the original data\n augment(finches_hybrid) %>%\n ggplot() + \n geom_point(aes(.fittedPC1, .fittedPC2, colour = category), size = 1.5) +\n geom_segment(data = pca_loadings,\n aes(xend = PC1, yend = PC2),\n x = 0, y = 0,\n arrow = arrow_style) +\n geom_text(data = pca_loadings,\n aes(x = PC1, y = PC2, label = terms), \n hjust = 0, \n vjust = 1,\n size = 5) \n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-25-1.png){width=672}\n:::\n:::\n\n\n\n## Python\n\n:::\n\nAfter all that we end up with a rather unclear plot. The 6 variables that contribute to each principal component have very overlapping contributions in PC1. As such, it's difficult to see which variable contributes what!\n\nThe reason why I'm still showing it here is that this kind of plot is very often used in PCA, so at least you can recognise it. If the variables have well-separated contributions across the two principal components, then it can be quite informative.\n\nA better way would be to plot the individual contributions to each principal component as an ordered bar plot. This does require some rejigging of the data again (sorry!).\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nFirst, we convert our loadings data back to a \"long\" format. We also add an extra column, `direction`, which indicates if the contribution to the principal component is positive or negative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_loadings %>% \n pivot_longer(cols = -terms,\n names_to = \"component\",\n values_to = \"value\") %>% \n # add a column that indicates direction\n # (positive or negative)\n mutate(direction = ifelse(value < 0, \"positive\", \"negative\"))\n\n# have a look at the data\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 4\n terms component value direction\n \n1 weight PC1 0.459 negative \n2 weight PC2 0.151 negative \n3 weight PC3 0.0423 negative \n4 weight PC4 -0.449 positive \n5 weight PC5 0.745 negative \n6 weight PC6 -0.0920 positive \n```\n\n\n:::\n:::\n\n\nWe can now visualise this. Here we are using some additional functionality offered by the `tidytext` library. Make sure to install it, if needed. Then load it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# we need this library\nlibrary(tidytext)\n\npca_loadings %>% \n mutate(terms = tidytext::reorder_within(terms,\n abs(value),\n component)) %>%\n ggplot(aes(x = abs(value), y = terms, fill = direction)) +\n geom_col() +\n facet_wrap(vars(component), scales = \"free_y\") +\n tidytext::scale_y_reordered()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-27-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n:::\n\n:::{.callout-important}\nIt is important to keep the amount of variance explained by each principal component in mind. For example, PC3 only explains around 4.7 of the variance. So although several variables contribute substantially to PC3, the contribution of PC3 itself remains small.\n:::\n\n## Exercises\n\n### Penguins {#sec-exr_penguins}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/penguins.csv`. These data are from the `palmerpenguins` package (for more information, see [the GitHub page](https://github.com/allisonhorst/palmerpenguins)).\n\nI would like you to do the following:\n\n1. Load and visualise the data.\n2. Create a screeplot and see how many PCs would be best.\n3. Calculate the loadings for PC1 and PC2 and visualise them.\n4. Any conclusions you might draw from the analysis.\n\n::: {.callout-answer collapse=\"true\"}\n\n#### Load and visualise the data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins <- read_csv(\"data/penguins.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 344 Columns: 8\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (3): species, island, sex\ndbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n1 Adelie Torgersen 39.1 18.7 181 3750\n2 Adelie Torgersen 39.5 17.4 186 3800\n3 Adelie Torgersen 40.3 18 195 3250\n4 Adelie Torgersen NA NA NA NA\n5 Adelie Torgersen 36.7 19.3 193 3450\n6 Adelie Torgersen 39.3 20.6 190 3650\n# ℹ 2 more variables: sex , year \n```\n\n\n:::\n:::\n\n\nWe can see that there are different kinds of variables, both categorical and numerical. Also, there appear to be some missing data in the data set, so we probably have to deal with that.\n\nLastly, we should be careful with the `year` column: it is recognised as a numerical column (because it contains numbers), but we should view it as a factor, since the years have an ordered, categorical meaning.\n\nTo get a better sense of our data we could plot all the numerical variables against each other, to see if there is any possible correlation between them. Doing that one-by-one in ggplot is tedious, so I'm using the `pairs()` function here. Pretty? No. Effective? Yes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins %>% \n select(where(is.numeric)) %>% \n pairs()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\n\n\n## Python\n\n:::\n\nSo we see that there is some possible grouping going on and possibly some linear relationships, too. However, there are multiple groups and closely related measurements, so it is not very surprising that there are possible relationships within the data.\n\n#### Perform the PCA\n\nFirst, we need to do a little bit of data tidying. We convert `year` to a factor and deal with the missing values. We're not dealing with them in a particularly subtle way here, removing all the rows that contain at least one missing value.\n\nIn your own research you may want to be more careful and only remove missing values from variables that you using in the PCA (here we include everything).\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins <- penguins %>% \n mutate(year = factor(year)) %>% \n drop_na() # remove all rows with missing values\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit <- penguins %>% \n # keep only the numeric columns\n select(where(is.numeric)) %>%\n # scale the data\n scale() %>%\n # perform the PCA\n prcomp()\n```\n:::\n\n\n## Python\n\n:::\n\n#### Visualise the PCs\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can visualise the principal components:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>% \n tidy(matrix = \"eigenvalues\") %>% \n ggplot(aes(x = PC,\n y = percent)) +\n geom_point() +\n geom_line()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-33-1.png){width=672}\n:::\n:::\n\n\n\n## Python\n\n:::\n\nIt looks like using the first two principal components is probably capturing the majority of the variance in the data. Combined they account for 90% of the variance.\n\n#### Loadings\n\nNext, we get the loadings: how much is each original variable contributing to the individual principal components?\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe start with some data tidying and wrangling, since we need the data in a \"wide\" format for the next plot to work:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_fit %>% \n tidy(matrix = \"loadings\") %>% \n rename(terms = column,\n component = PC) %>% \n mutate(component = paste0(\"PC\", component)) %>% \n pivot_wider(names_from = \"component\",\n values_from = \"value\")\n\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 5\n terms PC1 PC2 PC3 PC4\n \n1 bill_length_mm 0.454 -0.600 -0.642 0.145\n2 bill_depth_mm -0.399 -0.796 0.426 -0.160\n3 flipper_length_mm 0.577 -0.00579 0.236 -0.782\n4 body_mass_g 0.550 -0.0765 0.592 0.585\n```\n\n\n:::\n:::\n\n\nArrow style:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# define arrow style\narrow_style <- arrow(length = unit(2, \"mm\"),\n type = \"closed\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>%\n # add the original data\n augment(penguins) %>%\n ggplot() + \n geom_point(aes(.fittedPC1, .fittedPC2, colour = species), size = 1.5) +\n geom_segment(data = pca_loadings,\n aes(xend = PC1, yend = PC2),\n x = 0, y = 0,\n arrow = arrow_style) +\n geom_text(data = pca_loadings,\n aes(x = PC1, y = PC2, label = terms), \n hjust = 0, \n vjust = 1,\n size = 5) \n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-36-1.png){width=672}\n:::\n:::\n\n\nWe can also visualise these contributions using a bar plot. We need the data in a slightly different format before we can do this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_loadings <- pca_loadings %>% \n pivot_longer(cols = -terms,\n names_to = \"component\",\n values_to = \"value\") %>% \n # add a column that indicates direction\n # (positive or negative)\n mutate(direction = ifelse(value < 0, \"positive\", \"negative\"))\n\n# have a look at the data\nhead(pca_loadings)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 4\n terms component value direction\n \n1 bill_length_mm PC1 0.454 negative \n2 bill_length_mm PC2 -0.600 positive \n3 bill_length_mm PC3 -0.642 positive \n4 bill_length_mm PC4 0.145 negative \n5 bill_depth_mm PC1 -0.399 positive \n6 bill_depth_mm PC2 -0.796 positive \n```\n\n\n:::\n:::\n\n\nBut after that, we can visualise it as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# we need this library\nlibrary(tidytext)\n\npca_loadings %>% \n mutate(terms = tidytext::reorder_within(terms,\n abs(value),\n component)) %>%\n ggplot(aes(x = abs(value), y = terms, fill = direction)) +\n geom_col() +\n facet_wrap(vars(component), scales = \"free_y\") +\n tidytext::scale_y_reordered()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-38-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n:::\n\n#### Conclusions\n1. Load the data.\n2. Create a screeplot and see how many PCs would be best.\n3. Calculate the loadings for PC1 and PC2 and visualise them.\n4. Any conclusions you might draw from the analysis.\n\nTaken together, we can conclude/comment on a few things:\n\n1. Endlessly looking at pairwise comparisons between continuous variables probably becomes a bit tedious. An alternative would be to calculate correlations between the variables to get some insight into your data. In the end it depends on how many variables you have and how much you (want to) know about them.\n2. In this case, I'd say that the first two principal components capture most of the variance in the data.\n3. The largest contributing variables mostly differ between the first two principal components. The variables that make up PC1 are very similar in terms of contribution, whereas two variables more or less make up PC2 entirely.\n\nThe variables `flipper_length_mm` and `body_mass_g` contribute pretty much only to PC1 (they are horizontal in the loadings plot), whereas `bill_length_mm` is contributing a reasonable amount to both PC1 *and* PC2.\n\nFrom the data itself we can see that there is clear separation between the three `species`. The Gentoo penguins separate further once again from the other two species.\n\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- PCA allows you to reduce a large number of variables into fewer principal components.\n- Each PC is made up of a combination of the original variables and captures as much of the variance within the data as possible.\n- The loadings tell you how much each original variable contributes to each PC.\n- A screeplot is a graphical representation of the amount of variance explained by each PC.\n:::\n", "supporting": [ "mva-pca_files" ], diff --git a/materials/mva-intro.qmd b/materials/mva-intro.qmd index af3b008..29fd1ba 100644 --- a/materials/mva-intro.qmd +++ b/materials/mva-intro.qmd @@ -17,6 +17,14 @@ exec(open('setup_files/setup.py').read()) ## Principal component analysis +This is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables. + ## K-means clustering +This is a method for grouping observations into clusters. It groups data based on similarity and is an often-used method unsupervised machine learning algorithm. + +It groups the data into a fixed number of clusters ($k$) and the ultimate aim is to discover patterns in the data. + ## Hierarchical clustering + +A clustering method that tries to create a hierarchy across data, often displayed as a *dendogram*. This visualises the way the clusters are arranged. diff --git a/materials/mva-pca.qmd b/materials/mva-pca.qmd index 8c2c033..851e5c4 100644 --- a/materials/mva-pca.qmd +++ b/materials/mva-pca.qmd @@ -61,13 +61,6 @@ Long story short: these data are part of a hybridisation study on finches at the From many years of studies, going back to Darwin, we now know that the beak shape of the finches are important in their chances for survival. Changes in beak shape have led to new species and this study explored how movement of genes from a rare immigrant species (*Geospiza fuliginosa*) occurred through two other species (*G. fortis* and *G. scandens*). The researchers recorded various morphological traits. - -It uses the following abbreviations: - -* `F` = *G. fortis* -* `f` = *G. fuliginosa* -* `S` = *G. scandens* - ## Load and visualise the data First, we load the data: