0403 pca part 1

cambiotraining · Mar 4, 2024 · 44063fb · 44063fb
1 parent 5df777e
commit 44063fb
Show file tree

Hide file tree

Showing 11 changed files with 1,147 additions and 1,237 deletions.
diff --git a/_freeze/materials/mva-pca/execute-results/html.json b/_freeze/materials/mva-pca/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "73fa1185a0cfaea422e124ed008e7b77",
+  "hash": "1ad4a5cce1301789a1b72561d1efabec",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Principal component analysis\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Understand when PCAs can be useful\n- Be able to perform a PCA\n- Learn how to plot and interpret a screeplot\n- Plot and interpret the loadings for each PCA\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n### Functions\n:::\n:::\n\n## Purpose and aim\n\nThis is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.\n\n\n## Data and hypotheses\n\nThe example in this section uses the following data set:\n\n`data/finches_hybridisation.csv`\n\nThese data come from an analysis of gene flow across two finch species [@grant2020]. They are slightly adapted here for illustrative purposes.\n\n## Exercises\n\n### Title {#sec-exr_title}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/file.csv`.\n\n::: {.callout-answer collapse=\"true\"}\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n## Python\n\n:::\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- PCA allows you to reduce a large number of variables into fewer principal components\n- Each PC is made up of a combination of the original variables and captures as much of the variance within the data as possible\n- The loadings tell you how much each original variable contributes to each PC\n- A screeplot is a graphical representation of the amount of variance explained by each PC\n:::\n",
+    "markdown": "---\ntitle: \"Principal component analysis\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Understand when PCAs can be useful\n- Be able to perform a PCA\n- Learn how to plot and interpret a screeplot\n- Plot and interpret the loadings for each PCA\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n## Python\n\n### Libraries\n### Functions\n:::\n:::\n\n## Purpose and aim\n\nThis is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.\n\n\n## Data\n\nThe example in this section uses the following data set:\n\n`data/finches_hybridisation.csv`\n\nThese data come from an analysis of gene flow across two finch species [@grant2020]. They are slightly adapted here for illustrative purposes (the changes are documented in `materials/data_preparation.R` if you are interested).\n\nLong story short: these data are part of a hybridisation study on finches at the Galapagos Islands. Here, we've only included the non-hybrid observations, but feel free to rerun the analysis with all the hybrids!\n\nFrom many years of studies, going back to Darwin, we now know that the beak shape of the finches are important in their chances for survival. Changes in beak shape have led to new species and this study explored how movement of genes from a rare immigrant species (*Geospiza fuliginosa*) occurred through two other species (*G. fortis* and *G. scandens*). The researchers recorded various morphological traits.\n\n\nIt uses the following abbreviations:\n\n* `F`\t= *G. fortis*\n* `f`\t= *G. fuliginosa*\n* `S`\t= *G. scandens*\n\n## Load and visualise the data\n\nFirst, we load the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches_hybrid <- read_csv(\"data/finches_hybridisation.csv\")\n```\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py = pd.read_csv(\"data/finches_hybridisation.csv\")\n```\n:::\n\n\n:::\n\nNext, we visualise the data:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(finches_hybrid)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 7\n  category weight  wing tarsus blength bdepth bwidth\n  <chr>     <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>\n1 fortis     15.8  67.1   19.6    10.3   8.95   8.32\n2 fortis     16    67     17.7    10.4   9.3    8.6 \n3 fortis     19.2  72.4   19.4    12.1  10.8   10.1 \n4 fortis     16.1  68     19.1    11.3  10      8.2 \n5 fortis     18.6  67.5   20.0    11.2   9.5    8.9 \n6 fortis     17.3  71.1   19.6    11.8  10.3    9.59\n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py.head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n  category  weight   wing  tarsus  blength  bdepth  bwidth\n0   fortis   15.84  67.13   19.55    10.30    8.95    8.32\n1   fortis   16.00  67.00   17.70    10.40    9.30    8.60\n2   fortis   19.25  72.38   19.38    12.10   10.85   10.13\n3   fortis   16.10  68.00   19.10    11.30   10.00    8.20\n4   fortis   18.60  67.50   19.95    11.15    9.50    8.90\n```\n\n\n:::\n:::\n\n:::\n\nWe have 7 columns. We're not going to recreate the published analysis exactly, but for the purpose of this section we will be looking if differences in the measured morphological traits can be attributed to specific categories.\n\nWe have 3 different `category` values:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches_hybrid %>% distinct(category)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 1\n  category  \n  <chr>     \n1 fortis    \n2 fuliginosa\n3 scandens  \n```\n\n\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfinches_hybrid_py['category'].unique()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\narray(['fortis', 'fuliginosa', 'scandens'], dtype=object)\n```\n\n\n:::\n:::\n\n\n:::\n\nWe have 6 continuous response variables, which we're not going to visualise independently! However, you *could* look at some of the potential relationships within the data.\n\nFor example, looking at body weight (`weight`) against beak length (`blength`):\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(finches_hybrid, aes(x = blength, y = weight,\n                           colour = category)) +\n         geom_point()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nor just `weight` across the different categories:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(finches_hybrid, aes(x = category, y = weight)) +\n  geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(finches_hybrid_py,\n         aes(x = \"blength\", y = \"weight\",\n             colour = \"category\")) +\n     geom_point())\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-11-1.png){width=614}\n:::\n:::\n\n\nor just `weight` across the different categories:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n(ggplot(finches_hybrid_py,\n         aes(x = \"category\", y = \"weight\")) +\n     geom_boxplot())\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-12-3.png){width=614}\n:::\n:::\n\n\n\n:::\n\nThe number of combinations are rather large, given the number of variables we have. Hence it's a good idea to see if we can \"condense\" some of variables into new ones.\n\nWhat we're doing with a PCA is trying to capture as much *variance* that exists in the data using a Principal Component (PC). The PC therefore explains some of the variance for each of the individual variables.\n\nI like to compare the PCs to a smoothie: a smoothie might consist of 4 parts apple, 3 parts strawberry, 2 parts blueberry and 1 part grape. Along the lines of that delicious metaphor, one of our Principal components may consist of 4 parts `blength`, 3 parts `weight`, 2 parts `bdepth` and 1 part `wing`. We don't know this yet, so let's go find out.\n\n## Performing the PCA\n\nTo perform a PCA, we need to keep a few things in mind:\n\n1. We can only calculate principal components for numeric data\n2. The numeric data need to be on the same **scale**\n\nThis last point of scaling is very important. Measurements can take place at different scales. You shouldn't compare milimetres and kilogrammes directly, for example. Or seconds and height. That simply does not make sense.\n\nThis issue is even more prevalent when comparing gene expression data, for example. Genes can be active at varying levels, but some genes only need (an indeed *do*) change a tiny amount to elicit an effect, whereas other genes might have to change several fold-changes before something happens.\n\nTo compensate for this, we bring all of our data onto the same scale.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can scale our data with the `scale()` function. We perform the PCA using the `prcomp()` function. Here we store the output into an object called `pca_fit`, because we'll be working with the output later on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit <- finches_hybrid %>% \n  # keep only the numeric columns\n  select(where(is.numeric)) %>%\n  # scale the data\n  scale() %>%\n  # perform the PCA\n  prcomp()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nStandard deviations (1, .., p=6):\n[1] 1.9815712 1.1261455 0.5329781 0.4847379 0.4170304 0.3349943\n\nRotation (n x k) = (6 x 6):\n              PC1        PC2         PC3         PC4         PC5         PC6\nweight  0.4585415  0.1513150  0.04230031 -0.44900961  0.74497043 -0.09199756\nwing    0.4333116  0.1946944 -0.84096840  0.20843413 -0.12117514  0.09475803\ntarsus  0.4174676  0.3293946  0.46092626  0.70300481  0.08141532  0.06263195\nblength 0.4338157  0.2986440  0.26067134 -0.49214612 -0.64184633 -0.02217340\nbdepth  0.3370530 -0.6256540  0.10181858 -0.03331599 -0.02041369  0.69502424\nbwidth  0.3548249 -0.5916637  0.01460313  0.13195474 -0.10641317 -0.70362223\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>%\n  # add the original data\n  augment(finches_hybrid) %>%\n  ggplot(aes(.fittedPC1, .fittedPC2, colour = category)) + \n  geom_point(size = 1.5)\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\n\n## Python\n\n:::\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can figure out how much each principal component is contributing to the amount of variance that is being explained:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npca_fit %>% \n  tidy(matrix = \"eigenvalues\") %>% \n  ggplot(aes(x = PC,\n             y = percent)) +\n  geom_point() +\n  geom_line()\n```\n\n::: {.cell-output-display}\n![](mva-pca_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\n## Python\n\n:::\n\nThis means that PC1 is able to explain around 65% of the variance in the data.\n\nPC2 is able to explain around 21% of the variance in the data, and so forth.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n## Python\n\n:::\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n## Python\n\n:::\n\n\n\n## Exercises\n\n### Title {#sec-exr_title}\n\n:::{.callout-exercise}\n\n\n{{< level 2 >}}\n\n\n\nFor this exercise we'll be using the data from `data/file.csv`.\n\n::: {.callout-answer collapse=\"true\"}\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n## Python\n\n:::\n:::\n:::\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- PCA allows you to reduce a large number of variables into fewer principal components\n- Each PC is made up of a combination of the original variables and captures as much of the variance within the data as possible\n- The loadings tell you how much each original variable contributes to each PC\n- A screeplot is a graphical representation of the amount of variance explained by each PC\n:::\n",
     "supporting": [
       "mva-pca_files"
     ],

diff --git a/_freeze/materials/mva-pca/figure-html/unnamed-chunk-10-1.png b/_freeze/materials/mva-pca/figure-html/unnamed-chunk-10-1.png
diff --git a/_freeze/materials/mva-pca/figure-html/unnamed-chunk-11-1.png b/_freeze/materials/mva-pca/figure-html/unnamed-chunk-11-1.png
diff --git a/_freeze/materials/mva-pca/figure-html/unnamed-chunk-12-3.png b/_freeze/materials/mva-pca/figure-html/unnamed-chunk-12-3.png
diff --git a/_freeze/materials/mva-pca/figure-html/unnamed-chunk-15-1.png b/_freeze/materials/mva-pca/figure-html/unnamed-chunk-15-1.png
diff --git a/_freeze/materials/mva-pca/figure-html/unnamed-chunk-16-1.png b/_freeze/materials/mva-pca/figure-html/unnamed-chunk-16-1.png
diff --git a/_freeze/materials/mva-pca/figure-html/unnamed-chunk-9-1.png b/_freeze/materials/mva-pca/figure-html/unnamed-chunk-9-1.png