diff --git a/pca/Untitled.ipynb b/pca/Untitled.ipynb deleted file mode 100644 index 7fec5150..00000000 --- a/pca/Untitled.ipynb +++ /dev/null @@ -1,6 +0,0 @@ -{ - "cells": [], - "metadata": {}, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/pca_1/data/rectangle_data.csv b/pca_1/data/rectangle_data.csv new file mode 100644 index 00000000..8a19d206 --- /dev/null +++ b/pca_1/data/rectangle_data.csv @@ -0,0 +1,101 @@ +width,height,area,perimeter +8,6,48,28 +2,4,8,12 +1,3,3,8 +9,3,27,24 +9,8,72,34 +3,1,3,8 +4,2,8,12 +6,5,30,22 +7,1,7,16 +8,2,16,20 +5,5,25,20 +9,5,45,28 +8,4,32,24 +1,2,2,6 +2,9,18,22 +7,8,56,30 +7,5,35,24 +2,4,8,12 +2,5,10,14 +4,3,12,14 +8,1,8,18 +8,5,40,26 +9,8,72,34 +7,3,21,20 +8,3,24,22 +6,6,36,24 +6,4,24,20 +8,4,32,24 +7,2,14,18 +4,4,16,16 +4,1,4,10 +9,7,63,32 +4,2,8,12 +1,2,2,6 +8,1,8,18 +5,4,20,18 +9,5,45,28 +6,9,54,30 +3,7,21,20 +7,3,21,20 +7,7,49,28 +3,3,9,12 +5,5,25,20 +7,3,21,20 +8,2,16,20 +2,6,12,16 +7,1,7,16 +1,2,2,6 +3,5,15,16 +2,6,12,16 +5,6,30,22 +9,1,9,20 +6,3,18,18 +2,6,12,16 +6,8,48,28 +9,1,9,20 +2,8,16,20 +3,7,21,20 +6,5,30,22 +8,6,48,28 +9,2,18,22 +3,9,27,24 +2,5,10,14 +1,4,4,10 +3,3,9,12 +2,9,18,22 +6,4,24,20 +5,2,10,14 +2,6,12,16 +2,5,10,14 +1,3,3,8 +1,3,3,8 +6,7,42,26 +8,6,48,28 +2,7,14,18 +4,6,24,20 +1,1,1,4 +9,8,72,34 +5,5,25,20 +3,9,27,24 +2,8,16,20 +2,2,4,8 +8,3,24,22 +1,8,8,18 +5,1,5,12 +5,7,35,24 +9,5,45,28 +3,1,3,8 +9,7,63,32 +8,5,40,26 +9,8,72,34 +4,5,20,18 +4,5,20,18 +1,5,5,12 +6,6,36,24 +8,5,40,26 +8,7,56,30 +1,4,4,10 +1,6,6,14 +2,6,12,16 diff --git a/pca_1/images/dataset3.png b/pca_1/images/dataset3.png index 6cfcd793..a9ab94fb 100644 Binary files a/pca_1/images/dataset3.png and b/pca_1/images/dataset3.png differ diff --git a/pca_1/images/dataset3_outlier.png b/pca_1/images/dataset3_outlier.png new file mode 100644 index 00000000..a5d8fd04 Binary files /dev/null and b/pca_1/images/dataset3_outlier.png differ diff --git a/pca_1/images/diag_matrix.png b/pca_1/images/diag_matrix.png new file mode 100644 index 00000000..b8e5c815 Binary files /dev/null and b/pca_1/images/diag_matrix.png differ diff --git a/pca_1/images/diff_reductions.png b/pca_1/images/diff_reductions.png new file mode 100644 index 00000000..ebd4c5f8 Binary files /dev/null and b/pca_1/images/diff_reductions.png differ diff --git a/pca_1/images/matmul.png b/pca_1/images/matmul.png index a15e6018..6c56dba8 100644 Binary files a/pca_1/images/matmul.png and b/pca_1/images/matmul.png differ diff --git a/pca_1/images/matmul2.png b/pca_1/images/matmul2.png index c9700f46..6b7c6edb 100644 Binary files a/pca_1/images/matmul2.png and b/pca_1/images/matmul2.png differ diff --git a/pca_1/images/matmul3.png b/pca_1/images/matmul3.png index b5a07743..35bc575d 100644 Binary files a/pca_1/images/matmul3.png and b/pca_1/images/matmul3.png differ diff --git a/pca_1/images/orthonormal.png b/pca_1/images/orthonormal.png new file mode 100644 index 00000000..9256b5c2 Binary files /dev/null and b/pca_1/images/orthonormal.png differ diff --git a/pca_1/images/pca_example.png b/pca_1/images/pca_example.png new file mode 100644 index 00000000..825ca3b6 Binary files /dev/null and b/pca_1/images/pca_example.png differ diff --git a/pca_1/images/sigma.png b/pca_1/images/sigma.png new file mode 100644 index 00000000..2345e3f9 Binary files /dev/null and b/pca_1/images/sigma.png differ diff --git a/pca_1/images/svd.png b/pca_1/images/svd.png index 60f4654a..b66e9d3f 100644 Binary files a/pca_1/images/svd.png and b/pca_1/images/svd.png differ diff --git a/pca_1/images/svd_geo.png b/pca_1/images/svd_geo.png new file mode 100644 index 00000000..a57ac566 Binary files /dev/null and b/pca_1/images/svd_geo.png differ diff --git a/pca_1/images/u.png b/pca_1/images/u.png new file mode 100644 index 00000000..e36ce513 Binary files /dev/null and b/pca_1/images/u.png differ diff --git a/pca_1/images/v.png b/pca_1/images/v.png new file mode 100644 index 00000000..7a4ec99a Binary files /dev/null and b/pca_1/images/v.png differ diff --git a/pca_1/pca_1.ipynb b/pca_1/pca_1.ipynb new file mode 100644 index 00000000..13281d10 --- /dev/null +++ b/pca_1/pca_1.ipynb @@ -0,0 +1,905 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: PCA I\n", + "format:\n", + " html:\n", + " toc: true\n", + " toc-depth: 5\n", + " toc-location: right\n", + " code-fold: false\n", + " theme:\n", + " - cosmo\n", + " - cerulean\n", + " callout-icon: false\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.callout-note}\n", + "## Learning Outcomes TODO \n", + "- Define and carry out procedure of PCA\n", + "- Understand connection between PCA and SVD\n", + ":::\n", + "\n", + "## Dimensionality \n", + "\n", + "So far, we've used visualizations like rugplots or scatterplots to help us identify clusters in our dataset. Since we humans are 3D beings, we can’t visualize beyond three dimensions, but many datasets come with more than three features. In order to visualize multidimensional data, we can reduce the dataset to lower dimensions through **dimensionality reduction**. \n", + "\n", + "Previously, we have been working with data tables with rows and columns. These rows and columns corresponded to observations and attributes about said observations. Now, we have to be a bit more clear with our wording to follow the language of linear algebra. \n", + "\n", + "Suppose we have a dataset of:\n", + "- N observations (data points/rows)\n", + "- d attributes (features/columns).\n", + "\n", + "In Linear Algebra, we think of data being a collection of vectors. Vectors have a *dimension*, meaning they have some number of unique elements. Row and column now denotes the direction a vector is written (horizontally, like a row, or vertically, like a column):\n", + "\n", + "Linear Algebra views our data as a matrix:\n", + "- N row vectors in a d-Dimensions, OR\n", + "- d column vectors in an N-Dimensions.\n", + "\n", + "**Dimensionality** of data is a complex topic. Sometimes, it is clear from looking at the number of rows/columns, but other times it is not. \n", + "\n", + "
\n", + "\n", + "For example, the dataset below has 4 columns, but the `Weight (lbs)` column is actually just a linear transformation of the `Weight (kg)` column. Thus, no new information is captured, and the matrix of our dataset has a (column) rank of 3! Despite having 4 columns, we still say that this data is 3-dimensional.\n", + "\n", + "\n", + "
\n", + "\n", + "Plotting the weight columns together reveals the key visual intuition. While the two columns visually span a 2-d space as a line, the data does not deviate at all from that singular line. This means that one of the weight columns are redundant! Even given the option to cover the whole 2d space, the data below does not. It might as well not have this dimension, which is why we still do not consider the data below to span more than 1 dimension.\n", + "\n", + "
\n", + "\n", + "What happens when there are outliers? Below, we've added one outlier point to the dataset above, and just that is enough toi change the rank of the matrix from 1 to 2 dimensions. However, the data is still very much 1-dimensional. \n", + "\n", + "
\n", + "\n", + "As such, dimensionality reduction is generally an *approximation* of the original data that's achieved through projecting the data onto a desired dimension. However, there are many ways to project a dataset onto a lower dimension. How do we choose the best one? \n", + "\n", + "
\n", + "\n", + "In general, we want the projection that is the best approximation for the original data (the graph on the right). In other words, we want the projection that capture the most variance of the original data. In the next section, we'll see how this is calculated." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Matrix as Transformation\n", + "\n", + "Consider the matrix multiplication example below:\n", + "\n", + "
\n", + "\n", + "* For table 1, each row of the fruits matrix represents one bowl of fruit; the first bowl/row has 2 apples, 2 lemons, and 2 melons.\n", + "* For table 2, each column of the dollars matrix represents the cost of fruit at a store; the first store/column charges 2 dollars for an apple, 1 dollar for a lemon, and 4 dollars for a melon.\n", + "* The output is the cost of each bowl at each store\n", + "\n", + "In general, there are two ways to interpret matrix multiplication:\n", + "1. Row dot column to get each datapoint. In this view, we perform multiple linear operations on the data
\n", + "2. Linear transformation of columns
\n", + "\n", + "## Principal Component Analysis (PCA) \n", + "\n", + "In PCA, our goal is to transform observations from high-dimensional data down to low dimensions (often 2, as most visualizations are 2D) through linear transformations. In other words, we want to find a linear transformation that creates a low-dimension representation which captures as much of the original data’s *total variance* as possible.\n", + "\n", + "
\n", + "\n", + "We often perform PCA during the Exploratory Data Analysis (EDA) stage of our data science lifecycle when we don't know what model to use; it helps us with \n", + "\n", + "* Visually identifying clusters of similar observations in high dimensions.\n", + "* Removing irrelevant dimensions if we suspect that the dataset is inherently low rank. For example, if the columns are collinear: there are many attributes but only a few mostly determine the rest through linear associations.\n", + "* Finding a small basis for representing variations in complex things, e.g., images, genes.\n", + "* Reducing the number of dimensions to make some computation cheaper.\n", + "\n", + "
\n", + "\n", + "\n", + "### PCA Procedure (Overview)\n", + "\n", + "1. Center the data matrix by subtracting the mean of each attribute column.\n", + "2. To find the $i$-th principal component $v_i$:\n", + " 1. $v$ is a unit vector that linearly combines the attributes.\n", + " 2. $v$ gives a one-dimensional projection of the data.\n", + " 3. $v$ is chosen to minimize the sum of squared distances between each point and its projection onto $v$.\n", + " 4. Choose $v$ such that it is orthogonal to all previous principal components.\n", + "\n", + "$k$ principal components capture the most variance of any $k$-dimensional reduction of the data matrix.\n", + "\n", + "In practice, however, we don’t carry out the procedures in step 2 because it takes too long to compute. Instead, we use singular value decomposition (SVD) to efficiently find all principal components." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Singular Value Decomposition (SVD)\n", + "\n", + "Singular value decomposition (SVD) is an important concept in linear algebra. Since this class requires a linear algebra course (MATH 54, 56 or EECS 16A) as a pre/co-requisite, we assume you have taken (or are taking) a linear algebra course and will not explain SVD in its entirety. In particular, we will go over:\n", + "\n", + "- Why SVD is a valid decomposition of rectangular matrices\n", + "- Why PCA is an application of SVD.\n", + "\n", + "We will not go much into the theory and details of SVD. Instead, we will only cover what is needed for a data science interpretation.\n", + "\n", + "::: {.callout-tip}\n", + "### [Linear Algebra] Orthonormality\n", + "\n", + "Orthonormal is a combination of two words: orthogonal and normal.\n", + "When we say the columns of a matrix are orthonormal, we say that (1) the columns are all orthogonal to each other (all pairs of columns have a dot product of zero) and (2) all columns are unit vectors (he length of each column vector is 1)\n", + "
\n", + "\n", + "\n", + "Orthonormal matrices have a few important properties\n", + "* **Orthonormal inverse**: If an $m \\times n$ matrix $Q$ has orthonormal columns, $QQ^T= Iₘ$ and $Q^TQ=Iₙ$.\n", + "* **Rotation of coordinates**: the linear transformation represented by an orthonormal matrix is often a rotation (and less often a reflection). We can imagine columns of the matrix as where the unit vectors of the original space will land.\n", + ":::\n", + "\n", + "::: {.callout-tip}\n", + "### [Linear Algebra] Diagnomal Matrices \n", + "\n", + "**Diagonal matrices** are square matrices with non-zero values on the diagonal axis and zero everywhere else.\n", + "\n", + "Right-multiplied diagonal matrices scale each column up or down by a constant factor. Geometrically, this transformation can be viewed as scaling the coordinate system.\n", + "\n", + "
\n", + ":::\n", + "\n", + "Singular value decomposition (SVD) describes a matrix $X$'s decomposition into three matrices:\n", + "$$ X = U \\Sigma V^T $$\n", + "\n", + "Let's break down each of these terms one-by-one\n", + "\n", + "### $U$\n", + "\n", + "* $U$ is an $n$ by $d$ matrix: $U \\in \\mathbb{R}^{n \\times d}$.\n", + "* Its columns are **orthonormal**: $\\bar{u_i}^T\\bar{u_j} = 0$ for all pairs $i, j$, and all vectors $\\bar{u_i}$ are unit vectors with length = 1.\n", + "* Columns of U are called the **left singular vectors**.\n", + "* $UU^T = I_n$ and $U^TU = I_d$.\n", + "* We can think of $U$ as a rotation. \n", + "\n", + "
\n", + "\n", + "### $\\Sigma$\n", + "\n", + "* $\\Sigma$ is a $d$ by $d$ matrix: $\\Sigma \\in \\mathbb{R}^{d \\times d}$.\n", + "* The majority of the matrix is zero\n", + "* It has $r$ **non-zero** **singular values**, and $r$ is the rank of $X$\n", + "* Diagonal values (**singular values** $\\sigma_1, \\sigma_2, ... \\sigma_r$), are ordered from greatest to least $\\sigma_1 > \\sigma_2 > ... > \\sigma_r$\n", + "* We can think of $\\Sigma$ as a rotation. \n", + "\n", + "
\n", + "\n", + "### $V^T$\n", + "\n", + "* $V^T$ is an $d$ by $d$ matrix: $U \\in \\mathbb{R}^{d \\times d}$.\n", + "* Columns of $V$ are orthonormal, so the rows of $V^T$ are orthonormal\n", + "* Columns of $V$ are called the **right singular vectors**. \n", + "* $VV^T = V^TV = I_d$\n", + "* WE can think of $V$ as a rotation. \n", + "\n", + "
\n", + "\n", + "### SVD: Geometric Perspective \n", + "\n", + "
\n", + "\n", + "We’ve seen that $U$ and $V$ represent rotations, and $\\Sigma$ represents scaling. Therefore, SVD says that any matrix can be decomposed into a rotation, then a scaling, and another rotation.\n", + "
\n", + "\n", + "### SVD in `numpy`\n", + "For this demo, we'll continue working with our rectangular dataset from before with $n=100$ rows and $d=4$ columns." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
widthheightareaperimeter
0864828
124812
21338
3932724
4987234
\n", + "
" + ], + "text/plain": [ + " width height area perimeter\n", + "0 8 6 48 28\n", + "1 2 4 8 12\n", + "2 1 3 3 8\n", + "3 9 3 27 24\n", + "4 9 8 72 34" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#| code-fold: false\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "np.random.seed(23) #kallisti\n", + "\n", + "plt.rcParams['figure.figsize'] = (4, 4)\n", + "plt.rcParams['figure.dpi'] = 150\n", + "sns.set()\n", + "\n", + "rectangle = pd.read_csv(\"data/rectangle_data.csv\")\n", + "rectangle.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In `numpy`, the SVD algorith is already written and can be called with `np.linalg.svd` ([documentation](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html)). There are multiple versions of SVD; to get the version that we will follow, we need to set the `full_matrices` parameter to `False`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "U, S, Vt = np.linalg.svd(rectangle, full_matrices = False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's examine `U`. As we can see, it's dimensions are $n \\times d$" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(100, 4)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "U.shape\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The first 5 rows of `U` are shown below" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
0-0.1551510.064830-0.0299350.383675
1-0.038370-0.0891550.062019-0.544421
2-0.020357-0.0811380.0589970.053418
3-0.101519-0.076203-0.148160-0.122119
4-0.2189730.2064230.007274-0.058722
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 -0.155151 0.064830 -0.029935 0.383675\n", + "1 -0.038370 -0.089155 0.062019 -0.544421\n", + "2 -0.020357 -0.081138 0.058997 0.053418\n", + "3 -0.101519 -0.076203 -0.148160 -0.122119\n", + "4 -0.218973 0.206423 0.007274 -0.058722" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(U).head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "$\\Sigma$ is a little different in `NumPy`. Since the only useful values in the diagonal matrix $\\Sigma$ are the singular values on the diagonal axis, only those values are returned and they are stored in an array.\n", + "\n", + "Our `rectangle_data` has a rank of $3$, so we should have 3 non-zero singular values, **sorted from largest to smallest**." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([3.62932568e+02, 6.29904732e+01, 2.56544651e+01, 2.86529477e-15])" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "S" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While none of the values are 0, notice that the last value is so small ($10^{-15}$) that it's practically $0$, and we can round the values. " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([363., 63., 26., 0.])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.round(S)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get `S` in matrix format, we use `np.diag`." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[3.62932568e+02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],\n", + " [0.00000000e+00, 6.29904732e+01, 0.00000000e+00, 0.00000000e+00],\n", + " [0.00000000e+00, 0.00000000e+00, 2.56544651e+01, 0.00000000e+00],\n", + " [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.86529477e-15]])" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Sm = np.diag(S)\n", + "Sm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can see that `Vt` is indeed a $d \\times d$ matrix" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4, 4)" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Vt.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
0-0.146436-0.129942-8.100201e-01-0.552756
1-0.192736-0.1891285.863482e-01-0.763727
2-0.7049570.7091557.951614e-030.008396
30.6666670.666667-5.551115e-17-0.333333
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 -0.146436 -0.129942 -8.100201e-01 -0.552756\n", + "1 -0.192736 -0.189128 5.863482e-01 -0.763727\n", + "2 -0.704957 0.709155 7.951614e-03 0.008396\n", + "3 0.666667 0.666667 -5.551115e-17 -0.333333" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(Vt)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To check that this SVD is a valid decomposition, we can reverse it and see if it matches our original table (it does yay!)." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
08.06.048.028.0
12.04.08.012.0
21.03.03.08.0
39.03.027.024.0
49.08.072.034.0
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "0 8.0 6.0 48.0 28.0\n", + "1 2.0 4.0 8.0 12.0\n", + "2 1.0 3.0 3.0 8.0\n", + "3 9.0 3.0 27.0 24.0\n", + "4 9.0 8.0 72.0 34.0" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(U @ Sm @ Vt).head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PCA with SVD\n", + "\n", + "Principle Component Analysis PCA and Singular Value Decomposition can be easily mixed up, especially when you have to keep track of so many acronyms. Here is a quick summary:\n", + "\n", + "PCA: a data science procedure used for dimensionality reduction that uses SVD as one of the steps.\n", + "\n", + "SVD: a linear algebra algorithm that splits a matrix into 3 component parts." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Limited by Rank\n", + "\n", + "One key fact to remember is that the decomposition are not arbitrary. The *rank* of a matrix limits how small our inner dimensions can be if we want to perfectly recreate our matrix. The proof for this is out of scope.\n", + "\n", + "### Automatic Factorization\n", + "\n", + "Even if we know we have to factorize our matrix using an inner dimension of R, that still leaves a large space of solutions to traverse. What if we have a procedure to automatically factorize a rank R matrix into an R dimensional representation with some transformation matrix?\n", + "\n", + "- Lower dimensional representation avoids redundant features.\n", + "\n", + "- Imagine a 1000 dimensional dataset: If the rank is only 5, it’s much easier to do EDA after this mystery procedure.\n", + "\n", + "What if we wanted a 2-D representation? Its valuable to compress all of the data that is relevant onto as few dimensions as possible in order to plot it efficiently. Some 2D matrices yield better approximations than others. How well can we do?\n", + "\n", + "### Capturing Total Variance\n", + "\n", + "Variability of the data is the most important aspect that we want to maintain. It is the bulk of what we care about when we conduct EDA and modeling. Thus, oversimplifying the data may lead to incorrect analyses. \n", + "\n", + "Variance for a matrix is defined as the sum of the variances of the columns of a matrix. \n", + "\n", + "\n", + "\n", + "One approach is to simply keep the columns with the most variability. However, this leads to capturing of a small amount of variability present in our original dataset. Can we do better than just taking two columns from the original matrix?\n", + "\n", + "\n", + "\n", + "It turns out that PCA does a better job at preserving our variances. This is one of the best techniques for dimensionality reductions when it comes to preserving variances. Let's dig into how it works!\n", + "\n", + "\n", + "\n", + "\n", + "### PCA vs. SVD\n", + "\n", + "Principle Component Analysis PCA and Singular Value Decomposition can be easily mixed up, especially when you have to keep track of so many acronyms. Here is a quick summary:\n", + "\n", + "PCA: a data science procedure used for dimensionality reduction that uses SVD as one of the steps.\n", + "\n", + "SVD: a linear algebra algorithm that splits a matrix into 3 component parts.\n", + "\n", + "### PCA Procedure\n", + "\n", + "1. Center the data matrix by subtracting the mean of each attribute column.\n", + "\n", + "2. Use SVD to find all $v_i \\in \\{1...k\\}$, the principal components, which fulfills the following criteria:\n", + " \n", + " - $v_i$ is a unit vector that linearly combines the attributes.\n", + " - $v_i$ gives a one-dimensional projection of the data.\n", + " - $v_i$ is chosen to minimize the sum of squared distances between each point and its projection onto v.\n", + " - $v_i$ is orthogonal to all previous principal components.\n", + "\n", + "references for SVD: [EECS 16B Note 14](https://eecs16b.org/notes/sp23/note14.pdf), [EECS 16B Note 15](https://eecs16b.org/notes/sp23/note15.pdf)\n", + "\n", + "### SVD\n", + "\n", + "Singular value decomposition (SVD) is an important concept in linear algebra.\n", + "\n", + "- We assume you have taken (or are taking) a linear algebra course.\n", + "- We will not explain SVD in its entirety—in particular, we will not prove:\n", + " - Why SVD is a valid decomposition of rectangular matrices\n", + " - Why PCA is an application of SVD.\n", + "- We know SVD is not covered in EECS 16A, nor Math 54 this semester.\n", + "\n", + "We will not go much into the theory and details of SVD. Instead, we will only cover what is needed for a data science interpretation.\n", + "\n", + "\n", + "\n", + "In NumPy, this algorithm is already written and can be called with `np.linalg.svd`.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "## Data Variance and Centering\n", + "\n", + "Formally, the i-th singular value tells us the **component score**, i.e., how much of the data variance is captured by the ith principal component. Supposing the number of data points is $n$:\n", + "\n", + "$$\\text{i-th component score} = \\frac{(\\text{i-th singular value}^2)}{n}$$\n", + "\n", + "Summing up the component scores is equivalent to computing total variance.\n", + "\n", + "**Data Centering**: PCA has a data centering step that precedes any singular value decomposition, where if implemented defines the component score as above.\n", + "\n", + "## Proof of component score (out of scope)\n", + "\n", + "The proof defining component score out of scope for this class, but it is included below for your convenience.\n", + "\n", + "*Setup*: Consider the design matrix $X \\in \\mathbb{R}^{n \\times d}$, where the $j$-th column (corresponding to the $j$-th feature) is $x_j \\in \\mathbb{R}^n$ and the element in row $i$, column $j$ is $x_{ij}$. Further define $\\tilde{X}$ as the **centered** design matrix. The $j$-th column is $\\tilde{x}_j \\in \\mathbb{R}^n$ and the element in row $i$, column $j$ is $\\tilde{x}_{ij} = x_{ij} - \\bar{x_j}$, where $\\bar{x_j}$ is the mean of the $x_j$ column vector from the original $X$. \n", + "\n", + "*Variance*: Construct the **covariance matrix**: $\\frac{1}{n} \\tilde{X}^T \\tilde{X} \\in \\mathbb{R}^{d \\times d}$. The $j$-th element along the diagonal is the **variance** of the $j$-th column of the original design matrix $X$:\n", + "\n", + "$$\\left( \\frac{1}{n} \\tilde{X}^T \\tilde{X} \\right)_{jj} = \\frac{1}{n} \\tilde{x}_j ^T \\tilde{x}_j = \\frac{1}{n} \\sum_{i=i}^n (\\tilde{x}_{ij} )^2 = \\frac{1}{n} \\sum_{i=i}^n (x_{ij} - \\bar{x_j})^2$$\n", + "\n", + "\n", + "*SVD*: Suppose singular value decomposition of the *centered* design matrix $\\tilde{X}$ yields $\\tilde{X} = U \\Sigma V^T$, where $U \\in \\mathbb{R}^{n \\times d}$ and $V \\in \\mathbb{R}^{d \\times d}$ are matrices with orthonormal columns, and $\\Sigma \\in \\mathbb{R}^{d \\times d}$ is a diagonal matrix with singular values of $\\tilde{X}$.\n", + "\n", + "$$\\begin{aligned}\n", + "\\tilde{X}^T \\tilde{X} &= (U \\Sigma V^T )^T (U \\Sigma V^T) \\\\\n", + "&= V \\Sigma U^T U \\Sigma V^T & (\\Sigma^T = \\Sigma) \\\\\n", + "&= V \\Sigma^2 V^T & (U^T U = I) \\\\\n", + "\\frac{1}{n} \\tilde{X}^T \\tilde{X} &= \\frac{1}{n} V \\Sigma V^T =V \\left( \\frac{1}{n} \\Sigma \\right) V^T \\\\\n", + "\\frac{1}{n} \\tilde{X}^T \\tilde{X} V &= V \\left( \\frac{1}{n} \\Sigma \\right) V^T V = V \\left( \\frac{1}{n} \\Sigma \\right) & \\text{(right multiply by }V \\rightarrow V^T V = I \\text{)} \\\\\n", + "V^T \\frac{1}{n} \\tilde{X}^T \\tilde{X} V &= V^T V \\left( \\frac{1}{n} \\Sigma \\right) = \\frac{1}{n} \\Sigma & \\text{(left multiply by }V^T \\rightarrow V^T V = I \\text{)} \\\\\n", + "\\left( \\frac{1}{n} \\tilde{X}^T \\tilde{X} \\right)_{jj} &= \\frac{1}{n}\\sigma_j^2 & \\text{(Define }\\sigma_j\\text{ as the} j\\text{-th singular value)} \\\\\n", + "\\frac{1}{n} \\sigma_j^2 &= \\frac{1}{n} \\sum_{i=i}^n (x_{ij} - \\bar{x_j})^2\n", + "\\end{aligned}$$\n", + "\n", + "The last line defines the $j$-th component score.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/pca_2/pca_2.ipynb b/pca_2/pca_2.ipynb new file mode 100644 index 00000000..b450d3ca --- /dev/null +++ b/pca_2/pca_2.ipynb @@ -0,0 +1,167 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: PCA II\n", + "format:\n", + " html:\n", + " toc: true\n", + " toc-depth: 5\n", + " toc-location: right\n", + " code-fold: false\n", + " theme:\n", + " - cosmo\n", + " - cerulean\n", + " callout-icon: false\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.callout-note collapse=\"true\"}\n", + "## Learning Outcomes\n", + "\n", + "- Develop a deeper understanding of how to interpret Principal Components Analysis (PCA).\n", + "- See applications of PCA to some real-world contexts.\n", + ":::\n", + "\n", + "## PCA Review\n", + "\n", + "### Using Principal Components\n", + "\n", + "Steps to obtain Principal Components via SVD:\n", + "\n", + "1. Center the data matrix by subtracting the mean of each attribute column elementwise from the column. \n", + "\n", + "2. To find the p **principal components**:\n", + " - Compute the SVD of the data matrix ($X = U{\\Sigma}V^{T}$)\n", + " - The first $p$ columns of $U{\\Sigma}$ (or equivalently, $XV$) contain the $p$ principal components of $X$.\n", + " \n", + "The principal components are a low-dimension representation that capture as much of the original data's total variance as possible. \n", + "\n", + "The **component scores** sum to total variance if we center our data.\n", + "$$\\text{component score}_i = \\frac{\\sigma_i^{2}}{N}$$\n", + "\n", + "We can also use the SVD to get a rank-p approximation of $X$, $X_p$.\n", + "\n", + "$$X_p = \\sum_{j = 1}^{p} \\sigma_ju_jv_j^{T} $$\n", + "\n", + "where $\\sigma_j$ is the jth singular value of $X$, $u_j$ is the jth column of $U$, and $v_j$ is the jth column of $V$.\n", + "\n", + "\n", + "## Case Study: House of Representatives Voting\n", + "\n", + "Let's examine how the House of Representatives (of the 116th Congress, 1st session) voted in the month of September 2019.\n", + "\n", + "Specifically, we’ll look at the records of Roll call votes. From the U.S. Senate ([link](https://www.senate.gov/reference/Index/Votes.htm)):\n", + "\n", + "- Roll call votes occur when a representative or senator votes \"yea\" or \"nay,\" so that the names of members voting on each side are recorded. A voice vote is a vote in which those in favor or against a measure say \"yea\" or \"nay,\" respectively, without the names or tallies of members voting on each side being recorded.\n", + "\n", + "\n", + "**Do legislators' roll call votes show a relationship with their political party?**\n", + "\n", + "Please visit this [link](https://data100.datahub.berkeley.edu/hub/login?next=%2Fhub%2Fuser-redirect%2Fgit-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252FDS-100%252Fsp23%26branch%3Dmain%26urlpath%3Dlab%252Ftree%252Fsp23%252Flecture%252Flec26%252Flec26-votes.ipynb) to see the full Jupyter notebook demo. \n", + "\n", + "\n", + "As shown in the demo, the primary goal of PCA is to transform observations from high-dimensional data down to low dimensions through linear transformations. \n", + "\n", + "A related goal of PCA relates to the idea that a low-dimension representation of the data should capture the variability of the original data. For example, if the first two singular values are large and the others are relatively small, then two dimensions are probably enough to describe most of waht distinguishes one observation from another. However, if this is not the case, then a PCA scatter plot is probably omitting lots of information. \n", + "\n", + "We can use the the following formulas to quantify the amount each prinicial component contributes to the total variance:\n", + "\n", + "$$\\text{component score} = \\frac{\\sigma_i^{2}}{N}$$\n", + "\n", + "$$\\text{total variance} = \\text{sum(component scores)}$$\n", + "\n", + "$$\\text{variance ratio of principal component i} = \\frac{\\text{component score i}}{\\text{total variance}}$$\n", + "\n", + "\n", + "## Interpreting PCA\n", + "\n", + "### Scree Plots\n", + "\n", + "A **scree plot** shows the size of the diagonal value of $\\Sigma^{2}$, largest first.\n", + "\n", + "Scree plots help us visually determine the number of dimensions needed to describe the data reasonably completely. The singular values that fall in region of the plot after a large drop off correspond to principal components that are **not** needed to describe the data, since they explain a relatively low proportion of the total variance of the data. \n", + "\n", + "### PC with SVD\n", + "\n", + "After finding the SVD of $X$:\n", + "\n", + "slide15\n", + "\n", + "\n", + "We can derive the principal components of the data. Specifically, the first $n$ rows of $V^{T}$ are directions for the n principal components.\n", + "\n", + "### Columns of V are the Directions\n", + "\n", + "slide16\n", + "\n", + "The elements of each column of $V$ (row of $V^{T}$) rotate the original feature vectors into a principal component. \n", + "\n", + "The first column of V indicates how each feature contributes (e.g. positive, negative, etc.) to principal component 1. \n", + "\n", + "### Biplots\n", + "\n", + "Biplots superimpose the directions onto the plot of principal component 2 vs. principal component 1. \n", + "\n", + "Vector $j$ corresponds to the direction for feature $j$ (e.g. $v_1j, v_2j$).\n", + " - There are several ways to scale biplots vectors; in this course we plot the direction itself. \n", + " - For other scalings, which can lead to more interpretable directions/loadings, see [SAS biplots](https://blogs.sas.com/content/iml/2019/11/06/what-are-biplots.html)\n", + " \n", + "Through biplots, we can interpret how features correlate with the principal components shown: positively, negatively, or not much at all. \n", + "\n", + "\n", + "slide17_2\n", + "\n", + "The directions of the arrow are (v1, v2) where v1 and v2 are how that specific feature column contributes to PC1 and PC2, respectively. v1 and v2 are elements of V's first and second columns, respectively (i.e., V^T's first two rows).\n", + "\n", + "Say we were considering feature 3, and say that was the green arrow here (pointing bottom right).\n", + "\n", + "* v1 and v2 are the third elements of the respective columns in V. They are what scale feature 3's column vector in the linear transformation to PC1 and PC2, respectively.\n", + "\n", + "* Here we would infer that v1 (in the x/pc1-direction) is positive, meaning that a linear increase in feature 3 would correspond to a linear increase of PC1, meaning feature 3 and PC1 are positively correlated.\n", + "\n", + "* v2 (in the y/pc2-direction) is negative, meaning a linear increase in feature 3 would result correspond to a linear decrease in PC2, meaning feature 3 and PC2 are negatively correlated.\n", + "\n", + "## Applications of PCA\n", + "\n", + "### PCA in Biology\n", + "\n", + "PCA is commonly used in biomedical contexts, which have many named variables!\n", + "\n", + "1. To cluster data ([Paper 1](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2680-1), [Paper 2](https://www.science.org/doi/10.1126/scirobotics.abk2378))\n", + "\n", + "2. To identify correlated variables ([interpret](https://docs.google.com/presentation/d/1-aDu0ILCkPx3iCcJGB3YXci-L4g90Q6AarXU6wffLB8/edit#slide=id.g62cb86badb_0_1128) rows of $V^{T}$ as linear coefficients) ([Paper 3](https://www.nature.com/articles/s41598-017-05714-1)). Uses [biplots](https://www.google.com/url?q=https://www.geo.fu-berlin.de/en/v/soga/Geodata-analysis/Principal-Component-Analysis/principal-components-basics/Interpretation-and-visualization/index.html%23:~:text%3DThe%2520biplot%2520is%2520a%2520very,in%2520a%2520single%2520biplot%2520display.%26text%3DThe%2520plot%2520shows%2520the%2520observations,principal%2520components%2520(synthetic%2520variables).&sa=D&source=editors&ust=1682131633152964&usg=AOvVaw2H9SOeMP5kUS890Fkhfthx).\n", + "\n", + "### Image Classification\n", + "\n", + "In machine learning, PCA is often used as a preprocessing step prior to training a supervised model. \n", + "\n", + "See the following [demo](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Fsp23&branch=main&urlpath=lab%2Ftree%2Fsp23%2Flecture%2Flec26%2Flec26-fashion-mnist.ipynb) to see how PCA is useful for building an image classification model based on the MNIST-Fashion dataset.\n", + "\n", + "### Why PCA, then Model?\n", + "\n", + "1. Reduces dimensionality\n", + " - speeds up training, reduces number of features, etc.)\n", + "2. Avoids multicollinearity in new features (i.e. principal components)\n", + "\n", + "slide21\n" + ] + } + ], + "metadata": { + "kernelspec": { + "name": "python3", + "language": "python", + "display_name": "Python 3 (ipykernel)" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file