diff --git a/logistic_regression_2/images/linear_seperability_1D.png b/logistic_regression_2/images/linear_separability_1D.png similarity index 100% rename from logistic_regression_2/images/linear_seperability_1D.png rename to logistic_regression_2/images/linear_separability_1D.png diff --git a/logistic_regression_2/images/linear_seperability_2D.png b/logistic_regression_2/images/linear_separability_2D.png similarity index 100% rename from logistic_regression_2/images/linear_seperability_2D.png rename to logistic_regression_2/images/linear_separability_2D.png diff --git a/logistic_regression_2/logistic_reg_2.ipynb b/logistic_regression_2/logistic_reg_2.ipynb index d364ef4b..2a1dccd7 100644 --- a/logistic_regression_2/logistic_reg_2.ipynb +++ b/logistic_regression_2/logistic_reg_2.ipynb @@ -30,7 +30,7 @@ "* Introduce new metrics for model performance\n", "::: \n", "\n", - "Today, we will continue studying the Logistic Regression model. We'll discussion decision boundaries that help inform the classification of a particular prediction. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discus metrics that allow us to determine our model's performance in different scenarios. \n", + "Today, we will continue studying the Logistic Regression model. We'll discuss decision boundaries that help inform the classification of a particular prediction. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discus metrics that allow us to determine our model's performance in different scenarios. \n", "\n", "This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.\n", "\n", @@ -55,14 +55,14 @@ " \n", "The threshold is often set to $T = 0.5$, but *not always*. We'll discuss why we might want to use other thresholds $T \\neq 0.5$ later in this lecture.\n", "\n", - "Using our decision rule, we can define a **decision boundary** as the “line” the splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in p-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\\theta = [\\theta_0 \\theta_1 \\theta_2]$ (including the intercept term), and we can solve for the decision boundary like so: \n", + "Using our decision rule, we can define a **decision boundary** as the “line” the splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in p-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\\theta = [\\theta_0, \\theta_1, \\theta_2]$ (including the intercept term), and we can solve for the decision boundary like so: \n", "\n", "$$\n", "\\begin{align}\n", "T &= \\frac{1}{1 + e^{\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2}}} \\\\\n", - "1 + e^{\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2}} &= \\frac{1}{T} \\\\\n", - "e^{\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2}} &= \\frac{1}{T} - 1 \\\\\n", - "\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2} &= log(\\frac{1}{T} - 1)\n", + "1 + e^{\\theta_0 + \\theta_1 \\cdot \\text{feature1} + \\theta_2 \\cdot \\text{feature2}} &= \\frac{1}{T} \\\\\n", + "e^{\\theta_0 + \\theta_1 \\cdot \\text{feature1} + \\theta_2 \\cdot \\text{feature2}} &= \\frac{1}{T} - 1 \\\\\n", + "\\theta_0 + \\theta_1 \\cdot \\text{feature1} + \\theta_2 \\cdot \\text{feature2} &= \\log(\\frac{1}{T} - 1)\n", "\\end{align} \n", "$$\n", "\n", @@ -81,21 +81,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Linear Seperability and Regularization\n", + "## Linear Separability and Regularization\n", "\n", "A classification dataset is said to be **linearly separable** if there exists a hyperplane among input features $x$ that separates the two classes $y$. \n", "\n", - "Linear seperability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly seperable along the vertical line $x=0$. However, no such line perfectly seperates the two classes on the bottom right.\n", + "Linear separability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly separable along the vertical line $x=0$. However, no such line perfectly separates the two classes on the bottom right.\n", "\n", - "
linear_seperability_1D
\n", + "
linear_separability_1D
\n", "\n", - "This same definition holds in higher dimensions. If there are two features, the seperating hyperplane must exist in two dimensions (any line of the form $y=mx+b$). We can visualize this using a scatter plot.\n", + "This same definition holds in higher dimensions. If there are two features, the separating hyperplane must exist in two dimensions (any line of the form $y=mx+b$). We can visualize this using a scatter plot.\n", "\n", - "
linear_seperability_1D
\n", + "
linear_separability_1D
\n", "\n", "This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. However, (unexpected) complications may arise when data is linearly separable. Consider the toy dataset with 2 points and only a single feature $x$:\n", "\n", - "
toy_linear_seperability
\n", + "
toy_linear_separability
\n", "\n", "The optimal $\\theta$ value that minimizes loss pushes the predicted probabilities of the data points to their true class.\n", "\n", @@ -111,7 +111,7 @@ "\n", "The diverging weights cause the model to be overconfident. For example, consider the new point $(x, y) = (0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and a thus, $\\hat y = 0$.\n", "\n", - "
toy_linear_seperability
\n", + "
toy_linear_separability
\n", "\n", "The loss incurred by this misclassified point is infinite.\n", "\n", @@ -169,7 +169,7 @@ "## Performance Metrics\n", "You might be thinking, if we've already introduced cross-entropy loss, why do we need additional ways of assessing how well our models perform? In linear regression, we made numerical predictions and used a loss function to determine how “good” these predictions were. In logistic regression, our ultimate goal is to classify data – we are much more concerned with whether or not each datapoint was assigned the correct class using the decision rule. As such, we are interested in the *quality* of classifications, not the predicted probabilities.\n", "\n", - "The most basic evaluation metric is **accuracy** -- the proportion of correctly classified points.\n", + "The most basic evaluation metric is **accuracy**, that is, the proportion of correctly classified points.\n", "\n", "$$\\text{accuracy} = \\frac{\\# \\text{ of points classified correctly}}{\\# \\text{ of total points}}$$\n", "\n", @@ -185,7 +185,7 @@ "- **Model 1**: Our first model classifies every email as non-spam. The model's accuracy is high ($\\frac{95}{100} = 0.95$), but it doesn't detect any spam emails. Despite the high accuracy, this is a bad model.\n", "- **Model 2**: The second model classifies every email as spam. The accuracy is low ($\\frac{5}{100} = 0.05$), but the model correctly labels every spam email. Unfortunately, it also misclassifies every non-spam email.\n", "\n", - "As this example illustrates, accuracy is not always a good metric for classification, particularly when your data have class imbalance (e.g., very few 1’s compared to 0’s).\n", + "As this example illustrates, accuracy is not always a good metric for classification, particularly when your data could exhibit class imbalance (e.g., very few 1’s compared to 0’s).\n", "\n", "### Types of Classification\n", "There are 4 different different classifications that our model might make:\n", @@ -405,7 +405,7 @@ "\n", "\n", "#### [Extra] What is the “worst” AUC and why is it 0.5? \n", - "On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.\n", + "On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predicts $P(Y = 1 | x)$ to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.\n", "\n", "
roc_curve_worst_predictor
" ] @@ -419,7 +419,7 @@ "$$\n", "t_i = \\phi(x_i)^T \\theta \\\\\n", "p_i = \\sigma(t_i) \\\\\n", - "t_i = log(\\frac{p_i}{1 - p_i}) \\\\\n", + "t_i = \\log(\\frac{p_i}{1 - p_i}) \\\\\n", "1 - \\sigma(t_i) = \\sigma(-t_i) \\\\\n", "\\frac{d}{dt} \\sigma(t) = \\sigma(t) \\sigma(-t)\n", "$$\n", @@ -427,21 +427,21 @@ "Now, we can simplify the cross-entropy loss\n", "$$\n", "\\begin{align}\n", - "y_i log(p_i) + (1 - y_i)log(1 - p_i) &= y_i log(\\frac{p_i}{1 - p_i}) + log(1 - p_i) \\\\\n", - "&= y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta))\n", + "y_i \\log(p_i) + (1 - y_i) \\log(1 - p_i) &= y_i \\log(\\frac{p_i}{1 - p_i}) + \\log(1 - p_i) \\\\\n", + "&= y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta))\n", "\\end{align}\n", "$$\n", "\n", "Hence, the optimal $\\hat{\\theta}$ is \n", - "$$\\argmin_{\\theta} - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta)))$$ \n", + "$$\\argmin_{\\theta} - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta)))$$ \n", "\n", - "We want to minimize $$L(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta)))$$\n", + "We want to minimize $$L(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta)))$$\n", "\n", "So we take the derivative \n", "$$ \n", "\\begin{align}\n", - "\\triangledown_{\\theta} L(\\theta) &= - \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} y_i \\phi(x_i)^T + \\triangledown_{\\theta} log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n", - "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\triangledown_{\\theta} log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n", + "\\triangledown_{\\theta} L(\\theta) &= - \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} y_i \\phi(x_i)^T + \\triangledown_{\\theta} \\log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n", + "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\triangledown_{\\theta} \\log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n", "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{1}{\\sigma(-\\phi(x_i)^T \\theta)} \\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n", "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{\\sigma(-\\phi(x_i)^T \\theta)}{\\sigma(-\\phi(x_i)^T \\theta)} \\sigma(\\phi(x_i)^T \\theta)\\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n", "&= - \\frac{1}{n} \\sum_{i=1}^n (y_i - \\sigma(\\phi(x_i)^T \\theta)\\phi(x_i))\n", @@ -459,7 +459,7 @@ "### Stochastic Gradient Descent Update Rule\n", "$$\\theta^{(0)} \\leftarrow \\text{initial vector (random, zeros, ...)} $$\n", "\n", - "For $\\tau$ from 0 to convergence, let $B ~ \\text{Random subset of indices}$. \n", + "For $\\tau$ from 0 to convergence, let $B$ ~ $\\text{Random subset of indices}$. \n", "$$ \\theta^{(\\tau + 1)} \\leftarrow \\theta^{(\\tau)} + \\rho(\\tau)\\left( \\frac{1}{|B|} \\sum_{i \\in B} \\triangledown_{\\theta} L_i(\\theta) \\mid_{\\theta = \\theta^{(\\tau)}}\\right) $$" ] } diff --git a/sql_I/sql_I.ipynb b/sql_I/sql_I.ipynb new file mode 100644 index 00000000..c0af8812 --- /dev/null +++ b/sql_I/sql_I.ipynb @@ -0,0 +1,564 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: SQL I\n", + "execute:\n", + " echo: true\n", + "format:\n", + " html:\n", + " code-fold: false\n", + " code-tools: true\n", + " toc: true\n", + " toc-title: SQL I\n", + " page-layout: full\n", + " theme:\n", + " - cosmo\n", + " - cerulean\n", + " callout-icon: false\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.callout-note collapse=\"false\"}\n", + "## Learning Outcomes\n", + "* Identify situations where a database may be preferred over a CSV file\n", + "* Write basic SQL queries using `SELECT`, `FROM`, `WHERE`, `ORDER BY`, `LIMIT`, and `OFFSET`\n", + "* Perform aggregations using `GROUP BY`\n", + ":::\n", + "\n", + "So far in the course, we have made our way through the entire data science lifecycle: we learned how to load and explore a dataset, formulate questions, and use the tools of prediction and inference to come up with answers. For the remaining weeks of the semester, we are going to make a second pass through the lifecycle, this time doing so with a different set of tools, ideas, and abstractions. \n", + "\n", + "## Databases\n", + "With this goal in mind, let's go back to the very beginning of the lifecycle. We first started our work in data analysis by looking at the `pandas` library, which offered us powerful tools to manipulate tabular data stored in (primarily) CSV files. CSVs work well when analyzing relatively small datasets that don't need to be shared across many users. In research and industry, however, data scientists often need to access enormous bodies of data that cannot be easily stored in a CSV format. Collaborating with others when working with CSVs can also be tricky – a real-world data scientist may run into problems when multiple users try to make modifications, or, even worse, security issues with who should and should not have access to the data. \n", + "\n", + "A **database** is a large, organized collection of data. Databases are administered by **Database Management Systems (DBMS)**, which are software systems that store, manage, and facilitate access to one or more databases. Databases help mitigate many of the issues that come with using CSVs for data storage: they provide reliable storage that can survive system crashes or disk failures, are optimized to compute on data that does not fit into memory, and contain special data structures to improve performance. Using databases rather than CSVs offers further benefits from the standpoint of data management. A DBMS can apply settings that configure how data is organized, block certain data anomalies (for example, enforcing non-negative weights or ages), and determine who is allowed access to the data. It can also ensure safe concurrent operations where multiple users reading and writing to the database will not lead to fatal errors.\n", + "\n", + "As you may have guessed, we can't use our usual `pandas` methods to work with data in a database. Instead, we'll turn to Structured Query Language.\n", + "\n", + "## Structured Query Language and Database Schema\n", + "\n", + "**Structured Query Language**, or **SQL** (commonly pronounced \"sequel,\" though this is the subject of [fierce debate](https://patorjk.com/blog/2012/01/26/pronouncing-sql-s-q-l-or-sequel/)), is a special programming language designed to communicate with databases. You may have encountered it in classes like CS 61A or Data C88C before. Unlike Python, it is a **declarative programming language** – this means that rather than writing the exact logic needed to complete a task, a piece of SQL code \"declares\" what the desired final output should be and leaves the program to determine what logic should be implemented. \n", + "\n", + "It is important to reiterate that SQL is an entirely different language from Python. However, Python *does* have special engines that allow us to run SQL code in a Jupyter notebook. While this is typically not how SQL is used outside of an educational setting, we will be using this workflow to illustrate how SQL queries are constructed using the tools we've already worked with this semester. You will learn more about how to run SQL queries in Jupyter in Lab 10.\n", + "\n", + "The syntax below will seem unfamiliar to you; for now, just focus on understanding the output displayed. We will clarify the SQL code in a bit.\n", + "\n", + "To start, we'll look at a database called `basic_examples.db`." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Load the SQL Alchemy Python library\n", + "import sqlalchemy\n", + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# load %%sql cell magic\n", + "%load_ext sql" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Connect to the SQLite database `basic_examples.db`." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "sqlite:///data/basic_examples.db " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT * \n", + "FROM sqlite_master\n", + "WHERE type=\"table\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The summary above displays information about the database. The database contains four tables, named `sqlite_sequence`, `Dragon`, `Dish`, and `Scene`. The rightmost column above lists the command that was used to construct each table. \n", + "\n", + "Let's look more closely at the command used to create the `Dragon` table (the second entry above). \n", + "\n", + " CREATE TABLE Dragon (name TEXT PRIMARY KEY,\n", + " year INTEGER CHECK (year >= 2000),\n", + " cute INTEGER)\n", + "\n", + "The statement `CREATE TABLE` is used to specify the **schema** of the table – a description of what logic is used to organize the table. Schema follows a set format:\n", + "\n", + "* `ColName`: the name of a column\n", + "* `DataType`: the type of data to be stored in a column. Some of the most common SQL data types are `INT` (integers), `FLOAT` (floating point numbers), `TEXT` (strings), `BLOB` (arbitrary data, such as audio/video files), and `DATETIME` (a date and time).\n", + "* `Constraint`: some restriction on the data to be stored in the column. Common constraints are `CHECK` (data must obey a certain condition), `PRIMARY KEY` (designate a column as the table's primary key), `NOT NULL` (data cannot be null), and `DEFAULT` (a default fill value if no specific entry is given).\n", + "\n", + "We see that `Dragon` contains five columns. The first of these, `\"name\"`, contains text data. It is designated as the **primary key** of the table; that is, the data contained in `\"name\"` uniquely identifies each entry in the table. Because `\"name\"` is the primary key of the table, no two entries in the table can have the same name – a given value of `\"name\"` is unique to each dragon. The `\"year\"` column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, `\"cute\"`, contains integer data with no restrictions on allowable values. \n", + "\n", + "We can verify this by viewing `Dragon` itself." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Database tables (also referred to as **relations**) are structured much like `DataFrame`s in `pandas`. Each row, sometimes called a **tuple**, represents a single record in the dataset. Each column, sometimes called an **attribute** or **field**, describes some feature of the record. \n", + "\n", + "## `SELECT`ing From Tables\n", + "\n", + "To extract and manipulate data stored in a SQL table, we will need to familiarize ourselves with the syntax to write pieces of SQL code, which we call **queries**. \n", + "\n", + "The basic unit of a SQL query is the `SELECT` statement. `SELECT` specifies what columns we would like to extract from a given table. We use `FROM` to tell SQL the table from which we want to `SELECT` our data. " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In SQL, `*` means \"everything.\" The query above grabs *all* the columns in `Dragon` and displays them in the outputted table. We can also specify a specific subset of columns to be `SELECT`ed. Notice that the outputted columns appear in the order that they were `SELECT`ed." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT cute, year\n", + "FROM Dragon" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And just like that, we've already written two SQL queries. There are a few points of note in the queries above. Firstly, notice that every \"verb\" is written in uppercase. It is convention to write SQL operations in capital letters, but your code will run just fine even if you choose to keep things in lowercase. Second, the query above separates each statement with a new line. SQL queries are not impacted by whitespace within the query; this means that SQL code is typically written with a new line after each statement to make things more readable. The semicolon (`;`) indicates the end of a query. There are some \"flavors\" of SQL in which a query will not run if no semicolon is present; however, in Data 100, the SQL version we will use works with or without an ending semicolon. Queries in these notes will end with semicolons to build up good habits.\n", + "\n", + "The `AS` keyword allows us to give a column a new name (called an **alias**) after it has been `SELECT`ed. The general syntax is:\n", + "\n", + " SELECT column_name_in_database_table AS new_name_in_output_table" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT cute AS cuteness, year AS birth\n", + "FROM Dragon" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To `SELECT` only the *unique* values in a column, we use the `DISTINCT` keyword. This will cause any any duplicate entries in a column to be removed. If we want to find only the unique years in `Dragon`, without any repeats, we would write:" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT DISTINCT year\n", + "FROM Dragon" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Every** SQL query must include both a `SELECT` and `FROM` statement. Intuitively, this makes sense – we know that we'll want to extract some piece of information from the table; to do so, we also need to indicate what table we want to consider. \n", + "\n", + "It is important to note that SQL enforces a strict \"order of operations\" – SQL clauses must *always* follow the same sequence. For example, the `SELECT` statement must always precede `FROM`. This means that any SQL query will follow the same structure. \n", + "\n", + " SELECT \n", + " FROM \n", + " [additional clauses]\n", + " \n", + "The additional clauses that we use depend on the specific task trying to be achieved. We may refine our query to filter on a certain condition, aggregate a particular column, or join several tables together. We will spend the rest of this lecture outlining some useful clauses to build up our understanding of the order of operations.\n", + "\n", + "## Applying `WHERE` Conditions\n", + "\n", + "The `WHERE` keyword is used to select only some rows of a table, filtered on a given Boolean condition. " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT name, year\n", + "FROM Dragon\n", + "WHERE cute > 0" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can add complexity to the `WHERE` condition using the keywords `AND`, `OR`, and `NOT`, much like we would in Python." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT name, year\n", + "FROM Dragon\n", + "WHERE cute > 0 OR year > 2013" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To spare ourselves needing to write complicated logical expressions by combining several conditions, we can also filter for entries that are `IN` a specified list of values. This is similar to the use of `in` or `.isin` in Python." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT name, year\n", + "FROM Dragon\n", + "WHERE name IN (\"hiccup\", \"puff\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You may have noticed earlier that our table actually has a missing value. In SQL, missing data is given the special value `NULL`. `NULL` behaves in a fundamentally different way to other data types. We can't use the typical operators (=, >, and <) on `NULL` values (in fact, `NULL == NULL` returns `False`!); instead, we check to see if a value `IS` or `IS NOT` `NULL`." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon\n", + "WHERE cute IS NOT NULL" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sorting and Restricting Output\n", + " \n", + "What if we want the output table to appear in a certain order? The `ORDER BY` keyword behaves similarly to `.sort_values()` in `pandas`. " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon\n", + "ORDER BY cute" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, `ORDER BY` will display results in ascending order (with the lowest values at the top of the table). To sort in descending order, we use the `DESC` keyword after specifying the column to be used for ordering." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon\n", + "ORDER BY cute DESC" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also tell SQL to `ORDER BY` two columns at once. This will sort the table by the first listed column, then use the values in the second listed column to break any ties." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon\n", + "ORDER BY name, cute" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In many instances, we are only concerned with a certain number of rows in the output table (for example, wanting to find the first two dragons in the table). The `LIMIT` keyword restricts the output to a specified number of rows. It serves a function similar to that of `.head()` in `pandas`." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon\n", + "LIMIT 2" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `OFFSET` keyword indicates the index at which `LIMIT` should start. In other words, we can use `OFFSET` to shift where the `LIMIT`ing begins by a specified number of rows. For example, we might care about the dragons that are at positions #2 and #3 in the table. " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT *\n", + "FROM Dragon\n", + "LIMIT 2\n", + "OFFSET 1" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's summarize what we've learned so far. We know that `SELECT` and `FROM` are the fundamental building blocks of any SQL query. We can augment these two keywords with additional clauses to refine the data in our output table. \n", + "\n", + "Any clauses that we include must follow a strict ordering within the query:\n", + "\n", + " SELECT \n", + " FROM
\n", + " [WHERE ]\n", + " [ORDER BY ]\n", + " [LIMIT ]\n", + " [OFFSET ]\n", + " \n", + "Here, any clause contained in square brackets `[ ]` is optional – we only need to use the keyword if it is relevant to the table operation we want to perform. Also note that by convention, we use all caps for keywords in SQL statements and use newlines to make code more readable.\n", + "\n", + "## Aggregating with `GROUP BY`\n", + "\n", + "At this point, we've seen that SQL offers much of the same functionality that was given to us by `pandas`. We can extract data from a table, filter it, and reorder it to suit our needs.\n", + "\n", + "In `pandas`, much of our analysis work relied heavily on being able to use `.groupby()` to aggregate across the rows of our dataset. SQL's answer to this task is the (very conveniently named) `GROUP BY` clause. While the outputs of `GROUP BY` are similar to those of `.groupby()` – in both cases, we obtain an output table where some column has been used for grouping – the syntax and logic used to group data in SQL are fairly different to the `pandas` implementation.\n", + "\n", + "To illustrate `GROUP BY`, we will consider the `Dish` table from the `basic_examples.db` database." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT * \n", + "FROM Dish" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Say we wanted to find the total costs of dishes of a certain `type`. To accomplish this, we would write the following code." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT type, SUM(cost)\n", + "FROM Dish\n", + "GROUP BY type" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is going on here? The statement `GROUP BY type` tells SQL to group the data based on the value contained in the `type` column (whether a record is an appetizer, entree, or dessert). `SUM(cost)` sums up the costs of dishes in each `type` and displays the result in the output table.\n", + "\n", + "You may be wondering: why does `SUM(cost)` come before the command to `GROUP BY type`? Don't we need to form groups before we can count the number of entries in each?\n", + "\n", + "Remember that SQL is a *declarative* programming language – a SQL programmer simply states what end result they would like to see, and leaves the task of figuring out *how* to obtain this result to SQL itself. This means that SQL queries sometimes don't follow what a reader sees as a \"logical\" sequence of thought. Instead, SQL requires that we follow its set order of operations when constructing queries. So long as we follow this ordering, SQL will handle the underlying logic.\n", + "\n", + "In practical terms: our goal with this query was to output the total `cost`s of each `type`. To communicate this to SQL, we say that we want to `SELECT` the `SUM`med `cost` values for each `type` group. \n", + "\n", + "There are many aggregation functions that can be used to aggregate the data contained in each group. Some common examples are:\n", + "\n", + "* `COUNT`: count the number of rows associated with each group\n", + "* `MIN`: find the minimum value of each group\n", + "* `MAX`: find the maximum value of each group\n", + "* `SUM`: sum across all records in each group\n", + "* `AVG`: find the average value of each group\n", + "\n", + "We can easily compute multiple aggregations, all at once (a task that was very tricky in `pandas`)." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT type, SUM(cost), MIN(cost), MAX(name)\n", + "FROM Dish\n", + "GROUP BY type" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To count the number of rows associated with each group, we use the `COUNT` keyword. Calling `COUNT(*)` will compute the total number of rows in each group, including rows with null values. Its `pandas` equivalent is `.groupby().size()`." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT type, COUNT(*)\n", + "FROM Dish\n", + "GROUP BY type" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To exclude `NULL` values when counting the rows in each group, we explicitly call `COUNT` on a column in the table. This is similar to calling `.groupby().count()` in `pandas`." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "%%sql\n", + "SELECT year, COUNT(cute)\n", + "FROM Dragon\n", + "GROUP BY year" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With this definition of `GROUP BY` in hand, let's update our SQL order of operations. Remember: *every* SQL query must list clauses in this order. \n", + "\n", + " SELECT \n", + " FROM
\n", + " [WHERE ]\n", + " [GROUP BY ]\n", + " [ORDER BY ]\n", + " [LIMIT ]\n", + " [OFFSET ];\n", + "\n", + "Note that we can use the `AS` keyword to rename columns during the selection process and that column expressions may include aggregation functions (`MAX`, `MIN`, etc.).\n" + ] + } + ], + "metadata": { + "kernelspec": { + "name": "python3", + "language": "python", + "display_name": "Python 3 (ipykernel)" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file