diff --git a/03-data-applications-old.qmd b/03-data-applications-old.qmd new file mode 100644 index 00000000..d216e40c --- /dev/null +++ b/03-data-applications-old.qmd @@ -0,0 +1,350 @@ +# Applications: Data {#sec-data-applications} + +```{r} +#| include: false +source("_common.R") +``` + +## Case study: Passwords {#case-study-passwords} + +Stop for a second and think about how many passwords you've used so far today. +You've probably used one to unlock your phone, one to check email, and probably at least one to log on to a social media account. +Made a debit purchase? +You've probably entered a password there too. + +If you're reading this book, and particularly if you're reading it online, chances are you have had to create a password once or twice in your life. +And if you are diligent about your safety and privacy, you've probably chosen passwords that would be hard for others to guess, or *crack*. + +In this case study we introduce a dataset on passwords. +The goal of the case study is to walk you through what a data scientist does when they first get a hold of a dataset as well as to provide some "foreshadowing" of concepts and techniques we'll introduce in the next few chapters on exploratory data analysis. + +::: {.data data-latex=""} +The [`passwords`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-14/readme.md) data can be found in the [**tidytuesdayR**](https://thebioengineer.github.io/tidytuesdayR/) R package. +::: + +@tbl-passwords-df-head shows the first ten rows from the dataset, which are the ten most common passwords. +Perhaps unsurprisingly, "password" tops the list, followed by "123456". + +```{r} +#| label: tbl-passwords-df-head +#| tbl-cap: Top ten rows of the `passwords` dataset. +# https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv +passwords <- readr::read_csv("data/passwords.csv") +passwords <- passwords |> + select(-font_size, -rank_alt) |> + filter(!is.na(category)) |> + mutate(time_unit = fct_relevel(time_unit, "seconds", "minutes", "hours", "days", "weeks", "months", "years")) + +passwords |> + slice_head(n = 10) |> + kbl(linesep = "", booktabs = TRUE, + row.names = FALSE) |> + kable_styling(bootstrap_options = c("striped", "condensed"), + latex_options = c("striped", "hold_position")) +``` + +When you encounter a new dataset, taking a peek at the first few rows as we did in @tbl-passwords-df-head is almost instinctual. +It can often be helpful to look at the last few rows of the data as well to get a sense of the size of the data as well as potentially discover any characteristics that may not be apparent in the top few rows. +@tbl-passwords-df-tail shows the bottom ten rows of the passwords dataset, which reveals that we are looking at a dataset of 500 passwords. + +```{r} +#| label: tbl-passwords-df-tail +#| tbl-cap: Bottom ten rows of the `passwords` dataset. +passwords |> + slice_tail(n = 10) |> + kbl(linesep = "", booktabs = TRUE, row.names = FALSE) |> + kable_styling(bootstrap_options = c("striped", "condensed"), + latex_options = c("striped", "hold_position")) +``` + +At this stage it's also useful to think about how these data were collected, as that will inform the scope of any inference you can make based on your analysis of the data. + +::: {.guidedpractice data-latex=""} +Do these data come from an observational study or an experiment?[^03-data-applications-1] +::: + +[^03-data-applications-1]: This is an observational study. + Researchers collected data on existing passwords in use and identified most common ones to put together this dataset. + +::: {.guidedpractice data-latex=""} +There are `r nrow(passwords)` rows and `r ncol(passwords)` columns in the dataset. +What does each row and each column represent?[^03-data-applications-2] +::: + +[^03-data-applications-2]: Each row represents a password and each column represents a variable which contains information on each password. + +Once you've identified the rows and columns, it's useful to review the data dictionary to learn about what each column in the dataset represents. +This is provided in @tbl-passwords-var-def. + +```{r} +#| label: tbl-passwords-var-def +#| tbl-cap: Variables and their descriptions for the `passwords` dataset. +passwords_var_def <- tribble( + ~variable, ~description, + "rank", "Popularity in the database of released passwords.", + "password", "Actual text of the password.", + "category", "Category password falls into.", + "value", "Time to crack by online guessing.", + "time_unit", "Time unit to match with value.", + "offline_crack_sec", "Time to crack offline in seconds.", + "strength", "Strength of password, relative only to passwords in this dataset. Lower values indicate weaker passwords." +) + +passwords_var_def |> + kbl(linesep = "", booktabs = TRUE, + col.names = c("Variable", "Description")) |> + kable_styling(bootstrap_options = c("striped", "condensed"), + latex_options = c("striped", "hold_position"), full_width = TRUE) |> + column_spec(1, monospace = TRUE) |> + column_spec(2, width = "30em") +``` + +We now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables. + +::: {.workedexample data-latex=""} +Determine whether each variable in the passwords dataset is numerical or categorical. +For numerical variables, further classify them as continuous or discrete. +For categorical variables, determine if the variable is ordinal. + +------------------------------------------------------------------------ + +The numerical variables in the dataset are `rank` (discrete), `value` (continuous), and `offline_crack_sec` (continuous). +The categorical variables are `password`, `time_unit`. +The strength variable is trickier to classify -- we can think of it as discrete numerical or as an ordinal variable as it takes on numerical values, however it's used to categorize the passwords on an ordinal scale. +One way of approaching this is thinking about whether the values the variable takes vary linearly, e.g., is the difference in strength between passwords with strength levels 8 and 9 the same as the difference with those with strength levels 9 and 10. +If this is not necessarily the case, we would classify the variable as ordinal. +Determining the classification of this variable requires understanding of how `strength` values were determined, which is a very typical workflow for working with data. +Sometimes the data dictionary (presented in @tbl-passwords-var-def) isn't sufficient, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully. +::: + +Next, let's try to get to know each variable a little bit better. +For categorical variables, this involves figuring out what their levels are and how commonly represented they are in the data. +@fig-passwords-cat shows the distributions of the categorical variables in this dataset. +We can see that password strengths of 0-10 are more common than higher values. +The most common password category is name (e.g. michael, jennifer, jordan, etc.) and the least common is food (e.g., pepper, cheese, coffee, etc.). +Many passwords can be cracked in the matter of days by online cracking with some taking as little as seconds and some as long as years to break. +Each of these visualizations is a bar plot, which you will learn more about in @sec-explore-categorical. + +```{r} +#| label: fig-passwords-cat +#| fig-cap: Distributions of the categorical variables in the `passwords` dataset. +#| Plot A shows the distribution of password strengths, Plot B password +#| categories, and Plot C length of time it takes to crack the passwords by +#| online guessing. +#| fig-asp: 1.0 +#| out.width: 100% +p_category <- passwords |> + count(category, sort = TRUE) |> + mutate(category = fct_reorder(category, n)) |> + ggplot(aes(y = category, x = n, fill = fct_rev(category))) + + geom_col(show.legend = FALSE) + + scale_fill_openintro() + + labs( + x = "Count", + y = NULL, + title = "Categories" + ) + + theme(plot.title.position = "plot") + +p_time_unit <- passwords |> + count(time_unit) |> + ggplot(aes(y = time_unit, x = n)) + + geom_col(show.legend = FALSE) + + labs( + x = "Count", + y = NULL, + title = "Length of time to crack", + subtitle = "By online guessing" + ) + + theme(plot.title.position = "plot") + +p_strength <- passwords |> + ggplot(aes(y = strength)) + + geom_histogram(binwidth = 1, show.legend = FALSE) + + scale_y_continuous(breaks = seq(0, 50, 5), trans = "reverse") + + labs( + x = "Count", + y = NULL, + title = "Strengths" + ) + + theme(plot.title.position = "plot") + +patchwork <- p_strength | (p_category / p_time_unit) + +patchwork + + plot_annotation( + title = "Strengths, categories, and cracking time\nof 500 most common passwords", + tag_levels = "A" + ) & + theme(plot.tag = element_text(size = 12, color = "darkgray")) +``` + +Similarly, we can examine the distributions of the numerical variables as well. +We already know that rank ranges between 1 and 500 in this dataset, based on @tbl-passwords-df-head and @tbl-passwords-df-tail. +The value variable is slightly more complicated to consider since the numerical values in that column are meaningless without the time unit that accompanies them. +@tbl-passwords-online-crack-summary shows the minimum and maximum amount of time it takes to crack a password by online guessing. +For example, there are 11 passwords in the dataset that can be broken in a matter of seconds, and each of them take 11.11 seconds to break, since the minimum and the maximum of observations in this group are exactly equal to this value. +And there are 65 passwords that take years to break, ranging from 2.56 years to 92.27 years. + +```{r} +#| label: tbl-passwords-online-crack-summary +#| tbl-cap: Minimum and maximum amount of time it takes to crack a password by +#| online guessing as well as the number of observations that fall into each +#| time unit category. +passwords |> + group_by(time_unit) |> + summarise( + n = n(), + min = min(value), + max = max(value) + ) |> + kbl(linesep = "", booktabs = TRUE) |> + kable_styling(bootstrap_options = c("striped", "condensed"), + latex_options = c("striped", "hold_position")) +``` + +Even though passwords that take a large number of years to crack can seem like good options (see @tbl-passwords-long-crack for a list of them), now that you've seen them here (and the fact that they are in a dataset of 500 most common passwords), you should not use them as secure passwords! + +```{r} +#| label: tbl-passwords-long-crack +#| tbl-cap: Passwords that take the longest amount of time to crack by online +#| guessing. +passwords |> + filter(value == 92.27) |> + kbl(linesep = "", booktabs = TRUE, row.names = FALSE) |> + kable_styling(bootstrap_options = c("striped", "condensed"), + latex_options = c("striped", "hold_position")) +``` + +\clearpage + +The last numerical variable in the dataset is `offline_crack_sec`. +@fig-password-offline-crack-hist shows the distribution of this variable, which reveals that all of these passwords can be cracked offline in under 30 seconds, with a large number of them being crackable in just a few seconds. + +```{r} +#| label: fig-password-offline-crack-hist +#| fig-cap: Histogram of the length of time it takes to crack passwords offline. +ggplot(passwords, aes(x = offline_crack_sec)) + + geom_histogram(binwidth = 1) + + labs( + x = "Length of time (seconds)", + y = "Count", + title = "Length of time to crack passwords offline" + ) +``` + +So far we examined the distributions of each individual variable, but it would be more interesting to explore relationships between multiple variables. +@fig-password-strength-rank-category shows the relationship between rank and strength of passwords by category, where more common passwords (those with higher rank) are plotted higher on the y-axis than those that are less common in this dataset. +The stronger the password, the larger text it's represented with on the plot. +While this visualization reveals some passwords that are less common, and stronger than others, we should reiterate that you should not use any of these passwords. +And if you already do, it's time to go change it! + +```{r} +#| label: fig-password-strength-rank-category +#| fig-cap: Rank vs. strength of 500 most common passwords by category. +#| fig-asp: 1.2 +#| out.width: 100% +passwords |> + mutate(category = fct_relevel(category, "name", "cool-macho", "simple-alphanumeric", "fluffy", "sport", "nerdy-pop", "animal", "password-related", "rebellious-rude", "food")) |> + ggplot(aes(x = strength, y = rank, color = category)) + + geom_text(aes(label = password, size = strength), + check_overlap = TRUE, show.legend = FALSE) + + facet_wrap(vars(category), ncol = 3) + + coord_cartesian(ylim = c(525, -10)) + + scale_y_continuous(breaks = c(1, 100, 200, 300, 400, 500), minor_breaks = NULL, trans = "reverse") + + scale_color_openintro() + + labs( + x = "Strength of password", + y = "Rank of popularity", + title = "500 most common passwords by category", + caption = "Data: Information is beautiful, via TidyTuesday" + ) +``` + +In this case study, we introduced you to the very first steps a data scientist takes when they start working with a new dataset. +In the next few chapters, we will introduce exploratory data analysis and you'll learn more about the various types of data visualizations and summary statistics you can make to get to know your data better. + +Before you move on, we encourage you to think about whether the following questions can be answered with this dataset, and if yes, how you might go about answering them. +It's okay if your answer is "I'm not sure", we simply want to get your exploratory juices flowing to prime you for what's to come! + +1. What characteristics are associated with a strong vs. a weak password? +2. Do more popular passwords take shorter or longer to crack compared to less popular passwords? +3. Are passwords that start with letters or numbers more common among the list of top 500 most common passwords? + +\clearpage + +## Interactive R tutorials {#data-tutorials} + +Navigate the concepts you've learned in this chapter in R using the following self-paced tutorials. +All you need is your browser to get started! + +::: {.alltutorials data-latex=""} +[Tutorial 1: Introduction to data](https://openintrostat.github.io/ims-tutorials/01-data/) + +::: {.content-hidden unless-format="pdf"} +https://openintrostat.github.io/ims-tutorials/01-data +::: +::: + +::: {.singletutorial data-latex=""} +[Tutorial 1 - Lesson 1: Language of data](https://openintro.shinyapps.io/ims-01-data-01/) + +::: {.content-hidden unless-format="pdf"} +https://openintro.shinyapps.io/ims-01-data-01 +::: +::: + +::: {.singletutorial data-latex=""} +[Tutorial 1 - Lesson 2: Types of studies](https://openintro.shinyapps.io/ims-01-data-02/) + +::: {.content-hidden unless-format="pdf"} +https://openintro.shinyapps.io/ims-01-data-02 +::: +::: + +::: {.singletutorial data-latex=""} +[Tutorial 1 - Lesson 3: Sampling strategies and experimental design](https://openintro.shinyapps.io/ims-01-data-03/) + +::: {.content-hidden unless-format="pdf"} +https://openintro.shinyapps.io/ims-01-data-03 +::: +::: + +::: {.singletutorial data-latex=""} +[Tutorial 1 - Lesson 4: Case study](https://openintro.shinyapps.io/ims-01-data-04/) + +::: {.content-hidden unless-format="pdf"} +https://openintro.shinyapps.io/ims-01-data-04 +::: +::: + +::: {.content-hidden unless-format="pdf"} +You can also access the full list of tutorials supporting this book at\ +. +::: + +::: {.content-visible when-format="html"} +You can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials). +::: + +## R labs {#data-labs} + +Further apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study. + +::: {.singlelab data-latex=""} +[Intro to R - Birth rates](https://www.openintro.org/go?id=ims-r-lab-intro-to-r) + +::: {.content-hidden unless-format="pdf"} +https://www.openintro.org/go?i +d=ims-r-lab-intro-to-r +::: +::: + +::: {.content-hidden unless-format="pdf"} +You can also access the full list of labs supporting this book at\ +. +::: + +::: {.content-visible when-format="html"} +You can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs). +::: diff --git a/03-data-applications.qmd b/03-data-applications.qmd index 4c3d137d..30ed54ad 100644 --- a/03-data-applications.qmd +++ b/03-data-applications.qmd @@ -3,94 +3,93 @@ ```{r} #| include: false source("_common.R") +library(openintro) +data("paralympic_1500") ``` -## Case study: Passwords {#case-study-passwords} +## Case study: Olympic 1500m {#case-study-paralympics} -Stop for a second and think about how many passwords you've used so far today. -You've probably used one to unlock your phone, one to check email, and probably at least one to log on to a social media account. -Made a debit purchase? -You've probably entered a password there too. +While many of you may be glued to the Olympic Games every four years (or every two years if you fancy both summer and winter sports), the Paralympic Games are less popular than the Olympic Games, even if they hold the same competitive thrills. -If you're reading this book, and particularly if you're reading it online, chances are you have had to create a password once or twice in your life. -And if you are diligent about your safety and privacy, you've probably chosen passwords that would be hard for others to guess, or *crack*. +The Paralympic Games began as a way to support soldiers who had been wounded in World War II as a way to help them rehabilitate. +The first Paralympic Games were held in Rome, Italy in 1960. +Since 1988 (Seoul, South Korea), the Paralympic Games have been held a few weeks later than the Olympic Games in the same city, in both the summer and winter. -In this case study we introduce a dataset on passwords. -The goal of the case study is to walk you through what a data scientist does when they first get a hold of a dataset as well as to provide some "foreshadowing" of concepts and techniques we'll introduce in the next few chapters on exploratory data analysis. +In this case study we introduce a dataset comparing Olympic and Paralympic gold medal finishers in the 1500m running competition (the Olympic "mile", if a bit shorter than a full mile). +The goal of the case study is to walk you through what a data scientist does when they first get a hold of a dataset. +We also provide some "foreshadowing" of concepts and techniques we'll introduce in the next few chapters on exploratory data analysis. +Last, we introduce [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) and discuss the importance of understanding the impact of multiple variables in an analysis. ::: {.data data-latex=""} -The [`passwords`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-14/readme.md) data can be found in the [**tidytuesdayR**](https://thebioengineer.github.io/tidytuesdayR/) R package. +The [`paralympic_1500`](http://openintrostat.github.io/openintro/reference/paralympic_1500.html) data can be found in the [openintro](http://openintrostat.github.io/openintro/) R package. ::: -@tbl-passwords-df-head shows the first ten rows from the dataset, which are the ten most common passwords. -Perhaps unsurprisingly, "password" tops the list, followed by "123456". +@tbl-paralympic-df-tail shows the last ten rows from the dataset, which are the ten most recent 1500m races. +Notice that there are racers from both the Men's and Women's divisions as well as those of varying visual impairment (T11, T12, T13, and Olympic). +The T11 athletes have almost complete visual impairment, run with a black-out blindfold, and are allowed to run with a guide-runner. +T12 and T13 athletes have some visual impairment, and the visual acuity of Olympic runners is not determined. ```{r} -#| label: tbl-passwords-df-head -#| tbl-cap: Top ten rows of the `passwords` dataset. -# https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv -passwords <- readr::read_csv("data/passwords.csv") -passwords <- passwords |> - select(-font_size, -rank_alt) |> - filter(!is.na(category)) |> - mutate(time_unit = fct_relevel(time_unit, "seconds", "minutes", "hours", "days", "weeks", "months", "years")) - -passwords |> - slice_head(n = 10) |> +#| label: tbl-paralympic-df-tail +#| tbl-cap: Last ten rows of the `paralympic_1500` dataset. + +paralympic_1500 |> + slice_tail(n = 10) |> kbl(linesep = "", booktabs = TRUE, row.names = FALSE) |> kable_styling(bootstrap_options = c("striped", "condensed"), latex_options = c("striped", "hold_position")) ``` -When you encounter a new dataset, taking a peek at the first few rows as we did in @tbl-passwords-df-head is almost instinctual. -It can often be helpful to look at the last few rows of the data as well to get a sense of the size of the data as well as potentially discover any characteristics that may not be apparent in the top few rows. -@tbl-passwords-df-tail shows the bottom ten rows of the passwords dataset, which reveals that we are looking at a dataset of 500 passwords. +When you encounter a new dataset, taking a peek at the last few rows as we did in @tbl-paralympic-df-tail should be almost instinctual. +It can often be helpful to look at the first few rows of the data as well to get a sense of other aspects of the data which may not be apparent int he last few rows. +@tbl-paralympic-df-head shows the top ten rows of the `paralympic_1500` dataset, which reveals that for at least the first 10 Olympiads, there were no runners in the Women's division or in the Paralympics. ```{r} -#| label: tbl-passwords-df-tail -#| tbl-cap: Bottom ten rows of the `passwords` dataset. -passwords |> - slice_tail(n = 10) |> +#| label: tbl-paralympic-df-head +#| tbl-cap: First ten rows of the `paralympic_1500` dataset. +paralympic_1500 |> + slice_head(n = 10) |> kbl(linesep = "", booktabs = TRUE, row.names = FALSE) |> kable_styling(bootstrap_options = c("striped", "condensed"), latex_options = c("striped", "hold_position")) ``` -At this stage it's also useful to think about how these data were collected, as that will inform the scope of any inference you can make based on your analysis of the data. +At this stage it's also useful to think about how the data were collected, as that will inform the scope of any inference you can make based on your analysis of the data. ::: {.guidedpractice data-latex=""} Do these data come from an observational study or an experiment?[^03-data-applications-1] ::: [^03-data-applications-1]: This is an observational study. - Researchers collected data on existing passwords in use and identified most common ones to put together this dataset. + Researchers collected data on past gold medal race times in both Olympic and Paralympic Games. ::: {.guidedpractice data-latex=""} -There are `r nrow(passwords)` rows and `r ncol(passwords)` columns in the dataset. +There are `r nrow(paralympic_1500)` rows and `r ncol(paralympic_1500)` columns in the dataset. What does each row and each column represent?[^03-data-applications-2] ::: -[^03-data-applications-2]: Each row represents a password and each column represents a variable which contains information on each password. +[^03-data-applications-2]: Each row represents a 1500m gold medal race and each column represents a variable containing information on each race. Once you've identified the rows and columns, it's useful to review the data dictionary to learn about what each column in the dataset represents. -This is provided in @tbl-passwords-var-def. +The data dictionary is provided in @tbl-paralympic-var-def. ```{r} -#| label: tbl-passwords-var-def -#| tbl-cap: Variables and their descriptions for the `passwords` dataset. -passwords_var_def <- tribble( +#| label: tbl-paralympic-var-def +#| tbl-cap: Variables and their descriptions for the `paralympic_1500` dataset. +paralympic_var_def <- tribble( ~variable, ~description, - "rank", "Popularity in the database of released passwords.", - "password", "Actual text of the password.", - "category", "Category password falls into.", - "value", "Time to crack by online guessing.", - "time_unit", "Time unit to match with value.", - "offline_crack_sec", "Time to crack offline in seconds.", - "strength", "Strength of password, relative only to passwords in this dataset. Lower values indicate weaker passwords." + "year", "Year the Games took place.", + "city", "City of the Games.", + "country_of_games", "Country of the Games.", + "division", "Division: `Men` or `Women`.", + "name", "Name of the athlete.", + "country_of_athlete", "Country of athlete.", + "time", "Time of gold medal race, in m:s.", + "time_min", "Time of gold medal race, in decimal minutes (min + sec/60)." ) -passwords_var_def |> +paralympic_var_def |> kbl(linesep = "", booktabs = TRUE, col.names = c("Variable", "Description")) |> kable_styling(bootstrap_options = c("striped", "condensed"), @@ -102,176 +101,232 @@ passwords_var_def |> We now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables. ::: {.workedexample data-latex=""} -Determine whether each variable in the passwords dataset is numerical or categorical. +Determine whether each variable in the `paralympic_1500` dataset is numerical or categorical. For numerical variables, further classify them as continuous or discrete. For categorical variables, determine if the variable is ordinal. ------------------------------------------------------------------------ -The numerical variables in the dataset are `rank` (discrete), `value` (continuous), and `offline_crack_sec` (continuous). -The categorical variables are `password`, `time_unit`. -The strength variable is trickier to classify -- we can think of it as discrete numerical or as an ordinal variable as it takes on numerical values, however it's used to categorize the passwords on an ordinal scale. -One way of approaching this is thinking about whether the values the variable takes vary linearly, e.g., is the difference in strength between passwords with strength levels 8 and 9 the same as the difference with those with strength levels 9 and 10. -If this is not necessarily the case, we would classify the variable as ordinal. -Determining the classification of this variable requires understanding of how `strength` values were determined, which is a very typical workflow for working with data. -Sometimes the data dictionary (presented in @tbl-passwords-var-def) isn't sufficient, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully. +The numerical variables in the dataset are `year` (discrete), and `time_min` (continuous). +The categorical variables are `city`, `country_of_games`, `division`, `type`, `name`, and `country_of_athlete`. +The `time` variable is trickier to classify -- we can think of it as numerical, but it is classified as categorical. +The categorical classification is due to the colon `:` which separates the hours from the seconds. +Sometimes the data dictionary (presented in @tbl-paralympic-var-def) isn't sufficient for a complete analysis, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully. ::: Next, let's try to get to know each variable a little bit better. For categorical variables, this involves figuring out what their levels are and how commonly represented they are in the data. -@fig-passwords-cat shows the distributions of the categorical variables in this dataset. -We can see that password strengths of 0-10 are more common than higher values. -The most common password category is name (e.g. michael, jennifer, jordan, etc.) and the least common is food (e.g., pepper, cheese, coffee, etc.). -Many passwords can be cracked in the matter of days by online cracking with some taking as little as seconds and some as long as years to break. -Each of these visualizations is a bar plot, which you will learn more about in @sec-explore-categorical. +@fig-paralympic-cat shows the distributions of two of the categorical variables in this dataset. +We can see that the United States has hosted the Games most often, but runners from Great Britain and Kenya have won the 1500m most often. +There are a large number of countries who have had a single gold medal winner of the 1500m. +Similarly, there are a large number of countries who have hosted the Games only once. +Over the last century, the name describing the country for athletes from one particular region has changed and includes Russian Federation, Unified Team, and Russian Paralympic Committee. +Both of the visualizations are bar plots, which you will learn more about in @sec-explore-categorical. ```{r} -#| label: fig-passwords-cat -#| fig-cap: Distributions of the categorical variables in the `passwords` dataset. -#| Plot A shows the distribution of password strengths, Plot B password -#| categories, and Plot C length of time it takes to crack the passwords by -#| online guessing. -#| fig-asp: 1.0 -#| out.width: 100% -p_category <- passwords |> - count(category, sort = TRUE) |> - mutate(category = fct_reorder(category, n)) |> - ggplot(aes(y = category, x = n, fill = fct_rev(category))) + +#| label: fig-paralympic-cat +#| fig-cap: Distributions of categorical variables in the `paralympic_1500` dataset. +#| Plot A shows the distribution of the country of origin of the athlete; Plot B +#| shows the distribution of the country in which the Games gook place. +#| fig-alt: Two separate bar plots. The left panel shows a bar plot counting the number of gold medal athletes from each country. Great Britain has had 8 top finishers, Kenya has had 7 top finishers, and Tunisia and Algeria have both had 5. The right panel shows a bar plot counting the number of games which have happened in each country. The USA has hosted 4 games, the UK has hosted 3 games, and each of Japan, Greece, Germany, France, and Australia have hosted the games twice. +#| fig-asp: 1.4 +#| out-width: 100% +p_country_games <- paralympic_1500 |> + group_by(country_of_games, year) |> + sample_n(size = 1) |> + ungroup() |> + group_by(country_of_games) |> + count(country_of_games, sort = TRUE) |> + ungroup() |> + mutate(country_of_games = fct_reorder(country_of_games, n)) |> + ggplot(aes(y = country_of_games, x = n, + fill = fct_rev(country_of_games))) + geom_col(show.legend = FALSE) + scale_fill_openintro() + labs( x = "Count", y = NULL, - title = "Categories" + title = "Country of Games" ) + theme(plot.title.position = "plot") -p_time_unit <- passwords |> - count(time_unit) |> - ggplot(aes(y = time_unit, x = n)) + +p_country_athlete <- paralympic_1500 |> + group_by(country_of_athlete, year) |> + sample_n(size = 1) |> + ungroup() |> + group_by(country_of_athlete) |> + count(country_of_athlete, sort = TRUE) |> + ungroup() |> + mutate(country_of_athlete = fct_reorder(country_of_athlete, n)) |> + ggplot(aes(y = country_of_athlete, x = n)) + geom_col(show.legend = FALSE) + - labs( - x = "Count", - y = NULL, - title = "Length of time to crack", - subtitle = "By online guessing" - ) + - theme(plot.title.position = "plot") - -p_strength <- passwords |> - ggplot(aes(y = strength)) + - geom_histogram(binwidth = 1, show.legend = FALSE) + - scale_y_continuous(breaks = seq(0, 50, 5), trans = "reverse") + labs( x = "Count", y = NULL, - title = "Strengths" + title = "Country of athlete" ) + theme(plot.title.position = "plot") -patchwork <- p_strength | (p_category / p_time_unit) +patchwork <- p_country_athlete | p_country_games patchwork + plot_annotation( - title = "Strengths, categories, and cracking time\nof 500 most common passwords", + title = "Olympic and Paralympic Games, Men's division", tag_levels = "A" ) & theme(plot.tag = element_text(size = 12, color = "darkgray")) ``` Similarly, we can examine the distributions of the numerical variables as well. -We already know that rank ranges between 1 and 500 in this dataset, based on @tbl-passwords-df-head and @tbl-passwords-df-tail. -The value variable is slightly more complicated to consider since the numerical values in that column are meaningless without the time unit that accompanies them. -@tbl-passwords-online-crack-summary shows the minimum and maximum amount of time it takes to crack a password by online guessing. -For example, there are 11 passwords in the dataset that can be broken in a matter of seconds, and each of them take 11.11 seconds to break, since the minimum and the maximum of observations in this group are exactly equal to this value. -And there are 65 passwords that take years to break, ranging from 2.56 years to 92.27 years. +We already know that the 1500m times are mostly between 3.5min and 4.5min, based on @tbl-paralympic-df-tail and @tbl-paralympic-df-head. +We can break down the 1500m time by division and type of race. +@tbl-paralympic-summary shows the mean, minimum, and maximum 1500m times broken down by division and race type. +Recall that the Men's Olympic division has taken place since 1896, whereas the Men's Paralympic division has happened only since 1960. +The maximum race time, therefore, should be taken into context in terms of the year of the Games. ```{r} -#| label: tbl-passwords-online-crack-summary -#| tbl-cap: Minimum and maximum amount of time it takes to crack a password by -#| online guessing as well as the number of observations that fall into each -#| time unit category. -passwords |> - group_by(time_unit) |> +#| label: tbl-paralympic-summary +#| tbl-cap: Mean, minimum, and maximum of the gold medal times for the 1500m race +#| broken down by division and type of race. +paralympic_1500 |> + group_by(division, type) |> summarise( - n = n(), - min = min(value), - max = max(value) + mean = round(mean(time_min),3), + min = round(min(time_min),3), + max = round(max(time_min),3) ) |> kbl(linesep = "", booktabs = TRUE) |> kable_styling(bootstrap_options = c("striped", "condensed"), latex_options = c("striped", "hold_position")) ``` -Even though passwords that take a large number of years to crack can seem like good options (see @tbl-passwords-long-crack for a list of them), now that you've seen them here (and the fact that they are in a dataset of 500 most common passwords), you should not use them as secure passwords! + +### Fun fact {-} + +Sometimes playing around with the dataset will uncover interesting elements about the context in which the data were collected. +A scatterplot of the Men's 1500m broken down by race type shows that, in each given year, the Olympic runner is substantially faster than the Paralympic runners, with one exception. +In the Rio de Janeiro 2016 games, the [T13 gold medal athlete ran faster (3:48.29) than the Olympic gold medal athlete (3:50.00)](https://www.paralympic.org/news/remarkable-finish-1500m-rio-2016) (see @fig-paralympic-rio). +In fact, some internet sleuthing tells you that the **top four** T13 finishers all finished the 1500m under 3:50.00! ```{r} -#| label: tbl-passwords-long-crack -#| tbl-cap: Passwords that take the longest amount of time to crack by online -#| guessing. -passwords |> - filter(value == 92.27) |> - kbl(linesep = "", booktabs = TRUE, row.names = FALSE) |> - kable_styling(bootstrap_options = c("striped", "condensed"), - latex_options = c("striped", "hold_position")) +#| label: fig-paralympic-rio +#| fig-cap: 1500m race time for Men's Olympic and Paralympic athletes. Dashed grey line represents the Rio games in 2016. +#| fig-alt: A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis. The points are colored by which group the athlete is in - T11, T12, T13, or Olympic. A vertical line at 2016 show that in the Rio games the T13 gold medal athlete was faster than the Olympic gold medal athlete. +#| fig-asp: 1.2 +#| out-width: 100% + +paralympic_1500 |> + filter(division == "Men") |> + filter(year > 1950) |> + ggplot(aes(x = year, y = time_min, group = type, color = type)) + + geom_point() + + geom_vline(xintercept = 2016, color = "darkgrey", lty = 2) + + scale_color_openintro() + + labs( + x = "Year", + y = NULL, + color = "Race type", + title = "1500m race time, in minutes" + ) + + theme(plot.title.position = "plot") ``` -\clearpage -The last numerical variable in the dataset is `offline_crack_sec`. -@fig-password-offline-crack-hist shows the distribution of this variable, which reveals that all of these passwords can be cracked offline in under 30 seconds, with a large number of them being crackable in just a few seconds. +So far we examined aspects of some of the individual variables, and we have broken down the 1500m race times in terms of division and race type. +You might have already wondered how the race times vary across year. +The `paralymic_1500` dataset will provide us with an ability to explore an important statistical concept, Simpson's paradox. + +## Simpson's paradox + +Simpson's paradox \index{Simpson's paradox} is a description of three (or more) variables. +The paradox happens when a third variable reverses the relationship between the first two variables. + +Let's start by considering how the 1500m gold medal race times have changed over year. +@fig-paralympic-ungrouped shows a scatterplot describing 1500m race times and year for Men's Olympic and Paralympic (T11) athletes with a line of best fit (to the entire dataset) superimposed (see @sec-model-slr where we will present fitting a line to a scatterplot). +Notice that the line of best fit shows a **positive** relationship between race time and year. +That is, for later years, the predicted gold medal time is higher than in earlier years. ```{r} -#| label: fig-password-offline-crack-hist -#| fig-cap: Histogram of the length of time it takes to crack passwords offline. -ggplot(passwords, aes(x = offline_crack_sec)) + - geom_histogram(binwidth = 1) + +#| label: fig-paralympic-ungrouped +#| fig-cap: 1500m race time for Men's Olympic and Paralympic (T11) athletes. The line represents a line of best fit to the entire dataset. +#| fig-alt: A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis. A line of best fit is drawn over the points. +#| fig-asp: 1.2 +#| out-width: 100% + +paralympic_1500 |> + filter(division == "Men", type == "Olympic" | type == "T11") |> + filter(year > 1950) |> + ggplot(aes(x = year, y = time_min)) + + geom_point() + + geom_smooth(method = "lm", se = FALSE) + labs( - x = "Length of time (seconds)", - y = "Count", - title = "Length of time to crack passwords offline" - ) + x = "Year", + y = NULL, + title = "1500m race time, in minutes" + ) + + theme(plot.title.position = "plot") ``` -So far we examined the distributions of each individual variable, but it would be more interesting to explore relationships between multiple variables. -@fig-password-strength-rank-category shows the relationship between rank and strength of passwords by category, where more common passwords (those with higher rank) are plotted higher on the y-axis than those that are less common in this dataset. -The stronger the password, the larger text it's represented with on the plot. -While this visualization reveals some passwords that are less common, and stronger than others, we should reiterate that you should not use any of these passwords. -And if you already do, it's time to go change it! +Of course, both your eye and your intuition are likely telling you that it wouldn't make any sense to try to model all of the athletes together. +Instead, a separate model should be run for each of the two types of Games: Olympic and Paralympic (T11). +@fig-paralympic-grouped shows a scatterplot describing 1500m race times and year for Men's Olympic and Paralympic (T11) athletes with a line of best fit superimposed separately for each of the two types of races. +Notice that within each type of race, the relationship between 1500m race time and year is now **negative**. ```{r} -#| label: fig-password-strength-rank-category -#| fig-cap: Rank vs. strength of 500 most common passwords by category. +#| label: fig-paralympic-grouped +#| fig-cap: 1500m race time for Men's Olympic and Paralympic (T11) athletes. The best fit line is now fit separately to the Olympic and Paralympic athletes. +#| fig-alt: A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis. The points are colored by the type of athlete - T11 or Olympic. Lines of best fit are drawn separately for the two groups (T11 and Olympic). #| fig-asp: 1.2 -#| out.width: 100% -passwords |> - mutate(category = fct_relevel(category, "name", "cool-macho", "simple-alphanumeric", "fluffy", "sport", "nerdy-pop", "animal", "password-related", "rebellious-rude", "food")) |> - ggplot(aes(x = strength, y = rank, color = category)) + - geom_text(aes(label = password, size = strength), - check_overlap = TRUE, show.legend = FALSE) + - facet_wrap(vars(category), ncol = 3) + - coord_cartesian(ylim = c(525, -10)) + - scale_y_continuous(breaks = c(1, 100, 200, 300, 400, 500), minor_breaks = NULL, trans = "reverse") + +#| out-width: 100% + +paralympic_1500 |> + filter(division == "Men", type == "Olympic" | type == "T11") |> + filter(year > 1950) |> + ggplot(aes(x = year, y = time_min, group = type, color = type)) + + geom_point() + + geom_smooth(method = "lm", se = FALSE) + scale_color_openintro() + labs( - x = "Strength of password", - y = "Rank of popularity", - title = "500 most common passwords by category", - caption = "Data: Information is beautiful, via TidyTuesday" - ) + x = "Year", + y = NULL, + color = "Race type", + title = "1500m race time, in minutes" + ) + + theme(plot.title.position = "plot") ``` +::: {.important data-latex=""} +**Simpson's paradox.** + +Simpson's paradox happens when an association or relationship between two variables in one direction (e.g., positive) reverses (e.g., becomes negative) when a third variable is considered. +::: + +```{r} +#| include: false +terms_chp_3 <- c("Simpson's paradox") +``` + +Simpson's paradox was seen in the 1500m race data because the aggregate data showed a positive relationship (positive slope) between year and race time but a negative relationship (negative slope) between year and race time when broken down by the type of race. + +Simpson's paradox is observed with categorical data and with numeric data. +Often the paradox happens because the third variable (here, race type) is imbalanced. +There are either more observations in one group or the observations happen at different intervals across the two groups. +In the 1500m data, we saw that the T11 runners had fewer observations and their times were both generally slower and more recent than the Olympic runners. + +In the 1500m analysis, it would be most prudent to report the trends separately for the Olympic and the T11 athletes. +However, in other situations, it might be better to aggregate the data and report the overall trend. +Many additional examples of Simpson's paradox and a further exploration is given in @Witmer:2021. + In this case study, we introduced you to the very first steps a data scientist takes when they start working with a new dataset. -In the next few chapters, we will introduce exploratory data analysis and you'll learn more about the various types of data visualizations and summary statistics you can make to get to know your data better. +In the next few chapters, we will introduce exploratory data analysis, and you'll learn more about the various types of data visualizations and summary statistics you can make to get to know your data better. -Before you move on, we encourage you to think about whether the following questions can be answered with this dataset, and if yes, how you might go about answering them. +Before you move on, we encourage you to think about whether the following questions can be answered with this dataset, and if yes, how you might go about answering them? It's okay if your answer is "I'm not sure", we simply want to get your exploratory juices flowing to prime you for what's to come! -1. What characteristics are associated with a strong vs. a weak password? -2. Do more popular passwords take shorter or longer to crack compared to less popular passwords? -3. Are passwords that start with letters or numbers more common among the list of top 500 most common passwords? +1. Has there every been a year when a visually impaired paralympic gold medal athlete beat the Olympic gold medal athlete? +2. When comparing the paralympic and Olympic 1500m gold medal athletes, does Simpson's paradox hold in the Women's division? +3. Is there a biological boundary which establishes a time under which no human could run 1500m? -\clearpage ## Interactive R tutorials {#data-tutorials} diff --git a/_freeze/03-data-applications/execute-results/html.json b/_freeze/03-data-applications/execute-results/html.json index bf381bc4..a494c95b 100644 --- a/_freeze/03-data-applications/execute-results/html.json +++ b/_freeze/03-data-applications/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "7f24153ca24c463ad296189337712f5c", + "hash": "f3578d82bbdcf85c82b70c2ef4b3db17", "result": { - "markdown": "# Applications: Data {#sec-data-applications}\n\n\n\n\n\n## Case study: Passwords {#case-study-passwords}\n\nStop for a second and think about how many passwords you've used so far today.\nYou've probably used one to unlock your phone, one to check email, and probably at least one to log on to a social media account.\nMade a debit purchase?\nYou've probably entered a password there too.\n\nIf you're reading this book, and particularly if you're reading it online, chances are you have had to create a password once or twice in your life.\nAnd if you are diligent about your safety and privacy, you've probably chosen passwords that would be hard for others to guess, or *crack*.\n\nIn this case study we introduce a dataset on passwords.\nThe goal of the case study is to walk you through what a data scientist does when they first get a hold of a dataset as well as to provide some \"foreshadowing\" of concepts and techniques we'll introduce in the next few chapters on exploratory data analysis.\n\n::: {.data data-latex=\"\"}\nThe [`passwords`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-14/readme.md) data can be found in the [**tidytuesdayR**](https://thebioengineer.github.io/tidytuesdayR/) R package.\n:::\n\n@tbl-passwords-df-head shows the first ten rows from the dataset, which are the ten most common passwords.\nPerhaps unsurprisingly, \"password\" tops the list, followed by \"123456\".\n\n\n::: {#tbl-passwords-df-head .cell tbl-cap='Top ten rows of the `passwords` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
rank password category value time_unit offline_crack_sec strength
1 password password-related 6.91 years 2.170 8
2 123456 simple-alphanumeric 18.52 minutes 0.000 4
3 12345678 simple-alphanumeric 1.29 days 0.001 4
4 1234 simple-alphanumeric 11.11 seconds 0.000 4
5 qwerty simple-alphanumeric 3.72 days 0.003 8
6 12345 simple-alphanumeric 1.85 minutes 0.000 4
7 dragon animal 3.72 days 0.003 8
8 baseball sport 6.91 years 2.170 4
9 football sport 6.91 years 2.170 7
10 letmein password-related 3.19 months 0.084 8
\n\n`````\n:::\n:::\n\n\nWhen you encounter a new dataset, taking a peek at the first few rows as we did in @tbl-passwords-df-head is almost instinctual.\nIt can often be helpful to look at the last few rows of the data as well to get a sense of the size of the data as well as potentially discover any characteristics that may not be apparent in the top few rows.\n@tbl-passwords-df-tail shows the bottom ten rows of the passwords dataset, which reveals that we are looking at a dataset of 500 passwords.\n\n\n::: {#tbl-passwords-df-tail .cell tbl-cap='Bottom ten rows of the `passwords` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
rank password category value time_unit offline_crack_sec strength
491 natasha name 3.19 months 0.084 7
492 sniper cool-macho 3.72 days 0.003 8
493 chance name 3.72 days 0.003 7
494 genesis nerdy-pop 3.19 months 0.084 7
495 hotrod cool-macho 3.72 days 0.003 7
496 reddog cool-macho 3.72 days 0.003 6
497 alexande name 6.91 years 2.170 9
498 college nerdy-pop 3.19 months 0.084 7
499 jester name 3.72 days 0.003 7
500 passw0rd password-related 92.27 years 29.020 28
\n\n`````\n:::\n:::\n\n\nAt this stage it's also useful to think about how these data were collected, as that will inform the scope of any inference you can make based on your analysis of the data.\n\n::: {.guidedpractice data-latex=\"\"}\nDo these data come from an observational study or an experiment?[^03-data-applications-1]\n:::\n\n[^03-data-applications-1]: This is an observational study.\n Researchers collected data on existing passwords in use and identified most common ones to put together this dataset.\n\n::: {.guidedpractice data-latex=\"\"}\nThere are 500 rows and 7 columns in the dataset.\nWhat does each row and each column represent?[^03-data-applications-2]\n:::\n\n[^03-data-applications-2]: Each row represents a password and each column represents a variable which contains information on each password.\n\nOnce you've identified the rows and columns, it's useful to review the data dictionary to learn about what each column in the dataset represents.\nThis is provided in @tbl-passwords-var-def.\n\n\n::: {#tbl-passwords-var-def .cell tbl-cap='Variables and their descriptions for the `passwords` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
rank Popularity in the database of released passwords.
password Actual text of the password.
category Category password falls into.
value Time to crack by online guessing.
time_unit Time unit to match with value.
offline_crack_sec Time to crack offline in seconds.
strength Strength of password, relative only to passwords in this dataset. Lower values indicate weaker passwords.
\n\n`````\n:::\n:::\n\n\nWe now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables.\n\n::: {.workedexample data-latex=\"\"}\nDetermine whether each variable in the passwords dataset is numerical or categorical.\nFor numerical variables, further classify them as continuous or discrete.\nFor categorical variables, determine if the variable is ordinal.\n\n------------------------------------------------------------------------\n\nThe numerical variables in the dataset are `rank` (discrete), `value` (continuous), and `offline_crack_sec` (continuous).\nThe categorical variables are `password`, `time_unit`.\nThe strength variable is trickier to classify -- we can think of it as discrete numerical or as an ordinal variable as it takes on numerical values, however it's used to categorize the passwords on an ordinal scale.\nOne way of approaching this is thinking about whether the values the variable takes vary linearly, e.g., is the difference in strength between passwords with strength levels 8 and 9 the same as the difference with those with strength levels 9 and 10.\nIf this is not necessarily the case, we would classify the variable as ordinal.\nDetermining the classification of this variable requires understanding of how `strength` values were determined, which is a very typical workflow for working with data.\nSometimes the data dictionary (presented in @tbl-passwords-var-def) isn't sufficient, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully.\n:::\n\nNext, let's try to get to know each variable a little bit better.\nFor categorical variables, this involves figuring out what their levels are and how commonly represented they are in the data.\n@fig-passwords-cat shows the distributions of the categorical variables in this dataset.\nWe can see that password strengths of 0-10 are more common than higher values.\nThe most common password category is name (e.g. michael, jennifer, jordan, etc.) and the least common is food (e.g., pepper, cheese, coffee, etc.).\nMany passwords can be cracked in the matter of days by online cracking with some taking as little as seconds and some as long as years to break.\nEach of these visualizations is a bar plot, which you will learn more about in @sec-explore-categorical.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Distributions of the categorical variables in the `passwords` dataset. Plot A shows the distribution of password strengths, Plot B password categories, and Plot C length of time it takes to crack the passwords by online guessing.](03-data-applications_files/figure-html/fig-passwords-cat-1.png){#fig-passwords-cat width=100%}\n:::\n:::\n\n\nSimilarly, we can examine the distributions of the numerical variables as well.\nWe already know that rank ranges between 1 and 500 in this dataset, based on @tbl-passwords-df-head and @tbl-passwords-df-tail.\nThe value variable is slightly more complicated to consider since the numerical values in that column are meaningless without the time unit that accompanies them.\n@tbl-passwords-online-crack-summary shows the minimum and maximum amount of time it takes to crack a password by online guessing.\nFor example, there are 11 passwords in the dataset that can be broken in a matter of seconds, and each of them take 11.11 seconds to break, since the minimum and the maximum of observations in this group are exactly equal to this value.\nAnd there are 65 passwords that take years to break, ranging from 2.56 years to 92.27 years.\n\n\n::: {#tbl-passwords-online-crack-summary .cell tbl-cap='Minimum and maximum amount of time it takes to crack a password by online guessing as well as the number of observations that fall into each time unit category.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
time_unit n min max
seconds 11 11.11 11.11
minutes 51 1.85 18.52
hours 43 3.09 17.28
days 238 1.29 3.72
weeks 5 1.84 3.70
months 87 3.19 3.19
years 65 2.56 92.27
\n\n`````\n:::\n:::\n\n\nEven though passwords that take a large number of years to crack can seem like good options (see @tbl-passwords-long-crack for a list of them), now that you've seen them here (and the fact that they are in a dataset of 500 most common passwords), you should not use them as secure passwords!\n\n\n::: {#tbl-passwords-long-crack .cell tbl-cap='Passwords that take the longest amount of time to crack by online guessing.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
rank password category value time_unit offline_crack_sec strength
26 trustno1 simple-alphanumeric 92.3 years 29.0 25
336 rush2112 nerdy-pop 92.3 years 29.0 48
406 jordan23 sport 92.3 years 29.3 34
500 passw0rd password-related 92.3 years 29.0 28
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\nThe last numerical variable in the dataset is `offline_crack_sec`.\n@fig-password-offline-crack-hist shows the distribution of this variable, which reveals that all of these passwords can be cracked offline in under 30 seconds, with a large number of them being crackable in just a few seconds.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of the length of time it takes to crack passwords offline.](03-data-applications_files/figure-html/fig-password-offline-crack-hist-1.png){#fig-password-offline-crack-hist width=90%}\n:::\n:::\n\n\nSo far we examined the distributions of each individual variable, but it would be more interesting to explore relationships between multiple variables.\n@fig-password-strength-rank-category shows the relationship between rank and strength of passwords by category, where more common passwords (those with higher rank) are plotted higher on the y-axis than those that are less common in this dataset.\nThe stronger the password, the larger text it's represented with on the plot.\nWhile this visualization reveals some passwords that are less common, and stronger than others, we should reiterate that you should not use any of these passwords.\nAnd if you already do, it's time to go change it!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rank vs. strength of 500 most common passwords by category.](03-data-applications_files/figure-html/fig-password-strength-rank-category-1.png){#fig-password-strength-rank-category width=100%}\n:::\n:::\n\n\nIn this case study, we introduced you to the very first steps a data scientist takes when they start working with a new dataset.\nIn the next few chapters, we will introduce exploratory data analysis and you'll learn more about the various types of data visualizations and summary statistics you can make to get to know your data better.\n\nBefore you move on, we encourage you to think about whether the following questions can be answered with this dataset, and if yes, how you might go about answering them.\nIt's okay if your answer is \"I'm not sure\", we simply want to get your exploratory juices flowing to prime you for what's to come!\n\n1. What characteristics are associated with a strong vs. a weak password?\n2. Do more popular passwords take shorter or longer to crack compared to less popular passwords?\n3. Are passwords that start with letters or numbers more common among the list of top 500 most common passwords?\n\n\\clearpage\n\n## Interactive R tutorials {#data-tutorials}\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 1: Introduction to data](https://openintrostat.github.io/ims-tutorials/01-data/)\n\n::: {.content-hidden unless-format=\"pdf\"}\n\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 1: Language of data](https://openintro.shinyapps.io/ims-01-data-01/)\n\n::: {.content-hidden unless-format=\"pdf\"}\n\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 2: Types of studies](https://openintro.shinyapps.io/ims-01-data-02/)\n\n::: {.content-hidden unless-format=\"pdf\"}\n\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 3: Sampling strategies and experimental design](https://openintro.shinyapps.io/ims-01-data-03/)\n\n::: {.content-hidden unless-format=\"pdf\"}\n\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 4: Case study](https://openintro.shinyapps.io/ims-01-data-04/)\n\n::: {.content-hidden unless-format=\"pdf\"}\n\n:::\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n## R labs {#data-labs}\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Intro to R - Birth rates](https://www.openintro.org/go?id=ims-r-lab-intro-to-r)\n\n::: {.content-hidden unless-format=\"pdf\"}\n\n:::\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", + "markdown": "# Applications: Data {#sec-data-applications}\n\n\n\n\n\n## Case study: Olympic 1500m {#case-study-paralympics}\n\nWhile many of you may be glued to the Olympic Games every four years (or every two years if you fancy both summer and winter sports), the Paralympic Games are less popular than the Olympic Games, even if they hold the same competitive thrills.\n\nThe Paralympic Games began as a way to support soldiers who had been wounded in World War II as a way to help them rehabilitate.\nThe first Paralympic Games were held in Rome, Italy in 1960.\nSince 1988 (Seoul, South Korea), the Paralympic Games have been held a few weeks later than the Olympic Games in the same city, in both the summer and winter.\n\nIn this case study we introduce a dataset comparing Olympic and Paralympic gold medal finishers in the 1500m running competition (the Olympic \"mile\", if a bit shorter than a full mile).\nThe goal of the case study is to walk you through what a data scientist does when they first get a hold of a dataset.\nWe also provide some \"foreshadowing\" of concepts and techniques we'll introduce in the next few chapters on exploratory data analysis.\nLast, we introduce [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) and discuss the importance of understanding the impact of multiple variables in an analysis.\n\n::: {.data data-latex=\"\"}\nThe [`paralympic_1500`](http://openintrostat.github.io/openintro/reference/paralympic_1500.html) data can be found in the [openintro](http://openintrostat.github.io/openintro/) R package.\n:::\n\n@tbl-paralympic-df-tail shows the last ten rows from the dataset, which are the ten most recent 1500m races.\nNotice that there are racers from both the Men's and Women's divisions as well as those of varying visual impairment (T11, T12, T13, and Olympic).\nThe T11 athletes have almost complete visual impairment, run with a black-out blindfold, and are allowed to run with a guide-runner.\nT12 and T13 athletes have some visual impairment, and the visual acuity of Olympic runners is not determined.\n\n\n::: {#tbl-paralympic-df-tail .cell tbl-cap='Last ten rows of the `paralympic_1500` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
year city country_of_games division type name country_of_athlete time time_min
2016 Rio de Janeiro Brazil Men T13 Abdellatif Baka Algeria 3:48.29 3.81
2016 Rio de Janeiro Brazil Women Olympic Faith Chepngetich Kipyegon Kenya 4:8.92 4.15
2016 Rio de Janeiro Brazil Women T11 Jin Zheng China 4:38.92 4.65
2016 Rio de Janeiro Brazil Women T13 Somaya Bousaid Tunisia 4:21.45 4.36
2020 Tokyo Japan Men Olympic Jakob Ingebrigtsen Norway 3:28.32 3.47
2020 Tokyo Japan Men T11 Yeltsin Jacques Brazil 3:57.6 3.96
2020 Tokyo Japan Men T13 Anton Kuliatin Russian Paralympic Committee 3:54.04 3.90
2020 Tokyo Japan Women Olympic Faith Chepngetich Kipyegon Kenya 3:53.11 3.88
2020 Tokyo Japan Women T11 Monica Olivia Rodriguez Saavedra Mexico 4:37.4 4.62
2020 Tokyo Japan Women T13 Tigist Gezahagn Menigstu Ethiopia 4:23.24 4.39
\n\n`````\n:::\n:::\n\n\nWhen you encounter a new dataset, taking a peek at the last few rows as we did in @tbl-paralympic-df-tail should be almost instinctual.\nIt can often be helpful to look at the first few rows of the data as well to get a sense of other aspects of the data which may not be apparent int he last few rows.\n@tbl-paralympic-df-head shows the top ten rows of the `paralympic_1500` dataset, which reveals that for at least the first 10 Olympiads, there were no runners in the Women's division or in the Paralympics.\n\n\n::: {#tbl-paralympic-df-head .cell tbl-cap='First ten rows of the `paralympic_1500` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
year city country_of_games division type name country_of_athlete time time_min
1896 Athens Greece Men Olympic Edwin Flack Australia 4:33.2 4.55
1900 Paris France Men Olympic Charles Bennett Great Britain 4:6.2 4.10
1904 St Louis USA Men Olympic Jim Lightbody USA 4:5.4 4.09
1908 London United Kingdom Men Olympic Mel Sheppard USA 4:3.4 4.06
1912 Stockholm Sweden Men Olympic Arnold Jackson Great Britain 3:56.8 3.95
1920 Antwerp Belgium Men Olympic Albert Hill Great Britain 4:1.8 4.03
1924 Paris France Men Olympic Paavo Nurmi Finland 3:53.6 3.89
1928 Amsterdam Netherlands Men Olympic Harri Larva Finland 3:53.2 3.89
1932 Los Angeles USA Men Olympic Luigi Beccali Italy 3:51.2 3.85
1936 Berlin Germany Men Olympic Jack Lovelock New Zealand 3:47.8 3.80
\n\n`````\n:::\n:::\n\n\nAt this stage it's also useful to think about how the data were collected, as that will inform the scope of any inference you can make based on your analysis of the data.\n\n::: {.guidedpractice data-latex=\"\"}\nDo these data come from an observational study or an experiment?[^03-data-applications-1]\n:::\n\n[^03-data-applications-1]: This is an observational study.\n Researchers collected data on past gold medal race times in both Olympic and Paralympic Games.\n\n::: {.guidedpractice data-latex=\"\"}\nThere are 82 rows and 9 columns in the dataset.\nWhat does each row and each column represent?[^03-data-applications-2]\n:::\n\n[^03-data-applications-2]: Each row represents a 1500m gold medal race and each column represents a variable containing information on each race.\n\nOnce you've identified the rows and columns, it's useful to review the data dictionary to learn about what each column in the dataset represents.\nThe data dictionary is provided in @tbl-paralympic-var-def.\n\n\n::: {#tbl-paralympic-var-def .cell tbl-cap='Variables and their descriptions for the `paralympic_1500` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
year Year the Games took place.
city City of the Games.
country_of_games Country of the Games.
division Division: `Men` or `Women`.
name Name of the athlete.
country_of_athlete Country of athlete.
time Time of gold medal race, in m:s.
time_min Time of gold medal race, in decimal minutes (min + sec/60).
\n\n`````\n:::\n:::\n\n\nWe now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables.\n\n::: {.workedexample data-latex=\"\"}\nDetermine whether each variable in the `paralympic_1500` dataset is numerical or categorical.\nFor numerical variables, further classify them as continuous or discrete.\nFor categorical variables, determine if the variable is ordinal.\n\n------------------------------------------------------------------------\n\nThe numerical variables in the dataset are `year` (discrete), and `time_min` (continuous).\nThe categorical variables are `city`, `country_of_games`, `division`, `type`, `name`, and `country_of_athlete`.\nThe `time` variable is trickier to classify -- we can think of it as numerical, but it is classified as categorical.\nThe categorical classification is due to the colon `:` which separates the hours from the seconds.\nSometimes the data dictionary (presented in @tbl-paralympic-var-def) isn't sufficient for a complete analysis, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully.\n:::\n\nNext, let's try to get to know each variable a little bit better.\nFor categorical variables, this involves figuring out what their levels are and how commonly represented they are in the data.\n@fig-paralympic-cat shows the distributions of two of the categorical variables in this dataset.\nWe can see that the United States has hosted the Games most often, but runners from Great Britain and Kenya have won the 1500m most often.\nThere are a large number of countries who have had a single gold medal winner of the 1500m.\nSimilarly, there are a large number of countries who have hosted the Games only once.\nOver the last century, the name describing the country for athletes from one particular region has changed and includes Russian Federation, Unified Team, and Russian Paralympic Committee.\nBoth of the visualizations are bar plots, which you will learn more about in @sec-explore-categorical.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Distributions of categorical variables in the `paralympic_1500` dataset. Plot A shows the distribution of the country of origin of the athlete; Plot B shows the distribution of the country in which the Games gook place.](03-data-applications_files/figure-html/fig-paralympic-cat-1.png){#fig-paralympic-cat fig-alt='Two separate bar plots. The left panel shows a bar plot counting the number of gold medal athletes from each country. Great Britain has had 8 top finishers, Kenya has had 7 top finishers, and Tunisia and Algeria have both had 5. The right panel shows a bar plot counting the number of games which have happened in each country. The USA has hosted 4 games, the UK has hosted 3 games, and each of Japan, Greece, Germany, France, and Australia have hosted the games twice.' width=100%}\n:::\n:::\n\n\nSimilarly, we can examine the distributions of the numerical variables as well.\nWe already know that the 1500m times are mostly between 3.5min and 4.5min, based on @tbl-paralympic-df-tail and @tbl-paralympic-df-head.\nWe can break down the 1500m time by division and type of race.\n@tbl-paralympic-summary shows the mean, minimum, and maximum 1500m times broken down by division and race type.\nRecall that the Men's Olympic division has taken place since 1896, whereas the Men's Paralympic division has happened only since 1960.\nThe maximum race time, therefore, should be taken into context in terms of the year of the Games.\n\n\n::: {#tbl-paralympic-summary .cell tbl-cap='Mean, minimum, and maximum of the gold medal times for the 1500m race broken down by division and type of race.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
division type mean min max
Men Olympic 3.76 3.47 4.55
Men T11 4.14 3.96 4.31
Men T12 4.11 3.94 4.25
Men T13 3.98 3.81 4.24
Women Olympic 4.02 3.88 4.18
Women T11 5.05 4.62 5.63
Women T12 4.88 4.61 5.57
Women T13 4.55 4.23 5.24
\n\n`````\n:::\n:::\n\n\n\n### Fun fact {-}\n\nSometimes playing around with the dataset will uncover interesting elements about the context in which the data were collected.\nA scatterplot of the Men's 1500m broken down by race type shows that, in each given year, the Olympic runner is substantially faster than the Paralympic runners, with one exception.\nIn the Rio de Janeiro 2016 games, the [T13 gold medal athlete ran faster (3:48.29) than the Olympic gold medal athlete (3:50.00)](https://www.paralympic.org/news/remarkable-finish-1500m-rio-2016) (see @fig-paralympic-rio).\nIn fact, some internet sleuthing tells you that the **top four** T13 finishers all finished the 1500m under 3:50.00!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![1500m race time for Men's Olympic and Paralympic athletes. Dashed grey line represents the Rio games in 2016.](03-data-applications_files/figure-html/fig-paralympic-rio-1.png){#fig-paralympic-rio fig-alt='A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis. The points are colored by which group the athlete is in - T11, T12, T13, or Olympic. A vertical line at 2016 show that in the Rio games the T13 gold medal athlete was faster than the Olympic gold medal athlete.' width=100%}\n:::\n:::\n\n\n\nSo far we examined aspects of some of the individual variables, and we have broken down the 1500m race times in terms of division and race type.\nYou might have already wondered how the race times vary across year.\nThe `paralymic_1500` dataset will provide us with an ability to explore an important statistical concept, Simpson's paradox.\n\n## Simpson's paradox\n\nSimpson's paradox \\index{Simpson's paradox} is a description of three (or more) variables.\nThe paradox happens when a third variable reverses the relationship between the first two variables.\n\nLet's start by considering how the 1500m gold medal race times have changed over year.\n@fig-paralympic-ungrouped shows a scatterplot describing 1500m race times and year for Men's Olympic and Paralympic (T11) athletes with a line of best fit (to the entire dataset) superimposed (see @sec-model-slr where we will present fitting a line to a scatterplot). \nNotice that the line of best fit shows a **positive** relationship between race time and year.\nThat is, for later years, the predicted gold medal time is higher than in earlier years.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![1500m race time for Men's Olympic and Paralympic (T11) athletes. The line represents a line of best fit to the entire dataset.](03-data-applications_files/figure-html/fig-paralympic-ungrouped-1.png){#fig-paralympic-ungrouped fig-alt='A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis. A line of best fit is drawn over the points.' width=100%}\n:::\n:::\n\n\nOf course, both your eye and your intuition are likely telling you that it wouldn't make any sense to try to model all of the athletes together.\nInstead, a separate model should be run for each of the two types of Games: Olympic and Paralympic (T11).\n@fig-paralympic-grouped shows a scatterplot describing 1500m race times and year for Men's Olympic and Paralympic (T11) athletes with a line of best fit superimposed separately for each of the two types of races.\nNotice that within each type of race, the relationship between 1500m race time and year is now **negative**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![1500m race time for Men's Olympic and Paralympic (T11) athletes. The best fit line is now fit separately to the Olympic and Paralympic athletes.](03-data-applications_files/figure-html/fig-paralympic-grouped-1.png){#fig-paralympic-grouped fig-alt='A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis. The points are colored by the type of athlete - T11 or Olympic. Lines of best fit are drawn separately for the two groups (T11 and Olympic).' width=100%}\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Simpson's paradox.**\n\nSimpson's paradox happens when an association or relationship between two variables in one direction (e.g., positive) reverses (e.g., becomes negative) when a third variable is considered.\n:::\n\n\n\n\n\nSimpson's paradox was seen in the 1500m race data because the aggregate data showed a positive relationship (positive slope) between year and race time but a negative relationship (negative slope) between year and race time when broken down by the type of race.\n\nSimpson's paradox is observed with categorical data and with numeric data.\nOften the paradox happens because the third variable (here, race type) is imbalanced.\nThere are either more observations in one group or the observations happen at different intervals across the two groups.\nIn the 1500m data, we saw that the T11 runners had fewer observations and their times were both generally slower and more recent than the Olympic runners.\n\nIn the 1500m analysis, it would be most prudent to report the trends separately for the Olympic and the T11 athletes.\nHowever, in other situations, it might be better to aggregate the data and report the overall trend.\nMany additional examples of Simpson's paradox and a further exploration is given in @Witmer:2021.\n\nIn this case study, we introduced you to the very first steps a data scientist takes when they start working with a new dataset.\nIn the next few chapters, we will introduce exploratory data analysis, and you'll learn more about the various types of data visualizations and summary statistics you can make to get to know your data better.\n\nBefore you move on, we encourage you to think about whether the following questions can be answered with this dataset, and if yes, how you might go about answering them?\nIt's okay if your answer is \"I'm not sure\", we simply want to get your exploratory juices flowing to prime you for what's to come!\n\n1. Has there every been a year when a visually impaired paralympic gold medal athlete beat the Olympic gold medal athlete?\n2. When comparing the paralympic and Olympic 1500m gold medal athletes, does Simpson's paradox hold in the Women's division?\n3. Is there a biological boundary which establishes a time under which no human could run 1500m?\n\n\n## Interactive R tutorials {#data-tutorials}\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 1: Introduction to data](https://openintrostat.github.io/ims-tutorials/01-data/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintrostat.github.io/ims-tutorials/01-data\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 1: Language of data](https://openintro.shinyapps.io/ims-01-data-01/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-01\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 2: Types of studies](https://openintro.shinyapps.io/ims-01-data-02/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-02\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 3: Sampling strategies and experimental design](https://openintro.shinyapps.io/ims-01-data-03/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-03\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 4: Case study](https://openintro.shinyapps.io/ims-01-data-04/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-04\n:::\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n## R labs {#data-labs}\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Intro to R - Birth rates](https://www.openintro.org/go?id=ims-r-lab-intro-to-r)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://www.openintro.org/go?i\nd=ims-r-lab-intro-to-r\n:::\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", "supporting": [ "03-data-applications_files" ], diff --git a/_freeze/03-data-applications/figure-html/fig-paralympic-cat-1.png b/_freeze/03-data-applications/figure-html/fig-paralympic-cat-1.png new file mode 100644 index 00000000..a5300f11 Binary files /dev/null and b/_freeze/03-data-applications/figure-html/fig-paralympic-cat-1.png differ diff --git a/_freeze/03-data-applications/figure-html/fig-paralympic-grouped-1.png b/_freeze/03-data-applications/figure-html/fig-paralympic-grouped-1.png new file mode 100644 index 00000000..22133ea7 Binary files /dev/null and b/_freeze/03-data-applications/figure-html/fig-paralympic-grouped-1.png differ diff --git a/_freeze/03-data-applications/figure-html/fig-paralympic-rio-1.png b/_freeze/03-data-applications/figure-html/fig-paralympic-rio-1.png new file mode 100644 index 00000000..9aa5f691 Binary files /dev/null and b/_freeze/03-data-applications/figure-html/fig-paralympic-rio-1.png differ diff --git a/_freeze/03-data-applications/figure-html/fig-paralympic-ungrouped-1.png b/_freeze/03-data-applications/figure-html/fig-paralympic-ungrouped-1.png new file mode 100644 index 00000000..2d699be0 Binary files /dev/null and b/_freeze/03-data-applications/figure-html/fig-paralympic-ungrouped-1.png differ diff --git a/book.bib b/book.bib index 616cdbd6..ed57e8c0 100644 --- a/book.bib +++ b/book.bib @@ -196,4 +196,16 @@ @article{Hesterbeg:2015 year={2015}, publisher={Taylor \& Francis}, doi={10.1080/00031305.2015.1089789} +} + +@article{Witmer:2021, +author = {Jeff Witmer}, +title = {Simpson's Paradox, Visual Displays, and Causal Diagrams}, +journal = {The American Mathematical Monthly}, +volume = {128}, +number = {7}, +pages = {598-610}, +year = {2021}, +publisher = {Taylor & Francis}, +doi = {10.1080/00029890.2021.1932237} } \ No newline at end of file diff --git a/introduction-to-data.qmd b/introduction-to-data.qmd index 73e4f5b5..7033d626 100644 --- a/introduction-to-data.qmd +++ b/introduction-to-data.qmd @@ -8,7 +8,7 @@ Different data and settings lead to different **types** of conclusions, so you'l - In @sec-data-design the focus is on study design. In particular, the critical distinction between random sampling and randomization is made. -- @sec-data-applications includes an application on the Passwords case study where the topics from this part of the book are fully developed. +- @sec-data-applications includes an application on the Paralympics case study where the topics from this part of the book are fully developed. We recommend you come back to this part to review after you cover a new part in the textbook. In particular, it is worthwhile to consider @fig-randsampValloc in all of the inferential settings you cover.