diff --git a/11-foundations-randomization.qmd b/11-foundations-randomization.qmd index 5f5137dd..211b5142 100644 --- a/11-foundations-randomization.qmd +++ b/11-foundations-randomization.qmd @@ -42,10 +42,10 @@ We would probably observe a small difference due to *chance*. ::: ::: {.guidedpractice data-latex=""} -If we do not think the side of the room a person sits on in class is related to whether they prefer to read books on screen, what assumption are we making about the relationship between these two variables?[^11-foundations-randomization-1] +If we do not think the side of the room a person sits on in class is related to whether they prefer to read books on screen, what assumption are we making about the relationship between these two variables?[^1] ::: -[^11-foundations-randomization-1]: We would be assuming that these two variables are **independent**\index{independent}. +[^1]: We would be assuming that these two variables are **independent**\index{independent}. ```{r} #| include: false @@ -57,7 +57,7 @@ Throughout this chapter, and those that follow, we provide three different appro Using the methods provided in this chapter, we will be able to draw conclusions beyond the dataset at hand to research questions about larger populations that the samples come from. The first type of variability we will explore comes from experiments where the explanatory variable (or treatment) is randomly assigned to the observational units. -As you learned in Chapter \@ref(data-hello), a randomized experiment can be used to assess whether one variable (the explanatory variable) causes changes in a second variable (the response variable). +As you learned in [Chapter -@sec-data-hello], a randomized experiment can be used to assess whether one variable (the explanatory variable) causes changes in a second variable (the response variable). Every dataset has some variability in it, so to decide whether the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized experiment as a comparison. That is, we assume that each observational unit would have gotten the exact same response value regardless of the treatment level. By reassigning the treatments many many times, we can compare the actual experiment to the sham experiment. @@ -69,7 +69,7 @@ Using a few different case studies, let's look more carefully at this idea of a terms_chp_11 <- c(terms_chp_11, "randomization test") ``` -## Sex discrimination case study {#caseStudySexDiscrimination} +## Sex discrimination case study {#sec-caseStudySexDiscrimination} We consider a study investigating sex discrimination in the 1970s, which is set in the context of personnel decisions within a bank. The research question we hope to answer is, "Are individuals who identify as female discriminated against in promotion decisions made by their managers who identify as male?" [@Rosen:1974] @@ -90,26 +90,26 @@ These files were randomly assigned to the bank managers. ::: {.guidedpractice data-latex=""} Is this an observational study or an experiment? -How does the type of study impact what can be inferred from the results?[^11-foundations-randomization-2] +How does the type of study impact what can be inferred from the results?[^2] ::: -[^11-foundations-randomization-2]: The study is an experiment, as subjects were randomly assigned a "male" file or a "female" file (remember, all the files were actually identical in content). +[^2]: The study is an experiment, as subjects were randomly assigned a "male" file or a "female" file (remember, all the files were actually identical in content). Since this is an experiment, the results can be used to evaluate a causal relationship between the sex of a candidate and the promotion decision. ```{r} #| label: sex-discrimination-obs-p -sex_discrimination_props <- sex_discrimination %>% - rename(sex = sex) %>% - count(sex, decision) %>% - group_by(sex) %>% +sex_discrimination_props <- sex_discrimination |> + rename(sex = sex) |> + count(sex, decision) |> + group_by(sex) |> mutate(p = n / sum(n)) -p_male <- sex_discrimination_props %>% - filter(sex == "male", decision == "promoted") %>% +p_male <- sex_discrimination_props |> + filter(sex == "male", decision == "promoted") |> pull(p) -p_female <- sex_discrimination_props %>% - filter(sex == "female", decision == "promoted") %>% +p_female <- sex_discrimination_props |> + filter(sex == "female", decision == "promoted") |> pull(p) p_diff <- p_male - p_female @@ -120,34 +120,39 @@ perc_diff <- label_percent(accuracy = 0.1)(p_diff) ``` For each supervisor both the sex associated with the assigned file and the promotion decision were recorded. -Using the results of the study summarized in Table \@ref(tab:sex-discrimination-obs), we would like to evaluate if individuals who identify as female are unfairly discriminated against in promotion decisions. +Using the results of the study summarized in @tbl-sex-discrimination-obs, we would like to evaluate if individuals who identify as female are unfairly discriminated against in promotion decisions. In this study, a smaller proportion of female identifying applications were promoted than males (`r p_female` versus `r p_male`), but it is unclear whether the difference provides *convincing evidence* that individuals who identify as female are unfairly discriminated against. ```{r} -#| label: sex-discrimination-obs -sex_discrimination %>% - count(decision, sex) %>% - pivot_wider(names_from = decision, values_from = n) %>% - adorn_totals(where = c("col", "row")) %>% - kbl(linesep = "", booktabs = TRUE, caption = "Summary results for the sex discrimination study.") %>% +#| label: tbl-sex-discrimination-obs +#| tbl-cap: | +#| Summary results for the sex discrimination study. + +sex_discrimination |> + count(decision, sex) |> + pivot_wider(names_from = decision, values_from = n) |> + adorn_totals(where = c("col", "row")) |> + kbl(linesep = "", booktabs = TRUE) |> kable_styling(bootstrap_options = c("striped", "condensed"), - latex_options = c("striped", "hold_position"), full_width = FALSE) %>% - add_header_above(c(" " = 1, "decision" = 2, " " = 1)) %>% + latex_options = c("striped", "hold_position"), full_width = FALSE) |> + add_header_above(c(" " = 1, "decision" = 2, " " = 1)) |> column_spec(1:4, width = "7em") ``` -The data are visualized in Figure \@ref(fig:sex-rand-obs) as a set of cards. +The data are visualized in @fig-sex-rand-obs as a set of cards. Note that each card denotes a personnel file (an observation from our dataset) and the colors indicate the decision: red for promoted and white for not promoted. Additionally, the observations are broken up into groups of male and female identifying groups. ```{r} -#| label: sex-rand-obs -#| out.width: 40% -#| fig.cap: The sex discrimination study can be thought of as 48 red and white cards. -#| fig.alt: 48 cards are laid out; 24 indicating male files, 24 indicated female files. Of -#| the 24 male files 3 of the cards are colored white, and 21 of the cards are colored -#| red. Of the female files, 10 of the cards are colored white, and 14 of the cards -#| are colored red. +#| label: fig-sex-rand-obs +#| fig-cap: The sex discrimination study can be thought of as 48 red and white cards. +#| fig-alt: | +#| 48 cards are laid out; 24 indicating male files, 24 indicated female files. +#| Of the 24 male files 3 of the cards are colored white, and 21 of the cards +#| are colored red. Of the female files, 10 of the cards are colored white, +#| and 14 of the cards are colored red. +#| out-width: 40% + knitr::include_graphics("images/sex-rand-01-obs.png") ``` @@ -163,7 +168,7 @@ Since we wouldn't expect the sample proportions to be *exactly* equal, even if t ::: The previous example is a reminder that the observed outcomes in the sample may not perfectly reflect the true relationships between variables in the underlying population. -Table \@ref(tab:sex-discrimination-obs) shows there were 7 fewer promotions for female identifying personnel than for the male personnel, a difference in promotion rates of `r perc_diff` $\left( \frac{21}{24} - \frac{14}{24} = 0.292 \right).$ This observed difference is what we call a **point estimate**\index{point estimate} of the true difference. +@tbl-sex-discrimination-obs shows there were 7 fewer promotions for female identifying personnel than for the male personnel, a difference in promotion rates of `r perc_diff` $\left( \frac{21}{24} - \frac{14}{24} = 0.292 \right).$ This observed difference is what we call a **point estimate**\index{point estimate} of the true difference. The point estimate of the difference in promotion rate is large, but the sample size for the study is small, making it unclear if this observed difference represents discrimination or whether it is simply due to chance when there is no discrimination occurring. Chance can be thought of as the claim due to natural variability; discrimination can be thought of as the claim the researchers set out to demonstrate. We label these two competing claims, $H_0$ and $H_A:$ @@ -217,12 +222,12 @@ If data and the null claim seem to be at odds with one another, and the data see ### Variability of the statistic -Table \@ref(tab:sex-discrimination-obs) shows that 35 bank supervisors recommended promotion and 13 did not. +@tbl-sex-discrimination-obs shows that 35 bank supervisors recommended promotion and 13 did not. Now, suppose the bankers' decisions were independent of the sex of the candidate. Then, if we conducted the experiment again with a different random assignment of sex to the files, differences in promotion rates would be based only on random fluctuation in promotion decisions. -We can actually perform this **randomization**, which simulates what would have happened if the bankers' decisions had been independent of `sex` but we had distributed the file sexes differently.[^11-foundations-randomization-3] +We can actually perform this **randomization**, which simulates what would have happened if the bankers' decisions had been independent of `sex` but we had distributed the file sexes differently.[^3] -[^11-foundations-randomization-3]: The test procedure we employ in this section is sometimes referred to as a **randomization test**. +[^3]: The test procedure we employ in this section is sometimes referred to as a **randomization test**. If the explanatory variable had not been randomly assigned, as in an observational study, the procedure would be referred to as a **permutation test**. Permutation tests are used for observational studies, where the explanatory variable was not randomly assigned.\index{permutation test}. @@ -235,15 +240,18 @@ In the **simulation**\index{simulation}, we thoroughly shuffle the 48 personnel Note that by keeping 35 promoted and 13 not promoted, we are assuming that 35 of the bank managers would have promoted the individual whose content is contained in the file **independent** of the sex indicated on their file. We will deal 24 files into the first stack, which will represent the 24 "female" files. The second stack will also have 24 files, and it will represent the 24 "male" files. -Figure \@ref(fig:sex-rand-shuffle-1) highlights both the shuffle and the reallocation to the sham sex groups. +@fig-sex-rand-shuffle-1 highlights both the shuffle and the reallocation to the sham sex groups. ```{r} -#| label: sex-rand-shuffle-1 -#| out.width: 80% -#| fig.cap: The sex discrimination data is shuffled and reallocated to new groups of +#| label: fig-sex-rand-shuffle-1 +#| fig-cap: | +#| The sex discrimination data is shuffled and reallocated to new groups of #| male and female files. -#| fig.alt: The 48 red and white cards which denote the original data are shuffled and +#| fig-alt: | +#| The 48 red and white cards which denote the original data are shuffled and #| reassigned, 24 to each group indicating 24 male files and 24 female files. +#| out.width: 80% + knitr::include_graphics("images/sex-rand-02-shuffle-1.png") ``` @@ -255,40 +263,44 @@ terms_chp_11 <- c(terms_chp_11, "simulation") ``` Since the randomization of files in this simulation is independent of the promotion decisions, any difference in promotion rates is due to chance. -Table \@ref(tab:sex-discrimination-rand-1) show the results of one such simulation. +@tbl-sex-discrimination-rand-1 show the results of one such simulation. ```{r} -#| label: sex-discrimination-rand-1 +#| label: tbl-sex-discrimination-rand-1 +#| tbl-cap: | +#| Simulation results, where the difference in promotion rates between male +#| and female is purely due to random chance. + sex_discrimination_rand_1 <- tibble( sex = c(rep("male", 24), rep("female", 24)), decision = c(rep("promoted", 18), rep("not promoted", 6), rep("promoted", 17), rep("not promoted", 7)) -) %>% +) |> mutate( sex = fct_relevel(sex, "male", "female"), decision = fct_relevel(decision, "promoted", "not promoted") ) -sex_discrimination_rand_1 %>% - count(decision, sex) %>% - pivot_wider(names_from = decision, values_from = n) %>% - adorn_totals(where = c("col", "row")) %>% - kbl(linesep = "", booktabs = TRUE, caption = "Simulation results, where the difference in promotion rates between male and female is purely due to random chance.") %>% +sex_discrimination_rand_1 |> + count(decision, sex) |> + pivot_wider(names_from = decision, values_from = n) |> + adorn_totals(where = c("col", "row")) |> + kbl(linesep = "", booktabs = TRUE) |> kable_styling(bootstrap_options = c("striped", "condensed"), - latex_options = c("striped", "hold_position"), full_width = FALSE) %>% - add_header_above(c(" " = 1, "decision" = 2, " " = 1)) %>% + latex_options = c("striped", "hold_position"), full_width = FALSE) |> + add_header_above(c(" " = 1, "decision" = 2, " " = 1)) |> column_spec(1:4, width = "7em") ``` ::: {.guidedpractice data-latex=""} -What is the difference in promotion rates between the two simulated groups in Table \@ref(tab:sex-discrimination-rand-1) ? -How does this compare to the observed difference 29.2% from the actual study?[^11-foundations-randomization-4] +What is the difference in promotion rates between the two simulated groups in @tbl-sex-discrimination-rand-1 ? +How does this compare to the observed difference 29.2% from the actual study?[^4] ::: -[^11-foundations-randomization-4]: $18/24 - 17/24=0.042$ or about 4.2% in favor of the male personnel. +[^4]: $18/24 - 17/24=0.042$ or about 4.2% in favor of the male personnel. This difference due to chance is much smaller than the difference observed in the actual groups. -Figure \@ref(fig:sex-rand-shuffle-1-sort) shows that the difference in promotion rates is much larger in the original data than it is in the simulated groups (0.292 \> 0.042). +@fig-sex-rand-shuffle-1-sort shows that the difference in promotion rates is much larger in the original data than it is in the simulated groups (0.292 \> 0.042). The quantity of interest throughout this case study has been the difference in promotion rates. We call the summary value the **statistic** of interest (or often the **test statistic**). When we encounter different data structures, the statistic is likely to change (e.g., we might calculate an average instead of a proportion), but we will always want to understand how the statistic varies from sample to sample. @@ -299,18 +311,21 @@ terms_chp_11 <- c(terms_chp_11, "statistic", "test statistic") ``` ```{r} -#| label: sex-rand-shuffle-1-sort -#| out.width: 100% -#| fig.cap: We summarize the randomized data to produce one estimate of the difference +#| label: fig-sex-rand-shuffle-1-sort +#| out-width: 100% +#| fig-cap: | +#| We summarize the randomized data to produce one estimate of the difference #| in proportions given no sex discrimination. Note that the sort step is only used #| to make it easier to visually calculate the simulated sample proportions. -#| fig.alt: The 48 red and white cards are show in three panels. The first panel represents +#| fig-alt: | +#| The 48 red and white cards are show in three panels. The first panel represents #| the original data and original allocation of the male and female files (in the original #| data there are 3 white cards in the male group and 10 white cards in the female #| group). The second panel represents the shuffled red and white cards that are randomly #| assigned as male and female files. The third panel has the cards sorted according #| to the random assignment of female or male. In the third panel there are 6 white #| cards in the male group and 7 white cards in the female group. + knitr::include_graphics("images/sex-rand-03-shuffle-1-sort.png") ``` @@ -321,19 +336,25 @@ While in this first simulation, we physically dealt out files, it is much more e Repeating the simulation on a computer, we get another difference due to chance under the same assumption: -0.042. And another: 0.208. And so on until we repeat the simulation enough times that we have a good idea of the shape of the *distribution of differences* under the null hypothesis. -Figure \@ref(fig:sex-rand-dot-plot) shows a plot of the differences found from 100 simulations, where each dot represents a simulated difference between the proportions of male and female files recommended for promotion. +@fig-sex-rand-dot-plot shows a plot of the differences found from 100 simulations, where each dot represents a simulated difference between the proportions of male and female files recommended for promotion. ```{r} -#| label: sex-rand-dot-plot -#| fig.cap: (ref:sex-rand-dot-plot-cap) -#| out.width: 100% +#| label: fig-sex-rand-dot-plot +#| fig-cap: | +#| A stacked dot plot of differences from 100 simulations produced under +#| the null hypothesis, $H_0,$ where the simulated sex and decision are +#| independent. Two of the 100 simulations had a difference of at least +#| 29.2%, the difference observed in the study, and are shown as solid +#| blue dots. +#| out-width: 100% + set.seed(37) -sex_discrimination %>% - specify(decision ~ sex, success = "promoted") %>% - hypothesize(null = "independence") %>% - generate(reps = 100, type = "permute") %>% - calculate(stat = "diff in props", order = c("male", "female")) %>% - mutate(stat = round(stat, 3)) %>% +sex_discrimination |> + specify(decision ~ sex, success = "promoted") |> + hypothesize(null = "independence") |> + generate(reps = 100, type = "permute") |> + calculate(stat = "diff in props", order = c("male", "female")) |> + mutate(stat = round(stat, 3)) |> ggplot(aes(x = stat)) + geom_dotplot(binwidth = 0.01) + gghighlight(stat >= 0.292) + @@ -347,19 +368,17 @@ sex_discrimination %>% ) ``` -(ref:sex-rand-dot-plot-cap) A stacked dot plot of differences from 100 simulations produced under the null hypothesis, $H_0,$ where the simulated sex and decision are independent. Two of the 100 simulations had a difference of at least 29.2%, the difference observed in the study, and are shown as solid blue dots. - Note that the distribution of these simulated differences in proportions is centered around 0. Under the null hypothesis our simulations made no distinction between male and female personnel files. Thus, a center of 0 makes sense: we should expect differences from chance alone to fall around zero with some random fluctuation for each simulation. ::: {.workedexample data-latex=""} -How often would you observe a difference of at least `r perc_diff` (`r p_diff`) according to Figure \@ref(fig:sex-rand-dot-plot)? +How often would you observe a difference of at least `r perc_diff` (`r p_diff`) according to @fig-sex-rand-dot-plot? Often, sometimes, rarely, or never? ------------------------------------------------------------------------ -It appears that a difference of at least `r perc_diff` under the null hypothesis would only happen about 2% of the time according to Figure \@ref(fig:sex-rand-dot-plot). +It appears that a difference of at least `r perc_diff` under the null hypothesis would only happen about 2% of the time according to @fig-sex-rand-dot-plot. Such a low probability indicates that observing such a large difference from chance alone is rare. ::: @@ -369,11 +388,11 @@ The difference of 29.2% is a rare event if there really is no impact from listin - If $H_A,$ the **Alternative hypothesis** is true: Sex has an effect on promotion decision, and what we observed was actually due to equally qualified female candidates being discriminated against in promotion decisions, which explains the large difference of 29.2%. -When we conduct formal studies, we reject a null position (the idea that the data are a result of chance only) if the data strongly conflict with that null position.[^11-foundations-randomization-5] +When we conduct formal studies, we reject a null position (the idea that the data are a result of chance only) if the data strongly conflict with that null position.[^5] In our analysis, we determined that there was only a $\approx$ 2% probability of obtaining a sample where $\geq$ 29.2% more male candidates than female candidates get promoted under the null hypothesis, so we conclude that the data provide strong evidence of sex discrimination against female candidates by the male supervisors. In this case, we reject the null hypothesis in favor of the alternative. -[^11-foundations-randomization-5]: This reasoning does not generally extend to anecdotal observations. +[^5]: This reasoning does not generally extend to anecdotal observations. Each of us observes incredibly rare events every day, events we could not possibly hope to predict. However, in the non-rigorous setting of anecdotal evidence, almost anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is treacherous. For example, we might look at the lottery: there was only a 1 in 176 million chance that the Mega Millions numbers for the largest jackpot in history (October 23, 2018) would be (5, 28, 62, 65, 70) with a Mega ball of (5), but nonetheless those numbers came up! @@ -392,7 +411,7 @@ Before getting into the nuances of hypothesis testing, let's work through anothe terms_chp_11 <- c(terms_chp_11, "statistical inference") ``` -## Opportunity cost case study {#caseStudyOpportunityCost} +## Opportunity cost case study {#sec-caseStudyOpportunityCost} How rational and consistent is the behavior of the typical American college student? In this section, we'll explore whether college student consumers always consider the following: money not spent now can be spent later. @@ -411,9 +430,9 @@ In this section, we'll explore an experiment conducted by researchers that inves One-hundred and fifty students were recruited for the study, and each was given the following statement: -> *Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of \$14.99. What would you do in this situation? Please circle one of the options below.*[^11-foundations-randomization-6] +> *Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of \$14.99. What would you do in this situation? Please circle one of the options below.*[^6] -[^11-foundations-randomization-6]: This context might feel strange if physical video stores predate you. +[^6]: This context might feel strange if physical video stores predate you. If you're curious about what those were like, look up "Blockbuster". Half of the 150 students were randomized into a control group and were given the following two options: @@ -429,33 +448,36 @@ The remaining 75 students were placed in the treatment group, and they saw a sli > (B) Not buy this entertaining video. Keep the \$14.99 for other purchases. Would the extra statement reminding students of an obvious fact impact the purchasing decision? -Table \@ref(tab:opportunity-cost-obs) summarizes the study results. +@tbl-opportunity-cost-obs summarizes the study results. ::: {.data data-latex=""} The [`opportunity_cost`](http://openintrostat.github.io/openintro/reference/opportunity_cost.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package. ::: ```{r} -#| label: opportunity-cost-obs -opportunity_cost %>% - count(group, decision) %>% - pivot_wider(names_from = decision, values_from = n) %>% - adorn_totals(where = c("col", "row")) %>% - kbl(linesep = "", booktabs = TRUE, caption = "Summary results of the opportunity cost study.") %>% +#| label: tbl-opportunity-cost-obs +#| tbl-cap: Summary results of the opportunity cost study. + +opportunity_cost |> + count(group, decision) |> + pivot_wider(names_from = decision, values_from = n) |> + adorn_totals(where = c("col", "row")) |> + kbl(linesep = "", booktabs = TRUE) |> kable_styling(bootstrap_options = c("striped", "condensed"), - latex_options = c("striped", "hold_position"), full_width = FALSE) %>% - add_header_above(c(" " = 1, "decision" = 2, " " = 1)) %>% + latex_options = c("striped", "hold_position"), full_width = FALSE) |> + add_header_above(c(" " = 1, "decision" = 2, " " = 1)) |> column_spec(1:4, width = "7em") ``` It might be a little easier to review the results using a visualization. -Figure \@ref(fig:opportunity-cost-obs-bar) shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group. +@fig-opportunity-cost-obs-bar shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group. ```{r} -#| label: opportunity-cost-obs-bar -#| fig.cap: Stacked bar plot of results of the opportunity cost study. -#| out.width: 100% +#| label: fig-opportunity-cost-obs-bar +#| fig-cap: Stacked bar plot of results of the opportunity cost study. +#| out-width: 100% #| fig-asp: 0.3 + ggplot(opportunity_cost, aes(y = fct_rev(group), fill = fct_rev(decision))) + geom_bar(position = "fill") + scale_fill_openintro("two") + @@ -467,29 +489,34 @@ ggplot(opportunity_cost, aes(y = fct_rev(group), fill = fct_rev(decision))) + ) ``` -Another useful way to review the results from Table \@ref(tab:opportunity-cost-obs) is using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the video. -These summaries are given in Table \@ref(tab:opportunity-cost-obs-row-prop). +Another useful way to review the results from @tbl-opportunity-cost-obs is using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the video. +These summaries are given in @tbl-opportunity-cost-obs-row-prop. ```{r} -#| label: opportunity-cost-obs-row-prop -opportunity_cost %>% - count(group, decision) %>% - pivot_wider(names_from = decision, values_from = n) %>% - adorn_percentages(denominator = "row") %>% - adorn_totals(where = "col") %>% - kbl(linesep = "", booktabs = TRUE, caption = "The opportunity cost data are summarized using row proportions. Row proportions are particularly useful here since we can view the proportion of *buy* and *not buy* decisions in each group.") %>% +#| label: tbl-opportunity-cost-obs-row-prop +#| tbl-cap: | +#| The opportunity cost data are summarized using row proportions. Row +#| proportions are particularly useful here since we can view the proportion +#| of *buy* and *not buy* decisions in each group. + +opportunity_cost |> + count(group, decision) |> + pivot_wider(names_from = decision, values_from = n) |> + adorn_percentages(denominator = "row") |> + adorn_totals(where = "col") |> + kbl(linesep = "", booktabs = TRUE) |> kable_styling(bootstrap_options = c("striped", "condensed"), - latex_options = c("striped", "hold_position"), full_width = FALSE) %>% - add_header_above(c(" " = 1, "decision" = 2, " " = 1)) %>% + latex_options = c("striped", "hold_position"), full_width = FALSE) |> + add_header_above(c(" " = 1, "decision" = 2, " " = 1)) |> column_spec(1:4, width = "7em") ``` -We will define a **success**\index{success} in this study as a student who chooses not to buy the video.[^11-foundations-randomization-7] +We will define a **success**\index{success} in this study as a student who chooses not to buy the video.[^7] Then, the value of interest is the change in video purchase rates that results by reminding students that not spending money now means they can spend the money later. -[^11-foundations-randomization-7]: Success is often defined in a study as the outcome of interest, and a "success" may or may not actually be a positive outcome. +[^7]: Success is often defined in a study as the outcome of interest, and a "success" may or may not actually be a positive outcome. For example, researchers working on a study on COVID prevalence might define a "success" in the statistical sense as a patient who has COVID-19. - A more complete discussion of the term **success** will be given in Chapter \@ref(inference-one-prop). + A more complete discussion of the term **success** will be given in [Chapter -@sec-inference-one-prop]. ```{r} #| include: false @@ -506,7 +533,7 @@ Is this 20% difference between the two groups so prominent that it is unlikely t ### Variability of the statistic The primary goal in this data analysis is to understand what sort of differences we might see if the null hypothesis were true, i.e., the treatment had no effect on students. -Because this is an experiment, we'll use the same procedure we applied in Section \@ref(caseStudySexDiscrimination): randomization. +Because this is an experiment, we'll use the same procedure we applied in @sec-caseStudySexDiscrimination: randomization. Let's think about the data in the context of the hypotheses. If the null hypothesis $(H_0)$ was true and the treatment had no impact on student decisions, then the observed difference between the two groups of 20% could be attributed entirely to random chance. @@ -533,30 +560,35 @@ Since the simulated groups are of equal size, we would expect $53 / 2 = 26.5,$ i However, due to random chance, we might also expect to sometimes observe a number a little above or below 26 and 27. ::: -The results of a single randomization is shown in Table \@ref(tab:opportunity-cost-obs-simulated). +The results of a single randomization is shown in @tbl-opportunity-cost-obs-simulated. ```{r} -#| label: opportunity-cost-obs-simulated +#| label: tbl-opportunity-cost-obs-simulated +#| tbl-cap: | +#| Summary of student choices against their simulated groups. The group +#| assignment had no connection to the student decisions, so any difference +#| between the two groups is due to chance. + opportunity_cost_rand_1 <- tibble( group = c(rep("control", 75), rep("treatment", 75)), decision = c( rep("buy video", 46), rep("not buy video", 29), rep("buy video", 51), rep("not buy video", 24) ) -) %>% +) |> mutate( group = as.factor(group), decision = as.factor(decision) ) -opportunity_cost_rand_1 %>% - count(group, decision) %>% - pivot_wider(names_from = decision, values_from = n) %>% - adorn_totals(where = c("col", "row")) %>% - kbl(linesep = "", booktabs = TRUE, caption = "Summary of student choices against their simulated groups. The group assignment had no connection to the student decisions, so any difference between the two groups is due to chance.") %>% +opportunity_cost_rand_1 |> + count(group, decision) |> + pivot_wider(names_from = decision, values_from = n) |> + adorn_totals(where = c("col", "row")) |> + kbl(linesep = "", booktabs = TRUE) |> kable_styling(bootstrap_options = c("striped", "condensed"), - latex_options = c("striped", "hold_position"), full_width = FALSE) %>% - add_header_above(c(" " = 1, "decision" = 2, " " = 1)) %>% + latex_options = c("striped", "hold_position"), full_width = FALSE) |> + add_header_above(c(" " = 1, "decision" = 2, " " = 1)) |> column_spec(1:4, width = "7em") ``` @@ -569,11 +601,11 @@ Just one simulation will not be enough to get a sense of what sorts of differenc ```{r} #| label: opportunity-cost-rand-dist set.seed(25) -opportunity_cost_rand_dist <- opportunity_cost %>% - specify(decision ~ group, success = "not buy video") %>% - hypothesize(null = "independence") %>% - generate(reps = 1000, type = "permute") %>% - calculate(stat = "diff in props", order = c("treatment", "control")) %>% +opportunity_cost_rand_dist <- opportunity_cost |> + specify(decision ~ group, success = "not buy video") |> + hypothesize(null = "independence") |> + generate(reps = 1000, type = "permute") |> + calculate(stat = "diff in props", order = c("treatment", "control")) |> mutate(stat = round(stat, 3)) ``` @@ -585,11 +617,15 @@ And again: `r opportunity_cost_rand_dist$stat[3]`. We'll do this 1,000 times. -The results are summarized in a dot plot in Figure \@ref(fig:opportunity-cost-rand-dot-plot), where each point represents the difference from one randomization. +The results are summarized in a dot plot in @fig-opportunity-cost-rand-dot-plot, where each point represents the difference from one randomization. ```{r} -#| label: opportunity-cost-rand-dot-plot -#| fig.cap: (ref:opportunity-cost-rand-dot-plot-cap) +#| label: fig-opportunity-cost-rand-dot-plot +#| fig-cap: | +#| A stacked dot plot of 1,000 simulated (null) differences produced under +#| the null hypothesis, $H_0.$ Six of the 1,000 simulations had a difference +#| of at least 20%, which was the difference observed in the study. + ggplot(opportunity_cost_rand_dist, aes(x = stat)) + geom_dotplot(binwidth = 0.01, dotsize = 0.165) + gghighlight(stat >= 0.20) + @@ -604,15 +640,15 @@ ggplot(opportunity_cost_rand_dist, aes(x = stat)) + ) ``` -(ref:opportunity-cost-rand-dot-plot-cap) A stacked dot plot of 1,000 simulated (null) differences produced under the null hypothesis, $H_0.$ Six of the 1,000 simulations had a difference of at least 20% , which was the difference observed in the study. - -Since there are so many points and it is difficult to discern one point from the other, it is more convenient to summarize the results in a histogram such as the one in Figure \@ref(fig:opportunity-cost-rand-hist), where the height of each histogram bar represents the number of simulations resulting in an outcome of that magnitude. +Since there are so many points and it is difficult to discern one point from the other, it is more convenient to summarize the results in a histogram such as the one in @fig-opportunity-cost-rand-hist, where the height of each histogram bar represents the number of simulations resulting in an outcome of that magnitude. ```{r} -#| label: opportunity-cost-rand-hist -#| fig.cap: A histogram of 1,000 chance differences produced under the null hypothesis. -#| Histograms like this one are a convenient representation of data or results when -#| there are a large number of simulations. +#| label: fig-opportunity-cost-rand-hist +#| fig-cap: | +#| A histogram of 1,000 chance differences produced under the null hypothesis. +#| Histograms like this one are a convenient representation of data or results +#| when there are a large number of simulations. + ggplot(opportunity_cost_rand_dist, aes(x = stat)) + geom_histogram(binwidth = 0.04) + gghighlight(stat >= 0.20) + @@ -674,11 +710,11 @@ They are simply not convinced of the alternative, that the person is guilty. This is also the case with hypothesis testing: *even if we fail to reject the null hypothesis, we do not accept the null hypothesis as truth*. Failing to find evidence in favor of the alternative hypothesis is not equivalent to finding evidence that the null hypothesis is true. -We will see this idea in greater detail in Section \@ref(decerr). +We will see this idea in greater detail in [Chapter -@sec-decerr]. ### p-value and statistical discernibility -In Section \@ref(caseStudySexDiscrimination) we encountered a study from the 1970's that explored whether there was strong evidence that female candidates were less likely to be promoted than male candidates. +In @caseStudySexDiscrimination we encountered a study from the 1970's that explored whether there was strong evidence that female candidates were less likely to be promoted than male candidates. The research question -- are female candidates discriminated against in promotion decisions? -- was framed in the context of hypotheses: @@ -708,7 +744,7 @@ terms_chp_11 <- c(terms_chp_11, "p-value", "test statistic") ::: {.workedexample data-latex=""} In the sex discrimination study, the difference in discrimination rates was our test statistic. -What was the test statistic in the opportunity cost study covered in Section \@ref(caseStudyOpportunityCost)? +What was the test statistic in the opportunity cost study covered in @sec-caseStudyOpportunityCost)? ------------------------------------------------------------------------ @@ -717,17 +753,17 @@ In each of these examples, the **point estimate** of the difference in proportio ::: When the p-value is small, i.e., less than a previously set threshold, we say the results are **statistically discernible**\index{statistically significant}\index{statistically discernible}. -This means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.[^11-foundations-randomization-8] +This means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.[^8] The threshold is called the **discernibility level**\index{hypothesis testing!discernibility level}\index{significance level}\index{discernibility level} and often represented by $\alpha$ (the Greek letter *alpha*). -[^11-foundations-randomization-9] The value of $\alpha$ represents how rare an event needs to be in order for the null hypothesis to be rejected +[^9] The value of $\alpha$ represents how rare an event needs to be in order for the null hypothesis to be rejected . Historically, many fields have set $\alpha = 0.05,$ meaning that the results need to occur less than 5% of the time, if the null hypothesis is to be rejected . The value of $\alpha$ can vary depending on the the field or the application . -[^11-foundations-randomization-8]: Many texts use the phrase "statistically significant" instead of "statistically discernible". +[^8]: Many texts use the phrase "statistically significant" instead of "statistically discernible". We have chosen to use "discernible" to indicate that a precise statistical event has happened, as opposed to a notable effect which may or may not fit the statistical definition of discernible or significant. -[^11-foundations-randomization-9]: Here, too, we have chosen "discernibility level" instead of "significance level" which you will see in some texts. +[^9]: Here, too, we have chosen "discernibility level" instead of "significance level" which you will see in some texts. ```{r} #| include: false @@ -747,7 +783,7 @@ We say that the data provide **statistically discernible**\index{hypothesis test ::: ::: {.workedexample data-latex=""} -In the opportunity cost study in Section \@ref(caseStudyOpportunityCost), we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future. +In the opportunity cost study in @sec-caseStudyOpportunityCost, we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future. We determined that such a large difference would only occur 6-in-1,000 times if the reminder actually had no influence on student decision-making. What is the p-value in this study? Would you classify the result as "statistically discernible"? @@ -770,7 +806,7 @@ We've made a video to help clarify *why 0.05*: Sometimes it's also a good idea to deviate from the standard. -We'll discuss when to choose a threshold different than 0.05 in Section \@ref(decerr). +We'll discuss when to choose a threshold different than 0.05 in [Chapter -@sec-decerr]. ::: \clearpage @@ -779,7 +815,7 @@ We'll discuss when to choose a threshold different than 0.05 in Section \@ref(de ### Summary -Figure \@ref(fig:fullrand) provides a visual summary of the randomization testing procedure. +@fig-fullrand provides a visual summary of the randomization testing procedure. \index{randomization test} @@ -789,18 +825,21 @@ terms_chp_11 <- c(terms_chp_11, "randomization test") ``` ```{r} -#| label: fullrand -#| out.width: 100% -#| fig.cap: An example of one simulation of the full randomization procedure from a hypothetical +#| label: fig-fullrand +#| out-width: 100% +#| fig-cap: | +#| An example of one simulation of the full randomization procedure from a hypothetical #| dataset as visualized in the first panel. We repeat the steps hundreds or thousands #| of times. -#| fig.alt: 48 red and white cards are show in three panels. The first panel represents +#| fig-alt: | +#| 48 red and white cards are show in three panels. The first panel represents #| original data and original allocation of Group 1 and Group 2 (in the original data #| there are 7 white cards in Group 1 and 10 white cards in Group 2). The second panel #| represents the shuffled red and white cards that are randomly assigned as Group #| 1 and Group 2. The third panel has the cards sorted according to the random assignment #| of Group 1 and Group 2. In the third panel there are 8 white cards in the Group #| 1 and 9 white cards in Group 2. + knitr::include_graphics("images/fullrand.png") ``` @@ -812,17 +851,18 @@ We can summarize the randomization test procedure as follows: - **Analyze the data.** Choose an analysis technique appropriate for the data and identify the p-value. So far, we have only seen one analysis technique: randomization. Throughout the rest of this textbook, we'll encounter several new methods suitable for many other contexts. - **Form a conclusion.** Using the p-value from the analysis, determine whether the data provide evidence against the null hypothesis. Also, be sure to write the conclusion in plain language so casual readers can understand the results. -Table \@ref(tab:chp11-summary) is another look at the randomization test summary. +@tbl-chp11-summary is another look at the randomization test summary. ```{r} -#| label: chp11-summary -inference_method_summary_table %>% - filter(question != "What are the technical conditions?") %>% - select(question, randomization) %>% - kbl(linesep = "\\addlinespace", booktabs = TRUE, caption = "Summary of randomization as an inferential statistical method.", - col.names = c("Question", "Answer")) %>% +#| label: tbl-chp11-summary +#| tbl-cap: Summary of randomization as an inferential statistical method. + +inference_method_summary_table |> + filter(question != "What are the technical conditions?") |> + select(question, randomization) |> + kbl(linesep = "\\addlinespace", booktabs = TRUE, col.names = c("Question", "Answer")) |> kable_styling(bootstrap_options = c("striped", "condensed"), - latex_options = c("striped", "hold_position"), full_width = TRUE) %>% + latex_options = c("striped", "hold_position"), full_width = TRUE) |> column_spec(1, width = "15em") ``` diff --git a/14-foundations-errors.qmd b/14-foundations-errors.qmd index b9e5f9a1..a1df9e02 100644 --- a/14-foundations-errors.qmd +++ b/14-foundations-errors.qmd @@ -1,4 +1,4 @@ -# Decision Errors {#decerr} +# Decision Errors {#sec-decerr} ```{r} #| include: false diff --git a/16-inference-one-prop.qmd b/16-inference-one-prop.qmd index 74027f1a..63c8e127 100644 --- a/16-inference-one-prop.qmd +++ b/16-inference-one-prop.qmd @@ -4,7 +4,7 @@ source("_common.R") ``` -# Inference for a single proportion {#inference-one-prop} +# Inference for a single proportion {#sec-inference-one-prop} ::: {.chapterintro data-latex=""} Focusing now on statistical inference for categorical data, we will revisit many of the foundational aspects of hypothesis testing from Chapter \@ref(foundations-randomization). diff --git a/_freeze/01-data-hello/execute-results/html.json b/_freeze/01-data-hello/execute-results/html.json index 3e4dc4ed..600407d5 100644 --- a/_freeze/01-data-hello/execute-results/html.json +++ b/_freeze/01-data-hello/execute-results/html.json @@ -1,7 +1,8 @@ { - "hash": "2c11b67a4ada86aef92c743c92fdc28a", + "hash": "2d9d6471b03d92e97a32bf3de34fcd03", "result": { - "markdown": "# Hello data {#sec-data-hello}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nScientists seek to answer questions using rigorous methods and careful observations.\nThese observations -- collected from the likes of field notes, surveys, and experiments -- form the backbone of a statistical investigation and are called **data**.\nStatistics is the study of how best to collect, analyze, and draw conclusions from data.\nIn this first chapter, we focus on both the properties of data and on the collection of data.\n:::\n\n\n\n\n\n## Case study: Using stents to prevent strokes {#01-data-hello-sec-case-study-stents-strokes}\n\nIn this section we introduce a classic challenge in statistics: evaluating the efficacy of a medical treatment.\nTerms in this section, and indeed much of this chapter, will all be revisited later in the text.\nThe plan for now is simply to get a sense of the role statistics can play in practice.\n\nAn experiment is designed to study the effectiveness of stents in treating patients at risk of stroke [@chimowitz2011stenting].\nStents are small mesh tubes that are placed inside narrow or weak arteries to assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death.\n\nMany doctors have hoped that there would be similar benefits for patients at risk of stroke.\nWe start by writing the principal question the researchers hope to answer:\n\n> Does the use of stents reduce the risk of stroke?\n\nThe researchers who asked this question conducted an experiment with 451 at-risk patients.\nEach volunteer patient was randomly assigned to one of two groups:\n\n- **Treatment group**. Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modification.\n- **Control group**. Patients in the control group received the same medical management as the treatment group, but they did not receive stents.\n\nResearchers randomly assigned 224 patients to the treatment group and 227 to the control group.\nIn this study, the control group provides a reference point against which we can measure the medical impact of stents in the treatment group.\n\n\\clearpage\n\nResearchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment.\nThe results of 5 patients are summarized in @tbl-stentStudyResultsDF.\nPatient outcomes are recorded as `stroke` or `no event`, representing whether the patient had a stroke during that time period.\n\n::: {.data data-latex=\"\"}\nThe [`stent30`](http://openintrostat.github.io/openintro/reference/stent30.html) data and [`stent365`](http://openintrostat.github.io/openintro/reference/stent365.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {#tbl-stentStudyResultsDF .cell tbl-cap='Results for five patients from the stent study.' fig.asp='0.618'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
patient group 30 days 365 days
1 treatment no event no event
2 treatment stroke stroke
3 treatment no event no event
4 treatment no event no event
5 control no event no event
\n\n`````\n:::\n:::\n\n\nIt would be difficult to answer a question on the impact of stents on the occurrence of strokes for **all** study patients using these *individual* observations.\nThis question is better addressed by performing a statistical data analysis of *all* observations.\n@tbl-stentStudyResultsDFsummary summarizes the raw data in a more helpful way.\nIn this table, we can quickly see what happened over the entire study.\nFor instance, to identify the number of patients in the treatment group who had a stroke within 30 days after the treatment, we look in the leftmost column (30 days), at the intersection of treatment and stroke: 33.\nTo identify the number of control patients who did not have a stroke after 365 days after receiving treatment, we look at the rightmost column (365 days), at the intersection of control and no event: 199.\n\n\n::: {#tbl-stentStudyResultsDFsummary .cell tbl-cap='Descriptive statistics for the stent study.' fig.asp='0.618'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
30 days
365 days
Group Stroke No event Stroke No event
Control 13 214 28 199
Treatment 33 191 45 179
Total 46 405 73 378
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nOf the 224 patients in the treatment group, 45 had a stroke by the end of the first year.\nUsing these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year.\n(Note: answers to all Guided Practice exercises are provided in footnotes!)[^01-data-hello-1]\n:::\n\n[^01-data-hello-1]: The proportion of the 224 patients who had a stroke within 365 days: $45/224 = 0.20.$\n\nWe can compute summary statistics from the table to give us a better idea of how the impact of the stent treatment differed between the two groups.\nA **summary statistic** is a single number summarizing data from a sample.\nFor instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.\n\n\n\n\n\n- Proportion who had a stroke in the treatment (stent) group: $45/224 = 0.20 = 20\\%.$\n- Proportion who had a stroke in the control group: $28/227 = 0.12 = 12\\%.$\n\nThese two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke!\nThis is important for two reasons.\nFirst, it is contrary to what doctors expected, which was that stents would *reduce* the rate of strokes.\nSecond, it leads to a statistical question: do the data show a \"real\" difference between the groups?\n\nThis second question is subtle.\nSuppose you flip a coin 100 times.\nWhile the chance a coin lands heads in any given coin flip is 50%, we probably won't observe exactly 50 heads.\nThis type of variation is part of almost any type of data generating process.\nIt is possible that the 8% difference in the stent study is due to this natural variation.\nHowever, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance.\nSo, what we are really asking is the following: if in fact stents have no effect, how likely is it that we observe such a large difference?\n\nWhile we do not yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.\n\n**Be careful:** Do not generalize the results of this study to all patients and all stents.\nThis study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.\nIn addition, there are many types of stents, and this study only considered the self-expanding Wingspan stent (Boston Scientific).\nHowever, this study does leave us with an important lesson: we should keep our eyes open for surprises.\n\n## Data basics {#01-data-hello-sec-data-basics}\n\nEffective presentation and description of data is a first step in most analyses.\nThis section introduces one structure for organizing data as well as some terminology that will be used throughout this book.\n\n### Observations, variables, and data matrices\n\n@tbl-loan50-df displays six rows of a dataset for 50 randomly sampled loans offered through Lending Club, which is a peer-to-peer lending company.\nThis dataset will be referred to as `loan50`.\n\n::: {.data data-latex=\"\"}\nThe [`loan50`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nEach row in the table represents a single loan.\nThe formal name for a row is a \\index{case}**case** or \\index{unit of observation}**observational unit**.\nThe columns represent characteristics of each loan, where each column is referred to as a \\index{variable}**variable**.\nFor example, the first row represents a loan of \\$22,000 with an interest rate of 10.90%, where the borrower is based in New Jersey (NJ) and has an income of \\$59,000.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat is the grade of the first loan in @tbl-loan50-df?\nAnd what is the home ownership status of the borrower for that first loan?\nReminder: for these Guided Practice questions, you can check your answer in the footnote.[^01-data-hello-2]\n:::\n\n[^01-data-hello-2]: The loan's grade is B, and the borrower rents their residence.\n\nIn practice, it is especially important to ask clarifying questions to ensure important aspects of the data are understood.\nFor instance, it is always important to be sure we know what each variable means and its units of measurement.\nDescriptions of the variables in the `loan50` dataset are given in @tbl-loan-50-variables.\n\n\n::: {#tbl-loan50-df .cell tbl-cap='Six observations from the `loan50` dataset.' fig.asp='0.618'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
loan_amount interest_rate term grade state total_income homeownership
1 22,000 10.90 60 B NJ 59,000 rent
2 6,000 9.92 36 B CA 60,000 rent
3 25,000 26.30 36 E SC 75,000 mortgage
4 6,000 9.92 36 B CA 75,000 rent
5 25,000 9.43 60 B OH 254,000 mortgage
6 6,400 9.92 36 B IN 67,000 mortgage
\n\n`````\n:::\n:::\n\n::: {#tbl-loan-50-variables .cell tbl-cap='Variables and their descriptions for the `loan50` dataset.' fig.asp='0.618'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
loan_amount Amount of the loan received, in US dollars.
interest_rate Interest rate on the loan, in an annual percentage.
term The length of the loan, which is always set as a whole number of months.
grade Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid.
state US state where the borrower resides.
total_income Borrower's total income, including any second income, in US dollars.
homeownership Indicates whether the person owns, owns but has a mortgage, or rents.
\n\n`````\n:::\n:::\n\n\nThe data in @tbl-loan50-df represent a \\index{data frame}**data frame**, which is a convenient and common way to organize data, especially if collecting data in a spreadsheet.\nA data frame where each row is a unique case (observational unit), each column is a variable, and each cell is a single value is commonly referred to as \\index{tidy data}**tidy data** @wickham2014.\n\n\n\n\n\nWhen recording data, use a tidy data frame unless you have a very good reason to use a different structure.\nThis structure allows new cases to be added as rows or new variables as new columns and facilitates visualization, summarization, and other statistical analyses.\n\n::: {.guidedpractice data-latex=\"\"}\nThe grades for assignments, quizzes, and exams in a course are often recorded in a gradebook that takes the form of a data frame.\nHow might you organize a course's grade data using a data frame?\nDescribe the observational units and variables.[^01-data-hello-3]\n:::\n\n[^01-data-hello-3]: There are multiple strategies that can be followed.\n One common strategy is to have each student represented by a row, and then add a column for each assignment, quiz, or exam.\n Under this setup, it is easy to review a single line to understand the grade history of a student.\n There should also be columns to include student information, such as one column to list student names.\n\n::: {.guidedpractice data-latex=\"\"}\nWe consider data for 3,142 counties in the United States, which includes the name of each county, the state where it resides, its population in 2017, the population change from 2010 to 2017, poverty rate, and nine additional characteristics.\nHow might these data be organized in a data frame?[^01-data-hello-4]\n:::\n\n[^01-data-hello-4]: Each county may be viewed as a case, and there are eleven pieces of information recorded for each case.\n A table with 3,142 rows and 14 columns could hold these data, where each row represents a county and each column represents a particular piece of information.\n\n\\clearpage\n\nThe data described in the Guided Practice above represents the `county` dataset, which is shown as a data frame in @tbl-county-df.\nThe variables as well as the variables in the dataset that did not fit in @tbl-county-df are described in @tbl-county-variables.\n\n\n::: {#tbl-county-df .cell tbl-cap='Six observations and six variables from the `county` dataset.' fig.asp='0.618'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
name state pop2017 pop_change unemployment_rate median_edu
Autauga County Alabama 55,504 1.48 3.86 some_college
Baldwin County Alabama 212,628 9.19 3.99 some_college
Barbour County Alabama 25,270 -6.22 5.90 hs_diploma
Bibb County Alabama 22,668 0.73 4.39 hs_diploma
Blount County Alabama 58,013 0.68 4.02 hs_diploma
Bullock County Alabama 10,309 -2.28 4.93 hs_diploma
\n\n`````\n:::\n:::\n\n::: {#tbl-county-variables .cell tbl-cap='Variables and their descriptions for the `county` dataset.' fig.asp='0.618'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
name Name of county.
state Name of state.
pop2000 Population in 2000.
pop2010 Population in 2010.
pop2017 Population in 2017.
pop_change Population change from 2010 to 2017 (in percent).
poverty Percent of population in poverty in 2017.
homeownership Homeownership rate, 2006-2010.
multi_unit Multi-unit rate: percent of housing units that are in multi-unit structures, 2006-2010.
unemployment_rate Unemployment rate in 2017.
metro Whether the county contains a metropolitan area, taking one of the values yes or no.
median_edu Median education level (2013-2017), taking one of the values below_hs, hs_diploma, some_college, or bachelors.
per_capita_income Per capita (per person) income (2013-2017).
median_hh_income Median household income.
smoking_ban Describes the type of county-level smoking ban in place in 2010, taking one of the values none, partial, or comprehensive.
\n\n`````\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`county`](http://openintrostat.github.io/usdata/reference/county.html) data can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.\n:::\n\n### Types of variables {#01-data-hello-variable-types}\n\nExamine the `unemployment_rate`, `pop2017`, `state`, and `median_edu` variables in the `county` dataset.\nEach of these variables is inherently different from the other three, yet some share certain characteristics.\n\nFirst consider `unemployment_rate`, which is said to be a \\index{numerical variable}**numerical** variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values.\nOn the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes does not have any clear meaning.\nInstead, we would consider area codes as a categorical variable.\n\n\n\n\n\nThe `pop2017` variable is also numerical, although it seems to be a little different than `unemployment_rate`.\nThis variable of the population count can only take whole non-negative numbers (0, 1, 2, ...).\nFor this reason, the population variable is said to be **discrete** since it can only take numerical values with jumps.\nOn the other hand, the unemployment rate variable is said to be **continuous**.\n\n\n\n\n\nThe variable `state` can take up to 51 values after accounting for Washington, DC: Alabama, Alaska, ..., and Wyoming.\nBecause the responses themselves are categories, `state` is called a **categorical** variable, and the possible values (states) are called the variable's **levels** (e.g., District of Columbia, Alabama, Alaska, etc.) .\n\n\n\n\n\nFinally, consider the `median_edu` variable, which describes the median education level of county residents and takes values `below_hs`, `hs_diploma`, `some_college`, or `bachelors` in each county.\nThis variable seems to be a hybrid: it is a categorical variable, but the levels have a natural ordering.\nA variable with these properties is called an **ordinal** variable, while a regular categorical variable without this type of special ordering is called a **nominal** variable.\nTo simplify analyses, any categorical variable in this book will be treated as a nominal (unordered) categorical variable.\n\n\n\n\n::: {.cell fig.asp='0.5'}\n::: {.cell-output-display}\n![Breakdown of variables into their respective types.](01-data-hello_files/figure-html/variables-1.png){fig-alt='Types of variables are broken down into numerical (which can be discrete or continuous) and categorical (which can be ordinal or nominal).' width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nData were collected about students in a statistics course.\nThree variables were recorded for each student: number of siblings, student height, and whether the student had previously taken a statistics course.\nClassify each of the variables as continuous numerical, discrete numerical, or categorical.\n\n------------------------------------------------------------------------\n\nThe number of siblings and student height represent numerical variables.\nBecause the number of siblings is a count, it is discrete.\nHeight varies continuously, so it is a continuous numerical variable.\nThe last variable classifies students into two categories -- those who have and those who have not taken a statistics course -- which makes this variable categorical.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nAn experiment is evaluating the effectiveness of a new drug in treating migraines.\nA `group` variable is used to indicate the experiment group for each patient: treatment or control.\nThe `num_migraines` variable represents the number of migraines the patient experienced during a 3-month period.\nClassify each variable as either numerical or categorical?[^01-data-hello-5]\n:::\n\n[^01-data-hello-5]: The `group` variable can take just one of two group names, making it categorical.\n The `num_migraines` variable describes a count of the number of migraines, which is an outcome where basic arithmetic is sensible, which means this is a numerical outcome; more specifically, since it represents a count, `num_migraines` is a discrete numerical variable.\n\n### Relationships between variables {#01-data-hello-variable-relations}\n\nMany analyses are motivated by a researcher looking for a relationship between two or more variables.\nA social scientist may like to answer some of the following questions:\n\n> Does a higher-than-average increase in county population tend to correspond to counties with higher or lower median household incomes?\n\n> If homeownership in one county is lower than the national average, will the percent of housing units that are in multi-unit structures in that county tend to be above or below the national average?\n\n> How much can the median education level explain the median household income for counties in the US?\n\nTo answer these questions, data must be collected, such as the `county` dataset shown in @tbl-county-df.\nExamining \\index{summary statistic}**summary statistics** can provide numerical insights about the specifics of each of these questions.\nAlternatively, graphs can be used to visually explore the data, potentially providing more insight than a summary statistic.\n\n\\index{scatterplot}**Scatterplots** are one type of graph used to study the relationship between two numerical variables.\n@fig-county-multi-unit-homeownership displays the relationship between the variables `homeownership` and `multi_unit`, which is the percent of housing units that are in multi-unit structures (e.g., apartments, condos).\nEach point on the plot represents a single county.\nFor instance, the highlighted dot corresponds to County 413 in the `county` dataset: Chattahoochee County, Georgia, which has 39.4% of housing units that are in multi-unit structures and a homeownership rate of 31.3%.\nThe scatterplot suggests a relationship between the two variables: counties with a higher rate of housing units that are in multi-unit structures tend to have lower homeownership rates.\nWe might brainstorm as to why this relationship exists and investigate each idea to determine which are the most reasonable explanations.\n\n\n::: {.cell fig.asp='0.618'}\n::: {.cell-output-display}\n![A scatterplot of homeownership versus the percent of housing units that are in multi-unit structures for US counties. The highlighted dot represents Chattahoochee County, Georgia, which has a multi-unit rate of 39.4\\% and a homeownership rate of 31.3\\%.](01-data-hello_files/figure-html/fig-county-multi-unit-homeownership-1.png){#fig-county-multi-unit-homeownership width=90%}\n:::\n:::\n\n\nThe multi-unit and homeownership rates are said to be associated because the plot shows a discernible pattern.\nWhen two variables show some connection with one another, they are called **associated** variables.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nExamine the variables in the `loan50` dataset, which are described in @tbl-loan-50-variables.\nCreate two questions about possible relationships between variables in `loan50` that are of interest to you.[^01-data-hello-6]\n:::\n\n[^01-data-hello-6]: Two example questions: (1) What is the relationship between loan amount and total income?\n (2) If someone's income is above the average, will their interest rate tend to be above or below the average?\n\n::: {.workedexample data-latex=\"\"}\nThis example examines the relationship between the percent change in population from 2010 to 2017 and median household income for counties, which is visualized as a scatterplot in @fig-county-pop-change-med-hh-income.\nAre these variables associated?\n\n------------------------------------------------------------------------\n\nThe larger the median household income for a county, the higher the population growth observed for the county.\nWhile it isn't true that every county with a higher median household income has a higher population growth, the trend in the plot is evident.\nSince there is some relationship between the variables, they are associated.\n:::\n\n\n::: {.cell fig.asp='0.618'}\n::: {.cell-output-display}\n![A scatterplot showing population change against median household income. Owsley County of Kentucky is highlighted, which lost 3.63\\% of its population from 2010 to 2017 and had median household income of \\$22,736.](01-data-hello_files/figure-html/fig-county-pop-change-med-hh-income-1.png){#fig-county-pop-change-med-hh-income width=90%}\n:::\n:::\n\n\nBecause there is a downward trend in @fig-county-multi-unit-homeownership -- counties with more housing units that are in multi-unit structures are associated with lower homeownership -- these variables are said to be **negatively associated**.\nA **positive association** is shown in the relationship between the `median_hh_income` and `pop_change` variables in @fig-county-pop-change-med-hh-income, where counties with higher median household income tend to have higher rates of population growth.\n\n\n\n\n\nIf two variables are not associated, then they are said to be **independent**.\nThat is, two variables are independent if there is no evident relationship between the two.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Associated or independent, not both.**\n\nA pair of variables are either related in some way (associated) or not (independent).\nNo pair of variables is both associated and independent.\n:::\n\n### Explanatory and response variables\n\nWhen we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other.\nConsider the following rephrasing of an earlier question about the `county` dataset:\n\n> If there is an increase in the median household income in a county, does this drive an increase in its population?\n\nIn this question, we are asking whether one variable affects another.\nIf this is our underlying belief, then *median household income* is the **explanatory variable**, and the *population change* is the **response variable** in the hypothesized relationship.[^01-data-hello-7]\n\n[^01-data-hello-7]: In some disciplines, it's customary to refer to the explanatory variable as the **independent variable** and the response variable as the **dependent variable**.\n However, this becomes confusing since a *pair* of variables might be independent or dependent, so we avoid this language.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Explanatory and response variables.**\n\nWhen we suspect one variable might causally affect another, we label the first variable the explanatory variable and the second the response variable.\nWe also use the terms **explanatory** and **response** to describe variables where the **response** might be predicted using the **explanatory** even if there is no causal relationship.\n\n
explanatory variable $\\rightarrow$ *might affect* $\\rightarrow$ response variable
\n\n
For many pairs of variables, there is no hypothesized relationship, and these labels would not be applied to either variable in such cases.\n:::\n\nBear in mind that the act of labeling the variables in this way does nothing to guarantee that a causal relationship exists.\nA formal evaluation to check whether one variable causes a change in another requires an experiment.\n\n### Observational studies and experiments\n\nThere are two primary types of data collection: experiments and observational studies.\n\nWhen researchers want to evaluate the effect of particular traits, treatments, or conditions, they conduct an **experiment**.\nFor instance, we may suspect drinking a high-calorie energy drink will improve performance in a race.\nTo check if there really is a causal relationship between the explanatory variable (whether the runner drank an energy drink or not) and the response variable (the race time), researchers identify a sample of individuals and split them into groups.\nThe individuals in each group are *assigned* a treatment.\nWhen individuals are randomly assigned to a group, the experiment is called a **randomized experiment**.\nRandom assignment organizes the participants in a study into groups that are roughly equal on all aspects, thus allowing us to control for any confounding variables that might affect the outcome (e.g., fitness level, racing experience, etc.).\nFor example, each runner in the experiment could be randomly assigned, perhaps by flipping a coin, into one of two groups: the first group receives a **placebo** (fake treatment, in this case a no-calorie drink) and the second group receives the high-calorie energy drink.\nSee the case study in @sec-case-study-stents-strokes for another example of an experiment, though that study did not employ a placebo.\n\n\n\n\n\nResearchers perform an **observational study** when they collect data in a way that does not directly interfere with how the data arise.\nFor instance, researchers may collect information via surveys, review medical or company records, or follow a **cohort** of many similar individuals to form hypotheses about why certain diseases might develop.\nIn each of these situations, researchers merely observe the data that arise.\nIn general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they do not offer a mechanism for controlling for confounding variables.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Association** $\\neq$ **Causation.**\n\nIn general, association does not imply causation.\nAn advantage of a randomized experiment is that it is easier to establish causal relationships with such a study.\nThe main reason for this is that observational studies do not control for confounding variables, and hence establishing causal relationships with observational studies requires advanced statistical methods (that are beyond the scope of this book).\nWe will revisit this idea when we discuss experiments later in the book.\n:::\n\n\\vspace{10mm}\n\n## Chapter review {#01-data-hello-chp1-review}\n\n### Summary\n\nThis chapter introduced you to the world of data.\nData can be organized in many ways but tidy data, where each row represents an observation and each column represents a variable, lends itself most easily to statistical analysis.\nMany of the ideas from this chapter will be seen as we move on to doing full data analyses.\nIn the next chapter you're going to learn about how we can design studies to collect the data we need to make conclusions with the desired scope of inference.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell fig.asp='0.618'}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
associated experiment ordinal
case explanatory variable placebo
categorical independent positive association
cohort level randomized experiment
continuous negative association response variable
data nominal summary statistic
data frame numerical tidy data
dependent observational study variable
discrete observational unit
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#01-data-hello-chp1-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-01].\n\n::: {.exercises data-latex=\"\"}\n1. **Marvel Cinematic Universe films.** The data frame below contains information on Marvel Cinematic Universe films through the Infinity saga (a movie storyline spanning from Ironman in 2008 to Endgame in 2019).\n Box office totals are given in millions of US Dollars.\n How many observations and how many variables does this data frame have?[^_01-ex-data-hello-1]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Length
Gross
Title Hrs Mins Release Date Opening Wknd US US World
1 Iron Man 2 6 5/2/2008 98.62 319.03 585.8
2 The Incredible Hulk 1 52 6/12/2008 55.41 134.81 264.77
3 Iron Man 2 2 4 5/7/2010 128.12 312.43 623.93
4 Thor 1 55 5/6/2011 65.72 181.03 449.33
5 Captain America: The First Avenger 2 4 7/22/2011 65.06 176.65 370.57
... ... ... ... ... ... ... ...
23 Spiderman: Far from Home 2 9 7/2/2019 92.58 390.53 1131.93
\n \n `````\n :::\n :::\n\n2. **Cherry Blossom Run.** The data frame below contains information on runners in the 2017 Cherry Blossom Run, which is an annual road race that takes place in Washington, DC. Most runners participate in a 10-mile run while a smaller fraction take part in a 5k run or walk.\n How many observations and how many variables does this data frame have?[^_01-ex-data-hello-2]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Time
Bib Name Sex Age City / Country Net Clock Pace Event
1 6 Hiwot G. F 21 Ethiopia 3217 3217 321 10 Mile
2 22 Buze D. F 22 Ethiopia 3232 3232 323 10 Mile
3 16 Gladys K. F 31 Kenya 3276 3276 327 10 Mile
4 4 Mamitu D. F 33 Ethiopia 3285 3285 328 10 Mile
5 20 Karolina N. F 35 Poland 3288 3288 328 10 Mile
... ... ... ... ... ... ... ... ... ...
19961 25153 Andres E. M 33 Woodbridge, VA 5287 5334 1700 5K
\n \n `````\n :::\n :::\n\n3. **Air pollution and birth outcomes, study components.** Researchers collected data to examine the relationship between air pollutants and preterm births in Southern California.\n During the study air pollution levels were measured by air quality monitoring stations.\n Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (PM$_{10}$) in $\\mu g/m^3$.\n Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth.\n The analysis suggested that increased ambient PM$_{10}$ and, to a lesser degree, CO concentrations may be associated with the occurrence of preterm births.\n [@Ritz+Yu+Chapa+Fruin:2000]\n\n a. Identify the main research question of the study.\n\n b. Who are the subjects in this study, and how many are included?\n\n c. What are the variables in the study?\n Identify each variable as numerical or categorical.\n If numerical, state whether the variable is discrete or continuous.\n If categorical, state whether the variable is ordinal.\n\n4. **Cheaters, study components.** Researchers studying the relationship between honesty, age and self-control conducted an experiment on 160 children between the ages of 5 and 15.\n Participants reported their age, sex, and whether they were an only child or not.\n The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet and said they would only reward children who report white.\n [@Bucciol:2011]\n\n a. Identify the main research question of the study.\n\n b. Who are the subjects in this study, and how many are included?\n\n c. The study's findings can be summarized as follows: *\"Half the students were explicitly told not to cheat, and the others were not given any explicit instructions. In the no instruction group probability of cheating was found to be uniform across groups based on child's characteristics. In the group that was explicitly told to not cheat, girls were less likely to cheat, and while rate of cheating didn't vary by age for boys, it decreased with age for girls.\"* How many variables were recorded for each subject in the study in order to conclude these findings?\n State the variables and their types.\n\n5. **Gamification and statistics, study components.** Gamification is the application of game-design elements and game principles in non-game contexts.\n In educational settings, gamification is often implemented as educational activities to solve problems by using characteristics of game elements.\n Researchers investigating the effects of gamification on learning statistics conducted a study where they split college students in a statistics class into four groups: (1) no reading exercises and no gamification, (2) reading exercises but no gamification, (3) gamification but no reading exercises, and (4) gamification and reading exercises.\n Students in all groups also attended lectures.\n Students in the class were from two majors: Electrical and Computer Engineering (n = 279) and Business Administration (n = 86).\n After their assigned learning experience, each student took a final evaluation comprised of 30 multiple choice question and their score was measured as the number of questions they answered correctly.\n The researchers considered students' gender, level of studies (first through fourth year) and academic major.\n Other variables considered were expertise in the English language and use of personal computers and games, both of which were measured on a scale of 1 (beginner) to 5 (proficient).\n The study found that gamification had a positive effect on student learning compared to traditional teaching methods involving lectures and reading exercises.\n They also found that the effect was larger for females and Engineering students.\n [@Legaki:2020]\n\n a. Identify the main research question of the study.\n\n b. Who were the subjects in this study, and how many were included?\n\n c. What are the variables in the study?\n Identify each variable as numerical or categorical.\n If numerical, state whether the variable is discrete or continuous.\n If categorical, state whether the variable is ordinal.\n\n6. **Stealers, study components.** In a study of the relationship between socio-economic class and unethical behavior, 129 University of California undergraduates at Berkeley were asked to identify themselves as having low or high social class by comparing themselves to others with the most (least) money, most (least) education, and most (least) respected jobs.\n They were also presented with a jar of individually wrapped candies and informed that the candies were for children in a nearby laboratory, but that they could take some if they wanted.\n After completing some unrelated tasks, participants reported the number of candies they had taken.\n [@Piff:2012]\n\n a. Identify the main research question of the study.\n\n b. Who were the subjects in this study, and how many were included?\n\n c. The study found that students who were identified as upper-class took more candy than others.\n How many variables were recorded for each subject in the study in order to conclude these findings?\n State the variables and their types.\n\n \\clearpage\n\n7. \"Figure **Migraine and acupuncture.** A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture.\n To determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 individuals who identified as female diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control.\n Forty-three (43) patients in the treatment group received acupuncture that is specifically designed to treat migraines.\n Forty-six (46) patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations).\n Twenty-four (24) hours after patients received acupuncture, they were asked if they were pain free.\n Results are summarized in the contingency table below.\n Also provided is a figure from the original paper displaying the appropriate area (M) versus the inappropriate area (S) used in the treatment of migraine attacks.\n [^_01-ex-data-hello-3] [@Allais:2011]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Pain free?
Group No Yes
Control 44 2
Treatment 33 10
\n \n `````\n :::\n :::\n\n a. What percent of patients in the treatment group were pain free 24 hours after receiving acupuncture?\n\n b. What percent were pain free in the control group?\n\n c. In which group did a higher percent of patients become pain free 24 hours after receiving acupuncture?\n\n d. Your findings so far might suggest that acupuncture is an effective treatment for migraines for all people who suffer from migraines.\n However, this is not the only possible conclusion.\n What is one other possible explanation for the observed difference between the percentages of patients that are pain free 24 hours after receiving acupuncture in the two groups?\n\n e. What are the explanatory and response variables in this study?\n\n8. **Sinusitis and antibiotics.** Researchers studying the effect of antibiotic treatment for acute sinusitis compared to symptomatic treatments randomly assigned 166 adults diagnosed with acute sinusitis to one of two groups: treatment or control.\n Study participants received either a 10-day course of amoxicillin (an antibiotic) or a placebo similar in appearance and taste.\n The placebo consisted of symptomatic treatments such as acetaminophen, nasal decongestants, etc.\n At the end of the 10-day period, patients were asked if they experienced improvement in symptoms.\n The distribution of responses is summarized below.[^_01-ex-data-hello-4]\n [@Garbutt:2012]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Improvement
Group No Yes
Control 16 65
Treatment 19 66
\n \n `````\n :::\n :::\n\n a. What percent of patients in the treatment group experienced improvement in symptoms?\n\n b. What percent experienced improvement in symptoms in the control group?\n\n c. In which group did a higher percentage of patients experience improvement in symptoms?\n\n d. Your findings so far might suggest a real difference in the effectiveness of antibiotic and placebo treatments for improving symptoms of sinusitis.\n However, this is not the only possible conclusion.\n What is one other possible explanation for the observed difference between the percentages patients who experienced improvement in symptoms?\n\n e. What are the explanatory and response variables in this study?\n\n9. **Daycare fines, study components.** Researchers tested the deterrence hypothesis which predicts that the introduction of a penalty will reduce the occurrence of the behavior subject to the fine, with the condition that the fine leaves everything else unchanged by instituting a fine for late pickup at daycare centers.\n For this study, they worked with 10 volunteer daycare centers that did not originally impose a fine to parents for picking up their kids late.\n They randomly selected 6 of these daycare centers and instituted a monetary fine (of a considerable amount) for picking up children late and then removed it.\n In the remaining 4 daycare centers no fine was introduced.\n The study period was divided into four: before the fine (weeks 1--4), the first 4 weeks with the fine (weeks 5-8), the last 8 weeks with fine (weeks 9--16), and the after fine period (weeks 17-20).\n Throughout the study, the number of kids who were picked up late was recorded each week for each daycare.\n The study found that the number of late-coming parents increased discernibly when the fine was introduced, and no reduction occurred after the fine was removed.[^_01-ex-data-hello-5]\n [@Gneezy:2000]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
center week group late_pickups study_period
1 1 test 8 before fine
1 2 test 8 before fine
1 3 test 7 before fine
1 4 test 6 before fine
1 5 test 8 first 4 weeks with fine
... ... ... ... ...
10 20 control 13 after fine
\n \n `````\n :::\n :::\n\n a. Is this an observational study or an experiment?\n Explain your reasoning.\n\n b. What are the cases in this study and how many are included?\n\n c. What is the response variable in the study and what type of variable is it?\n\n d. What are the explanatory variables in the study and what types of variables are they?\n\n \\vspace{5mm}\n\n10. **Efficacy of COVID-19 vaccine on adolescents, study components.** Results of a Phase 3 trial announced in March 2021 show that the Pfizer-BioNTech COVID-19 vaccine demonstrated 100% efficacy and robust antibody responses on 12 to 15 years old adolescents with or without prior evidence of SARS-CoV-2 infection.\n In this trial 2,260 adolescents were randomly assigned to two groups: one group got the vaccine (n = 1,131) and the other got a placebo (n = 1,129).\n While 18 cases of COVID-19 were observed in the placebo group, none were observed in the vaccine group.[^_01-ex-data-hello-6]\n [@Pfizer:2021]\n\n a. Is this an observational study or an experiment?\n Explain your reasoning.\n\n b. What are the cases in this study and how many are included?\n\n c. What is the response variable in the study and what type of variable is it?\n\n d. What are the explanatory variables in the study and what types of variables are they?\n\n \\clearpage\n\n11. **Palmer penguins.** Data were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in the Palmer Archipelago, Antarctica.\n In addition to which island each penguin lives on, the data contains information on the species of the penguin (*Adelie*, *Chinstrap*, or *Gentoo*), its bill length, bill depth, and flipper length (measured in millimeters), its body mass (measured in grams), and the sex of the penguin (female or male).[^_01-ex-data-hello-7]\n Bill length and depth are measured as shown in the image.\n [^_01-ex-data-hello-8] [@palmerpenguins]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n ![](exercises/images/culmen_depth.png){fig-alt='Bill length and depth marked on an illustration of a penguin head.' width=40%}\n :::\n :::\n\n a. How many cases were included in the data?\n b. How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.\n c. How many categorical variables are included in the data, and what are they? List the corresponding levels (categories) for each.\n\n \\vspace{5mm}\n\n12. **Smoking habits of UK residents.** A survey was conducted to study the smoking habits of 1,691 UK residents.\n Below is a data frame displaying a portion of the data collected in this survey.\n A blank cell indicates that data for that variable was not available for a given respondent.[^_01-ex-data-hello-9]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
amount
sex age marital_status gross_income smoke weekend weekday
1 Female 61 Married 2,600 to 5,200 No
2 Female 61 Divorced 10,400 to 15,600 Yes 5 4
3 Female 69 Widowed 5,200 to 10,400 No
4 Female 50 Married 5,200 to 10,400 No
5 Male 31 Single 10,400 to 15,600 Yes 10 20
... ... ... ... ... ...
1691 Male 49 Divorced Above 36,400 Yes 15 10
\n \n `````\n :::\n :::\n\n a. What does each row of the data frame represent?\n\n b. How many participants were included in the survey?\n\n c. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\clearpage\n\n13. **US Airports.** The visualization below shows the geographical distribution of airports in the contiguous United States and Washington, DC. This visualization was constructed based on a dataset where each observation is an airport.[^_01-ex-data-hello-10]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-33-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\vspace{5mm}\n\n14. **UN Votes.** The visualization below shows voting patterns in the United States, Canada, and Mexico in the United Nations General Assembly on a variety of issues.\n Specifically, for a given year between 1946 and 2019, it displays the percentage of roll calls in which the country voted yes for each issue.\n This visualization was constructed based on a dataset where each observation is a country/year pair.[^_01-ex-data-hello-11]\n\n ::: {.cell fig.asp='0.8'}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n :::\n\n a. List the variables used in creating this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n15. **UK baby names.** The visualization below shows the number of baby girls born in the United Kingdom (comprised of England & Wales, Northern Ireland, and Scotland) who were given the name \"Fiona\" over the years.[^_01-ex-data-hello-12]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-35-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\vspace{5mm}\n\n16. **Shows on Netflix.** The visualization below shows the distribution of ratings of TV shows on Netflix (a streaming entertainment service) based on the decade they were released in and the country they were produced in.\n In the dataset, each observation is a TV show.[^_01-ex-data-hello-13]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-36-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\clearpage\n\n17. **Stanford Open Policing.** The Stanford Open Policing project gathers, analyzes, and releases records from traffic stops by law enforcement agencies across the United States.\n Their goal is to help researchers, journalists, and policy makers investigate and improve interactions between police and the public.\n The following is an excerpt from a summary table created based off the data collected as part of this project.\n [@pierson2020large]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Driver
Car
County State Race / Ethnicity Arrest rate Stops / year Search rate
Apache County AZ Black 0.016 266 0.077
Apache County AZ Hispanic 0.018 1008 0.053
Apache County AZ White 0.006 6322 0.017
Cochise County AZ Black 0.015 1169 0.047
Cochise County AZ Hispanic 0.01 9453 0.037
Cochise County AZ White 0.008 10826 0.024
... ... ... ... ... ...
Wood County WI Black 0.098 16 0.244
Wood County WI Hispanic 0.029 27 0.036
Wood County WI White 0.029 1157 0.033
\n \n `````\n :::\n :::\n\n a. What variables were collected on each individual traffic stop in order to create the summary table above?\n\n b. State whether each variable is numerical or categorical.\n If numerical, state whether it is continuous or discrete.\n If categorical, state whether it is ordinal or not.\n\n c. Suppose we wanted to evaluate whether vehicle search rates are different for drivers of different races.\n In this analysis, which variable would be the response variable and which variable would be the explanatory variable?\n\n \\vspace{5mm}\n\n18. **Space launches.** The following summary table shows the number of space launches in the US by the type of launching agency and the outcome of the launch (success or failure).[^_01-ex-data-hello-14]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
1957 - 1999
2000-2018
Failure Success Failure Success
Private 13 295 10 562
State 281 3751 33 711
Startup 0 0 5 65
\n \n `````\n :::\n :::\n\n a. What variables were collected on each launch in order to create to the summary table above?\n\n b. State whether each variable is numerical or categorical.\n If numerical, state whether it is continuous or discrete.\n If categorical, state whether it is ordinal or not.\n\n c. Suppose we wanted to study how the success rate of launches vary between launching agencies and over time.\n In this analysis, which variable would be the response variable and which variable would be the explanatory variable?\n\n \\clearpage\n\n19. **Pet names.** The city of Seattle, WA has an open data portal that includes pets registered in the city.\n For each registered pet, we have information on the pet's name and species.\n The following visualization plots the proportion of dogs with a given name versus the proportion of cats with the same name.\n The 20 most common cat and dog names are displayed.\n The diagonal line on the plot is the $x = y$ line; if a name appeared on this line, the name's popularity would be exactly the same for dogs and cats.[^_01-ex-data-hello-15]\n\n ::: {.cell fig.asp='0.618'}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n\n a. Are these data collected as part of an experiment or an observational study?\n\n b. What is the most common dog name?\n What is the most common cat name?\n\n c. What names are more common for cats than dogs?\n\n d. Is the relationship between the two variables positive or negative?\n What does this mean in context of the data?\n\n \\vspace{5mm}\n\n20. **Stressed out in an elevator.** In a study evaluating the relationship between stress and muscle cramps, half the subjects are randomly assigned to be exposed to increased stress by being placed into an elevator that falls rapidly and stops abruptly and the other half are left at no or baseline stress.\n\n a. What type of study is this?\n\n b. Can this study be used to conclude a causal relationship between increased stress and muscle cramps?\n\n[^_01-ex-data-hello-1]: The [`mcu_films`](http://openintrostat.github.io/openintro/reference/mcu_films.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-2]: The [`run17`](http://openintrostat.github.io/openintro/reference/run17.html) data used in this exercise can be found in the [**cherryblossom**](http://openintrostat.github.io/cherryblossom) R package.\n\n[^_01-ex-data-hello-3]: The [`migraine`](http://openintrostat.github.io/openintro/reference/migraine.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-4]: The [`sinusitis`](http://openintrostat.github.io/openintro/reference/sinusitis.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-5]: The [`daycare_fines`](http://openintrostat.github.io/openintro/reference/daycare_fines.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-6]: The [`biontech_adolescents`](http://openintrostat.github.io/openintro/reference/biontech_adolescents.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-7]: The [`penguins`](https://allisonhorst.github.io/palmerpenguins/reference/penguins.html) data used in this exercise can be found in the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins/) R package.\n\n[^_01-ex-data-hello-8]: Artwork by [Allison Horst](https://twitter.com/allison_horst).\n\n[^_01-ex-data-hello-9]: The [`smoking`](http://openintrostat.github.io/openintro/reference/smoking.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-10]: The [`usairports`](http://openintrostat.github.io/airports/reference/usairports.html) data used in this exercise can be found in the [**airports**](http://openintrostat.github.io/airports/) R package.\n\n[^_01-ex-data-hello-11]: The data used in this exercise can be found in the [**unvotes**](https://cran.r-project.org/web/packages/unvotes/index.html) R package.\n\n[^_01-ex-data-hello-12]: The [`ukbabynames`](https://mine-cetinkaya-rundel.github.io/ukbabynames/reference/ukbabynames.html) data used in this exercise can be found in the [**ukbabynames**](https://mine-cetinkaya-rundel.github.io/ukbabynames/) R package.\n\n[^_01-ex-data-hello-13]: The [`netflix_titles`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-20/readme.md) data used in this exercise can be found in the [**tidytuesdayR**](https://cran.r-project.org/web/packages/tidytuesdayR/index.html) R package.\n\n[^_01-ex-data-hello-14]: The data used in this exercise comes from the [JSR Launch Vehicle Database, 2019 Feb 10 Edition](https://www.openintro.org/go?id=textbook-space-launches-data&referrer=ims0_html).\n\n[^_01-ex-data-hello-15]: The [`seattlepets`](http://openintrostat.github.io/openintro/reference/seattlepets.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n\n:::\n", + "engine": "knitr", + "markdown": "# Hello data {#sec-data-hello}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nScientists seek to answer questions using rigorous methods and careful observations.\nThese observations -- collected from the likes of field notes, surveys, and experiments -- form the backbone of a statistical investigation and are called **data**.\nStatistics is the study of how best to collect, analyze, and draw conclusions from data.\nIn this first chapter, we focus on both the properties of data and on the collection of data.\n:::\n\n\n\n\n\n## Case study: Using stents to prevent strokes {sec-case-study-stents-strokes}\n\nIn this section we introduce a classic challenge in statistics: evaluating the efficacy of a medical treatment.\nTerms in this section, and indeed much of this chapter, will all be revisited later in the text.\nThe plan for now is simply to get a sense of the role statistics can play in practice.\n\nAn experiment is designed to study the effectiveness of stents in treating patients at risk of stroke [@chimowitz2011stenting].\nStents are small mesh tubes that are placed inside narrow or weak arteries to assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death.\n\nMany doctors have hoped that there would be similar benefits for patients at risk of stroke.\nWe start by writing the principal question the researchers hope to answer:\n\n> Does the use of stents reduce the risk of stroke?\n\nThe researchers who asked this question conducted an experiment with 451 at-risk patients.\nEach volunteer patient was randomly assigned to one of two groups:\n\n- **Treatment group**. Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modification.\n- **Control group**. Patients in the control group received the same medical management as the treatment group, but they did not receive stents.\n\nResearchers randomly assigned 224 patients to the treatment group and 227 to the control group.\nIn this study, the control group provides a reference point against which we can measure the medical impact of stents in the treatment group.\n\n\\clearpage\n\nResearchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment.\nThe results of 5 patients are summarized in @tbl-stentStudyResultsDF.\nPatient outcomes are recorded as `stroke` or `no event`, representing whether the patient had a stroke during that time period.\n\n::: {.data data-latex=\"\"}\nThe [`stent30`](http://openintrostat.github.io/openintro/reference/stent30.html) data and [`stent365`](http://openintrostat.github.io/openintro/reference/stent365.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {#tbl-stentStudyResultsDF .cell tbl-cap='Results for five patients from the stent study.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
patient group 30 days 365 days
1 treatment no event no event
2 treatment stroke stroke
3 treatment no event no event
4 treatment no event no event
5 control no event no event
\n\n`````\n:::\n:::\n\n\nIt would be difficult to answer a question on the impact of stents on the occurrence of strokes for **all** study patients using these *individual* observations.\nThis question is better addressed by performing a statistical data analysis of *all* observations.\n@tbl-stentStudyResultsDFsummary summarizes the raw data in a more helpful way.\nIn this table, we can quickly see what happened over the entire study.\nFor instance, to identify the number of patients in the treatment group who had a stroke within 30 days after the treatment, we look in the leftmost column (30 days), at the intersection of treatment and stroke: 33.\nTo identify the number of control patients who did not have a stroke after 365 days after receiving treatment, we look at the rightmost column (365 days), at the intersection of control and no event: 199.\n\n\n::: {#tbl-stentStudyResultsDFsummary .cell tbl-cap='Descriptive statistics for the stent study.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
30 days
365 days
Group Stroke No event Stroke No event
Control 13 214 28 199
Treatment 33 191 45 179
Total 46 405 73 378
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nOf the 224 patients in the treatment group, 45 had a stroke by the end of the first year.\nUsing these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year.\n(Note: answers to all Guided Practice exercises are provided in footnotes!)[^01-data-hello-1]\n:::\n\n[^01-data-hello-1]: The proportion of the 224 patients who had a stroke within 365 days: $45/224 = 0.20.$\n\nWe can compute summary statistics from the table to give us a better idea of how the impact of the stent treatment differed between the two groups.\nA **summary statistic** is a single number summarizing data from a sample.\nFor instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.\n\n\n\n\n\n- Proportion who had a stroke in the treatment (stent) group: $45/224 = 0.20 = 20\\%.$\n- Proportion who had a stroke in the control group: $28/227 = 0.12 = 12\\%.$\n\nThese two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke!\nThis is important for two reasons.\nFirst, it is contrary to what doctors expected, which was that stents would *reduce* the rate of strokes.\nSecond, it leads to a statistical question: do the data show a \"real\" difference between the groups?\n\nThis second question is subtle.\nSuppose you flip a coin 100 times.\nWhile the chance a coin lands heads in any given coin flip is 50%, we probably won't observe exactly 50 heads.\nThis type of variation is part of almost any type of data generating process.\nIt is possible that the 8% difference in the stent study is due to this natural variation.\nHowever, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance.\nSo, what we are really asking is the following: if in fact stents have no effect, how likely is it that we observe such a large difference?\n\nWhile we do not yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.\n\n**Be careful:** Do not generalize the results of this study to all patients and all stents.\nThis study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.\nIn addition, there are many types of stents, and this study only considered the self-expanding Wingspan stent (Boston Scientific).\nHowever, this study does leave us with an important lesson: we should keep our eyes open for surprises.\n\n## Data basics {sec-data-basics}\n\nEffective presentation and description of data is a first step in most analyses.\nThis section introduces one structure for organizing data as well as some terminology that will be used throughout this book.\n\n### Observations, variables, and data matrices\n\n@tbl-loan50-df displays six rows of a dataset for 50 randomly sampled loans offered through Lending Club, which is a peer-to-peer lending company.\nThis dataset will be referred to as `loan50`.\n\n::: {.data data-latex=\"\"}\nThe [`loan50`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nEach row in the table represents a single loan.\nThe formal name for a row is a \\index{case}**case** or \\index{unit of observation}**observational unit**.\nThe columns represent characteristics of each loan, where each column is referred to as a \\index{variable}**variable**.\nFor example, the first row represents a loan of \\$22,000 with an interest rate of 10.90%, where the borrower is based in New Jersey (NJ) and has an income of \\$59,000.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat is the grade of the first loan in @tbl-loan50-df?\nAnd what is the home ownership status of the borrower for that first loan?\nReminder: for these Guided Practice questions, you can check your answer in the footnote.[^01-data-hello-2]\n:::\n\n[^01-data-hello-2]: The loan's grade is B, and the borrower rents their residence.\n\nIn practice, it is especially important to ask clarifying questions to ensure important aspects of the data are understood.\nFor instance, it is always important to be sure we know what each variable means and its units of measurement.\nDescriptions of the variables in the `loan50` dataset are given in @tbl-loan-50-variables.\n\n\n::: {#tbl-loan50-df .cell tbl-cap='Six observations from the `loan50` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
loan_amount interest_rate term grade state total_income homeownership
1 22,000 10.90 60 B NJ 59,000 rent
2 6,000 9.92 36 B CA 60,000 rent
3 25,000 26.30 36 E SC 75,000 mortgage
4 6,000 9.92 36 B CA 75,000 rent
5 25,000 9.43 60 B OH 254,000 mortgage
6 6,400 9.92 36 B IN 67,000 mortgage
\n\n`````\n:::\n:::\n\n::: {#tbl-loan-50-variables .cell tbl-cap='Variables and their descriptions for the `loan50` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
loan_amount Amount of the loan received, in US dollars.
interest_rate Interest rate on the loan, in an annual percentage.
term The length of the loan, which is always set as a whole number of months.
grade Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid.
state US state where the borrower resides.
total_income Borrower's total income, including any second income, in US dollars.
homeownership Indicates whether the person owns, owns but has a mortgage, or rents.
\n\n`````\n:::\n:::\n\n\nThe data in @tbl-loan50-df represent a \\index{data frame}**data frame**, which is a convenient and common way to organize data, especially if collecting data in a spreadsheet.\nA data frame where each row is a unique case (observational unit), each column is a variable, and each cell is a single value is commonly referred to as \\index{tidy data}**tidy data** @wickham2014.\n\n\n\n\n\nWhen recording data, use a tidy data frame unless you have a very good reason to use a different structure.\nThis structure allows new cases to be added as rows or new variables as new columns and facilitates visualization, summarization, and other statistical analyses.\n\n::: {.guidedpractice data-latex=\"\"}\nThe grades for assignments, quizzes, and exams in a course are often recorded in a gradebook that takes the form of a data frame.\nHow might you organize a course's grade data using a data frame?\nDescribe the observational units and variables.[^01-data-hello-3]\n:::\n\n[^01-data-hello-3]: There are multiple strategies that can be followed.\n One common strategy is to have each student represented by a row, and then add a column for each assignment, quiz, or exam.\n Under this setup, it is easy to review a single line to understand the grade history of a student.\n There should also be columns to include student information, such as one column to list student names.\n\n::: {.guidedpractice data-latex=\"\"}\nWe consider data for 3,142 counties in the United States, which includes the name of each county, the state where it resides, its population in 2017, the population change from 2010 to 2017, poverty rate, and nine additional characteristics.\nHow might these data be organized in a data frame?[^01-data-hello-4]\n:::\n\n[^01-data-hello-4]: Each county may be viewed as a case, and there are eleven pieces of information recorded for each case.\n A table with 3,142 rows and 14 columns could hold these data, where each row represents a county and each column represents a particular piece of information.\n\n\\clearpage\n\nThe data described in the Guided Practice above represents the `county` dataset, which is shown as a data frame in @tbl-county-df.\nThe variables as well as the variables in the dataset that did not fit in @tbl-county-df are described in @tbl-county-variables.\n\n\n::: {#tbl-county-df .cell tbl-cap='Six observations and six variables from the `county` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
name state pop2017 pop_change unemployment_rate median_edu
Autauga County Alabama 55,504 1.48 3.86 some_college
Baldwin County Alabama 212,628 9.19 3.99 some_college
Barbour County Alabama 25,270 -6.22 5.90 hs_diploma
Bibb County Alabama 22,668 0.73 4.39 hs_diploma
Blount County Alabama 58,013 0.68 4.02 hs_diploma
Bullock County Alabama 10,309 -2.28 4.93 hs_diploma
\n\n`````\n:::\n:::\n\n::: {#tbl-county-variables .cell tbl-cap='Variables and their descriptions for the `county` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
name Name of county.
state Name of state.
pop2000 Population in 2000.
pop2010 Population in 2010.
pop2017 Population in 2017.
pop_change Population change from 2010 to 2017 (in percent).
poverty Percent of population in poverty in 2017.
homeownership Homeownership rate, 2006-2010.
multi_unit Multi-unit rate: percent of housing units that are in multi-unit structures, 2006-2010.
unemployment_rate Unemployment rate in 2017.
metro Whether the county contains a metropolitan area, taking one of the values yes or no.
median_edu Median education level (2013-2017), taking one of the values below_hs, hs_diploma, some_college, or bachelors.
per_capita_income Per capita (per person) income (2013-2017).
median_hh_income Median household income.
smoking_ban Describes the type of county-level smoking ban in place in 2010, taking one of the values none, partial, or comprehensive.
\n\n`````\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`county`](http://openintrostat.github.io/usdata/reference/county.html) data can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.\n:::\n\n### Types of variables {variable-types}\n\nExamine the `unemployment_rate`, `pop2017`, `state`, and `median_edu` variables in the `county` dataset.\nEach of these variables is inherently different from the other three, yet some share certain characteristics.\n\nFirst consider `unemployment_rate`, which is said to be a \\index{numerical variable}**numerical** variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values.\nOn the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes does not have any clear meaning.\nInstead, we would consider area codes as a categorical variable.\n\n\n\n\n\nThe `pop2017` variable is also numerical, although it seems to be a little different than `unemployment_rate`.\nThis variable of the population count can only take whole non-negative numbers (0, 1, 2, ...).\nFor this reason, the population variable is said to be **discrete** since it can only take numerical values with jumps.\nOn the other hand, the unemployment rate variable is said to be **continuous**.\n\n\n\n\n\nThe variable `state` can take up to 51 values after accounting for Washington, DC: Alabama, Alaska, ..., and Wyoming.\nBecause the responses themselves are categories, `state` is called a **categorical** variable, and the possible values (states) are called the variable's **levels** (e.g., District of Columbia, Alabama, Alaska, etc.) .\n\n\n\n\n\nFinally, consider the `median_edu` variable, which describes the median education level of county residents and takes values `below_hs`, `hs_diploma`, `some_college`, or `bachelors` in each county.\nThis variable seems to be a hybrid: it is a categorical variable, but the levels have a natural ordering.\nA variable with these properties is called an **ordinal** variable, while a regular categorical variable without this type of special ordering is called a **nominal** variable.\nTo simplify analyses, any categorical variable in this book will be treated as a nominal (unordered) categorical variable.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Breakdown of variables into their respective types.](01-data-hello_files/figure-html/variables-1.png){fig-alt='Types of variables are broken down into numerical (which can be discrete or continuous) and categorical (which can be ordinal or nominal).' width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nData were collected about students in a statistics course.\nThree variables were recorded for each student: number of siblings, student height, and whether the student had previously taken a statistics course.\nClassify each of the variables as continuous numerical, discrete numerical, or categorical.\n\n------------------------------------------------------------------------\n\nThe number of siblings and student height represent numerical variables.\nBecause the number of siblings is a count, it is discrete.\nHeight varies continuously, so it is a continuous numerical variable.\nThe last variable classifies students into two categories -- those who have and those who have not taken a statistics course -- which makes this variable categorical.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nAn experiment is evaluating the effectiveness of a new drug in treating migraines.\nA `group` variable is used to indicate the experiment group for each patient: treatment or control.\nThe `num_migraines` variable represents the number of migraines the patient experienced during a 3-month period.\nClassify each variable as either numerical or categorical?[^01-data-hello-5]\n:::\n\n[^01-data-hello-5]: The `group` variable can take just one of two group names, making it categorical.\n The `num_migraines` variable describes a count of the number of migraines, which is an outcome where basic arithmetic is sensible, which means this is a numerical outcome; more specifically, since it represents a count, `num_migraines` is a discrete numerical variable.\n\n### Relationships between variables {variable-relations}\n\nMany analyses are motivated by a researcher looking for a relationship between two or more variables.\nA social scientist may like to answer some of the following questions:\n\n> Does a higher-than-average increase in county population tend to correspond to counties with higher or lower median household incomes?\n\n> If homeownership in one county is lower than the national average, will the percent of housing units that are in multi-unit structures in that county tend to be above or below the national average?\n\n> How much can the median education level explain the median household income for counties in the US?\n\nTo answer these questions, data must be collected, such as the `county` dataset shown in @tbl-county-df.\nExamining \\index{summary statistic}**summary statistics** can provide numerical insights about the specifics of each of these questions.\nAlternatively, graphs can be used to visually explore the data, potentially providing more insight than a summary statistic.\n\n\\index{scatterplot}**Scatterplots** are one type of graph used to study the relationship between two numerical variables.\n@fig-county-multi-unit-homeownership displays the relationship between the variables `homeownership` and `multi_unit`, which is the percent of housing units that are in multi-unit structures (e.g., apartments, condos).\nEach point on the plot represents a single county.\nFor instance, the highlighted dot corresponds to County 413 in the `county` dataset: Chattahoochee County, Georgia, which has 39.4% of housing units that are in multi-unit structures and a homeownership rate of 31.3%.\nThe scatterplot suggests a relationship between the two variables: counties with a higher rate of housing units that are in multi-unit structures tend to have lower homeownership rates.\nWe might brainstorm as to why this relationship exists and investigate each idea to determine which are the most reasonable explanations.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot of homeownership versus the percent of housing units that are in multi-unit structures for US counties. The highlighted dot represents Chattahoochee County, Georgia, which has a multi-unit rate of 39.4\\% and a homeownership rate of 31.3\\%.](01-data-hello_files/figure-html/fig-county-multi-unit-homeownership-1.png){#fig-county-multi-unit-homeownership width=90%}\n:::\n:::\n\n\nThe multi-unit and homeownership rates are said to be associated because the plot shows a discernible pattern.\nWhen two variables show some connection with one another, they are called **associated** variables.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nExamine the variables in the `loan50` dataset, which are described in @tbl-loan-50-variables.\nCreate two questions about possible relationships between variables in `loan50` that are of interest to you.[^01-data-hello-6]\n:::\n\n[^01-data-hello-6]: Two example questions: (1) What is the relationship between loan amount and total income?\n (2) If someone's income is above the average, will their interest rate tend to be above or below the average?\n\n::: {.workedexample data-latex=\"\"}\nThis example examines the relationship between the percent change in population from 2010 to 2017 and median household income for counties, which is visualized as a scatterplot in @fig-county-pop-change-med-hh-income.\nAre these variables associated?\n\n------------------------------------------------------------------------\n\nThe larger the median household income for a county, the higher the population growth observed for the county.\nWhile it isn't true that every county with a higher median household income has a higher population growth, the trend in the plot is evident.\nSince there is some relationship between the variables, they are associated.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot showing population change against median household income. Owsley County of Kentucky is highlighted, which lost 3.63\\% of its population from 2010 to 2017 and had median household income of \\$22,736.](01-data-hello_files/figure-html/fig-county-pop-change-med-hh-income-1.png){#fig-county-pop-change-med-hh-income width=90%}\n:::\n:::\n\n\nBecause there is a downward trend in @fig-county-multi-unit-homeownership -- counties with more housing units that are in multi-unit structures are associated with lower homeownership -- these variables are said to be **negatively associated**.\nA **positive association** is shown in the relationship between the `median_hh_income` and `pop_change` variables in @fig-county-pop-change-med-hh-income, where counties with higher median household income tend to have higher rates of population growth.\n\n\n\n\n\nIf two variables are not associated, then they are said to be **independent**.\nThat is, two variables are independent if there is no evident relationship between the two.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Associated or independent, not both.**\n\nA pair of variables are either related in some way (associated) or not (independent).\nNo pair of variables is both associated and independent.\n:::\n\n### Explanatory and response variables\n\nWhen we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other.\nConsider the following rephrasing of an earlier question about the `county` dataset:\n\n> If there is an increase in the median household income in a county, does this drive an increase in its population?\n\nIn this question, we are asking whether one variable affects another.\nIf this is our underlying belief, then *median household income* is the **explanatory variable**, and the *population change* is the **response variable** in the hypothesized relationship.[^01-data-hello-7]\n\n[^01-data-hello-7]: In some disciplines, it's customary to refer to the explanatory variable as the **independent variable** and the response variable as the **dependent variable**.\n However, this becomes confusing since a *pair* of variables might be independent or dependent, so we avoid this language.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Explanatory and response variables.**\n\nWhen we suspect one variable might causally affect another, we label the first variable the explanatory variable and the second the response variable.\nWe also use the terms **explanatory** and **response** to describe variables where the **response** might be predicted using the **explanatory** even if there is no causal relationship.\n\n
explanatory variable $\\rightarrow$ *might affect* $\\rightarrow$ response variable
\n\n
For many pairs of variables, there is no hypothesized relationship, and these labels would not be applied to either variable in such cases.\n:::\n\nBear in mind that the act of labeling the variables in this way does nothing to guarantee that a causal relationship exists.\nA formal evaluation to check whether one variable causes a change in another requires an experiment.\n\n### Observational studies and experiments\n\nThere are two primary types of data collection: experiments and observational studies.\n\nWhen researchers want to evaluate the effect of particular traits, treatments, or conditions, they conduct an **experiment**.\nFor instance, we may suspect drinking a high-calorie energy drink will improve performance in a race.\nTo check if there really is a causal relationship between the explanatory variable (whether the runner drank an energy drink or not) and the response variable (the race time), researchers identify a sample of individuals and split them into groups.\nThe individuals in each group are *assigned* a treatment.\nWhen individuals are randomly assigned to a group, the experiment is called a **randomized experiment**.\nRandom assignment organizes the participants in a study into groups that are roughly equal on all aspects, thus allowing us to control for any confounding variables that might affect the outcome (e.g., fitness level, racing experience, etc.).\nFor example, each runner in the experiment could be randomly assigned, perhaps by flipping a coin, into one of two groups: the first group receives a **placebo** (fake treatment, in this case a no-calorie drink) and the second group receives the high-calorie energy drink.\nSee the case study in @sec-case-study-stents-strokes for another example of an experiment, though that study did not employ a placebo.\n\n\n\n\n\nResearchers perform an **observational study** when they collect data in a way that does not directly interfere with how the data arise.\nFor instance, researchers may collect information via surveys, review medical or company records, or follow a **cohort** of many similar individuals to form hypotheses about why certain diseases might develop.\nIn each of these situations, researchers merely observe the data that arise.\nIn general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they do not offer a mechanism for controlling for confounding variables.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Association** $\\neq$ **Causation.**\n\nIn general, association does not imply causation.\nAn advantage of a randomized experiment is that it is easier to establish causal relationships with such a study.\nThe main reason for this is that observational studies do not control for confounding variables, and hence establishing causal relationships with observational studies requires advanced statistical methods (that are beyond the scope of this book).\nWe will revisit this idea when we discuss experiments later in the book.\n:::\n\n\\vspace{10mm}\n\n## Chapter review {chp1-review}\n\n### Summary\n\nThis chapter introduced you to the world of data.\nData can be organized in many ways but tidy data, where each row represents an observation and each column represents a variable, lends itself most easily to statistical analysis.\nMany of the ideas from this chapter will be seen as we move on to doing full data analyses.\nIn the next chapter you're going to learn about how we can design studies to collect the data we need to make conclusions with the desired scope of inference.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
associated experiment ordinal
case explanatory variable placebo
categorical independent positive association
cohort level randomized experiment
continuous negative association response variable
data nominal summary statistic
data frame numerical tidy data
dependent observational study variable
discrete observational unit
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {chp1-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-01].\n\n::: {.exercises data-latex=\"\"}\n1. **Marvel Cinematic Universe films.** The data frame below contains information on Marvel Cinematic Universe films through the Infinity saga (a movie storyline spanning from Ironman in 2008 to Endgame in 2019).\n Box office totals are given in millions of US Dollars.\n How many observations and how many variables does this data frame have?[^_01-ex-data-hello-1]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Length
Gross
Title Hrs Mins Release Date Opening Wknd US US World
1 Iron Man 2 6 5/2/2008 98.62 319.03 585.8
2 The Incredible Hulk 1 52 6/12/2008 55.41 134.81 264.77
3 Iron Man 2 2 4 5/7/2010 128.12 312.43 623.93
4 Thor 1 55 5/6/2011 65.72 181.03 449.33
5 Captain America: The First Avenger 2 4 7/22/2011 65.06 176.65 370.57
... ... ... ... ... ... ... ...
23 Spiderman: Far from Home 2 9 7/2/2019 92.58 390.53 1131.93
\n \n `````\n :::\n :::\n\n2. **Cherry Blossom Run.** The data frame below contains information on runners in the 2017 Cherry Blossom Run, which is an annual road race that takes place in Washington, DC. Most runners participate in a 10-mile run while a smaller fraction take part in a 5k run or walk.\n How many observations and how many variables does this data frame have?[^_01-ex-data-hello-2]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Time
Bib Name Sex Age City / Country Net Clock Pace Event
1 6 Hiwot G. F 21 Ethiopia 3217 3217 321 10 Mile
2 22 Buze D. F 22 Ethiopia 3232 3232 323 10 Mile
3 16 Gladys K. F 31 Kenya 3276 3276 327 10 Mile
4 4 Mamitu D. F 33 Ethiopia 3285 3285 328 10 Mile
5 20 Karolina N. F 35 Poland 3288 3288 328 10 Mile
... ... ... ... ... ... ... ... ... ...
19961 25153 Andres E. M 33 Woodbridge, VA 5287 5334 1700 5K
\n \n `````\n :::\n :::\n\n3. **Air pollution and birth outcomes, study components.** Researchers collected data to examine the relationship between air pollutants and preterm births in Southern California.\n During the study air pollution levels were measured by air quality monitoring stations.\n Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (PM$_{10}$) in $\\mu g/m^3$.\n Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth.\n The analysis suggested that increased ambient PM$_{10}$ and, to a lesser degree, CO concentrations may be associated with the occurrence of preterm births.\n [@Ritz+Yu+Chapa+Fruin:2000]\n\n a. Identify the main research question of the study.\n\n b. Who are the subjects in this study, and how many are included?\n\n c. What are the variables in the study?\n Identify each variable as numerical or categorical.\n If numerical, state whether the variable is discrete or continuous.\n If categorical, state whether the variable is ordinal.\n\n4. **Cheaters, study components.** Researchers studying the relationship between honesty, age and self-control conducted an experiment on 160 children between the ages of 5 and 15.\n Participants reported their age, sex, and whether they were an only child or not.\n The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet and said they would only reward children who report white.\n [@Bucciol:2011]\n\n a. Identify the main research question of the study.\n\n b. Who are the subjects in this study, and how many are included?\n\n c. The study's findings can be summarized as follows: *\"Half the students were explicitly told not to cheat, and the others were not given any explicit instructions. In the no instruction group probability of cheating was found to be uniform across groups based on child's characteristics. In the group that was explicitly told to not cheat, girls were less likely to cheat, and while rate of cheating didn't vary by age for boys, it decreased with age for girls.\"* How many variables were recorded for each subject in the study in order to conclude these findings?\n State the variables and their types.\n\n5. **Gamification and statistics, study components.** Gamification is the application of game-design elements and game principles in non-game contexts.\n In educational settings, gamification is often implemented as educational activities to solve problems by using characteristics of game elements.\n Researchers investigating the effects of gamification on learning statistics conducted a study where they split college students in a statistics class into four groups: (1) no reading exercises and no gamification, (2) reading exercises but no gamification, (3) gamification but no reading exercises, and (4) gamification and reading exercises.\n Students in all groups also attended lectures.\n Students in the class were from two majors: Electrical and Computer Engineering (n = 279) and Business Administration (n = 86).\n After their assigned learning experience, each student took a final evaluation comprised of 30 multiple choice question and their score was measured as the number of questions they answered correctly.\n The researchers considered students' gender, level of studies (first through fourth year) and academic major.\n Other variables considered were expertise in the English language and use of personal computers and games, both of which were measured on a scale of 1 (beginner) to 5 (proficient).\n The study found that gamification had a positive effect on student learning compared to traditional teaching methods involving lectures and reading exercises.\n They also found that the effect was larger for females and Engineering students.\n [@Legaki:2020]\n\n a. Identify the main research question of the study.\n\n b. Who were the subjects in this study, and how many were included?\n\n c. What are the variables in the study?\n Identify each variable as numerical or categorical.\n If numerical, state whether the variable is discrete or continuous.\n If categorical, state whether the variable is ordinal.\n\n6. **Stealers, study components.** In a study of the relationship between socio-economic class and unethical behavior, 129 University of California undergraduates at Berkeley were asked to identify themselves as having low or high social class by comparing themselves to others with the most (least) money, most (least) education, and most (least) respected jobs.\n They were also presented with a jar of individually wrapped candies and informed that the candies were for children in a nearby laboratory, but that they could take some if they wanted.\n After completing some unrelated tasks, participants reported the number of candies they had taken.\n [@Piff:2012]\n\n a. Identify the main research question of the study.\n\n b. Who were the subjects in this study, and how many were included?\n\n c. The study found that students who were identified as upper-class took more candy than others.\n How many variables were recorded for each subject in the study in order to conclude these findings?\n State the variables and their types.\n\n \\clearpage\n\n7. \"Figure **Migraine and acupuncture.** A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture.\n To determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 individuals who identified as female diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control.\n Forty-three (43) patients in the treatment group received acupuncture that is specifically designed to treat migraines.\n Forty-six (46) patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations).\n Twenty-four (24) hours after patients received acupuncture, they were asked if they were pain free.\n Results are summarized in the contingency table below.\n Also provided is a figure from the original paper displaying the appropriate area (M) versus the inappropriate area (S) used in the treatment of migraine attacks.\n [^_01-ex-data-hello-3] [@Allais:2011]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Pain free?
Group No Yes
Control 44 2
Treatment 33 10
\n \n `````\n :::\n :::\n\n a. What percent of patients in the treatment group were pain free 24 hours after receiving acupuncture?\n\n b. What percent were pain free in the control group?\n\n c. In which group did a higher percent of patients become pain free 24 hours after receiving acupuncture?\n\n d. Your findings so far might suggest that acupuncture is an effective treatment for migraines for all people who suffer from migraines.\n However, this is not the only possible conclusion.\n What is one other possible explanation for the observed difference between the percentages of patients that are pain free 24 hours after receiving acupuncture in the two groups?\n\n e. What are the explanatory and response variables in this study?\n\n8. **Sinusitis and antibiotics.** Researchers studying the effect of antibiotic treatment for acute sinusitis compared to symptomatic treatments randomly assigned 166 adults diagnosed with acute sinusitis to one of two groups: treatment or control.\n Study participants received either a 10-day course of amoxicillin (an antibiotic) or a placebo similar in appearance and taste.\n The placebo consisted of symptomatic treatments such as acetaminophen, nasal decongestants, etc.\n At the end of the 10-day period, patients were asked if they experienced improvement in symptoms.\n The distribution of responses is summarized below.[^_01-ex-data-hello-4]\n [@Garbutt:2012]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Improvement
Group No Yes
Control 16 65
Treatment 19 66
\n \n `````\n :::\n :::\n\n a. What percent of patients in the treatment group experienced improvement in symptoms?\n\n b. What percent experienced improvement in symptoms in the control group?\n\n c. In which group did a higher percentage of patients experience improvement in symptoms?\n\n d. Your findings so far might suggest a real difference in the effectiveness of antibiotic and placebo treatments for improving symptoms of sinusitis.\n However, this is not the only possible conclusion.\n What is one other possible explanation for the observed difference between the percentages patients who experienced improvement in symptoms?\n\n e. What are the explanatory and response variables in this study?\n\n9. **Daycare fines, study components.** Researchers tested the deterrence hypothesis which predicts that the introduction of a penalty will reduce the occurrence of the behavior subject to the fine, with the condition that the fine leaves everything else unchanged by instituting a fine for late pickup at daycare centers.\n For this study, they worked with 10 volunteer daycare centers that did not originally impose a fine to parents for picking up their kids late.\n They randomly selected 6 of these daycare centers and instituted a monetary fine (of a considerable amount) for picking up children late and then removed it.\n In the remaining 4 daycare centers no fine was introduced.\n The study period was divided into four: before the fine (weeks 1--4), the first 4 weeks with the fine (weeks 5-8), the last 8 weeks with fine (weeks 9--16), and the after fine period (weeks 17-20).\n Throughout the study, the number of kids who were picked up late was recorded each week for each daycare.\n The study found that the number of late-coming parents increased discernibly when the fine was introduced, and no reduction occurred after the fine was removed.[^_01-ex-data-hello-5]\n [@Gneezy:2000]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
center week group late_pickups study_period
1 1 test 8 before fine
1 2 test 8 before fine
1 3 test 7 before fine
1 4 test 6 before fine
1 5 test 8 first 4 weeks with fine
... ... ... ... ...
10 20 control 13 after fine
\n \n `````\n :::\n :::\n\n a. Is this an observational study or an experiment?\n Explain your reasoning.\n\n b. What are the cases in this study and how many are included?\n\n c. What is the response variable in the study and what type of variable is it?\n\n d. What are the explanatory variables in the study and what types of variables are they?\n\n \\vspace{5mm}\n\n10. **Efficacy of COVID-19 vaccine on adolescents, study components.** Results of a Phase 3 trial announced in March 2021 show that the Pfizer-BioNTech COVID-19 vaccine demonstrated 100% efficacy and robust antibody responses on 12 to 15 years old adolescents with or without prior evidence of SARS-CoV-2 infection.\n In this trial 2,260 adolescents were randomly assigned to two groups: one group got the vaccine (n = 1,131) and the other got a placebo (n = 1,129).\n While 18 cases of COVID-19 were observed in the placebo group, none were observed in the vaccine group.[^_01-ex-data-hello-6]\n [@Pfizer:2021]\n\n a. Is this an observational study or an experiment?\n Explain your reasoning.\n\n b. What are the cases in this study and how many are included?\n\n c. What is the response variable in the study and what type of variable is it?\n\n d. What are the explanatory variables in the study and what types of variables are they?\n\n \\clearpage\n\n11. **Palmer penguins.** Data were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in the Palmer Archipelago, Antarctica.\n In addition to which island each penguin lives on, the data contains information on the species of the penguin (*Adelie*, *Chinstrap*, or *Gentoo*), its bill length, bill depth, and flipper length (measured in millimeters), its body mass (measured in grams), and the sex of the penguin (female or male).[^_01-ex-data-hello-7]\n Bill length and depth are measured as shown in the image.\n [^_01-ex-data-hello-8] [@palmerpenguins]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](exercises/images/culmen_depth.png){fig-alt='Bill length and depth marked on an illustration of a penguin head.' width=40%}\n :::\n :::\n\n a. How many cases were included in the data?\n b. How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.\n c. How many categorical variables are included in the data, and what are they? List the corresponding levels (categories) for each.\n\n \\vspace{5mm}\n\n12. **Smoking habits of UK residents.** A survey was conducted to study the smoking habits of 1,691 UK residents.\n Below is a data frame displaying a portion of the data collected in this survey.\n A blank cell indicates that data for that variable was not available for a given respondent.[^_01-ex-data-hello-9]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
amount
sex age marital_status gross_income smoke weekend weekday
1 Female 61 Married 2,600 to 5,200 No
2 Female 61 Divorced 10,400 to 15,600 Yes 5 4
3 Female 69 Widowed 5,200 to 10,400 No
4 Female 50 Married 5,200 to 10,400 No
5 Male 31 Single 10,400 to 15,600 Yes 10 20
... ... ... ... ... ...
1691 Male 49 Divorced Above 36,400 Yes 15 10
\n \n `````\n :::\n :::\n\n a. What does each row of the data frame represent?\n\n b. How many participants were included in the survey?\n\n c. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\clearpage\n\n13. **US Airports.** The visualization below shows the geographical distribution of airports in the contiguous United States and Washington, DC. This visualization was constructed based on a dataset where each observation is an airport.[^_01-ex-data-hello-10]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-33-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\vspace{5mm}\n\n14. **UN Votes.** The visualization below shows voting patterns in the United States, Canada, and Mexico in the United Nations General Assembly on a variety of issues.\n Specifically, for a given year between 1946 and 2019, it displays the percentage of roll calls in which the country voted yes for each issue.\n This visualization was constructed based on a dataset where each observation is a country/year pair.[^_01-ex-data-hello-11]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n :::\n\n a. List the variables used in creating this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n15. **UK baby names.** The visualization below shows the number of baby girls born in the United Kingdom (comprised of England & Wales, Northern Ireland, and Scotland) who were given the name \"Fiona\" over the years.[^_01-ex-data-hello-12]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-35-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\vspace{5mm}\n\n16. **Shows on Netflix.** The visualization below shows the distribution of ratings of TV shows on Netflix (a streaming entertainment service) based on the decade they were released in and the country they were produced in.\n In the dataset, each observation is a TV show.[^_01-ex-data-hello-13]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-36-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical.\n If numerical, identify as continuous or discrete.\n If categorical, indicate if the variable is ordinal.\n\n \\clearpage\n\n17. **Stanford Open Policing.** The Stanford Open Policing project gathers, analyzes, and releases records from traffic stops by law enforcement agencies across the United States.\n Their goal is to help researchers, journalists, and policy makers investigate and improve interactions between police and the public.\n The following is an excerpt from a summary table created based off the data collected as part of this project.\n [@pierson2020large]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Driver
Car
County State Race / Ethnicity Arrest rate Stops / year Search rate
Apache County AZ Black 0.016 266 0.077
Apache County AZ Hispanic 0.018 1008 0.053
Apache County AZ White 0.006 6322 0.017
Cochise County AZ Black 0.015 1169 0.047
Cochise County AZ Hispanic 0.01 9453 0.037
Cochise County AZ White 0.008 10826 0.024
... ... ... ... ... ...
Wood County WI Black 0.098 16 0.244
Wood County WI Hispanic 0.029 27 0.036
Wood County WI White 0.029 1157 0.033
\n \n `````\n :::\n :::\n\n a. What variables were collected on each individual traffic stop in order to create the summary table above?\n\n b. State whether each variable is numerical or categorical.\n If numerical, state whether it is continuous or discrete.\n If categorical, state whether it is ordinal or not.\n\n c. Suppose we wanted to evaluate whether vehicle search rates are different for drivers of different races.\n In this analysis, which variable would be the response variable and which variable would be the explanatory variable?\n\n \\vspace{5mm}\n\n18. **Space launches.** The following summary table shows the number of space launches in the US by the type of launching agency and the outcome of the launch (success or failure).[^_01-ex-data-hello-14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
1957 - 1999
2000-2018
Failure Success Failure Success
Private 13 295 10 562
State 281 3751 33 711
Startup 0 0 5 65
\n \n `````\n :::\n :::\n\n a. What variables were collected on each launch in order to create to the summary table above?\n\n b. State whether each variable is numerical or categorical.\n If numerical, state whether it is continuous or discrete.\n If categorical, state whether it is ordinal or not.\n\n c. Suppose we wanted to study how the success rate of launches vary between launching agencies and over time.\n In this analysis, which variable would be the response variable and which variable would be the explanatory variable?\n\n \\clearpage\n\n19. **Pet names.** The city of Seattle, WA has an open data portal that includes pets registered in the city.\n For each registered pet, we have information on the pet's name and species.\n The following visualization plots the proportion of dogs with a given name versus the proportion of cats with the same name.\n The 20 most common cat and dog names are displayed.\n The diagonal line on the plot is the $x = y$ line; if a name appeared on this line, the name's popularity would be exactly the same for dogs and cats.[^_01-ex-data-hello-15]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n\n a. Are these data collected as part of an experiment or an observational study?\n\n b. What is the most common dog name?\n What is the most common cat name?\n\n c. What names are more common for cats than dogs?\n\n d. Is the relationship between the two variables positive or negative?\n What does this mean in context of the data?\n\n \\vspace{5mm}\n\n20. **Stressed out in an elevator.** In a study evaluating the relationship between stress and muscle cramps, half the subjects are randomly assigned to be exposed to increased stress by being placed into an elevator that falls rapidly and stops abruptly and the other half are left at no or baseline stress.\n\n a. What type of study is this?\n\n b. Can this study be used to conclude a causal relationship between increased stress and muscle cramps?\n\n[^_01-ex-data-hello-1]: The [`mcu_films`](http://openintrostat.github.io/openintro/reference/mcu_films.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-2]: The [`run17`](http://openintrostat.github.io/openintro/reference/run17.html) data used in this exercise can be found in the [**cherryblossom**](http://openintrostat.github.io/cherryblossom) R package.\n\n[^_01-ex-data-hello-3]: The [`migraine`](http://openintrostat.github.io/openintro/reference/migraine.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-4]: The [`sinusitis`](http://openintrostat.github.io/openintro/reference/sinusitis.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-5]: The [`daycare_fines`](http://openintrostat.github.io/openintro/reference/daycare_fines.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-6]: The [`biontech_adolescents`](http://openintrostat.github.io/openintro/reference/biontech_adolescents.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-7]: The [`penguins`](https://allisonhorst.github.io/palmerpenguins/reference/penguins.html) data used in this exercise can be found in the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins/) R package.\n\n[^_01-ex-data-hello-8]: Artwork by [Allison Horst](https://twitter.com/allison_horst).\n\n[^_01-ex-data-hello-9]: The [`smoking`](http://openintrostat.github.io/openintro/reference/smoking.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_01-ex-data-hello-10]: The [`usairports`](http://openintrostat.github.io/airports/reference/usairports.html) data used in this exercise can be found in the [**airports**](http://openintrostat.github.io/airports/) R package.\n\n[^_01-ex-data-hello-11]: The data used in this exercise can be found in the [**unvotes**](https://cran.r-project.org/web/packages/unvotes/index.html) R package.\n\n[^_01-ex-data-hello-12]: The [`ukbabynames`](https://mine-cetinkaya-rundel.github.io/ukbabynames/reference/ukbabynames.html) data used in this exercise can be found in the [**ukbabynames**](https://mine-cetinkaya-rundel.github.io/ukbabynames/) R package.\n\n[^_01-ex-data-hello-13]: The [`netflix_titles`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-20/readme.md) data used in this exercise can be found in the [**tidytuesdayR**](https://cran.r-project.org/web/packages/tidytuesdayR/index.html) R package.\n\n[^_01-ex-data-hello-14]: The data used in this exercise comes from the [JSR Launch Vehicle Database, 2019 Feb 10 Edition](https://www.openintro.org/go?id=textbook-space-launches-data&referrer=ims0_html).\n\n[^_01-ex-data-hello-15]: The [`seattlepets`](http://openintrostat.github.io/openintro/reference/seattlepets.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n\n:::\n", "supporting": [ "01-data-hello_files" ], diff --git a/_freeze/11-foundations-randomization/execute-results/html.json b/_freeze/11-foundations-randomization/execute-results/html.json index ba877ef8..830eca9c 100644 --- a/_freeze/11-foundations-randomization/execute-results/html.json +++ b/_freeze/11-foundations-randomization/execute-results/html.json @@ -1,7 +1,8 @@ { - "hash": "5b5610b02848b52c2a44ee65f19d3f4c", + "hash": "2fa89c2b893c4bd92de15405b28fea3f", "result": { - "markdown": "\n\n\n# Hypothesis testing with randomization {#sec-foundations-randomization}\n\n::: {.chapterintro data-latex=\"\"}\nStatistical inference is primarily concerned with understanding and quantifying the uncertainty of parameter estimates.\nWhile the equations and details change depending on the setting, the foundations for inference are the same throughout all of statistics.\n\nWe start with two case studies designed to motivate the process of making decisions about research claims.\nWe formalize the process through the introduction of the **hypothesis testing framework**\\index{hypothesis test}, which allows us to formally evaluate claims about the population.\n:::\n\n\n\n\n\nThroughout the book so far, you have worked with data in a variety of contexts.\nYou have learned how to summarize and visualize the data as well as how to model multiple variables at the same time.\nSometimes the dataset at hand represents the entire research question.\nBut more often than not, the data have been collected to answer a research question about a larger group of which the data are a (hopefully) representative subset.\n\nYou may agree that there is almost always variability in data -- one dataset will not be identical to a second dataset even if they are both collected from the same population using the same methods.\nHowever, quantifying the variability in the data is neither obvious nor easy to do, i.e., answering the question \"*how* different is one dataset from another?\" is not trivial.\n\nFirst, a note on notation.\nWe generally use $p$ to denote a population proportion and $\\hat{p}$ to a sample proportion.\nSimilarly, we generally use $\\mu$ to denote a population mean and $\\bar{x}$ to denote a sample mean.\n\n::: {.workedexample data-latex=\"\"}\nSuppose your professor splits the students in your class into two groups: students who sit on the left side of the classroom and students who sit on the right side of the classroom.\nIf $\\hat{p}_{L}$ represents the proportion of students who prefer to read books on screen who sit on the left side of the classroom and $\\hat{p}_{R}$ represents the proportion of students who prefer to read books on screen who sit on the right side of the classroom, would you be surprised if $\\hat{p}_{L}$ did not *exactly* equal $\\hat{p}_{R}$?\n\n------------------------------------------------------------------------\n\nWhile the proportions $\\hat{p}_{L}$ and $\\hat{p}_{R}$ would probably be close to each other, it would be unusual for them to be exactly the same.\nWe would probably observe a small difference due to *chance*.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf we do not think the side of the room a person sits on in class is related to whether they prefer to read books on screen, what assumption are we making about the relationship between these two variables?[^11-foundations-randomization-1]\n:::\n\n[^11-foundations-randomization-1]: We would be assuming that these two variables are **independent**\\index{independent}.\n\n\n\n\n\nStudying randomness of this form is a key focus of statistics.\nThroughout this chapter, and those that follow, we provide three different approaches for quantifying the variability inherent in data: randomization, bootstrapping, and mathematical models.\nUsing the methods provided in this chapter, we will be able to draw conclusions beyond the dataset at hand to research questions about larger populations that the samples come from.\n\nThe first type of variability we will explore comes from experiments where the explanatory variable (or treatment) is randomly assigned to the observational units.\nAs you learned in Chapter \\@ref(data-hello), a randomized experiment can be used to assess whether one variable (the explanatory variable) causes changes in a second variable (the response variable).\nEvery dataset has some variability in it, so to decide whether the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized experiment as a comparison.\nThat is, we assume that each observational unit would have gotten the exact same response value regardless of the treatment level.\nBy reassigning the treatments many many times, we can compare the actual experiment to the sham experiment.\nIf the actual experiment has more extreme results than any of the sham experiments, we are led to believe that it is the explanatory variable which is causing the result and not just variability inherent to the data.\nUsing a few different case studies, let's look more carefully at this idea of a **randomization test**\\index{randomization test}.\n\n\n\n\n\n## Sex discrimination case study {#caseStudySexDiscrimination}\n\nWe consider a study investigating sex discrimination in the 1970s, which is set in the context of personnel decisions within a bank.\nThe research question we hope to answer is, \"Are individuals who identify as female discriminated against in promotion decisions made by their managers who identify as male?\" [@Rosen:1974]\n\n::: {.data data-latex=\"\"}\nThe [`sex_discrimination`](http://openintrostat.github.io/openintro/reference/sex_discrimination.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThis study considered sex roles, and only allowed for options of \"male\" and \"female\".\nWe should note that the identities being considered are not gender identities and that the study allowed only for a binary classification of sex.\n\n### Observed data\n\nThe participants in this study were 48 bank supervisors who identified as male, attending a management institute at the University of North Carolina in 1972.\nThey were asked to assume the role of the personnel director of a bank and were given a personnel file to judge whether the person should be promoted to a branch manager position.\nThe files given to the participants were identical, except that half of them indicated the candidate identified as male and the other half indicated the candidate identified as female.\nThese files were randomly assigned to the bank managers.\n\n::: {.guidedpractice data-latex=\"\"}\nIs this an observational study or an experiment?\nHow does the type of study impact what can be inferred from the results?[^11-foundations-randomization-2]\n:::\n\n[^11-foundations-randomization-2]: The study is an experiment, as subjects were randomly assigned a \"male\" file or a \"female\" file (remember, all the files were actually identical in content).\n Since this is an experiment, the results can be used to evaluate a causal relationship between the sex of a candidate and the promotion decision.\n\n\n::: {.cell}\n\n:::\n\n\nFor each supervisor both the sex associated with the assigned file and the promotion decision were recorded.\nUsing the results of the study summarized in Table \\@ref(tab:sex-discrimination-obs), we would like to evaluate if individuals who identify as female are unfairly discriminated against in promotion decisions.\nIn this study, a smaller proportion of female identifying applications were promoted than males (0.583 versus 0.875), but it is unclear whether the difference provides *convincing evidence* that individuals who identify as female are unfairly discriminated against.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary results for the sex discrimination study.
decision
sex promoted not promoted Total
male 21 3 24
female 14 10 24
Total 35 13 48
\n\n`````\n:::\n:::\n\n\nThe data are visualized in Figure \\@ref(fig:sex-rand-obs) as a set of cards.\nNote that each card denotes a personnel file (an observation from our dataset) and the colors indicate the decision: red for promoted and white for not promoted.\nAdditionally, the observations are broken up into groups of male and female identifying groups.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The sex discrimination study can be thought of as 48 red and white cards.](images/sex-rand-01-obs.png){fig-alt='48 cards are laid out; 24 indicating male files, 24 indicated female files. Of the 24 male files 3 of the cards are colored white, and 21 of the cards are colored red. Of the female files, 10 of the cards are colored white, and 14 of the cards are colored red.' width=40%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nStatisticians are sometimes called upon to evaluate the strength of evidence.\nWhen looking at the rates of promotion in this study, why might we be tempted to immediately conclude that individuals identifying as female are being discriminated against?\n\n------------------------------------------------------------------------\n\nThe large difference in promotion rates (58.3% for female personnel versus 87.5% for male personnel) suggest there might be discrimination against women in promotion decisions.\nHowever, we cannot yet be sure if the observed difference represents discrimination or is just due to random chance when there is no discrimination occurring.\nSince we wouldn't expect the sample proportions to be *exactly* equal, even if the truth was that the promotion decisions were independent of sex, we can't rule out random chance as a possible explanation when simply comparing the sample proportions.\n:::\n\nThe previous example is a reminder that the observed outcomes in the sample may not perfectly reflect the true relationships between variables in the underlying population.\nTable \\@ref(tab:sex-discrimination-obs) shows there were 7 fewer promotions for female identifying personnel than for the male personnel, a difference in promotion rates of 29.2% $\\left( \\frac{21}{24} - \\frac{14}{24} = 0.292 \\right).$ This observed difference is what we call a **point estimate**\\index{point estimate} of the true difference.\nThe point estimate of the difference in promotion rate is large, but the sample size for the study is small, making it unclear if this observed difference represents discrimination or whether it is simply due to chance when there is no discrimination occurring.\nChance can be thought of as the claim due to natural variability; discrimination can be thought of as the claim the researchers set out to demonstrate.\nWe label these two competing claims, $H_0$ and $H_A:$\n\n\n\n\n\n\\vspace{-2mm}\n\n- $H_0:$ **Null hypothesis**\\index{null hypothesis}. The variables `sex` and `decision` are independent. They have no relationship, and the observed difference between the proportion of males and females who were promoted, 29.2%, was due to the natural variability inherent in the population.\n- $H_A:$ **Alternative hypothesis**\\index{alternative hypothesis}. The variables `sex` and `decision` are *not* independent. The difference in promotion rates of 29.2% was not due to natural variability, and equally qualified female personnel are less likely to be promoted than male personnel.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Hypothesis testing.**\n\nThese hypotheses are part of what is called a **hypothesis test**\\index{hypothesis test}.\nA hypothesis test is a statistical technique used to evaluate competing claims using data.\nOften times, the null hypothesis takes a stance of *no difference* or *no effect*.\nThis hypothesis assumes that any differences seen are due to the variability inherent in the population and could have occurred by random chance.\n\nIf the null hypothesis and the data notably disagree, then we will reject the null hypothesis in favor of the alternative hypothesis.\n\nThere are many nuances to hypothesis testing, so do not worry if you aren't a master of hypothesis testing at the end of this section.\nWe'll discuss these ideas and details many times in this chapter as well as in the chapters that follow.\n:::\n\n\n\n\n\nWhat would it mean if the null hypothesis, which says the variables `sex` and `decision` are unrelated, was true?\nIt would mean each banker would decide whether to promote the candidate without regard to the sex indicated on the personnel file.\nThat is, the difference in the promotion percentages would be due to the natural variability in how the files were randomly allocated to different bankers, and this randomization just happened to give rise to a relatively large difference of 29.2%.\n\nConsider the alternative hypothesis: bankers were influenced by which sex was listed on the personnel file.\nIf this was true, and especially if this influence was substantial, we would expect to see some difference in the promotion rates of male and female candidates.\nIf this sex bias was against female candidates, we would expect a smaller fraction of promotion recommendations for female personnel relative to the male personnel.\n\nWe will choose between the two competing claims by assessing if the data conflict so much with $H_0$ that the null hypothesis cannot be deemed reasonable.\nIf data and the null claim seem to be at odds with one another, and the data seem to support $H_A,$ then we will reject the notion of independence and conclude that the data provide evidence of discrimination.\n\n\\vspace{-2mm}\n\n### Variability of the statistic\n\nTable \\@ref(tab:sex-discrimination-obs) shows that 35 bank supervisors recommended promotion and 13 did not.\nNow, suppose the bankers' decisions were independent of the sex of the candidate.\nThen, if we conducted the experiment again with a different random assignment of sex to the files, differences in promotion rates would be based only on random fluctuation in promotion decisions.\nWe can actually perform this **randomization**, which simulates what would have happened if the bankers' decisions had been independent of `sex` but we had distributed the file sexes differently.[^11-foundations-randomization-3]\n\n[^11-foundations-randomization-3]: The test procedure we employ in this section is sometimes referred to as a **randomization test**.\n If the explanatory variable had not been randomly assigned, as in an observational study, the procedure would be referred to as a **permutation test**.\n Permutation tests are used for observational studies, where the explanatory variable was not randomly assigned.\\index{permutation test}.\n\n\n\n\n\nIn the **simulation**\\index{simulation}, we thoroughly shuffle the 48 personnel files, 35 labelled `promoted` and 13 labelled `not promoted`, together and we deal files into two new stacks.\nNote that by keeping 35 promoted and 13 not promoted, we are assuming that 35 of the bank managers would have promoted the individual whose content is contained in the file **independent** of the sex indicated on their file.\nWe will deal 24 files into the first stack, which will represent the 24 \"female\" files.\nThe second stack will also have 24 files, and it will represent the 24 \"male\" files.\nFigure \\@ref(fig:sex-rand-shuffle-1) highlights both the shuffle and the reallocation to the sham sex groups.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The sex discrimination data is shuffled and reallocated to new groups of male and female files.](images/sex-rand-02-shuffle-1.png){fig-alt='The 48 red and white cards which denote the original data are shuffled and reassigned, 24 to each group indicating 24 male files and 24 female files.' width=80%}\n:::\n:::\n\n\nThen, as we did with the original data, we tabulate the results and determine the fraction of personnel files designated as \"male\" and \"female\" who were promoted.\n\n\n\n\n\nSince the randomization of files in this simulation is independent of the promotion decisions, any difference in promotion rates is due to chance.\nTable \\@ref(tab:sex-discrimination-rand-1) show the results of one such simulation.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Simulation results, where the difference in promotion rates between male and female is purely due to random chance.
decision
sex promoted not promoted Total
male 18 6 24
female 17 7 24
Total 35 13 48
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat is the difference in promotion rates between the two simulated groups in Table \\@ref(tab:sex-discrimination-rand-1) ?\nHow does this compare to the observed difference 29.2% from the actual study?[^11-foundations-randomization-4]\n:::\n\n[^11-foundations-randomization-4]: $18/24 - 17/24=0.042$ or about 4.2% in favor of the male personnel.\n This difference due to chance is much smaller than the difference observed in the actual groups.\n\nFigure \\@ref(fig:sex-rand-shuffle-1-sort) shows that the difference in promotion rates is much larger in the original data than it is in the simulated groups (0.292 \\> 0.042).\nThe quantity of interest throughout this case study has been the difference in promotion rates.\nWe call the summary value the **statistic** of interest (or often the **test statistic**).\nWhen we encounter different data structures, the statistic is likely to change (e.g., we might calculate an average instead of a proportion), but we will always want to understand how the statistic varies from sample to sample.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![We summarize the randomized data to produce one estimate of the difference in proportions given no sex discrimination. Note that the sort step is only used to make it easier to visually calculate the simulated sample proportions.](images/sex-rand-03-shuffle-1-sort.png){fig-alt='The 48 red and white cards are show in three panels. The first panel represents the original data and original allocation of the male and female files (in the original data there are 3 white cards in the male group and 10 white cards in the female group). The second panel represents the shuffled red and white cards that are randomly assigned as male and female files. The third panel has the cards sorted according to the random assignment of female or male. In the third panel there are 6 white cards in the male group and 7 white cards in the female group.' width=100%}\n:::\n:::\n\n\n### Observed statistic vs. null statistics\n\nWe computed one possible difference under the null hypothesis in Guided Practice, which represents one difference due to chance when the null hypothesis is assumed to be true.\nWhile in this first simulation, we physically dealt out files, it is much more efficient to perform this simulation using a computer.\nRepeating the simulation on a computer, we get another difference due to chance under the same assumption: -0.042.\nAnd another: 0.208.\nAnd so on until we repeat the simulation enough times that we have a good idea of the shape of the *distribution of differences* under the null hypothesis.\nFigure \\@ref(fig:sex-rand-dot-plot) shows a plot of the differences found from 100 simulations, where each dot represents a simulated difference between the proportions of male and female files recommended for promotion.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:sex-rand-dot-plot-cap)](11-foundations-randomization_files/figure-html/sex-rand-dot-plot-1.png){width=100%}\n:::\n:::\n\n\n(ref:sex-rand-dot-plot-cap) A stacked dot plot of differences from 100 simulations produced under the null hypothesis, $H_0,$ where the simulated sex and decision are independent. Two of the 100 simulations had a difference of at least 29.2%, the difference observed in the study, and are shown as solid blue dots.\n\nNote that the distribution of these simulated differences in proportions is centered around 0.\nUnder the null hypothesis our simulations made no distinction between male and female personnel files.\nThus, a center of 0 makes sense: we should expect differences from chance alone to fall around zero with some random fluctuation for each simulation.\n\n::: {.workedexample data-latex=\"\"}\nHow often would you observe a difference of at least 29.2% (0.292) according to Figure \\@ref(fig:sex-rand-dot-plot)?\nOften, sometimes, rarely, or never?\n\n------------------------------------------------------------------------\n\nIt appears that a difference of at least 29.2% under the null hypothesis would only happen about 2% of the time according to Figure \\@ref(fig:sex-rand-dot-plot).\nSuch a low probability indicates that observing such a large difference from chance alone is rare.\n:::\n\nThe difference of 29.2% is a rare event if there really is no impact from listing sex in the candidates' files, which provides us with two possible interpretations of the study results:\n\n- If $H_0,$ the **Null hypothesis** is true: Sex has no effect on promotion decision, and we observed a difference that is so large that it would only happen rarely.\n\n- If $H_A,$ the **Alternative hypothesis** is true: Sex has an effect on promotion decision, and what we observed was actually due to equally qualified female candidates being discriminated against in promotion decisions, which explains the large difference of 29.2%.\n\nWhen we conduct formal studies, we reject a null position (the idea that the data are a result of chance only) if the data strongly conflict with that null position.[^11-foundations-randomization-5]\nIn our analysis, we determined that there was only a $\\approx$ 2% probability of obtaining a sample where $\\geq$ 29.2% more male candidates than female candidates get promoted under the null hypothesis, so we conclude that the data provide strong evidence of sex discrimination against female candidates by the male supervisors.\nIn this case, we reject the null hypothesis in favor of the alternative.\n\n[^11-foundations-randomization-5]: This reasoning does not generally extend to anecdotal observations.\n Each of us observes incredibly rare events every day, events we could not possibly hope to predict.\n However, in the non-rigorous setting of anecdotal evidence, almost anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is treacherous.\n For example, we might look at the lottery: there was only a 1 in 176 million chance that the Mega Millions numbers for the largest jackpot in history (October 23, 2018) would be (5, 28, 62, 65, 70) with a Mega ball of (5), but nonetheless those numbers came up!\n However, no matter what numbers had turned up, they would have had the same incredibly rare odds.\n That is, *any set of numbers we could have observed would ultimately be incredibly rare*.\n This type of situation is typical of our daily lives: each possible event in itself seems incredibly rare, but if we consider every alternative, those outcomes are also incredibly rare.\n We should be cautious not to misinterpret such anecdotal evidence.\n\n**Statistical inference** is the practice of making decisions and conclusions from data in the context of uncertainty.\nErrors do occur, just like rare events, and the dataset at hand might lead us to the wrong conclusion.\nWhile a given dataset may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often these errors occur.\nBefore getting into the nuances of hypothesis testing, let's work through another case study.\n\n\n\n\n\n## Opportunity cost case study {#caseStudyOpportunityCost}\n\nHow rational and consistent is the behavior of the typical American college student?\nIn this section, we'll explore whether college student consumers always consider the following: money not spent now can be spent later.\n\nIn particular, we are interested in whether reminding students about this well-known fact about money causes them to be a little thriftier.\nA skeptic might think that such a reminder would have no impact.\nWe can summarize the two different perspectives using the null and alternative hypothesis framework.\n\n- $H_0:$ **Null hypothesis**. Reminding students that they can save money for later purchases will not have any impact on students' spending decisions.\n- $H_A:$ **Alternative hypothesis**. Reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase.\n\nIn this section, we'll explore an experiment conducted by researchers that investigates this very question for students at a university in the southwestern United States.\n[@Frederick:2009]\n\n### Observed data\n\nOne-hundred and fifty students were recruited for the study, and each was given the following statement:\n\n> *Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of \\$14.99. What would you do in this situation? Please circle one of the options below.*[^11-foundations-randomization-6]\n\n[^11-foundations-randomization-6]: This context might feel strange if physical video stores predate you.\n If you're curious about what those were like, look up \"Blockbuster\".\n\nHalf of the 150 students were randomized into a control group and were given the following two options:\n\n> (A) Buy this entertaining video.\n\n> (B) Not buy this entertaining video.\n\nThe remaining 75 students were placed in the treatment group, and they saw a slightly modified option (B):\n\n> (A) Buy this entertaining video.\n\n> (B) Not buy this entertaining video. Keep the \\$14.99 for other purchases.\n\nWould the extra statement reminding students of an obvious fact impact the purchasing decision?\nTable \\@ref(tab:opportunity-cost-obs) summarizes the study results.\n\n::: {.data data-latex=\"\"}\nThe [`opportunity_cost`](http://openintrostat.github.io/openintro/reference/opportunity_cost.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary results of the opportunity cost study.
decision
group buy video not buy video Total
control 56 19 75
treatment 41 34 75
Total 97 53 150
\n\n`````\n:::\n:::\n\n\nIt might be a little easier to review the results using a visualization.\nFigure \\@ref(fig:opportunity-cost-obs-bar) shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Stacked bar plot of results of the opportunity cost study.](11-foundations-randomization_files/figure-html/opportunity-cost-obs-bar-1.png){width=100%}\n:::\n:::\n\n\nAnother useful way to review the results from Table \\@ref(tab:opportunity-cost-obs) is using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the video.\nThese summaries are given in Table \\@ref(tab:opportunity-cost-obs-row-prop).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n
The opportunity cost data are summarized using row proportions. Row proportions are particularly useful here since we can view the proportion of *buy* and *not buy* decisions in each group.
decision
group buy video not buy video Total
control 0.747 0.253 1
treatment 0.547 0.453 1
\n\n`````\n:::\n:::\n\n\nWe will define a **success**\\index{success} in this study as a student who chooses not to buy the video.[^11-foundations-randomization-7]\nThen, the value of interest is the change in video purchase rates that results by reminding students that not spending money now means they can spend the money later.\n\n[^11-foundations-randomization-7]: Success is often defined in a study as the outcome of interest, and a \"success\" may or may not actually be a positive outcome.\n For example, researchers working on a study on COVID prevalence might define a \"success\" in the statistical sense as a patient who has COVID-19.\n A more complete discussion of the term **success** will be given in Chapter \\@ref(inference-one-prop).\n\n\n\n\n\nWe can construct a point estimate for this difference as ($T$ for treatment and $C$ for control):\n\n$$\\hat{p}_{T} - \\hat{p}_{C} = \\frac{34}{75} - \\frac{19}{75} = 0.453 - 0.253 = 0.200$$\n\nThe proportion of students who chose not to buy the video was 20 percentage points higher in the treatment group than the control group.\nIs this 20% difference between the two groups so prominent that it is unlikely to have occurred from chance alone, if there is no difference between the spending habits of the two groups?\n\n### Variability of the statistic\n\nThe primary goal in this data analysis is to understand what sort of differences we might see if the null hypothesis were true, i.e., the treatment had no effect on students.\nBecause this is an experiment, we'll use the same procedure we applied in Section \\@ref(caseStudySexDiscrimination): randomization.\n\nLet's think about the data in the context of the hypotheses.\nIf the null hypothesis $(H_0)$ was true and the treatment had no impact on student decisions, then the observed difference between the two groups of 20% could be attributed entirely to random chance.\nIf, on the other hand, the alternative hypothesis $(H_A)$ is true, then the difference indicates that reminding students about saving for later purchases actually impacts their buying decisions.\n\n### Observed statistic vs. null statistics\n\nJust like with the sex discrimination study, we can perform a statistical analysis.\nUsing the same randomization technique from the last section, let's see what happens when we simulate the experiment under the scenario where there is no effect from the treatment.\n\nWhile we would in reality do this simulation on a computer, it might be useful to think about how we would go about carrying out the simulation without a computer.\nWe start with 150 index cards and label each card to indicate the distribution of our response variable: `decision`.\nThat is, 53 cards will be labeled \"not buy video\" to represent the 53 students who opted not to buy, and 97 will be labeled \"buy video\" for the other 97 students.\nThen we shuffle these cards thoroughly and divide them into two stacks of size 75, representing the simulated treatment and control groups.\nBecause we have shuffled the cards from both groups together, assuming no difference in their purchasing behavior, any observed difference between the proportions of \"not buy video\" cards (what we earlier defined as *success*) can be attributed entirely to chance.\n\n::: {.workedexample data-latex=\"\"}\nIf we are randomly assigning the cards into the simulated treatment and control groups, how many \"not buy video\" cards would we expect to end up in each simulated group?\nWhat would be the expected difference between the proportions of \"not buy video\" cards in each group?\n\n------------------------------------------------------------------------\n\nSince the simulated groups are of equal size, we would expect $53 / 2 = 26.5,$ i.e., 26 or 27, \"not buy video\" cards in each simulated group, yielding a simulated point estimate of the difference in proportions of 0% .\nHowever, due to random chance, we might also expect to sometimes observe a number a little above or below 26 and 27.\n:::\n\nThe results of a single randomization is shown in Table \\@ref(tab:opportunity-cost-obs-simulated).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of student choices against their simulated groups. The group assignment had no connection to the student decisions, so any difference between the two groups is due to chance.
decision
group buy video not buy video Total
control 46 29 75
treatment 51 24 75
Total 97 53 150
\n\n`````\n:::\n:::\n\n\nFrom this table, we can compute a difference that occurred from the first shuffle of the data (i.e., from chance alone):\n\n$$\\hat{p}_{T, shfl1} - \\hat{p}_{C, shfl1} = \\frac{24}{75} - \\frac{29}{75} = 0.32 - 0.387 = - 0.067$$\n\nJust one simulation will not be enough to get a sense of what sorts of differences would happen from chance alone.\n\n\n::: {.cell}\n\n:::\n\n\nWe'll simulate another set of simulated groups and compute the new difference: 0.04.\n\nAnd again: 0.12.\n\nAnd again: -0.013.\n\nWe'll do this 1,000 times.\n\nThe results are summarized in a dot plot in Figure \\@ref(fig:opportunity-cost-rand-dot-plot), where each point represents the difference from one randomization.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:opportunity-cost-rand-dot-plot-cap)](11-foundations-randomization_files/figure-html/opportunity-cost-rand-dot-plot-1.png){width=90%}\n:::\n:::\n\n\n(ref:opportunity-cost-rand-dot-plot-cap) A stacked dot plot of 1,000 simulated (null) differences produced under the null hypothesis, $H_0.$ Six of the 1,000 simulations had a difference of at least 20% , which was the difference observed in the study.\n\nSince there are so many points and it is difficult to discern one point from the other, it is more convenient to summarize the results in a histogram such as the one in Figure \\@ref(fig:opportunity-cost-rand-hist), where the height of each histogram bar represents the number of simulations resulting in an outcome of that magnitude.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A histogram of 1,000 chance differences produced under the null hypothesis. Histograms like this one are a convenient representation of data or results when there are a large number of simulations.](11-foundations-randomization_files/figure-html/opportunity-cost-rand-hist-1.png){width=90%}\n:::\n:::\n\n\nUnder the null hypothesis (no treatment effect), we would observe a difference of at least +20% about 0.6% of the time.\nThat is really rare!\nInstead, we will conclude the data provide strong evidence there is a treatment effect: reminding students before a purchase that they could instead spend the money later on something else lowers the chance that they will continue with the purchase.\nNotice that we are able to make a causal statement for this study since the study is an experiment, although we do not know why the reminder induces a lower purchase rate.\n\n## Hypothesis testing {#HypothesisTesting}\n\nIn the last two sections, we utilized a **hypothesis test**\\index{hypothesis test}, which is a formal technique for evaluating two competing possibilities.\nIn each scenario, we described a **null hypothesis**\\index{null hypothesis}, which represented either a skeptical perspective or a perspective of no difference.\nWe also laid out an **alternative hypothesis**\\index{alternative hypothesis}, which represented a new perspective such as the possibility of a relationship between two variables or a treatment effect in an experiment.\nThe alternative hypothesis is usually the reason the scientists set out to do the research in the first place.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Null and alternative hypotheses.**\n\nThe **null hypothesis** $(H_0)$ often represents either a skeptical perspective or a claim of \"no difference\" to be tested.\n\nThe **alternative hypothesis** $(H_A)$ represents an alternative claim under consideration and is often represented by a range of possible values for the value of interest.\n:::\n\nIf a person makes a somewhat unbelievable claim, we are initially skeptical.\nHowever, if there is sufficient evidence that supports the claim, we set aside our skepticism.\nThe hallmarks of hypothesis testing are also found in the US court system.\n\n### The US court system\n\nIn the US course system, jurors evaluate the evidence to see whether it convincingly shows a defendant is guilty.\nDefendants are considered to be innocent until proven otherwise.\n\n::: {.workedexample data-latex=\"\"}\nThe US court considers two possible claims about a defendant: they are either innocent or guilty.\n\nIf we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative?\n\n------------------------------------------------------------------------\n\nThe jury considers whether the evidence is so convincing (strong) that there is no reasonable doubt regarding the person's guilt.\nThat is, the skeptical perspective (null hypothesis) is that the person is innocent until evidence is presented that convinces the jury that the person is guilty (alternative hypothesis).\n:::\n\nJurors examine the evidence to see whether it convincingly shows a defendant is guilty.\nNotice that if a jury finds a defendant *not guilty*, this does not necessarily mean the jury is confident in the person's innocence.\nThey are simply not convinced of the alternative, that the person is guilty.\nThis is also the case with hypothesis testing: *even if we fail to reject the null hypothesis, we do not accept the null hypothesis as truth*.\n\nFailing to find evidence in favor of the alternative hypothesis is not equivalent to finding evidence that the null hypothesis is true.\nWe will see this idea in greater detail in Section \\@ref(decerr).\n\n### p-value and statistical discernibility\n\nIn Section \\@ref(caseStudySexDiscrimination) we encountered a study from the 1970's that explored whether there was strong evidence that female candidates were less likely to be promoted than male candidates.\nThe research question -- are female candidates discriminated against in promotion decisions?\n-- was framed in the context of hypotheses:\n\n- $H_0:$ Sex has no effect on promotion decisions.\n\n- $H_A:$ Female candidates are discriminated against in promotion decisions.\n\nThe null hypothesis $(H_0)$ was a perspective of no difference in promotion.\nThe data on sex discrimination provided a point estimate of a 29.2% difference in recommended promotion rates between male and female candidates.\nWe determined that such a difference from chance alone, assuming the null hypothesis was true, would be rare: it would only happen about 2 in 100 times.\nWhen results like these are inconsistent with $H_0,$ we reject $H_0$ in favor of $H_A.$ Here, we concluded there was discrimination against female candidates.\n\nThe 2-in-100 chance is what we call a **p-value**, which is a probability quantifying the strength of the evidence against the null hypothesis, given the observed data.\n\n::: {.important data-latex=\"\"}\n**p-value.**\n\nThe **p-value**\\index{hypothesis testing!p-value} is the probability of observing data at least as favorable to the alternative hypothesis as our current dataset, if the null hypothesis were true.\nWe typically use a summary statistic of the data, such as a difference in proportions, to help compute the p-value and evaluate the hypotheses.\nThis summary value that is used to compute the p-value is often called the **test statistic**\\index{test statistic}.\n:::\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nIn the sex discrimination study, the difference in discrimination rates was our test statistic.\nWhat was the test statistic in the opportunity cost study covered in Section \\@ref(caseStudyOpportunityCost)?\n\n------------------------------------------------------------------------\n\nThe test statistic in the opportunity cost study was the difference in the proportion of students who decided against the video purchase in the treatment and control groups.\nIn each of these examples, the **point estimate** of the difference in proportions was used as the test statistic.\n:::\n\nWhen the p-value is small, i.e., less than a previously set threshold, we say the results are **statistically discernible**\\index{statistically significant}\\index{statistically discernible}.\nThis means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.[^11-foundations-randomization-8]\nThe threshold is called the **discernibility level**\\index{hypothesis testing!discernibility level}\\index{significance level}\\index{discernibility level} and often represented by $\\alpha$ (the Greek letter *alpha*).\n[^11-foundations-randomization-9] The value of $\\alpha$ represents how rare an event needs to be in order for the null hypothesis to be rejected\n. Historically, many fields have set $\\alpha = 0.05,$ meaning that the results need to occur less than 5% of the time, if the null hypothesis is to be rejected\n. The value of $\\alpha$ can vary depending on the the field or the application\n.\n\n[^11-foundations-randomization-8]: Many texts use the phrase \"statistically significant\" instead of \"statistically discernible\".\n We have chosen to use \"discernible\" to indicate that a precise statistical event has happened, as opposed to a notable effect which may or may not fit the statistical definition of discernible or significant.\n\n[^11-foundations-randomization-9]: Here, too, we have chosen \"discernibility level\" instead of \"significance level\" which you will see in some texts.\n\n\n\n\n\nNote that you may have heard the phrase \"statistically significant\" as a way to describe \"statistically discernible.\" Although in everyday language \"significant\" would indicate that a difference is large or meaningful, that is not necessarily the case here.\nThe term \"statistically discernible\" indicates that the p-value from a study fell below the chosen discernibility level.\nFor example, in the sex discrimination study, the p-value was found to be approximately 0.02.\nUsing a discernibility level of $\\alpha = 0.05,$ we would say that the data provided statistically discernible evidence against the null hypothesis.\nHowever, this conclusion gives us no information regarding the size of the difference in promotion rates!\n\n::: {.important data-latex=\"\"}\n**Statistical discernibility.**\n\nWe say that the data provide **statistically discernible**\\index{hypothesis testing!statistically discernible.} evidence against the null hypothesis if the p-value is less than some predetermined threshold (e.g., 0.01, 0.05, 0.1).\n:::\n\n::: {.workedexample data-latex=\"\"}\nIn the opportunity cost study in Section \\@ref(caseStudyOpportunityCost), we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future.\nWe determined that such a large difference would only occur 6-in-1,000 times if the reminder actually had no influence on student decision-making.\nWhat is the p-value in this study?\nWould you classify the result as \"statistically discernible\"?\n\n------------------------------------------------------------------------\n\nThe p-value was 0.006.\nSince the p-value is less than 0.05, the data provide statistically discernible evidence that US college students were actually influenced by the reminder.\n:::\n\n::: {.important data-latex=\"\"}\n**What's so special about 0.05?**\n\nWe often use a threshold of 0.05 to determine whether a result is statistically discernible.\nBut why 0.05?\nMaybe we should use a bigger number, or maybe a smaller number.\nIf you're a little puzzled, that probably means you're reading with a critical eye -- good job!\nWe've made a video to help clarify *why 0.05*:\n\n\n\nSometimes it's also a good idea to deviate from the standard.\nWe'll discuss when to choose a threshold different than 0.05 in Section \\@ref(decerr).\n:::\n\n\\clearpage\n\n## Chapter review {#chp11-review}\n\n### Summary\n\nFigure \\@ref(fig:fullrand) provides a visual summary of the randomization testing procedure.\n\n\\index{randomization test}\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of one simulation of the full randomization procedure from a hypothetical dataset as visualized in the first panel. We repeat the steps hundreds or thousands of times.](images/fullrand.png){fig-alt='48 red and white cards are show in three panels. The first panel represents original data and original allocation of Group 1 and Group 2 (in the original data there are 7 white cards in Group 1 and 10 white cards in Group 2). The second panel represents the shuffled red and white cards that are randomly assigned as Group 1 and Group 2. The third panel has the cards sorted according to the random assignment of Group 1 and Group 2. In the third panel there are 8 white cards in the Group 1 and 9 white cards in Group 2.' width=100%}\n:::\n:::\n\n\nWe can summarize the randomization test procedure as follows:\n\n- **Frame the research question in terms of hypotheses.** Hypothesis tests are appropriate for research questions that can be summarized in two competing hypotheses. The null hypothesis $(H_0)$ usually represents a skeptical perspective or a perspective of no relationship between the variables. The alternative hypothesis $(H_A)$ usually represents a new view or the existance of a relationship between the variables.\n- **Collect data with an observational study or experiment.** If a research question can be formed into two hypotheses, we can collect data to run a hypothesis test. If the research question focuses on associations between variables but does not concern causation, we would use an observational study. If the research question seeks a causal connection between two or more variables, then an experiment should be used.\n- **Model the randomness that would occur if the null hypothesis was true.** In the examples above, the variability has been modeled as if the treatment (e.g., sexual identity, opportunity) allocation was independent of the outcome of the study. The computer generated null distribution is the result of many different randomizations and quantifies the variability that would be expected if the null hypothesis was true.\n- **Analyze the data.** Choose an analysis technique appropriate for the data and identify the p-value. So far, we have only seen one analysis technique: randomization. Throughout the rest of this textbook, we'll encounter several new methods suitable for many other contexts.\n- **Form a conclusion.** Using the p-value from the analysis, determine whether the data provide evidence against the null hypothesis. Also, be sure to write the conclusion in plain language so casual readers can understand the results.\n\nTable \\@ref(tab:chp11-summary) is another look at the randomization test summary.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of randomization as an inferential statistical method.
Question Answer
What does it do? Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment
What is the random process described? Randomized experiment
What other random processes can be approximated? Can also be used to describe random sampling in an observational model
What is it best for? Hypothesis testing (can also be used for confidence intervals, but not covered in this text).
What physical object represents the simulation process? Shuffling cards
\n\n`````\n:::\n:::\n\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
alternative hypothesis p-value statistic
confidence interval permutation test statistical inference
discernibility level point estimate statistically discernible
hypothesis test randomization test statistically significant
independent significance level success
null hypothesis simulation test statistic
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp11-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-11].\n\n::: {.exercises data-latex=\"\"}\n1. **Identify the parameter, I**\nFor each of the following situations, state whether the parameter of interest is a mean or a proportion. It may be helpful to examine whether individual responses are numerical or categorical.\n\n a. In a survey, one hundred college students are asked how many hours per week they spend on the Internet.\n\n b. In a survey, one hundred college students are asked: \"What percentage of the time you spend on the Internet is part of your course work?\"\n\n c. In a survey, one hundred college students are asked whether they cited information from Wikipedia in their papers.\n\n d. In a survey, one hundred college students are asked what percentage of their total weekly spending is on alcoholic beverages.\n\n e. In a sample of one hundred recent college graduates, it is found that 85 percent expect to get a job within one year of their graduation date.\n\n1. **Identify the parameter, II.**\nFor each of the following situations, state whether the parameter of interest is a mean or a proportion.\n\n a. A poll shows that 64% of Americans personally worry a great deal about federal spending and the budget deficit.\n\n b. A survey reports that local TV news has shown a 17% increase in revenue within a two year period while newspaper revenues decreased by 6.4% during this time period.\n\n c. In a survey, high school and college students are asked whether they use geolocation services on their smart phones.\n\n d. In a survey, smart phone users are asked whether they use a web-based taxi service.\n\n e. In a survey, smart phone users are asked how many times they used a web-based taxi service over the last year.\n\n1. **Hypotheses.**\nFor each of the research statements below, note whether it represents a null hypothesis claim or an alternative hypothesis claim.\n\n a. The number of hours that grade-school children spend doing homework predicts their future success on standardized tests.\n \n b. King cheetahs on average run the same speed as standard spotted cheetahs.\n \n c. For a particular student, the probability of correctly answering a 5-option multiple choice test is larger than 0.2 (i.e., better than guessing).\n \n d. The mean length of African elephant tusks has changed over the last 100 years.\n \n e. The risk of facial clefts is equal for babies born to mothers who take folic acid supplements compared with those from mothers who do not.\n \n f. Caffeine intake during pregnancy affects mean birth weight.\n \n g. The probability of getting in a car accident is the same if using a cell phone than if not using a cell phone.\n \n \\clearpage\n\n1. **True null hypothesis.**\nUnbeknownst to you, let's say that the null hypothesis is actually true in the population. You plan to run a study anyway.\n\n a. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.05, how likely is it that you will mistakenly reject the null hypothesis?\n \n b. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.01, how likely is it that you will mistakenly reject the null hypothesis?\n \n c. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.10, how likely is it that you will mistakenly reject the null hypothesis?\n\n1. **Identify hypotheses, I.**\nWrite the null and alternative hypotheses in words and then symbols for each of the following situations.\n\n a. New York is known as \"the city that never sleeps\". A random sample of 25 New Yorkers were asked how much sleep they get per night. Do these data provide convincing evidence that New Yorkers on average sleep less than 8 hours a night?\n\n b. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity. They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc. They also collect data on how much company time employees spend on such non- business activities during March Madness. They want to determine if these data provide convincing evidence that employee productivity decreases during March Madness.\n\n1. **Identify hypotheses, II.**\nWrite the null and alternative hypotheses in words and using symbols for each of the following situations.\n\n a. Since 2008, chain restaurants in California have been required to display calorie counts of each menu item. Prior to menus displaying calorie counts, the average calorie intake of diners at a restaurant was 1100 calories. After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners. Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant?\n\n b. Based on the performance of those who took the GRE exam between July 1, 2004 and June 30, 2007, the average Verbal Reasoning score was calculated to be 462. In 2021 the average verbal score was slightly higher. Do these data provide convincing evidence that the average GRE Verbal Reasoning score has changed since 2021?\n \n \\clearpage\n\n1. **Side effects of Avandia.** \nRosiglitazone is the active ingredient in the controversial type 2 diabetes medicine Avandia and has been linked to an increased risk of serious cardiovascular problems such as stroke, heart failure, and death. A common alternative treatment is Pioglitazone, the active ingredient in a diabetes medicine called Actos. In a nationwide retrospective observational study of 227,571 Medicare beneficiaries aged 65 years or older, it was found that 2,593 of the 67,593 patients using Rosiglitazone and 5,386 of the 159,978 using Pioglitazone had serious cardiovascular problems. These data are summarized in the contingency table below.^[The [`avandia`](http://openintrostat.github.io/openintro/reference/avandia.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Graham:2010]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
treatment No Yes Total
Pioglitazone 154,592 5,386 159,978
Rosiglitazone 65,000 2,593 67,593
Total 219,592 7,979 227,571
\n \n `````\n :::\n :::\n\n a. Determine if each of the following statements is true or false. If false, explain why. *Be careful:* The reasoning may be wrong even if the statement's conclusion is correct. In such cases, the statement should be considered false.\n \n i. Since more patients on Pioglitazone had cardiovascular problems (5,386 vs. 2,593), we can conclude that the rate of cardiovascular problems for those on a Pioglitazone treatment is higher.\n \n ii. The data suggest that diabetic patients who are taking Rosiglitazone are more likely to have cardiovascular problems since the rate of incidence was (2,593 / 67,593 = 0.038) 3.8\\% for patients on this treatment, while it was only (5,386 / 159,978 = 0.034) 3.4\\% for patients on Pioglitazone. \n \n iii. The fact that the rate of incidence is higher for the Rosiglitazone group proves that Rosiglitazone causes serious cardiovascular problems. \n \n iv. Based on the information provided so far, we cannot tell if the difference between the rates of incidences is due to a relationship between the two variables or due to chance.\n \n b. What proportion of all patients had cardiovascular problems?\n\n c. If the type of treatment and having cardiovascular problems were independent, about how many patients in the Rosiglitazone group would we expect to have had cardiovascular problems?\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for part d.*\n :::\n \n \\clearpage\n\n d. We can investigate the relationship between outcome and treatment in this study using a randomization technique. While in reality we would carry out the simulations required for randomization using statistical software, suppose we actually simulate using index cards. In order to simulate from the independence model, which states that the outcomes were independent of the treatment, we write whether each patient had a cardiovascular problem on cards, shuffled all the cards together, then deal them into two groups of size 67,593 and 159,978. We repeat this simulation 100 times and each time record the difference between the proportions of cards that say \"Yes\" in the Rosiglitazone and Pioglitazone groups. Use the histogram of these differences in proportions to answer the following questions.\n \n i. What are the claims being tested? \n \n ii. Compared to the number calculated in part (b), which would provide more support for the alternative hypothesis, *higher* or *lower* proportion of patients with cardiovascular problems in the Rosiglitazone group? \n \n iii. What do the simulation results suggest about the relationship between taking Rosiglitazone and having cardiovascular problems in diabetic patients?\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-35-1.png){width=90%}\n :::\n :::\n \n \\clearpage\n\n1. **Heart transplants.** \nThe Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate, meaning that they were gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable `transplant` indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment group, 45 died. Another variable called `survived` was used to indicate whether the patient was alive at the end of the study.^[The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Turnbull+Brown+Hu:1974]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-36-1.png){width=90%}\n :::\n :::\n\n a. Does the stacked bar plot indicate that survival is independent of whether the patient got a transplant? Explain your reasoning.\n\n b. What do the box plots above suggest about the efficacy (effectiveness) of the heart transplant treatment.\n\n c. What proportion of patients in the treatment group and what proportion of patients in the control group died?\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for part d.*\n :::\n \n \\clearpage\n\n d. One approach for investigating whether the treatment is effective is to use a randomization technique.\n \n i. What are the claims being tested?\n \n ii. The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.\n \n > We write *alive* on $\\rule{2cm}{0.5pt}$ cards representing patients who were alive at the end of the study, and *deceased* on $\\rule{2cm}{0.5pt}$ cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size $\\rule{2cm}{0.5pt}$ representing treatment, and another group of size $\\rule{2cm}{0.5pt}$ representing control. We calculate the difference between the proportion of \\textit{deceased} cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at $\\rule{2cm}{0.5pt}$. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are $\\rule{2cm}{0.5pt}$. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.\n\n iii. What do the simulation results shown below suggest about the effectiveness of the transplant program?\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-37-1.png){width=90%}\n :::\n :::\n\n\n:::\n", + "engine": "knitr", + "markdown": "\n\n\n# Hypothesis testing with randomization {#sec-foundations-randomization}\n\n::: {.chapterintro data-latex=\"\"}\nStatistical inference is primarily concerned with understanding and quantifying the uncertainty of parameter estimates.\nWhile the equations and details change depending on the setting, the foundations for inference are the same throughout all of statistics.\n\nWe start with two case studies designed to motivate the process of making decisions about research claims.\nWe formalize the process through the introduction of the **hypothesis testing framework**\\index{hypothesis test}, which allows us to formally evaluate claims about the population.\n:::\n\n\n\n\n\nThroughout the book so far, you have worked with data in a variety of contexts.\nYou have learned how to summarize and visualize the data as well as how to model multiple variables at the same time.\nSometimes the dataset at hand represents the entire research question.\nBut more often than not, the data have been collected to answer a research question about a larger group of which the data are a (hopefully) representative subset.\n\nYou may agree that there is almost always variability in data -- one dataset will not be identical to a second dataset even if they are both collected from the same population using the same methods.\nHowever, quantifying the variability in the data is neither obvious nor easy to do, i.e., answering the question \"*how* different is one dataset from another?\" is not trivial.\n\nFirst, a note on notation.\nWe generally use $p$ to denote a population proportion and $\\hat{p}$ to a sample proportion.\nSimilarly, we generally use $\\mu$ to denote a population mean and $\\bar{x}$ to denote a sample mean.\n\n::: {.workedexample data-latex=\"\"}\nSuppose your professor splits the students in your class into two groups: students who sit on the left side of the classroom and students who sit on the right side of the classroom.\nIf $\\hat{p}_{L}$ represents the proportion of students who prefer to read books on screen who sit on the left side of the classroom and $\\hat{p}_{R}$ represents the proportion of students who prefer to read books on screen who sit on the right side of the classroom, would you be surprised if $\\hat{p}_{L}$ did not *exactly* equal $\\hat{p}_{R}$?\n\n------------------------------------------------------------------------\n\nWhile the proportions $\\hat{p}_{L}$ and $\\hat{p}_{R}$ would probably be close to each other, it would be unusual for them to be exactly the same.\nWe would probably observe a small difference due to *chance*.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf we do not think the side of the room a person sits on in class is related to whether they prefer to read books on screen, what assumption are we making about the relationship between these two variables?[^1]\n:::\n\n[^1]: We would be assuming that these two variables are **independent**\\index{independent}.\n\n\n\n\n\nStudying randomness of this form is a key focus of statistics.\nThroughout this chapter, and those that follow, we provide three different approaches for quantifying the variability inherent in data: randomization, bootstrapping, and mathematical models.\nUsing the methods provided in this chapter, we will be able to draw conclusions beyond the dataset at hand to research questions about larger populations that the samples come from.\n\nThe first type of variability we will explore comes from experiments where the explanatory variable (or treatment) is randomly assigned to the observational units.\nAs you learned in [Chapter -@sec-data-hello], a randomized experiment can be used to assess whether one variable (the explanatory variable) causes changes in a second variable (the response variable).\nEvery dataset has some variability in it, so to decide whether the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized experiment as a comparison.\nThat is, we assume that each observational unit would have gotten the exact same response value regardless of the treatment level.\nBy reassigning the treatments many many times, we can compare the actual experiment to the sham experiment.\nIf the actual experiment has more extreme results than any of the sham experiments, we are led to believe that it is the explanatory variable which is causing the result and not just variability inherent to the data.\nUsing a few different case studies, let's look more carefully at this idea of a **randomization test**\\index{randomization test}.\n\n\n\n\n\n## Sex discrimination case study {#sec-caseStudySexDiscrimination}\n\nWe consider a study investigating sex discrimination in the 1970s, which is set in the context of personnel decisions within a bank.\nThe research question we hope to answer is, \"Are individuals who identify as female discriminated against in promotion decisions made by their managers who identify as male?\" [@Rosen:1974]\n\n::: {.data data-latex=\"\"}\nThe [`sex_discrimination`](http://openintrostat.github.io/openintro/reference/sex_discrimination.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThis study considered sex roles, and only allowed for options of \"male\" and \"female\".\nWe should note that the identities being considered are not gender identities and that the study allowed only for a binary classification of sex.\n\n### Observed data\n\nThe participants in this study were 48 bank supervisors who identified as male, attending a management institute at the University of North Carolina in 1972.\nThey were asked to assume the role of the personnel director of a bank and were given a personnel file to judge whether the person should be promoted to a branch manager position.\nThe files given to the participants were identical, except that half of them indicated the candidate identified as male and the other half indicated the candidate identified as female.\nThese files were randomly assigned to the bank managers.\n\n::: {.guidedpractice data-latex=\"\"}\nIs this an observational study or an experiment?\nHow does the type of study impact what can be inferred from the results?[^2]\n:::\n\n[^2]: The study is an experiment, as subjects were randomly assigned a \"male\" file or a \"female\" file (remember, all the files were actually identical in content).\n Since this is an experiment, the results can be used to evaluate a causal relationship between the sex of a candidate and the promotion decision.\n\n\n::: {.cell}\n\n:::\n\n\nFor each supervisor both the sex associated with the assigned file and the promotion decision were recorded.\nUsing the results of the study summarized in @tbl-sex-discrimination-obs, we would like to evaluate if individuals who identify as female are unfairly discriminated against in promotion decisions.\nIn this study, a smaller proportion of female identifying applications were promoted than males (0.583 versus 0.875), but it is unclear whether the difference provides *convincing evidence* that individuals who identify as female are unfairly discriminated against.\n\n\n::: {#tbl-sex-discrimination-obs .cell tbl-cap='Summary results for the sex discrimination study.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
decision
sex promoted not promoted Total
male 21 3 24
female 14 10 24
Total 35 13 48
\n\n`````\n:::\n:::\n\n\nThe data are visualized in @fig-sex-rand-obs as a set of cards.\nNote that each card denotes a personnel file (an observation from our dataset) and the colors indicate the decision: red for promoted and white for not promoted.\nAdditionally, the observations are broken up into groups of male and female identifying groups.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The sex discrimination study can be thought of as 48 red and white cards.](images/sex-rand-01-obs.png){#fig-sex-rand-obs fig-alt='48 cards are laid out; 24 indicating male files, 24 indicated female files.\nOf the 24 male files 3 of the cards are colored white, and 21 of the cards\nare colored red. Of the female files, 10 of the cards are colored white,\nand 14 of the cards are colored red.\n' width=40%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nStatisticians are sometimes called upon to evaluate the strength of evidence.\nWhen looking at the rates of promotion in this study, why might we be tempted to immediately conclude that individuals identifying as female are being discriminated against?\n\n------------------------------------------------------------------------\n\nThe large difference in promotion rates (58.3% for female personnel versus 87.5% for male personnel) suggest there might be discrimination against women in promotion decisions.\nHowever, we cannot yet be sure if the observed difference represents discrimination or is just due to random chance when there is no discrimination occurring.\nSince we wouldn't expect the sample proportions to be *exactly* equal, even if the truth was that the promotion decisions were independent of sex, we can't rule out random chance as a possible explanation when simply comparing the sample proportions.\n:::\n\nThe previous example is a reminder that the observed outcomes in the sample may not perfectly reflect the true relationships between variables in the underlying population.\n@tbl-sex-discrimination-obs shows there were 7 fewer promotions for female identifying personnel than for the male personnel, a difference in promotion rates of 29.2% $\\left( \\frac{21}{24} - \\frac{14}{24} = 0.292 \\right).$ This observed difference is what we call a **point estimate**\\index{point estimate} of the true difference.\nThe point estimate of the difference in promotion rate is large, but the sample size for the study is small, making it unclear if this observed difference represents discrimination or whether it is simply due to chance when there is no discrimination occurring.\nChance can be thought of as the claim due to natural variability; discrimination can be thought of as the claim the researchers set out to demonstrate.\nWe label these two competing claims, $H_0$ and $H_A:$\n\n\n\n\n\n\\vspace{-2mm}\n\n- $H_0:$ **Null hypothesis**\\index{null hypothesis}. The variables `sex` and `decision` are independent. They have no relationship, and the observed difference between the proportion of males and females who were promoted, 29.2%, was due to the natural variability inherent in the population.\n- $H_A:$ **Alternative hypothesis**\\index{alternative hypothesis}. The variables `sex` and `decision` are *not* independent. The difference in promotion rates of 29.2% was not due to natural variability, and equally qualified female personnel are less likely to be promoted than male personnel.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Hypothesis testing.**\n\nThese hypotheses are part of what is called a **hypothesis test**\\index{hypothesis test}.\nA hypothesis test is a statistical technique used to evaluate competing claims using data.\nOften times, the null hypothesis takes a stance of *no difference* or *no effect*.\nThis hypothesis assumes that any differences seen are due to the variability inherent in the population and could have occurred by random chance.\n\nIf the null hypothesis and the data notably disagree, then we will reject the null hypothesis in favor of the alternative hypothesis.\n\nThere are many nuances to hypothesis testing, so do not worry if you aren't a master of hypothesis testing at the end of this section.\nWe'll discuss these ideas and details many times in this chapter as well as in the chapters that follow.\n:::\n\n\n\n\n\nWhat would it mean if the null hypothesis, which says the variables `sex` and `decision` are unrelated, was true?\nIt would mean each banker would decide whether to promote the candidate without regard to the sex indicated on the personnel file.\nThat is, the difference in the promotion percentages would be due to the natural variability in how the files were randomly allocated to different bankers, and this randomization just happened to give rise to a relatively large difference of 29.2%.\n\nConsider the alternative hypothesis: bankers were influenced by which sex was listed on the personnel file.\nIf this was true, and especially if this influence was substantial, we would expect to see some difference in the promotion rates of male and female candidates.\nIf this sex bias was against female candidates, we would expect a smaller fraction of promotion recommendations for female personnel relative to the male personnel.\n\nWe will choose between the two competing claims by assessing if the data conflict so much with $H_0$ that the null hypothesis cannot be deemed reasonable.\nIf data and the null claim seem to be at odds with one another, and the data seem to support $H_A,$ then we will reject the notion of independence and conclude that the data provide evidence of discrimination.\n\n\\vspace{-2mm}\n\n### Variability of the statistic\n\n@tbl-sex-discrimination-obs shows that 35 bank supervisors recommended promotion and 13 did not.\nNow, suppose the bankers' decisions were independent of the sex of the candidate.\nThen, if we conducted the experiment again with a different random assignment of sex to the files, differences in promotion rates would be based only on random fluctuation in promotion decisions.\nWe can actually perform this **randomization**, which simulates what would have happened if the bankers' decisions had been independent of `sex` but we had distributed the file sexes differently.[^3]\n\n[^3]: The test procedure we employ in this section is sometimes referred to as a **randomization test**.\n If the explanatory variable had not been randomly assigned, as in an observational study, the procedure would be referred to as a **permutation test**.\n Permutation tests are used for observational studies, where the explanatory variable was not randomly assigned.\\index{permutation test}.\n\n\n\n\n\nIn the **simulation**\\index{simulation}, we thoroughly shuffle the 48 personnel files, 35 labelled `promoted` and 13 labelled `not promoted`, together and we deal files into two new stacks.\nNote that by keeping 35 promoted and 13 not promoted, we are assuming that 35 of the bank managers would have promoted the individual whose content is contained in the file **independent** of the sex indicated on their file.\nWe will deal 24 files into the first stack, which will represent the 24 \"female\" files.\nThe second stack will also have 24 files, and it will represent the 24 \"male\" files.\n@fig-sex-rand-shuffle-1 highlights both the shuffle and the reallocation to the sham sex groups.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The sex discrimination data is shuffled and reallocated to new groups of\nmale and female files.\n](images/sex-rand-02-shuffle-1.png){#fig-sex-rand-shuffle-1 fig-alt='The 48 red and white cards which denote the original data are shuffled and\nreassigned, 24 to each group indicating 24 male files and 24 female files.\n' width=80%}\n:::\n:::\n\n\nThen, as we did with the original data, we tabulate the results and determine the fraction of personnel files designated as \"male\" and \"female\" who were promoted.\n\n\n\n\n\nSince the randomization of files in this simulation is independent of the promotion decisions, any difference in promotion rates is due to chance.\n@tbl-sex-discrimination-rand-1 show the results of one such simulation.\n\n\n::: {#tbl-sex-discrimination-rand-1 .cell tbl-cap='Simulation results, where the difference in promotion rates between male\nand female is purely due to random chance.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
decision
sex promoted not promoted Total
male 18 6 24
female 17 7 24
Total 35 13 48
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat is the difference in promotion rates between the two simulated groups in @tbl-sex-discrimination-rand-1 ?\nHow does this compare to the observed difference 29.2% from the actual study?[^4]\n:::\n\n[^4]: $18/24 - 17/24=0.042$ or about 4.2% in favor of the male personnel.\n This difference due to chance is much smaller than the difference observed in the actual groups.\n\n@fig-sex-rand-shuffle-1-sort shows that the difference in promotion rates is much larger in the original data than it is in the simulated groups (0.292 \\> 0.042).\nThe quantity of interest throughout this case study has been the difference in promotion rates.\nWe call the summary value the **statistic** of interest (or often the **test statistic**).\nWhen we encounter different data structures, the statistic is likely to change (e.g., we might calculate an average instead of a proportion), but we will always want to understand how the statistic varies from sample to sample.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![We summarize the randomized data to produce one estimate of the difference\nin proportions given no sex discrimination. Note that the sort step is only used\nto make it easier to visually calculate the simulated sample proportions.\n](images/sex-rand-03-shuffle-1-sort.png){#fig-sex-rand-shuffle-1-sort fig-alt='The 48 red and white cards are show in three panels. The first panel represents\nthe original data and original allocation of the male and female files (in the original\ndata there are 3 white cards in the male group and 10 white cards in the female\ngroup). The second panel represents the shuffled red and white cards that are randomly\nassigned as male and female files. The third panel has the cards sorted according\nto the random assignment of female or male. In the third panel there are 6 white\ncards in the male group and 7 white cards in the female group.' width=100%}\n:::\n:::\n\n\n### Observed statistic vs. null statistics\n\nWe computed one possible difference under the null hypothesis in Guided Practice, which represents one difference due to chance when the null hypothesis is assumed to be true.\nWhile in this first simulation, we physically dealt out files, it is much more efficient to perform this simulation using a computer.\nRepeating the simulation on a computer, we get another difference due to chance under the same assumption: -0.042.\nAnd another: 0.208.\nAnd so on until we repeat the simulation enough times that we have a good idea of the shape of the *distribution of differences* under the null hypothesis.\n@fig-sex-rand-dot-plot shows a plot of the differences found from 100 simulations, where each dot represents a simulated difference between the proportions of male and female files recommended for promotion.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A stacked dot plot of differences from 100 simulations produced under\nthe null hypothesis, $H_0,$ where the simulated sex and decision are\nindependent. Two of the 100 simulations had a difference of at least\n29.2%, the difference observed in the study, and are shown as solid\nblue dots.\n](11-foundations-randomization_files/figure-html/fig-sex-rand-dot-plot-1.png){#fig-sex-rand-dot-plot width=100%}\n:::\n:::\n\n\nNote that the distribution of these simulated differences in proportions is centered around 0.\nUnder the null hypothesis our simulations made no distinction between male and female personnel files.\nThus, a center of 0 makes sense: we should expect differences from chance alone to fall around zero with some random fluctuation for each simulation.\n\n::: {.workedexample data-latex=\"\"}\nHow often would you observe a difference of at least 29.2% (0.292) according to @fig-sex-rand-dot-plot?\nOften, sometimes, rarely, or never?\n\n------------------------------------------------------------------------\n\nIt appears that a difference of at least 29.2% under the null hypothesis would only happen about 2% of the time according to @fig-sex-rand-dot-plot.\nSuch a low probability indicates that observing such a large difference from chance alone is rare.\n:::\n\nThe difference of 29.2% is a rare event if there really is no impact from listing sex in the candidates' files, which provides us with two possible interpretations of the study results:\n\n- If $H_0,$ the **Null hypothesis** is true: Sex has no effect on promotion decision, and we observed a difference that is so large that it would only happen rarely.\n\n- If $H_A,$ the **Alternative hypothesis** is true: Sex has an effect on promotion decision, and what we observed was actually due to equally qualified female candidates being discriminated against in promotion decisions, which explains the large difference of 29.2%.\n\nWhen we conduct formal studies, we reject a null position (the idea that the data are a result of chance only) if the data strongly conflict with that null position.[^5]\nIn our analysis, we determined that there was only a $\\approx$ 2% probability of obtaining a sample where $\\geq$ 29.2% more male candidates than female candidates get promoted under the null hypothesis, so we conclude that the data provide strong evidence of sex discrimination against female candidates by the male supervisors.\nIn this case, we reject the null hypothesis in favor of the alternative.\n\n[^5]: This reasoning does not generally extend to anecdotal observations.\n Each of us observes incredibly rare events every day, events we could not possibly hope to predict.\n However, in the non-rigorous setting of anecdotal evidence, almost anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is treacherous.\n For example, we might look at the lottery: there was only a 1 in 176 million chance that the Mega Millions numbers for the largest jackpot in history (October 23, 2018) would be (5, 28, 62, 65, 70) with a Mega ball of (5), but nonetheless those numbers came up!\n However, no matter what numbers had turned up, they would have had the same incredibly rare odds.\n That is, *any set of numbers we could have observed would ultimately be incredibly rare*.\n This type of situation is typical of our daily lives: each possible event in itself seems incredibly rare, but if we consider every alternative, those outcomes are also incredibly rare.\n We should be cautious not to misinterpret such anecdotal evidence.\n\n**Statistical inference** is the practice of making decisions and conclusions from data in the context of uncertainty.\nErrors do occur, just like rare events, and the dataset at hand might lead us to the wrong conclusion.\nWhile a given dataset may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often these errors occur.\nBefore getting into the nuances of hypothesis testing, let's work through another case study.\n\n\n\n\n\n## Opportunity cost case study {#sec-caseStudyOpportunityCost}\n\nHow rational and consistent is the behavior of the typical American college student?\nIn this section, we'll explore whether college student consumers always consider the following: money not spent now can be spent later.\n\nIn particular, we are interested in whether reminding students about this well-known fact about money causes them to be a little thriftier.\nA skeptic might think that such a reminder would have no impact.\nWe can summarize the two different perspectives using the null and alternative hypothesis framework.\n\n- $H_0:$ **Null hypothesis**. Reminding students that they can save money for later purchases will not have any impact on students' spending decisions.\n- $H_A:$ **Alternative hypothesis**. Reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase.\n\nIn this section, we'll explore an experiment conducted by researchers that investigates this very question for students at a university in the southwestern United States.\n[@Frederick:2009]\n\n### Observed data\n\nOne-hundred and fifty students were recruited for the study, and each was given the following statement:\n\n> *Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of \\$14.99. What would you do in this situation? Please circle one of the options below.*[^6]\n\n[^6]: This context might feel strange if physical video stores predate you.\n If you're curious about what those were like, look up \"Blockbuster\".\n\nHalf of the 150 students were randomized into a control group and were given the following two options:\n\n> (A) Buy this entertaining video.\n\n> (B) Not buy this entertaining video.\n\nThe remaining 75 students were placed in the treatment group, and they saw a slightly modified option (B):\n\n> (A) Buy this entertaining video.\n\n> (B) Not buy this entertaining video. Keep the \\$14.99 for other purchases.\n\nWould the extra statement reminding students of an obvious fact impact the purchasing decision?\n@tbl-opportunity-cost-obs summarizes the study results.\n\n::: {.data data-latex=\"\"}\nThe [`opportunity_cost`](http://openintrostat.github.io/openintro/reference/opportunity_cost.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {#tbl-opportunity-cost-obs .cell tbl-cap='Summary results of the opportunity cost study.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
decision
group buy video not buy video Total
control 56 19 75
treatment 41 34 75
Total 97 53 150
\n\n`````\n:::\n:::\n\n\nIt might be a little easier to review the results using a visualization.\n@fig-opportunity-cost-obs-bar shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Stacked bar plot of results of the opportunity cost study.](11-foundations-randomization_files/figure-html/fig-opportunity-cost-obs-bar-1.png){#fig-opportunity-cost-obs-bar width=100%}\n:::\n:::\n\n\nAnother useful way to review the results from @tbl-opportunity-cost-obs is using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the video.\nThese summaries are given in @tbl-opportunity-cost-obs-row-prop.\n\n\n::: {#tbl-opportunity-cost-obs-row-prop .cell tbl-cap='The opportunity cost data are summarized using row proportions. Row\nproportions are particularly useful here since we can view the proportion\nof *buy* and *not buy* decisions in each group.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n
decision
group buy video not buy video Total
control 0.747 0.253 1
treatment 0.547 0.453 1
\n\n`````\n:::\n:::\n\n\nWe will define a **success**\\index{success} in this study as a student who chooses not to buy the video.[^7]\nThen, the value of interest is the change in video purchase rates that results by reminding students that not spending money now means they can spend the money later.\n\n[^7]: Success is often defined in a study as the outcome of interest, and a \"success\" may or may not actually be a positive outcome.\n For example, researchers working on a study on COVID prevalence might define a \"success\" in the statistical sense as a patient who has COVID-19.\n A more complete discussion of the term **success** will be given in [Chapter -@sec-inference-one-prop].\n\n\n\n\n\nWe can construct a point estimate for this difference as ($T$ for treatment and $C$ for control):\n\n$$\\hat{p}_{T} - \\hat{p}_{C} = \\frac{34}{75} - \\frac{19}{75} = 0.453 - 0.253 = 0.200$$\n\nThe proportion of students who chose not to buy the video was 20 percentage points higher in the treatment group than the control group.\nIs this 20% difference between the two groups so prominent that it is unlikely to have occurred from chance alone, if there is no difference between the spending habits of the two groups?\n\n### Variability of the statistic\n\nThe primary goal in this data analysis is to understand what sort of differences we might see if the null hypothesis were true, i.e., the treatment had no effect on students.\nBecause this is an experiment, we'll use the same procedure we applied in @sec-caseStudySexDiscrimination: randomization.\n\nLet's think about the data in the context of the hypotheses.\nIf the null hypothesis $(H_0)$ was true and the treatment had no impact on student decisions, then the observed difference between the two groups of 20% could be attributed entirely to random chance.\nIf, on the other hand, the alternative hypothesis $(H_A)$ is true, then the difference indicates that reminding students about saving for later purchases actually impacts their buying decisions.\n\n### Observed statistic vs. null statistics\n\nJust like with the sex discrimination study, we can perform a statistical analysis.\nUsing the same randomization technique from the last section, let's see what happens when we simulate the experiment under the scenario where there is no effect from the treatment.\n\nWhile we would in reality do this simulation on a computer, it might be useful to think about how we would go about carrying out the simulation without a computer.\nWe start with 150 index cards and label each card to indicate the distribution of our response variable: `decision`.\nThat is, 53 cards will be labeled \"not buy video\" to represent the 53 students who opted not to buy, and 97 will be labeled \"buy video\" for the other 97 students.\nThen we shuffle these cards thoroughly and divide them into two stacks of size 75, representing the simulated treatment and control groups.\nBecause we have shuffled the cards from both groups together, assuming no difference in their purchasing behavior, any observed difference between the proportions of \"not buy video\" cards (what we earlier defined as *success*) can be attributed entirely to chance.\n\n::: {.workedexample data-latex=\"\"}\nIf we are randomly assigning the cards into the simulated treatment and control groups, how many \"not buy video\" cards would we expect to end up in each simulated group?\nWhat would be the expected difference between the proportions of \"not buy video\" cards in each group?\n\n------------------------------------------------------------------------\n\nSince the simulated groups are of equal size, we would expect $53 / 2 = 26.5,$ i.e., 26 or 27, \"not buy video\" cards in each simulated group, yielding a simulated point estimate of the difference in proportions of 0% .\nHowever, due to random chance, we might also expect to sometimes observe a number a little above or below 26 and 27.\n:::\n\nThe results of a single randomization is shown in @tbl-opportunity-cost-obs-simulated.\n\n\n::: {#tbl-opportunity-cost-obs-simulated .cell tbl-cap='Summary of student choices against their simulated groups. The group\nassignment had no connection to the student decisions, so any difference\nbetween the two groups is due to chance.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
decision
group buy video not buy video Total
control 46 29 75
treatment 51 24 75
Total 97 53 150
\n\n`````\n:::\n:::\n\n\nFrom this table, we can compute a difference that occurred from the first shuffle of the data (i.e., from chance alone):\n\n$$\\hat{p}_{T, shfl1} - \\hat{p}_{C, shfl1} = \\frac{24}{75} - \\frac{29}{75} = 0.32 - 0.387 = - 0.067$$\n\nJust one simulation will not be enough to get a sense of what sorts of differences would happen from chance alone.\n\n\n::: {.cell}\n\n:::\n\n\nWe'll simulate another set of simulated groups and compute the new difference: 0.04.\n\nAnd again: 0.12.\n\nAnd again: -0.013.\n\nWe'll do this 1,000 times.\n\nThe results are summarized in a dot plot in @fig-opportunity-cost-rand-dot-plot, where each point represents the difference from one randomization.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A stacked dot plot of 1,000 simulated (null) differences produced under\nthe null hypothesis, $H_0.$ Six of the 1,000 simulations had a difference\nof at least 20%, which was the difference observed in the study.](11-foundations-randomization_files/figure-html/fig-opportunity-cost-rand-dot-plot-1.png){#fig-opportunity-cost-rand-dot-plot width=90%}\n:::\n:::\n\n\nSince there are so many points and it is difficult to discern one point from the other, it is more convenient to summarize the results in a histogram such as the one in @fig-opportunity-cost-rand-hist, where the height of each histogram bar represents the number of simulations resulting in an outcome of that magnitude.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A histogram of 1,000 chance differences produced under the null hypothesis.\nHistograms like this one are a convenient representation of data or results\nwhen there are a large number of simulations.](11-foundations-randomization_files/figure-html/fig-opportunity-cost-rand-hist-1.png){#fig-opportunity-cost-rand-hist width=90%}\n:::\n:::\n\n\nUnder the null hypothesis (no treatment effect), we would observe a difference of at least +20% about 0.6% of the time.\nThat is really rare!\nInstead, we will conclude the data provide strong evidence there is a treatment effect: reminding students before a purchase that they could instead spend the money later on something else lowers the chance that they will continue with the purchase.\nNotice that we are able to make a causal statement for this study since the study is an experiment, although we do not know why the reminder induces a lower purchase rate.\n\n## Hypothesis testing {#HypothesisTesting}\n\nIn the last two sections, we utilized a **hypothesis test**\\index{hypothesis test}, which is a formal technique for evaluating two competing possibilities.\nIn each scenario, we described a **null hypothesis**\\index{null hypothesis}, which represented either a skeptical perspective or a perspective of no difference.\nWe also laid out an **alternative hypothesis**\\index{alternative hypothesis}, which represented a new perspective such as the possibility of a relationship between two variables or a treatment effect in an experiment.\nThe alternative hypothesis is usually the reason the scientists set out to do the research in the first place.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Null and alternative hypotheses.**\n\nThe **null hypothesis** $(H_0)$ often represents either a skeptical perspective or a claim of \"no difference\" to be tested.\n\nThe **alternative hypothesis** $(H_A)$ represents an alternative claim under consideration and is often represented by a range of possible values for the value of interest.\n:::\n\nIf a person makes a somewhat unbelievable claim, we are initially skeptical.\nHowever, if there is sufficient evidence that supports the claim, we set aside our skepticism.\nThe hallmarks of hypothesis testing are also found in the US court system.\n\n### The US court system\n\nIn the US course system, jurors evaluate the evidence to see whether it convincingly shows a defendant is guilty.\nDefendants are considered to be innocent until proven otherwise.\n\n::: {.workedexample data-latex=\"\"}\nThe US court considers two possible claims about a defendant: they are either innocent or guilty.\n\nIf we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative?\n\n------------------------------------------------------------------------\n\nThe jury considers whether the evidence is so convincing (strong) that there is no reasonable doubt regarding the person's guilt.\nThat is, the skeptical perspective (null hypothesis) is that the person is innocent until evidence is presented that convinces the jury that the person is guilty (alternative hypothesis).\n:::\n\nJurors examine the evidence to see whether it convincingly shows a defendant is guilty.\nNotice that if a jury finds a defendant *not guilty*, this does not necessarily mean the jury is confident in the person's innocence.\nThey are simply not convinced of the alternative, that the person is guilty.\nThis is also the case with hypothesis testing: *even if we fail to reject the null hypothesis, we do not accept the null hypothesis as truth*.\n\nFailing to find evidence in favor of the alternative hypothesis is not equivalent to finding evidence that the null hypothesis is true.\nWe will see this idea in greater detail in [Chapter -@sec-decerr].\n\n### p-value and statistical discernibility\n\nIn @caseStudySexDiscrimination we encountered a study from the 1970's that explored whether there was strong evidence that female candidates were less likely to be promoted than male candidates.\nThe research question -- are female candidates discriminated against in promotion decisions?\n-- was framed in the context of hypotheses:\n\n- $H_0:$ Sex has no effect on promotion decisions.\n\n- $H_A:$ Female candidates are discriminated against in promotion decisions.\n\nThe null hypothesis $(H_0)$ was a perspective of no difference in promotion.\nThe data on sex discrimination provided a point estimate of a 29.2% difference in recommended promotion rates between male and female candidates.\nWe determined that such a difference from chance alone, assuming the null hypothesis was true, would be rare: it would only happen about 2 in 100 times.\nWhen results like these are inconsistent with $H_0,$ we reject $H_0$ in favor of $H_A.$ Here, we concluded there was discrimination against female candidates.\n\nThe 2-in-100 chance is what we call a **p-value**, which is a probability quantifying the strength of the evidence against the null hypothesis, given the observed data.\n\n::: {.important data-latex=\"\"}\n**p-value.**\n\nThe **p-value**\\index{hypothesis testing!p-value} is the probability of observing data at least as favorable to the alternative hypothesis as our current dataset, if the null hypothesis were true.\nWe typically use a summary statistic of the data, such as a difference in proportions, to help compute the p-value and evaluate the hypotheses.\nThis summary value that is used to compute the p-value is often called the **test statistic**\\index{test statistic}.\n:::\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nIn the sex discrimination study, the difference in discrimination rates was our test statistic.\nWhat was the test statistic in the opportunity cost study covered in @sec-caseStudyOpportunityCost)?\n\n------------------------------------------------------------------------\n\nThe test statistic in the opportunity cost study was the difference in the proportion of students who decided against the video purchase in the treatment and control groups.\nIn each of these examples, the **point estimate** of the difference in proportions was used as the test statistic.\n:::\n\nWhen the p-value is small, i.e., less than a previously set threshold, we say the results are **statistically discernible**\\index{statistically significant}\\index{statistically discernible}.\nThis means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.[^8]\nThe threshold is called the **discernibility level**\\index{hypothesis testing!discernibility level}\\index{significance level}\\index{discernibility level} and often represented by $\\alpha$ (the Greek letter *alpha*).\n[^9] The value of $\\alpha$ represents how rare an event needs to be in order for the null hypothesis to be rejected\n. Historically, many fields have set $\\alpha = 0.05,$ meaning that the results need to occur less than 5% of the time, if the null hypothesis is to be rejected\n. The value of $\\alpha$ can vary depending on the the field or the application\n.\n\n[^8]: Many texts use the phrase \"statistically significant\" instead of \"statistically discernible\".\n We have chosen to use \"discernible\" to indicate that a precise statistical event has happened, as opposed to a notable effect which may or may not fit the statistical definition of discernible or significant.\n\n[^9]: Here, too, we have chosen \"discernibility level\" instead of \"significance level\" which you will see in some texts.\n\n\n\n\n\nNote that you may have heard the phrase \"statistically significant\" as a way to describe \"statistically discernible.\" Although in everyday language \"significant\" would indicate that a difference is large or meaningful, that is not necessarily the case here.\nThe term \"statistically discernible\" indicates that the p-value from a study fell below the chosen discernibility level.\nFor example, in the sex discrimination study, the p-value was found to be approximately 0.02.\nUsing a discernibility level of $\\alpha = 0.05,$ we would say that the data provided statistically discernible evidence against the null hypothesis.\nHowever, this conclusion gives us no information regarding the size of the difference in promotion rates!\n\n::: {.important data-latex=\"\"}\n**Statistical discernibility.**\n\nWe say that the data provide **statistically discernible**\\index{hypothesis testing!statistically discernible.} evidence against the null hypothesis if the p-value is less than some predetermined threshold (e.g., 0.01, 0.05, 0.1).\n:::\n\n::: {.workedexample data-latex=\"\"}\nIn the opportunity cost study in @sec-caseStudyOpportunityCost, we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future.\nWe determined that such a large difference would only occur 6-in-1,000 times if the reminder actually had no influence on student decision-making.\nWhat is the p-value in this study?\nWould you classify the result as \"statistically discernible\"?\n\n------------------------------------------------------------------------\n\nThe p-value was 0.006.\nSince the p-value is less than 0.05, the data provide statistically discernible evidence that US college students were actually influenced by the reminder.\n:::\n\n::: {.important data-latex=\"\"}\n**What's so special about 0.05?**\n\nWe often use a threshold of 0.05 to determine whether a result is statistically discernible.\nBut why 0.05?\nMaybe we should use a bigger number, or maybe a smaller number.\nIf you're a little puzzled, that probably means you're reading with a critical eye -- good job!\nWe've made a video to help clarify *why 0.05*:\n\n\n\nSometimes it's also a good idea to deviate from the standard.\nWe'll discuss when to choose a threshold different than 0.05 in [Chapter -@sec-decerr].\n:::\n\n\\clearpage\n\n## Chapter review {#chp11-review}\n\n### Summary\n\n@fig-fullrand provides a visual summary of the randomization testing procedure.\n\n\\index{randomization test}\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of one simulation of the full randomization procedure from a hypothetical\ndataset as visualized in the first panel. We repeat the steps hundreds or thousands\nof times.\n](images/fullrand.png){#fig-fullrand fig-alt='48 red and white cards are show in three panels. The first panel represents\noriginal data and original allocation of Group 1 and Group 2 (in the original data\nthere are 7 white cards in Group 1 and 10 white cards in Group 2). The second panel\nrepresents the shuffled red and white cards that are randomly assigned as Group\n1 and Group 2. The third panel has the cards sorted according to the random assignment\nof Group 1 and Group 2. In the third panel there are 8 white cards in the Group\n1 and 9 white cards in Group 2.' width=100%}\n:::\n:::\n\n\nWe can summarize the randomization test procedure as follows:\n\n- **Frame the research question in terms of hypotheses.** Hypothesis tests are appropriate for research questions that can be summarized in two competing hypotheses. The null hypothesis $(H_0)$ usually represents a skeptical perspective or a perspective of no relationship between the variables. The alternative hypothesis $(H_A)$ usually represents a new view or the existance of a relationship between the variables.\n- **Collect data with an observational study or experiment.** If a research question can be formed into two hypotheses, we can collect data to run a hypothesis test. If the research question focuses on associations between variables but does not concern causation, we would use an observational study. If the research question seeks a causal connection between two or more variables, then an experiment should be used.\n- **Model the randomness that would occur if the null hypothesis was true.** In the examples above, the variability has been modeled as if the treatment (e.g., sexual identity, opportunity) allocation was independent of the outcome of the study. The computer generated null distribution is the result of many different randomizations and quantifies the variability that would be expected if the null hypothesis was true.\n- **Analyze the data.** Choose an analysis technique appropriate for the data and identify the p-value. So far, we have only seen one analysis technique: randomization. Throughout the rest of this textbook, we'll encounter several new methods suitable for many other contexts.\n- **Form a conclusion.** Using the p-value from the analysis, determine whether the data provide evidence against the null hypothesis. Also, be sure to write the conclusion in plain language so casual readers can understand the results.\n\n@tbl-chp11-summary is another look at the randomization test summary.\n\n\n::: {#tbl-chp11-summary .cell tbl-cap='Summary of randomization as an inferential statistical method.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Question Answer
What does it do? Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment
What is the random process described? Randomized experiment
What other random processes can be approximated? Can also be used to describe random sampling in an observational model
What is it best for? Hypothesis testing (can also be used for confidence intervals, but not covered in this text).
What physical object represents the simulation process? Shuffling cards
\n\n`````\n:::\n:::\n\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
alternative hypothesis p-value statistic
confidence interval permutation test statistical inference
discernibility level point estimate statistically discernible
hypothesis test randomization test statistically significant
independent significance level success
null hypothesis simulation test statistic
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp11-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-11].\n\n::: {.exercises data-latex=\"\"}\n1. **Identify the parameter, I** For each of the following situations, state whether the parameter of interest is a mean or a proportion.\n It may be helpful to examine whether individual responses are numerical or categorical.\n\n a. In a survey, one hundred college students are asked how many hours per week they spend on the Internet.\n\n b. In a survey, one hundred college students are asked: \"What percentage of the time you spend on the Internet is part of your course work?\"\n\n c. In a survey, one hundred college students are asked whether they cited information from Wikipedia in their papers.\n\n d. In a survey, one hundred college students are asked what percentage of their total weekly spending is on alcoholic beverages.\n\n e. In a sample of one hundred recent college graduates, it is found that 85 percent expect to get a job within one year of their graduation date.\n\n2. **Identify the parameter, II.** For each of the following situations, state whether the parameter of interest is a mean or a proportion.\n\n a. A poll shows that 64% of Americans personally worry a great deal about federal spending and the budget deficit.\n\n b. A survey reports that local TV news has shown a 17% increase in revenue within a two year period while newspaper revenues decreased by 6.4% during this time period.\n\n c. In a survey, high school and college students are asked whether they use geolocation services on their smart phones.\n\n d. In a survey, smart phone users are asked whether they use a web-based taxi service.\n\n e. In a survey, smart phone users are asked how many times they used a web-based taxi service over the last year.\n\n3. **Hypotheses.** For each of the research statements below, note whether it represents a null hypothesis claim or an alternative hypothesis claim.\n\n a. The number of hours that grade-school children spend doing homework predicts their future success on standardized tests.\n\n b. King cheetahs on average run the same speed as standard spotted cheetahs.\n\n c. For a particular student, the probability of correctly answering a 5-option multiple choice test is larger than 0.2 (i.e., better than guessing).\n\n d. The mean length of African elephant tusks has changed over the last 100 years.\n\n e. The risk of facial clefts is equal for babies born to mothers who take folic acid supplements compared with those from mothers who do not.\n\n f. Caffeine intake during pregnancy affects mean birth weight.\n\n g. The probability of getting in a car accident is the same if using a cell phone than if not using a cell phone.\n\n \\clearpage\n\n4. **True null hypothesis.** Unbeknownst to you, let's say that the null hypothesis is actually true in the population.\n You plan to run a study anyway.\n\n a. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.05, how likely is it that you will mistakenly reject the null hypothesis?\n\n b. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.01, how likely is it that you will mistakenly reject the null hypothesis?\n\n c. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.10, how likely is it that you will mistakenly reject the null hypothesis?\n\n5. **Identify hypotheses, I.** Write the null and alternative hypotheses in words and then symbols for each of the following situations.\n\n a. New York is known as \"the city that never sleeps\".\n A random sample of 25 New Yorkers were asked how much sleep they get per night.\n Do these data provide convincing evidence that New Yorkers on average sleep less than 8 hours a night?\n\n b. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity.\n They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc.\n They also collect data on how much company time employees spend on such non- business activities during March Madness.\n They want to determine if these data provide convincing evidence that employee productivity decreases during March Madness.\n\n6. **Identify hypotheses, II.** Write the null and alternative hypotheses in words and using symbols for each of the following situations.\n\n a. Since 2008, chain restaurants in California have been required to display calorie counts of each menu item.\n Prior to menus displaying calorie counts, the average calorie intake of diners at a restaurant was 1100 calories.\n After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners.\n Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant?\n\n b. Based on the performance of those who took the GRE exam between July 1, 2004 and June 30, 2007, the average Verbal Reasoning score was calculated to be 462.\n In 2021 the average verbal score was slightly higher.\n Do these data provide convincing evidence that the average GRE Verbal Reasoning score has changed since 2021?\n\n \\clearpage\n\n7. **Side effects of Avandia.** Rosiglitazone is the active ingredient in the controversial type 2 diabetes medicine Avandia and has been linked to an increased risk of serious cardiovascular problems such as stroke, heart failure, and death.\n A common alternative treatment is Pioglitazone, the active ingredient in a diabetes medicine called Actos.\n In a nationwide retrospective observational study of 227,571 Medicare beneficiaries aged 65 years or older, it was found that 2,593 of the 67,593 patients using Rosiglitazone and 5,386 of the 159,978 using Pioglitazone had serious cardiovascular problems.\n These data are summarized in the contingency table below.[^_11-ex-foundations-randomization-1]\n [@Graham:2010]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
treatment No Yes Total
Pioglitazone 154,592 5,386 159,978
Rosiglitazone 65,000 2,593 67,593
Total 219,592 7,979 227,571
\n \n `````\n :::\n :::\n\n a. Determine if each of the following statements is true or false.\n If false, explain why.\n *Be careful:* The reasoning may be wrong even if the statement's conclusion is correct.\n In such cases, the statement should be considered false.\n\n i. Since more patients on Pioglitazone had cardiovascular problems (5,386 vs. 2,593), we can conclude that the rate of cardiovascular problems for those on a Pioglitazone treatment is higher.\n\n ii. The data suggest that diabetic patients who are taking Rosiglitazone are more likely to have cardiovascular problems since the rate of incidence was (2,593 / 67,593 = 0.038) 3.8% for patients on this treatment, while it was only (5,386 / 159,978 = 0.034) 3.4% for patients on Pioglitazone.\n\n iii. The fact that the rate of incidence is higher for the Rosiglitazone group proves that Rosiglitazone causes serious cardiovascular problems.\n\n iv. Based on the information provided so far, we cannot tell if the difference between the rates of incidences is due to a relationship between the two variables or due to chance.\n\n b. What proportion of all patients had cardiovascular problems?\n\n c. If the type of treatment and having cardiovascular problems were independent, about how many patients in the Rosiglitazone group would we expect to have had cardiovascular problems?\n\n ::: {.content-hidden unless-format=\"pdf\"} *See next page for part d.* :::\n\n \\clearpage\n\n d. We can investigate the relationship between outcome and treatment in this study using a randomization technique.\n While in reality we would carry out the simulations required for randomization using statistical software, suppose we actually simulate using index cards.\n In order to simulate from the independence model, which states that the outcomes were independent of the treatment, we write whether each patient had a cardiovascular problem on cards, shuffled all the cards together, then deal them into two groups of size 67,593 and 159,978.\n We repeat this simulation 100 times and each time record the difference between the proportions of cards that say \"Yes\" in the Rosiglitazone and Pioglitazone groups.\n Use the histogram of these differences in proportions to answer the following questions.\n\n i. What are the claims being tested?\n\n ii. Compared to the number calculated in part (b), which would provide more support for the alternative hypothesis, *higher* or *lower* proportion of patients with cardiovascular problems in the Rosiglitazone group?\n\n iii. What do the simulation results suggest about the relationship between taking Rosiglitazone and having cardiovascular problems in diabetic patients?\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-35-1.png){width=90%}\n :::\n :::\n\n \\clearpage\n\n8. **Heart transplants.** The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan.\n Each patient entering the program was designated an official heart transplant candidate, meaning that they were gravely ill and would most likely benefit from a new heart.\n Some patients got a transplant and some did not.\n The variable `transplant` indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not.\n Of the 34 patients in the control group, 30 died.\n Of the 69 people in the treatment group, 45 died.\n Another variable called `survived` was used to indicate whether the patient was alive at the end of the study.[^_11-ex-foundations-randomization-2]\n [@Turnbull+Brown+Hu:1974]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-36-1.png){width=90%}\n :::\n :::\n\n a. Does the stacked bar plot indicate that survival is independent of whether the patient got a transplant?\n Explain your reasoning.\n\n b. What do the box plots above suggest about the efficacy (effectiveness) of the heart transplant treatment.\n\n c. What proportion of patients in the treatment group and what proportion of patients in the control group died?\n\n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for part d.*\n :::\n\n \\clearpage\n\n d. One approach for investigating whether the treatment is effective is to use a randomization technique.\n\n i. What are the claims being tested?\n\n ii. The paragraph below describes the set up for such approach, if we were to do it without using statistical software.\n Fill in the blanks with a number or phrase, whichever is appropriate.\n\n > We write *alive* on $\\rule{2cm}{0.5pt}$ cards representing patients who were alive at the end of the study, and *deceased* on $\\rule{2cm}{0.5pt}$ cards representing patients who were not.\n > Then, we shuffle these cards and split them into two groups: one group of size $\\rule{2cm}{0.5pt}$ representing treatment, and another group of size $\\rule{2cm}{0.5pt}$ representing control.\n > We calculate the difference between the proportion of \\textit{deceased} cards in the treatment and control groups (treatment - control) and record this value.\n > We repeat this 100 times to build a distribution centered at $\\rule{2cm}{0.5pt}$.\n > Lastly, we calculate the fraction of simulations where the simulated differences in proportions are $\\rule{2cm}{0.5pt}$.\n > If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.\n\n iii. What do the simulation results shown below suggest about the effectiveness of the transplant program?\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-37-1.png){width=90%}\n :::\n :::\n\n[^_11-ex-foundations-randomization-1]: The [`avandia`](http://openintrostat.github.io/openintro/reference/avandia.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_11-ex-foundations-randomization-2]: The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n\n:::\n", "supporting": [ "11-foundations-randomization_files" ], diff --git a/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-obs-bar-1.png b/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-obs-bar-1.png new file mode 100644 index 00000000..3e5e2bee Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-obs-bar-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-rand-dot-plot-1.png b/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-rand-dot-plot-1.png new file mode 100644 index 00000000..0ad16378 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-rand-dot-plot-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-rand-hist-1.png b/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-rand-hist-1.png new file mode 100644 index 00000000..6d49eab5 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/fig-opportunity-cost-rand-hist-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/fig-sex-rand-dot-plot-1.png b/_freeze/11-foundations-randomization/figure-html/fig-sex-rand-dot-plot-1.png new file mode 100644 index 00000000..71d3e1cf Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/fig-sex-rand-dot-plot-1.png differ diff --git a/exercises/_11-ex-foundations-randomization.qmd b/exercises/_11-ex-foundations-randomization.qmd index 07638726..2073bd6a 100644 --- a/exercises/_11-ex-foundations-randomization.qmd +++ b/exercises/_11-ex-foundations-randomization.qmd @@ -1,75 +1,84 @@ -1. **Identify the parameter, I** -For each of the following situations, state whether the parameter of interest is a mean or a proportion. It may be helpful to examine whether individual responses are numerical or categorical. +1. **Identify the parameter, I** For each of the following situations, state whether the parameter of interest is a mean or a proportion. + It may be helpful to examine whether individual responses are numerical or categorical. - a. In a survey, one hundred college students are asked how many hours per week they spend on the Internet. + a. In a survey, one hundred college students are asked how many hours per week they spend on the Internet. - b. In a survey, one hundred college students are asked: "What percentage of the time you spend on the Internet is part of your course work?" + b. In a survey, one hundred college students are asked: "What percentage of the time you spend on the Internet is part of your course work?" - c. In a survey, one hundred college students are asked whether they cited information from Wikipedia in their papers. + c. In a survey, one hundred college students are asked whether they cited information from Wikipedia in their papers. - d. In a survey, one hundred college students are asked what percentage of their total weekly spending is on alcoholic beverages. + d. In a survey, one hundred college students are asked what percentage of their total weekly spending is on alcoholic beverages. - e. In a sample of one hundred recent college graduates, it is found that 85 percent expect to get a job within one year of their graduation date. + e. In a sample of one hundred recent college graduates, it is found that 85 percent expect to get a job within one year of their graduation date. -1. **Identify the parameter, II.** -For each of the following situations, state whether the parameter of interest is a mean or a proportion. +2. **Identify the parameter, II.** For each of the following situations, state whether the parameter of interest is a mean or a proportion. - a. A poll shows that 64% of Americans personally worry a great deal about federal spending and the budget deficit. + a. A poll shows that 64% of Americans personally worry a great deal about federal spending and the budget deficit. - b. A survey reports that local TV news has shown a 17% increase in revenue within a two year period while newspaper revenues decreased by 6.4% during this time period. + b. A survey reports that local TV news has shown a 17% increase in revenue within a two year period while newspaper revenues decreased by 6.4% during this time period. - c. In a survey, high school and college students are asked whether they use geolocation services on their smart phones. + c. In a survey, high school and college students are asked whether they use geolocation services on their smart phones. - d. In a survey, smart phone users are asked whether they use a web-based taxi service. + d. In a survey, smart phone users are asked whether they use a web-based taxi service. - e. In a survey, smart phone users are asked how many times they used a web-based taxi service over the last year. + e. In a survey, smart phone users are asked how many times they used a web-based taxi service over the last year. -1. **Hypotheses.** -For each of the research statements below, note whether it represents a null hypothesis claim or an alternative hypothesis claim. +3. **Hypotheses.** For each of the research statements below, note whether it represents a null hypothesis claim or an alternative hypothesis claim. + + a. The number of hours that grade-school children spend doing homework predicts their future success on standardized tests. + + b. King cheetahs on average run the same speed as standard spotted cheetahs. + + c. For a particular student, the probability of correctly answering a 5-option multiple choice test is larger than 0.2 (i.e., better than guessing). + + d. The mean length of African elephant tusks has changed over the last 100 years. + + e. The risk of facial clefts is equal for babies born to mothers who take folic acid supplements compared with those from mothers who do not. + + f. Caffeine intake during pregnancy affects mean birth weight. + + g. The probability of getting in a car accident is the same if using a cell phone than if not using a cell phone. - a. The number of hours that grade-school children spend doing homework predicts their future success on standardized tests. - - b. King cheetahs on average run the same speed as standard spotted cheetahs. - - c. For a particular student, the probability of correctly answering a 5-option multiple choice test is larger than 0.2 (i.e., better than guessing). - - d. The mean length of African elephant tusks has changed over the last 100 years. - - e. The risk of facial clefts is equal for babies born to mothers who take folic acid supplements compared with those from mothers who do not. - - f. Caffeine intake during pregnancy affects mean birth weight. - - g. The probability of getting in a car accident is the same if using a cell phone than if not using a cell phone. - \clearpage -1. **True null hypothesis.** -Unbeknownst to you, let's say that the null hypothesis is actually true in the population. You plan to run a study anyway. +4. **True null hypothesis.** Unbeknownst to you, let's say that the null hypothesis is actually true in the population. + You plan to run a study anyway. + + a. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.05, how likely is it that you will mistakenly reject the null hypothesis? + + b. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.01, how likely is it that you will mistakenly reject the null hypothesis? + + c. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.10, how likely is it that you will mistakenly reject the null hypothesis? - a. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.05, how likely is it that you will mistakenly reject the null hypothesis? - - b. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.01, how likely is it that you will mistakenly reject the null hypothesis? - - c. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.10, how likely is it that you will mistakenly reject the null hypothesis? +5. **Identify hypotheses, I.** Write the null and alternative hypotheses in words and then symbols for each of the following situations. -1. **Identify hypotheses, I.** -Write the null and alternative hypotheses in words and then symbols for each of the following situations. + a. New York is known as "the city that never sleeps". + A random sample of 25 New Yorkers were asked how much sleep they get per night. + Do these data provide convincing evidence that New Yorkers on average sleep less than 8 hours a night? - a. New York is known as "the city that never sleeps". A random sample of 25 New Yorkers were asked how much sleep they get per night. Do these data provide convincing evidence that New Yorkers on average sleep less than 8 hours a night? + b. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity. + They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc. + They also collect data on how much company time employees spend on such non- business activities during March Madness. + They want to determine if these data provide convincing evidence that employee productivity decreases during March Madness. - b. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity. They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc. They also collect data on how much company time employees spend on such non- business activities during March Madness. They want to determine if these data provide convincing evidence that employee productivity decreases during March Madness. +6. **Identify hypotheses, II.** Write the null and alternative hypotheses in words and using symbols for each of the following situations. -1. **Identify hypotheses, II.** -Write the null and alternative hypotheses in words and using symbols for each of the following situations. + a. Since 2008, chain restaurants in California have been required to display calorie counts of each menu item. + Prior to menus displaying calorie counts, the average calorie intake of diners at a restaurant was 1100 calories. + After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners. + Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant? - a. Since 2008, chain restaurants in California have been required to display calorie counts of each menu item. Prior to menus displaying calorie counts, the average calorie intake of diners at a restaurant was 1100 calories. After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners. Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant? + b. Based on the performance of those who took the GRE exam between July 1, 2004 and June 30, 2007, the average Verbal Reasoning score was calculated to be 462. + In 2021 the average verbal score was slightly higher. + Do these data provide convincing evidence that the average GRE Verbal Reasoning score has changed since 2021? - b. Based on the performance of those who took the GRE exam between July 1, 2004 and June 30, 2007, the average Verbal Reasoning score was calculated to be 462. In 2021 the average verbal score was slightly higher. Do these data provide convincing evidence that the average GRE Verbal Reasoning score has changed since 2021? - \clearpage -1. **Side effects of Avandia.** -Rosiglitazone is the active ingredient in the controversial type 2 diabetes medicine Avandia and has been linked to an increased risk of serious cardiovascular problems such as stroke, heart failure, and death. A common alternative treatment is Pioglitazone, the active ingredient in a diabetes medicine called Actos. In a nationwide retrospective observational study of 227,571 Medicare beneficiaries aged 65 years or older, it was found that 2,593 of the 67,593 patients using Rosiglitazone and 5,386 of the 159,978 using Pioglitazone had serious cardiovascular problems. These data are summarized in the contingency table below.^[The [`avandia`](http://openintrostat.github.io/openintro/reference/avandia.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Graham:2010] +7. **Side effects of Avandia.** Rosiglitazone is the active ingredient in the controversial type 2 diabetes medicine Avandia and has been linked to an increased risk of serious cardiovascular problems such as stroke, heart failure, and death. + A common alternative treatment is Pioglitazone, the active ingredient in a diabetes medicine called Actos. + In a nationwide retrospective observational study of 227,571 Medicare beneficiaries aged 65 years or older, it was found that 2,593 of the 67,593 patients using Rosiglitazone and 5,386 of the 159,978 using Pioglitazone had serious cardiovascular problems. + These data are summarized in the contingency table below.[^_11-ex-foundations-randomization-1] + [@Graham:2010] ```{r} library(tools) @@ -78,55 +87,60 @@ Rosiglitazone is the active ingredient in the controversial type 2 diabetes med library(janitor) library(knitr) library(kableExtra) - - avandia %>% - count(treatment, cardiovascular_problems) %>% - mutate(cardiovascular_problems = toTitleCase(as.character(cardiovascular_problems))) %>% - pivot_wider(names_from = cardiovascular_problems, values_from = n) %>% - adorn_totals(where = c("row", "col")) %>% - kbl(linesep = "", booktabs = TRUE, format.args = list(big.mark = ",")) %>% + + avandia |> + count(treatment, cardiovascular_problems) |> + mutate(cardiovascular_problems = toTitleCase(as.character(cardiovascular_problems))) |> + pivot_wider(names_from = cardiovascular_problems, values_from = n) |> + adorn_totals(where = c("row", "col")) |> + kbl(linesep = "", booktabs = TRUE, format.args = list(big.mark = ",")) |> kable_styling(bootstrap_options = c("striped", "condensed"), latex_options = "HOLD_position", - full_width = FALSE) %>% + full_width = FALSE) |> column_spec(1:4, width = "7em") ``` - a. Determine if each of the following statements is true or false. If false, explain why. *Be careful:* The reasoning may be wrong even if the statement's conclusion is correct. In such cases, the statement should be considered false. - - i. Since more patients on Pioglitazone had cardiovascular problems (5,386 vs. 2,593), we can conclude that the rate of cardiovascular problems for those on a Pioglitazone treatment is higher. - - ii. The data suggest that diabetic patients who are taking Rosiglitazone are more likely to have cardiovascular problems since the rate of incidence was (2,593 / 67,593 = 0.038) 3.8\% for patients on this treatment, while it was only (5,386 / 159,978 = 0.034) 3.4\% for patients on Pioglitazone. - - iii. The fact that the rate of incidence is higher for the Rosiglitazone group proves that Rosiglitazone causes serious cardiovascular problems. - + a. Determine if each of the following statements is true or false. + If false, explain why. + *Be careful:* The reasoning may be wrong even if the statement's conclusion is correct. + In such cases, the statement should be considered false. + + i. Since more patients on Pioglitazone had cardiovascular problems (5,386 vs. 2,593), we can conclude that the rate of cardiovascular problems for those on a Pioglitazone treatment is higher. + + ii. The data suggest that diabetic patients who are taking Rosiglitazone are more likely to have cardiovascular problems since the rate of incidence was (2,593 / 67,593 = 0.038) 3.8% for patients on this treatment, while it was only (5,386 / 159,978 = 0.034) 3.4% for patients on Pioglitazone. + + iii. The fact that the rate of incidence is higher for the Rosiglitazone group proves that Rosiglitazone causes serious cardiovascular problems. + iv. Based on the information provided so far, we cannot tell if the difference between the rates of incidences is due to a relationship between the two variables or due to chance. - + b. What proportion of all patients had cardiovascular problems? c. If the type of treatment and having cardiovascular problems were independent, about how many patients in the Rosiglitazone group would we expect to have had cardiovascular problems? - - ::: {.content-hidden unless-format="pdf"} - *See next page for part d.* - ::: - + + ::: {.content-hidden unless-format="pdf"} *See next page for part d.* ::: + \clearpage - d. We can investigate the relationship between outcome and treatment in this study using a randomization technique. While in reality we would carry out the simulations required for randomization using statistical software, suppose we actually simulate using index cards. In order to simulate from the independence model, which states that the outcomes were independent of the treatment, we write whether each patient had a cardiovascular problem on cards, shuffled all the cards together, then deal them into two groups of size 67,593 and 159,978. We repeat this simulation 100 times and each time record the difference between the proportions of cards that say "Yes" in the Rosiglitazone and Pioglitazone groups. Use the histogram of these differences in proportions to answer the following questions. - - i. What are the claims being tested? - - ii. Compared to the number calculated in part (b), which would provide more support for the alternative hypothesis, *higher* or *lower* proportion of patients with cardiovascular problems in the Rosiglitazone group? - + d. We can investigate the relationship between outcome and treatment in this study using a randomization technique. + While in reality we would carry out the simulations required for randomization using statistical software, suppose we actually simulate using index cards. + In order to simulate from the independence model, which states that the outcomes were independent of the treatment, we write whether each patient had a cardiovascular problem on cards, shuffled all the cards together, then deal them into two groups of size 67,593 and 159,978. + We repeat this simulation 100 times and each time record the difference between the proportions of cards that say "Yes" in the Rosiglitazone and Pioglitazone groups. + Use the histogram of these differences in proportions to answer the following questions. + + i. What are the claims being tested? + + ii. Compared to the number calculated in part (b), which would provide more support for the alternative hypothesis, *higher* or *lower* proportion of patients with cardiovascular problems in the Rosiglitazone group? + iii. What do the simulation results suggest about the relationship between taking Rosiglitazone and having cardiovascular problems in diabetic patients? - + ```{r} library(infer) set.seed(25) - avandia %>% - specify(response = cardiovascular_problems, explanatory = treatment, success = "yes") %>% - hypothesize(null = "independence") %>% - generate(reps = 100, type = "permute") %>% - calculate(stat = "diff in props", order = c("Rosiglitazone", "Pioglitazone")) %>% + avandia |> + specify(response = cardiovascular_problems, explanatory = treatment, success = "yes") |> + hypothesize(null = "independence") |> + generate(reps = 100, type = "permute") |> + calculate(stat = "diff in props", order = c("Rosiglitazone", "Pioglitazone")) |> ggplot(aes(x = stat)) + geom_histogram(binwidth = 0.001/4, fill = IMSCOL["green", "full"]) + labs( @@ -136,63 +150,76 @@ Rosiglitazone is the active ingredient in the controversial type 2 diabetes med ) + scale_y_continuous(breaks = seq(0, 16, 2)) ``` - + \clearpage -1. **Heart transplants.** -The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate, meaning that they were gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable `transplant` indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment group, 45 died. Another variable called `survived` was used to indicate whether the patient was alive at the end of the study.^[The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Turnbull+Brown+Hu:1974] +8. **Heart transplants.** The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. + Each patient entering the program was designated an official heart transplant candidate, meaning that they were gravely ill and would most likely benefit from a new heart. + Some patients got a transplant and some did not. + The variable `transplant` indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. + Of the 34 patients in the control group, 30 died. + Of the 69 people in the treatment group, 45 died. + Another variable called `survived` was used to indicate whether the patient was alive at the end of the study.[^_11-ex-foundations-randomization-2] + [@Turnbull+Brown+Hu:1974] ```{r} #| fig-asp: 0.5 library(tidyverse) library(openintro) library(patchwork) - - heart_transplant <- heart_transplant %>% + + heart_transplant <- heart_transplant |> mutate(survived_better_wording = if_else(survived == "dead", "deceased", as.character(survived))) - + p_bar <- ggplot(heart_transplant, aes(x = transplant, fill = survived_better_wording)) + geom_bar(position = "fill") + scale_fill_openintro("two") + labs(x = NULL, y = NULL, fill = "Outcome") - + p_box <- ggplot(heart_transplant, aes(x = transplant, y = survtime)) + geom_boxplot() + labs(x = NULL, y = "Survival time (days)") - + p_bar + p_box ``` - a. Does the stacked bar plot indicate that survival is independent of whether the patient got a transplant? Explain your reasoning. + a. Does the stacked bar plot indicate that survival is independent of whether the patient got a transplant? + Explain your reasoning. b. What do the box plots above suggest about the efficacy (effectiveness) of the heart transplant treatment. c. What proportion of patients in the treatment group and what proportion of patients in the control group died? - + ::: {.content-hidden unless-format="pdf"} *See next page for part d.* ::: - + \clearpage d. One approach for investigating whether the treatment is effective is to use a randomization technique. - - i. What are the claims being tested? - - ii. The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate. - - > We write *alive* on $\rule{2cm}{0.5pt}$ cards representing patients who were alive at the end of the study, and *deceased* on $\rule{2cm}{0.5pt}$ cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size $\rule{2cm}{0.5pt}$ representing treatment, and another group of size $\rule{2cm}{0.5pt}$ representing control. We calculate the difference between the proportion of \textit{deceased} cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at $\rule{2cm}{0.5pt}$. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are $\rule{2cm}{0.5pt}$. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative. - - iii. What do the simulation results shown below suggest about the effectiveness of the transplant program? - + + i. What are the claims being tested? + + ii. The paragraph below describes the set up for such approach, if we were to do it without using statistical software. + Fill in the blanks with a number or phrase, whichever is appropriate. + + > We write *alive* on $\rule{2cm}{0.5pt}$ cards representing patients who were alive at the end of the study, and *deceased* on $\rule{2cm}{0.5pt}$ cards representing patients who were not. + > Then, we shuffle these cards and split them into two groups: one group of size $\rule{2cm}{0.5pt}$ representing treatment, and another group of size $\rule{2cm}{0.5pt}$ representing control. + > We calculate the difference between the proportion of \textit{deceased} cards in the treatment and control groups (treatment - control) and record this value. + > We repeat this 100 times to build a distribution centered at $\rule{2cm}{0.5pt}$. + > Lastly, we calculate the fraction of simulations where the simulated differences in proportions are $\rule{2cm}{0.5pt}$. + > If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative. + + iii. What do the simulation results shown below suggest about the effectiveness of the transplant program? + ```{r} library(infer) set.seed(40) - heart_transplant %>% - specify(response = survived_better_wording, explanatory = transplant, success = "deceased") %>% - hypothesize(null = "independence") %>% - generate(reps = 100, type = "permute") %>% - calculate(stat = "diff in props", order = c("treatment", "control")) %>% + heart_transplant |> + specify(response = survived_better_wording, explanatory = transplant, success = "deceased") |> + hypothesize(null = "independence") |> + generate(reps = 100, type = "permute") |> + calculate(stat = "diff in props", order = c("treatment", "control")) |> ggplot(aes(x = stat)) + geom_histogram(binwidth = 0.05, fill = IMSCOL["green", "full"]) + labs( @@ -202,3 +229,7 @@ The Stanford University Heart Transplant Study was conducted to determine whethe ) + scale_y_continuous(breaks = seq(0, 24, 4), minor_breaks = seq(0, 24, 2)) ``` + +[^_11-ex-foundations-randomization-1]: The [`avandia`](http://openintrostat.github.io/openintro/reference/avandia.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package. + +[^_11-ex-foundations-randomization-2]: The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.