generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-notes.Rmd
107 lines (84 loc) · 6.81 KB
/
10-notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
output:
pdf_document: default
html_document: default
---
# Notes
## 1. The @jung2014female US hurricane data {#hurricanes .unnumbered}
Figures \@ref(fig:hurricanes1) and \@ref(fig:hurricanes2) checked for a difference between the fitted male and female line or curve. Jung et al's approach was, instead, to examine whether numbers of deaths varied with the "femaleness" of name, as judged by students in 2014.
As a check on how the popularity of a name for each of females and males may have changed with time, Table \@ref(tab:changetab) uses US social security administration data to show numbers for names where the range of variation over 1950 to 2012 in the relative number of females was 0.08 or more.
```{r cap1, echo=FALSE}
cap1 <- "The minimum and maximum value of the relative proportion
of female names, and the difference, are shown for the eight names
that showed the greatest change over the years 1950 to 2012."
```
```{r changetab, out.width="90%", fig.align='center', echo=FALSE}
mat <- match(babynames::babynames$name, DAAG::hurricNamed$Name, nomatch=0)
matnames <- babynames::babynames[mat>0,]
mf <-merge(subset(matnames,sex=='F')[, c(1,3:4)],
subset(matnames,sex=='M')[,c(1,3:4)],all=TRUE,
by=c("year",'name'),suffixes=c('f','m'))
mf <- within(mf, {nf[is.na(nf)]<-0; nm[is.na(nm)]<-0})
mf$fracf <- with(mf, nf/(nf+nm))
## table(mf$name[mf$fracf>0.075&mf$fracf<0.925])
mfsub <- subset(mf, year%in%(1950:2012))
ran <- with(mfsub, sapply(split(fracf,name),range))
d <- apply(ran, 2, diff)
ord <- round(order(abs(d), decreasing=T))
stats <- round(rbind(ran[,ord[1:8]],d[ord[1:8]]),2)
rownames(stats) <- c('Minimum','Maximum','Difference')
z <- match(DAAG::hurricNamed$Name, mfsub$name, nomatch=0)
knitr::kable(stats, digits=2, caption=cap1)
## |>
## kableExtra::row_spec(3, bold=TRUE) |>
## kableExtra::kable_styling(font_size=10)
```
In other cases, relative number of females were always either in the range 0 to 0.07 (i.e., mostly female), or in the range 0.93 to 1 (i.e., mostly female).
Figure \@ref(fig:plotchanges) shows how the numbers of each sex changed over time, for the first six of the names in Table \@ref(tab:changetab).
```{r cap2, echo=FALSE}
suppressPackageStartupMessages(library(latticeExtra))
trellis.par.set(theme)
cap2 <- "Change in numbers of names given to males and females
over the years 1950 to 2012, for the six names where the
maximum difference in relative frequency was more than 0.1.
The maximum change is shown against each name."
```
```{r plotchanges, echo=FALSE, eval=TRUE, fig.width=5.5, fig.height=3.6, out.width="90%", fig.cap=cap2, fig.align='center'}
lookat <- subset(mfsub, name%in%c('Charley','Cleo',
'Fran','Sandy','Erin','Inez'))
lookat$name <- factor(lookat$name)
dnam <- round(abs(d[c('Charley','Cleo','Fran','Sandy','Erin','Inez')]),2)
levels(lookat$name) <- paste0(levels(lookat$name), ": ",dnam)
atx <- c(10,100,1000, 10000)
xyplot(nm+nf~year|name, data=lookat, xlab="Year",
ylab="Number of babies", par.settings=simpleTheme(pch=c(3,2), cex=0.8),
auto.key=list(text=c("Males","Females"),columns=2),
scales=list(y=list(log='e', at=atx, labels=paste(atx)))) +
latticeExtra::layer(panel.axis('left', at=atx*sqrt(10), labels=F, tck=0.5))
```
The names where the choices of parents at the time are likely to most different from that for students in 2014 are those where there has been greatest change over time (i.e., especially, Charley, Cleo and Fran).
\enlargethispage{21pt}
Other differences from the analyses on which Figures \@ref(fig:hurricanes1) and \@ref(fig:hurricanes2) were based are
- As the primary measure of the risk posed by the hurricanes, the authors used a 2013 US\$ estimate of damage, intended for insurance purposes, for a comparable hurricane in 2013. Figure \@ref(fig:hurricanes1) uses what is surely the more relevant measure, namely `NDAM2014` --- this converts the estimate of damage caused at the time to 2014 US\$.[^10-notes-1]
- Jung et al allowed for minor effects from barometric pressure at landfall, and interactions. Again, these do not affect the conclusions reached.
[^10-notes-1]: Fortuitously, this change makes no difference of consequence to the graph or to the conclusions reached.
```{r missed, echo=FALSE, eval=FALSE}
missed <- DAAG::hurricNamed$Name[z==0]
mfOther <- subset(mf, name%in%missed)
ranOther <- with(mfOther, sapply(split(fracf,name),range))
dOther <- apply(ranOther, 2, diff)
```
```{r changes, echo=FALSE}
stats <- round(rbind(ran[,ord[1:8]],d[ord[1:8]]),2)
x <- knitr::kable(stats)
```
## 2. $^*$What does a $p$-value tell the experimenter? {#pval .unnumbered}
$P$-values are widely used to indicate whether a difference. What follows introduces technicalities that have been avoided in the preceding chapters.
Consider an experiment that compares a treatment with a control. For example, does taking a drug of interest reduce sleeplessness? The starting point for a $p$-value calculation is the NULL hypothesis assumption of no difference in effect between treatment and control. A $p$-value measures the probability that a difference in measured effect as large as that observed, or larger would, assuming the NULL hypothesis, occur by chance. The definition says nothing about the $p$-value that can be expected if there is a difference. It does not give, as is sometimes thought, the probability that there really is no difference.
Issues for understanding the meaning of a $p$-value are:
- Finding $p \leq 0.05$ is all very well. What one really needs to know is that there is an alternative that is substantially more likely.
- $p = 0.05$, or close to 0.05, is at the upper end of the range of $p$-values that occur with a probability of 0.05. Treating $p = 0.05$ as an event that occurs with probability 0.05 under the NULL exaggerates the evidence that it provides against the NULL.
- In fact, under the NULL, all values between 0 and 1 are equally likely!
- Prior probabilities matter. If for example the data is from one of a set of drug trials where 99 out of 100 can be expected to show no effect, the 99 out of 100 instances where there is a 1 in 20 change of $p$ \<= 0.05 have to be set again 1 out of 100 instances where there is a potential to detect an effect. A $p$ \<= 0.05 is more than 99/20 = 4.95 times as likely to come from a drug with no effect as from one that has a real effect.
- What size of effect is, based on whatever is known from past comparable investigations, likely? A reasonable default is to center the distribution on zero, with values that become increasingly less likely as the distance from zero increases. Under plausible assumptions of this general type, it can be shown that with $p$=0.05, the prior relative likelihood should be multiplied by, at most, around 2.5.[^10-notes-2]
[^10-notes-2]: See <https://statisticsbyjim.com/hypothesis-testing/interpreting-p-values/>