generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-weighting.Rmd
393 lines (315 loc) · 20.6 KB
/
06-weighting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
# Weighting effects that skew statistics {#sec:yule1}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, comment=NA, fig.align='center', fig.width=5.5,fig.height=5.5)
```
```{r dataInput, echo=FALSE}
options(scipen=10)
load('data/UScovid.RData')
```
## Covid-19 deaths --- comparing countries
```{r caps}
cap0 <- "US data for proportions who died from Covid-19, and in total,
for the 13 months up to 31 January 2021. Vaccination of a substantial
part of the population, and the emergence of new variants, will,
six months or more later, have substantially changed the pattern
of relative risks by age."
cap_props <- paste0("United States reported deaths per 100 from Covid-19
(and, for comparison, deaths per 100), for the time period from ",
wkStartEnd[1], " to ", wkStartEnd[2], ". The total number
reported was ", sum(USdeadpop1yr$covdeaths),".")
fromTo <- format(UScausesByCats[1,2:3],"%b %d %Y")
```
```{r riskByAge, fig.width=4.5, fig.asp=0.68, fig.pos='H', out.width='60%', echo=FALSE, fig.cap=cap0}
suppressPackageStartupMessages(library(mgcv))
rcd2.gam <- gam(sqrt(relcovdead)~s(age, k=15), data=USdeadpop1yr[-86,])
rd2.gam <- gam(sqrt(reldead)~s(age, k=15), data=USdeadpop1yr[-86,])
ylab <- 10^seq(from=-4, to =2,by=1)
predC <- log(100)+c(2*log(predict(rcd2.gam)),NA, rep(log(USdeadpop1yr[86,'relcovdead']),2))
pred <- log(100)+c(2*log(predict(rd2.gam)),NA, rep(log(USdeadpop1yr[86,'reldead']),2))
atx <- seq(from=0, to=80, by=10)
library(lattice)
trellis.par.set(theme)
gph <- xyplot(predC+pred~I(c(0:84,NA,88,92)), type='l',
xlab="Age", ylab="Deaths per 100",
scales=list(y=list(at=log(ylab),labels=paste(ylab)),
x=list(at=c(atx,90), labels=c(paste(atx),"85+"))))
update(gph,auto.key=list(columns=2, lines=TRUE, points=FALSE, text=c('from Covid-19','All causes')))
```
```{r rates, echo=FALSE}
ltge65 <- with(USdeadpop1yr, c("<65"=sum(covdeaths[age<65])/sum(pop[age<65]),
">=65"=sum(covdeaths[age>=65])/sum(pop[age>=65])))
```
Figure \@ref(fig:riskByAge) shows stark differences by age in US Covid-19 death rates. The US death rates per 1000 for under age 65 as opposed to age $>=$ 65, were `r round(1000*ltge65,2)` US totals for infection rates and for hospital admissions were also impacted by age structure, but to a lesser extent. More recent Covid-19 variants would present a somewhat different picture.
Countries with a lower proportion of their population aged \>=65 would, if the death rates for each of the two groups are similar to US rates, have lower overall death rates. The following compares the overall deaths rates for the US with what might, if these figures carried across, be expected for Kenya and for Italy.
```{r ov65, echo=FALSE}
us65plus <- with(USdeadpop1yr, sum(pop[age>=65])/sum(pop))
it65plus <- with(italyPop5s, sum((M+F)[Age>=65])/sum(M+F))
ke65plus <- with(kenyaPop5s, sum((M+F)[Age>=65])/sum(M+F))
deathsPer100K <- rbind("Percentage 65 or more"= 100*c(US=us65plus, Italy=it65plus,Kenya=ke65plus),
"Expected deaths per 100,000"=100000*c(ltge65[1]*(1-c(us65plus, it65plus, ke65plus))+
ltge65[2]*c(us65plus, it65plus, ke65plus)),
"Reported deaths per 100,000"=100000*cf3countries[,3]/cf3countries[,2])
deathsPer100K[3,1] <- 100000*with(USdeadpop1yr, sum(covdeaths)/sum(pop))
round(deathsPer100K,1)
```
Between country comparisons are hazardous. The dependence of reported case numbers on testing rates and reporting protocols makes it likely that they will be substantial undercounts, to an extent that varies from country to country.
## University admissions data --- Simpson's paradox {#sec:Y-S}
Admissions data for University of California Berkeley in 1973 showed a curious anomaly. Overall admission rates strongly favoured males, while in individual departments the rates mostly favoured females. The table shows percent admission rates, with number applying shown in brackets underneath.
```{r admitrates, comment=NA}
## Tabulate by Admit and Gender
byGender <- margin.table(UCBAdmissions,
margin=c("Admit","Gender"))
admitRate <- round(100*prop.table(byGender,
margin="Gender")["Admitted", ], 1)
admit <- cbind(OVERALL=admitRate,
round(100*prop.table(UCBAdmissions,
margin=c("Gender","Dept"))["Admitted", , ], 1))
tots <- cbind(apply(byGender,2,sum),
margin.table(UCBAdmissions,
margin=c("Gender", "Dept")))
totals <- matrix(paste0('(',tots,')'),nrow=2)
admit <- rbind(admit, totals)[c(1,3,2,4),]
## tab <- xtable::xtable(admit, digits=1)
## print(tab, type='latex', comment=FALSE, floating=FALSE, scalebox=0.9)
library(kableExtra)
x <- knitr::kable(admit, booktabs=TRUE, align="r")
x <- column_spec(x, 2, width = "5em", bold=TRUE)
column_spec(x, 3:8, width = "2.5em")
## print(admit, quote=F, right=T)
```
Figure \@ref(fig:UCBgph1) provides a graphical summary.
```{r cap6, echo=FALSE}
cap6 <- "UCB admossion data for 1973, for males and females, by
department. Department labels range from A to E. Notice that
the largest differences in admission rates are for departments
A and B, in both cases favouring females."
```
```{r UCBgph1, fig.width=6, fig.asp=0.6, fig.pos='H', out.width="72%", fig.cap=cap6}
oldpar <- par(mar=c(3.1,3.1,2.6,0.6), mgp=c(2,0.5,0))
applied <- margin.table(UCBAdmissions, margin=2:3)
pcAdmit <- 100*prop.table(UCBAdmissions, margin=2:3)["Admitted", , ]
byGender <- 100*prop.table(margin.table(UCBAdmissions,
margin=1:2), margin=2)
col2 <- c("black","gray60")
dimnam <- dimnames(UCBAdmissions)
mfStats <- data.frame(Admit=c(pcAdmit[1,],pcAdmit[2,]),
Applicants=c(applied[1,], applied[2,]),
mf=factor(rep(dimnam[['Gender']],c(6,6)),
levels=dimnam[['Gender']]),
Department=rep(dimnam[["Dept"]],2))
xlim <- c(0, max(mfStats$Admit)*1.025)
ylim <- c(0, max(mfStats$Applicants)*1.075)
plot(Applicants~Admit, data=mfStats, xlim=xlim, ylim=ylim,
fg="gray", cex.lab=1.25, xaxs="i",
col=rep(col2,rep(6,2)),type="h",lwd=3,
xlab="UCB Admission rates (%), 1973",
ylab="Number of applicants")
pcA <- rbind(pcAdmit[1,], apply(pcAdmit,2, mean)+2, pcAdmit[2,], rep(NA,6))
pcA[2,3] <- pcA[2,3]+1
appA <- rbind(applied[1,], apply(applied,2, mean)+80,
applied[2,], rep(NA,6))
deptNam <- dimnam[[3]]
for(j in 1:ncol(appA))
lines(pcA[,j], appA[,j], type="c", col="gray")
text(pcA[2,],appA[2,],deptNam)
##
par(xpd=TRUE)
text(byGender[1,1:2], rep(par()$usr[4],2)+0.5*strheight("^"), labels=c("^","^"),
col=col2,cex=1.25,srt=180)
text(byGender[1,], par()$usr[4]+1.4*strheight("A"),
labels=paste(round(byGender[1,],1)),cex=0.85)
text(byGender[1,1:2]+c(-3.5,3.5), rep(par()$usr[4],2)+2.65*strheight("A"),
labels=c("All males","All females"), pos=c(4,2), cex=1.25)
par(xpd=FALSE)
abline(h=200*(0:4),col="lightgray",lty="dotted")
abline(v=20*(0:4),col="lightgray",lty="dotted")
legend("topleft", col=col2,lty=c(1,1), lwd=0.75, cex=0.9,
y.intersp=0.65, legend=c("Males","Females"),bty="n")
par(oldpar)
```
There are thus three different broad admission patterns. The big differences in admission rates were in departments A (82.4%-62.1%=20.3%) and B (68%-63%=5%), in both cases favouring females. In the other four departments, differences were 3.8% or less, and split very nearly equally between slightly favouring females and slightly favouring males. Also, the relative numbers of males and females applying did not show the same big differences as in departments A and B.
```{r admitrates123, comment=NA}
cap123 <- "Comparion of admission rates (percent) and numbers of males and
females applying, for departments A, B, and CDEF which combines
numbers for departments other than A and B."
library(kableExtra)
UCB123 <- aperm(as.table(apply(UCBAdmissions,1:2,
function(x)c(x[1],x[2], CDEF=sum(x[3:6])))), c(2,3,1))
names(dimnames(UCB123))[3] <- "Dept"
byGender <- margin.table(UCB123,
margin=c("Admit","Gender"))
admitRate <- round(100*prop.table(byGender,
margin="Gender")["Admitted", ], 1)
admit <- cbind(TOTALS=admitRate, round(100*prop.table(UCB123,
margin=c("Gender","Dept"))["Admitted",,], 1))
tots <- cbind(apply(byGender,2,sum),
margin.table(UCB123, margin=c("Gender", "Dept")))
totals <- matrix(paste0('(',tots,')'),nrow=2)
admit <- rbind(admit, totals)[c(1,3,2,4),]
## tab <- xtable::xtable(admit, digits=1)
## print(tab, type='latex', comment=FALSE, floating=FALSE, scalebox=0.9)
x <- knitr::kable(admit, booktabs=TRUE, align="r",
caption=cap123, label=NA)
x <- kable_styling(x, latex_options = "HOLD_position")
x <- column_spec(x, 2, width = "5em", bold=TRUE)
column_spec(x, 3:5, width = "2.25em")
```
- The overall male rates are weighted (825+560):(325+(417+191+373 = 1385:1306 or 1.06:1, between an overall AB rate of 62.5%, and the CDEF rate of 25.5%.\
- Overall female rates are weighted 133:1702 = 1:15.8 for departments A and B, as against departments CDEF. Overall female rates are, accordingly, strongly weighted towards the 26.5% rate for other departments.
### UCB Admissions Data -- Another perspective {.unnumbered}
```{r cap7, echo=FALSE}
cap7 <- "UCB admossion data for 1973 --- another perspective."
```
```{r UCBdiffs, fig.width=6, fig.asp=0.625, fig.pos='H', out.width="72%", fig.cap=cap7}
par(mar=c(3.1,3.1,2.1,0.6), mgp=c(2,0.5,0))
diffAdmit <- t(pcAdmit)
diffAdmit <- data.frame(dept=rownames(diffAdmit), stringsAsFactors=FALSE,
Male=diffAdmit[,1],Female=diffAdmit[,2],
diff=diffAdmit[,2]-diffAdmit[,1])
diffAdmit$mfratio <- signif(apply(applied,2, function(x)x[1]/x[2]),2)
ord <- order(diffAdmit$diff)
plot(diff ~ I(1:6), data=diffAdmit[ord,], pch=dept, cex=1.25,
ylim=c(-5.25,21.5), xlab="Order of difference", fg="gray",
ylab="Female - Male difference (%)")
with(diffAdmit[ord,], text(diff~I(1:6), labels=paste(mfratio),
pos=c(rep(4,5),2)), cex=0.8)
leg <- c("Male:Female ratios of numbers",
"who applied are shown alongside",
"each letter")
legend("topleft", legend=leg, box.col="gray")
```
In order to understand how the overall imbalance between numbers of males and females arose, it was necessary to break the data down by department. The major driver of the overall imbalance was the low numbers of females, relative to males, applying to department A (and, to a lesser extent, B) where admission rates were highest.
See <https://www.youtube.com/watch?v=ZDinnCwP3dg> for an animated video that explains the Yule-Simpson paradox.
### A note on Lord's paradox {.unnumbered}
The same sorts of paradoxical effects can be found in regression. The Yule-Simpson paradox may be regarded as a special case of Lord's paradox, described in @lord1967paradox.[^06-weighting-1] Any attempt to attach meaning to regression coefficients can be highly misleading, unless it is clear that effects of all relevant variables are properly accounted for. It is rarely easy, with observational data, to be sure that this has been done effectively, a point that will be taken up in Chapter \@ref(sec:reg).
[^06-weighting-1]: See also @tu2011statistical, pp.60-71
## Comparing unvaccinated with vaccinated
Once high enough vaccination rates have been achieved, a greater number of Covid-19 infections will be found among the vaccinated than among the vaccinated. Suppose that, for a particular age group and for a particular vaccine, the risk rate for severe Covid-19, is 5% for unvaccinated versus 1% for vaccinated, i.e. a relative risk of 5:1. Consider the following scenarios. The final column has the relative numbers, for unvaccinated versus vaccinated with Covid-19:
| Unvaccinated | Vaccinated | Relative numbers with severe Covid-19 |
|--------------|------------|---------------------------------------|
| 200 | 800 | 10 unvaccinated : 8 among vaccinated |
| 100 | 900 | 5 : 9 |
| 0 | 1000 | 0 : 10 |
At a certain point, provided the risk ratio is greater than 1.0, the higher number will be from the vaccinated. Of course, the 100% vaccination rate scenario in the final row is unlikely to be realized.
### Report found almost 60% of cases were vaccinated {.unnumbered}
As of August 15 2021, Israeli data showed that 301 of those hospitalized with severe Covid-19 were fully vaccinated, while 214 were not. We need to compare, not the counts, but the proportions of vaccinated and unvaccinated, i.e., look at the percentages that are shown in brackets following the counts.[^06-weighting-2] Numbers given are of those aged 12 or more.
[^06-weighting-2]: Data are from <https://bit.ly/3h1mwXQ>, on the website <https://www.covid-datascience.com/>.
| | Unvaccinated | Fully vaccinated | |
|---------------------|--------------|------------------|-----|
| Number (aged \>=12) | 1,302,912 | 5,634,634 | |
| Severe cases | 214 (16.4%) | 301 (5.3%) | |
The relative risk is 16.4/5.3 $\simeq$ 3.1. Because no account has been taken of age differences, this still underestimates the effectiveness of vaccination. A fairer picture is:
| | Unvaccinated | Fully vaccinated | |
|--------------|--------------|------------------|-----|
| Number \<50 | 1,116,834 | 3,501,118 | |
| Severe cases | 43 (3.9%) | 11 (0.3%) | |
| ---- | ---- | ---- | |
| Number \>=50 | 186,078 | 2,133,518 | |
| Severe cases | 171 (91.9%) | 290 (13.6%) | |
The relative risk is 3.9/0.3 $\simeq$ 13 for those under 50, and 91.9/13.6 $\simeq$ 6.8 for those over 50, i.e., in both cases much greater than the 3.1 ratio obtained when the split between under 50s and 50 or more is ignored. Even this does not do justice to the data, which needs to be split into smaller age ranges.
<!-- Here, "variable" is used as generic name for either continuous -->
<!-- variables, or categorical variables, otherwise called factors. -->
## Further illustrative examples
### Does Baclofen help in reducing pain? {.unnumbered}
<!-- ```{r baclofen, out.width="100%"} -->
<!-- knitr::include_graphics("./pictures/baclofen.png") -->
<!-- ``` -->
```{r cap9}
cap9 <- "Data are pain reduction scores. Subgroup numbers, shown
below each point, weight the overall average (''ALL'') for
baclofen towards the high female average, and for no baclofen
slightly towards the low male average."
```
```{r painkiller, fig.cap=cap9, fig.width=7, fig.asp=0.4, fig.pos='H', eval=TRUE}
library(lattice)
parset <- simpleTheme(cex=1.35, pch=16:17,
col=c("darkblue","turquoise"))
gabalong <- data.frame(values=unlist(DAAG::gaba["30",])[-1],
sex=rep(c("male", "female", "ALL"), rep(2,3)),
trt=rep(c("Baclofen","No baclofen"),3))
gph <- stripplot(sex~values, groups=trt, data=gabalong,
par.settings=parset,
xlab=list("Average reduction: 30 min vs 0 min",
cex=1.0),
scales=list(cex=1.0),
panel=function(x,y,...){
panel.stripplot(x,y,...)
ltext(x,y,paste(c(3,9,15,7,22,12)), pos=1,
cex=0.8)
}, auto.key=list(columns=2, points=TRUE, cex=1.0))
gph + latticeExtra::layer(panel.abline(h=1.35, col="gray"))
```
In work reported in @cohen1996, researchers were comparing two analgesic treatments, without and with baclofen. When the paper was first submitted for publication, an alert reviewer spotted that some of the treatment groups contained more women than men, and asked whether this might account for the results.
For a fair overall comparison:
- Calculate means for each subgroup separately.
- Overall treatment effect is average of subgroup differences.
The effect of baclofen (reduction in pain score from time 0) is then:
- Females: 3.479 - 4.151 = -0.672 (-ve, therefore an increase)
- Males: 1.311 - 1.647 = -0.336
- Average, male & female = -0.5 $\times$ (0.672+0.336) = -0.504
### Web page revenue per click {.unnumbered}
@smith-sd (p.111) describes an experiment where a US Internet company collected data that compared the effectiveness of two strategies. In the 1-click strategy, an advert appeared on the website's first page. The 2-click strategy required the user to click on a keyword which then led to a page with the advert. The response (total dollar value of purchases) was compared between the two.
The following, with numbers changed to make the comparison relatively simple, shows a scenario that is not unlike that given by Smith:
| 1-click | | | | 2-click | | | |
|---------|-------|---------|-----|---------|-------|---------|-----|
| Revenue | Users | RP1000 | | Revenue | Users | RP1000 | |
| \$2500 | 200 | \$12.50 | | \$3000 | 300 | \$15.00 | |
The 1-click strategy appears to give a better return.
<!-- Now break the data down, looking separately at US and International users. -->
<!-- \begin{tabular}{l r r r r r r r} -->
<!-- & \multicolumn{3}{l}{1-click} & & \multicolumn{3}{l}{2-click}\\ -->
<!-- Location & Revenue & Users & RP1000 && Revenue & Users & RP1000\\ -->
<!-- US & \$1.8 & \textcolor{red}{70} &\$27.71 && \$1.2 & \textcolor{red}{50} & \$24.00\\ -->
<!-- Int. & \$1.1 & \textcolor{red}{180} &\$6.11 && \$0.50 & \textcolor{red}{90} & \$5.56\\ -->
<!-- & \multicolumn{3}{c}{$\frac{180}{70}$=\textcolor{red}{2.57}} & & \multicolumn{3}{c}{$\frac{90}{50}$=\textcolor{red}{1.8}}\\ -->
<!-- \end{tabular} -->
Now, see how the numbers divide up when a split is made between US and international visitors to the site. The numbers (thousands) for the two strategies divide up thus, with the revenue per thousand users given, in each case, in brackets:
| | 1-click | 2-click |
|-----|---------|---------|
| US | 50 | 80 |
| | (\$32) | (\$30) |
| Int | 150 | 120 |
| | (\$6) | (\$5) |
The 1-click strategy clearly gives a better return, both for US and for international users.
The overall figure is dominated by the result for the 150,000 international users (as opposed to 50,000 US) in the 1-click sample. This compares with the much weaker weighting towards international users (120,000 as opposed to 80,000) in the 2-click sample.
In the case that Smith reported, the type of analysis shown was followed up with a randomized experiment (an A/B test), where the probability of assignment to 1-click, as opposed to 2-click, was the same for both classes of user.
## Cricket Bowling Averages
### Runs (R), wickets (W) and runs per wicket ({\em RPW}) {.unnumbered}
\vspace*{-8pt}
```{=tex}
\begin{center}
\begin{tabular}{lrrr||rrr||rrr}
\hline
& \multicolumn{3}{c}{1st innings} & \multicolumn{3}{c}{2nd innings} &
\multicolumn{3}{c}{Overall} \\
\cline{2-4} \cline{5-7} \cline{8-10}
& R & W & {\em RPW} & R & W & {\em RPW} & R & W
& {\em RPW}\\
Bowler A & 40 & 4 & {\em 10.0} & 240 & 6 & {\em 40.0} & 280 & 10 &
{\em 28.0}\\
Bowler B & 70 & 5 & {\em 14.0} & 50 & 1 & {\em 50.0} & 120 & 6 & {\em 20.0} \\
\hline
\end{tabular}
\end{center}
```
### Fair comparison: Compare runs per wicket ({\em RPW}) {.unnumbered}
\vspace*{-5pt}
```{=tex}
\begin{center}
\begin{tabular}{lrr||rr||rr}
\hline
& \multicolumn{2}{c}{1st innings} & \multicolumn{2}{c}{2nd innings} &
\multicolumn{2}{c}{Overall} \\
\cline{2-3} \cline{4-5} \cline{6-7}
& RPW & {\em W} & RPW & {\em W} &
RPW & {\em W} \\[4pt]
Bowler A & 10.0 & (4) & 40.0 & (6) &
$\frac{10+40}{2} = 25$ & (10)\\[4pt]
Bowler B & 14.0 & (5) & 50.0 & (1) & $\frac{50+14}{2} = 32$ & (6)\\[4pt]
\cline{2-3} \cline{4-5} \cline{6-7}
\end{tabular}
\end{center}
```
## Epistatic effects in genetic studies
In population genetics, Simpson's paradox type effects are known as epistasis. Most human societies are genetically heterogeneous. In San Francisco, any gene that is different between the European and Chinese populations will be found to be associated with the use of chopsticks! If a disease differs in frequency between the European and Chinese populations, then a naive analysis will find an association between that disease and any gene that differs in frequency between the European and Chinese populations.
Such effects are major issues for gene/disease population association studies. It is now common to collect genetic fingerprinting data that should identify major heterogeneity. Providing such differences are accounted for, large effects that show up in large studies are likely to be real. Small effects may well be epistatic.