generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-medical.Rmd
297 lines (188 loc) · 27.5 KB
/
04-medical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
output:
pdf_document: default
html_document: default
---
# Medicine and health
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, comment=NA, fig.align='center', fig.width=5.5,fig.height=5.5)
```
A focus on medicine and public health is used as a context in which to introduce ideas and issues that are more generally relevant. There is wide acceptance that the evidence provided by randomized controlled trials (RCTs) that are conducted to high standards is at the top of a hierarchy of evidence.
Web sources are noted that, because they base their advice on careful and transparent evaluations of the available evidence, can be trusted.
## Useful sources of advice and information
Here are noted web resources that, for many issues of major interest, provide carefully collated assessments that are based on a critical examination of the whole range of evidence that the authors could identify. There may be many published studies, not all of equal quality, that bear on a medical issue. Or there may be one, or a small number, of large-scale high quality trials that are used as a basis for judgment. It is important that assessments are updated as new evidence emerges.
Resources that will be noted are
- [Harding Center for Risk Literacy](https://www.hardingcenter.de/en)
- The [Cochrane Center](https://www.cochrane.org/)
- [Winton Centre for Risk and Evidence Communication](https://wintoncentre.maths.cam.ac.uk/)
[Note also the US Consumer health information site](https://medlineplus.gov/)
### The Harding Center for Risk Literacy {.unnumbered}
There is extensive informative content on the [Harding Center for Risk Literacy site.](https://www.hardingcenter.de/en)[^04-medical-1] Note in particular
[^04-medical-1]: <https://www.hardingcenter.de/en>
- Medical Fact Boxes. Look under **Transfer and Impact \| Fact Boxes**
- Covid-19. Look under **Transfer and Impact \| Corona pandemic**
- Consumer Empowerment. **Transfer and Impact \| Consumer Empowerment**
- Risk and Evidence Communication. Look under **Research \| Risk and Evidence Communication**
- VisRisk - Visualization and Communication of Complex Evidence in Risk Assessment. Look under **Research \| VisRisk**
- Horizon2020 Project FORECEE (on balancing female cancer risk against the risk arising from false-positive alarms and overdiagnosis). Look under **Research \| FORECEE Project**
- Drone Risks Project. Look under **Research \| Drone Risks Project**
- Publications. Look under **The Harding Center \| Publications**
- In the Media. Look under **The Harding Center \| In the Media**
### Harding Center Medical Fact Boxes {.unnumbered}
[Medical fact boxes](https://www.hardingcenter.de/en/transfer-and-impact/fact-boxes)[^04-medical-2] provide visual and tabular summaries of the current "best" evidence, from randomized controlled trials. The comparison may be with a placebo, or with an alternative that is known to be effective. Detailed references are given. Where available, reliance is on Cochrane studies. Available fact boxes, as of March 2024, appear under the headings:
[^04-medical-2]: <https://www.hardingcenter.de/en/transfer-and-impact/fact-boxes>
- Vaccines
- Low back pain
- Antibiotics
- Early detection of cancer: breast, colon, prostate, ovarian
- Cardiovascular Diseases
- Osteoarthritis of the knee
- Tonsil surgery
- Pregnancy and childbirth
Scroll down the home page[^04-medical-3] to find the headings **Risk Literate** (with a link to a risk quiz), and **Quick Risk Test** (with a link to a test that is targeted at medical students and medical professionals. On the development of this latter test, see @cokely2012measuring.
[^04-medical-3]: <https://www.hardingcenter.de/en>
### The Cochrane Center[^04-medical-4] {.unnumbered}
[^04-medical-4]: <https://www.cochrane.org/>
The Cochrane Center's mission is "to promote evidence-informed health decision-making by producing high-quality, relevant, accessible systematic reviews and other synthesized research evidence." They rely heavily on meta-analyses, looking for the balance of evidence across all relevant studies.
## Randomized Controlled Trials vs other study types {#ss:rct}
Two types of study are widely used in medical and other contexts --- randomized controlled trials (RCTs), and population-based studies. These can, in both cases, be broken down into further sub-types. There may be elements of both these types of studies.
### Randomized Controlled Trials --- the gold standard? {.unnumbered}
For simplicity, it will be assumed that there are two alternatives to be compared, in what is known as a two-armed trial. In medical trials the two arms may be a treatment and a placebo, with the placebo (something harmless that has no effect) made to look as similar as possible to the treatment.
An important distinction is between pre-clinical trials, often with animals such as mice, and clinical trials with human subjects. Pre-clinical trials are likely to be conducted with a small number of mice or other animals, and are intended to check whether a drug or other treatment warrants further investigation. Evidence will be presented in Section \@ref(sec:lab) that suggests that these trials are commonly not achieving their intended purpose.
Clinical trials typically are conducted in three phases. Phase I applies the treatment to a small number of subjects and checks that the treatment appears safe. Phase II, perhaps with several hundred subjects, looks for evidence of an effect, and how this might relate to dose level. Phase III trials are conducted with large numbers of subjects, typically spread over multiple treatment centres, and are designed to check whether the treatment is effective, what side effects there may be, and whether there are issues with particular subgroups.
Important issues are
- Use a random mechanism to assign to assign to treatment as against control --- in a medical screening study to "screen" or "not screen"
- The aim is to ensure that apples are compared with apples
- Treatment and control must otherwise be treated in the same way.
- There must be strict adherence to a protocol
- Minor departures that may, e.g., allow unconscious bias in the way that results from the different groups of participants are measured, can invalidate results.
- In clinical trials an ideal, not always possible, is the double blind trial where neither the individual nor the clinician involved knows which drug (or other treatment) the individual has received.
- Results apply, strictly, to those who meet the trial entry criteria
- This may limit relevance to the general population
Especially in medical trials, think carefully about the outcome measure. It is not enough to show that a screening program will pick up otherwise undiagnosed cancers.
- In a screening trial, e.g., for prostate cancer, there are risks both for those who test positive, and for those who test negative.
- The process used to check for cancer may itself bring a smaller or larger element of risk.
- Positives may be false positives, leading to more invasive checks which may themselves carry a risk. Thus, for prostate cancer, a positive PSA test is likely to lead to a biopsy that itself has been estimated to carry a 5% risk of serious side effects, with a much higher proportion of less serious effects [@levitin_2015, p.245]
- Some slow growing cancers may be better left untreated, rather than exposing the patient to a treatment that may itself do serious damage.
For a helpful animated summary of some of the key issues, see:\
<https://www.youtube.com/watch?v=Wy7qpJeozec>. @pashayan2020personalized provide an overview of progress towards the personalized early detection and prevention of breast cancer, noting priority areas for action.
### A note in passing: HiPPO decisions vs A/B testing {.unnumbered}
Randomized studies are widely used outside of medicine. Randomization is a key component of the way that Google and others test out, e.g., the effect of different web page layouts.
- HiPPO = "Highest paid person in the Office."
- The term "A/B testing" is sometimes used to refer to randomized testing of alternatives.
A/B testing helped propel Obama into office! An experiment was conducted that involved 15 million people, or about 25%, from its email list. The signup forms had one of nine different combinations of images with words on which recipients were invited to click, thus:
```{=tex}
\begin{tabular}{lccc}
& Learn more & Join us & Sign up now \\
Obama photo & \ding{56} & \ding{56} & \ding{56} \\
BW photo of Obama family & \ding{52} & \ding{56} & \ding{56} \\
Obama speaking & \ding{56} & \ding{56} & \ding{56} \\
\end{tabular}
```
The black and white photo of the Obama family, with the words "Learn more", generated the most clicks.
@young2014improving gives an account of A/B testing as it might be used for improving library user experience.
### Population studies --- groups must be broadly comparable {.unnumbered}
- Adjust prunes to look like apples (is it possible?)
- Can one ever be sure that the adjustments do the job?
- Potential for biases is greater than for randomized controlled trials.
Where a treatment is compared with a control group, the idea is to use a regression type approach to adjust for differences in such variables or factors as age, sex, socioeconomic status, and co-morbidities. "Propensity score" approaches try to summarize such group differences in a single variable (or, in principle, two or more variables) that measure the propensity to belong to the treatment as opposed to the control group. While their effectiveness for this purpose may be doubted, they can be used to provide insightful graphs that check the extent to which the groups are broadly comparable on the variables and/or factors used to adjust for differences.
The generally negative view of observational studies that is presented in @soni2019comparison (studies in oncology) contrasts with the more positive view offered in @anglemyer2014healthcare (health care), for observational studies that have been conducted with high methodological rigour. The strongest evidence comes, as with the link between smoking and lung cancer, from multiple studies, with different likely biases, that all point strongly in the same direction.
### Issues for all types of study {.unnumbered}
What are the relevant outcome measures?
- e.g., cancer -- malignancies found & removed, or deaths
- deaths from cancer, or from all causes (for some individuals, the treatment may be more damaging in it medium to long term effect than the cancer)
Care is required to deal with survivor, as well as other, biases.
### False Positives
In contexts where the number of false positives is likely to be high relative to the number of true positives, screening programs may have serious downsides that outweigh the benefits.
Excess iron syndrome, known as "haemochromatosis", affects around 1 in 200 in the New Zealand population. Consider a test that has an 80% accuracy, both for detecting the syndrome among those who have it, and for not detecting among those who do not.[^04-medical-5]
[^04-medical-5]: These are known as the "sensitivity", and the "specificity".
Excess iron syndrome, known as "haemochromatosis", affects around 1 in 200 in the New Zealand population. Consider a test that has an 80% accuracy, both for detecting the syndrome among those who have it, and for not detecting among those who do not.[^04-medical-6]\
Among 2000 tested (10 with and 1990 without)
[^04-medical-6]: These are known as the "sensitivity" (true positive rate"), and the "specificity" (true negative rate).
- there will on average be **8** out of 10 true positives
- the 1990 without the syndrome will split up as detected +ve to detected -ve in an 20%:80% ratio. Thus there will be, on average, .2 $\times$ 1990 = **398** false positives.
Overall, those detected as positive split up in a true:false ratio of 8:398, i.e., 8/406 or just under 2% of the positives are false positives. If all positives were detected as positive, the 8/406 would change to 10/406.
A test with this kind of accuracy becomes much more useful in a subset of the population already known to be at high risk, perhaps as identified by a genetic test, or perhaps because of a medical condition commonly associated with the syndrome.
## Hierarchies of evidence
There is broad agreement among medical researchers on the hierarchy of evidence that is set down in the [@us1989guide] guide:
- Level I: Evidence obtained from at least one properly designed randomized controlled trial.
- Level II-1: Evidence obtained from well-designed controlled trials without randomization.
- Level II-2: Evidence obtained from well-designed cohort studies or case-control studies, preferably from more than one centre or research group.
- Level II-3: Evidence obtained from multiple time series designs with or without the intervention. Dramatic results in uncontrolled trials might also be regarded as this type of evidence.
- Level III: Opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees.
The CONSORT 2010 statement [@Schulzc332] sets out detailed criteria for assessing randomized controlled trials (RCTs). For Level II studies, the STROBE guidelines [@erik2007strengthening] set out reporting standards.
Use of such criteria is essential when evidence that is available from multiple randomized controlled trials, perhaps supplemented by evidence from studies at lower levels of the hierarchy, is brought together in a systematic review. Evidence at level II, and especially at level II-3, should ideally be checked by conducting an RCT. This is not always possible, for ethical or practical reasons.
Evidence from one type of study may complement evidence from another. A paper entitled "Resolved: there is sufficient scientific evidence that decreasing sugar-sweetened beverage consumption will reduce the prevalence of obesity and obesity-related diseases" [@hu2013resolved] provides an example. Hu brings evidence from short-term randomized controlled trials together with evidence from long term cohort studies (4 or 8 years) to make a convincing case.
Clinical trials have their own problems and issues. Using evidence from published review sources, @chalmers2009avoidable found issues with the choice of research questions; the quality of research design and methods, and the adequacy of publication practices. They reported that 50% of studies were designed without reference to systematic reviews of existing evidence, and that 50% were never published in full.
Planning and execution failures are set in stone by the time that a research report is sent for review. Pre-registration, involving the depositing a research question and study design with a registration service or journal before starting an investigation, allows peer review feedback that can elicit suggestions for improvement and detect any potential flaws before the study begins.[^04-medical-7] Even more than for industrial quality control, processes are needed that prevent defects from appearing in the first place, with screening for defects at the end of the production line used as a check that those processes have done the job asked of them.
[^04-medical-7]: See <https://plos.org/open-science/preregistration/>
Chapter \@ref(ch:obs) will discuss issues for using observational data as a basis for inferences.
## Avoid, or expose infants to peanuts?
Clinical practice guidelines introduced in or around the year 2000 had "recommended the exclusion of allergenic foods from the diets of infants at high risk for allergy, and from the diets of their mothers during pregnancy and lactation."
It was then a surprise to find that the prevalence of peanut allergy has substantially increased in the recent past, doubling in Europe between 2005 and 2015, suggesting that advice given to parents of young children to avoid foods containing peanuts may have been counterproductive. This reassessment was supported, at least for infants who at four months had either severe eczema or food allergy or both, and thus were at high risk of developing a peanut allergy, by the LEAPS study reported in @du2015randomized.
As noted, the LEAPS study was limited to infants who at four months had either severe eczema or food allergy or both. Infants were stratified into two groups following a skin-prick test, with each group then randomized between those exposed to peanut extract, and those not exposed.
Among 530 infants in the population who initially had negative results on the skin-prick test, the prevalence of peanut allergy at 60 months of age was 13.7% (37/270) in the avoidance group and 1.8% (5/272) in the consumption group.[^04-medical-8] Among the 98 participants who initially had positive test results, the prevalence of peanut allergy was 35.3% (18/51) in the avoidance group and 10.6% (5/47) in the consumption group. There was no between-group difference of consequence in the incidence of serious adverse events.
[^04-medical-8]: There were twelve further infants in this group whose results could not be evaluated.
In both groups, numbers and percentages are for those who were assigned to the group and whose results could be evaluated, whether or not they followed the treatment protocol to which they were assigned. In technical terms, these are results from an "intention to treat" analysis. Such an analysis is designed to mirror what can be expected in practice --- not everyone who starts off in one group will stick to it. It answers questions about what to do with subjects who did not fully follow the treatment to which they were assigned.
The results were followed, in 2016, by changes to guidelines that recommended introduction of peanut and other allergenic foods before 12 months. The assumption that avoiding early exposure to peanuts would reduce risk of later development of peanut allergy was, it was judged, likely wrong for all infants.
## The effectiveness of surgery -- RCTs are challenging
The blurb on the back cover of @harris2016book states that
> For many complaints and conditions, the benefits from surgery are lower, and the risks higher, than you or your surgeon think.
Humans are very prone to the *post hoc, ergo propter hoc* fallacy: "it followed, therefore it was because of" fallacy. Harris argues that unless the benefits of a surgical procedure are clear, the only ethical way forward is to do a randomized trial where the procedure is compared with a sham procedure. Such trials are not easy to design and execute. Nonetheless, there are a number of important cases where such comparisons have been made.
Bloodletting is a prime example of a surgical procedure that has faded away due to evidence, not just of a lack of effectiveness, but of serious harm.[^04-medical-9] The practice attracted widespread debate in the 19th century and into the early 20th century, with its defenders making such claims as
[^04-medical-9]: For the history, see for example @seigworth1980bloodletting
> "blood-letting is a remedy which, when judiciously employed, it is hardly possible to estimate too highly"
A comment in @watkins2000conviction is apt
> Medical evidence is trusted, and we must retain that situation and ensure that it is not abused. It is possible to be an extremely good doctor without being numerate, and not every eminent clinician is best placed to give epidemiological evidence. Doctors should not use techniques before they have acquainted themselves with the principles underlying them.
What are the implications for medical practice?
## Screening for cancer --- how relevant is historical evidence
Screening for cancer is an area where, if the interest is in risk of death, it is necessary to wait for perhaps several decades before there is a high enough number of deaths that results can be usefully evaluated. Changes that can be expected in the interim include:
- Screening may become more sensitive, perhaps picking up a higher proportion of relatively benign cancers that are unlikely to ever have serious effects.\
- There may be an increased ability to distinguish between benign and more aggressive cancers.\
- More effective and/or less invasive treatments may become available, and earlier treatments finessed.
All cause death rates are a more relevant measure than cancer specific death rates, as treatments may themselves have harmful effects. Whether or not results from clinical trials in past decades remain relevant to current circumstances, their results do highlight important questions.
### PSA Screening for Prostate Cancer, & more {.unnumbered}
Numbers (rounded) in the following table are from a Harding Center fact box. They are for men 50 years or older who either did or did not participate in prostate cancer screening, using the PSA test, for 16 years.[^04-medical-10]
[^04-medical-10]: <https://www.hardingcenter.de/en/early-detection-of-cancer/early-detection-of-prostate-cancer-with-psa-testing>
| | 1000 men, No screening | 1000 men, Screening |
|--------------------------|------------------------|---------------------|
| Deaths (prostate cancer) | 12 | 10 |
| --- | --- | --- |
| Biopsy & false alarm | 0 | 155 |
| Unnecessary treatment | 0 | 51 |
About 10 out of every 1,000 men with screening, and 12 out of every 1,000 men without screening, died from prostate cancer within 16 years. This means that 2 out of every 1,000 people could be saved from death from prostate cancer by early detection using PSA testing. Deaths from any cause were around 322 in both groups.
Numbers for benefits are based on four studies with about 77,000 participants (progressive cancer), four studies with about 472,000 participants (overall mortality), and eleven studies with about 619,000 participants (prostate cancer specific mortality). The numbers for harms are based on seven studies with approximately 128,000 participants (false-positive results within three to six participations in PSA testing for early detection) and nine studies with approximately 274,000 participants (over-diagnosis and over-treatment). See the web site for references to the studies.
Unlike the biopsies that may follow a positive PSA test, the PSA test has no direct potential to cause physical harm. Harm results from an undue readiness to use the test result as a reason for further potentially harmful testing and treatment. "Wait and watch" is often the preferred strategy.
See @martin2018effect, @levitin_2015 \[ch. 6\], @fung2020cancer [pp. 278-281], the web page [How Patients Think, and How They Should](https://www.nytimes.com/2011/10/09/books/review/your-medical-mind-by-jerome-groopman-and-pamela-hartzband-book-review.html)[^04-medical-11], and regularly updated summary of the evidence can be found at [PDQ Cancer Information Summaries](https://www.ncbi.nlm.nih.gov/books/NBK65906/).
[^04-medical-11]: <https://www.nytimes.com/2011/10/09/books/review/your-medical-mind-by-jerome-groopman-and-pamela-hartzband-book-review.html>
### Breast cancer screening {.unnumbered}
The @raichand2017conclusions review starts with the comment:
> The recent controversy about using mammography to screen for breast cancer based on randomized controlled trials over 3 decades in Western countries has not only eclipsed the paradigm of evidence-based medicine, but also puts health decision-makers in countries where breast cancer screening is still being considered in a dilemma to adopt or abandon such a well-established screening modality.
The short summary, last updated in October 2019, from the [Harding Center Fact Box for Mammography Screening](https://www.hardingcenter.de/en/early-detection-breast-cancer-mammography-screening), referring to women 50 years (a few trials looked at women aged \$\geq\$40) and older who either did or did not participate in mammography screening for approximately 11 years reads
> Mammography reduced the number of women who died from breast cancer by 1 out of every 1000 women. It had no effect on the number of women who died from any type of cancer. Among all women taking part in screening, some women with non-progressive cancer were over-diagnosed and received unnecessary treatment.[^04-medical-12]
[^04-medical-12]: <https://www.hardingcenter.de/en/early-detection-breast-cancer-mammography-screening>
The chief English source of evidence for the fact box is the Cochrane review @gotzsche2013screening. The eight eligible trials included more than 600,000 women aged between 39 and 74, all reported between 1963 and 1991. One trial was excluded because the randomization had not produced comparable groups. Four trials had inadequate randomization. The three trials with adequate randomization did not find an effect of screening on total cancer mortality.
@loberg2015benefits provide a slightly more detailed breakdown of the evidence, as applied to women who were screened for 20 years, starting at age 50, with mortality assessed at ages 56 to 75 in the UK. Figure \@ref(fig:c-screen) is a visual summary. Interval cancers are cancers that are detected in between regular screens. A prior normal screen may give a false assurance and lead to a delay in seeking help when symptoms appear. See also the regularly updated summary of the evidence at[PDQ Cancer Information Summaries](https://www.ncbi.nlm.nih.gov/books/NBK65906/).
```{r cap2, echo=FALSE}
cap2 <- "Estimates of benefits and harms of screening, as applied to the observed incidence of invasive breast cancer (women aged 50 to 69 years) and mortality (women aged 55 to 74 years) in the UK in 2007."
```
```{r c-screen, echo=FALSE, message=FALSE, warning=FALSE, fig.width=6, fig.height=3.75, fig.pos='H', out.width="60%", fig.cap=cap2}
dset <- data.frame(num = c(200,30,3,15,2,0),
outcome = c("False positive","False positive\nwith biopsy",
"Interval cancer","Overdiagnosed",
"Prevented breast\ncancer death",
"Prevented\noverall death"))
lattice::barchart(outcome~num, data=dset, horiz=T, xlim=c(0,210),
col=rep(c("gray90","white"), c(4,2)),
xlab="Number out of 1000",
panel=function(x,y,...){
ylim <- current.panel.limits()$ylim
lrect(0,ylim[1],210,4.55,
border='white', col='gray95');
panel.barchart(x,y,...);
panel.text(x,y, labels=paste(x), offset=0.3,
pos=c(2,2,4,2,4,4))}
)
```
The area by area phasing of the introduction of a mammography screening program in Ireland over 1994-2011 had the character of a natural experiment, allowing checks on what the before/after difference of each area as it was phased in, as against areas where the rollout occurred earlier or later. @MORAN2022115073 looked at data on the ten-year follow-up of 33,722 breast cancer cases. The conclusion was that "while invitation to screening increased detection, it did not significantly decrease the average risk of dying from breast cancer in the population." The authors did however note that "screening may have helped to reduce socioeconomic disparities in late stage breast cancer incidence."
The historical data does, irrespective of questions regarding its current direct relevance, emphasize the importance of tuning breast cancer checks to the risk profile, of finding ways to distinguish progressive from non-progressive cancer, and of avoiding over-treatment. Early diagnosis may allow the use of less invasive forms of treatment. As argued in @essermanwisdom, women want "better, not more screening".