-
Notifications
You must be signed in to change notification settings - Fork 43
/
Copy pathICI3D_RTutorial_3.R
420 lines (326 loc) · 16.7 KB
/
ICI3D_RTutorial_3.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
## Tutorial 3: Probability Distributions and Control Structures
## Clinic on Meaningful Modeling of Epidemiological Data
## International Clinics on Infectious Disease Dynamics and Data Program
## African Institute for Mathematical Sciences, Muizenberg, Cape Town, RSA
## David M. Goehring 2004
## Juliet R.C. Pulliam 2008, 2009
## Steve Bellan 2010, 2012
##
## Last updated by Juliet R.C. Pulliam, May 2019
## Some Rights Reserved
## CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/)
######################################################################
## By the end of this tutorial you should...
## * Feel comfortable working with probability distributions
## * Know how to use conditional structures
## * Be warned about the dangers of loop structures in R
## * Have tried your hands at synthesizing all the new material
######################################################################
## SECTION A. Probability Distributions in R
######################################################################
####################
## Probability distributions
####################
## So far the calculations we have been performing have been doable on
## a modern calculator, so my assertion that R will make things easier
## for you has not seen any confirmation in experience.
## In terms of solving probability problems, R does not provide many
## tools that extend too far beyond your intuition. If you want to
## know the probability of rolling two sixes on two dice you do not
## need R to tell you that 1/6 times 1/6 is 1/36. The two exceptions
## might be the functions factorial() and choose(), which only appear
## on a minority of modern calculators and which can solve a problem
## for you once you know to reach for them. These functions are
## sufficiently self-explanatory with their help files to render a
## treatment of them here unnecessary. A third exception is
## simulation, which you will try your hand at in the Benchmark
## Questions, and is an important component of mathematical modelling.
## Enter probability distributions. R is able to do in a few
## milliseconds what took hours of hunting in tables not 30 years
## ago. If you understand what a probability distribution is, R will
## provide four different utilities for (nearly) any such distribution, as
## follows:
## * Generating individual density values or a plot of the density function
## * Generating random numbers drawn from the distribution
## * Providing cumulative probability for points along the distribution
## * Providing points along the distribution corresponding to a cumulative probability
## Each of these is an important part of the toolkit for working with
## probability distributions.
## Making the functions meet your needs
## The different distributions have different corresponding functions
## in R, but aside from the particulars of their arguments, they
## consist of four functions of the same general form.
## We will use the binomial distribution (which is a
## discrete distribution) as an example for each of our four methods,
## and you should be able to extend that experience to any
## distribution you like.
## As we have seen hints of before, R can generate random numbers from
## any distribution. For discrete distributions, creating a histogram
## of a large number of these values can give a quick picture of the
## distribution. To generate random numbers from the binomial
## distribution, use rbinom():
hist(rbinom(10000, 10, 1/3), freq=FALSE, breaks=-1:10)
## The arguments of rbinom() tell it how many numbers you want to
## generate, how many trials are used to generate the distribution,
## and what the probability of a success per trial is,
## respectively. The extra arguments of hist() tell it that we want it
## to give proportions rather than counts and then how to divide up
## the data into bars.
## This is an empirical estimation of the distribution. The R function
## that calculates the probability density function of the binomial
## distribution is dbinom(), and that will give the precise
## theoretical values we may need. What is the probability of having 7
## successes on 10 trials with a probability of success of 1/3?
dbinom(7, 10, 1/3)
## Which is the equivalent of about 1.6% of the time. By extension, we
## can generate the complete distribution with vector input to
## dbinom(), as in
dbinom(0:10, 10, 1/3)
## A useful application of rbinom() and dbinom() is to plot a curve of
## the density function over the histogram of observed values. For a
## discrete distribution such as this, the “curve” will be slightly
## misleading, but we will know what it means. Generating fewer random
## numbers this time,
hist(rbinom(500,10,1/3), freq=FALSE, breaks=-1:10)
curve(dbinom(ceiling(x), 10, 1/3), add=TRUE, col="red")
## The argument ceiling(x) resolves the fact that the x-values on the
## plot are continuous, not discrete. The correction is not needed for
## continuous distributions.
## Alternatively, you may use a paired histogram for comparing
## simulations to predicted densities for discrete distributions. The
## visualization used here is less elegant for discrete distributions
## but is more easily extended to continuous distributions.
## Using dbinom() to calculate cumulative probabilities for discrete
## distributions would be tedious, and we cannot calculate the
## cumulative probabilities using the equivalent density values from a
## continuous distribution (because it requires integration). The
## solution in R is to use a function, such as pbinom() for the
## binomial distribution, that calculates the cumulative density
## function (c.d.f.), that is, the function F (which here takes
## integer values).
# F(x) = P(X <= x)
## As a quick example, what is the probability of getting 2 or less
## with our chosen binomial distribution?
pbinom(2, 10, 1/3)
## The last function we need in our probability-distribution toolkit
## is qbinom(), the R function that calculates the inverse cumulative
## density function, F^-1 of the binomial distribution The c. d. f. and
## the inverse c. d. f. are related by
# p = F(x)
# x = F^-1(p)
## As a quick example, where is the 80% threshold in the cumulative
## probability distribution?
qbinom(.8, 10, 1/3)
## The c. d. f. and inverse c. d. f. are particularly useful when
## performing statistical tests.
## I do not know why the R language chose such counter-intuitive
## naming practices for its probability distribution functions. But
## let’s have some mnemonics:
# Function Prefix Mnemonic
#
# random numbers from dist. r random
# density of distribution d density
# cumulative probability p probability (is output)
# inverse cumulative probability q qumulative (with an “inversed “c”)
## Feel free to remember them in any way you like, of course.
####################
## Probability distributions galore
####################
## You may not be familiar with all of these distributions, but this
## list should prove helpful for future reference. Keep in mind that
## some of these distributions are continuous and some are discrete –
## this will affect the allowed inputs of the functions.
# Beta pbeta() qbeta() dbeta() rbeta()
# Binomial pbinom() qbinom() dbinom() rbinom()
# Cauchy pcauchy() qcauchy() dcauchy() rcauchy()
# Chi-square pchisq() qchisq() dchisq() rchisq()
# Exponential pexp() qexp() dexp() rexp()
# F pf() qf() df() rf()
# Gamma pgamma() qgamma() dgamma() rgamma()
# Geometric pgeom() qgeom() dgeom() rgeom()
# Hypergeometric phyper() qhyper() dhyper() rhyper()
# Logistic plogis() qlogis() dlogis() rlogis()
# Log Normal plnorm() qlnorm() dlnorm() rlnorm()
# Negative Binomial pnbinom() qnbinom() dnbinom() rnbinom()
# Normal pnorm() qnorm() dnorm() rnorm()
# Poisson ppois() qpois() dpois() rpois()
# Student's t pt() qt() dt() rt()
# Uniform punif() qunif() dunif() runif()
######################################################################
## SECTION B. Control structures
######################################################################
## Now that you have gotten your proverbial feet wet by composing your
## own functions, you are ready to learn how to make your functions
## (or your scripts) more powerful by using control structures. The
## two main varieties of control structures are conditional structures
## and loop structures, but more important than what they are called
## is what they do...
## The semicolon, ";", allows you to put two or more commands
## together on a single line. The commands will be executed
## sequentially, just as if they had been written on separate lines.
## The best use of the semicolon is probably to combine lists of
## assignments into a single line, such as,
x <- 8; y <- 4; z <- 3
c(x, y, z)
## Using the semicolon in this way will condense your code
## and make it look cleaner. In most situations, though, clarity is
## more important than compactness, so in general it is best not to
## string long lists of commands together on a single line using
## semicolons. Remember that the amount of whitespace (such as spaces,
## indentation, and empty lines) in your code will not affect its
## efficiency and that adding whitespace (and comments!) will often
## make your code easier to follow.
## Conditional structures
## “If you are under the age of 6, you cannot ride this ride.”
## “If you are available Friday night, please give me your number.”
## Needless to say, we are all familiar with conditional structures in
## casual speech, and conditional structures in R are not much
## different.
## The command ''if'' initiates a conditional structure. (Like function,
## it looks like a function, but is not a function.) What comes in
## parentheses after if is an expression of the condition that must be
## met for the remaining code to be executed. That is, it is a logical
## expression (meaning it should evaluate to TRUE or FALSE) which if
## TRUE tells R to execute the next line (or set) of code.
## A silly, but fully functional, example:
age <- 19
if(age >= 6){
can.ride <- TRUE
}
can.ride
## Here we obviously know the result before we start, but in a
## (just-slightly-less silly) function, we would not:
ride.test <- function(age)
{
can.ride <- FALSE
if(age >= 6){
can.ride <- TRUE
}
return(can.ride)
}
ride.test(5)
ride.test(65)
## Note that the curly brackets, {}, play the same role as for
## conditional statements as they do for the function construct: they
## allow multi-line procedures. They can also make your code easier to
## follow, even for single-line procedures.
## As an aside, the above function can (and perhaps should) be written
## more simply as
ride.test <- function(age){
return(age >= 6)
}
## The conditional structure becomes important when you want different
## functions or operations to be performed depending on some
## condition. This is something that you will often want.
## As an example, consider a function that measures central
## tendency. While you may want to avoid the influence of extremely
## large values by using the median, you may want to forgo it on very small
## samples, replacing the median with the mean. Let us say if we have fewer
## than 10 measurements in our sample we will resort to the mean:
central.tend <- function(x){
my.answer <- median(x)
if(length(x) < 10){my.answer <- mean(x)}
return(my.answer)
}
central.tend(c(1,1,3))
## This last example may have raised a red flag about efficiency – I
## determined the median of the data regardless of whether this value
## will ultimately be used in my output. Here, the cost is
## exceptionally low, because that calculation will only be irrelevant
## if there are fewer than 10 items in the vector, in which case it
## took trivially long to execute. But, you can see where there might
## be a problem.
## One solution is to create separate if statements for each
## condition, such as:
# if(x < 0) #followed by what should happen if x < 0
# if(x = 0) #followed by what should happen if x = 0
# if((x > 0) & (x < 1)) # followed by what should happen if x is between 0 and 1, exclusive
# if(x >= 1) # followed by what should happen if x >= 1
## Note that defining conditions which are not mutually exclusive may
## make your code hard to follow and therefore render your results
## unpredictable without detailed examination.
## Another solution, particularly useful for binary (either-or) cases,
## is to use the command else, which follows the if procedure and
## gives what to do if your conditional expression evaluates to
## FALSE. If you are careful about your use of brackets, you can nest
## if and else commands to create more complicated conditional
## structures.
## Here are some additional options for writing the central tendancy
## function above. Using if and else, as described above:
central.tend2 <- function(x){
if(length(x) >= 10){
my.answer <- median(x)
}else{
my.answer <- mean(x)
}
return(my.answer)
}
# and using the ifelse() control structure, which is similar but more compact:
central.tend3 <- function(x){
my.answer <- ifelse(length(x) >= 10, median(x), mean(x))
return(my.answer)
}
## There are two potential big errors that may occur in implementing a
## conditional structure. One occurs if the conditional expression
## evaluates to NA or NaN – try this out yourself:
if(Inf - Inf > 0){ x <- 5 }
## To make sure that your if statements don’t abort like this, you will
## use sometime want to use the is.na() function, which returns TRUE if the
## value being evaluated is NA or NaN.
## The second big mistake that you can make is assuming that R will act
## on the elements of a conditional expression that is a vector. Try this:
if(c(T, F, F)){ 1 + 1 }
## As the warning that is generated indicates, R will ignore all but
## the first element of the vector. On reflection, this makes sense
## because R wouldn’t know what to do with a vector of logical values
## – the if statement has no inputs or outputs, just a conditional
## procedure. In other words, you shouldn’t expect the if statement to
## behave like a function because it isn’t one.
####################
## Loop structures
####################
## R employs a number of common loop structures familiar to those of
## you with some programming experience. However, those of you without
## programming experience may actually have an edge here.
## The reason is that, in general, in mathematical programming it is
## much more efficient to perform what traditionally has been done
## with programming loops using vector calculations. You have already
## seen one of the basic methods used to replace loop structures, the
## apply() function.
######################################################################
######################################################################
## This concludes Tutorial III. The benchmark questions should help
## solidify your understanding of the material.
## The probability-distribution section of this tutorial was aided by
## information available at
## http://www.stat.umn.edu/geyer/5102/examp/rlook.html.
## If you are unfamiliar with or rusty on your understanding of the
## binomial distribtion, you may also want to work through the introductory
## Binomial Distribution tutorial, available here:
## https://github.com/ICI3D/RTutorials/blob/master/binomialDistribution.R?raw=true
######################################################################
## SECTION C. Benchmark Questions
######################################################################
####################
## 1. The standard medication used to treat some disease is known to
## produce a severe side effect in 20% of the patients treated. A
## pharmaceutical company develops a new drug to treat the disease and
## tests it on 9 patients. There are no side effects observed in these
## patients.
## a. What is the probability of treating 9 patients without side
## effects using the standard medication?
## b. Generate a plot of the probability of observing no side effects
## in patients receiving treatment, as a function of the number of
## patients in the study, assuming the new drug causes the severe side
## effect with the same probability as the old.
## c. Add a horizontal red line at p = 0.025, visually depicting the
## number of patients minimally required for a study to show that the
## new medication produces side effects in fewer patients than the
## standard treatment (traditional statistical significance).
####################
####################
## 2. Re-do the suite of calculations and plots found in the Tutorial
## using the Poisson distribution (with lambda = 3) rather than the
## binomial distribution. (Hint: Make sure the histogram has an
## appropriate range with breaks.)
####################