-
Notifications
You must be signed in to change notification settings - Fork 2
/
_bookdown-demo.Rmd
8125 lines (4882 loc) · 507 KB
/
_bookdown-demo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "The Order of the Statistical Jedi: <div style='font-size: .8em;'>Responsibilities, Routines, and Rituals</div>"
author: "Dustin Fife"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output:
bookdown::gitbook:
number_sections: false
documentclass: book
bibliography: [book.bib, packages.bib, book_citations.bib, references.bib]
biblio-style: apalike
link-citations: yes
description: "This is a minimal example of using the bookdown package to write a book. The output format for this example is bookdown::gitbook."
---
# Preface
<hr>
<img src="drawings/coverjedi.jpg" alt="drawing" class="cover"/>
<!--chapter:end:index.Rmd-->
```{r, include=FALSE}
knitr::opts_chunk$set(
comment = "#>", echo = FALSE, message=FALSE, note=FALSE, warning=FALSE, cache= TRUE
)
```
# Introduction {#intro}
I was an undergraduate once. And...let’s just say the statistics force wasn’t strong with this one. At least initially. Given that my innate intellectual prowess was somewhat lacking, I knew I needed a boost to get into graduate school. So, I figured one of the best things I could do, aside from getting a statistics minor (more on that later), was to write an honor’s thesis.
And so I did.
In retrospect, it was kind of a dumb idea. I wasn’t terribly interested in the project. And I hate inconveniencing people. Yet here I was, barging into professors’ offices, begging for them to let me inconvenience them *and* their students to conduct a study.
I still have nightmares about my data collection endeavors.
Well, I finally collected the data I needed and went to my mentor for analysis. Cuz I had no idea what to do with that shiz.
Well, it turns out, neither did my mentor. Unbeknownst to me, the data I collected and the research question I asked required an extremely sophisticated analysis. Neither my mentor nor I had any idea how to even begin to start answering the question.
So, I did what any daunted student would do...I went to the [Statistics Jedis](http://https://statistics.byu.edu/) themselves. You see, my friends, at my alma mater, we were graced with the presence of a statistics jedi training facility (translation...a department of statistics). And since I was a padawan in training there anyhow, I went to see one of my professors.
<img src="drawings/schaalje.jpg" alt="drawing" style="width:500px; float:right;"/>
Dr. Bruce Schallje. With silver streaked hair, a continually cocked head, and eyebrows that remained fixed on a distant problem, the guy was brilliant, yet humble. He wore jeans and brown leather boots, always boots. In fact, outside his earshot, we actually called him "Boots."
As I approached his office, I imagined what it would look like to see raw statistical power manifest. Would he, perhaps, close his eyes and hum gently? Would he sit nomeste-style as insights seeped like mist into his flannel shirt? Or perhaps, his power would manifest by shouting brilliant insights as he pitterpattered on a keyboard like a hacker on a movie. Or maybe it would be like Neo in the Matrix...Dr. Schallje could just look at the data spreadsheet and see the hidden patterns in the data.
But no.
The man graphed the data.
Yeah. He graphed it.
And that, my friends, is where his brilliance really showed. The solution to the problem was, as it turned out, extremely simple. One only needed to plot the data.
Yet at the same time, I thought, "That’s it? I could have done that!"
I have thought over the years about my initial observation of a statistician in his native environment. The absolute mysticism I felt said more about me than about statistics. It was only mystical because I didn’t understand statistics.
Let me say that again, but more tweet-worthy:
```{block, type='rmdtweet', echo = TRUE}
Those who treat statistics with reverence or a measure of mysticism only reveal they really fail to understand statistics.
```
There’s nothing mystical about statistics. That’s both good and bad. Good for you. Bad for me. It’s bad for me because I have lofty and excessive delusions of grandeur. I would love to see myself as a statistical Jedi with superpowers that exceed my own ambitions. Alas, that is not the case. There’s nothing magical or mystical about statistics. It’s actually surprisingly simple.
That really sucks when you want to believe you’re special.
And it wasn’t entirely my fault, I believe. Textbook writers of old have been guilty of making it way more complicated than necessary. Sums of squares, degrees of freedom, rejecting null hypotheses in infinite universes of repeated experiments?...come on, guys! Do we really need to invent or perpetuate convoluted concepts to make ourselves feel smarter than we actually are?
Allow me to burn the veil before you.
Statistics isn’t that complicated.
And that’s where the good news comes in. You, my young Padawan, need not be daunted by the journey before you. Whatever PTSD you suffer from past statistical experience...let it go. Statistics as you knew it is dead to you. Prepare to march upon the enlightened path; one where statistics is not only understandable, but intuitive.
But it does take repetition.
## The power of repetition (and my...umm...*complicated* history with statistics)
My first year at [BYU in Provo, UT](http://https://www.byu.edu) I took the undergraduate statistics class in the psychology department. The class was taught by Dr. Kim, a Korean man with a very thick accent. It took me six weeks before I finally realized "shima" was "sigma."
But I can handle accents. The bigger problem was the class took place at 8AM, and the guy's voice was deep and soothing. The guy could have recorded meditation voiceovers and made millions. The moment his mouth opened, my eyelids grew heavier. And heavier.
And I understood *nothing.* I was a really bad student. I was so confused it hurt to even try. I didn't even know what I didn't know. I didn't even know how to get help. I was *helpless*.
The professor felt bad for me, I think. Somehow, I ended up with a passing grade, having not learned a lick of statistics.
*Finally*, I thought, *that nightmare is over!*
Oh no, good friends The nightmare had just begun.
About the time I was applying to graduate school, I had a painful realization: I was a horrible candidate. It didn't make sense to waste hundreds of dollars to apply to graduate school when I likely wasn't going to be accepted anywhere but Trump University.
So I decided to take another year.
But what could I do in a year and a half that would make a difference?
The thought came before I could dismiss it. And that thought brough cold, dark fear to the depths of my bones.
I could minor in statistics.
Frightening, that.
But, I knew it would be a nice addition to my application. So I did it.
The next year and a half was quite painful. I took a graduate level introductory stats class, followed by a second graduate stats course. I then took the introductory stats class in the statsitics department, along with matrix algebra and statistical computing.
Then something miraculous happened. I was in my experimental design class, which would have been my *seventh* stats class. I sat near the front, posture sunken, feeling frightened of the next concepts, knowing I'd have to struggle for hours to make sense of what he said.
But something odd happened. I *knew* what he was saying. And it wasn't even a struggle to understand him. It was as if I had suddenly awoken from a dream and could speak a foreign language. I *knew* the language of statistics.
And since then, it hasn't always been easy, but it's been exciting.
And I think it's important to realize that statistics is a language. It's a different way of thinking about how the world works, one that requires training of the mind.
## But there's a better way
Lemme take a trip into the deep recesses of your thoughts. I assume you are saying to yourself, "Really?? WTF? I don't have time to take SEVEN stats classes to get it!"
I know. I'm surprised I did.
But, let me assure you it won't take seven classes. Why? Because you're way smarter than me. How do I know? I'm a statistician. I'm playing the odds on this one.
More importantly though, there's a troubling flaw in how statistics has been taught in the past. My first indication was a conversation with [Dr. Robert Terry](http://https://www.ou.edu/cas/psychology/people/faculty/robert-terry) of the [University of Oklahoma](http://https://www.ou.edu). The man sported a gray-haired goatee and a relaxed demeanor--the kind "wait 'til the cows come home" demeanor only a southerner can manage.
"There may be a problem with how we teach statistics," he said.
He grinned, raising an eyebrow, waiting for the question.
"And what problem is that?" I asked.
He turned toward his computer, and opened an image:
[image of Robert's findings]
What am I seeing?" I asked, cocking my head. Lines and circles splattered across the image in a haphazard pattern. I couldn't make any sense of it.
"These are visual representations of cognitive maps. You can think of it like a graph, showing how the mind processes statistical information. This image," he pointed to the left graph, "is how experts think about basic statistical concepts. The right image," he pointed to the other figure, "is how some of our *best* students conceptualize statistics."
He laned back in his chair, grinning smugly. "So, what do you make of it?"
"They're different."
"Yes!" He leaned forward, eyes aglow. "They're not only different. We would expect them to be different. But they share *no* resemblance with one another."
"But how?" I asked. "In what ways are they different?"
"Experts think of statistical concepts as interconnected. Students, on the other hand, see these concepts as quite distinct."
That was in 2008.
For *years* I have thought about what Robert showed me. Experts. Students. Mind maps. Very *different* mind maps.
Does this imply we're teaching statistics incorrectly? Are we misleading students somehow to see things as distinct when, in fact, they're quite connected?
Why yes, yes we are.
But more on that in a minute.
## The Curriculum Hasn't Changed in 50 Years!
Joseph Rodgers, another of my mentors and eventual collaborators, once gave an address to Division 5 of the American Psychological Association. (This is the stats-nerds/qualitative methods division). In his address, he showed his syllabus from his graduate statistics class from the 1970s.
If one were to review the topics in his syllabus, they would soon discover the standard curriculum hasn't changed in 50 years.
Let that sink in for a minute.
Although statistics and our understanding of statistics has rapidly evolved over the last 50 years, the way we teach students has *not*.
Does anyone see a problem with this?
Well, I sure do.
So, the standard curriculum hasn't changed *and* students' mind maps are very qualitatively different from the mind maps of experts.
I think it's time for a change. How about you?
## The General Linear Model Approach
The standard stats curriculum does indeed teach statistical concepts as very distinct: t-test. ANOVA. Regression. Correlation. Chi-square. Distinct concepts! Ne'r dare to confuse them because they are as different as apples and dirt!
Actually, no. They're not.
It turns out t-tests, ANOVAs, regressions, and correlations are all the same thing. So too are factorial ANOVAs, ANCOVAs, and multiple regressions.
They're the same freaking thing. They are all what we call the General Linear Model.
Why then do people teach them as distinct?
Beats me!
Okay, well that's not entirely true. See the fascinating historical note box below....
```{block, type='rmdnote'}
In the early history of statistics, there were two major powerhouses: Karl Pearson and Ronald Fisher. Karl, the older of the two, was the one who popularized regression and correlation and developed a bunch of really cool procedures. Ronald Fisher, the young whippersnapper had the gall to write a note to a journal where Karl recently published and publicly *correct* him for a mistake in calculating something called degrees of freedom.
The nerve of Ronny.
That started a long, hateful relationship between the two. Ronald Fisher, who utilized statistics for primarily doing agricultural experiments, realized you could perform a regression with a mathematical shortcut that he called the Analysis of Variance (ANOVA). He know (as did Karl) that ANOVA was just a different way of computing regression. But, being a spiteful academic, Ron decided to use very different terms for many of the same concepts as Karl did (e.g., instead of calling variance, well, 'variances', he called them mean squared errors). Afterall, who wouldn't fail to give credit to their mortal enemies.
So Ronny, again, an experimental researcher, promoted his own names and mathematical shortcuts for experiments, while Karl continued to use standard terminology. Over time, people began to develop some weird misconceptions about ANOVAs, t-tests, and regression; somehow, ANOVAs could demonstrate causation while regression could not, regression can only be used for numeric variables, and ANOVAs are somehow more appropriate and informative for experiments.
Yeah, no.
They really are the same thing and it's been quite a tragety to see people misunderstand them.
```
That is, perhaps, the most controversial and important distinction between my approach and the standard approach. I believe it's borderline criminal to teach students these are distinct procedures when they are, in fact, the same thing.
My approach offers several advantages:
* It is easy for student to transition to experts; there's no need to reshuffle their mind maps.
* When students are taught they're different, they must invest enormous intellectual resources *deciding* which analysis to do. By the time they get to interpreting the results, they're out of resources, so they make poor decisions. When taught using my method, on the other hand, they don't have to spend hardly *any* effort deciding which analysis to do. Instead, they invest that mental energy in interpreting results.
* Teaching from a GLM perspective makes the transition to advances statistical procedures intuitive. A mixed model is just a GLM, but with different slopes/intercepts for each cluster. General*ized* linear models are gener*al* linear models with different distributions of residuals (and different link functions). Structural Equation Modeling is just a bunch of GLMs slapped into a simultaneous equation. Most students taught the traditional way, again, have to make a very dramatic mental shift before they learn these advanced procedures. When taught my way, these advanced methods are minor extensions.
* GLMs do more than test hypotheses. T-*tests* are called tests for a reason; they are designed to test for statistical significance. Great. What if you're interested in estimation? Or visualization? Or prediction? Or model comparison? Sorry, standard statistics approach. You can't handle it! (Or at least, you're agnostic about how to handle it). The GLM, on the other hand, handles these situations with aplomb.
* Students can use the same analytic framework for *all* statistical analyses they perform. With a GLM perspective, they merely identify their predictor(s) and their outcome, then visualize, then study estimates, they study probability estimates. With the other approaches, on the other hand, not only do they have to click on different menus (t-test or ANOVA or regression or chi square), but the type of information they must interpret differs from one analysis to another. (Although p-values, the most easily misunderstood statistic of all time, is a notable exception).
I'm sure there's more, but my pizza just arrived, so I'm going to wrap up this chapter.
My approach isn't a mere modification, based on personal taste, of the existing statistics curriculum. It is a explicit abandonment of the existing curriculum. Burn that curricular bridge down, I say. Let's start anew.
This is the "anew."
My approach is characterized by visualization and a general linear model approach. The whole purpose of my curriculum is to teach students to *really* understand what their data are trying to say. But, for too long, the standard approach to statistics has stuffed the mouth of our data with cotton balls; it couldn't speak, couldn't say your interpretation was wrong, couldn't reveal interesting patterns you missed.
It could barely breathe.
And now, my friends, we are drowning in a replication crisis--a crisis caused by suffocating our data.
This curriculum is designed to give voice to the data. If you adopt my approach, you will gain insights you will have missed using the standard approach. It will prevent publishing embarrassing errors and yield novel insights.
In short, my approach is about data literacy.
Ya'll ready to begin?
<!--chapter:end:01-intro.Rmd-->
# Ethics
You know what's kinda funny.
I have *never* seen an ethics chapter in statistics textbooks.
"Well," you might say, "I don't see a need. You don't teach ethics in physics, do you? In logic? In math? Math is math, Math doesn't care what ethical standards we hold."
True. Math doesn't have feelings.
Neither does statistics.
But we (humans, that is...if you're not a human and you're reading this, let me be the first to welcome you to our planet) do!
And here's the thing: it is actually quite easy to lie using statistics. We can lie to others and we can lie to ourselves. It is very possible, if not likely, that if two statisticians analyze the same dataset, they will arrive at different conclusions.
Sometimes those conclusions are similar. Sometimes they're not.
And, with enough searching, we can almost always find *something* in our dataset that tickles our intellect. Problem is, we never really know if that intellect-tickling insight is real or spurious.
This has always been the case, by the way. But we didn't really realize it until 2011.
What's so magical about 2011? Well, I had a birthday in 2011. So there's that.
But there's oh so much more.
## History of the Replication Crisis
Prior to 2011, research in psychology (as well as biology, sociology, medicine, exercise science, etc.) was business as usual. Scientists were pumping out scientific truths faster than technicians could stock the printer. We were quite proud of that, patting ourselves on the back and feeling quite right about the truths we had revealed.
Then 2011 happened.
### Dederick Stapel
It started with a fellow named Dederick Stapel. Stapel was a Dutch social psychologist. The man was a rising star, earning high-impact publications and awards. That was, until one of his graduate students grew suspicious. You see, Stapel *always* performed his own statistical analysis. None of his students ever saw the data.
Odd, that.
So one of his student reported their suspicion to the university. The university conducted an investigation and discovered that, for many years, Stapel had been outright fabricating his data.
Swaths of publications had to be retracted.
Suddenly, scientists started to worry about what they could trust.
Oh, but there was more to come.
### Darryl Bem
So, Stapel was a crook. (Although, he seems to have rehabilitated since. Good for him!). So, as long as most scientists aren't crooks, science can be believed....right?
Well, no, unfortunately.
Darryl Bem, also a social psychologist at Harvard, was likewise a luminary in his field. In 2011, he published an article in Journal of Personality and Social Psychology that "proved" humans are capable of pre-cognition.
Yeah.
What was odd about this incident is that Darryl hadn't fabricated his data. Instead, he has used *standard statistical procedures* to justify his conclusions.
Apparently, the reviewers of the articles, despite their skepticism of the conclusions, trusted the methods enough to let the publication pass.
Others were not so trusting. Once again, scientists began to feel uneasy. Darryl Bem used the *same* statistical procedures the vast majority of scientists used, and yet he concluded something so outlandish.
But there was one more incident in 2011 that would solidify our unease.
### The "P-Hacking" Article
This one is actually quite funny. A trio of researchers (Joe Simmons, Lief Nelson, and Uri Simonsohn) published a paper where they "proved" that people listening to a song titled "When I'm Sixty-Four" made them *younger* than a control group (who listened to the song "Kalimba").
What?
Yes, apparently, after listening to a song, it reversed the flow of time.
Nice.
Except, this is absolutely ridiculous. And that was *exactly* the point of their article.
What they showed is that researchers could engage in practices they called "researcher degrees of freedom" to essentially find support for any conclusion they want, even ridiculous conclusions.
This paper was a pretty big deal. Why?
As researchers read this, they realized many of these "researcher degrees of freedom" these authors cautioned against were activities in which these researchers routinely engaged.
Uh oh.
This article was later dubbed the "p-hacking" article.
What is p-hacking?
Glad you asked.
## P-hacking
Before I talk about what p-hacking is, let me give you a brief overview of how researchers use probability to make decisions about data. When a researcher collects data, they use statistics to summarize what they found in the data.
Well, it turns out statistics are *designed* to detect patterns. That's good, right?
Yes and no.
The problem is data *always* show patterns. For example, you may collect data on your quality of sleep and notice that you tend to sleep better whenever polar bears migrate closer to the arctic circle.
Nice!
Perhaps you ought to set up an automatic feeder within the arctic circle so you can always sleep well.
Are you seeing the problem?
There's no reason to suspect polar bear migrations have anything to do with your sleep patterns. This is what we call a "spurious relationship." A spurious relationship occurs when two variables appear to be related to one another, but in reality any association between them is nothing more than chance.
So, think about that: statistics is designed to detect patterns. Some patterns are spurious. So...maybe that pattern you discovered is spurious?
That's always the risk when doing data analysis. And, unfortunately, you never know whether that thing you detected is real or spurious.
But, there are safeguards we can put in place. Often these safeguards utilize probability; we compute the probability of obtaining our results. ^[To those Bayesians, I *know* your objection and I'll get to that in my [Bayesian versus Frequentist](#bayesprobability) chapter. Hold your horses! In fact, I'm very much going to liberally abuse misconceptions about p-values and probability in this section for the sake of simplicity. I'll cover the nuances in later chapters.]
That's all well and good, but the Achilles heel of probability is what we call multiplicity.
Say I want to win a game of chance. To do so, all I have to do is roll a six on a die. What is my probability of rolling a six? 1/6.
What if, instead of rolling once, I roll a hundred times. What's my probability of rolling a six now? It's ain't 1/6! It's much higher. Why is it higher?
Because of multiplicity.
Likewise, when we collect data, we'll generally compute some sort of probability of obtaining our results. What we'd like is to find a high probability our data support our hypothesis. If there's a 99% chance our hypothesis is true, that's good...right? (BTW, this is a very poor representation of the statistics we'd actually compute when doing data analysis, but you get the idea).
That probability (99%) can only be believed if there's no multiplicity. Just like our dice-rolling example, researchers too can engage in multiplicity. What does that look like?
Well, maybe a researcher analyzes how treatment and control groups differ on a memory task. Darn. There's a very small difference between the two groups, and it happens to be in the opposite direction the researcher hypothesized. And their probability estimate isn't very favorable.
Undeterred, the researcher decides the last four questions on the memory task should be thrown out. Why? I don't know. Maybe they think participants got too tired.
Again, they compute the difference between the two groups. Now, the difference between the two is in the right direction, but it's still small. Once again, they compute some probabilities and find the estimates aren't that favorable. Maybe the probability of their hypothesis being true is only 50%. Well, you can't win big in science if your probability's only 50%.
You know....those 20 or so people who participated looked a little lethargic. Let's go ahead and delete their data.
Okay, well, that helped a little. The difference between the two groups is larger and the probability rises to 75%.
So let's now "control" for intelligence. (We'll talk more about what this means in our [conditioning](#multivariate-glms-conditioning-effects) chapter).
Then let's delete that outlier.
Then let's delete that guy's scores because...well, I don't know. I've got a gut feeling. And, besides, he wears Old Spice. Everyone knows anybody who wears Old Spice can't be trusted.
In the end, the researcher may obtain a very impressive probability estimate, but not because he discovered some amazing truth. It's only because of multiplicity.
AKA p-hacking. P-hacking is short for "probability-value hacking," which means to keep trying a bunch of different analyses until one's probability estimate is favorable.
```{block, type="rmdnote"}
It may be a bit misleading to say nobody realized multiplicity was an issue. People knew multiplicity was a problem. However, it was usually understood in terms of testing a bunch of different hypotheses. People hadn't really realized that testing the *same* hypothesis, but in *different ways* also constitutes multiplicity.
```
P-hacking is what Simmons, Nelson, and Simonsohn were criticizing. And, nearly *everybody* was practicing multiplicity [@John2012].
Big oops.
A few years later, in 2015, a gentleman by the name of Brian Nosek led a research team in a massive effort to replicate some of psychology's most recent prestigious findings. To do so, they found 100 studies from the top journals in the field to replicate. Unfortunately, only 36% reached the standard threshold for publishability upon replication.
Double big oops.
Since 2011 (and especially since 2015), Psychology has been undergoing undergoing a "replication crisis" [@Pashler2012a]. That sucks, but it's also good because a lot of good things are coming out of it. One of those very good things is the Open Science Movement.
What is the Open Science Movement?
To understand this movement, it's important to see what this movement was a response to.
## The Scientific Method Movement
Chances are, you are a recipient of the scientific method movement pedagogy. This movement began to emerge in the early 1900s. Back then, scientists started to consider *how* one goes about finding truth. Do we meditate? Do we ask questions of a magic 8-ball? Do we sit in an empty room and think about tacos and beach waves?
We could try these things, but how do we know if we're actually divining truth?
Alas, you can never really know.
Quite by chance, scientists began to believe truth was independent of the person seeking it. It doesn't matter whether the scientist believes in black holes, fairies, or unicorns; these things either exist or they do not.
So, if we accept truth is truth, regardless of our own beliefs, how do we uncover the truth?
The answer for these earlier scientists was objectivity. If the scientist can somehow put aside their beliefs, values, biases, and expectations, they might more easily uncover truth. And it makes sense; we all know that our biases can get in the way of seeing evidence. Just think of flat-earthers! The evidence is overwhelming, yet their biases seem to always find a way to dismiss even the most convincing of evidence.
Easier said than done, am I right? It's easy to say objectivity is the answer, but if objectivity is the answer, shouldn't we have some *objective* method of deriving truth? And shouldn't it be the case that *anybody* who applies this method will arrive at the same answer?
Why yes. (At least, that's what we came to believe in the early 1900s).
This "method" became known as the scientific method. The scientific method is a rough overview of how scientists might investigate a research question (and by so doing, it is an overview of how they might uncover truth).
In reality, there's no "the" scientific method. But usually, people would agree it consists of at least the following procedures:
1. Develop a hypothesis to explain a phenomena
2. Design an experiment
3. Formulate a hypothesis
4. Objectively measure the outcome
5. Refine and repeat
Simple, right? It seems to meet the criteria: the steps are objective and seemingly easy to follow. And, it seems to have had success in the past. Supposedly, when we apply the scientific method, we are guarding against our own subjectivity and the results gleaned are near-perfect representations of truth.
Right?
Well, no. Alas, the scientific method only has the *illusion* of objectivity. But humans are still humans. It's impossible for us to be objective. And by mindlessly following the scientific method procedure, we may trick ourselves into believing we're objective when we're not.
Also, information gleaned from the scientific method is anything but certain.
It's quite a shame, actually. When I was taught about science as a kid, I had this impression of science as this codified set of facts that were certain and could be trusted. Likewise, I assumed all scientists were quintessential objectivists.
Ha. HAHAHAHAHAHAHAHAHAHAHAHAHAHA!
Scientists are very much humans. Sometimes they're really stupid, or stubborn, or ignorant, or mean, or blind. Sometimes scientists promote their theories for no other reason than to protect their pride. Sometimes scientists are petty and suppress information that contradicts their self-interests.
That really shouldn't surprise you. Scientists are people too.
Alas, subscribing to the scientific method actually makes it more likely that, when you *are* being detrimentally subjective, you will (falsely) feel self-assured you're being a good scientist and uncovering truths.
For scientists, the replication crisis served as a wake-up call. It was nearly undeniable proof we had fooled ourselves into believing in our methods.
## Values versus Ethics
The old school values (the Scientific Method values) relegated discussion of ethics and values to philosophers. Instead, their "ethics" were nothing more than intrusions in their scientific freedom.
"What? I can't inject poison in my participants? But, it's for science!!!!!"
Yeah, not cool, man.
Remember, the scientists of older years valued objectivity. What better way to be ethical than to embed ethics into "rules" that clearly delineate what is and what is not ethical behavior. And that has largely been the approach to ethics in science: a series of "thou shalt nots" that identify what's cool from what's uncool.
"Thou shalt not harm participants."
"Thou shalt use deception only when the benefits outweight the risks."
"Thou shalt compensate participants."
"Thou shalt gain permission from people before they participate."
"Thou shalt not fabricate your data."
The problem with this approach is plain to any lawyer (or anybody, for that matter). There's always a loophole. There's always room for the gratuitous exercise of shenaningans.
"Oh, I must compensate participants? Well, I'll compensate them with good advice."
"Oh, I can't fabricate my data? Well, it's not fabrication if I merely modify my existing data."
So, shenanigans brings more shenanigans, we add more rules, tighten the noose, annoy well-intentioned researchers with ever-longer training modules, and so on.
This rule-based approach focuses on *restrictions* to research, which really only seem to annoy people.
But there's a better way--values. There's little doubt one's values can be a powerful motivator. Consider the following quote from Karl Maeser, an American educator:
> I have been asked what I mean by ‘word of honor.’ I will tell you. Place me behind prison walls–walls of stone ever so high, ever so thick, reaching ever so far into the ground–there is a possibility that in some way or another I may escape; but stand me on the floor and draw a chalk line around me and have me give my word of honor never to cross it. Can I get out of the circle? No. Never! I’d die first!”
Daaaaammmmmnnnnn, dude!
If Mr. Maeser's words are to be believed (and, by all accounts, the fellow was an upstanding individual), is there any doubt he would be ethical in his scientific pursuits?
No doubt.
Maeser doesn't need rules. The man has values to guide his decision-making.
What's my point? My point is that the objective, rule-based approach to ethics has serious limitations. Rules are easy to circumvent, they're overly complicated, and they can stifle creativity. But, for one who is motivated by *values*, their values become the *motivation* behind their research.
As I once said in a paper I wrote:
> Rules invite exploitation, whereas values motivate exploration. Rules limit freedom while values instill purpose. Rules are limitations. Values invite possibilities. Values exist from idea inception, to study design, to data collection, to data analysis, to publication, to post-publication, and they are the guiding force behind the research process itself. Once ethics shift away from rules and boundaries (extrinsic motivation), and toward values (intrinsic motivation), researchers can more readily govern themselves.
This is not to say we shouldn't have rules. In an ideal world, every scientist would have strong values guiding their research and will always act in perfect alignment with their values. (Ha!)
But, let's be reasonable. Some people will always be jerks.
I'm not advocating we abandon rules. Rather, I'm advocating we shift the focus away from from teaching ethics as a set of ever-expanding rules, and instead teach emerging scientists to espouse the values of good scientists.
For those who teach or will ever teach emerging scientists, remember this: as a teacher, you have enormous influence over the values to which your students will subscribe. Think of Mr. Miyagi versus xxxx--it's no wonder "Daniel-son" embodied the values Mr. Miyagi taught (inner peace and martial arts as a defense only), while xxxx embodied the values of his sensei (no mercy and win at all costs).
Be a Mr. Miyagi.
They key to change is to embed values in our students. By so doing, the culture itself changes.
But what are these values?
## The Open Science Values
Btw, this section is a much more condensed and irreverent version of a paper I published, available at:
The scientific method movement emphasized objectivity and valued (near) certainty. In many ways, the open science movement is a rejection of these ideas. Rather than promoting objectivity and certainty, the open science movement promotes a completely different set of values [@Fife]:
1. Protecting humanity.
2. Seeking truth.
3. Openness and transparency
4. Humility and skepticism
5. Dissemination
Why these values? Because the open science movement recognizes it's *impossible* to be objective. So, rather than pretending we're not human, instead we should (a) leverage our strengths as humans, and (b) put safeguards in place that minimize the damage we can do to our pursuit of truth.
And how do we do that?
### 1. Protecting humanity
You know what would advance science quite quickly? Forcing people to do what we want. We could, for example, estimate exactly how damaging cigarretes are by simply forcing healthy people to smoke and seeing the damage it does. But, that would be mean. And it wouldn't be worth it. Sure, we'd learn a lot about smoking, but we'd lose our humanity. Not cool.
This first value recognizes that *no* scientific pursuit is worth sacrificing the well-being of humanity. When we keep that in mind, it makes it much easier to make ethical decisions.
### 2. Seek truth
I get it. We all want to get a job, acquire tenure, make a name for ourselves, be rich, attractive, popular, etc. etc. These motivations will certainly play a role in motivating us as scientists. Right now, I'm wanting you to like my book, and maybe I'm hoping you'll make a donation so I can buy a new bandsaw, or subscribe to my YouTube channel. I wouldn't mind if my wonky approach to teaching statistics becomes the standard and they interview me on live television and ask me what my inspiration was for my innovative teaching approach. And, I'm not going to lie, that motivation is one of the reasons I'm sitting here, during Covid lockdown, simultaneously helping my kids do their homeschool while trying to write jokes that keep your interest.
You don't get that kind of detail unless it's true.
Sometimes, these motivations conflict with our desire for truth. Back in 2018 I wrote (and received) a grant to develop statistical software. It was brilliant, I tell ya. I promised to develop point-and-click software that focused on visualization, estimation, Bayesian analysis, and was built atop R so that *any* R package developer could easily build a point-and-click interface for their R packages. Yes, it was brilliant.
Except that software already existed (mostly). I didn't know it until after I received the grant. Both JASP and Jamovi already did 70% of what I wanted.
Suddenly, my vision of fame and notoriety started to fade. That was *my* idea! They stole it!
What do you do? Here I was with grant money to develop something someone else had already developed.
I considered going forward with my original plan. To hell with JASP and Jamovi. I'll make software, the likes of which the world has never seen. I'll make them regret they ever thought of an idea before I did. Because I was going to make it better. Faster. More truthy. More....
What the hell was I thinking?
I know what I was thinking. My desires came in conflict with one another. Sure, I wanted notoriety, prestige, credit, ....
But I also genuinely wanted to make a difference in science. I wanted to promote sound statistical practice.
So then I asked myself....
> *If I really cared about truth, what would I do?*
Once I asked myself that question, the answer was clear: I needed to join forces with JASP and Jamovi.
So I did.
Now, years later, I have accomplished my original goal: I created software that focuses on estimation and visualization, but I didn't have to create software built atop R, nor did I have to build Bayesian-focused software. JASP and Jamovi did that for me.
Maybe I won't get as much credit as I'd originally hoped, but my original purpose (advocating for sound statistical practices) is much further along than if I had tried building my own software from scratch.
When we seek truth above our own personal ambitions, it always seems to work out better, both for science as a whole, and for ourselves personally.
### 3. Openness and transparency.
Remember how I said the scientific method advocated or objectivity? Also remember how I said we can't possibly be objective?
I hope you remember. I just said it a few sections ago. Maybe you're getting tired. Go take a nap, then come back. This part's important.
I hope you had a good nap.
Anyway, where were we? Ah, yes. Objectivity.
Yeah, we're human. We can't be objective. Confirmation bias threatens everything we do.
How do we combat confirmation bias?
Openness and transparency.
Somewhere out there, your scholarly enemies await, looking for the moment to pounce upon a frailty in your study. Maybe said enemy is Reviewer 2, who insists your paper doesn't adequately cite their research or undermines a finding they published 30 years ago. Or maybe said enemy is some scientist across the nation that stumbles upon your paper.
Fortunely, for you and I, there's no shortage of arrogant scholars waiting to pounce on a weakness.
That's actually a very good thing. This batch of misfit scholars are what allows science to be as self-correcting as it is.
But, we can thwart the self-corrective mechanisms built-in to science by masking information. We can, for example, refuse to make our data publicly available, or we can hide the fact we auditioned 30 different analyses before finding something "statistically significant," or we can remove any mention of failed hypotheses and only report those that tell a sexy story.
In short, to circumvent the self-corrective mechanisms in science, we only need to hide our weaknesses.
When we hide things, there are certainly short-term gains. Maybe our paper speeds through the publication pipeline or our unambiguously positive report that supports our finding is highly cited.
Long term, however, nobody benefits. Eventually, the weaknesses of our findings will come to light, but only after researchers across the globe waste thousands of hours and dollars attempting to replicate something that never should have survived in the first place. And when that happens, progress grinds to a halt and our reputations will suffer.
To prevent such backpedaling, and to ensure science truly is self-correcting, it's best to be open and transparent from the get-go. Afterall, if there is a glaring weakness in our data analysis or research design, don't we want other to discover it?
### 4. Humility and skepticism.
As scientists, we need to be skeptical of claims we hear. Skepticism is, perhaps, our best tool against being deceived.
By the way, that includes being skeptical of our *own* findings. To be skeptical of our own findings requires a great deal of humility.
So, skepticism and humlity are really two sides of the same coin.
Let me give you an example. A few years ago, Nosek and Motyl performed a study where they measured participants' political ideaology, then subsequently showed them words of various shades of gray. These participants then had to select a grayscale shade that matched the shade of the word. What they found was that participants with more extreme political ideologies tended to also pick more extreme shades to match the words they saw. In other words, those at the political extremes *literally* saw colors as more black and white.
That there would be a TED talk-worthy finding.
But, Nosek and Motyl were skeptical of their own finding. So, they attempted to replicate the results in a new sample.
And found nothing.
What humility that required. And it cost them a publication. But science is better for it.
It's not easy for scientists to be humble, especially when we're so freaking smart. And, it's doubly hard when the findings we've discovered and the theories we've developed are called into question. We often tie our personal identities to our science. The temptation to double down against a challenger is great, but science will be better when we choose to be humble.
### 5. Dissemination.
Hey, you remember back when people believed in science? Remember? Back before the whole autism and vaccinations debacle, before flat-earthers, before global warming denialism, and before people got super offended when you asked them to wear a Covid-preventing mask?
Yeah. Those were the good old days.
But, we live in a different time.
Why are people so dumb? I'm sure there's lots of causes: the dominance of social media, change in diets, the advances of medicine (and thus the decline of natural selection's power in weeding out morons), UFO kidnappings.
But, I suspect, part of the reason things are different now is the fault of scientists. Scientists can really suck at communicating. And that problem is exacerbated by pulitzer-chasing journalists looking for catchy headlines for their articles.
"Eating Chocolate Is Healthy, Says a World-Reknown Food Scientist!"
"Chocolate Will Kill You, Says a World-Reknown Food Scientist!"
"Gravity Exists, Says Newton!"
"Gravity Doesn't Exist, Says Einstein!"
If you're not comfortable with today's scientific headlines, just wait six months. It'll change.
Why?
The problem is, as you'll soon learn, we deal with what we call "noisy" data. When data are noisy (i.e., when it's hard to pick out the good stuff), conclusions are ambiguous. But, nobody wants to publish a paper saying a conclusion is ambiguous. So, scientists wrangle the data (i.e., p-hack....we'll get to that later) until they get an unambiguous conclusion.
Then, months or years later, somebody else comes along with a different research agenda. They might take similar data (i.e., *noisy* data), and wrangle the data a different way. And, they come to the same conclusions.
What does this have to do with dissemination? Good question. I hope scientists are becoming increasingly careful and cautious with their conclusions. But, they might not be the best at communicating that. Alas, a scientist's care and caution might not translate to journalists who really want to report catchy headlines.
What else tends to happen is scientists get used to speaking a particular language. I call that language nerd-speak. It's the sort of language nobody understands but fellow scientists. It's like a letterman jacket....but for nerds. Only those who've earned their letter get to wear such fancy pantsy language.
Want an example?
(You probably said no, but I'm going to pretend you said yes). Here are some journal titles for a journal I follow:
"A Square-Root Second-Order Extended Kalman Filtering Approach for Estimating Smoothly Time-Varying Parameters"
"Finite Mixtures of Hidden Markov Models for Longitudinal Responses Subject to Drop out"
"Model Selection of Nested and Non-Nested Item Response Models Using Vuong Tests"
Most of your are probably saying, "I know what 'the' means!"
Yeah, I get it. And, in some sense, it's kinda necessary to have a discipline-specific venacular. Saying, "Model Selection of Nested and Non-Nested Item Response Models Using Vuong Tests," is way easier than saying, "Alright, so this papers about comparing two statistical models, one of which is more complicated than the other. Oh, and Item Response Models are used for educational testing. So, yeah, this paper's about using two different models for fitting data from educational testing. And Vuong was some guy who invented a way to compare two models. It's pretty cool."
But, we have to be able to communicate our research to the public. Why is that important?
Let me say it again:
```{block, type="rmdtweet"}
Hey, you remember back when people believed in science? Remember? Back before the whole autism and vaccinations debacle, before flat-earthers, before global warming denialism, and before people got super offended when you asked them to wear a Covid-preventing mask?
```
That's why! We have become such an "elite" group of people. We have used our fancy-pantsy language and flip-flopped on so many "truths," non-scientists don't trust us.
And, it's kinda scary.
So, yeah, learn how to communicate. We're pretty blessed to be able to do what we do. Let's pay it forward.
How, you ask?
Good question. I don't know. I'm pretty good at it, but just because I know how to communicate, doesn't mean I can teach it to others. And, besides, this is a statistics class, not a communication class.
So why talk about it in the first place?
Again, to remind you (and me) that communicating to the public is part of our professional responsibility. More than our pride is on the line.
The fate of humanity is. (Seriously).
So, yeah, practice communicating with non-scientists. You might save the world doing it!
## Making Change
Unfortunately, right now it's not all that "cool" to practice open science. The status quo still rewards those who write catchy titles, p-hack, hide important information, and refuse to acknowledge their wrongs.
That's pretty sucky.
But I want a better science. I hope you want a better science to.
How do we make that happen?
We push. That's all! No one person's going to change the status quo. Rather, it's going to be thousands of scientists, chipping away at the wall of the scientific method.
So how do you chip away? Maybe you preregister your hypotheses. Or maybe you report *all* analyses you did, rather than just the ones that worked. Or maybe you make your data publicly available. Or maybe you explicitly state in your papers you're uncertain about your results.
If *all* of us push against the status quo, pretty soon, these sorts of things move from being "weird," to being common, to being the norm. Then, those who refuse to make their data publicly available (for example) start looking like the one guy at a nude beach who's walking around in a tuxedo and top-hat. (Btw, I'm not at all recommending you walk outside your house naked. That would be illegal. And awkward. It was just a metaphor).
Throughout this course, I'm going to teach data analysis from *this* perspective. If you read any other textbook, it's going to be written as if you can divine truth from your data. I will not mislead you into believing that. So, the very fact you're reading this (and presumably taking a course with this book as your text) means you're already pushing against the boundaries. High five!
Because this is *my* ethical framework, I'm going to, throughout the text, tell you ways you can push against the boundaries. I might, for example, talk about having an external website that contains *all* your plots, or I might suggest you report Bayes factors instead of p-values, or I might suggest you tell reviewers to f&$% off when they say you should run a t-test instead of a general linear model.
(Not really. You should probalby be polite).
Anyway. I'm just rambling now. On to the next section.
## Further data analysis ethics.
I've only really scratched the veneer of ethics in science. I haven't even talked about exploration versus confirmation, nor p-hacking, data mining, or HARKing. Alas, to understand the nuances of these, you really need to have a foundation in probability.
But, we'll get there. Once we do, I'm going to revisit the idea of ethics, and more specifically, the ethics of data analysis. Until then, peace out.
#
<!--chapter:end:02-Ethics.Rmd-->
```{r}
knitr::opts_chunk$set(message=FALSE, warning=FALSE, note=FALSE, cache=TRUE)
```
# Measurement
When I was a lad (7th grade, that is), I took algebra. My teacher was trying to sell math as an important life skill.
Alas, I was ~~lizdexic~~ dyslexic, so Math wasn't my favorite subject. But, for some reason, I've always had a fascination with Albert Einstein. So, when Ms. Miller said his name, I lifted my head from between my folded arms.
"Einstein was a mathemetician, you know..." she said.
I scoffed. "No, Einstein was a scientist."
"Scientists use math."
That truly puzzled me. Scientists use *math*??? But...they're scientists?
I really couldn't figure out why in the world a scientist would ever use math. To me, a scientist was someone in a lab coat who boiled acids in petri dishes or launched rockets into space.
Why in the world would a scientist use math?
In retrospect, the answer was obvious: scientists *measure* things, like the temperature of a liquid or the speed of a rocket.
Okay...so they measure things. Aaaaand?????
Think about what it means to *measure* something: we take an abstract concept (like how fast something is traveling) and convert it to a *number* (like kilometers per hour).
And, once you have converted something in the world (e.g., speed, temperature, depression, age) into a *number*....
You can use math.
So, scientists study how things work. To study how things work, they have to measure how these things behave. Measuring converts "things" to numbers. Once they're represented as numbers, we can use math.
Ironically, it wasn't until college I saw that connection.
```{block, type="rmdnote"}
Technically, most people don't really use math in science. They use statistics.
And what's the difference between statistics and math?
Statistics = Math + Error.
That's it! Just look at the two plots below. The first plot is a cartesian plot of $X$ and $Y$. We have a bunch of numbers for $X$ and $Y$ and a line. The second plot looks similar to the first, but the dots don't all fall on the line. Or, in other words, the *line* is the math part. Because not all dots fall on the line, we have "error" in predicting these $Y$ values. So, Statistics = Math + Error.
```
```{r mathvsstats, echo=FALSE}
x = 1:15
y = 2*x + 4
y2 = 2*x + 4 + rnorm(15, 0, 5)
d = data.frame(x=x, y=y, y2=y2)
require(flexplot)
require(ggplot2)
a = flexplot(y~x, data=d, method="lm") + coord_cartesian(ylim=c(0, 40), xlim=c(2,14))
b = flexplot(y2~x, data=d, method="lm", se=F) + labs(y="y")+ coord_cartesian(ylim=c(0, 40), xlim=c(2,14))
cowplot::plot_grid(a,b)
```
## Why am I talking about measurement?
I'm glad you asked. Or, I'm glad I asked. This whole chapter, I've been mad at myself. I hate when information is presented without context. "Here's a bunch of stuff people need to know about statistics! But don't ask me why!"
Alas, this chapter is largely theoretical, without much practical information.
So why am I telling you this?
Two reasons. First, I can't yet be practical (and show you *how* to compute important statistics related to measurement) without you knowing what a correlation is, how to compute it, and how to visualize it. I could, of course, put the measurement chapter *after* I teach you all that stuff, but that would be weird; we need to *measure* variables before we can *compute* them. So, this chapter has to be theoretical.
Second, I put measurement front-and-center to emphasize that, without good measures, there's no point in doing statistics.
So, just because I'm not giving you the "how," that doesn't mean this chapter isn't important. (In fact, aside from Ethics, this is probably the most important chapter).
## Constructs
So, scientists study *things*. (I know, that's outrageously specific). We study things like rocks (geologists), bodies (biology), chemicals (chemistry), motion (physics), etc. Sometimes, the things we study can be observed. We can see rocks. We can see bodies (or even the insides of bodies).
Other times, we can't see these things. We can't see gravity. But we can see gravity's influence on other things.
For us psychologists, we drew the short end of the stick. *Most* of what we study can't be observed. We can't see depression, or stress, or attention, or schizophrenia. Like gravity or atoms, we can only infer its influence on other things. For example, you can't see anger, but you can observe its influence when your mom hurls pancakes, fresh of the skillet, on your older brother's face. (True story, btw. My older brother was screaming for pancakes and my mom got fed up...I'm only a *little* satisfied every time I think of that story. Sorry Jordan.).
[comic of mom hurling pancakes]
These "things" we study, that cannot be observed, but only inferred by their influence on other things, are called construct. Stress is a construct. Anger is a construct. Sex drive is a construct.
Before we science the sh%& out of constructs, we have to convert them to numbers. To do that, we have to use "operational definitions."
## Operational Definitions
So, we have a construct, or a thing that we want to study that cannot be directly observed. Remember, though we can't observe them, we know they exist because of the things we *can* observe.
An operational definition is how we make that conversion from unobserved to observed.
We should probably make an official definition, huh? Okay....
**Operational definitions are the objective criteria we use to quantify our constructs.**
Let's do an example. Let's say I want to measure anger. How do you know anger exists? You see manifestations of it. One manifestation is punching people in the face.
So, perhaps, you decide to "operationalize" anger by counting the number of times a participant punched someone in the face during the duration of an experiment. That would work as an operational definition! (It's a bad definition, btw, but it's technically an operational definition. We'll get into good versus bad in the next few sections).
What makes this operational definition work is that it fits three criteria:
* *Our operational definition is observable*. A punch to someone's face can be observed. Magnetic chalkra fields cannot. So, check.
* *Our operational definition is objective*. In other words, if two people were to observe the behavior of interest (a participant punching someone in the face), we expect the two would agree on the value. It would be quite rare for two to disagree. ("I don't know, it looked more like a yawn to me.") So, a punch to the face is objective. One's subjective evaluation of someone's body language is not. (There are ways we can make subjective opinions more objective, but I'm not going to get into that).
* *The measure is specific*. Notice we qualified our criteria by saying it had to be a punch (not hit, caress, brush) in the face (not shoulder, stomach, kidney, etc.), during a particular duration. When operational definitions are specific, they're more likely to be objective. For example, if we said our OD of anger was the number of times someone expressed anger, that's not very specific! What counts as an expression of anger? Furrowed eyebrows? Heavy breathing? Punching holes in drywall?
But, of course, our OD might meet all the criteria, but really suck. Maybe our OD of anger is number of times the participant's eyebrows lower more than 2mm from their resting face within the first 60 minute after the experimenter states a pre-determined sequence of "your mom" insults. That there is specific! But it's a bad definition for anger. Maybe they lowered their eyebrows because they were confused, or concentrating as they tried to figure out why they were being insulted, or maybe they have a resting butch face.
So, a good OD doesn't necessarily mean we're actually measuring what we want to measure. To determine that, we need to understand what *validity* is.
## Validity
Validity means "truth." If one makes an invalid argument, that means they're speaking falsehoods.
When speaking of measurement, **validity means we have actually measured what we think we measured.**
A valid measurement of anger measures anger. No more, no less.
A valid measurement of gravity measures gravity. No more, no less.
A valid measure of blood pressure measures blood pressure. No more, no less.
Let's consider some *invalid* measures. Maybe let's use a table, eh?
| Construct | Operational Definition | What's wrong with it? |
| ----- | ----- | ----- |
| The flu | # of times someone sneezes in an hour | People who sneeze might have allergies, and sometimes sneezing isn't a symptom of the flu |
| Intelligence among monkeys | Their score on a written intelligence test | Monkeys can't read. Need I say more? |
| A house's hauntedness | Electromagnetic Field (EMF) machine | Why would ghosts give off electrical signals? Also, other things give off electrical signals (like, you know, power lines) |
So, you feeling good about what validity is?
The next question, of course, is how in the world you determine whether your measure is valid.
Actually, that's a bad question. Measures don't divide nicely into valid and invalid measures. Instead, our measures have varying *degrees* of validity.
But, again, how do you determine the degree to which your measures are valid?
### Evaluating Validity
It turns out, evaluating validity isn't easy (at least in psychology). And, it's subjective. (But remember from the ethics chapter that subjectivity isn't necessarily a bad thing). To evaluate validity, we eventually *make a subjective judgment call where we weigh the evidence.*
What sorts of evidence are we looking for? Usually, we consider three forms of evidence:
1. *Content validity*. This one is the hardest to grasp, methinks. Essentially this says our operational definitions measure *all* aspects of the construct, and do not measure things they should not. For example, depression is generally considered to be comprised of sadness, hopelessness, and a lack of motivation. If our measure only measures sadness (and not hopelessness and motivation), we have failed in content validity. Likewise, if our measure of depression measures how many times people frown, you might accidentally be measuring concentration!
The best way to evaluate content validity is to have experts in the area evaluate your measure. If you get a bunch of depression experts to assess your measure, and they all agree you seem to have hit all the important aspects of depression (and no more), you probably have good content validity.
2. *Face validity*. Face validity refers to the degree to which our operational definitions *appear* to measure what we're trying to measure. If we're blood pressure, but our operational definition uses a breathilizer, that's pretty poor face validity! Likewise, if we're measuring stress and we ask people how frequently they yodel, again we have poor face validity.
Generally we want high face validity. However, in psychology we tend to use a *lot* of self-report stuff. In these situations, face validity might be bad. Why? Well, let me illustrate with a story. Starting in high school, I began to suspect I was dyslexic. When I entered college, I struggled with completing my tests on time. I knew if I had a dyslexia diagnosis, I'd get more time! So, when they began asking me questions, I was *highly* motivated to perform poorly. It was an honest struggle not to miss questions on purpose so I could get that diagnosis. If I had more questionable ethics, I might have scored poorly, but not because I had a learning disability, but because I was motivated to *look* like I had a learning disability. When measures have high face validity, they're easier to fake.
But, I think generally people agree it's worth the cost. It's very hard (and quite rare) to have self-report questions with low face validity and high content validity.
3. *Criterion validity.* This is kinda sorta a tangential way to gather evidence of validity. Let's say we want to develop a measure of nerdiness. Also suppose there's some theory out there that suggests nerds are more successful in their careers. We have two constructs (nerdiness and success in careers). If that is indeed the case, then our measure of nerdiness should be correlated with measures of success.
This may be the easiest way to gather evidence of validity. There's mountains of statistics we could use that compute associations between two measures.
Alas, my discussion of validity is painfully brief. There are many of my type (i.e., quantitative psychologists) who believe validity is the most important part of statistics. I don't think I disagree. You can have amazing visuals, robust statistics, and beautiful statistical models that say amazing things about the world. But if your instruments don't actually measure what you think they're measuring, you're screwed!
Despite it's importance, this book isn't about measurement. It's about statistics. And, I'm a poor teacher of measurement issues. So, I will end this section with an embarassing admission: validity is more important than I have time to cover, but I won't cover it and just pretend we have valid measures for the remainder of the book.
Yikes. I can practically hear my colleagues throwing this textbook into a crackling fire and dismissing me as a fraud. (If you are using a digital text, that's a very expensive act of frustration).
## Reliability
So, yeah, it's kinda hard to do validity. Reliability, on the other hand, is easy! (Or easier).
What is reliability, you ask?
**Reliability refers to the consistency of a measure**.
Here's the basic idea. Let's say you're measuring the age of a mummy using carbon dating. To do so, you take samples from 10 different sections of the same skeleton. Maybe your first sample says the mummy is 2,000 years old. And maybe the second says the mummy is 10,000 years old. Then the third says it's ten minutes old. Then the next says it was actually born in the future, but has returned via a wormhole in your mother's garbage disposal.
In other words, our measure is *inconsistent*, or it varies from one occasion to the next.
That sucks.
Why does it suck? Because we can never be sure if the numbers we use are even close to their "true" value. When this happens, our statistical models have a real hard time picking out any patterns.
### Evaluating reliability
Luckily for us, measuring reliability is much easier than measuring validity. (There will be nuances I'm going to ignore because, again, this is a book about statistics, not measurement.) All it requires of us is that we have *at least* two different measures of the same people. Why two? Well, you can't very well measure consistency with only one score! How consistent is an athlete who only plays a single game? You don't know! You can't know until you measure them more than once.
Generally, we have three ways of measuring reliability. Each of these ways gives us two measures of the same person (or thing, if you're not measuring people). The first (test-retest) gives us 2+ scores by measuring people 2+ times. The second (internal consistency) only mesures once, but divides each person's answers into 2+ splits. The third (interrater) utilizes two different raters of an event (e.g., how many times a child throws a tantrum).
1. *Test-retest reliability*. We can assess the consistency of a measure by measuring the same people (or things, if your study things..) multiple times. So maybe I measure your depression today, then measure you again in a few weeks. I can then correlate your scores. (We'll talk more about how to do this later. I'll even show you examples!). Now, I have an objective measure of the reliability of the test. Easy peasy! (Kinda...it's actually a bit tedious to measure people more than once. If you're measuring rocks or lethal bacteria, consider yourself lucky!...unless, of course, you get attacked by said lethal bacteria. Or rocks. In which case, my apologies for being insensitive).
2. *Internal consistency reliability*. It's tedious to measure the same person more than once. So, internal consistency is the lazy little brother of test-retest reliability. Let's say we have a questionnaire measuring motivation, and let's say that questionnaire has 10 questions. The basic idea behind internal consistency, is we split the test in half (maybe first five versus second five). Then we compute the sums for each half and, now that we have a pair of scores for each person, we can correlate their two scores. (Again, we'll talk more about what it means to "correlate" two things later). In reality, there's some fancy-pantsy statistics that allow us to (metaphorically) split the test more than once, but the idea is the same.
3. *Interrater reliability*. The first two measures work quite well if you're using self-report measures. Not so well when we measure using *raters*. Let's say we have five people watching my measurement YouTube video. We can ask each of them to rate the amazingness of my video (say, on a scale of 1-10). Once again, we now have multiple measurements of the same person, which means we can assess consistency. And, once again, we can easily compute some statistics that measure how consistent our raters are.
As with validity, there's no (good) criteria for "reliable" versus "unreliable". There are some folks out there who try to establish "benchmarks" for what good reliability numbers are, but I favor a subjective judgment that weighs the various sources of evidence.
### Increasing Reliability
So, let's say you develop your measure of the construct of interest and you're super excited to make world-changing discoveries.
Theeeen you compute reliability. And it sucks.
Oops.
What now?
Fortunately, it's pretty easy to increase a measure's reliability. All you do is make the measure longer. Maybe your measure of depression is 10 items. Make it 20. Maybe you measure the age of a fossil with 5 samples. Make it 10. Maybe you measure acceleration by dropping 10 bowling balls off of a skyscraper. Drop 20.
Why does that work?
Let me put it this way. Say you're going on a date with a fella for the first time. You, obviously, want to know if this guy is a creep. After one date, perhaps you start to think this guy's quite swell; he makes you laugh, he's complimentary, and he smells like daisies. But, you can't assess his reliability yet. Is he always this way? Or is he putting on a good show.
Now, on the second date, he's not as funny, doesn't compliment you nearly as much, and he smells like Old Spice instead of daisies. Lame. So, maybe he's no Prince Charmain, but worth a third date.
And a fourth.
And a fifth.
Now, 50 years and thousands of dates later, that dude's your husband. After so many dates, you have a pretty good idea of whether this person's swell. Why? Because you have "measured" that person for decades.
It's the same with our measures of constructs. Say you compute someone's depression score using a measure with only five items. You'd be far less confident in their score than if you had 1,000 items. (Although having people complete 1,000 items is cruel. And it might induce depression.) If you measure the age of a fossil 50 times (and average the age estimate), you can be pretty confident that average score is much more reliable than if you only measured 5 times.
In short, when we have lots of items (or lots of measurements), we can become increasingly certain about our scores.
## Variable types
### Predictor versus Outcome Variables
One critical skill you need to develop is the ability to distinguish predictor versus outcome variables. Why? Because eventually, we will have to *explicitly* tell our statistics software which variable (or variables) is the predictor, and which is the outcome.
The outcome variable is the variable that is influenced by the predictor variable. The predictor variable is the variable that is either (1) manipulated directly by the experimenter (when this happens, we often call it an "independent variable"), or (2) posited by the researcher to have an influence on the dependent variable.
Another way to think of it is the outcome variable is the *effect* while the predictor variable is the *cause*. (Some statisticians are quite ready to pounce on that statement. "Correlation is not causation!" they'll say. Just chill, ya'll, then read the note below). Or, the *predictor* comes *before* the *outcome*.
When trying to decide which is which, consider these questions:
* Are you trying to predict a future value? If so, that variable you're trying to predict is the outcome. The variable(s) you use to make that predictor are the predictors.
* Are you trying to see what happens to a variable after you introduce a change in the environment? The change in the environment is the predictor, the "what happens" is the outcome.
* Does one variable come before the other? If so, the variable that comes first is the predictor and the varaible that comes after is the outcome.