-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path12-correlation-and-regression-analysis.Rmd
2777 lines (2075 loc) · 94 KB
/
12-correlation-and-regression-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Correlation and Regression Analysis
[book](pdf/book12.pdf){target="_blank"}
[eStat YouTube Channel](https://www.youtube.com/channel/UCw2Rzl9A4rXMcT8ue8GH3IA){target="_blank"}
**CHAPTER OBJECTIVES**
From Chapter 7 to Chapter 10, we discussed the estimation and the
testing hypothesis of parameters such as population mean and variance
for single variable.
This chapter describes a correlation analysis for two or more variables.
If variables are related with each other, then a regression analysis is
described to see how this association can be used. Simple linear
regression analysis and multiple regression analysis are discussed.
:::
:::
## Correlation Analysis
::: presentation-video-link
[presentation](pdf/1201.pdf){.presentation-link target="_blank"}
[video](https://youtu.be/dPCZ1w59Vm8){.video-link target="_blank"}
:::
::: mainTable
The easiest way to observe the relation of two variables is to draw a
scatter plot with one variable as X axis and the other as Y axis. If two
variables are related, data will gather together with a certain pattern,
and if not related, data will be scattered around. The correlation
analysis is a method of analyzing the degree of linear relationship
between two variables. It is to investigate how linearly the other
variable increases or decreases as one variable increases.
:::
::: mainTableGrey
**Example 12.1.1** Based on the survey of advertising costs and sales
for 10 companies that make the same product, we obtained the following
data as [Table 12.1.1]{.table-ref}. Using 『eStat』 , draw a scatter plot for this
data and investigate the relation of the two variables.
::: textLeft
Table 12.1.1 Advertising costs and sales (unit: 1 million USD)
:::
Company Advertise (X) Sales (Y)
--------- --------------- -----------
1 4 39
2 6 42
3 6 45
4 8 47
5 8 50
6 9 50
7 9 52
8 10 55
9 12 57
10 12 60
::: textLeft
Ex ⇨ eBook ⇨ EX120101_SalesByAdvertise.csv.
:::
**Answer**
Using 『eStat』 , enter data as shown in [Figure 12.1.1]{.figure-ref}. If you
select the Sales as 'Y Var' and the Advertise 'by X Var' in the
variable selection box that appears when you click the scatter plot icon
on the main menu, the scatter plot will appear as shown in \<Figure
12.1.2\>. As we can expect, the scatter plot show that the more
investments in advertising, the more sales increase, and not only that,
the form of increase is linear.
<div>
<div>
<input class="qrBtn" onclick="window.open(addrStr[37])" src="QR/EX120101.svg" type="image"/>
</div>
<div>
![](Figure/Fig120101.png){.imgFig600400}
::: figText
[Figure 12.1.1]{.figure-ref} Data input in 『eStat』
:::
</div>
</div>
![](Figure/Fig120102.svg)
::: figText
[Figure 12.1.2]{.figure-ref} Scatter plot of sales by advertise
:::
:::
::: mainTable
The relation between two variables can be roughly investigated using a
scatter plot like this. However, a measure of the extent of the relation
can be used together to provide a more accurate and objective view of
the relation between two variables. As a measure of the relation between
two variables, there is a covariance. The population covariance of the
two variables $X$ and $Y$ is denoted as $Cov(X,Y)$. When the random
samples of two variables are given as
$(X_1 , Y_1 ) , (X_2 , Y_2 ), ... , (X_n , Y_n )$, the estimate of the
population covariance using samples, which is called the sample
covariance, $s_{XY}$, is defined as follows: $$
\begin{align}
s_{XY} &= \frac{1}{n-1} \sum_{i=1}^{n} ( X_i - \overline X )( Y_i - \overline Y ) \\
&= \frac{1}{n-1} ( \sum_{i=1}^{n} X_i Y_i - n {\overline X}{\overline Y} )
\end{align}
$$ In the above equation, $\overline X$ and $\overline Y$
represent the sample means of $X$ and $Y$ respectively.
In order to understand the meaning of covariance, consider a case that
$Y$ increases if $X$ increases. If the value of $X$ is larger than
$\overline X$ and the value of $Y$ is larger than $\overline Y$, then
$(X - \overline X)(Y- \overline Y)$ always has a positive value. Also,
if the value of $X$ is smaller than $\overline X$ and the value of $Y$
is smaller than $\overline Y$, then $(X - \overline X)(Y- \overline Y)$
has a positive value. Therefore, their mean value which is the
covariance tends to be positive. Conversely, if the value of the
covariance is negative, the value of the other variable decreases as the
value of one variable increases. Hence, by calculating covariance, we
can see the relation between two variables: positive correlation (i.e.,
increasing the value of one variable will increase the value of the
other) or negative correlation (i.e., decreasing the value of the
other).
Covariance itself is a good measure, but, since the covariance depends
on the unit of $X$ and $Y$, it makes difficult to interpret the
covariance according to the size of the value and inconvenient to
compare with other data. Standardized covariance which divides the
covariance by the standard deviation of $X$ and $Y$, $\sigma_{X}$ and
$\sigma_{Y}$, to obtain a measurement unrelated to the type of variable
or specific unit, is called the population correlation coefficient and
denoted as $\rho$.
$$
\text{Population Correlation Coefficient: } \rho = \frac{Cov (X, Y)} { \sigma_X \sigma_Y }
$$
[Figure 12.1.3]{.figure-ref} shows different scatter plots and its values of the
correlation coefficient.
![](Figure/Fig120103.png){.imgFig600400}
::: figText
[Figure 12.1.3]{.figure-ref} Different scatter plots and their correlation
coefficients.
:::
The correlation coefficient $\rho$ is interpreted as follows:
::: textL20M20
1\) $\rho$ has a value between -1 and +1. A $\rho$ value closer to +1
indicates a strong positive linear relation and a $\rho$ value closer to
-1 indicates a strong negative linear relation. Linear relationship
weakens as the value of $\rho$ is close to 0.
:::
::: textL20M20
2\) If all the corresponding values of $X$ and $Y$ are located on a
straight line, the value of $\rho$ has either +1 (if the slope of the
straight line is positive) or -1 (if the slope of the straight line is
negative).
:::
::: textL20M20
3\) The correlation coefficient $\rho$ is only a measure of linear
relationship between two variables. Therefore, in the case of $\rho$ =
0, there is no linear relationship between the two variables, but there
may be a different relationship. (see the scatter plot (f) in \<Figure
12.1.3\>)
:::
『eStatU』 provides a simulation of scatter plot shapes for different
correlations as in [Figure 12.1.4]{.figure-ref}.
<div>
<div>
<input class="qrBtn" onclick="window.open(addrStr[79])" src="QR/eStatU310_Correlation.svg" type="image"/>
</div>
<div>
![](Figure/Fig120104.png){.imgFig600400}
::: figText
[Figure 12.1.4]{.figure-ref} Simulation of correlation coefficient at 『eStatU』
:::
</div>
</div>
An estimate of the population correlation coefficient using samples of
two variables is called the sample correlation coefficient and denoted
as $r$. The formula for the sample correlation coefficient $r$ can be
obtained by replacing each parameter with the estimates in the formula
for the population correlation coefficient. $$
r = \frac {s_{XY}} { s_X s_Y }
$$ where $s_{XY}$ is the sample covariance and $s_{X}$, $s_{Y}$
are the sample standard deviations of $X$ and $Y$ as follows: $$
\begin{align}
s_{XY} &= \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \overline X )(Y_i - \overline Y ) \\
s_X^2 &= \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \overline X )^{2} \\
s_Y^2 &= \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline Y )^{2} \\
\end{align}
$$ Therefore, the formula $r$ can be written as follows $$
\begin{align}
r &= \frac {\sum_{i=1}^{n} (X_i - \overline X )(Y_i - \overline Y )} { \sqrt{\sum_{i=1}^{n} (X_i - \overline X )^{2} \sum_{i=1}^{n} (Y_i - \overline Y )^{2} } } \\
&= \frac {\sum_{i=1}^{n} X_i Y_i - n \overline X \overline Y } { \sqrt{\left (\sum_{i=1}^{n} X_{i}^{2} - n {\overline X}^2 \right) \left( \sum_{i=1}^{n} Y_{i}^{2} - n {\overline Y}^{2} \right) } }
\end{align}
$$
:::
::: mainTableGrey
**Example 12.1.2** Find the sample covariance and correlation
coefficient for the advertising costs and sales of [Example 12.1.1]{.example-ref}.
**Answer**
To calculate the sample covariance and correlation coefficient, it is
convenient to make the following table. This table can also be used for
calculations in regression analysis.
::: textLeft
Table 12.1.2 A table for calculating the covariance
:::
Number $X$ $Y$ $X^2$ $Y^2$ $XY$
-------- ----- ------ ------- ------- ------
1 4 39 16 1521 156
2 6 42 36 1764 252
3 6 45 36 2025 270
4 8 47 64 2209 376
5 8 50 64 2500 400
6 9 50 81 2500 450
7 9 52 81 2704 468
8 10 55 100 3025 550
9 12 57 144 3249 684
10 12 60 144 3600 720
Sum 64 497 766 25097 4326
Mean 8.4 49.7
Terms which are necessary to calculate the covariance and correlation
coefficient are as follows:
$\small \quad SXX = \sum_{i=1}^{n} (X_i - \overline X )^{2} = \sum_{i=1}^{n} X_{i}^2 - n{\overline X}^{2} = 766 - 10×8.4^2 = 60.4$\
$\small \quad SYY = \sum_{i=1}^{n} (Y_i - \overline Y )^{2} = \sum_{i=1}^{n} Y_{i}^2 - n{\overline Y}^{2} = 25097 - 10×49.7^2 = 396.1$\
$\small \quad SXY = \sum_{i=1}^{n} (X_i - \overline X )(Y_i - \overline Y ) = \sum_{i=1}^{n} X_{i}Y_{i} - n{\overline X}{\overline Y} = 4326 - 10×8.4×49.7 = 151.2$
$\small SXX, SYY, SXY$represent the sum of squares of $\small X$, the
sum of squares of $\small Y$, the sum of squares of $\small XY$. Hence,
the covariance and correlation coefficient are as follows:
$\small \quad s_{XY} = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \overline X )(Y_i - \overline Y ) = \frac{151.2}{10-1} = 16.8$
$\small \quad r = \frac {\sum_{i=1}^{n} (X_i - \overline X )(Y_i - \overline Y )} { \sqrt{\sum_{i=1}^{n} (X_i - \overline X )^{2} \sum_{i=1}^{n} (Y_i - \overline Y )^{2} } } = \frac{151.2} { \sqrt{ 60.4 × 396.1 } } = 0.978$
This value of the correlation coefficient is consistent with the scatter
plot which shows a strong positive correlation of the two variables.
:::
::: mainTable
Sample correlation coefficient $r$ can be used for testing hypothesis of
the population correlation coefficient. The main interest in testing
hypothesis of $\rho$ is $H_0 : \rho = 0$ which tests the existence of
linear correlation. This test can be done using $t$ distribution as
follows:
:::
::: mainTableYellow
**Testing the population correlation coefficient $\rho$:**
Null Hypothesis: $H_0 : \rho = 0$
Test Statistic: $\qquad t_0 = \sqrt{n-2} \frac{r}{\sqrt{1 - r^2 }}$,
$\quad t_0$ follows $t$ distribution with $n-2$ degrees of freedom
Rejection Region of $H_0$:\
$\qquad 1)\; H_1 : \rho < 0 , \;\;$ Reject if $\; t_0 < -t_{n-2; α}$\
$\qquad 2)\; H_1 : \rho > 0 , \;\;$ Reject if $\; t_0 > t_{n-2; α}$\
$\qquad 3)\; H_1 : \rho \ne 0 , \;\;$ Reject if
$\; |t_0 | > t_{n-2; α/2}$
:::
::: mainTableGrey
**Example 12.1.3** In the Example 12.1.2, test the hypothesis that the
population correlation coefficient between advertising cost and the
sales amount is zero at the significance level of 0.05. (Since the
sample correlation coefficient is 0.978 which is close to 1, this test
will not be required in practice.)
**Answer**
The value of the test statistic $t$ is as follows:
$\qquad \small t_0 = \sqrt{10-2} \frac{0.978}{\sqrt{1 - 0.978^2 }}$ =
13.26
Since it is greater than $t_{8; 0.025}$ = 2.306, $\small H_0 : \rho = 0$
should be rejected.
With the selected variables of 『eStat』 as [Figure 12.1.1]{.figure-ref}, click the
regression icon on the main menu, then the scatter plot with a
regression line will appear. Clicking the \[Correlation and Regression\]
button below this graph will show the output as [Figure 12.1.5]{.figure-ref} in the
Log Area with the result of the regression analysis. The values of this
result are slightly different from the textbook, which is the error
associated with the number of digits below the decimal point. The same
conclusion is obtained that the p-value for the correlation test is
0.0001, less than the significance level of 0.05 and therefore, the null
hypothesis is rejected.
![](Figure/Fig120105.png){.imgFig600400}
::: figText
[Figure 12.1.5]{.figure-ref} Testing hypothesis of correlation using 『eStat』
:::
:::
::: mainTablePink
<div>
<div>
<input class="qrBtn" onclick="window.open(addrStr[74])" src="QR/PR120101.svg" type="image"/>
</div>
<div>
**Practice 12.1.1** A professor of statistics argues that a student's
final test score can be predicted from his/her midterm. Ten students
were randomly selected and their mid-term and final exam scores are as
follows:
id Mid-term X Final Y
---- ------------ ---------
1 92 87
2 65 71
3 75 75
4 83 84
5 95 93
6 87 82
7 96 98
8 53 42
9 77 82
10 68 60
::: textLeft
Ex ⇨ eBook ⇨ PR120101_MidtermFinal.csv.
:::
::: textL20M20
1\) Draw a scatter plot of this data with the mid-term score on X axis
and final score on Y axis. What do you think is the relationship between
mid-term and final scores?
:::
::: textL20M20
2\) Find the sample correlation coefficient and test the hypothesis that
the population correlation coefficient is zero with the significance
level of 0.05.
:::
</div>
</div>
:::
::: mainTable
If there are more than three variables in the analysis, the relationship
can be viewed using the scatter plots for each combination of two
variables and the sample correlation coefficients can be obtained.
However, to make it easier to see the relationship between the
variables, the correlations between the variables can be arranged in a
matrix format which is called a correlation matrix. 『eStat』 shows the
result of a correlation matrix and the significance test for those
values. The result of the test shows the t value and p-value.
:::
::: mainTableGrey
**Example 12.1.4** Draw a scatter plot matrix and correlation
coefficient matrix using four variables of the iris data saved in the
following location of 『eStat』.
::: textLeft
Ex ⇨ eBook ⇨ EX120104_Iris.csv
:::
The variables are Sepal.Length, Sepal.Width, Petal.Length, and
Petal.Width. Test the hypothesis whether the correlation coefficients
are equal to zero.
**Answer**
From 『eStat』, load the data and click the 'Regression' icon. When
the variable selection box appears, select the four variables of
Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, then the
scatter plot matrix will be shown as [Figure 12.1.6]{.figure-ref}.
It is observed that the Sepal.Length and the Petal.Length, and the
Petal.Length and the Petal.Width are related.
<div>
<div>
<input class="qrBtn" onclick="window.open(addrStr[38])" src="QR/EX120104.svg" type="image"/>
</div>
<div>
![](Figure/Fig120106.svg){.imgFig600400}
::: figText
[Figure 12.1.6]{.figure-ref} Scatter plot matrix using 『eStat』
:::
</div>
</div>
When selecting \[Regression Analysis\] button from the options below the
graph, the basic statistics and correlation coefficient matrix such as
[Figure 12.1.7]{.figure-ref} appear in the Log Area with the test result. It can be
seen that all correlations are significant except the correlation
coefficient between the Sepal.Length and Sepal.Width.
![](Figure/Fig1201071.png){.imgFig600400}
![](Figure/Fig1201072.png){.imgFig600400}
::: figText
[Figure 12.1.7]{.figure-ref} Descriptive statistics and correlation matrix using
『eStat』
:::
:::
::: mainTablePink
<div>
<div>
<input class="qrBtn" onclick="window.open(addrStr[75])" src="QR/PR120102.svg" type="image"/>
</div>
<div>
**Practice 12.1.2** A health scientist randomly selected 20 people to
determine the effects of smoking and obesity on their physical strength
and examined the average daily smoking rate (, number/day), the ratio of
weight by height (, kg/m), and the time to exercise with a certain
intensity (, in hours). Draw a scatterplot matrix and test whether there
is a correlation among smoking, obesity and exercising time with a
certain intensity.
-----------------------------------------------------------------------
smoking rate\ ratio of weight by time to exercise\
$x_1$ height\ $y$
$x_2$
----------------------- ----------------------- -----------------------
24 53 11
0 47 22
25 50 7
0 52 26
5 40 22
18 44 15
20 46 9
0 45 23
15 56 15
6 40 24
0 45 27
15 47 14
18 41 13
5 38 21
10 51 20
0 43 24
12 38 15
0 36 24
15 43 12
12 45 16
-----------------------------------------------------------------------
::: textLeft
Ex ⇨ eBook ⇨ PR120102_SmokingObesityExercis.csv.
:::
</div>
</div>
:::
::: mainTablePink
### Multiple Choice Exercise
Choose one answer and click Submit button
::: textL30M30
12.1 The variables X and Y have a strong relationship with a quadratic
equation () as shown in the following table. What is their sample
correlation coefficient?
:::
X Y
------ ------
\... \...
-3 9
-2 4
-1 1
0 0
1 1
2 4
3 9
\... \...
<form name="Q1">
<label><input name="item" type="radio" value="1"/> 1</label><br/>
<label><input name="item" type="radio" value="2"/> 0</label><br/>
<label><input name="item" type="radio" value="3"/> -1</label><br/>
<label><input name="item" type="radio" value="4"/> 1/2</label><br/>
<p>
<input onclick="radio(12,1,Q1)" type="button" value="Submit"/>
<input id="ansQ1" size="15" type="text"/>
</p></form>
::: textL30M30
12.2 Which is a wrong description of the sample correlation coefficient
$r$?
:::
<form name="Q2">
<label><input name="item" type="radio" value="1"/> \(-1 < r < 1\)</label><br/>
<label><input name="item" type="radio" value="2"/> if \( r = 1 \), perfect negative correlation</label><br/>
<label><input name="item" type="radio" value="3"/> if \( r = 0 \), no linear correlation</label><br/>
<label><input name="item" type="radio" value="4"/> if \( r < 0 \), negative correlation</label><br/>
<p>
<input onclick="radio(12,2,Q2)" type="button" value="Submit"/>
<input id="ansQ2" size="15" type="text"/>
</p></form>
::: textL30M30
12.3 Which is a right description of the sample correlation coefficient
$r$?
<form name="Q3">
<label><input name="item" type="radio" value="1"/> if \( r > 1 \), there is strong positive correlation between \(x\) and \(y\).</label><br/>
<label><input name="item" type="radio" value="2"/> if \( | r | \) closes to 0, there exist a weak linear correlation between \(x\) and \(y\).</label><br/>
<label><input name="item" type="radio" value="3"/> If \( r \) is negative, then \(y\) is increasing when \(x\) increases. </label><br/>
<label><input name="item" type="radio" value="4"/> If \( r \) is near -1, there exist a weak linear correlation between \(x\) and \(y\).</label><br/>
<p>
<input onclick="radio(12,3,Q3)" type="button" value="Submit"/>
<input id="ansQ3" size="15" type="text"/>
</p></form>
::: textL30M30
12.4 If the sample correlation coefficient between $x$ and $y$ is $r$,
what is the sample correlation coefficient between $2x$ and $3y +1$?
:::
<form name="Q4">
<label><input name="item" type="radio" value="1"/> \(r\)</label><br/>
<label><input name="item" type="radio" value="2"/> \(2r\)</label><br/>
<label><input name="item" type="radio" value="3"/> \(3r\)</label><br/>
<label><input name="item" type="radio" value="4"/> \(3r+1\)</label><br/>
<p>
<input onclick="radio(12,4,Q4)" type="button" value="Submit"/>
<input id="ansQ4" size="15" type="text"/>
</p></form>
::: textL30M30
12.5 Find the sample correlation coefficient between $x$ and $y$ of the
following data.
:::
$x$ $y$
----- -----
10 2
20 4
30 6
40 8
<form name="Q5">
<label><input name="item" type="radio" value="1"/> 1</label><br/>
<label><input name="item" type="radio" value="2"/> 0.3</label><br/>
<label><input name="item" type="radio" value="3"/> 0.4</label><br/>
<label><input name="item" type="radio" value="4"/> 0.5</label><br/>
<p>
<input onclick="radio(12,5,Q5)" type="button" value="Submit"/>
<input id="ansQ5" size="15" type="text"/>
</p></form>
::: textL30M30
12.6 If the correlation coefficient of two variables $x, y$ is 0, what
is the right description?
:::
<form name="Q6">
<label><input name="item" type="radio" value="1"/> There is no linear relationship between two variables \(x, y\).</label><br/>
<label><input name="item" type="radio" value="2"/> There is a linear relationship between two variables \(x, y\)</label><br/>
<label><input name="item" type="radio" value="3"/> Two variables \(x, y\) has a strong relationship.</label><br/>
<label><input name="item" type="radio" value="4"/> Two variables \(x, y\) has a strong linear relationship.</label><br/>
<p>
<input onclick="radio(12,6,Q6)" type="button" value="Submit"/>
<input id="ansQ6" size="15" type="text"/>
</p></form>
::: textL30M30
12.7 Which one of the following descriptions on the sample correlation
coefficient $r$ is not right?
:::
<form name="Q7">
<label><input name="item" type="radio" value="1"/> \(r\) is a random variable.</label><br/>
<label><input name="item" type="radio" value="2"/> \(-1 \le r \le 1\)</label><br/>
<label><input name="item" type="radio" value="3"/> \(r\) is a measure of linear relationship between two variables.</label><br/>
<label><input name="item" type="radio" value="4"/> Distribution of \(r\) is a normal distribution.</label><br/>
<p>
<input onclick="radio(12,7,Q7)" type="button" value="Submit"/>
<input id="ansQ7" size="15" type="text"/>
</p></form>
::: textL30M30
12.8 Find the sample correlation coefficient between $x$ and $y$ of the
following data.
:::
$x$ $y$
----- -----
1 5
2 4
3 3
4 2
5 1
<form name="Q8">
<label><input name="item" type="radio" value="1"/> -1</label><br/>
<label><input name="item" type="radio" value="2"/> -1/2</label><br/>
<label><input name="item" type="radio" value="3"/> 0</label><br/>
<label><input name="item" type="radio" value="4"/> 1/2</label><br/>
<p>
<input onclick="radio(12,8,Q8)" type="button" value="Submit"/>
<input id="ansQ8" size="15" type="text"/>
</p></form>
::: textL30M30
12.9 If $X$ and $Y$ are independent, what is the sample correlation
coefficient $r$?
:::
<form name="Q9">
<label><input name="item" type="radio" value="1"/> 1</label><br/>
<label><input name="item" type="radio" value="2"/> 1/2</label><br/>
<label><input name="item" type="radio" value="3"/> 0</label><br/>
<label><input name="item" type="radio" value="4"/> -1/2</label><br/>
<p>
<input onclick="radio(12,9,Q9)" type="button" value="Submit"/>
<input id="ansQ9" size="15" type="text"/>
</p></form>
::: textL30M30
12.10 Which one of the followings is not right for description of the
sample correlation coefficient $r$ between $X$ and $Y$?
:::
<form name="Q10">
<label><input name="item" type="radio" value="1"/> \(-1 \le r \le 1\)</label><br/>
<label><input name="item" type="radio" value="2"/> Distribution of \(r\) is a normal distribution.</label><br/>
<label><input name="item" type="radio" value="3"/> \(r\) is a random variable.</label><br/>
<label><input name="item" type="radio" value="4"/> The formula to calculate \(r\) is \(r = \frac{\sum (x_i -\overline x)(y_i - \overline y)}{\sqrt{{\sum(x_i - \overline x)^2}{\sum(y_i - \overline y)^2}}}\)</label><br/>
<p>
<input onclick="radio(12,10,Q10)" type="button" value="Submit"/>
<input id="ansQ10" size="15" type="text"/>
</p></form>
::: textL30M30
12.11 Which one of the followings has positive correlation?
:::
<form name="Q11">
<label><input name="item" type="radio" value="1"/> height of mountain and pressure</label><br/>
<label><input name="item" type="radio" value="2"/> weight and height</label><br/>
<label><input name="item" type="radio" value="3"/> monthly income and Engel's coefficient</label><br/>
<label><input name="item" type="radio" value="4"/> amount of production and price</label><br/>
<p>
<input onclick="radio(12,11,Q11)" type="button" value="Submit"/>
<input id="ansQ11" size="15" type="text"/>
</p></form>
::: textL30M30
12.12 Find the sample covariance between $x$ and $y$ of the following
data.
:::
$x$ $y$
----- -----
1 5
2 5
3 5
4 5
<form name="Q12">
<label><input name="item" type="radio" value="1"/> 1</label><br/>
<label><input name="item" type="radio" value="2"/> 0</label><br/>
<label><input name="item" type="radio" value="3"/> 0.5</label><br/>
<label><input name="item" type="radio" value="4"/> -1</label><br/>
<p>
<input onclick="radio(12,12,Q12)" type="button" value="Submit"/>
<input id="ansQ12" size="15" type="text"/>
</p></form>
::: textL30M30
12.13 Find the sample covariance between $x$ and $y$ of the following
data.
#####
$x$ $y$
----- -----
1 6
2 8
3 10
4 12
5 14
<form name="Q13">
<label><input name="item" type="radio" value="1"/> 3</label><br/>
<label><input name="item" type="radio" value="2"/> 4</label><br/>
<label><input name="item" type="radio" value="3"/> 10</label><br/>
<label><input name="item" type="radio" value="4"/> 20</label><br/>
<p>
<input onclick="radio(12,13,Q13)" type="button" value="Submit"/>
<input id="ansQ13" size="15" type="text"/>
</p></form>
::: textL30M30
12.14 If the standard deviations of the $X$ and $Y$ variables are 4.06
and 2.65 respectively, the covariance is 10.50, what is the sample
correlation coefficient $r$?
:::
<form name="Q14">
<label><input name="item" type="radio" value="1"/> 10.759</label><br/>
<label><input name="item" type="radio" value="2"/> 0.532</label><br/>
<label><input name="item" type="radio" value="3"/> 1.025</label><br/>
<label><input name="item" type="radio" value="4"/> 0.976</label><br/>
<p>
<input onclick="radio(12,14,Q14)" type="button" value="Submit"/>
<input id="ansQ14" size="15" type="text"/>
</p></form>
:::
:::
:::
:::
:::
## Simple Linear Regression Analysis
::: presentation-video-link
[presentation](pdf/1202.pdf){.presentation-link target="_blank"}
[video](https://youtu.be/wn0Dl3dLgko){.video-link target="_blank"}
:::
::: mainTable
Regression analysis is a statistical method that first establishes a
reasonable mathematical model of relationships between variables,
estimates the model using measured values of the variables, and then
uses the estimated model to describe the relationship between the
variables, or to apply it to the analysis such as forecasting. For
example, a mathematical model of the relationship between sales ($Y$)
and advertising costs ($X$) would not only explain the relationship
between sales and advertising costs, but would also be able to predict
the amount of sales that a given investment.
:::
::: mainTableYellow
**Regression Analysis**
Regression analysis is a statistical method that first establishes a
reasonable mathematical model of relationships between variables,
estimates the model using measured values of the variables, and then
uses the estimated model to describe the relationship between the
variables, or to apply it to the analysis such as forecasting.
:::
::: mainTable
As such, the regression analysis is intended to investigate and predict
the degree of relation between variables and the shape of the relation.
In regression analysis, a mathematical model of the relation between
variables is called a **regression equation**, and the variable affected
by other related variables is called a **dependent variable**. The
dependent variable is the variable we would like to describe which is
usually observed in response to other variables, so it is also called a
**response variable**. In addition, variables that affect the dependent
variable are called **independent variables**. The independent variable
is also referred to as the **explanatory variable**, because it is used
to describe the dependent variable. In the previous example, if the
objective is to analyse the change in sales amounts resulting from
increases and decreases in advertising costs, the sales amount is a
dependent variable and the advertising cost is an independent variable.
If the number of independent variables included in the regression
equation is one, it is called a **simple linear regression**. If the
number of independent variables are two or more, it is called a
**multiple linear regression**.
:::
### Simple Linear Regression Model
::: mainTable
Simple linear regression analysis has only one independent variable and
the regression equation is shown as follows: $$
Y = f(X,\alpha,\beta) = \alpha + \beta X
$$ In other words, the regression equation is represented by the
linear equation of the independent variable, and $\alpha$ and $\beta$
are unknown parameters which represent the intercept and slope
respectively. The $\alpha$ and $\beta$ are called the **regression
coefficients**. The above equation represents an unknown linear
relationship between $Y$ and $X$ in population and is therefore,
referred to as the population regression equation.
In order to estimate the regression coefficients $\alpha$ and $\beta$,
observations of the dependent and independent variable are required,
i.e., samples. In general, all of these observations are not located in
a line. This is because, even if the $Y$ and $X$ have an exact linear
relation, there may be a measurement error in the observations, or there
may not be an exact linear relationship between $Y$ and $X$. Therefore,
the regression formula can be written by considering these errors
together as follows: $$
Y_i = \alpha + \beta X_i + \epsilon_{i}, \quad i=1,2,...,n
$$ where $i$ is the subscript representing the $i^{th}$
observation, and $\epsilon_i$ is the random variable indicating an error
with a mean of zero and a variance $\sigma^2$ which is independent of
each other. The error $\epsilon_i$ indicates that the observation $Y_i$
is how far away from the population regression equation. The above
equation includes unknown population parameters $\alpha$, $\beta$ and
$\sigma^2$, and is therefore, referred to as a population regression
model.
If $a$ and $b$ are the estimated regression coefficients using samples,
the fitted regression equation can be written as follows: It is referred
to as the sample regression equation. $$
{\hat Y}_i = a + b X_i
$$ In this expression, ${\hat Y}_i$ represents the estimated value
of $Y$ at $X=X_i$ as predicted by the appropriate regression equation.
These predicted values can not match the actual observed values of $Y$,
and differences between these two values are called residuals and
denoted as $e_i$. $$
\text{Residuals} \qquad e_i = Y_i - {\hat Y}_i , \quad i=1,2,...,n
$$ The regression analysis makes some assumptions about the
unobservable error $\epsilon_i$. Since the residuals $e_i$ calculated
using the sample values have similar characteristics as $\epsilon_i$,
they are used to investigate the validity of these assumptions. (Refer
to Section 12.2.6 for residual analysis.)
:::
### Estimation of Regression Coefficient
::: mainTable
When sample data, $(X_1 , Y_1 ) , (X_2 , Y_2 ) , ... , (X_n , Y_n )$,
are given, a straight line representing it can be drawn in many ways.
Since one of the main objectives of regression analysis is prediction,
we would like to use the estimated regression line that would make the
residuals smallest that the error occurs when predicting the value of Y.
However, it is not possible to minimize the value of the residuals at
all points, and it should be chosen to make the residuals 'totally'
smaller. The most widely used of these methods is the method which
minimizes the total sum of squared residuals, that is called the method
of least squares regression.
:::
::: mainTableYellow
**Method of Least Squares Regression**
A method of estimating regression coefficients so that the total sum of
the squared errors occurring in each observation is minimized. i.e.,
$\quad$ Find $\alpha$ and $\beta$ which minimize
$$
\sum_{i=1}^{n} \epsilon_{i}^2 = \sum_{i=1}^{n} ( Y_i - \alpha - \beta X_i )^2
$$
:::
::: mainTable
To obtain the values of $\alpha$ and $\beta$ by the least squares
method, the sum of squares above should be differentiated partially with
respect to $\alpha$ and $\beta$, and equate them zero respectively. If
the solution of $\alpha$ and $\beta$ of these equations is $a$ and $b$,
the equations can be written as follows: $$
\begin{align}
a \cdot n + b \sum_{i=1}^{n} X_i &= \sum_{i=1}^{n} Y_i \\
a \sum_{i=1}^{n} X_i + b \sum_{i=1}^{n} X_i^2 &= \sum_{i=1}^{n} X_i Y_i \\
\end{align}
$$
The above expression is called a **normal equation**. The solution $a$
and $b$ of this normal equation is called the **least squares
estimator** of $\alpha$ and $\beta$ and is given as follows:
:::
::: mainTableYellow
**Least Squares Estimator of $\alpha$ and $\beta$**
$\small \quad b = \frac {\sum_{i=1}^{n} (X_i - \overline X ) (Y_i - \overline Y )} { \sum_{i=1}^{n} (X_i - \overline X )^2 }$
$\small \quad a = \overline Y - b \overline X$
:::
::: mainTable
If we divide both the numerator and the denominator of $b$ by $n-1$, $b$
can be written as $b = \frac{s_{XY}}{s_{X}^2}$. Since the correlation
coefficient is $r = \frac{s_{XY}}{s_X s_Y}$ and therefore, the slope $b$
can also be calculated by using the correlation coefficient as follows:
$$
b = \frac{s_{XY}}{s_X ^2} = \frac{ r s_X s_Y } {s_X ^2 } = r \frac{s_Y}{s_X}