2025 rewrite of paired t with lm

3mmaRand · Feb 11, 2025 · 7baa0b3 · 7baa0b3
1 parent 983b6d9
commit 7baa0b3
Show file tree

Hide file tree

Showing 20 changed files with 99 additions and 135 deletions.
diff --git a/adipocytes.png b/adipocytes.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-11-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-11-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-13-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-13-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-14-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-14-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-15-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-15-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-16-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-16-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-17-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-17-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-18-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-18-1.png
diff --git a/docs/index.html b/docs/index.html
@@ -314,7 +314,7 @@ <h1 class="title">Computational Analysis for Bioscientists</h1>
     <div>
     <div class="quarto-title-meta-heading">Published</div>
     <div class="quarto-title-meta-contents">
-      <p class="date">10 February, 2025</p>
+      <p class="date">11 February, 2025</p>
     </div>
   </div>
 

diff --git a/docs/one_way_anova_and_kw_files/figure-html/unnamed-chunk-16-1.png b/docs/one_way_anova_and_kw_files/figure-html/unnamed-chunk-16-1.png
diff --git a/docs/search.json b/docs/search.json
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
@@ -66,7 +66,7 @@
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/two_sample_tests.html</loc>
-    <lastmod>2025-02-03T15:26:44.876Z</lastmod>
+    <lastmod>2025-02-11T17:45:18.210Z</lastmod>
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/one_way_anova_and_kw.html</loc>

diff --git a/docs/two_sample_tests.html b/docs/two_sample_tests.html
diff --git a/docs/two_sample_tests_files/figure-html/unnamed-chunk-13-1.png b/docs/two_sample_tests_files/figure-html/unnamed-chunk-13-1.png
diff --git a/docs/two_sample_tests_files/figure-html/unnamed-chunk-25-1.png b/docs/two_sample_tests_files/figure-html/unnamed-chunk-25-1.png
diff --git a/docs/two_sample_tests_files/figure-html/unnamed-chunk-27-1.png b/docs/two_sample_tests_files/figure-html/unnamed-chunk-27-1.png
diff --git a/docs/two_sample_tests_files/figure-html/unnamed-chunk-29-1.png b/docs/two_sample_tests_files/figure-html/unnamed-chunk-29-1.png
diff --git a/docs/two_way_anova_files/figure-html/unnamed-chunk-16-1.png b/docs/two_way_anova_files/figure-html/unnamed-chunk-16-1.png
diff --git a/docs/workflow_rstudio.html b/docs/workflow_rstudio.html
@@ -480,8 +480,8 @@ <h1 class="title">
 <div class="sourceCode" id="cb6"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="co"># apply a log-square root transformation</span></span>
 <span><span class="va">tnums</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">log</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/MathFun.html">sqrt</a></span><span class="op">(</span><span class="va">nums</span><span class="op">)</span><span class="op">)</span></span>
 <span><span class="va">tnums</span></span>
-<span><span class="co">##  [1] 1.7631803 1.7169936 0.6931472 2.1587441 2.2769384 2.0036666 1.6836479</span></span>
-<span><span class="co">##  [8] 1.7482538 2.0948274 0.0000000</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span><span class="co">##  [1] 2.2554298 2.2213256 2.0871936 1.9560115 0.3465736 2.1383331 2.1242476</span></span>
+<span><span class="co">##  [8] 1.8187931 1.3862944 1.7631803</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>The first function to be applied is innermost. When we are using just two functions, the level of nesting does not cause too much difficulty in reading the code. However, you can image this gets more unreadable as the number of functions applied increases. It also makes it harder to debug and find out where an error might be. One solution is to create intermediate variables so the commands a given in order:</p>
 <div class="cell">

diff --git a/two_sample_tests.qmd b/two_sample_tests.qmd
@@ -518,7 +518,7 @@ The data in [marks.csv](data-raw/marks.csv) give the marks for ten
 students in two subjects: Data Analysis and Biology. These data are
 paired because we have two marks from one student so that a mark in one
 group has a closer relationship with one of the marks in the other group
-than with any of the the other values. We want to know if students to equally
+than with any of the other values. We want to know if students do equally
 well in both subjects.
 
 ### Import and explore
@@ -533,8 +533,7 @@ Since these data are paired, it makes sense to highlight how the marks
 differ for each student. One way of doing that is to draw a line linking
 their marks in each subject. This is known as a spaghetti plot. We can
 use two geoms: `geom_point()` and `geom_line()`. To join a student's
-marks we need to set the `group` aesthetic to 
-`student.`[^two_sample_tests-1]
+marks, we need to set the `group` aesthetic to `student`.[^two_sample_tests-1]
 
 [^two_sample_tests-1]: You might like to try removing
     `aes(group = student)` to see what ggplot does when the lines are
@@ -544,106 +543,74 @@ marks we need to set the `group` aesthetic to
 ggplot(data = marks, aes(x = subject, y = mark)) +
   geom_point() +
   geom_line(aes(group = student))
-
 ```
 
 Summarise the data so that we can use the means in plots later:
 
 ```{r}
-marks_summary <- marks |> 
-  group_by(subject) |> 
+marks_summary <- marks |>
+  group_by(subject) |>
   summarise(mean = mean(mark))
 ```
 
-A paired test requires us to test whether the *difference* in marks
-between the two subjects is zero on average. One handy way to achieve
-this is to organise our groups into two columns. The `pivot_wider()`
-function will do this for us. We need to tell it what column gives the
-identifiers (*i.e.* what column matches the pairs). This is the student 
-number in this example. We also need to say which variable contains 
-the values that will become the column names and which contains the values.
-
-Pivot the data so there is a column for each subject:
-
-```{r}
-marks_wide <- marks |> 
-  pivot_wider(id_cols = student,
-              names_from = subject,
-              values_from = mark)
-```
-
-We can summarise the difference between subject,
-`DataAnalysis - Biology` as follows:
-
-```{r}
-marks_wide |> 
-  summarise(mean_diff = mean(DataAnalysis - Biology),
-            sd_diff = sd(DataAnalysis - Biology),
-            n_diff = length(DataAnalysis - Biology),
-            se_diff = sd_diff/sqrt(n_diff))
-```
-
-The mean difference is positive meaning students have higher marks in
-Data Analysis on average.
+A paired test requires us to into account the variation between 
+students.
 
 ### Apply `lm()`
 
 We can create a paired-sample model with the `lm()`
 function[^two_sample_tests-2] like this:
 
 [^two_sample_tests-2]: This is not the only way to apply a paired test.
-    When there are only two groups and no other explanatory variables we
+    When there are only two groups and no other explanatory variables, we
     can use `t.test(data = marks, mark ~ subject, paired = TRUE)`. A
-    more general method which works when you have two or more
+    more general method that works when you have two or more
     non-independent values (e.g., more than two subjects) or additional
-    explanatory variables, is to create a "linear mixed model" with
-    `lmer()`
+    explanatory variables is to create a "linear mixed model" with
+    `lmer()`.
 
 ```{r}
-mod <- lm(data = marks_wide, DataAnalysis - Biology ~ 1)
+mod <- lm(mark ~ subject + factor(student), data = marks)
 ```
 
-The response here is the differences `DataAnalysis - Biology` and the
-`~ 1` indicates we only have one group so we are only interested in
-testing a single mean against a value of zero. Another way of saying
-this is we have an "intercept only model" meaning were are estimating
-only $\beta_0$. There is no $\beta_1$.
+- `mark` is the dependent variable (response).
+- `subject` is the independent variable (explanatory factor).
+- `factor(student)` accounts for the pairing by treating student as
+   another explanatory variable. We have used factor because the values
+   in students are the numbers 1 to 10 and we want `student` to be
+   treated as a category not a number
 
 ```{r}
 summary(mod)
 ```
 
-The Estimates in the Coefficients table gives the `(Intercept)` which is
-the mean of `DataAnalysis - Biology`. The average difference between
-subjects is 7.0 marks. The *p*-value of 0.0355 means 7.0 does differ
-significantly from zero. Individual students get significantly higher 
-marks in Data Analysis than in Biology.
+The coefficient for `(Intercept)` gives the mean Biology mark and that
+for `subjectDataAnalysis` is amount that the Data Analysis mark are 
+above Biology marks in general. The *p*-value tests whether this 
+difference is significantly different from zero. The rest of the output
+considers how students differ. You can ignore this here.
 
 ### Check assumptions
 
-We might expect marks, and thus the difference between marks to be
-normally distributed. However, this is a very small sample and choosing a
-non-parametric test instead would be sensible. We will continue with
-this example to demonstrate how to interpret and report on the result of 
-a parametric paired-samples test (paired-samples *t*-test).
-
+We might expect marks to be normally distributed. However, this is a 
+very small sample, and choosing a non-parametric test instead would be 
+reasonable. However, we will continue with this example to demonstrate 
+how to interpret and report on the result of a parametric paired-samples 
+test (paired-samples *t*-test).
 
-We do not need to plot the residuals against the fitted values
-(`plot(mod, which = 1)`) for a paired test. The purpose of a residuals
-vs fitted plot is to check the variance is same for all fitted values. 
-For a paired test, we only have one fitted value, the mean difference.
-
-The normality of the residuals should be checked.
+A plot the residuals against the fitted values (`plot(mod, which = 1)`) 
+is not useful for a paired test. The normality of the residuals should 
+be checked.
 
 ```{r}
-ggplot(mapping = aes(x = mod$residuals)) + 
+ggplot(mapping = aes(x = mod$residuals)) +
   geom_histogram(bins = 3)
 ```
 
-We only have 10 values so the distribution is never going to look
-smooth. We can't draw strong conclusions from this but we do at least
-have a peak at 0. Similarly a normality test is likely to be
-non-significant because of the small sample size which means the test 
+We only have 10 values, so the distribution is never going to look
+smooth. We can't draw strong conclusions from this, but we do at least
+have a peak at 0. Similarly, a normality test is likely to be
+non-significant because of the small sample size, meaning the test
 is not very powerful. This means a non-significant result
 is not strong evidence of the residuals following a normal distribution:
 
@@ -659,35 +626,34 @@ difference of 6.5%. See @fig-marks
 
 ::: {#fig-marks}
 
-
 ```{r}
 #| code-fold: true
 
 ggplot(data = marks, aes(x = subject, y = mark)) +
-  geom_point(pch = 1, size = 3) + 
+  geom_point(pch = 1, size = 3) +
   geom_line(aes(group = student), linetype = 3) +
-  geom_point(data = marks_summary, 
-             aes(x = subject, y = mean), 
+  geom_point(data = marks_summary,
+             aes(x = subject, y = mean),
              size = 3) +
   scale_x_discrete(name = "") +
   scale_y_continuous(name = "Mark",
                      expand = c(0, 0),
                      limits = c(0, 110)) +
-  annotate("segment", x = 1, xend = 2, 
+  annotate("segment", x = 1, xend = 2,
            y = 105, yend = 105,
            colour = "black") +
-  annotate("text", x = 1.5,  y = 108, 
+  annotate("text", x = 1.5,  y = 108,
            label = expression(italic(p)~"= 0.0355")) +
   theme_classic()
-
 ```
 
 **Students score higher in Data Analysis than in Biology**. Open circles indicate an individual student's marks in each subject with dashed lines joining their marks in each subject. The filled circles indicate the mean mark for each subject. Individual students score significantly higher in Data Analysis than in
-Biology (*t* = 2.47; *d.f.* = 9; *p* = 0.0355) with an average. difference of 6.5%. Data analysis was conducted in R [@R-core] with tidyverse packages [@tidyverse].
+Biology (*t* = 2.47; *d.f.* = 9; *p* = 0.0355) with an average difference of 6.5%. Data analysis was conducted in R [@R-core] with tidyverse packages [@tidyverse].
 
 :::
 
 
+
 ## Two paired-samples, non-parametric
 
 We have the marks for just 10 students. This sample is too small for us to judge whether the marks are normally distributed. We will use a non-parametric test instead. The "Wilcoxon signed-rank" test is the non-parametric equivalent of the paired-samples *t*-test. This is often referred to as the paired-sample