Skip to content

Commit

Permalink
2025 rewrite of paired t with lm
Browse files Browse the repository at this point in the history
  • Loading branch information
3mmaRand committed Feb 11, 2025
1 parent 983b6d9 commit 7baa0b3
Show file tree
Hide file tree
Showing 20 changed files with 99 additions and 135 deletions.
Binary file modified adipocytes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/import_to_report_files/figure-html/unnamed-chunk-11-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/import_to_report_files/figure-html/unnamed-chunk-13-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/import_to_report_files/figure-html/unnamed-chunk-14-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/import_to_report_files/figure-html/unnamed-chunk-15-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/import_to_report_files/figure-html/unnamed-chunk-16-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/import_to_report_files/figure-html/unnamed-chunk-17-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/import_to_report_files/figure-html/unnamed-chunk-18-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,7 @@ <h1 class="title">Computational Analysis for Bioscientists</h1>
<div>
<div class="quarto-title-meta-heading">Published</div>
<div class="quarto-title-meta-contents">
<p class="date">10 February, 2025</p>
<p class="date">11 February, 2025</p>
</div>
</div>

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/search.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
</url>
<url>
<loc>https://3mmarand.github.io/comp4biosci/two_sample_tests.html</loc>
<lastmod>2025-02-03T15:26:44.876Z</lastmod>
<lastmod>2025-02-11T17:45:18.210Z</lastmod>
</url>
<url>
<loc>https://3mmarand.github.io/comp4biosci/one_way_anova_and_kw.html</loc>
Expand Down
102 changes: 50 additions & 52 deletions docs/two_sample_tests.html

Large diffs are not rendered by default.

Binary file modified docs/two_sample_tests_files/figure-html/unnamed-chunk-13-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/two_sample_tests_files/figure-html/unnamed-chunk-27-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file modified docs/two_way_anova_files/figure-html/unnamed-chunk-16-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/workflow_rstudio.html
Original file line number Diff line number Diff line change
Expand Up @@ -480,8 +480,8 @@ <h1 class="title">
<div class="sourceCode" id="cb6"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="co"># apply a log-square root transformation</span></span>
<span><span class="va">tnums</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">log</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/MathFun.html">sqrt</a></span><span class="op">(</span><span class="va">nums</span><span class="op">)</span><span class="op">)</span></span>
<span><span class="va">tnums</span></span>
<span><span class="co">## [1] 1.7631803 1.7169936 0.6931472 2.1587441 2.2769384 2.0036666 1.6836479</span></span>
<span><span class="co">## [8] 1.7482538 2.0948274 0.0000000</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span><span class="co">## [1] 2.2554298 2.2213256 2.0871936 1.9560115 0.3465736 2.1383331 2.1242476</span></span>
<span><span class="co">## [8] 1.8187931 1.3862944 1.7631803</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>The first function to be applied is innermost. When we are using just two functions, the level of nesting does not cause too much difficulty in reading the code. However, you can image this gets more unreadable as the number of functions applied increases. It also makes it harder to debug and find out where an error might be. One solution is to create intermediate variables so the commands a given in order:</p>
<div class="cell">
Expand Down
118 changes: 42 additions & 76 deletions two_sample_tests.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -518,7 +518,7 @@ The data in [marks.csv](data-raw/marks.csv) give the marks for ten
students in two subjects: Data Analysis and Biology. These data are
paired because we have two marks from one student so that a mark in one
group has a closer relationship with one of the marks in the other group
than with any of the the other values. We want to know if students to equally
than with any of the other values. We want to know if students do equally
well in both subjects.

### Import and explore
Expand All @@ -533,8 +533,7 @@ Since these data are paired, it makes sense to highlight how the marks
differ for each student. One way of doing that is to draw a line linking
their marks in each subject. This is known as a spaghetti plot. We can
use two geoms: `geom_point()` and `geom_line()`. To join a student's
marks we need to set the `group` aesthetic to
`student.`[^two_sample_tests-1]
marks, we need to set the `group` aesthetic to `student`.[^two_sample_tests-1]

[^two_sample_tests-1]: You might like to try removing
`aes(group = student)` to see what ggplot does when the lines are
Expand All @@ -544,106 +543,74 @@ marks we need to set the `group` aesthetic to
ggplot(data = marks, aes(x = subject, y = mark)) +
geom_point() +
geom_line(aes(group = student))
```

Summarise the data so that we can use the means in plots later:

```{r}
marks_summary <- marks |>
group_by(subject) |>
marks_summary <- marks |>
group_by(subject) |>
summarise(mean = mean(mark))
```

A paired test requires us to test whether the *difference* in marks
between the two subjects is zero on average. One handy way to achieve
this is to organise our groups into two columns. The `pivot_wider()`
function will do this for us. We need to tell it what column gives the
identifiers (*i.e.* what column matches the pairs). This is the student
number in this example. We also need to say which variable contains
the values that will become the column names and which contains the values.

Pivot the data so there is a column for each subject:

```{r}
marks_wide <- marks |>
pivot_wider(id_cols = student,
names_from = subject,
values_from = mark)
```

We can summarise the difference between subject,
`DataAnalysis - Biology` as follows:

```{r}
marks_wide |>
summarise(mean_diff = mean(DataAnalysis - Biology),
sd_diff = sd(DataAnalysis - Biology),
n_diff = length(DataAnalysis - Biology),
se_diff = sd_diff/sqrt(n_diff))
```

The mean difference is positive meaning students have higher marks in
Data Analysis on average.
A paired test requires us to into account the variation between
students.

### Apply `lm()`

We can create a paired-sample model with the `lm()`
function[^two_sample_tests-2] like this:

[^two_sample_tests-2]: This is not the only way to apply a paired test.
When there are only two groups and no other explanatory variables we
When there are only two groups and no other explanatory variables, we
can use `t.test(data = marks, mark ~ subject, paired = TRUE)`. A
more general method which works when you have two or more
more general method that works when you have two or more
non-independent values (e.g., more than two subjects) or additional
explanatory variables, is to create a "linear mixed model" with
`lmer()`
explanatory variables is to create a "linear mixed model" with
`lmer()`.

```{r}
mod <- lm(data = marks_wide, DataAnalysis - Biology ~ 1)
mod <- lm(mark ~ subject + factor(student), data = marks)
```

The response here is the differences `DataAnalysis - Biology` and the
`~ 1` indicates we only have one group so we are only interested in
testing a single mean against a value of zero. Another way of saying
this is we have an "intercept only model" meaning were are estimating
only $\beta_0$. There is no $\beta_1$.
- `mark` is the dependent variable (response).
- `subject` is the independent variable (explanatory factor).
- `factor(student)` accounts for the pairing by treating student as
another explanatory variable. We have used factor because the values
in students are the numbers 1 to 10 and we want `student` to be
treated as a category not a number

```{r}
summary(mod)
```

The Estimates in the Coefficients table gives the `(Intercept)` which is
the mean of `DataAnalysis - Biology`. The average difference between
subjects is 7.0 marks. The *p*-value of 0.0355 means 7.0 does differ
significantly from zero. Individual students get significantly higher
marks in Data Analysis than in Biology.
The coefficient for `(Intercept)` gives the mean Biology mark and that
for `subjectDataAnalysis` is amount that the Data Analysis mark are
above Biology marks in general. The *p*-value tests whether this
difference is significantly different from zero. The rest of the output
considers how students differ. You can ignore this here.

### Check assumptions

We might expect marks, and thus the difference between marks to be
normally distributed. However, this is a very small sample and choosing a
non-parametric test instead would be sensible. We will continue with
this example to demonstrate how to interpret and report on the result of
a parametric paired-samples test (paired-samples *t*-test).

We might expect marks to be normally distributed. However, this is a
very small sample, and choosing a non-parametric test instead would be
reasonable. However, we will continue with this example to demonstrate
how to interpret and report on the result of a parametric paired-samples
test (paired-samples *t*-test).

We do not need to plot the residuals against the fitted values
(`plot(mod, which = 1)`) for a paired test. The purpose of a residuals
vs fitted plot is to check the variance is same for all fitted values.
For a paired test, we only have one fitted value, the mean difference.

The normality of the residuals should be checked.
A plot the residuals against the fitted values (`plot(mod, which = 1)`)
is not useful for a paired test. The normality of the residuals should
be checked.

```{r}
ggplot(mapping = aes(x = mod$residuals)) +
ggplot(mapping = aes(x = mod$residuals)) +
geom_histogram(bins = 3)
```

We only have 10 values so the distribution is never going to look
smooth. We can't draw strong conclusions from this but we do at least
have a peak at 0. Similarly a normality test is likely to be
non-significant because of the small sample size which means the test
We only have 10 values, so the distribution is never going to look
smooth. We can't draw strong conclusions from this, but we do at least
have a peak at 0. Similarly, a normality test is likely to be
non-significant because of the small sample size, meaning the test
is not very powerful. This means a non-significant result
is not strong evidence of the residuals following a normal distribution:

Expand All @@ -659,35 +626,34 @@ difference of 6.5%. See @fig-marks

::: {#fig-marks}


```{r}
#| code-fold: true
ggplot(data = marks, aes(x = subject, y = mark)) +
geom_point(pch = 1, size = 3) +
geom_point(pch = 1, size = 3) +
geom_line(aes(group = student), linetype = 3) +
geom_point(data = marks_summary,
aes(x = subject, y = mean),
geom_point(data = marks_summary,
aes(x = subject, y = mean),
size = 3) +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Mark",
expand = c(0, 0),
limits = c(0, 110)) +
annotate("segment", x = 1, xend = 2,
annotate("segment", x = 1, xend = 2,
y = 105, yend = 105,
colour = "black") +
annotate("text", x = 1.5, y = 108,
annotate("text", x = 1.5, y = 108,
label = expression(italic(p)~"= 0.0355")) +
theme_classic()
```

**Students score higher in Data Analysis than in Biology**. Open circles indicate an individual student's marks in each subject with dashed lines joining their marks in each subject. The filled circles indicate the mean mark for each subject. Individual students score significantly higher in Data Analysis than in
Biology (*t* = 2.47; *d.f.* = 9; *p* = 0.0355) with an average. difference of 6.5%. Data analysis was conducted in R [@R-core] with tidyverse packages [@tidyverse].
Biology (*t* = 2.47; *d.f.* = 9; *p* = 0.0355) with an average difference of 6.5%. Data analysis was conducted in R [@R-core] with tidyverse packages [@tidyverse].

:::



## Two paired-samples, non-parametric

We have the marks for just 10 students. This sample is too small for us to judge whether the marks are normally distributed. We will use a non-parametric test instead. The "Wilcoxon signed-rank" test is the non-parametric equivalent of the paired-samples *t*-test. This is often referred to as the paired-sample
Expand Down

0 comments on commit 7baa0b3

Please sign in to comment.