summary and legends for one-way anova. improved summaries for two-sam…

…ple and regression
3mmaRand · Feb 23, 2024 · 2cc4f49 · 2cc4f49
1 parent bad873b
commit 2cc4f49
Show file tree

Hide file tree

Showing 23 changed files with 340 additions and 112 deletions.
diff --git a/adipocytes.png b/adipocytes.png
diff --git a/docs/ideas_about_data.html b/docs/ideas_about_data.html
@@ -404,7 +404,7 @@ <h1 class="title">
 <ul>
 <li><p>the range: the difference between the maximum value and the minimum value in a variable</p></li>
 <li><p>the interquartile range: two values, the first quartile and the thrid quartile. The first quartile is half way between the median value and the lowest value when the values are arranged in order and the third quartile is halfway between the median value and the highest value</p></li>
-<li><p>the variance: the average of the squared differences between each value and the variable’s mean, <span class="math inline">\(\bar{x} = \frac{(\sum{x - \bar{x})^2}}{n - 1}\)</span></p></li>
+<li><p>the variance: the average of the squared differences between each value and the variable’s mean, <span class="math inline">\(s^2 = \frac{(\sum{x - \bar{x})^2}}{n - 1}\)</span></p></li>
 <li><p>the standard deviation: the square root of the variance.</p></li>
 </ul></section><section id="discrete-data" class="level2" data-number="5.4"><h2 data-number="5.4" class="anchored" data-anchor-id="discrete-data">
 <span class="header-section-number">5.4</span> Discrete data</h2>

diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-11-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-11-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-13-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-13-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-14-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-14-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-15-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-15-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-16-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-16-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-17-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-17-1.png
diff --git a/docs/import_to_report_files/figure-html/unnamed-chunk-18-1.png b/docs/import_to_report_files/figure-html/unnamed-chunk-18-1.png
diff --git a/docs/index.html b/docs/index.html
@@ -302,7 +302,7 @@ <h1 class="title">Computational Analysis for Bioscientists</h1>
     <div>
     <div class="quarto-title-meta-heading">Published</div>
     <div class="quarto-title-meta-contents">
-      <p class="date">22 February, 2024</p>
+      <p class="date">23 February, 2024</p>
     </div>
   </div>
 

diff --git a/docs/one_way_anova_and_kw.html b/docs/one_way_anova_and_kw.html
diff --git a/docs/one_way_anova_and_kw_files/figure-html/unnamed-chunk-16-1.png b/docs/one_way_anova_and_kw_files/figure-html/unnamed-chunk-16-1.png
diff --git a/docs/one_way_anova_and_kw_files/figure-html/unnamed-chunk-22-1.png b/docs/one_way_anova_and_kw_files/figure-html/unnamed-chunk-22-1.png
diff --git a/docs/search.json b/docs/search.json
diff --git a/docs/single_linear_regression.html b/docs/single_linear_regression.html
@@ -648,6 +648,16 @@ <h1 class="title">
 </div>
 </section></section><section id="summary" class="level2" data-number="12.4"><h2 data-number="12.4" class="anchored" data-anchor-id="summary">
 <span class="header-section-number">12.4</span> Summary</h2>
+<ol type="1">
+<li><p>Single linear regression is an appropriate when you have one continuous explanatory variable and one continuous response and the relationship between the two is linear.</p></li>
+<li><p>Applying a single linear regression to data means putting a line of best fit through it. We estimate the <strong>coefficients</strong> (also called the <strong>parameters</strong>) of the model. These are the intercept, <span class="math inline">\(\beta_0\)</span>, and the slope, <span class="math inline">\(\beta_1\)</span>. We test whether the parameters differ significantly from zero</p></li>
+<li><p>We can use <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code> to a linear regression.</p></li>
+<li><p>In the output of <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code> the coefficients are listed in a table in the Estimates column. The <em>p</em>-value for each coefficient is in the test of whether it differs from zero. At the bottom of the output there is a test of the model <em>overall</em>. In a single linear regression this is exactly the same as the test of the <span class="math inline">\(\beta_1\)</span> and the p-values are identical. The R-squared value is the proportion of the variance in the response variable that is explained by the model.</p></li>
+<li><p>The assumptions of the general linear model are that the residuals are normally distributed and have homogeneity of variance. A residual is the difference between the predicted value and the observed value.</p></li>
+<li><p>We examine a histogram of the residuals and use the Shapiro-Wilk normality test to check the normality assumption. We check the variance of the residuals is the same for all fitted values with a residuals vs fitted plot.</p></li>
+<li><p>If the assumptions are not met, we might need to transform the data or use a different type of model.</p></li>
+<li><p>When reporting the results of a regression we give the significance, direction and size of the effect. Often we give the equation of the best fitting line. A Figure should show the data and the line of best fit.</p></li>
+</ol>
 
 
 <div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list" style="display: none">

diff --git a/docs/sitemap.xml b/docs/sitemap.xml
@@ -30,7 +30,7 @@
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/ideas_about_data.html</loc>
-    <lastmod>2023-10-17T14:17:55.511Z</lastmod>
+    <lastmod>2024-02-23T11:50:54.420Z</lastmod>
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/first_steps_rstudio.html</loc>
@@ -62,15 +62,15 @@
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/single_linear_regression.html</loc>
-    <lastmod>2024-02-22T13:48:27.629Z</lastmod>
+    <lastmod>2024-02-23T10:52:10.379Z</lastmod>
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/two_sample_tests.html</loc>
-    <lastmod>2024-02-22T13:57:51.528Z</lastmod>
+    <lastmod>2024-02-23T10:52:20.092Z</lastmod>
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/one_way_anova_and_kw.html</loc>
-    <lastmod>2024-02-18T11:41:34.000Z</lastmod>
+    <lastmod>2024-02-23T14:05:48.867Z</lastmod>
   </url>
   <url>
     <loc>https://3mmarand.github.io/comp4biosci/two_way_anova.html</loc>

diff --git a/docs/two_sample_tests.html b/docs/two_sample_tests.html
@@ -1108,14 +1108,16 @@ <h1 class="title">
 <span class="header-section-number">13.7</span> Summary</h2>
 <ol type="1">
 <li><p>A linear model with one explanatory variable with two groups and one continuous response is “a two-sample test”.</p></li>
-<li><p>If pairs of observations in the groups have something in common that make them more similar to each other, than to other observations, then those observations are not independent</p></li>
-<li><p>A paired-samples test is used when the observations are not independent.</p></li>
+<li><p>If pairs of observations in the groups have something in common that make them more similar to each other, than to other observations, then those observations are not independent. A <strong>paired-samples test</strong> is used when the observations are not independent.</p></li>
 <li><p>A linear model with one explanatory variable with two groups and one continuous response is also known as a <strong>two-sample <em>t</em>-test</strong> when the samples are independent and as a <strong>paired-samples <em>t</em>-test</strong> when they are not</p></li>
 <li><p>We can use <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code> to do two-sample and paired sample tests. We can also use <code><a href="https://rdrr.io/r/stats/t.test.html">t.test()</a></code> for these but using <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code> helps us understand tests with more groups and/or more variables where we will have to use <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code>. The output of <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code> is also more typical of the output of statistical functions in R.</p></li>
+<li><p>We estimate the <strong>coefficients</strong> (also called the <strong>parameters</strong>) of the model. For a two-sample test these are the mean of the first group, <span class="math inline">\(\beta_0\)</span> (which might also be called the intercept) and the difference between the means of the first and second groups, <span class="math inline">\(\beta_1\)</span> (which might also be called the slope). For a paired-sample test there is just one parameter, the mean difference between pairs of values, <span class="math inline">\(\beta_0\)</span> (which might also be called the intercept). We test whether the parameters differ significantly from zero</p></li>
+<li><p>We can use <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code> to a linear regression.</p></li>
+<li><p>In the output of <code><a href="https://rdrr.io/r/stats/lm.html">lm()</a></code> the coefficients are listed in a table in the Estimates column. The <em>p</em>-value for each coefficient is in the test of whether it differs from zero. At the bottom of the output there is a test of the model <em>overall</em>. In this case, this is exactly the same as the test of the <span class="math inline">\(\beta_1\)</span> and the p-values are identical. The R-squared value is the proportion of the variance in the response variable that is explained by the model.</p></li>
 <li><p>The assumptions of the general linear model are that the residuals are normally distributed and have homogeneity of variance. A residual is the difference between the predicted value and the observed value.</p></li>
-<li><p>We examine a histogram of the residuals and use the Shapiro-Wilk normality tests to check the normality assumption. We check the variance of the residuals is the same for all fitted values with a residuals vs fitted plot.</p></li>
+<li><p>We examine a histogram of the residuals and use the Shapiro-Wilk normality test to check the normality assumption. We check the variance of the residuals is the same for all fitted values with a residuals vs fitted plot.</p></li>
 <li><p>If the assumptions are not met, we can use alternatives known as non-parametric tests. These are applied with <code><a href="https://rdrr.io/r/stats/wilcox.test.html">wilcox.test()</a></code> in R.</p></li>
-<li><p>When reporting the results of a test we give the significance, direction and size of the effect. Our figures and the values we give should reflect the type of test we have used. We use means and standard errors for parametric tests and medians and interquartile ranges for non-parametric tests. We also give the test statistic, the degrees of freedom (parametric) or sample size (non-parametric) and the p-value.</p></li>
+<li><p>When reporting the results of a test we give the significance, direction and size of the effect. Our figures and the values we give should reflect the type of test we have used. We use means and standard errors for parametric tests and medians and interquartile ranges for non-parametric tests. We also give the test statistic, the degrees of freedom (parametric) or sample size (non-parametric) and the p-value. We annotate our figures with the p-value, making clear which comparison it applies to.</p></li>
 </ol>
 
 

diff --git a/docs/two_sample_tests_files/figure-html/unnamed-chunk-13-1.png b/docs/two_sample_tests_files/figure-html/unnamed-chunk-13-1.png
diff --git a/docs/two_way_anova_files/figure-html/fig-para-1.png b/docs/two_way_anova_files/figure-html/fig-para-1.png
diff --git a/docs/workflow_rstudio.html b/docs/workflow_rstudio.html
@@ -468,8 +468,8 @@ <h1 class="title">
 <div class="sourceCode" id="cb6"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="co"># apply a log-square root transformation</span></span>
 <span><span class="va">tnums</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/Log.html">log</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/MathFun.html">sqrt</a></span><span class="op">(</span><span class="va">nums</span><span class="op">)</span><span class="op">)</span></span>
 <span><span class="va">tnums</span></span>
-<span><span class="co">##  [1] 2.087194 2.297560 1.386294 0.000000 1.242453 2.124248 1.151293 2.055437</span></span>
-<span><span class="co">##  [9] 2.109754 1.748254</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span><span class="co">##  [1] 1.7917595 0.5493061 1.4166067 2.0715674 1.5222612 0.9729551 2.2213256</span></span>
+<span><span class="co">##  [8] 2.2924837 2.2154084 1.6836479</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>The first function to be applied is innermost. When we are using just two functions, the level of nesting does not cause too much difficulty in reading the code. However, you can image this gets more unreadable as the number of functions applied increases. It also makes it harder to debug and find out where an error might be. One solution is to create intermediate variables so the commands a given in order:</p>
 <div class="cell">

diff --git a/one_way_anova_and_kw.qmd b/one_way_anova_and_kw.qmd
@@ -5,7 +5,7 @@
 #| echo: false
 
 source("_common.R")
-status("polishing")
+status("complete")
 ```
 
 ## Overview
@@ -134,14 +134,15 @@ Import the data:
 culture <- read_csv("data-raw/culture.csv")
 ```
 
+
 ```{r}
 #| echo: false
-
 knitr::kable(culture) |> 
   kableExtra::kable_styling() |> 
   kableExtra::scroll_box(height = "200px")
 ```
 
+
 The Response variable is colony diameters in millimetres and we would
 expect it to be continuous. The Explanatory variable is type of media
 and is categorical with 3 groups. It is known “one-way ANOVA” or
@@ -420,19 +421,20 @@ and homogeneity of variance are probably not violated.
 
 ### Report
 
-There is a significant effect of media on the diameter of bacterial
-colonies (*F* = 6.11; *d.f.* = 2, 27; *p* = 0.006) with colonies growing
-significantly better when both sugar and amino acids are added to the
-medium. Post-hoc testing with Tukey's Honestly Significant Difference
-test [@tukey1949] revealed the colony diameters were significantly
-larger when grown with both sugar and amino acids
-($\bar{x} \pm s.e$: 11.4 $\pm$ 0.37 mm) than with neither
+There was a significant effect of media on the diameter of bacterial
+colonies (*F* = 6.11; *d.f.* = 2, 27; *p* = 0.006). Post-hoc testing 
+with Tukey's Honestly Significant Difference test [@tukey1949] revealed
+the colony diameters were significantly larger when grown with both 
+sugar and amino acids ($\bar{x} \pm s.e$: 11.4 $\pm$ 0.37 mm) than with
+neither
 (10.2 $\pm$ 0.26 mm; *p* = 0.0092) or just sugar (10.1 $\pm$ 0.23 mm;
 *p* = 0.0244). See @fig-culture.
 
+
+
+::: {#fig-culture}
 ```{r}
-#| label: fig-culture
-#| fig-cap: "Diameters of bacterial colonies grown on three types of media: control, with sugar added and with both sugar and amino acids added. Errors bars are ± 1 s.e."
+#| code-fold: true
 
 ggplot() +
   geom_point(data = culture, aes(x = medium, y = diameter),
@@ -462,6 +464,18 @@ ggplot() +
   theme_classic()
 ```
 
+**Medium affects bacterial colony diameter**. Ten replicate colonies 
+were grown on three types of media: control, with sugar added and with
+both sugar and amino acids added. Error bars are means $\pm$ 1 standard 
+error. There was a significant effect of media on the diameter of 
+bacterial colonies (*F* = 6.11; *d.f.* = 2, 27; *p* = 0.006). Post-hoc 
+testing with Tukey's Honestly Significant Difference test [@tukey1949] 
+revealed the colony diameters were significantly larger when grown with 
+both sugar and amino acids than with neither or just sugar. Data 
+analysis was conducted in R [@R-core] with tidyverse packages [@tidyverse].
+
+:::
+
 # Kruskal-Wallis
 
 Our examination of the assumptions revealed a possible violation of the
@@ -552,9 +566,10 @@ grown with both sugar and amino acids ($median = 11.3 mm$) than with
 neither ($median = 10.2 mm$; *p* = 0.031) or just sugar
 ($median = 10.2 mm$; *p* = 0.038). See @fig-culture-kw.
 
+
+::: {#fig-culture-kw}
 ```{r}
-#| label: fig-culture-kw
-#| fig-cap: "Diameters of bacterial colonies grown on three types of media: control, with sugar added and with both sugar and amino acids added. Heavy line indicate the median, boxes the interquartile range and whiskers the range."
+#| code-fold: true
 
 ggplot(data = culture, aes(x = medium, y = diameter)) +
  geom_boxplot() +
@@ -575,3 +590,72 @@ ggplot(data = culture, aes(x = medium, y = diameter)) +
            label = expression(italic(p)~"= 0.031")) +
   theme_classic()
 ```
+
+
+**Medium affects bacterial colony diameter**. Ten replicate colonies 
+were grown on three types of media: control, with sugar added and with
+both sugar and amino acids added. The heavy lines
+indicate median diameter, boxes indicate the interquartile range
+and whiskers the range. There was a significant effect of media on the 
+diameter of bacterial colonies (Kruskal-Wallis: *chi-squared* = 6.34, 
+*df* = 2, *p*-value = 0.042). Post-hoc testing with the Dunn test
+[@dunn1964] revealed the colony diameters were significantly larger when
+grown with both sugar and amino acids than with neither or just sugar. 
+Data analysis was conducted in R [@R-core] with 
+tidyverse packages [@tidyverse].
+
+:::
+
+# Summary
+
+1. A linear model with one explanatory variable with two or more groups 
+   is also known as a **one-way ANOVA**.
+
+2. We estimate the **coefficients** (also called the **parameters**) of
+   the model. For a one-way ANOVA with three groups these are the mean
+   of the first group, $\beta_0$, the difference between the means of 
+   the first and second groups, $\beta_1$, and the difference between 
+   the means of the first and third groups, $\beta_2$. We test whether the 
+   parameters differ significantly from zero
+
+3. We can use `lm()` to one-way ANOVA in R.
+
+4. When we get a significant effect of our explanatory variable, it only
+   tells us that at least two of the means differ. To find out which 
+   means differ, we need a **post-hoc** test. Here we use Tukey’s HSD 
+   applied with the `emmeans()` and `pairs()` functions from the 
+   **`emmeans`** package. Post-hoc tests make adjustments to the 
+   *p*-values to account for the fact that we are doing multiple tests.
+
+5. In the output of `lm()` the coefficients are listed in a table in the
+   Estimates column. The *p*-value for each coefficient is in the test 
+   of whether it differs from zero. At the bottom of the output there 
+   is a test of the model *overall*. Now we have more than two 
+   parameters, this is different from the test on any one parameter. The
+   R-squared value is the proportion of the variance in the response 
+   variable that is explained by the model. It tells us is the 
+   explanatory variable is useful in predicting the response variable
+   overall.
+
+6. The assumptions of the general linear model are that the residuals
+   are normally distributed and have homogeneity of variance. A residual
+   is the difference between the predicted value and the observed value.
+
+7. We examine a histogram of the residuals and use the Shapiro-Wilk
+   normality test to check the normality assumption. We  check the
+   variance of the residuals is the same for all fitted values with
+   a residuals vs fitted plot.
+
+8. If the assumptions are not met, we can use the Kruskal-Wallis test
+   applied with `kruskal.test()` in R and follow it with The Dunn test 
+   applied with `dunnTest()` in the package **`FSA`**.
+
+9. When reporting the results of a test we give the significance,
+   direction and size of the effect. Our figures and the values we give
+   should reflect the type of test we have used. We use means and
+   standard errors for parametric tests and medians and interquartile
+   ranges for non-parametric tests. We also give the test statistic, the
+   degrees of freedom (parametric) or sample size (non-parametric) and
+   the p-value. We annotate our figures with the p-value, making clear
+   which comparison it applies to.
+