diff --git a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-10-1.png b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-10-1.png index 8cca4c27..d1828a1d 100644 Binary files a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-10-1.png and b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-11-1.png b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-11-1.png index e78bac20..19c37bf7 100644 Binary files a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-11-1.png and b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-12-1.png b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-12-1.png index 014b2629..63191d42 100644 Binary files a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-12-1.png and b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-5-1.png b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-5-1.png index 4a86cd24..9ca7fb36 100644 Binary files a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-5-1.png and b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-7-1.png b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-7-1.png index 6339e0ca..9f2ec9cd 100644 Binary files a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-7-1.png and b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-8-1.png b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-8-1.png index fb1b0a4a..4a481d4c 100644 Binary files a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-8-1.png and b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-9-1.png b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-9-1.png index e53d251c..2c8a4977 100644 Binary files a/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-9-1.png and b/pgt52m/week-4/workshop_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-11-1.png b/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-11-1.png index ccf65c0d..021df5a3 100644 Binary files a/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-11-1.png and b/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-12-1.png b/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-12-1.png index 765fd4be..5d98bf07 100644 Binary files a/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-12-1.png and b/pgt52m/week-7/workshop_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/pgt52m/week-8/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png b/pgt52m/week-8/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png index 63c0f7c3..ad277c96 100644 Binary files a/pgt52m/week-8/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png and b/pgt52m/week-8/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/pgt52m/week-8/workshop_files/figure-html/unnamed-chunk-13-1.png b/pgt52m/week-8/workshop_files/figure-html/unnamed-chunk-13-1.png index 948754f7..e1920761 100644 Binary files a/pgt52m/week-8/workshop_files/figure-html/unnamed-chunk-13-1.png and b/pgt52m/week-8/workshop_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-10-1.png b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-10-1.png index 64ca3861..b81eb1fa 100644 Binary files a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-10-1.png and b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-11-1.png b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-11-1.png index 791eb4fd..2b9a3a48 100644 Binary files a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-11-1.png and b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-12-1.png b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-12-1.png index 9e6a4baa..07be4086 100644 Binary files a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-12-1.png and b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-5-1.png b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-5-1.png index 96386af0..e123f308 100644 Binary files a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-5-1.png and b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-7-1.png b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-7-1.png index 263e0ded..4d02891b 100644 Binary files a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-7-1.png and b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-8-1.png b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-8-1.png index 5986918a..cb69855b 100644 Binary files a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-8-1.png and b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-9-1.png b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-9-1.png index 4515a0d3..8a2cb8c3 100644 Binary files a/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-9-1.png and b/r4babs1/week-9/workshop_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/r4babs2/week-2/data-raw/plant.xlsx b/r4babs2/week-2/data-raw/plant.xlsx new file mode 100644 index 00000000..954f190e Binary files /dev/null and b/r4babs2/week-2/data-raw/plant.xlsx differ diff --git a/r4babs2/week-2/workshop.html b/r4babs2/week-2/workshop.html index fa3efdb9..fa8e7796 100644 --- a/r4babs2/week-2/workshop.html +++ b/r4babs2/week-2/workshop.html @@ -479,8 +479,8 @@

Workshop

Add a comment to the script: # Introduction to statistical models: Single linear-regression and load the tidyverse (Wickham et al. 2019) package

Exercises

Linear Regression

-

The data in plant.xlsx is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other.

-

Save a copy of plant.xlsx to your data-raw folder and import it.

+

The data in plant.xlsx is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other.

+

Save a copy of plant.xlsx to your data-raw folder and import it.

What type of variables do you have? Which is the response and which is the explanatory? What is the null hypothesis?

diff --git a/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-11-1.png b/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-11-1.png index 07e2a957..83fbf6cb 100644 Binary files a/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-11-1.png and b/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-12-1.png b/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-12-1.png index 40bcbce6..cfb2389f 100644 Binary files a/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-12-1.png and b/r4babs2/week-3/workshop_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/r4babs2/week-4/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png b/r4babs2/week-4/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png index dee353ef..272870c7 100644 Binary files a/r4babs2/week-4/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png and b/r4babs2/week-4/study_after_workshop_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/r4babs2/week-4/workshop_files/figure-html/unnamed-chunk-13-1.png b/r4babs2/week-4/workshop_files/figure-html/unnamed-chunk-13-1.png index 977c1dec..a7faf610 100644 Binary files a/r4babs2/week-4/workshop_files/figure-html/unnamed-chunk-13-1.png and b/r4babs2/week-4/workshop_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-16-1.png b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-16-1.png index 731f7a4c..e80a6879 100644 Binary files a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-16-1.png and b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-17-1.png b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-17-1.png index 6ba99940..ad2693ce 100644 Binary files a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-17-1.png and b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-18-1.png b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-18-1.png index fe80f4d7..2c906dc3 100644 Binary files a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-18-1.png and b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-19-1.png b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-19-1.png index 453bddc1..67fcbe34 100644 Binary files a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-19-1.png and b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-19-1.png differ diff --git a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-20-1.png b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-20-1.png index 91e153ed..2fcde0cc 100644 Binary files a/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-20-1.png and b/r4babs2/week-5/workshop_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/search.json b/search.json index 446c5af8..4381ef23 100644 --- a/search.json +++ b/search.json @@ -1,59 +1,31 @@ [ { - "objectID": "r4babs2/r4babs2.html", - "href": "r4babs2/r4babs2.html", - "title": "Data Analysis in R for BABS 2", + "objectID": "r4babs2/week-3/overview.html", + "href": "r4babs2/week-3/overview.html", + "title": "Overview", "section": "", - "text": "This is the second of the four BABS modules. Over six weeks you will learn about the logic of hypothesis testing, confidence intervals, what is meant by a statistical model, two-sample tests and one- and two-way analysis of variance (ANOVA).\n\n\nThe BABS2 Module Learning outcomes that relate to the Data Analysis in R content are:\n\nThink creatively to address a Grand Challenge by designing investigations with testable hypotheses and rigorous controls\nAppropriately select classical univariate statistical tests and some non-parametric equivalents to a given scenario and recognise when these are not suitable\nUse R to perform these analyses, reproducibly, on data in a variety of formats and present the results graphically\nCommunicate research in scientific reports and via oral presentation." + "text": "This week you will how to use and interpret the general linear model when the x variable is categorical and has two groups. Just as with single linear regression, the model puts a line of best through data and the model parameters, the intercept and the slope, have the same in interpretation The intercept is one of the group means and the slope is the difference between that, mean and the other group mean. You will also learn about the non-parametric equivalents - the tests we use when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nunderstand the principles of two-sample tests\nappreciate that two-sample tests with lm() are based on the normal distribution and thus have assumptions\nappropriately select parametric and non-parametric two-sample tests\nappropriately select paired and and unpaired two-sample tests\napply and interpret lm()and wilcox.test()\nevaluate whether the assumptions of lm() are met\nscientifically report a two-sample test result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read Two-Sample tests\n\nWorkshop\n\nšŸ’» Parametric two-sample test\nšŸ’» Non-parametric two-sample test\nšŸ’» Parametric paired-sample test\n\nConsolidate\n\nšŸ’» Appropriately test whether a genetic modification was successful in increasing omega 3 fatty acids in Cannabis sativa.\nšŸ’» ā€¦." }, { - "objectID": "r4babs2/r4babs2.html#module-learning-objectives", - "href": "r4babs2/r4babs2.html#module-learning-objectives", - "title": "Data Analysis in R for BABS 2", + "objectID": "r4babs2/week-3/study_before_workshop.html", + "href": "r4babs2/week-3/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", "section": "", - "text": "The BABS2 Module Learning outcomes that relate to the Data Analysis in R content are:\n\nThink creatively to address a Grand Challenge by designing investigations with testable hypotheses and rigorous controls\nAppropriately select classical univariate statistical tests and some non-parametric equivalents to a given scenario and recognise when these are not suitable\nUse R to perform these analyses, reproducibly, on data in a variety of formats and present the results graphically\nCommunicate research in scientific reports and via oral presentation." - }, - { - "objectID": "r4babs2/r4babs2.html#the-logic-of-hypothesis-testing-and-cis", - "href": "r4babs2/r4babs2.html#the-logic-of-hypothesis-testing-and-cis", - "title": "Data Analysis in R for BABS 2", - "section": "The logic of hypothesis testing and CIs", - "text": "The logic of hypothesis testing and CIs" - }, - { - "objectID": "r4babs2/r4babs2.html#introduction-to-statistical-models-single-regression", - "href": "r4babs2/r4babs2.html#introduction-to-statistical-models-single-regression", - "title": "Data Analysis in R for BABS 2", - "section": "Introduction to statistical models: Single regression", - "text": "Introduction to statistical models: Single regression" - }, - { - "objectID": "r4babs2/r4babs2.html#two-sample-tests", - "href": "r4babs2/r4babs2.html#two-sample-tests", - "title": "Data Analysis in R for BABS 2", - "section": "Two-sample tests", - "text": "Two-sample tests" - }, - { - "objectID": "r4babs2/r4babs2.html#one-way-anova-and-kruskal-wallis", - "href": "r4babs2/r4babs2.html#one-way-anova-and-kruskal-wallis", - "title": "Data Analysis in R for BABS 2", - "section": "One-way ANOVA and Kruskal-Wallis", - "text": "One-way ANOVA and Kruskal-Wallis" + "text": "Prepare\n\nšŸ“– Read Two-Sample tests" }, { - "objectID": "r4babs2/r4babs2.html#two-way-anova", - "href": "r4babs2/r4babs2.html#two-way-anova", - "title": "Data Analysis in R for BABS 2", - "section": "Two-way ANOVA", - "text": "Two-way ANOVA" + "objectID": "r4babs2/week-4/overview.html", + "href": "r4babs2/week-4/overview.html", + "title": "Overview", + "section": "", + "text": "Last week you learnt how to use and interpret the general linear model when the x variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (analysis of variance). You will also learn about the Kruskal- Wallis test which can be used when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nexplain the rationale behind ANOVA understand the meaning of the F values\nselect, appropriately, one-way ANOVA and Kruskal-Wallis\nknow what functions are used in R to run these tests and how to interpret them\nevaluate whether the assumptions of lm() are met\nscientifically report the results of these tests including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis\n\nWorkshop\n\nšŸ’» One-way ANOVA\nšŸ’» Kruskal-Wallis\n\nConsolidate\n\nšŸ’» Appropriately test if fitness and acclimation effect the sodium content of sweat\nšŸ’» Appropriately test if insecticides vary in their effectiveness" }, { - "objectID": "r4babs2/r4babs2.html#chi-squared-tests-and-correlation", - "href": "r4babs2/r4babs2.html#chi-squared-tests-and-correlation", - "title": "Data Analysis in R for BABS 2", - "section": "Chi-squared tests and correlation", - "text": "Chi-squared tests and correlation" + "objectID": "r4babs2/week-4/study_before_workshop.html", + "href": "r4babs2/week-4/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", + "section": "", + "text": "Prepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis" }, { "objectID": "r4babs2/week-1/overview.html", @@ -69,6 +41,20 @@ "section": "", "text": "šŸ“– Read The logic of hyothesis testing\nšŸ“– Read Confidence Intervals" }, + { + "objectID": "r4babs2/week-5/overview.html", + "href": "r4babs2/week-5/overview.html", + "title": "Overview", + "section": "", + "text": "This week we will extend of our understanding by learning how to include two categorical explanatory variables in a general linear model. This model is often known as the two-way ANOVA. It has three null hypotheses\n\nLearning objectives\nThe successful student will be able to:\n\ncombine dataframes of the same structure\nselect, appropriately, two-way ANOVA\napply and interpret lm() for a two-way ANOVA\nevaluate whether the assumptions of lm() are met\nunderstand the meaning of the interaction term\nscientifically report a two-way ANOVA result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read Two-way ANOVA\n\nWorkshop\n\nšŸ’» Two-way ANOVA Choline deficiency on neuron size\nšŸ’» What to do when the assumptions are not met\n\nConsolidate\n\nšŸ’» Appropriately test if the addition of nitrogen and potassium to a crop influences yield and whether they act independently." + }, + { + "objectID": "r4babs2/week-5/study_before_workshop.html", + "href": "r4babs2/week-5/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", + "section": "", + "text": "Prepare\n\nšŸ“– Read Two-way ANOVA" + }, { "objectID": "r4babs2/week-2/overview.html", "href": "r4babs2/week-2/overview.html", @@ -84,137 +70,305 @@ "text": "šŸ“– Read What is a statistical model\nšŸ“– Read Single linear regression" }, { - "objectID": "r4babs2/week-6/overview.html", - "href": "r4babs2/week-6/overview.html", - "title": "Overview", + "objectID": "r4babs2/week-6/workshop.html", + "href": "r4babs2/week-6/workshop.html", + "title": "Workshop", "section": "", - "text": "This week you will\n\nLearning objectives\nThe successful student will be able to:\n\n\n\n\n\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read the book OR šŸ“¹ Watch two videos\n\nWorkshop\ni.šŸ’»\nConsolidate\n\nšŸ’»\nšŸ“– Read" + "text": "Artwork by Horst (2023):\n\n\nIn this session you will\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-6/study_before_workshop.html", - "href": "r4babs2/week-6/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "r4babs2/week-6/workshop.html#session-overview", + "href": "r4babs2/week-6/workshop.html#session-overview", + "title": "Workshop", "section": "", - "text": "Prepare\n\nEither šŸ“– Read xxxxx in OR šŸ“¹ Watch" + "text": "In this session you will" }, { - "objectID": "r4babs2/week-5/overview.html", - "href": "r4babs2/week-5/overview.html", - "title": "Overview", + "objectID": "r4babs2/week-6/workshop.html#philosophy", + "href": "r4babs2/week-6/workshop.html#philosophy", + "title": "Workshop", "section": "", - "text": "This week we will extend of our understanding by learning how to include two categorical explanatory variables in a general linear model. This model is often known as the two-way ANOVA. It has three null hypotheses\n\nLearning objectives\nThe successful student will be able to:\n\ncombine dataframes of the same structure\nselect, appropriately, two-way ANOVA\napply and interpret lm() for a two-way ANOVA\nevaluate whether the assumptions of lm() are met\nunderstand the meaning of the interaction term\nscientifically report a two-way ANOVA result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read Two-way ANOVA\n\nWorkshop\n\nšŸ’» Two-way ANOVA Choline deficiency on neuron size\nšŸ’» What to do when the assumptions are not met\n\nConsolidate\n\nšŸ’» Appropriately test if the addition of nitrogen and potassium to a crop influences yield and whether they act independently." + "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-5/study_before_workshop.html", - "href": "r4babs2/week-5/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "r4babs2/week-6/study_after_workshop.html", + "href": "r4babs2/week-6/study_after_workshop.html", + "title": "Independent Study to consolidate this week", "section": "", - "text": "Prepare\n\nšŸ“– Read Two-way ANOVA" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’»\n\n\nšŸ“– Read xxx" }, { - "objectID": "r4babs2/week-4/overview.html", - "href": "r4babs2/week-4/overview.html", - "title": "Overview", + "objectID": "pgt52m/week-3/workshop.html", + "href": "pgt52m/week-3/workshop.html", + "title": "Workshop", "section": "", - "text": "Last week you learnt how to use and interpret the general linear model when the x variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (analysis of variance). You will also learn about the Kruskal- Wallis test which can be used when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nexplain the rationale behind ANOVA understand the meaning of the F values\nselect, appropriately, one-way ANOVA and Kruskal-Wallis\nknow what functions are used in R to run these tests and how to interpret them\nevaluate whether the assumptions of lm() are met\nscientifically report the results of these tests including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis\n\nWorkshop\n\nšŸ’» One-way ANOVA\nšŸ’» Kruskal-Wallis\n\nConsolidate\n\nšŸ’» Appropriately test if fitness and acclimation effect the sodium content of sweat\nšŸ’» Appropriately test if insecticides vary in their effectiveness" + "text": "Artwork by Horst (2023): Continuous and Discrete\n\n\nIn this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-4/study_before_workshop.html", - "href": "r4babs2/week-4/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-3/workshop.html#session-overview", + "href": "pgt52m/week-3/workshop.html#session-overview", + "title": "Workshop", "section": "", - "text": "Prepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis" + "text": "In this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe." }, { - "objectID": "r4babs2/week-3/overview.html", - "href": "r4babs2/week-3/overview.html", - "title": "Overview", + "objectID": "pgt52m/week-3/workshop.html#philosophy", + "href": "pgt52m/week-3/workshop.html#philosophy", + "title": "Workshop", "section": "", - "text": "This week you will how to use and interpret the general linear model when the x variable is categorical and has two groups. Just as with single linear regression, the model puts a line of best through data and the model parameters, the intercept and the slope, have the same in interpretation The intercept is one of the group means and the slope is the difference between that, mean and the other group mean. You will also learn about the non-parametric equivalents - the tests we use when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nunderstand the principles of two-sample tests\nappreciate that two-sample tests with lm() are based on the normal distribution and thus have assumptions\nappropriately select parametric and non-parametric two-sample tests\nappropriately select paired and and unpaired two-sample tests\napply and interpret lm()and wilcox.test()\nevaluate whether the assumptions of lm() are met\nscientifically report a two-sample test result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read Two-Sample tests\n\nWorkshop\n\nšŸ’» Parametric two-sample test\nšŸ’» Non-parametric two-sample test\nšŸ’» Parametric paired-sample test\n\nConsolidate\n\nšŸ’» Appropriately test whether a genetic modification was successful in increasing omega 3 fatty acids in Cannabis sativa.\nšŸ’» ā€¦." + "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-3/study_before_workshop.html", - "href": "r4babs2/week-3/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-3/workshop.html#importing-data-from-files", + "href": "pgt52m/week-3/workshop.html#importing-data-from-files", + "title": "Workshop", + "section": "Importing data from files", + "text": "Importing data from files\nLast week we created data by typing the values in to R. This is not practical when you have added a lot of data to a spreadsheet, or you are using data file that has been supplied to you by a person or a machine. Far more commonly, we import data from a file into R. This requires you know two pieces of information.\n\n\nWhat format the data are in\nThe format of the data determines what function you will use to import it and the file extension often indicates format.\n\n\n.txt a plain text file1, where the columns are often separated by a space but might also be separated by a tab, a backslash or forward slash, or some other character\n\n.csv a plain text file where the columns are separated by commas\n\n.xlsx an Excel file\n\n\n\nWhere the file is relative to your working directory\nR can only read in a file if you say where it is, i.e., you give its relative path. If you follow the advice in this course, your data will be in a folder, data-raw which is inside your Project folder (and working directory).\n\n\nWe will save the four files for this workshop to our Project folder (week-8) and read them in. We will then create a new folder inside our Project folder called data-raw and move the data files to there before modifying the file paths as required. This is demonstrate how the relative path to the file will change after we move it.\n Save these four files in to your week-8 folder\n\nThe coat colour and mass of 62 cats: cat-coats.csv\n\nThe relative size of over 5000 cells measure by forward scatter (FSC) in flow cytometry: cell-size.txt\n\nThe number of sternopleural bristles on 96 female Drosophila: bristles.txt\n\nThe number of sternopleural bristles on 96 female Drosophila (with technical replicates): bristles-mean.xlsx\n\n\nThe first three files can be read in with core tidyverse Wickham et al. (2019) functions and the last can be read in with the readxl Wickham and Bryan (2023) package.\n Load the two packages\n\nlibrary(tidyverse)\nlibrary(readxl)\n\nWe will first read in cat-coats.csv. A .csv. extension suggests this is plain text file with comma separated columns. However, before we attempt to read it it, when should take a look at it. We can do this from RStudio\n Go to the Files pane (bottom right), click on the cat-coats.csv file and choose View File2\n\n\nRStudio Files Pane\n\nAny plain text file will open in the top left pane (Excel files will launch Excel).\n Is the file csv?\n\n\n What kind of variables does the file contain?\n\n\n Read in the csv file with:\n\ncats <- read_csv(\"cat-coats.csv\")\n\nThe data from the file a read into a dataframe called cats and you will be able to see it in the Environment.\n Click on each of the remaining files and choose View File.\n In each case, say what the format is and what types of variables it contains.\n\n\n\n\n\n\n\n\nWe use the read_table()3 command to read in plain text files of single columns or where the columns are separated by spacesā€¦\n ā€¦so in cell-size.txt can be read into a dataframe called cells like this:\n\ncells <- read_table(\"cell-size.txt\")\n\n Now you try reading bristles.txt in to a dataframe called fly_bristles\nThe readxl package we loaded earlier has two useful functions for working with Excel files: excel_sheets(\"filename.xlsx\") will list the sheets in an Excel workbook; read_excel(\"filename.xlsx\") will read in to top sheet or a specified sheet with a small modification read_excel(\"filename.xlsx\", sheet = \"Sheet1\").\n List the the names of the sheets and read in the sheet with the data like this:\n\nexcel_sheets(\"bristles-mean.xlsx\")\nfly_bristles_means <- read_excel(\"bristles-mean.xlsx\", sheet = \"means\")\n\nWell done! You can now read read in from files in your working directory.\nTo help you understand relative file paths, we will now move the data files.\n First remove the dataframes you just created to make it easier to see whether you can successfully read in the files from a different place:\n\nrm(cat_coats, fly_bristles, cells, flies_bristles_means)\n\n Now make a new folder called data-raw. You can do this on the Files Pane by clicking New Folder and typing into the box that appears.\n Check the boxes next to the file names and choose More | Moveā€¦ and select the data-raw folder.\n The files will move. To import data from files in the data-raw folder, you need to give the relative path to the file from the working directory. The working directory is the Project folder, week-8 so the relative path is data-raw/cat-coats.csv\n Import the cat-coats.csv data like this:\n\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\n Now you do the other files.\nFrom this point forward in the course, we will always create a data-raw folder each time we make a new Project.\nThese are the most common forms of data file you will encounter at first. However, data can certainly come to you in other formats particularly when they have come from particular software. Usually, there is an R package specially for that format.\nIn the rest of the workshop we will take each dataset in turn and create summaries and plots appropriate for the data types. Data is summarised using the group_by() and summarise() functions" + }, + { + "objectID": "pgt52m/week-3/workshop.html#summarising-discrete-data-cat-coat", + "href": "pgt52m/week-3/workshop.html#summarising-discrete-data-cat-coat", + "title": "Workshop", + "section": "Summarising discrete data: Cat coat", + "text": "Summarising discrete data: Cat coat\nThe most appropriate way to summarise nominal data like the colour of cat coats is to tabulate the number of cats with each colour.\n Summarise the cats dataframe by counting the number of cats in each category\n\ncats |> \n group_by(coat) |> \n count()\n\n# A tibble: 6 Ɨ 2\n# Groups: coat [6]\n coat n\n <chr> <int>\n1 black 23\n2 calico 1\n3 ginger 10\n4 tabby 8\n5 tortoiseshell 5\n6 white 15\n\n\n|> is the pipe and can be produced with Ctrl+Shift+M\nThis sort of data might be represented with a barchart. You have two options for producing that barchart:\n\nplot the summary table using geom_col()\nplot the raw data using geom_bar()\n\nWe did the first of these last week. The geom_col() function uses the numbers in a second column to determine how high the bars are. However, the geom_bar() function will do the tabulating for you.\n Plot the coat data using geom_bar:\n\nggplot(cats, aes(x = coat)) +\n geom_bar()\n\n\n\n\nThe gaps that R put automatically between the bars reflects that the coat colours are discrete categories." + }, + { + "objectID": "pgt52m/week-3/workshop.html#summarising-counts-bristles", + "href": "pgt52m/week-3/workshop.html#summarising-counts-bristles", + "title": "Workshop", + "section": "Summarising Counts: Bristles", + "text": "Summarising Counts: Bristles\nCounts are discrete and can be thought of a categories with an order (ordinal).\n Summarise the fly_bristles dataframe by counting the number of flies in each category of bristle number\nSince counts are numbers, we might also want to calculate some summary statistics such as the median and interquartile range.\n Summarise the fly_bristles dataframe by calculate the median and interquartile range\n\nfly_bristles |> \n summarise(median(number),\n IQR(number))\n\n# A tibble: 1 Ɨ 2\n `median(number)` `IQR(number)`\n <dbl> <dbl>\n1 6 4\n\n\nAs the interquartile is 4 and the median is 6 then 25% flies have 4 bristles or fewer and 25% have 8 or more.\nThe distribution of counts4 is not symmetrical for lower counts so the mean is not usually a good way to summarise count data.\n If you want to save the table you created and give the columns better names you can make two adjustments:\n\nfly_bristles_summary <- fly_bristles |> \n summarise(med = median(number),\n interquartile = IQR(number))\n\n Plot the bristles data using geom_bar:\nIf counts have a a high mean and big range, like number of hairs on a personā€™s head, then you can often treat them as continuous. This means you can use statistics like the mean and standard deviation to summarise them, histograms to plot them and use some standard statistical tests on them." + }, + { + "objectID": "pgt52m/week-3/workshop.html#summarising-continuous-data", + "href": "pgt52m/week-3/workshop.html#summarising-continuous-data", + "title": "Workshop", + "section": "Summarising continuous data", + "text": "Summarising continuous data\nCat mass\nThe variable mass in the cats dataframe is continuous. Very many continuous variables have a normal distribution. e normal distribution is also known as the bell-shaped curve. If we had the mass of all the cats in the world, we would find many cats were near the mean and fewer would be away from the mean, either much lighter or much heavier. In fact 68% would be within one standard deviation of the mean and about 96% would be within two standard deviations.\n\n\n\n\n\n We can find the mean mass with:\n\ncats |> \n summarise(mean = mean(mass))\n\n# A tibble: 1 Ɨ 1\n mean\n <dbl>\n1 4.51\n\n\nWe can add any sort of summary by placing it inside the the summarise parentheses. Each one is separated by a comma. We did this to find the median and the interquatrile range for fly bristles.\n For example, another way to calculate the number of values is to use the length() function:\n\ncats |> \n summarise(mean = mean(mass),\n n = length(mass))\n\n# A tibble: 1 Ɨ 2\n mean n\n <dbl> <int>\n1 4.51 62\n\n\n Adapt the code to calculate the mean, the sample size and the standard deviation (sd())\nA single continuous variable can be plotted using a histogram to show the shape of the distribution.\n Plots a histogram of cats mass:\n\nggplot(cats, aes(x = mass)) +\n geom_histogram(bins = 15, colour = \"black\") \n\n\n\n\nNotice that there are no gaps between the bars which reflects that mass is continuous. bins determines how many groups the variable is divided up into (i.e., the number of bars) and colour sets the colour for the outline of the bars. A sample of 62 is a relatively small number of values for plotting a distribution and the number of bins used determines how smooth or normally distributed the values look.\n Experiment with the number of bins. Does the number of bins affect how you view the distribution.\nNext week we will practice summarise and plotting data files with several variables but just to give you a taste, we will find summary statistics about mass for each of the coat types. The group_by() function is used before the summarise() to do calculations for each of the coats:\n\ncats |> \n group_by(coat) |> \n summarise(mean = mean(mass),\n standard_dev = sd(mass))\n\n# A tibble: 6 Ɨ 3\n coat mean standard_dev\n <chr> <dbl> <dbl>\n1 black 4.63 1.33 \n2 calico 2.19 NA \n3 ginger 4.46 1.12 \n4 tabby 4.86 0.444\n5 tortoiseshell 4.50 0.929\n6 white 4.34 1.34 \n\n\nYou can read this as:\n\ntake cats and then group by coat and then summarise by finding the mean of mass and the standard deviation of mass\n\n Why do we get an NA for the standard deviation of the calico cats?\n\n\n\nCells\n Summarise the cells dataframe by calculating the mean, median, sample size and standard deviation of FSC.\n Add a column for the standard error which is given by \\(\\frac{s.d.}{\\sqrt{n}}\\)\nMeans of counts\nMany things are quite difficult to measure or count and in these cases we often do technical replicates. A technical replicate allows us the measure the exact same thing to check how variable the measurement process is. For example, Drosophila are small and counting their sternopleural bristles is tricky. In addition, where a bristle is short (young) or broken scientists might vary in whether they count it. Or people or machines might vary in measuring the concentration of the same solution.\nWhen we do technical replicates we calculate their mean and use that as the measure. This is what is in our fly_bristles_means dataframe - the bristles of each of the 96 flies was counted by 5 people and the data are those means. These has an impact on how we plot and summarise the dataset because the distribution of mean counts is continuous! We can use means, standard deviations and histograms. This will be an exercise in Consolidate." + }, + { + "objectID": "pgt52m/week-3/workshop.html#look-after-future-you", + "href": "pgt52m/week-3/workshop.html#look-after-future-you", + "title": "Workshop", + "section": "Look after future you!", + "text": "Look after future you!\nFuture you is going to summarise and plot data from the ā€œRiver practicalsā€. You can make this much easier by documenting what you have done now. At the moment all of your code from this workshop is in a single file, probably called analysis.R. I recommend making a new script for each of nominal, continuous and count data and copying the code which imports, summarises and plots it. This will make it easier for future you to find the code you need. Here is an example: nominal_data.R. You may wish to comment your version much more.\nYouā€™re finished!" + }, + { + "objectID": "pgt52m/week-3/workshop.html#footnotes", + "href": "pgt52m/week-3/workshop.html#footnotes", + "title": "Workshop", + "section": "Footnotes", + "text": "Footnotes\n\nPlain text files can be opened in notepad or other similar editor and still be readable.ā†©ļøŽ\nDo not be tempted to import data this way. Unless you are careful, your data import will not be scripted or will not be scripted correctly.ā†©ļøŽ\nnote read_csv() and read_table() are the same functions with some different settings.ā†©ļøŽ\nCount data are usually ā€œPoissonā€ distributed.ā†©ļøŽ" + }, + { + "objectID": "pgt52m/week-3/study_after_workshop.html", + "href": "pgt52m/week-3/study_after_workshop.html", + "title": "Independent Study to consolidate this week", "section": "", - "text": "Prepare\n\nšŸ“– Read Two-Sample tests" + "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nfly_bristles_means <- read_excel(\"data-raw/bristles-mean.xlsx\")\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\nExercises\n\nšŸ’» Summarise the fly_bristles_means dataframe by calculating the mean, median, sample size, standard deviation and standard error of the mean_count variable.\n\n\nCodefly_bristles_means_summary <- fly_bristles_means |> \n summarise(mean = mean(mean_count),\n median = median(mean_count),\n n = length(mean_count),\n standard_dev = sd(mean_count),\n standard_error = standard_dev / sqrt(n))\n\n\n\nšŸ’» Create an appropriate plot to show the distribution of mean_count in fly_bristles_means\n\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10)\n\n\n\nšŸ’» Can you format the plot 2. by removing the grey background, giving the bars a black outline and the fill colour of your choice and improving the axis format and labelling? You may want to refer to last weekā€™s workshop.\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10, \n colour = \"black\",\n fill = \"skyblue\") +\n scale_x_continuous(name = \"Number of bristles\",\n expand = c(0, 0)) +\n scale_y_continuous(name = \"Frequency\",\n expand = c(0, 0),\n limits = c(0, 35)) +\n theme_classic()\n\n\n\nšŸ’» Amend this code to change the order of the bars by the average mass of each coat colour? Changing the order of bars was covered last week. You may also want to practice formatting the graph nicely.\n\n\nggplot(cats, aes(x = coat, y = mass)) +\n geom_boxplot()\n\n\n\n\n\nCodeggplot(cats, \n aes(x = reorder(coat, mass), y = mass)) +\n geom_boxplot(fill = \"darkcyan\") +\n scale_x_discrete(name = \"Coat colour\") +\n scale_y_continuous(name = \"Mass (kg)\", \n expand = c(0, 0),\n limits = c(0, 8)) +\n theme_classic()\n\n\n\nšŸ“– Read Understanding the pipe |>" + }, + { + "objectID": "pgt52m/week-4/workshop.html", + "href": "pgt52m/week-4/workshop.html", + "title": "Workshop", + "section": "", + "text": "Data data Artwork from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst\n\n\nIn this workshop you will learn to summarise and plot datasets with more than one variable and how to write figures to files. You will also get more practice with working directories, importing data, formatting figures and the pipe.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + }, + { + "objectID": "pgt52m/week-4/workshop.html#session-overview", + "href": "pgt52m/week-4/workshop.html#session-overview", + "title": "Workshop", + "section": "", + "text": "In this workshop you will learn to summarise and plot datasets with more than one variable and how to write figures to files. You will also get more practice with working directories, importing data, formatting figures and the pipe." + }, + { + "objectID": "pgt52m/week-4/workshop.html#philosophy", + "href": "pgt52m/week-4/workshop.html#philosophy", + "title": "Workshop", + "section": "", + "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + }, + { + "objectID": "pgt52m/week-4/workshop.html#myoglobin-in-seal-muscle", + "href": "pgt52m/week-4/workshop.html#myoglobin-in-seal-muscle", + "title": "Workshop", + "section": "Myoglobin in seal muscle", + "text": "Myoglobin in seal muscle\nThe myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in seal.csv. Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.\nImport\n Save seal.csv to your data-raw folder\n Read the data into a dataframe called seal. . You might want to look up data import from last week.\n What types of variables do you have in the seal dataframe? What role would you expect them to play in analysis?\n\n\n\n\nThe key point here is that the fundamental structure of:\n\none continuous response and one nominal explanatory variable with two groups (adipocytes), and\none continuous response and one nominal explanatory variable with three groups (seals)\n\nis the same! The only thing that differs is the number of groups (the number of values in the nominal variable). This means the code for summarising and plotting is identical except for the variable names!\n\n\n\n\n\n\nTip\n\n\n\nWhen two datasets have the same number of columns and the response variable and the explanatory variables have the same data types then the code you need is the same.\n\n\nSummarise\nSummarising the data for each species is the next sensible step. The most useful summary statistics for a continuous variable like myoglobin are the means, standard deviations, sample sizes and standard errors. You might remember from last week that we use the group_by() and summarise() functions along with the functions that do the calculations.\n Create a data frame called seal_summary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples.\n\nseal_summary <- seal %>%\n group_by(species) %>%\n summarise(mean = mean(myoglobin),\n std = sd(myoglobin),\n n = length(myoglobin),\n se = std/sqrt(n))\n\nYou should get the following numbers:\n\n\n\n\nspecies\nmean\nstd\nn\nse\n\n\n\nBladdernose Seal\n42.31600\n8.020634\n30\n1.464361\n\n\nHarbour Seal\n49.01033\n8.252004\n30\n1.506603\n\n\nWeddell Seal\n44.66033\n7.849816\n30\n1.433174\n\n\n\n\n\nVisualise\nMost commonly, we put the explanatory variable on the x axis and the response variable on the y axis. A continuous response, particularly one that follows the normal distribution, is best summarised with the mean and the standard error. In my opinion, you should also show all the raw data points if possible.\nWe are going to create a figure like this:\n\n\n\n\n\nIn this figure, we have the data points themselves which are in seal dataframe and the means and standard errors which are in the seal_summary dataframe. That is, we have two dataframes we want to plot.\nHere you will learn that dataframes and aesthetics can be specified within a geom_xxxx (rather than in the ggplot()). This is very useful if the geom only applies to some of the data you want to plot.\n\n\n\n\n\n\nTip: ggplot()\n\n\n\nYou put the data argument and aes() inside ggplot() if you want all the geoms to use that dataframe and variables. If you want a different dataframe for a geom, put the data argument and aes() inside the geom_xxxx()\n\n\nI will build the plot up in small steps but you should edit your existing ggplot() command as we go.\n Plot the data points first.\n\nggplot() +\n geom_point(data = seal, \n aes(x = species, y = myoglobin))\n\n\n\n\nNotice how we have given the data argument and the aesthetics inside the geom. The variables species and myoglobin are in the seal dataframe\n So the data points donā€™t overlap, we can add some random jitter in the x direction (edit your existing code):\n\nggplot() +\n geom_point(data = seal, \n aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0))\n\n\n\n\nNote that position = position_jitter(width = 0.1, height = 0) is inside the geom_point() parentheses, after the aes() and a comma.\nWeā€™ve set the vertical jitter to 0 because, in contrast to the categorical x-axis, movement on the y-axis has meaning (the myoglobin levels).\n Letā€™s make the points a light grey (edit your existing code):\n\nggplot() +\n geom_point(data = seal, \n aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\")\n\n\n\n\nNow to add the errorbars. These go from one standard error below the mean to one standard error above the mean.\n Add a geom_errorbar() for errorbars (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) \n\n\n\n\nWe have specified the seal_summary dataframe and the variables species, mean and se are in that.\nThere are several ways you could add the mean. You could use geom_point() but I like to use geom_errorbar() again with the ymin and ymax both set to the mean.\n Add a geom_errorbar() for the mean (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2)\n\n\n\n\n Alter the axis labels and limits using scale_y_continuous() and scale_x_discrete() (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Myoglobin (g/kg)\", \n limits = c(0, 80), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Species\")\n\n\n\n\nYou only need to use scale_y_continuous() and scale_x_discrete() to use labels that are different from those in the dataset. Often this is to use proper terminology and captialisation.\n Format the figure in a way that is more suitable for including in a report using theme_classic() (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Myoglobin (g/kg)\", \n limits = c(0, 80), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Species\") +\n theme_classic()\n\n\n\n\nWriting figures to file\n Make a new folder called figures.\n Edit you ggplot code so that you assign the figure to a variable.\n\nsealfig <- ggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Myoglobin (g/kg)\", \n limits = c(0, 80), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Species\") +\n theme_classic()\n\nThe figure wonā€™t be shown in the Plots tab - the output has gone into sealfig rather than to the Plots tab. To make it appear in the Plots tab type sealfig\n The ggsave() command will write a ggplot figure to a file:\n\nggsave(\"figures/seal-muscle.png\",\n plot = sealfig,\n device = \"png\",\n width = 4,\n height = 3,\n units = \"in\",\n dpi = 300)\n\nfiguresseal-muscle.png is the name of the file, including the relative path.\n Look up ggsave() in the manual to understand the arguments. You can do this by putting your cursor on the command and pressing F1" + }, + { + "objectID": "pgt52m/week-4/workshop.html#pigeons", + "href": "pgt52m/week-4/workshop.html#pigeons", + "title": "Workshop", + "section": "Pigeons", + "text": "Pigeons\nThe data in pigeon.txt are 40 measurements of interorbital width (in mm) for two populations of domestic pigeons measured to the nearest 0.1mm\n\n\nInterorbital width is the distance between the eyes\n\nImport\n Save pigeon.txt to your data-raw folder\n Read the data into a dataframe called pigeons.\n What variables are there in the pigeons dataframe?\n\n\n\n\nHummmm, these data are not organised like the other data sets we have used. The population is given as the column names and the interorbital distances for one population are given in a different column than those for the other population. The first row has data from two pigeons which have nothing in common, they just happen to be the first individual recorded in each population.\n\n\n\n\n\nA\nB\n\n\n\n12.4\n12.6\n\n\n11.2\n11.3\n\n\n11.6\n12.1\n\n\n12.3\n12.2\n\n\n11.8\n11.8\n\n\n10.7\n11.5\n\n\n11.3\n11.2\n\n\n11.6\n11.9\n\n\n12.3\n11.2\n\n\n10.5\n12.1\n\n\n12.1\n11.9\n\n\n10.4\n10.7\n\n\n10.8\n11.0\n\n\n11.9\n12.2\n\n\n10.9\n12.6\n\n\n10.8\n11.6\n\n\n10.4\n10.7\n\n\n12.0\n12.4\n\n\n11.7\n11.8\n\n\n11.3\n11.1\n\n\n11.5\n12.9\n\n\n11.8\n11.9\n\n\n10.3\n11.1\n\n\n10.3\n12.2\n\n\n11.5\n11.8\n\n\n10.7\n11.5\n\n\n11.3\n11.2\n\n\n11.6\n11.9\n\n\n13.3\n11.2\n\n\n10.7\n11.1\n\n\n12.1\n11.6\n\n\n10.2\n12.7\n\n\n10.8\n11.0\n\n\n11.4\n12.2\n\n\n10.9\n11.3\n\n\n10.3\n11.6\n\n\n10.4\n12.2\n\n\n10.0\n12.4\n\n\n11.2\n11.3\n\n\n11.3\n11.1\n\n\n\n\n\n\n\nThis data is not in ā€˜tidyā€™ format (Wickham 2014).\nTidy format has variables in column and observations in rows. All of the distance measurements should be in one column and a second column should give the population.\n\n\n\n\n\npopulation\ndistance\n\n\n\nA\n12.4\n\n\nB\n12.6\n\n\nA\n11.2\n\n\nB\n11.3\n\n\nA\n11.6\n\n\nB\n12.1\n\n\nA\n12.3\n\n\nB\n12.2\n\n\nA\n11.8\n\n\nB\n11.8\n\n\nA\n10.7\n\n\nB\n11.5\n\n\nA\n11.3\n\n\nB\n11.2\n\n\nA\n11.6\n\n\nB\n11.9\n\n\nA\n12.3\n\n\nB\n11.2\n\n\nA\n10.5\n\n\nB\n12.1\n\n\nA\n12.1\n\n\nB\n11.9\n\n\nA\n10.4\n\n\nB\n10.7\n\n\nA\n10.8\n\n\nB\n11.0\n\n\nA\n11.9\n\n\nB\n12.2\n\n\nA\n10.9\n\n\nB\n12.6\n\n\nA\n10.8\n\n\nB\n11.6\n\n\nA\n10.4\n\n\nB\n10.7\n\n\nA\n12.0\n\n\nB\n12.4\n\n\nA\n11.7\n\n\nB\n11.8\n\n\nA\n11.3\n\n\nB\n11.1\n\n\nA\n11.5\n\n\nB\n12.9\n\n\nA\n11.8\n\n\nB\n11.9\n\n\nA\n10.3\n\n\nB\n11.1\n\n\nA\n10.3\n\n\nB\n12.2\n\n\nA\n11.5\n\n\nB\n11.8\n\n\nA\n10.7\n\n\nB\n11.5\n\n\nA\n11.3\n\n\nB\n11.2\n\n\nA\n11.6\n\n\nB\n11.9\n\n\nA\n13.3\n\n\nB\n11.2\n\n\nA\n10.7\n\n\nB\n11.1\n\n\nA\n12.1\n\n\nB\n11.6\n\n\nA\n10.2\n\n\nB\n12.7\n\n\nA\n10.8\n\n\nB\n11.0\n\n\nA\n11.4\n\n\nB\n12.2\n\n\nA\n10.9\n\n\nB\n11.3\n\n\nA\n10.3\n\n\nB\n11.6\n\n\nA\n10.4\n\n\nB\n12.2\n\n\nA\n10.0\n\n\nB\n12.4\n\n\nA\n11.2\n\n\nB\n11.3\n\n\nA\n11.3\n\n\nB\n11.1\n\n\n\n\n\n\n\nData which is in tidy format is easier to summarise, analyses and plot because the organisation matches the conceptual structure of the data:\n\nit is more obvious what the variables are because they columns are named with them - in the untidy format, that the measures are distances is not clear and what A and B are isnā€™t clear\nit is more obvious that there is no relationship between any of the pigeons except for population\nfunctions are designed to work with variables in columns\nTidying data\nWe can put this data in such a format with the pivot_longer() function from the tidyverse:\npivot_longer() collects the values from specified columns (cols) into a single column (values_to) and creates a column to indicate the group (names_to).\n Put the data in tidy format:\n\npigeons <- pivot_longer(data = pigeons, \n cols = everything(), \n names_to = \"population\", \n values_to = \"distance\")\n\nWe have overwritten the original dataframe. If you wanted to keep the original you would need to give a new name on the left side of the assignment <- Note: the data in the file are unchanged." + }, + { + "objectID": "pgt52m/week-4/workshop.html#ulna-and-height", + "href": "pgt52m/week-4/workshop.html#ulna-and-height", + "title": "Workshop", + "section": "Ulna and height", + "text": "Ulna and height\nThe datasets we have used up to this point, have had a continuous variable and a categorical variable where it makes sense to summarise the response for each of the different groups in the categorical variable and plot the response on the y-axis. We will now summarise a dataset with two continuous variables. The data in height.txt are the ulna length (cm) and height (m) of 30 people. In this case, it is more appropriate to summarise both of thee variables and to plot them as a scatter plot.\nWe will use summarise() again but we do not need the group_by() function this time. We will also need to use each of the summary functions, such as mean(), twice, once for each variable.\nImport\n Save height.txt to your data-raw folder\n Read the data into a dataframe called ulna_heights.\nSummarise\n Create a data frame called ulna_heights_summary that contains the sample size and means, standard deviations and standard errors for both variables.\n\nulna_heights_summary <- ulna_heights %>%\n summarise(n = length(ulna),\n mean_ulna = mean(ulna),\n std_ulna = sd(ulna),\n se_ulna = std_ulna/sqrt(n),\n mean_height = mean(height),\n std_height = sd(height),\n se_height = std_height/sqrt(n))\n\nYou should get the following numbers:\n\n\n\n\nn\nmean_ulna\nstd_ulna\nse_ulna\nmean_height\nstd_height\nse_height\n\n\n30\n24.72\n4.137332\n0.75537\n1.494\n0.2404823\n0.0439059\n\n\n\n\nVisualise\nTo plot make a scatter plot we need to use geom_point() again but without any scatter. In this case, it does not really matter which variable is on the x-axis and which is on the y-axis.\n Make a simple scatter plot\n\nggplot(data = ulna_heights, aes(x = ulna, y = height)) +\n geom_point()\n\n\n\n\nIf you have time, you may want to format the figure more appropriately.\n\n\nYouā€™re finished!" + }, + { + "objectID": "pgt52m/week-4/study_after_workshop.html", + "href": "pgt52m/week-4/study_after_workshop.html", + "title": "Independent Study to consolidate this week", + "section": "", + "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nšŸ’» Summarise and plot the pigeons dataframe appropriately.\n\n\nCode# import\npigeons <- read_table(\"data-raw/pigeon.txt\")\n\n# reformat to tidy\npigeons <- pivot_longer(data = pigeons, \n cols = everything(), \n names_to = \"population\", \n values_to = \"distance\")\n\n# sumnmarise\npigeons_summary <- pigeons %>%\n group_by(population) %>%\n summarise(mean = mean(distance),\n std = sd(distance),\n n = length(distance),\n se = std/sqrt(n))\n# plot\nggplot() +\n geom_point(data = pigeons, aes(x = population, y = distance),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Interorbital distance (mm)\", \n limits = c(0, 14), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Population\") +\n theme_classic()\n\n\n\n\nšŸ’» The data in blood.csv are measurements of several blood parameters from fifty people with Crohnā€™s disease, a lifelong condition where parts of the digestive system become inflamed. Twenty-five of people are in the early stages of diagnosis and 25 have started treatment. The variables in the dataset are:\n\nsodium - Sodium concentration in umol/L, the average of 5 technical replicates\npotassium - Potassium concentration in umol/L, the average of 5 technical replicates\nB12 Vitamin - B12 in pmol/L, the average of 5 technical replicates\nwbc - White blood cell count in 10^9 /L, the average of 5 technical replicates\nrbc count - Red blood cell count in 10^12 /L, the average of 5 technical replicates\nplatlet count - platlet count in 10^9 /L, the average of 5 technical replicates\ninflammation marker - the presence or absence of a marker of inflammation, either 0 or 1\nstatus - whether the individual is before or after treatment.\n\nYour task is to summarise and plot these data in any suitable way. Create a complete RStudio Project for an analysis of these data. You will need to:\n\nMake a new project\nMake folders for data and for figures\nImport the data\nSummarise and plot variables of your choice. It doesnā€™t matter what you chose - the goal is the practice the project workflow and selecting appropriate plotting and summarising methods for particular data sets." + }, + { + "objectID": "pgt52m/week-1/workshop.html", + "href": "pgt52m/week-1/workshop.html", + "title": "Workshop", + "section": "", + "text": "There is no formal workshop this week but you might want to install R and RStudio on your own machine. This is optional because University computers already have R and RStudio installed.\nInstall R and RStudio.\nNote you need a computer - not a tablet." + }, + { + "objectID": "pgt52m/week-1/study_after_workshop.html", + "href": "pgt52m/week-1/study_after_workshop.html", + "title": "Independent Study to consolidate this week", + "section": "", + "text": "There is no additional study this week but you may want to look ahead to next week." + }, + { + "objectID": "pgt52m/week-5/workshop.html", + "href": "pgt52m/week-5/workshop.html", + "title": "Workshop", + "section": "", + "text": "Artwork by Horst (2023): ā€œlove this classā€\n\n\nIn this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + }, + { + "objectID": "pgt52m/week-5/workshop.html#session-overview", + "href": "pgt52m/week-5/workshop.html#session-overview", + "title": "Workshop", + "section": "", + "text": "In this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples." + }, + { + "objectID": "pgt52m/week-5/workshop.html#philosophy", + "href": "pgt52m/week-5/workshop.html#philosophy", + "title": "Workshop", + "section": "", + "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + }, + { + "objectID": "pgt52m/week-5/workshop.html#remind-yourself-how-to-import-files", + "href": "pgt52m/week-5/workshop.html#remind-yourself-how-to-import-files", + "title": "Workshop", + "section": "Remind yourself how to import files!", + "text": "Remind yourself how to import files!\nImporting data from files was covered in a previous workshop (Rand 2023) if you need to remind yourself." + }, + { + "objectID": "pgt52m/week-5/workshop.html#confidence-intervals-large-samples", + "href": "pgt52m/week-5/workshop.html#confidence-intervals-large-samples", + "title": "Workshop", + "section": "Confidence intervals (large samples)", + "text": "Confidence intervals (large samples)\nThe data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by:\n\\(\\bar{x} \\pm 1.96 \\times s.e.\\)\nWhere 1.96 is the quantile for 95% confidence.\n Save beewing.txt to your data-raw folder.\n Read in the data and check the structure of the resulting dataframe.\n Calculate and assign to variables: the mean, standard deviation and standard error:\n\n# mean\nm <- mean(bee$wing)\n\n# standard deviation\nsd <- sd(bee$wing)\n\n# sample size (needed for the se)\nn <- length(bee$wing)\n\n# standard error\nse <- sd / sqrt(n)\n\n To calculate the 95% confidence interval we need to look up the quantile (multiplier) using qnorm()\n\nq <- qnorm(0.975)\n\nThis should be about 1.96.\n Now we can use it in our confidence interval calculation\n\nlcl <- m - q * se\nucl <- m + q * se\n\n Print the values\n\nlcl\n\n[1] 4.473176\n\nucl\n\n[1] 4.626824\n\n\nThis means we are 95% confident the population mean lies between 4.47 mm and 4.63 mm. The usual way of expressing this is that the mean is 4.55 +/- 0.07 mm\n Between what values would you be 99% confident of the population mean being?" + }, + { + "objectID": "pgt52m/week-5/workshop.html#confidence-intervals-small-samples", + "href": "pgt52m/week-5/workshop.html#confidence-intervals-small-samples", + "title": "Workshop", + "section": "Confidence intervals (small samples)", + "text": "Confidence intervals (small samples)\nThe confidence interval for small samples is given by:\n\\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\)\nThe only difference between the calculation for small and large sample is the multiple. For large samples we use the ā€œthe standard normal distributionā€ accessed with qnorm(); for small samples we use the ā€œt distributionā€ assessed with qt().The value returned by q(t) is larger than that returned by qnorm() which reflects the greater uncertainty we have on estimations of population means based on small samples.\nThe fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \\(\\mu m^2\\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt\n Save neuron.txt to your data-raw folder\n Read in the data and check the structure of the resulting dataframe\n Assign the mean to m.\n Calculate and assign the standard error to se.\nTo work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).\n We can assign this to a variable, df, using:\n\ndf <- length(neur$csa) - 1\n\n The t value is found by:\n\nt <- qt(0.975, df = df)\n\nNote that we are using qt() rather than qnorm() but that the probability, 0.975, used is the same. Finally, we need to put our mean, standard error and t value in the equation. \\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\).\n The upper confidence limit is:\n\n(m + t * se) |> round(2)\n\n[1] 151.95\n\n\nThe first part of the command, (m + t * se) calculates the upper limit. This is ā€˜pipedā€™ in to the round() function to round the result to two decimal places.\n Calculate the lower confidence limit:\n Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?\n\n\n\n\nYouā€™re finished!" + }, + { + "objectID": "pgt52m/week-5/study_after_workshop.html", + "href": "pgt52m/week-5/study_after_workshop.html", + "title": "Independent Study to consolidate this week", + "section": "", + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not. Estimate the mean Adiponectin concentration in each group - this means calculate the sample mean and construct a confidence interval around it for each group. This exercise forces you to bring together ideas from this workshop and from previous workshops\n\n\nHow to calculate a confidence intervals (this workshop)\n\nHow to summarise variables in more than one group (previous workshop)\n\n\nCode# data import\nadip <- read_table(\"data-raw/adipocytes.txt\")\n\n# examine the structure\nstr(adip)\n\n# summarise\nadip_summary <- adip %>% \n group_by(treatment) %>% \n summarise(mean = mean(adiponectin),\n sd = sd(adiponectin),\n n = length(adiponectin),\n se = sd/sqrt(n),\n dif = qt(0.975, df = n - 1) * se,\n lower_ci = mean - dif,\n uppp_ci = mean + dif)\n\n\n# we conclude we're 95% certain the mean for the control group is \n# between 4.73 and 6.36 and the mean for the nicotinic group is \n# between 6.52 and 8.50. More usually we might put is like this:\n# the mean for the control group is 5.55 +/- 0.82 and that for the nicotinic group is 7.51 +/- 0.99" + }, + { + "objectID": "pgt52m/pgt52m.html", + "href": "pgt52m/pgt52m.html", + "title": "52M Data Analysis in R", + "section": "", + "text": "This module introduces you to data analysis in R. The first 4 weeks covers core concepts about scientific computing, types of variable, the role of variables in analysis and how to use RStudio to organise analysis and import, summarise and plot data. In weeks 5 to 8, you will learn about the logic of hypothesis testing, confidence intervals, what is meant by a statistical model, two-sample tests and one-way analysis of variance (ANOVA). You will learn how to write reproducible reports in Quarto in weeks 9 and 10. Finally, there will be a drop-in for your questions in week 11.\nThis module complement the work you will do in BIO00070M Research, Professional and Team Skills where you will you will learn how to organise reproducible data analyses using a project-oriented workflow and analyses RNA sequence data. It will be important to use the skills and tools you learn in 52M and apply them in 70M.\n\n\nThe Module Learning outcomes are:\n\nExplain the purpose of data analysis and the rationale for scripting analysis in the biosciences\nRecognise when statistics such as t-tests, one-way ANOVA, correlation and regression can be applied, and use R to perform these analyses on data in a variety of formats\nSummarise data in single or multiple groups, recognise tidy data formats, and carry out some typical data tidying tasks\nUse markdown (through Quarto) to produce reproducible analyses, figures and reports" }, { - "objectID": "pgt52m/week-7/study_after_workshop.html", - "href": "pgt52m/week-7/study_after_workshop.html", - "title": "Independent Study to consolidate this week", + "objectID": "pgt52m/pgt52m.html#module-learning-objectives", + "href": "pgt52m/pgt52m.html#module-learning-objectives", + "title": "52M Data Analysis in R", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Plant Biotech. Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt. Do you think their modification has been successful?\n\n\nCodecsativa <- read_table(\"data-raw/csativa.txt\")\nstr(csativa)\n\n# First realise that this is a two sample test. You have two independent samples\n# - there are a total of 100 different plants and the values in one \n# group have no relationship to the values in the other.\n\n\n\nCode# create a rough plot of the data \nggplot(data = csativa, aes(x = plant, y = omega)) +\n geom_violin()\nCode# note the modified plants seem to have lower omega!\n\n\n\nCode# create a summary of the data\ncsativa_summary <- csativa %>%\n group_by(plant) %>%\n summarise(mean = mean(omega),\n std = sd(omega),\n n = length(omega),\n se = std/sqrt(n))\n\n\n\nCode# The data seem to be continuous so it is likely that a parametric test will be fine\n# we will check the other assumptions after we have run the lm\n\n# build the statistical model\nmod <- lm(data = csativa, omega ~ plant)\n\n\n# examine it\nsummary(mod)\n# So there is a significant difference but you need to make sure you know the direction!\n# Wild plants have a significantly higher omega 3 content (mean +/- s.e = 56.41 +/- 1.11) \n# than modified plants (49.46 +/- 0.82)(t = 5.03; d.f. = 98; p < 0.0001).\n\n\n\nCode# let's check the assumptions\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be the same in both groups.\n# This looks OK. Maybe a bit higher in the wild plants (with the higher mean)\n \nhist(mod$residuals)\nCodeshapiro.test(mod$residuals)\n# On balance the use of lm() is probably justifiable The variance isn't quite equal \n# and the histogram looks a bit off normal but the normality test is NS and the \n# effect (in the figure) is clear.\n\n\n\nCode# A figure \nfig1 <- ggplot() +\n geom_point(data = csativa, aes(x = plant, y = omega),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_x_discrete(name = \"Plant type\", labels = c(\"GMO\", \"WT\")) +\n scale_y_continuous(name = \"Amount of Omega 3 (units)\",\n expand = c(0, 0),\n limits = c(0, 90)) +\n annotate(\"segment\", x = 1, xend = 2, \n y = 80, yend = 80,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 85, \n label = expression(italic(p)~\"< 0.001\")) +\n theme_classic()\n\n# save figure to figures/csativa.png\nggsave(\"figures/csativa.png\",\n plot = fig1,\n width = 3.5,\n height = 3.5,\n units = \"in\",\n dpi = 300)\n\n\n\nšŸ’» another example" + "text": "The Module Learning outcomes are:\n\nExplain the purpose of data analysis and the rationale for scripting analysis in the biosciences\nRecognise when statistics such as t-tests, one-way ANOVA, correlation and regression can be applied, and use R to perform these analyses on data in a variety of formats\nSummarise data in single or multiple groups, recognise tidy data formats, and carry out some typical data tidying tasks\nUse markdown (through Quarto) to produce reproducible analyses, figures and reports" }, { - "objectID": "pgt52m/week-7/workshop.html", - "href": "pgt52m/week-7/workshop.html", - "title": "Workshop", - "section": "", - "text": "Artwork by Horst (2023): ā€œHow much I think I know about Rā€\n\n\nIn this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "objectID": "pgt52m/pgt52m.html#week-1-understanding-file-systems", + "href": "pgt52m/pgt52m.html#week-1-understanding-file-systems", + "title": "52M Data Analysis in R", + "section": "Week 1: Understanding file systems", + "text": "Week 1: Understanding file systems\nYou will learn about operating systems, files and file systems, working directories, absolute and relative paths, what R and RStudio are" }, { - "objectID": "pgt52m/week-7/workshop.html#session-overview", - "href": "pgt52m/week-7/workshop.html#session-overview", - "title": "Workshop", - "section": "", - "text": "In this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R." + "objectID": "pgt52m/pgt52m.html#week-2-introduction-to-r-and-project-organisation", + "href": "pgt52m/pgt52m.html#week-2-introduction-to-r-and-project-organisation", + "title": "52M Data Analysis in R", + "section": "Week 2: Introduction to R and project organisation", + "text": "Week 2: Introduction to R and project organisation\nYou will start writing R code in RStudio and will create your first graph! You will learn about data types such as ā€œnumericsā€ and ā€œcharactersā€ and some of the different types of objects in R such as ā€œvectorsā€ and ā€œdataframesā€. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects." }, { - "objectID": "pgt52m/week-7/workshop.html#philosophy", - "href": "pgt52m/week-7/workshop.html#philosophy", - "title": "Workshop", - "section": "", - "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "objectID": "pgt52m/pgt52m.html#week-3-types-of-variable-summarising-and-plotting-data", + "href": "pgt52m/pgt52m.html#week-3-types-of-variable-summarising-and-plotting-data", + "title": "52M Data Analysis in R", + "section": "Week 3: Types of variable, summarising and plotting data", + "text": "Week 3: Types of variable, summarising and plotting data\nThe type of values our data can take is important in how we analyse and visualise it. This week you will learn the difference between continuous and discrete values and how we summarise and visualise them. The focus will be on plotting and summarising single variables. You will also learn how to read in data in to RStudio from plain text files and Excel files." }, { - "objectID": "pgt52m/week-7/workshop.html#adiponectin-secretion", - "href": "pgt52m/week-7/workshop.html#adiponectin-secretion", - "title": "Workshop", - "section": "Adiponectin secretion", - "text": "Adiponectin secretion\nAdiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not.\n Save a copy of adipocytes.txt to data-raw\n Read in the data and check the structure. I used the name adip for the dataframe/tibble.\nWe have a tibble containing two variables: adiponectin is the response and is continuous and treatment is explanatory. treatment is categorical with two levels (groups). The first task is visualise the data to get an overview. For continuous response variables with categorical explanatory variables you could use geom_point(), geom_boxplot() or a variety of other geoms. I often use geom_violin() which allows us to see the distribution - the violin is fatter where there are more data points.\n Do a quick plot of the data:\n\nggplot(data = adip, aes(x = treatment, y = adiponectin)) +\n geom_violin()\n\n\n\n\nSummarising the data\nSummarising the data for each treatment group is the next sensible step. The most useful summary statistics are the means, standard deviations, sample sizes and standard errors.\n Create a data frame called adip_summary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples. You may need to the Summarise from the Week 4 workshop\nYou should get the following numbers:\n\n\n\n\ntreatment\nmean\nstd\nn\nse\n\n\n\ncontrol\n5.546000\n1.475247\n15\n0.3809072\n\n\nnicotinic\n7.508667\n1.793898\n15\n0.4631824\n\n\n\n\n\nSelecting a test\n Do you think this is a paired-sample test or two-sample test?\n\n\n\n\nApplying, interpreting and reporting\n Create a two-sample model like this:\n\nmod <- lm(data = adip,\n adiponectin ~ treatment)\n\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = adiponectin ~ treatment, data = adip)\n\nResiduals:\n Min 1Q Median 3Q Max \n-4.3787 -1.0967 0.1927 1.0245 3.1113 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 5.5460 0.4240 13.079 1.9e-13 ***\ntreatmentnicotinic 1.9627 0.5997 3.273 0.00283 ** \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.642 on 28 degrees of freedom\nMultiple R-squared: 0.2767, Adjusted R-squared: 0.2509 \nF-statistic: 10.71 on 1 and 28 DF, p-value: 0.00283\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\n\n\n\n\n\nWe now need to annotate the figure with the results from the statistical test. This most commonly done with a line linking the means being compared and the p-value. The annotate() function can be used to draw the line and then to add the value. The line is a segment and the p-value is a text.\n Add annotation to the figure by adding:\n...... +\n annotate(\"segment\", x = 1, xend = 2, \n y = 11.3, yend = 11.3,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 11.7, \n label = expression(italic(p)~\"= 0.003\")) +\n theme_classic()\n\n\n\n\n\nFor the segment, annotate() needs the x and y coordinates for the start and the finish of the line.\nThe use of expression() allows you to specify formatting or special characters. expression() takes strings or LaTeX formatting. Each string or piece of LaTeX is separated by a * or a ~. The * concatenates the strings without a space, ~ does so with a space. It will generate a warning message ā€œIn is.na(x) : is.na() applied to non-(list or vector) of type ā€˜expressionā€™ā€ which can be ignored.\n Save your figure to your figures folder." + "objectID": "pgt52m/pgt52m.html#week-4-summarising-data-with-several-variables", + "href": "pgt52m/pgt52m.html#week-4-summarising-data-with-several-variables", + "title": "52M Data Analysis in R", + "section": "Week 4: Summarising data with several variables", + "text": "Week 4: Summarising data with several variables\nThis week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by ā€œtidyā€ data and how to perform a simple data tidying task. Finally you will discover how to save your figures and place them in documents." }, { - "objectID": "pgt52m/week-7/workshop.html#grouse-parasites", - "href": "pgt52m/week-7/workshop.html#grouse-parasites", - "title": "Workshop", - "section": "Grouse Parasites", - "text": "Grouse Parasites\nGrouse livers were dissected and the number of individuals of a parasitic nematode were counted for two estates ā€˜Gordonā€™ and ā€˜Mossā€™. We want to know if the two estates have different infection rates. The data are in grouse.csv\n Save a copy of grouse.csv to data-raw\n Read in the data and check the structure. I used the name grouse for the dataframe/tibble.\nSelecting\n Using your common sense, do these data look normally distributed?\n\n\n\n What test do you suggest?\n\n\nApplying, interpreting and reporting\n Summarise the data by finding the median of each group:\n Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney):\n\nwilcox.test(data = grouse, nematodes ~ estate)\n\n\n Wilcoxon rank sum exact test\n\ndata: nematodes by estate\nW = 78, p-value = 0.03546\nalternative hypothesis: true location shift is not equal to 0\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\nIllustrating\nA box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range.\n We can create a simple boxplot with:\n\nggplot(data = grouse, aes(x = estate, y = nematodes) ) +\n geom_boxplot() \n\n\n\n\n Annotate and format the figure so it is more suitable for a report and save it to your figures folder." + "objectID": "pgt52m/pgt52m.html#week-5-the-logic-of-hypothesis-testing-and-ci", + "href": "pgt52m/pgt52m.html#week-5-the-logic-of-hypothesis-testing-and-ci", + "title": "52M Data Analysis in R", + "section": "Week 5: The logic of hypothesis testing and CI", + "text": "Week 5: The logic of hypothesis testing and CI\nThis week we will cover the logic of consider the logic of hypothesis testing and type 1 and type 2 errors. We will also find out what the sampling distribution of the mean and the standard error are, and how to calculate confidence intervals." }, { - "objectID": "pgt52m/week-7/workshop.html#gene-expression", - "href": "pgt52m/week-7/workshop.html#gene-expression", - "title": "Workshop", - "section": "Gene Expression", - "text": "Gene Expression\nBambara groundnut (Vigna subterranea) is an African legume with good nutritional value which can be influenced by low temperature stress. Researchers are interested in the expression levels of a particular set of 35 genes (probe_id) in response to temperature stress. They measure the expression of the genes at 23 and 18 degrees C (high and low temperature). These samples are not independent because we have two measure from one gene. The data are in expr.xlxs.\nSelecting\n What is the null hypothesis?\n\n\n\n Save a copy of expr.xlxs and import the data. I named the dataframe bambara\n What is the appropriate parametric test?\n\n\nApplying, interpreting and reporting\nA paired test requires us to test whether the difference in expression between high and low temperatures is zero on average. One handy way to achieve this is to organise our groups into two columns. The pivot_wider() function will do this for us. We need to tell it what column gives the identifiers (i.e., matches the the pairs) - the probe_ids in this case. We also need to say which variable contains what will become the column names and which contains the values.\n Pivot the data so there is a column for each temperature:\n\nbambara <- bambara |> \n pivot_wider(names_from = temperature, \n values_from = expression, \n id_cols = probe_id)\n\n Click on the bambara dataframe in the environment to open a view of it so that you understand what pivot_wider() has done.\n Create a paired-sample model like this:\n\nmod <- lm(data = bambara, \n highert - lowert ~ 1)\n\nSince we have done highert - lowert, the ā€œ(Intercept) Estimateā€ will be the average of the higher temperature expression minus the lower temperature expression for each gene.\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = highert - lowert ~ 1, data = bambara)\n\nResiduals:\n Min 1Q Median 3Q Max \n-1.05478 -0.46058 0.09682 0.33342 1.06892 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.30728 0.09591 3.204 0.00294 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.5674 on 34 degrees of freedom\n\n\n State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any significant effect." + "objectID": "pgt52m/pgt52m.html#week-6-introduction-to-statistical-models-single-regression", + "href": "pgt52m/pgt52m.html#week-6-introduction-to-statistical-models-single-regression", + "title": "52M Data Analysis in R", + "section": "Week 6: Introduction to statistical models: Single regression", + "text": "Week 6: Introduction to statistical models: Single regression\nThis week you will be introduced to the idea of a statistical ā€œmodelā€ in general and to general linear model in particular. Our first general linear model will be single linear regression which puts a line of best fit through data so the response can be predicted from the explanatory variable. We will consider the two ā€œparametersā€ estimated by the model (the slope and the intercept) and whether these differ from zero" }, { - "objectID": "pgt52m/week-7/workshop.html#look-after-future-you", - "href": "pgt52m/week-7/workshop.html#look-after-future-you", - "title": "Workshop", - "section": "Look after future you!", - "text": "Look after future you!\nThe code required to summarise, test, and plot data for any two-sample test AND for any for any one-way ANOVA is exactly the same except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it your code so that you can make use of it next week.\n\nYouā€™re finished!" + "objectID": "pgt52m/pgt52m.html#week-7-two-sample-tests", + "href": "pgt52m/pgt52m.html#week-7-two-sample-tests", + "title": "52M Data Analysis in R", + "section": "Week 7: Two-sample tests", + "text": "Week 7: Two-sample tests\nThis week you will how to use and interpret the general linear model when the x variable is categorical and has two groups. Just as with single linear regression, the model puts a line of best through data and the model parameters, the intercept and the slope, have the same in interpretation The intercept is one of the group means and the slope is the difference between that, mean and the other group mean. You will also learn about the non-parametric equivalents - the tests we use when the assumptions of the general linear model are not met." }, { - "objectID": "pgt52m/week-1/study_after_workshop.html", - "href": "pgt52m/week-1/study_after_workshop.html", - "title": "Independent Study to consolidate this week", - "section": "", - "text": "There is no additional study this week but you may want to look ahead to next week." + "objectID": "pgt52m/pgt52m.html#week-8-one-way-anova-and-kruskal-wallis", + "href": "pgt52m/pgt52m.html#week-8-one-way-anova-and-kruskal-wallis", + "title": "52M Data Analysis in R", + "section": "Week 8: One-way ANOVA and Kruskal-Wallis", + "text": "Week 8: One-way ANOVA and Kruskal-Wallis\nLast week you learnt how to use and interpret the general linear model when the x variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (analysis of variance). You will also learn about the Kruskal- Wallis test which can be used when the assumptions of the general linear model are not met." }, { - "objectID": "pgt52m/week-1/workshop.html", - "href": "pgt52m/week-1/workshop.html", - "title": "Workshop", - "section": "", - "text": "There is no formal workshop this week but you might want to install R and RStudio on your own machine. This is optional because University computers already have R and RStudio installed.\nInstall R and RStudio.\nNote you need a computer - not a tablet." + "objectID": "pgt52m/pgt52m.html#week-9-assessment-intro", + "href": "pgt52m/pgt52m.html#week-9-assessment-intro", + "title": "52M Data Analysis in R", + "section": "Week 9: Assessment intro", + "text": "Week 9: Assessment intro\nReproducible analysis of some relevant data." }, { - "objectID": "pgt52m/week-2/study_after_workshop.html", - "href": "pgt52m/week-2/study_after_workshop.html", - "title": "Independent Study to consolidate this week", - "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» In a maternity hospital, the total numbers of births induced on each day of the week over a six week period were recorded (see table below). Create a plot of these data with the days of week in order.\n\n\n\n\nNumber of inductions for each day of the week over six weeks.\n\nDay\nNo. inductions\n\n\n\nMonday\n43\n\n\nTuesday\n36\n\n\nWednesday\n35\n\n\nThursday\n38\n\n\nFriday\n48\n\n\nSaturday\n26\n\n\nSunday\n24\n\n\n\n\n\n\nCode# create a dataframe for the data\nday <- c(\"Monday\", \n \"Tuesday\", \n \"Wednesday\",\n \"Thursday\",\n \"Friday\",\n \"Saturday\",\n \"Sunday\")\nfreq <- c(43, 36, 35, 38, 48, 26, 24) \ninductions <- data.frame(day, freq)\n\n# make the order of the days correct rather than alphabetical\ninductions <- inductions |> \n mutate(day = fct_relevel(day, c(\"Monday\",\n \"Tuesday\",\n \"Wednesday\",\n \"Thursday\",\n \"Friday\",\n \"Saturday\",\n \"Sunday\")))\n\n# plot the data as a barplot with the bars in\nggplot(data = inductions, \n aes(x = day, y = freq)) +\n geom_col(colour = \"black\",\n fill = \"lightseagreen\") +\n scale_x_discrete(expand = c(0, 0),\n name = \"Day of the week\") + \n scale_y_continuous(expand = c(0, 0),\n name = \"Number of inductions\",\n limits = c(0, 55)) +\n theme_classic()\n\n\n\nšŸ“– Read Workflow in RStudio" + "objectID": "pgt52m/pgt52m.html#week-10-reproducible-reporting", + "href": "pgt52m/pgt52m.html#week-10-reproducible-reporting", + "title": "52M Data Analysis in R", + "section": "Week 10: Reproducible Reporting", + "text": "Week 10: Reproducible Reporting\nUsing Quarto" + }, + { + "objectID": "pgt52m/pgt52m.html#week-11-drop-in", + "href": "pgt52m/pgt52m.html#week-11-drop-in", + "title": "52M Data Analysis in R", + "section": "Week 11: Drop-in", + "text": "Week 11: Drop-in" }, { "objectID": "pgt52m/week-2/workshop.html", @@ -294,81 +448,151 @@ "text": "Footnotes\n\nThere are also scale_x_continous() and scale_y_discrete() functions when you have those types of variableā†©ļøŽ\nModify components of a themeā†©ļøŽ" }, { - "objectID": "pgt52m/week-2/study_before_workshop.html", - "href": "pgt52m/week-2/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-2/study_after_workshop.html", + "href": "pgt52m/week-2/study_after_workshop.html", + "title": "Independent Study to consolidate this week", "section": "", - "text": "Either\n\nšŸ“– Read First Steps in RStudio in\n\nOR\n\nšŸ“¹ Watch" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» In a maternity hospital, the total numbers of births induced on each day of the week over a six week period were recorded (see table below). Create a plot of these data with the days of week in order.\n\n\n\n\nNumber of inductions for each day of the week over six weeks.\n\nDay\nNo. inductions\n\n\n\nMonday\n43\n\n\nTuesday\n36\n\n\nWednesday\n35\n\n\nThursday\n38\n\n\nFriday\n48\n\n\nSaturday\n26\n\n\nSunday\n24\n\n\n\n\n\n\nCode# create a dataframe for the data\nday <- c(\"Monday\", \n \"Tuesday\", \n \"Wednesday\",\n \"Thursday\",\n \"Friday\",\n \"Saturday\",\n \"Sunday\")\nfreq <- c(43, 36, 35, 38, 48, 26, 24) \ninductions <- data.frame(day, freq)\n\n# make the order of the days correct rather than alphabetical\ninductions <- inductions |> \n mutate(day = fct_relevel(day, c(\"Monday\",\n \"Tuesday\",\n \"Wednesday\",\n \"Thursday\",\n \"Friday\",\n \"Saturday\",\n \"Sunday\")))\n\n# plot the data as a barplot with the bars in\nggplot(data = inductions, \n aes(x = day, y = freq)) +\n geom_col(colour = \"black\",\n fill = \"lightseagreen\") +\n scale_x_discrete(expand = c(0, 0),\n name = \"Day of the week\") + \n scale_y_continuous(expand = c(0, 0),\n name = \"Number of inductions\",\n limits = c(0, 55)) +\n theme_classic()\n\n\n\nšŸ“– Read Workflow in RStudio" }, { - "objectID": "pgt52m/week-6/overview.html", - "href": "pgt52m/week-6/overview.html", - "title": "Overview", + "objectID": "pgt52m/week-8/workshop.html", + "href": "pgt52m/week-8/workshop.html", + "title": "Workshop", "section": "", - "text": "This week you will be introduced to the idea of a statistical ā€œmodelā€ in general and to general linear model in particular. Our first general linear model will be single linear regression which puts a line of best fit through data so the response can be predicted from the explanatory variable. We will consider the two ā€œparametersā€ estimated by the model (the slope and the intercept) and whether these differ from zero\n\nLearning objectives\nThe successful student will be able to:\n\nexplain what is meant by a statistical model and fitting a model\nknow what the general linear model is and how it relates to regression\nexplain the principle of regression and know when it can be applied\napply and interpret a simple linear regression in R\nevaluate whether the assumptions of regression are met\nscientifically report a regression result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read What is a statistical model\nšŸ“– Read Single linear regression\n\nWorkshop\ni.šŸ’» Carry out a single linear regression\nConsolidate\n\nšŸ’» Appropriately analyse the relationsip between juvenile hormone and mandible size in stage beetles\nšŸ’» Appropriately analyse the relationsip between anxiety and performance" + "text": "Artwork by Horst (2023): ā€œDebugging and feelingsā€\n\n\nIn this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "pgt52m/week-6/study_before_workshop.html", - "href": "pgt52m/week-6/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-8/workshop.html#session-overview", + "href": "pgt52m/week-8/workshop.html#session-overview", + "title": "Workshop", "section": "", - "text": "šŸ“– Read What is a statistical model\nšŸ“– Read Single linear regression" + "text": "In this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R." }, { - "objectID": "pgt52m/week-5/overview.html", - "href": "pgt52m/week-5/overview.html", - "title": "Overview", + "objectID": "pgt52m/week-8/workshop.html#philosophy", + "href": "pgt52m/week-8/workshop.html#philosophy", + "title": "Workshop", "section": "", - "text": "This week we will cover the logic of consider the logic of hypothesis testing and type 1 and type 2 errors. We will also find out what the sampling distribution of the mean and the standard error are, and how to calculate confidence intervals.\n\n\n\nArtwork by Horst (2023): ā€œtype 1 errorā€\n\n\n\n\n\nArtwork by Horst (2023): ā€œtype 2 errorā€\n\n\n\nLearning objectives\nThe successful student will be able to:\n\ndemonstrate the process of hypothesis testing with an example\nexplain type 1 and type 2 errors\ndefine the sampling distribution of the mean and the standard error\nexplain what a confidence interval is\ncalculate confidence intervals for large and small samples\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read The logic of hyothesis testing\nšŸ“– Read Confidence Intervals\n\nWorkshop\n\nšŸ’» Remind yourself how to import files\nšŸ’» Calculate confidence intervals on large\nšŸ’» Calculate confidence intervals on small samples.\n\nConsolidate\n\nšŸ’» Calculate confidence intervals for each group in a data set\n\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." + "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "pgt52m/week-5/study_before_workshop.html", - "href": "pgt52m/week-5/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-8/workshop.html#myoglobin-in-seal-muscle", + "href": "pgt52m/week-8/workshop.html#myoglobin-in-seal-muscle", + "title": "Workshop", + "section": "Myoglobin in seal muscle", + "text": "Myoglobin in seal muscle\nThe myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in seal.csv. We want to know if there is a difference between species. Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.\n Save a copy of the data file seal.csv to data-raw\n Read in the data and check the structure. I used the name seal for the dataframe/tibble.\n What kind of variables do you have?\n\n\n\nExploring\n Do a quick plot of the data. You may need to refer to a previous workshop\nSummarising the data\nDo you remember Look after future you!\n If you followed that tip youā€™ll be able to open that script and whizz through summarising,testing and plotting.\n Create a data frame called seal_summary that contains the means, standard deviations, sample sizes and standard errors for each species.\nYou should get the following numbers:\n\n\n\n\nspecies\nmean\nstd\nn\nse\n\n\n\nBladdernose Seal\n42.31600\n8.020634\n30\n1.464361\n\n\nHarbour Seal\n49.01033\n8.252004\n30\n1.506603\n\n\nWeddell Seal\n44.66033\n7.849816\n30\n1.433174\n\n\n\n\n\nApplying, interpreting and reporting\nWe can now carry out a one-way ANOVA using the same lm() function we used for two-sample tests.\n Carry out an ANOVA and examine the results with:\n\nmod <- lm(data = seal, myoglobin ~ species)\nsummary(mod)\n\n\nCall:\nlm(formula = myoglobin ~ species, data = seal)\n\nResiduals:\n Min 1Q Median 3Q Max \n-16.306 -5.578 -0.036 5.240 18.250 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 42.316 1.468 28.819 < 2e-16 ***\nspeciesHarbour Seal 6.694 2.077 3.224 0.00178 ** \nspeciesWeddell Seal 2.344 2.077 1.129 0.26202 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 8.043 on 87 degrees of freedom\nMultiple R-squared: 0.1096, Adjusted R-squared: 0.08908 \nF-statistic: 5.352 on 2 and 87 DF, p-value: 0.006427\n\n\nRemember: the tilde (~) means test the values in myoglobin when grouped by the values in species. Or explain myoglobin with species\n What do you conclude so far from the test? Write your conclusion in a form suitable for a report.\n\n\n\n Can you relate the values under Estimate to the means?\n\n\n\n\n\n\n\nThe ANOVA is significant but this only tells us that species matters, meaning at least two of the means differ. To find out which means differ, we need a post-hoc test. A post-hoc (ā€œafter thisā€) test is done after a significant ANOVA test. There are several possible post-hoc tests and we will be using Tukeyā€™s HSD (honestly significant difference) test (Tukey 1949) implemented in the emmeans (Lenth 2023) package.\n Load the package\n\nlibrary(emmeans)\n\n Carry out the post-hoc test\n\nemmeans(mod, ~ species) |> pairs()\n\n contrast estimate SE df t.ratio p.value\n Bladdernose Seal - Harbour Seal -6.69 2.08 87 -3.224 0.0050\n Bladdernose Seal - Weddell Seal -2.34 2.08 87 -1.129 0.4990\n Harbour Seal - Weddell Seal 4.35 2.08 87 2.095 0.0968\n\nP value adjustment: tukey method for comparing a family of 3 estimates \n\n\nEach row is a comparison between the two means in the ā€˜contrastā€™ column. The ā€˜estimateā€™ column is the difference between those means and the ā€˜p.valueā€™ indicates whether that difference is significant.\nA plot can be used to visualise the result of the post-hoc which can be especially useful when there are very many comparisons.\n Plot the results of the post-hoc test:\n\nemmeans(mod, ~ species) |> plot()\n\n\n\n\nWhere the purple bars overlap, there is no significant difference.\n What do you conclude from the test?\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\nWe will again use both our seal and seal_summary dataframes.\n Create the plot:\n\n\n\n\n\n Save your figure to your figures folder." + }, + { + "objectID": "pgt52m/week-8/workshop.html#leafminers-on-birch", + "href": "pgt52m/week-8/workshop.html#leafminers-on-birch", + "title": "Workshop", + "section": "Leafminers on Birch", + "text": "Leafminers on Birch\nLarvae of the Ambermarked birch leafminer, Profenusa thomsoni, feed on the interior leaf tissues of Birch (Betula) species. They do not normally kill the tree but can weaken it making it susceptible to attack from other species. Researchers are interested in whether there is a difference in the rates at which white, grey and yellow birch are attacked. They introduce adult female P.thomsoni to a green house containing 30 young trees (ten of each type) and later count the egg laying events on each tree. The data are in leaf.txt.\nExploring\n Read in the data and check the structure. I used the name leaf for the dataframe/tibble.\n What kind of variables do we have?\n\n\n\n Do a quick plot of the data.\n Using your common sense, do these data look normally distributed?\n\n\n Why is a Kruskal-Wallis appropriate in this case?\n\n\n\n\n\n Calculate the medians, means and sample sizes.\nApplying, interpreting and reporting\n Carry out a Kruskal-Wallis:\n\nkruskal.test(data = leaf, eggs ~ birch)\n\n\n Kruskal-Wallis rank sum test\n\ndata: eggs by birch\nKruskal-Wallis chi-squared = 6.3393, df = 2, p-value = 0.04202\n\n\n What do you conclude from the test?\n\n\n\nA significant Kruskal-Wallis tells us at least two of the groups differ but where do the differences lie? The Dunn test is a post-hoc multiple comparison test for a significant Kruskal-Wallis. It is available in the package FSA\n Load the package using:\n\nlibrary(FSA)\n\n Run the post-hoc test with:\n\ndunnTest(data = leaf, eggs ~ birch)\n\n Comparison Z P.unadj P.adj\n1 Grey - White 1.296845 0.19468465 0.38936930\n2 Grey - Yellow -1.220560 0.22225279 0.22225279\n3 White - Yellow -2.517404 0.01182231 0.03546692\n\n\nThe P.adj column gives p-value for the comparison listed in the first column. Z is the test statistic.\n What do you conclude from the test?\n\n\n\n Write up the result is a form suitable for a report.\n\n\n\n\n\n\nIllustrating\n A box plot is an appropriate choice for illustrating a Kruskal-Wallis. Can you produce a figure like this?\n\n\n\n\n\nYouā€™re finished!" + }, + { + "objectID": "pgt52m/week-8/study_after_workshop.html", + "href": "pgt52m/week-8/study_after_workshop.html", + "title": "Independent Study to consolidate this week", "section": "", - "text": "šŸ“– Read The logic of hyothesis testing\nšŸ“– Read Confidence Intervals" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Sports scientists were investigating the effects of fitness and heat acclimatisation on the sodium content of sweat. They measured the sodium content of the sweat (Ī¼moll^āˆ’1) of three groups of individuals: unfit and unacclimatised (UU); fit and unacclimatised(FU); and fit and acclimatised (FA). The are in sweat.txt. Is there a difference between the groups in the sodium content of their sweat?\n\n\nCode# read in the data and look at structure\nsweat <- read_table(\"data-raw/sweat.txt\")\nstr(sweat)\n\n\n\nCode# quick plot of the data\nggplot(data = sweat, aes(x = gp, y = na)) +\n geom_boxplot()\nCode# Since the sample sizes are small and not the same in each group and the \n# variance in the FA gp looks a bit lower, I'm leaning to a non-parametric test K-W.\n# However, don't panic if you decided to do an anova\n\n\n\nCode# calculate some summary stats \nsweat_summary <- sweat %>% \n group_by(gp) %>% \n summarise(mean = mean(na),\n n = length(na),\n median = median(na))\n\n\n\nCode# Kruskal-Wallis\nkruskal.test(data = sweat, na ~ gp)\n# We can say there is a difference between the groups in the sodium \n# content of their sweat (chi-squared = 11.9802, df = 2, p-value = 0.002503).\n# Unfit and unacclimatised people have most salty sweat, \n# Fit and acclimatised people the least salty.\n\n\n\nCode# a post-hoc test to see where the sig differences lie:\nlibrary(FSA)\ndunnTest(data = sweat, na ~ gp)\n# Fit and acclimatised people (median = 49.5 Ī¼moll^āˆ’1) have significantly less sodium in their\n# sweat than the unfit and unacclimatised people (70 Ī¼moll^āˆ’1) \n# (Kruskal-Wallis multiple comparison p-values adjusted with the Holm method: p = 0.0026).\n# Fit and unacclimatised (54 Ī¼moll^āˆ’1) also have significantly less sodium in their\n# people have sodium concentrations than unfit and unacclimatised people (p = 0.033). \n# There was no difference between the Fit and unacclimatised and the Fit and acclimatised. See figure 1.\n\n\n\nCodeggplot(sweat, aes(x = gp, y = na) ) +\n geom_boxplot() +\n scale_x_discrete(labels = c(\"Fit Acclimatised\", \n \"Fit Unacclimatised\", \n \"Unfit Unacclimatised\"), \n name = \"Group\") +\n scale_y_continuous(limits = c(0, 110), \n expand = c(0, 0),\n name = expression(\"Sodium\"~mu*\"mol\"*l^{-1})) +\n annotate(\"segment\", x = 1, xend = 3, \n y = 100, yend = 100,\n colour = \"black\") +\n annotate(\"text\", x = 2, y = 103, \n label = expression(italic(p)~\"= 0.0026\")) +\n annotate(\"segment\", x = 2, xend = 3, \n y = 90, yend = 90,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 93, \n label = expression(italic(p)~\"= 0.0340\")) +\n theme_classic()\nCode#Figure 1. Sodium content of sweat for three groups: Fit and acclimatised\n#(FA), Fit and unacclimatised (FU) and Unfit and unacclimatised (UU). Heavy lines\n#indicate the median, boxes the interquartile range and whiskers the range. \n\n\n\nšŸ’» The data are given in biomass.txt are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Do the insecticides vary in their effectiveness? What advice would you give to a person: - currently using insecticide E? - trying to choose between A and D? - trying to choose between C and B?\n\n\nCodebiom <- read_table(\"data-raw/biomass.txt\")\n# The data are organised with an insecticide treatment group in\n# each column.\n\n\n\nCode#Put the data into tidy format.\n\nbiom <- biom |> \n pivot_longer(cols = everything(),\n names_to = \"spray\",\n values_to = \"biomass\")\n\n\n\nCode# quick plot of the data\nggplot(data = biom, aes(x = spray, y = biomass)) +\n geom_boxplot()\nCode# Looks like there is a difference between sprays. E doesn't look very effective.\n\n\n\nCode# summary statistics\nbiom_summary <- biom %>% \n group_by(spray) %>% \n summarise(mean = mean(biomass),\n median = median(biomass),\n sd = sd(biomass),\n n = length(biomass),\n se = sd / sqrt(n))\n# thoughts so far: the sample sizes are equal, 10 is a smallish but\n# reasonable sample size\n# the means and medians are similar to each other (expected for\n# normally distributed data), A has a smaller variance \n\n# We have one explanatory variable, \"spray\" comprising 6 levels\n# Biomass has decimal places and we would expect such data to be \n# normally distributed therefore one-way ANOVA is the desired test\n# - we will check the assumptions after building the model\n\n\n\nCode# arry out an ANOVA and examine the results \nmod <- lm(data = biom, biomass ~ spray)\nsummary(mod)\n# spray type does have an effect F-statistic: 26.46 on 5 and 54 DF, p-value: 2.081e-13\n\n\n\nCode# Carry out the post-hoc test\nlibrary(emmeans)\n\nemmeans(mod, ~ spray) |> pairs()\n\n# the signifcant comparisons are:\n# contrast estimate SE df t.ratio p.value\n# A - D -76.50 21.9 54 -3.489 0.0119\n# A - E -175.51 21.9 54 -8.005 <.0001\n# A - WaterControl -175.91 21.9 54 -8.024 <.0001\n# B - E -154.32 21.9 54 -7.039 <.0001\n# B - WaterControl -154.72 21.9 54 -7.057 <.0001\n# C - E -155.71 21.9 54 -7.102 <.0001\n# C - WaterControl -156.11 21.9 54 -7.120 <.0001\n# D - E -99.01 21.9 54 -4.516 0.0005\n# D - WaterControl -99.41 21.9 54 -4.534 0.0004\n# All sprays are better than the water control except E. \n# This is probably the most important result.\n# What advice would you give to a person currently using insecticide E?\n# Don't bother!! It's no better than water. Switch to any of \n# the other sprays\n# What advice would you give to a person currently\n# + trying to choose between A and D? Choose A because A has sig lower\n# insect biomass than D \n# + trying to choose between C and B? It doesn't matter because there is \n# no difference in insect biomass. Use other criteria to chose (e.g., price)\n# We might report this like:\n# There is a very highly significant effect of spray type on pest \n# biomass (F = 26.5; d.f., 5, 54; p < 0.001). Post-hoc testing \n# showed E was no more effective than the control; A, C and B were \n# all better than the control but could be equally as good as each\n# other; D would be a better choice than the control or E but \n# worse than A. See figure 1\n\n\n\nCode# I reordered the bars to make is easier for me to annotate with\n# I also used * to indicate significance\n\nggplot() +\n geom_point(data = biom, aes(x = reorder(spray, biomass), y = biomass),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Pest Biomass (units)\",\n limits = c(0, 540),\n expand = c(0, 0)) +\n scale_x_discrete(\"Spray treatment\") +\n # E and control are one group\n annotate(\"segment\", x = 4.5, xend = 6.5, \n y = 397, yend = 397,\n colour = \"black\", linewidth = 1) +\n annotate(\"text\", x = 5.5, y = 385, \n label = \"N.S\", size = 4) +\n # WaterControl-D and E-D ***\n annotate(\"segment\", x = 4, xend = 5.5, \n y = 410, yend = 410,\n colour = \"black\") +\n annotate(\"text\", x = 4.5, y = 420, \n label = \"***\", size = 5) +\n # WaterControl-B ***\n annotate(\"segment\", x = 3, xend = 5.5, \n y = 440, yend = 440,\n colour = \"black\") +\n annotate(\"text\", x = 4, y = 450,\n label = \"***\", size = 5) +\n # WaterControl-C ***\n annotate(\"segment\", x = 2, xend = 5.5, \n y = 475, yend = 475,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 485, \n label = \"***\", size = 5) +\n # WaterControl-A ***\n annotate(\"segment\", x = 1, xend = 5.5, \n y = 510, yend = 510,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 520, \n label = \"***\", size = 5) + \n# A-D ***\n annotate(\"segment\", x = 1, xend = 4, \n y = 330, yend = 330,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 335, \n label = \"*\", size = 5) +\n theme_classic()\nCode# Figure 1. The mean pest biomass following various insecticide treatments.\n# Error bars are +/- 1 S.E. Significant comparisons are indicated: * is p < 0.05, ** p < 0.01 and *** is p < 0.001" }, { - "objectID": "pgt52m/week-4/overview.html", - "href": "pgt52m/week-4/overview.html", - "title": "Overview", + "objectID": "pgt52m/week-6/workshop.html", + "href": "pgt52m/week-6/workshop.html", + "title": "Workshop", "section": "", - "text": "Last week you summarised and plotted single variables. This week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by ā€œtidyā€ data and how to perform a simple data tidying task. Finally you will discover how to save your figures and place them in documents.\n\nLearning objectives\n\nsummarise and plot appropriately datasets with more than one variable\nrecognise that variables can be categorised by their role in analysis\nexplain what is meant by ā€˜tidyā€™ data and be able to perform some data tidying tasks.\nsave figures to file\ncreate neat reports which include text and figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– From importing to reporting\n\nWorkshop\n\nšŸ’» Summarise and plot datasets with more than one variable.\nšŸ’» Practice with working directories, importing data, formatting figures and the pipe\nšŸ’» Lay out text, figures and figure legends in documents\n\nConsolidate\n\nšŸ’» Summarise and plot a dataframe from the workshop\nšŸ’» Practice the complete RStudio Project worklfow for a new dataset" + "text": "In this workshop you will get practice in applying, interpreting and reporting single linear regression.\n\n\nArtwork by Horst (2023): ā€œlinear regression dragonsā€\n\n\nIn this session you will carry out, interpret and report on a single linear regression.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "pgt52m/week-4/study_before_workshop.html", - "href": "pgt52m/week-4/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-6/workshop.html#session-overview", + "href": "pgt52m/week-6/workshop.html#session-overview", + "title": "Workshop", "section": "", - "text": "šŸ“– Read From importing to reporting. The first part of this chapter is about data import which we covered in the last workshop. You may be able to skip that part or you may find it useful to revise. The section on Summarising data will be mainly new." + "text": "In this session you will carry out, interpret and report on a single linear regression." }, { - "objectID": "pgt52m/week-8/overview.html", - "href": "pgt52m/week-8/overview.html", - "title": "Overview", + "objectID": "pgt52m/week-6/workshop.html#philosophy", + "href": "pgt52m/week-6/workshop.html#philosophy", + "title": "Workshop", "section": "", - "text": "Last week you learnt how to use and interpret the general linear model when the x variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (analysis of variance). You will also learn about the Kruskal- Wallis test which can be used when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nexplain the rationale behind ANOVA understand the meaning of the F values\nselect, appropriately, one-way ANOVA and Kruskal-Wallis\nknow what functions are used in R to run these tests and how to interpret them\nevaluate whether the assumptions of lm() are met\nscientifically report the results of these tests including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis\n\nWorkshop\n\nšŸ’» One-way ANOVA\nšŸ’» Kruskal-Wallis\n\nConsolidate\n\nšŸ’» Appropriately test if fitness and acclimation effect the sodium content of sweat\nšŸ’» Appropriately test if insecticides vary in their effectiveness" + "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "pgt52m/week-8/study_before_workshop.html", - "href": "pgt52m/week-8/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-6/workshop.html#linear-regression", + "href": "pgt52m/week-6/workshop.html#linear-regression", + "title": "Workshop", + "section": "Linear Regression", + "text": "Linear Regression\nThe data in plant.xlsx is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other.\n Save a copy of plant.xlsx to your data-raw folder and import it.\n What type of variables do you have? Which is the response and which is the explanatory? What is the null hypothesis?\n\n\n\n\n\n\nExploring\n Do a quick plot of the data:\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point()\n\n\n\n\n What are the assumptions of linear regression? Do these seem to be met?\n\n\n\n\n\n\n\n\n\n\nApplying, interpreting and reporting\n We now carry out a regression assigning the result of the lm() procedure to a variable and examining it with summary().\n\nmod <- lm(data = plant, mass ~ day)\nsummary(mod)\n\n\nCall:\nlm(formula = mass ~ day, data = plant)\n\nResiduals:\n Min 1Q Median 3Q Max \n-32.810 -11.253 -0.408 9.075 48.869 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -8.6834 6.4729 -1.342 0.186 \nday 1.6026 0.1705 9.401 1.5e-12 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 17.92 on 49 degrees of freedom\nMultiple R-squared: 0.6433, Adjusted R-squared: 0.636 \nF-statistic: 88.37 on 1 and 49 DF, p-value: 1.503e-12\n\n\nThe Estimates in the Coefficients table give the intercept (first line) and the slope (second line) of the best fitting straight line. The p-values on the same line are tests of whether that coefficient is different from zero.\nThe F value and p-value in the last line are a test of whether the model as a whole explains a significant amount of variation in the dependent variable. For a single linear regression this is exactly equivalent to the test of the slope against zero.\n What is the equation of the line? What do you conclude from the analysis?\n\n\n\n\n\n Does the line go through (0,0)?\n\n\n\n What percentage of variation is explained by the line?\n\n\nIt might be useful to assign the slope and the intercept to variables in case we need them later. The can be accessed in the mod$coefficients variable:\n\nmod$coefficients\n\n(Intercept) day \n -8.683379 1.602606 \n\n\n Assign mod$coefficients[1] to b0 and mod$coefficients[1] to b1:\n\nb0 <- mod$coefficients[1] |> round(2)\nb1 <- mod$coefficients[2] |> round(2)\n\nI also rounded the values to two decimal places.\nChecking assumptions\nWe need to examine the residuals. Very conveniently, the object which is created by lm() contains a variable called $residuals. Also conveniently, the Rā€™s plot() function can used on the output objects of lm(). The assumptions demand that each y is drawn from a normal distribution for each x and these normal distributions have the same variance. Therefore we plot the residuals against the fitted values to see if the variance is the same for all the values of x. The fitted - predicted - values are the values on the line of best fit. Each residual is the difference between the fitted values and the observed value.\n Plot the model residuals against the fitted values like this:\n\nplot(mod, which = 1)\n\n\n\n\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals:\n\nggplot(mapping = aes(x = mod$residuals)) + \n geom_histogram(bins = 10)\n\n\n\n\n Use the shapiro.test() to test the normality of the model residuals\n\nshapiro.test(mod$residuals)\n\n\n Shapiro-Wilk normality test\n\ndata: mod$residuals\nW = 0.96377, p-value = 0.1208\n\n\nUsually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the assumptions are met.\n What to you conclude?\n\n\n\n\nIllustrating\nWe want a figure with the points and the statistical model, i.e., the best fitting straight line.\n Create a scatter plot using geom_point()\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n theme_classic()\n\n\n\n\n The geom_smooth() function will had a variety of fitted lines to a plot. We want a line so we need to specify method = \"lm\":\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n theme_classic()\n\n\n\n\n What do the se and colour arguments do? Try changing them.\n Letā€™s add the equation of the line to the figure using annotate():\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() +\n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n annotate(\"text\", x = 20, y = 110, \n label = \"mass = 1.61 * day - 8.68\") +\n theme_classic()\n\n\n\n\nWe have to tell annotate() what type of geom we want - text in this case, - where to put it, and the text we want to appear.\n Improve the axes. You may need to refer back Changing the axes from the Week 2 workshop\n Save your figure to your figures folder." + }, + { + "objectID": "pgt52m/week-6/workshop.html#look-after-future-you", + "href": "pgt52m/week-6/workshop.html#look-after-future-you", + "title": "Workshop", + "section": "Look after future you!", + "text": "Look after future you!\nYouā€™re finished!" + }, + { + "objectID": "pgt52m/week-6/study_after_workshop.html", + "href": "pgt52m/week-6/study_after_workshop.html", + "title": "Independent Study to consolidate this week", "section": "", - "text": "Prepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Effect of anxiety status and sporting performance. The data in sprint.txt are from an investigation of the effect of anxiety status and sporting performance. A group of 40 100m sprinters undertook a psychometric test to measure their anxiety shortly before competing. The data are their anxiety scores and the 100m times achieved. What you do conclude from these data?\n\n\nCode# this example is designed to emphasise the importance of plotting your data first\nsprint <- read_table(\"data-raw/sprint.txt\")\n# Anxiety is discrete but ranges from 16 to 402 meaning the gap between possible measures is small and \n# the variable could be treated as continuous if needed. Time is a continuous measure that has decimal places and which we would expect to follow a normal distribution \n\n# explore with a plot\nggplot(sprint, aes(x = anxiety, y = time) ) +\n geom_point()\nCode# A scatterplot of the data clearly reveals that these data are not linear. There is a good relationship between the two variables but since it is not linear, single linear regression is not appropriate.\n\n\n\nšŸ’» Juvenile hormone in stag beetles. The concentration of juvenile hormone in stag beetles is known to influence mandible growth. Groups of stag beetles were injected with different concentrations of juvenile hormone (arbitrary units) and their average mandible size (mm) determined. The experimenters planned to analyse their data with regression. The data are in stag.txt\n\n\n\nCode# read the data in and check the structure\nstag <- read_table(\"data-raw/stag.txt\")\nstr(stag)\n\n# jh is discrete but ordered and has been chosen by the experimenter - it is the explanatory variable. \n# the response is mandible size which has decimal places and is something we would expect to be \n# normally distributed. So far, common sense suggests the assumptions of regression are met.\n\n\n\nCode# exploratory plot\nggplot(stag, aes(x = jh, y = mand)) +\n geom_point()\nCode# looks linear-ish on the scatter\n# regression still seems appropriate\n# we will check the other assumptions after we have run the lm\n\n\n\nCode# build the statistical model\nmod <- lm(data = stag, mand ~ jh)\n\n# examine it\nsummary(mod)\n# mand = 0.032*jh + 0.419\n# the slope of the line is significantly different from zero / the jh explains a significant amount of the variation in mand (ANOVA: F = 16.63; d.f. = 1,14; p = 0.00113).\n# the intercept is 0.419 and differs significantly from zero \n\n\n\nCode# checking the assumption\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be equal along the x axis.\n# with a small data set there is some apparent heterogeneity but it doesn't look too.\n# \nhist(mod$residuals)\nCode# We have some skew which again might be partly a result of a small sample size.\nshapiro.test(mod$residuals) # the also test not sig diff from normal\n\n# On balance the use of regression is probably justifiable but it is borderline\n# but ideally the experiment would be better if multiple individuals were measure at\n# each of the chosen juvenile hormone levels.\n\n\n\nCode# a better plot\nggplot(stag, aes(x = jh, y = mand) ) +\n geom_point() +\n geom_smooth(method = lm, se = FALSE, colour = \"black\") +\n scale_x_continuous(name = \"Juvenile hormone (arbitrary units)\",\n expand = c(0, 0),\n limits = c(0, 32)) +\n scale_y_continuous(name = \"Mandible size (mm)\",\n expand = c(0, 0),\n limits = c(0, 2)) +\n theme_classic()" + }, + { + "objectID": "pgt52m/week-7/workshop.html", + "href": "pgt52m/week-7/workshop.html", + "title": "Workshop", + "section": "", + "text": "Artwork by Horst (2023): ā€œHow much I think I know about Rā€\n\n\nIn this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + }, + { + "objectID": "pgt52m/week-7/workshop.html#session-overview", + "href": "pgt52m/week-7/workshop.html#session-overview", + "title": "Workshop", + "section": "", + "text": "In this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R." + }, + { + "objectID": "pgt52m/week-7/workshop.html#philosophy", + "href": "pgt52m/week-7/workshop.html#philosophy", + "title": "Workshop", + "section": "", + "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + }, + { + "objectID": "pgt52m/week-7/workshop.html#adiponectin-secretion", + "href": "pgt52m/week-7/workshop.html#adiponectin-secretion", + "title": "Workshop", + "section": "Adiponectin secretion", + "text": "Adiponectin secretion\nAdiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not.\n Save a copy of adipocytes.txt to data-raw\n Read in the data and check the structure. I used the name adip for the dataframe/tibble.\nWe have a tibble containing two variables: adiponectin is the response and is continuous and treatment is explanatory. treatment is categorical with two levels (groups). The first task is visualise the data to get an overview. For continuous response variables with categorical explanatory variables you could use geom_point(), geom_boxplot() or a variety of other geoms. I often use geom_violin() which allows us to see the distribution - the violin is fatter where there are more data points.\n Do a quick plot of the data:\n\nggplot(data = adip, aes(x = treatment, y = adiponectin)) +\n geom_violin()\n\n\n\n\nSummarising the data\nSummarising the data for each treatment group is the next sensible step. The most useful summary statistics are the means, standard deviations, sample sizes and standard errors.\n Create a data frame called adip_summary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples. You may need to the Summarise from the Week 4 workshop\nYou should get the following numbers:\n\n\n\n\ntreatment\nmean\nstd\nn\nse\n\n\n\ncontrol\n5.546000\n1.475247\n15\n0.3809072\n\n\nnicotinic\n7.508667\n1.793898\n15\n0.4631824\n\n\n\n\n\nSelecting a test\n Do you think this is a paired-sample test or two-sample test?\n\n\n\n\nApplying, interpreting and reporting\n Create a two-sample model like this:\n\nmod <- lm(data = adip,\n adiponectin ~ treatment)\n\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = adiponectin ~ treatment, data = adip)\n\nResiduals:\n Min 1Q Median 3Q Max \n-4.3787 -1.0967 0.1927 1.0245 3.1113 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 5.5460 0.4240 13.079 1.9e-13 ***\ntreatmentnicotinic 1.9627 0.5997 3.273 0.00283 ** \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.642 on 28 degrees of freedom\nMultiple R-squared: 0.2767, Adjusted R-squared: 0.2509 \nF-statistic: 10.71 on 1 and 28 DF, p-value: 0.00283\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\n\n\n\n\n\nWe now need to annotate the figure with the results from the statistical test. This most commonly done with a line linking the means being compared and the p-value. The annotate() function can be used to draw the line and then to add the value. The line is a segment and the p-value is a text.\n Add annotation to the figure by adding:\n...... +\n annotate(\"segment\", x = 1, xend = 2, \n y = 11.3, yend = 11.3,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 11.7, \n label = expression(italic(p)~\"= 0.003\")) +\n theme_classic()\n\n\n\n\n\nFor the segment, annotate() needs the x and y coordinates for the start and the finish of the line.\nThe use of expression() allows you to specify formatting or special characters. expression() takes strings or LaTeX formatting. Each string or piece of LaTeX is separated by a * or a ~. The * concatenates the strings without a space, ~ does so with a space. It will generate a warning message ā€œIn is.na(x) : is.na() applied to non-(list or vector) of type ā€˜expressionā€™ā€ which can be ignored.\n Save your figure to your figures folder." + }, + { + "objectID": "pgt52m/week-7/workshop.html#grouse-parasites", + "href": "pgt52m/week-7/workshop.html#grouse-parasites", + "title": "Workshop", + "section": "Grouse Parasites", + "text": "Grouse Parasites\nGrouse livers were dissected and the number of individuals of a parasitic nematode were counted for two estates ā€˜Gordonā€™ and ā€˜Mossā€™. We want to know if the two estates have different infection rates. The data are in grouse.csv\n Save a copy of grouse.csv to data-raw\n Read in the data and check the structure. I used the name grouse for the dataframe/tibble.\nSelecting\n Using your common sense, do these data look normally distributed?\n\n\n\n What test do you suggest?\n\n\nApplying, interpreting and reporting\n Summarise the data by finding the median of each group:\n Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney):\n\nwilcox.test(data = grouse, nematodes ~ estate)\n\n\n Wilcoxon rank sum exact test\n\ndata: nematodes by estate\nW = 78, p-value = 0.03546\nalternative hypothesis: true location shift is not equal to 0\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\nIllustrating\nA box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range.\n We can create a simple boxplot with:\n\nggplot(data = grouse, aes(x = estate, y = nematodes) ) +\n geom_boxplot() \n\n\n\n\n Annotate and format the figure so it is more suitable for a report and save it to your figures folder." + }, + { + "objectID": "pgt52m/week-7/workshop.html#gene-expression", + "href": "pgt52m/week-7/workshop.html#gene-expression", + "title": "Workshop", + "section": "Gene Expression", + "text": "Gene Expression\nBambara groundnut (Vigna subterranea) is an African legume with good nutritional value which can be influenced by low temperature stress. Researchers are interested in the expression levels of a particular set of 35 genes (probe_id) in response to temperature stress. They measure the expression of the genes at 23 and 18 degrees C (high and low temperature). These samples are not independent because we have two measure from one gene. The data are in expr.xlxs.\nSelecting\n What is the null hypothesis?\n\n\n\n Save a copy of expr.xlxs and import the data. I named the dataframe bambara\n What is the appropriate parametric test?\n\n\nApplying, interpreting and reporting\nA paired test requires us to test whether the difference in expression between high and low temperatures is zero on average. One handy way to achieve this is to organise our groups into two columns. The pivot_wider() function will do this for us. We need to tell it what column gives the identifiers (i.e., matches the the pairs) - the probe_ids in this case. We also need to say which variable contains what will become the column names and which contains the values.\n Pivot the data so there is a column for each temperature:\n\nbambara <- bambara |> \n pivot_wider(names_from = temperature, \n values_from = expression, \n id_cols = probe_id)\n\n Click on the bambara dataframe in the environment to open a view of it so that you understand what pivot_wider() has done.\n Create a paired-sample model like this:\n\nmod <- lm(data = bambara, \n highert - lowert ~ 1)\n\nSince we have done highert - lowert, the ā€œ(Intercept) Estimateā€ will be the average of the higher temperature expression minus the lower temperature expression for each gene.\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = highert - lowert ~ 1, data = bambara)\n\nResiduals:\n Min 1Q Median 3Q Max \n-1.05478 -0.46058 0.09682 0.33342 1.06892 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.30728 0.09591 3.204 0.00294 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.5674 on 34 degrees of freedom\n\n\n State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any significant effect." }, { - "objectID": "pgt52m/week-3/overview.html", - "href": "pgt52m/week-3/overview.html", - "title": "Overview", - "section": "", - "text": "The type of values our data can take is important in how we analyse and visualise it. This week you will learn the difference between continuous and discrete values and how we summarise and visualise them. You will also learn about the ā€œnormal distributionā€ which is the most important continuous distribution.\n\n\n\nDiscrete variable\n\n\n\nLearning objectives\nThe successful student will be able to:\n\ndistinguish between continuous, discrete, nominal and ordinal variable\nread in data in to RStudio from a plain text file and Excel files\nsummarise and plot variables appropriately for the data type\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read: Ideas about data\n\nWorkshop\n\nšŸ’» Importing data\nšŸ’» Summarising discrete data\nšŸ’» Summarising count data\nšŸ’» Summarising continuous data\n\nConsolidate\n\nšŸ’» Summarise some data\nšŸ’» Plot some data\nšŸ’» Format a plot (1)\nšŸ’» Format a plot (2)\nšŸ“– Read Understanding the pipe |>" + "objectID": "pgt52m/week-7/workshop.html#look-after-future-you", + "href": "pgt52m/week-7/workshop.html#look-after-future-you", + "title": "Workshop", + "section": "Look after future you!", + "text": "Look after future you!\nThe code required to summarise, test, and plot data for any two-sample test AND for any for any one-way ANOVA is exactly the same except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it your code so that you can make use of it next week.\n\nYouā€™re finished!" }, { - "objectID": "pgt52m/week-3/study_before_workshop.html", - "href": "pgt52m/week-3/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", + "objectID": "pgt52m/week-7/study_after_workshop.html", + "href": "pgt52m/week-7/study_after_workshop.html", + "title": "Independent Study to consolidate this week", "section": "", - "text": "šŸ“– Read Ideas about data" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Plant Biotech. Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt. Do you think their modification has been successful?\n\n\nCodecsativa <- read_table(\"data-raw/csativa.txt\")\nstr(csativa)\n\n# First realise that this is a two sample test. You have two independent samples\n# - there are a total of 100 different plants and the values in one \n# group have no relationship to the values in the other.\n\n\n\nCode# create a rough plot of the data \nggplot(data = csativa, aes(x = plant, y = omega)) +\n geom_violin()\nCode# note the modified plants seem to have lower omega!\n\n\n\nCode# create a summary of the data\ncsativa_summary <- csativa %>%\n group_by(plant) %>%\n summarise(mean = mean(omega),\n std = sd(omega),\n n = length(omega),\n se = std/sqrt(n))\n\n\n\nCode# The data seem to be continuous so it is likely that a parametric test will be fine\n# we will check the other assumptions after we have run the lm\n\n# build the statistical model\nmod <- lm(data = csativa, omega ~ plant)\n\n\n# examine it\nsummary(mod)\n# So there is a significant difference but you need to make sure you know the direction!\n# Wild plants have a significantly higher omega 3 content (mean +/- s.e = 56.41 +/- 1.11) \n# than modified plants (49.46 +/- 0.82)(t = 5.03; d.f. = 98; p < 0.0001).\n\n\n\nCode# let's check the assumptions\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be the same in both groups.\n# This looks OK. Maybe a bit higher in the wild plants (with the higher mean)\n \nhist(mod$residuals)\nCodeshapiro.test(mod$residuals)\n# On balance the use of lm() is probably justifiable The variance isn't quite equal \n# and the histogram looks a bit off normal but the normality test is NS and the \n# effect (in the figure) is clear.\n\n\n\nCode# A figure \nfig1 <- ggplot() +\n geom_point(data = csativa, aes(x = plant, y = omega),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_x_discrete(name = \"Plant type\", labels = c(\"GMO\", \"WT\")) +\n scale_y_continuous(name = \"Amount of Omega 3 (units)\",\n expand = c(0, 0),\n limits = c(0, 90)) +\n annotate(\"segment\", x = 1, xend = 2, \n y = 80, yend = 80,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 85, \n label = expression(italic(p)~\"< 0.001\")) +\n theme_classic()\n\n# save figure to figures/csativa.png\nggsave(\"figures/csativa.png\",\n plot = fig1,\n width = 3.5,\n height = 3.5,\n units = \"in\",\n dpi = 300)\n\n\n\nšŸ’» another example" }, { "objectID": "index.html", @@ -405,132 +629,6 @@ "section": "", "text": "All the Data Analysis in R teaching is on the VLE so why is this site useful? Well, perhaps more than any other material, you will want to refer back when applying your skills throughout your degree and this site collects everything together in a searchable way. The search icon is on the top right." }, - { - "objectID": "r4babs1/week-7/overview.html", - "href": "r4babs1/week-7/overview.html", - "title": "Overview", - "section": "", - "text": "This week you will start writing R code in RStudio and will create your first graph! You will learn about data types such as ā€œnumericsā€ and ā€œcharactersā€ and some of the different types of objects in R such as ā€œvectorsā€ and ā€œdataframesā€. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects.\n\n\n\nArtwork by Horst (2023): ā€œbless this workflowā€\n\n\n\nLearning objectives\nThe successful student will be able to:\n\nuse the R command line as a calculator and to assign variables\ncreate and use the basic data types in R\nfind their way around the RStudio windows\nuse an RStudio Project to organise work\nuse a script to run R commands\ncreate and customise a barplot\nsearch and understand manual pages\n\n\n\nInstructions\n\nPrepare\n\nFirst Steps in RStudio: Either šŸ“– Read the book OR šŸ“¹ Watch two videos\n\nWorkshop\ni.šŸ’» šŸˆ Coat colour of cats. Type in some data, perform calculations on, and plot it.\nConsolidate\n\nšŸ’» Create a plot\nšŸ“– Read Workflow in RStudio\n\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#outline", - "href": "r4babs1/week-7/rstudio-projects.html#outline", - "title": "RStudio ProjectsWho, what, why?", - "section": "Outline", - "text": "Outline\n\nWho\nA One-line what\nThe high-level why\n\n\nMight be enough!\n\n\n\nMore detailed why\nMore detailed what" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#audience", - "href": "r4babs1/week-7/rstudio-projects.html#audience", - "title": "RStudio ProjectsWho, what, why?", - "section": "Audience", - "text": "Audience\n\nYou teach using R directly\n\nBecoming a Bioscientist 1 - 4\nIM group project\nPGT\n\nYou teach or supervise students using R\n\nfield courses, practical work\nprojects\n\nYou use R" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#an-rstudio-project", - "href": "r4babs1/week-7/rstudio-projects.html#an-rstudio-project", - "title": "RStudio ProjectsWho, what, why?", - "section": "šŸ“ An RStudio Project", - "text": "šŸ“ An RStudio Project\n\nis a folder!\n\n\n\nhave been part of the stage 1 and IM stage 3 for > 5 years\n\n\n\nStage 1\n\nUse an RStudio project containing the script you used to analyse and plot the data for your report, your figures and and the data itself. The Project should be structured and the script should be well-commented, well-organised and follow good practice in the use of spacing, indentation, and variable naming. It should include all the code required to reproduce data import and formatting as well as the summary information, analyses, and figures in your report." - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#y12345678", - "href": "r4babs1/week-7/rstudio-projects.html#y12345678", - "title": "RStudio ProjectsWho, what, why?", - "section": "Y12345678", - "text": "Y12345678\ndemo" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#babs-1-4-lo-progression", - "href": "r4babs1/week-7/rstudio-projects.html#babs-1-4-lo-progression", - "title": "RStudio ProjectsWho, what, why?", - "section": "BABS 1-4 LO progression", - "text": "BABS 1-4 LO progression\nBABS 1-5 LO progression" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects", - "href": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects", - "title": "RStudio ProjectsWho, what, why?", - "section": "Why use RStudio Projects", - "text": "Why use RStudio Projects\n\nthe same reason we keep lab books: reproducibility and validation\n\n\nItā€™s science!\n\n\n\nvia GIPHY" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-1", - "href": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-1", - "title": "RStudio ProjectsWho, what, why?", - "section": "Why use RStudio Projects", - "text": "Why use RStudio Projects\n\nTransferable: explicit training in organising work" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-2", - "href": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-2", - "title": "RStudio ProjectsWho, what, why?", - "section": "Why use RStudio Projects", - "text": "Why use RStudio Projects\n\n\n\nhelp you to work with your most important collaborator\n\n\n\n\n\nfutureself, CC-BY-NC, by Julen Colomb" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#section", - "href": "r4babs1/week-7/rstudio-projects.html#section", - "title": "RStudio ProjectsWho, what, why?", - "section": "", - "text": "via GIPHY" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#working-directories-and-paths", - "href": "r4babs1/week-7/rstudio-projects.html#working-directories-and-paths", - "title": "RStudio ProjectsWho, what, why?", - "section": "Working directories and Paths", - "text": "Working directories and Paths\n\ndirectory means folder\nimportant concepts when you interact with computers without clicking\n\n\nAllison Horst cartoon ā€œcode gets the blameā€" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#working-directories", - "href": "r4babs1/week-7/rstudio-projects.html#working-directories", - "title": "RStudio ProjectsWho, what, why?", - "section": "Working directories", - "text": "Working directories\n\nDefault folder a program will read and write to.\nYou will have some understanding\n\nWord demo" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#paths", - "href": "r4babs1/week-7/rstudio-projects.html#paths", - "title": "RStudio ProjectsWho, what, why?", - "section": "Paths", - "text": "Paths\n\nlocation of a file/folder\nappear in the address bar of explorer/finder and browsers\n\ndemo\n\n\nwhen you canā€™t click, you need the path\n\n\nchaffinch <- read_table(\"chaff.txt\")" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#absolute-path", - "href": "r4babs1/week-7/rstudio-projects.html#absolute-path", - "title": "RStudio ProjectsWho, what, why?", - "section": "Absolute path", - "text": "Absolute path\n\nchaffinch <- read_table(\"C:/Users/er13/OneDrive - University of York/Desktop/Desktop/undergrad-teaching-york/BIO00017C/BIO00017C-Data-Analysis-in-R-2020/data/chaff.txt\")\n\n\nOnly exists on my computer!" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#relative-paths", - "href": "r4babs1/week-7/rstudio-projects.html#relative-paths", - "title": "RStudio ProjectsWho, what, why?", - "section": "Relative paths", - "text": "Relative paths\n\nlocation of a file/folder relative to the working directory\nIf my working directory is BIO00017C-Data-Analysis-in-R-2020:\n\n\nchaffinch <- read_table(\"data/chaff.txt\")" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#rstudio-projects", - "href": "r4babs1/week-7/rstudio-projects.html#rstudio-projects", - "title": "RStudio ProjectsWho, what, why?", - "section": "RStudio Projects", - "text": "RStudio Projects\n\nSets the working directory to be the project folder\nCode is portable: you send someone the folder and everything just works!" - }, - { - "objectID": "r4babs1/week-7/rstudio-projects.html#demo", - "href": "r4babs1/week-7/rstudio-projects.html#demo", - "title": "RStudio ProjectsWho, what, why?", - "section": "demo", - "text": "demo" - }, - { - "objectID": "r4babs1/week-9/study_after_workshop.html", - "href": "r4babs1/week-9/study_after_workshop.html", - "title": "Independent Study to consolidate this week", - "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nšŸ’» Summarise and plot the pigeons dataframe appropriately.\n\n\nCode# import\npigeons <- read_table(\"data-raw/pigeon.txt\")\n\n# reformat to tidy\npigeons <- pivot_longer(data = pigeons, \n cols = everything(), \n names_to = \"population\", \n values_to = \"distance\")\n\n# sumnmarise\npigeons_summary <- pigeons %>%\n group_by(population) %>%\n summarise(mean = mean(distance),\n std = sd(distance),\n n = length(distance),\n se = std/sqrt(n))\n# plot\nggplot() +\n geom_point(data = pigeons, aes(x = population, y = distance),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Interorbital distance (mm)\", \n limits = c(0, 14), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Population\") +\n theme_classic()\n\n\n\n\nšŸ’» The data in blood.csv are measurements of several blood parameters from fifty people with Crohnā€™s disease, a lifelong condition where parts of the digestive system become inflamed. Twenty-five of people are in the early stages of diagnosis and 25 have started treatment. The variables in the dataset are:\n\nsodium - Sodium concentration in umol/L, the average of 5 technical replicates\npotassium - Potassium concentration in umol/L, the average of 5 technical replicates\nB12 Vitamin - B12 in pmol/L, the average of 5 technical replicates\nwbc - White blood cell count in 10^9 /L, the average of 5 technical replicates\nrbc count - Red blood cell count in 10^12 /L, the average of 5 technical replicates\nplatlet count - platlet count in 10^9 /L, the average of 5 technical replicates\ninflammation marker - the presence or absence of a marker of inflammation, either 0 or 1\nstatus - whether the individual is before or after treatment.\n\nYour task is to summarise and plot these data in any suitable way. Create a complete RStudio Project for an analysis of these data. You will need to:\n\nMake a new project\nMake folders for data and for figures\nImport the data\nSummarise and plot variables of your choice. It doesnā€™t matter what you chose - the goal is the practice the project workflow and selecting appropriate plotting and summarising methods for particular data sets." - }, { "objectID": "r4babs1/week-9/workshop.html", "href": "r4babs1/week-9/workshop.html", @@ -574,88 +672,11 @@ "text": "Ulna and height\nThe datasets we have used up to this point, have had a continuous variable and a categorical variable where it makes sense to summarise the response for each of the different groups in the categorical variable and plot the response on the y-axis. We will now summarise a dataset with two continuous variables. The data in height.txt are the ulna length (cm) and height (m) of 30 people. In this case, it is more appropriate to summarise both of thee variables and to plot them as a scatter plot.\nWe will use summarise() again but we do not need the group_by() function this time. We will also need to use each of the summary functions, such as mean(), twice, once for each variable.\nImport\n Save height.txt to your data-raw folder\n Read the data into a dataframe called ulna_heights.\nSummarise\n Create a data frame called ulna_heights_summary that contains the sample size and means, standard deviations and standard errors for both variables.\n\nulna_heights_summary <- ulna_heights %>%\n summarise(n = length(ulna),\n mean_ulna = mean(ulna),\n std_ulna = sd(ulna),\n se_ulna = std_ulna/sqrt(n),\n mean_height = mean(height),\n std_height = sd(height),\n se_height = std_height/sqrt(n))\n\nYou should get the following numbers:\n\n\n\n\nn\nmean_ulna\nstd_ulna\nse_ulna\nmean_height\nstd_height\nse_height\n\n\n30\n24.72\n4.137332\n0.75537\n1.494\n0.2404823\n0.0439059\n\n\n\n\nVisualise\nTo plot make a scatter plot we need to use geom_point() again but without any scatter. In this case, it does not really matter which variable is on the x-axis and which is on the y-axis.\n Make a simple scatter plot\n\nggplot(data = ulna_heights, aes(x = ulna, y = height)) +\n geom_point()\n\n\n\n\nIf you have time, you may want to format the figure more appropriately.\n\n\nYouā€™re finished!" }, { - "objectID": "r4babs1/week-6/study_after_workshop.html", - "href": "r4babs1/week-6/study_after_workshop.html", - "title": "Independent Study to consolidate this week", - "section": "", - "text": "There is no additional study this week but you may want to look ahead to next week." - }, - { - "objectID": "r4babs1/week-6/workshop.html", - "href": "r4babs1/week-6/workshop.html", - "title": "Workshop", - "section": "", - "text": "There is no formal workshop this week but you might want to install R and RStudio on your own machine. This is optional because University computers already have R and RStudio installed.\nInstall R and RStudio.\nNote you need a computer - not a tablet." - }, - { - "objectID": "r4babs1/week-8/study_after_workshop.html", - "href": "r4babs1/week-8/study_after_workshop.html", + "objectID": "r4babs1/week-9/study_after_workshop.html", + "href": "r4babs1/week-9/study_after_workshop.html", "title": "Independent Study to consolidate this week", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nfly_bristles_means <- read_excel(\"data-raw/bristles-mean.xlsx\")\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\nExercises\n\nšŸ’» Summarise the fly_bristles_means dataframe by calculating the mean, median, sample size, standard deviation and standard error of the mean_count variable.\n\n\nCodefly_bristles_means_summary <- fly_bristles_means |> \n summarise(mean = mean(mean_count),\n median = median(mean_count),\n n = length(mean_count),\n standard_dev = sd(mean_count),\n standard_error = standard_dev / sqrt(n))\n\n\n\nšŸ’» Create an appropriate plot to show the distribution of mean_count in fly_bristles_means\n\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10)\n\n\n\nšŸ’» Can you format the plot 2. by removing the grey background, giving the bars a black outline and the fill colour of your choice and improving the axis format and labelling? You may want to refer to last weekā€™s workshop.\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10, \n colour = \"black\",\n fill = \"skyblue\") +\n scale_x_continuous(name = \"Number of bristles\",\n expand = c(0, 0)) +\n scale_y_continuous(name = \"Frequency\",\n expand = c(0, 0),\n limits = c(0, 35)) +\n theme_classic()\n\n\n\nšŸ’» Amend this code to change the order of the bars by the average mass of each coat colour? Changing the order of bars was covered last week. You may also want to practice formatting the graph nicely.\n\n\nggplot(cats, aes(x = coat, y = mass)) +\n geom_boxplot()\n\n\n\n\n\nCodeggplot(cats, \n aes(x = reorder(coat, mass), y = mass)) +\n geom_boxplot(fill = \"darkcyan\") +\n scale_x_discrete(name = \"Coat colour\") +\n scale_y_continuous(name = \"Mass (kg)\", \n expand = c(0, 0),\n limits = c(0, 8)) +\n theme_classic()\n\n\n\nšŸ“– Read Understanding the pipe |>" - }, - { - "objectID": "r4babs1/week-8/workshop.html", - "href": "r4babs1/week-8/workshop.html", - "title": "Workshop", - "section": "", - "text": "Artwork by Horst (2023): Continuous and Discrete\n\n\nIn this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." - }, - { - "objectID": "r4babs1/week-8/workshop.html#session-overview", - "href": "r4babs1/week-8/workshop.html#session-overview", - "title": "Workshop", - "section": "", - "text": "In this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe." - }, - { - "objectID": "r4babs1/week-8/workshop.html#philosophy", - "href": "r4babs1/week-8/workshop.html#philosophy", - "title": "Workshop", - "section": "", - "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." - }, - { - "objectID": "r4babs1/week-8/workshop.html#importing-data-from-files", - "href": "r4babs1/week-8/workshop.html#importing-data-from-files", - "title": "Workshop", - "section": "Importing data from files", - "text": "Importing data from files\nLast week we created data by typing the values in to R. This is not practical when you have added a lot of data to a spreadsheet, or you are using data file that has been supplied to you by a person or a machine. Far more commonly, we import data from a file into R. This requires you know two pieces of information.\n\n\nWhat format the data are in\nThe format of the data determines what function you will use to import it and the file extension often indicates format.\n\n\n.txt a plain text file1, where the columns are often separated by a space but might also be separated by a tab, a backslash or forward slash, or some other character\n\n.csv a plain text file where the columns are separated by commas\n\n.xlsx an Excel file\n\n\n\nWhere the file is relative to your working directory\nR can only read in a file if you say where it is, i.e., you give its relative path. If you follow the advice in this course, your data will be in a folder, data-raw which is inside your Project folder (and working directory).\n\n\nWe will save the four files for this workshop to our Project folder (week-8) and read them in. We will then create a new folder inside our Project folder called data-raw and move the data files to there before modifying the file paths as required. This is demonstrate how the relative path to the file will change after we move it.\n Save these four files in to your week-8 folder\n\nThe coat colour and mass of 62 cats: cat-coats.csv\n\nThe relative size of over 5000 cells measure by forward scatter (FSC) in flow cytometry: cell-size.txt\n\nThe number of sternopleural bristles on 96 female Drosophila: bristles.txt\n\nThe number of sternopleural bristles on 96 female Drosophila (with technical replicates): bristles-mean.xlsx\n\n\nThe first three files can be read in with core tidyverse Wickham et al. (2019) functions and the last can be read in with the readxl Wickham and Bryan (2023) package.\n Load the two packages\n\nlibrary(tidyverse)\nlibrary(readxl)\n\nWe will first read in cat-coats.csv. A .csv. extension suggests this is plain text file with comma separated columns. However, before we attempt to read it it, when should take a look at it. We can do this from RStudio\n Go to the Files pane (bottom right), click on the cat-coats.csv file and choose View File2\n\n\nRStudio Files Pane\n\nAny plain text file will open in the top left pane (Excel files will launch Excel).\n Is the file csv?\n\n\n What kind of variables does the file contain?\n\n\n Read in the csv file with:\n\ncats <- read_csv(\"cat-coats.csv\")\n\nThe data from the file a read into a dataframe called cats and you will be able to see it in the Environment.\n Click on each of the remaining files and choose View File.\n In each case, say what the format is and what types of variables it contains.\n\n\n\n\n\n\n\n\nWe use the read_table()3 command to read in plain text files of single columns or where the columns are separated by spacesā€¦\n ā€¦so in cell-size.txt can be read into a dataframe called cells like this:\n\ncells <- read_table(\"cell-size.txt\")\n\n Now you try reading bristles.txt in to a dataframe called fly_bristles\nThe readxl package we loaded earlier has two useful functions for working with Excel files: excel_sheets(\"filename.xlsx\") will list the sheets in an Excel workbook; read_excel(\"filename.xlsx\") will read in to top sheet or a specified sheet with a small modification read_excel(\"filename.xlsx\", sheet = \"Sheet1\").\n List the the names of the sheets and read in the sheet with the data like this:\n\nexcel_sheets(\"bristles-mean.xlsx\")\nfly_bristles_means <- read_excel(\"bristles-mean.xlsx\", sheet = \"means\")\n\nWell done! You can now read read in from files in your working directory.\nTo help you understand relative file paths, we will now move the data files.\n First remove the dataframes you just created to make it easier to see whether you can successfully read in the files from a different place:\n\nrm(cat_coats, fly_bristles, cells, flies_bristles_means)\n\n Now make a new folder called data-raw. You can do this on the Files Pane by clicking New Folder and typing into the box that appears.\n Check the boxes next to the file names and choose More | Moveā€¦ and select the data-raw folder.\n The files will move. To import data from files in the data-raw folder, you need to give the relative path to the file from the working directory. The working directory is the Project folder, week-8 so the relative path is data-raw/cat-coats.csv\n Import the cat-coats.csv data like this:\n\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\n Now you do the other files.\nFrom this point forward in the course, we will always create a data-raw folder each time we make a new Project.\nThese are the most common forms of data file you will encounter at first. However, data can certainly come to you in other formats particularly when they have come from particular software. Usually, there is an R package specially for that format.\nIn the rest of the workshop we will take each dataset in turn and create summaries and plots appropriate for the data types. Data is summarised using the group_by() and summarise() functions" - }, - { - "objectID": "r4babs1/week-8/workshop.html#summarising-discrete-data-cat-coat", - "href": "r4babs1/week-8/workshop.html#summarising-discrete-data-cat-coat", - "title": "Workshop", - "section": "Summarising discrete data: Cat coat", - "text": "Summarising discrete data: Cat coat\nThe most appropriate way to summarise nominal data like the colour of cat coats is to tabulate the number of cats with each colour.\n Summarise the cats dataframe by counting the number of cats in each category\n\ncats |> \n group_by(coat) |> \n count()\n\n# A tibble: 6 Ɨ 2\n# Groups: coat [6]\n coat n\n <chr> <int>\n1 black 23\n2 calico 1\n3 ginger 10\n4 tabby 8\n5 tortoiseshell 5\n6 white 15\n\n\n|> is the pipe and can be produced with Ctrl+Shift+M\nThis sort of data might be represented with a barchart. You have two options for producing that barchart:\n\nplot the summary table using geom_col()\nplot the raw data using geom_bar()\n\nWe did the first of these last week. The geom_col() function uses the numbers in a second column to determine how high the bars are. However, the geom_bar() function will do the tabulating for you.\n Plot the coat data using geom_bar:\n\nggplot(cats, aes(x = coat)) +\n geom_bar()\n\n\n\n\nThe gaps that R put automatically between the bars reflects that the coat colours are discrete categories." - }, - { - "objectID": "r4babs1/week-8/workshop.html#summarising-counts-bristles", - "href": "r4babs1/week-8/workshop.html#summarising-counts-bristles", - "title": "Workshop", - "section": "Summarising Counts: Bristles", - "text": "Summarising Counts: Bristles\nCounts are discrete and can be thought of a categories with an order (ordinal).\n Summarise the fly_bristles dataframe by counting the number of flies in each category of bristle number\nSince counts are numbers, we might also want to calculate some summary statistics such as the median and interquartile range.\n Summarise the fly_bristles dataframe by calculate the median and interquartile range\n\nfly_bristles |> \n summarise(median(number),\n IQR(number))\n\n# A tibble: 1 Ɨ 2\n `median(number)` `IQR(number)`\n <dbl> <dbl>\n1 6 4\n\n\nAs the interquartile is 4 and the median is 6 then 25% flies have 4 bristles or fewer and 25% have 8 or more.\nThe distribution of counts4 is not symmetrical for lower counts so the mean is not usually a good way to summarise count data.\n If you want to save the table you created and give the columns better names you can make two adjustments:\n\nfly_bristles_summary <- fly_bristles |> \n summarise(med = median(number),\n interquartile = IQR(number))\n\n Plot the bristles data using geom_bar:\nIf counts have a a high mean and big range, like number of hairs on a personā€™s head, then you can often treat them as continuous. This means you can use statistics like the mean and standard deviation to summarise them, histograms to plot them and use some standard statistical tests on them." - }, - { - "objectID": "r4babs1/week-8/workshop.html#summarising-continuous-data", - "href": "r4babs1/week-8/workshop.html#summarising-continuous-data", - "title": "Workshop", - "section": "Summarising continuous data", - "text": "Summarising continuous data\nCat mass\nThe variable mass in the cats dataframe is continuous. Very many continuous variables have a normal distribution. e normal distribution is also known as the bell-shaped curve. If we had the mass of all the cats in the world, we would find many cats were near the mean and fewer would be away from the mean, either much lighter or much heavier. In fact 68% would be within one standard deviation of the mean and about 96% would be within two standard deviations.\n\n\n\n\n\n We can find the mean mass with:\n\ncats |> \n summarise(mean = mean(mass))\n\n# A tibble: 1 Ɨ 1\n mean\n <dbl>\n1 4.51\n\n\nWe can add any sort of summary by placing it inside the the summarise parentheses. Each one is separated by a comma. We did this to find the median and the interquatrile range for fly bristles.\n For example, another way to calculate the number of values is to use the length() function:\n\ncats |> \n summarise(mean = mean(mass),\n n = length(mass))\n\n# A tibble: 1 Ɨ 2\n mean n\n <dbl> <int>\n1 4.51 62\n\n\n Adapt the code to calculate the mean, the sample size and the standard deviation (sd())\nA single continuous variable can be plotted using a histogram to show the shape of the distribution.\n Plots a histogram of cats mass:\n\nggplot(cats, aes(x = mass)) +\n geom_histogram(bins = 15, colour = \"black\") \n\n\n\n\nNotice that there are no gaps between the bars which reflects that mass is continuous. bins determines how many groups the variable is divided up into (i.e., the number of bars) and colour sets the colour for the outline of the bars. A sample of 62 is a relatively small number of values for plotting a distribution and the number of bins used determines how smooth or normally distributed the values look.\n Experiment with the number of bins. Does the number of bins affect how you view the distribution.\nNext week we will practice summarise and plotting data files with several variables but just to give you a taste, we will find summary statistics about mass for each of the coat types. The group_by() function is used before the summarise() to do calculations for each of the coats:\n\ncats |> \n group_by(coat) |> \n summarise(mean = mean(mass),\n standard_dev = sd(mass))\n\n# A tibble: 6 Ɨ 3\n coat mean standard_dev\n <chr> <dbl> <dbl>\n1 black 4.63 1.33 \n2 calico 2.19 NA \n3 ginger 4.46 1.12 \n4 tabby 4.86 0.444\n5 tortoiseshell 4.50 0.929\n6 white 4.34 1.34 \n\n\nYou can read this as:\n\ntake cats and then group by coat and then summarise by finding the mean of mass and the standard deviation of mass\n\n Why do we get an NA for the standard deviation of the calico cats?\n\n\n\nCells\n Summarise the cells dataframe by calculating the mean, median, sample size and standard deviation of FSC.\n Add a column for the standard error which is given by \\(\\frac{s.d.}{\\sqrt{n}}\\)\nMeans of counts\nMany things are quite difficult to measure or count and in these cases we often do technical replicates. A technical replicate allows us the measure the exact same thing to check how variable the measurement process is. For example, Drosophila are small and counting their sternopleural bristles is tricky. In addition, where a bristle is short (young) or broken scientists might vary in whether they count it. Or people or machines might vary in measuring the concentration of the same solution.\nWhen we do technical replicates we calculate their mean and use that as the measure. This is what is in our fly_bristles_means dataframe - the bristles of each of the 96 flies was counted by 5 people and the data are those means. These has an impact on how we plot and summarise the dataset because the distribution of mean counts is continuous! We can use means, standard deviations and histograms. This will be an exercise in Consolidate." - }, - { - "objectID": "r4babs1/week-8/workshop.html#look-after-future-you", - "href": "r4babs1/week-8/workshop.html#look-after-future-you", - "title": "Workshop", - "section": "Look after future you!", - "text": "Look after future you!\nFuture you is going to summarise and plot data from the ā€œRiver practicalsā€. You can make this much easier by documenting what you have done now. At the moment all of your code from this workshop is in a single file, probably called analysis.R. I recommend making a new script for each of nominal, continuous and count data and copying the code which imports, summarises and plots it. This will make it easier for future you to find the code you need. Here is an example: nominal_data.R. You may wish to comment your version much more.\nYouā€™re finished!" - }, - { - "objectID": "r4babs1/week-8/workshop.html#footnotes", - "href": "r4babs1/week-8/workshop.html#footnotes", - "title": "Workshop", - "section": "Footnotes", - "text": "Footnotes\n\nPlain text files can be opened in notepad or other similar editor and still be readable.ā†©ļøŽ\nDo not be tempted to import data this way. Unless you are careful, your data import will not be scripted or will not be scripted correctly.ā†©ļøŽ\nnote read_csv() and read_table() are the same functions with some different settings.ā†©ļøŽ\nCount data are usually ā€œPoissonā€ distributed.ā†©ļøŽ" + "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nšŸ’» Summarise and plot the pigeons dataframe appropriately.\n\n\nCode# import\npigeons <- read_table(\"data-raw/pigeon.txt\")\n\n# reformat to tidy\npigeons <- pivot_longer(data = pigeons, \n cols = everything(), \n names_to = \"population\", \n values_to = \"distance\")\n\n# sumnmarise\npigeons_summary <- pigeons %>%\n group_by(population) %>%\n summarise(mean = mean(distance),\n std = sd(distance),\n n = length(distance),\n se = std/sqrt(n))\n# plot\nggplot() +\n geom_point(data = pigeons, aes(x = population, y = distance),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Interorbital distance (mm)\", \n limits = c(0, 14), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Population\") +\n theme_classic()\n\n\n\n\nšŸ’» The data in blood.csv are measurements of several blood parameters from fifty people with Crohnā€™s disease, a lifelong condition where parts of the digestive system become inflamed. Twenty-five of people are in the early stages of diagnosis and 25 have started treatment. The variables in the dataset are:\n\nsodium - Sodium concentration in umol/L, the average of 5 technical replicates\npotassium - Potassium concentration in umol/L, the average of 5 technical replicates\nB12 Vitamin - B12 in pmol/L, the average of 5 technical replicates\nwbc - White blood cell count in 10^9 /L, the average of 5 technical replicates\nrbc count - Red blood cell count in 10^12 /L, the average of 5 technical replicates\nplatlet count - platlet count in 10^9 /L, the average of 5 technical replicates\ninflammation marker - the presence or absence of a marker of inflammation, either 0 or 1\nstatus - whether the individual is before or after treatment.\n\nYour task is to summarise and plot these data in any suitable way. Create a complete RStudio Project for an analysis of these data. You will need to:\n\nMake a new project\nMake folders for data and for figures\nImport the data\nSummarise and plot variables of your choice. It doesnā€™t matter what you chose - the goal is the practice the project workflow and selecting appropriate plotting and summarising methods for particular data sets." }, { "objectID": "r4babs1/r4babs1.html", @@ -699,13 +720,6 @@ "section": "Summarising data with several variables", "text": "Summarising data with several variables\nThis week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by ā€œtidyā€ data and how to perform a simple data tidying task. Finally you will discover how to save your figures and place them in documents." }, - { - "objectID": "r4babs1/week-8/study_before_workshop.html", - "href": "r4babs1/week-8/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", - "section": "", - "text": "šŸ“– Read Ideas about data" - }, { "objectID": "r4babs1/week-8/overview.html", "href": "r4babs1/week-8/overview.html", @@ -714,39 +728,25 @@ "text": "The type of values our data can take is important in how we analyse and visualise it. This week you will learn the difference between continuous and discrete values and how we summarise and visualise them. You will also learn about the ā€œnormal distributionā€ which is the most important continuous distribution.\n\n\n\nDiscrete variable\n\n\n\nLearning objectives\nThe successful student will be able to:\n\ndistinguish between continuous, discrete, nominal and ordinal variable\nread in data in to RStudio from a plain text file and Excel files\nsummarise and plot variables appropriately for the data type\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read: Ideas about data\n\nWorkshop\n\nšŸ’» Importing data\nšŸ’» Summarising discrete data\nšŸ’» Summarising count data\nšŸ’» Summarising continuous data\n\nConsolidate\n\nšŸ’» Summarise some data\nšŸ’» Plot some data\nšŸ’» Format a plot (1)\nšŸ’» Format a plot (2)\nšŸ“– Read Understanding the pipe |>" }, { - "objectID": "r4babs1/week-6/study_before_workshop.html", - "href": "r4babs1/week-6/study_before_workshop.html", + "objectID": "r4babs1/week-8/study_before_workshop.html", + "href": "r4babs1/week-8/study_before_workshop.html", "title": "Independent Study to prepare for workshop", "section": "", - "text": "Join the video conference Intro: Data Handling - BIO00027C-A (Lecture) on your timetable\nRead What they forgot to teach you about computers in Computational Analysis for Bioscientists\nRead What are R and Rstudio?. You only need to read this section, you do not need to the read the rest of the chapter (yet!)" + "text": "šŸ“– Read Ideas about data" }, { "objectID": "r4babs1/week-6/overview.html", "href": "r4babs1/week-6/overview.html", "title": "Overview", "section": "", - "text": "This week you will carry out some independent study to ensure you have some understanding of computer file systems. We will introduce you to the concepts of paths and working directories.\n\n\n\nArtwork by Horst (2023): ā€œcode gets the blameā€\n\n\n\nLearning objectives\nThe parentheses after each learning objective indicate where the content covers that objective.\nThe successful student will be able to:\n\nexplain what an operating system is\nexplain the organisation of files and directories in a file systems\nexplain what a file is and give some common files types\nexplain what is meant by a plain text file\nexplain the relationship between the file extensions, the file format and associations with programs\nuse a file manager\nexplain root, home and working directories\nexplain absolute and relative file paths\nknow what R and RStudio are\nknow how to organise their work\n\n\n\nInstructions\n\nPrepare\n\nJoin the video conference Intro: Data Handling - BIO00027C-A (Lecture) on your timetable\nRead What they forgot to teach you about computers\nRead What are R and Rstudio?\n\nWorkshop\n\nOptional: Install R and RStudio\n\nConsolidate\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." - }, - { - "objectID": "r4babs1/week-9/study_before_workshop.html", - "href": "r4babs1/week-9/study_before_workshop.html", - "title": "Independent Study to prepare for workshop", - "section": "", - "text": "šŸ“– Read From importing to reporting. The first part of this chapter is about data import which we covered in the last workshop. You may be able to skip that part or you may find it useful to revise. The section on Summarising data will be mainly new." - }, - { - "objectID": "r4babs1/week-9/overview.html", - "href": "r4babs1/week-9/overview.html", - "title": "Overview", - "section": "", - "text": "Last week you summarised and plotted single variables. This week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by ā€œtidyā€ data and how to perform a simple data tidying task. Finally you will discover how to save your figures and place them in documents.\n\nLearning objectives\n\nsummarise and plot appropriately datasets with more than one variable\nrecognise that variables can be categorised by their role in analysis\nexplain what is meant by ā€˜tidyā€™ data and be able to perform some data tidying tasks.\nsave figures to file\ncreate neat reports which include text and figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– From importing to reporting\n\nWorkshop\n\nšŸ’» Summarise and plot datasets with more than one variable.\nšŸ’» Practice with working directories, importing data, formatting figures and the pipe\nšŸ’» Lay out text, figures and figure legends in documents\n\nConsolidate\n\nšŸ’» Summarise and plot a dataframe from the workshop\nšŸ’» Practice the complete RStudio Project worklfow for a new dataset" + "text": "This week you will carry out some independent study to ensure you have some understanding of computer file systems. We will introduce you to the concepts of paths and working directories.\n\n\n\nArtwork by Horst (2023): ā€œcode gets the blameā€\n\n\n\nLearning objectives\nThe parentheses after each learning objective indicate where the content covers that objective.\nThe successful student will be able to:\n\nexplain what an operating system is\nexplain the organisation of files and directories in a file systems\nexplain what a file is and give some common files types\nexplain what is meant by a plain text file\nexplain the relationship between the file extensions, the file format and associations with programs\nuse a file manager\nexplain root, home and working directories\nexplain absolute and relative file paths\nknow what R and RStudio are\nknow how to organise their work\n\n\n\nInstructions\n\nPrepare\n\nJoin the video conference Intro: Data Handling - BIO00027C-A (Lecture) on your timetable\nRead What they forgot to teach you about computers\nRead What are R and Rstudio?\n\nWorkshop\n\nOptional: Install R and RStudio\n\nConsolidate\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." }, { - "objectID": "r4babs1/week-7/study_before_workshop.html", - "href": "r4babs1/week-7/study_before_workshop.html", + "objectID": "r4babs1/week-6/study_before_workshop.html", + "href": "r4babs1/week-6/study_before_workshop.html", "title": "Independent Study to prepare for workshop", "section": "", - "text": "Either\n\nšŸ“– Read First Steps in RStudio in\n\nOR\n\nšŸ“¹ Watch" + "text": "Join the video conference Intro: Data Handling - BIO00027C-A (Lecture) on your timetable\nRead What they forgot to teach you about computers in Computational Analysis for Bioscientists\nRead What are R and Rstudio?. You only need to read this section, you do not need to the read the rest of the chapter (yet!)" }, { "objectID": "r4babs1/week-7/workshop.html", @@ -832,27 +832,6 @@ "section": "", "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» In a maternity hospital, the total numbers of births induced on each day of the week over a six week period were recorded (see table below). Create a plot of these data with the days of week in order.\n\n\n\n\nNumber of inductions for each day of the week over six weeks.\n\nDay\nNo. inductions\n\n\n\nMonday\n43\n\n\nTuesday\n36\n\n\nWednesday\n35\n\n\nThursday\n38\n\n\nFriday\n48\n\n\nSaturday\n26\n\n\nSunday\n24\n\n\n\n\n\n\nCode# create a dataframe for the data\nday <- c(\"Monday\", \n \"Tuesday\", \n \"Wednesday\",\n \"Thursday\",\n \"Friday\",\n \"Saturday\",\n \"Sunday\")\nfreq <- c(43, 36, 35, 38, 48, 26, 24) \ninductions <- data.frame(day, freq)\n\n# make the order of the days correct rather than alphabetical\ninductions <- inductions |> \n mutate(day = fct_relevel(day, c(\"Monday\",\n \"Tuesday\",\n \"Wednesday\",\n \"Thursday\",\n \"Friday\",\n \"Saturday\",\n \"Sunday\")))\n\n# plot the data as a barplot with the bars in\nggplot(data = inductions, \n aes(x = day, y = freq)) +\n geom_col(colour = \"black\",\n fill = \"lightseagreen\") +\n scale_x_discrete(expand = c(0, 0),\n name = \"Day of the week\") + \n scale_y_continuous(expand = c(0, 0),\n name = \"Number of inductions\",\n limits = c(0, 55)) +\n theme_classic()\n\n\n\nšŸ“– Read Workflow in RStudio" }, - { - "objectID": "r4babs4/r4babs4.html", - "href": "r4babs4/r4babs4.html", - "title": "Data Analysis in R for BABS 4", - "section": "", - "text": "This is the last of the four BABS modules.\ncore strand specific: immunology\n\n\nThe BABS4 Module Learning outcomes that relate to the Data Analysis in R content are:" - }, - { - "objectID": "r4babs4/r4babs4.html#module-learning-objectives", - "href": "r4babs4/r4babs4.html#module-learning-objectives", - "title": "Data Analysis in R for BABS 4", - "section": "", - "text": "The BABS4 Module Learning outcomes that relate to the Data Analysis in R content are:" - }, - { - "objectID": "r4babs4/r4babs4.html#all-strands", - "href": "r4babs4/r4babs4.html#all-strands", - "title": "Data Analysis in R for BABS 4", - "section": "All Strands", - "text": "All Strands\nmany var and obs, examples, PCA, log fc transformation, normalisation, QC gating, missing values, excluding proteins idā€™d from fewer thaan two peptides, FDR, data structures" - }, { "objectID": "about.html", "href": "about.html", @@ -861,256 +840,305 @@ "text": "About this site\nAbout me" }, { - "objectID": "pgt52m/week-3/workshop.html", - "href": "pgt52m/week-3/workshop.html", - "title": "Workshop", + "objectID": "r4babs1/week-7/study_before_workshop.html", + "href": "r4babs1/week-7/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", "section": "", - "text": "Artwork by Horst (2023): Continuous and Discrete\n\n\nIn this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "Either\n\nšŸ“– Read First Steps in RStudio in\n\nOR\n\nšŸ“¹ Watch" }, { - "objectID": "pgt52m/week-3/workshop.html#session-overview", - "href": "pgt52m/week-3/workshop.html#session-overview", - "title": "Workshop", + "objectID": "r4babs1/week-7/overview.html", + "href": "r4babs1/week-7/overview.html", + "title": "Overview", "section": "", - "text": "In this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe." + "text": "This week you will start writing R code in RStudio and will create your first graph! You will learn about data types such as ā€œnumericsā€ and ā€œcharactersā€ and some of the different types of objects in R such as ā€œvectorsā€ and ā€œdataframesā€. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects.\n\n\n\nArtwork by Horst (2023): ā€œbless this workflowā€\n\n\n\nLearning objectives\nThe successful student will be able to:\n\nuse the R command line as a calculator and to assign variables\ncreate and use the basic data types in R\nfind their way around the RStudio windows\nuse an RStudio Project to organise work\nuse a script to run R commands\ncreate and customise a barplot\nsearch and understand manual pages\n\n\n\nInstructions\n\nPrepare\n\nFirst Steps in RStudio: Either šŸ“– Read the book OR šŸ“¹ Watch two videos\n\nWorkshop\ni.šŸ’» šŸˆ Coat colour of cats. Type in some data, perform calculations on, and plot it.\nConsolidate\n\nšŸ’» Create a plot\nšŸ“– Read Workflow in RStudio\n\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." }, { - "objectID": "pgt52m/week-3/workshop.html#philosophy", - "href": "pgt52m/week-3/workshop.html#philosophy", - "title": "Workshop", - "section": "", - "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "objectID": "r4babs1/week-7/rstudio-projects.html#outline", + "href": "r4babs1/week-7/rstudio-projects.html#outline", + "title": "RStudio ProjectsWho, what, why?", + "section": "Outline", + "text": "Outline\n\nWho\nA One-line what\nThe high-level why\n\n\nMight be enough!\n\n\n\nMore detailed why\nMore detailed what" }, { - "objectID": "pgt52m/week-3/workshop.html#importing-data-from-files", - "href": "pgt52m/week-3/workshop.html#importing-data-from-files", - "title": "Workshop", - "section": "Importing data from files", - "text": "Importing data from files\nLast week we created data by typing the values in to R. This is not practical when you have added a lot of data to a spreadsheet, or you are using data file that has been supplied to you by a person or a machine. Far more commonly, we import data from a file into R. This requires you know two pieces of information.\n\n\nWhat format the data are in\nThe format of the data determines what function you will use to import it and the file extension often indicates format.\n\n\n.txt a plain text file1, where the columns are often separated by a space but might also be separated by a tab, a backslash or forward slash, or some other character\n\n.csv a plain text file where the columns are separated by commas\n\n.xlsx an Excel file\n\n\n\nWhere the file is relative to your working directory\nR can only read in a file if you say where it is, i.e., you give its relative path. If you follow the advice in this course, your data will be in a folder, data-raw which is inside your Project folder (and working directory).\n\n\nWe will save the four files for this workshop to our Project folder (week-8) and read them in. We will then create a new folder inside our Project folder called data-raw and move the data files to there before modifying the file paths as required. This is demonstrate how the relative path to the file will change after we move it.\n Save these four files in to your week-8 folder\n\nThe coat colour and mass of 62 cats: cat-coats.csv\n\nThe relative size of over 5000 cells measure by forward scatter (FSC) in flow cytometry: cell-size.txt\n\nThe number of sternopleural bristles on 96 female Drosophila: bristles.txt\n\nThe number of sternopleural bristles on 96 female Drosophila (with technical replicates): bristles-mean.xlsx\n\n\nThe first three files can be read in with core tidyverse Wickham et al. (2019) functions and the last can be read in with the readxl Wickham and Bryan (2023) package.\n Load the two packages\n\nlibrary(tidyverse)\nlibrary(readxl)\n\nWe will first read in cat-coats.csv. A .csv. extension suggests this is plain text file with comma separated columns. However, before we attempt to read it it, when should take a look at it. We can do this from RStudio\n Go to the Files pane (bottom right), click on the cat-coats.csv file and choose View File2\n\n\nRStudio Files Pane\n\nAny plain text file will open in the top left pane (Excel files will launch Excel).\n Is the file csv?\n\n\n What kind of variables does the file contain?\n\n\n Read in the csv file with:\n\ncats <- read_csv(\"cat-coats.csv\")\n\nThe data from the file a read into a dataframe called cats and you will be able to see it in the Environment.\n Click on each of the remaining files and choose View File.\n In each case, say what the format is and what types of variables it contains.\n\n\n\n\n\n\n\n\nWe use the read_table()3 command to read in plain text files of single columns or where the columns are separated by spacesā€¦\n ā€¦so in cell-size.txt can be read into a dataframe called cells like this:\n\ncells <- read_table(\"cell-size.txt\")\n\n Now you try reading bristles.txt in to a dataframe called fly_bristles\nThe readxl package we loaded earlier has two useful functions for working with Excel files: excel_sheets(\"filename.xlsx\") will list the sheets in an Excel workbook; read_excel(\"filename.xlsx\") will read in to top sheet or a specified sheet with a small modification read_excel(\"filename.xlsx\", sheet = \"Sheet1\").\n List the the names of the sheets and read in the sheet with the data like this:\n\nexcel_sheets(\"bristles-mean.xlsx\")\nfly_bristles_means <- read_excel(\"bristles-mean.xlsx\", sheet = \"means\")\n\nWell done! You can now read read in from files in your working directory.\nTo help you understand relative file paths, we will now move the data files.\n First remove the dataframes you just created to make it easier to see whether you can successfully read in the files from a different place:\n\nrm(cat_coats, fly_bristles, cells, flies_bristles_means)\n\n Now make a new folder called data-raw. You can do this on the Files Pane by clicking New Folder and typing into the box that appears.\n Check the boxes next to the file names and choose More | Moveā€¦ and select the data-raw folder.\n The files will move. To import data from files in the data-raw folder, you need to give the relative path to the file from the working directory. The working directory is the Project folder, week-8 so the relative path is data-raw/cat-coats.csv\n Import the cat-coats.csv data like this:\n\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\n Now you do the other files.\nFrom this point forward in the course, we will always create a data-raw folder each time we make a new Project.\nThese are the most common forms of data file you will encounter at first. However, data can certainly come to you in other formats particularly when they have come from particular software. Usually, there is an R package specially for that format.\nIn the rest of the workshop we will take each dataset in turn and create summaries and plots appropriate for the data types. Data is summarised using the group_by() and summarise() functions" + "objectID": "r4babs1/week-7/rstudio-projects.html#audience", + "href": "r4babs1/week-7/rstudio-projects.html#audience", + "title": "RStudio ProjectsWho, what, why?", + "section": "Audience", + "text": "Audience\n\nYou teach using R directly\n\nBecoming a Bioscientist 1 - 4\nIM group project\nPGT\n\nYou teach or supervise students using R\n\nfield courses, practical work\nprojects\n\nYou use R" }, { - "objectID": "pgt52m/week-3/workshop.html#summarising-discrete-data-cat-coat", - "href": "pgt52m/week-3/workshop.html#summarising-discrete-data-cat-coat", - "title": "Workshop", - "section": "Summarising discrete data: Cat coat", - "text": "Summarising discrete data: Cat coat\nThe most appropriate way to summarise nominal data like the colour of cat coats is to tabulate the number of cats with each colour.\n Summarise the cats dataframe by counting the number of cats in each category\n\ncats |> \n group_by(coat) |> \n count()\n\n# A tibble: 6 Ɨ 2\n# Groups: coat [6]\n coat n\n <chr> <int>\n1 black 23\n2 calico 1\n3 ginger 10\n4 tabby 8\n5 tortoiseshell 5\n6 white 15\n\n\n|> is the pipe and can be produced with Ctrl+Shift+M\nThis sort of data might be represented with a barchart. You have two options for producing that barchart:\n\nplot the summary table using geom_col()\nplot the raw data using geom_bar()\n\nWe did the first of these last week. The geom_col() function uses the numbers in a second column to determine how high the bars are. However, the geom_bar() function will do the tabulating for you.\n Plot the coat data using geom_bar:\n\nggplot(cats, aes(x = coat)) +\n geom_bar()\n\n\n\n\nThe gaps that R put automatically between the bars reflects that the coat colours are discrete categories." + "objectID": "r4babs1/week-7/rstudio-projects.html#an-rstudio-project", + "href": "r4babs1/week-7/rstudio-projects.html#an-rstudio-project", + "title": "RStudio ProjectsWho, what, why?", + "section": "šŸ“ An RStudio Project", + "text": "šŸ“ An RStudio Project\n\nis a folder!\n\n\n\nhave been part of the stage 1 and IM stage 3 for > 5 years\n\n\n\nStage 1\n\nUse an RStudio project containing the script you used to analyse and plot the data for your report, your figures and and the data itself. The Project should be structured and the script should be well-commented, well-organised and follow good practice in the use of spacing, indentation, and variable naming. It should include all the code required to reproduce data import and formatting as well as the summary information, analyses, and figures in your report." }, { - "objectID": "pgt52m/week-3/workshop.html#summarising-counts-bristles", - "href": "pgt52m/week-3/workshop.html#summarising-counts-bristles", - "title": "Workshop", - "section": "Summarising Counts: Bristles", - "text": "Summarising Counts: Bristles\nCounts are discrete and can be thought of a categories with an order (ordinal).\n Summarise the fly_bristles dataframe by counting the number of flies in each category of bristle number\nSince counts are numbers, we might also want to calculate some summary statistics such as the median and interquartile range.\n Summarise the fly_bristles dataframe by calculate the median and interquartile range\n\nfly_bristles |> \n summarise(median(number),\n IQR(number))\n\n# A tibble: 1 Ɨ 2\n `median(number)` `IQR(number)`\n <dbl> <dbl>\n1 6 4\n\n\nAs the interquartile is 4 and the median is 6 then 25% flies have 4 bristles or fewer and 25% have 8 or more.\nThe distribution of counts4 is not symmetrical for lower counts so the mean is not usually a good way to summarise count data.\n If you want to save the table you created and give the columns better names you can make two adjustments:\n\nfly_bristles_summary <- fly_bristles |> \n summarise(med = median(number),\n interquartile = IQR(number))\n\n Plot the bristles data using geom_bar:\nIf counts have a a high mean and big range, like number of hairs on a personā€™s head, then you can often treat them as continuous. This means you can use statistics like the mean and standard deviation to summarise them, histograms to plot them and use some standard statistical tests on them." + "objectID": "r4babs1/week-7/rstudio-projects.html#y12345678", + "href": "r4babs1/week-7/rstudio-projects.html#y12345678", + "title": "RStudio ProjectsWho, what, why?", + "section": "Y12345678", + "text": "Y12345678\ndemo" }, { - "objectID": "pgt52m/week-3/workshop.html#summarising-continuous-data", - "href": "pgt52m/week-3/workshop.html#summarising-continuous-data", - "title": "Workshop", - "section": "Summarising continuous data", - "text": "Summarising continuous data\nCat mass\nThe variable mass in the cats dataframe is continuous. Very many continuous variables have a normal distribution. e normal distribution is also known as the bell-shaped curve. If we had the mass of all the cats in the world, we would find many cats were near the mean and fewer would be away from the mean, either much lighter or much heavier. In fact 68% would be within one standard deviation of the mean and about 96% would be within two standard deviations.\n\n\n\n\n\n We can find the mean mass with:\n\ncats |> \n summarise(mean = mean(mass))\n\n# A tibble: 1 Ɨ 1\n mean\n <dbl>\n1 4.51\n\n\nWe can add any sort of summary by placing it inside the the summarise parentheses. Each one is separated by a comma. We did this to find the median and the interquatrile range for fly bristles.\n For example, another way to calculate the number of values is to use the length() function:\n\ncats |> \n summarise(mean = mean(mass),\n n = length(mass))\n\n# A tibble: 1 Ɨ 2\n mean n\n <dbl> <int>\n1 4.51 62\n\n\n Adapt the code to calculate the mean, the sample size and the standard deviation (sd())\nA single continuous variable can be plotted using a histogram to show the shape of the distribution.\n Plots a histogram of cats mass:\n\nggplot(cats, aes(x = mass)) +\n geom_histogram(bins = 15, colour = \"black\") \n\n\n\n\nNotice that there are no gaps between the bars which reflects that mass is continuous. bins determines how many groups the variable is divided up into (i.e., the number of bars) and colour sets the colour for the outline of the bars. A sample of 62 is a relatively small number of values for plotting a distribution and the number of bins used determines how smooth or normally distributed the values look.\n Experiment with the number of bins. Does the number of bins affect how you view the distribution.\nNext week we will practice summarise and plotting data files with several variables but just to give you a taste, we will find summary statistics about mass for each of the coat types. The group_by() function is used before the summarise() to do calculations for each of the coats:\n\ncats |> \n group_by(coat) |> \n summarise(mean = mean(mass),\n standard_dev = sd(mass))\n\n# A tibble: 6 Ɨ 3\n coat mean standard_dev\n <chr> <dbl> <dbl>\n1 black 4.63 1.33 \n2 calico 2.19 NA \n3 ginger 4.46 1.12 \n4 tabby 4.86 0.444\n5 tortoiseshell 4.50 0.929\n6 white 4.34 1.34 \n\n\nYou can read this as:\n\ntake cats and then group by coat and then summarise by finding the mean of mass and the standard deviation of mass\n\n Why do we get an NA for the standard deviation of the calico cats?\n\n\n\nCells\n Summarise the cells dataframe by calculating the mean, median, sample size and standard deviation of FSC.\n Add a column for the standard error which is given by \\(\\frac{s.d.}{\\sqrt{n}}\\)\nMeans of counts\nMany things are quite difficult to measure or count and in these cases we often do technical replicates. A technical replicate allows us the measure the exact same thing to check how variable the measurement process is. For example, Drosophila are small and counting their sternopleural bristles is tricky. In addition, where a bristle is short (young) or broken scientists might vary in whether they count it. Or people or machines might vary in measuring the concentration of the same solution.\nWhen we do technical replicates we calculate their mean and use that as the measure. This is what is in our fly_bristles_means dataframe - the bristles of each of the 96 flies was counted by 5 people and the data are those means. These has an impact on how we plot and summarise the dataset because the distribution of mean counts is continuous! We can use means, standard deviations and histograms. This will be an exercise in Consolidate." + "objectID": "r4babs1/week-7/rstudio-projects.html#babs-1-4-lo-progression", + "href": "r4babs1/week-7/rstudio-projects.html#babs-1-4-lo-progression", + "title": "RStudio ProjectsWho, what, why?", + "section": "BABS 1-4 LO progression", + "text": "BABS 1-4 LO progression\nBABS 1-5 LO progression" }, { - "objectID": "pgt52m/week-3/workshop.html#look-after-future-you", - "href": "pgt52m/week-3/workshop.html#look-after-future-you", - "title": "Workshop", - "section": "Look after future you!", - "text": "Look after future you!\nFuture you is going to summarise and plot data from the ā€œRiver practicalsā€. You can make this much easier by documenting what you have done now. At the moment all of your code from this workshop is in a single file, probably called analysis.R. I recommend making a new script for each of nominal, continuous and count data and copying the code which imports, summarises and plots it. This will make it easier for future you to find the code you need. Here is an example: nominal_data.R. You may wish to comment your version much more.\nYouā€™re finished!" + "objectID": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects", + "href": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects", + "title": "RStudio ProjectsWho, what, why?", + "section": "Why use RStudio Projects", + "text": "Why use RStudio Projects\n\nthe same reason we keep lab books: reproducibility and validation\n\n\nItā€™s science!\n\n\n\nvia GIPHY" }, { - "objectID": "pgt52m/week-3/workshop.html#footnotes", - "href": "pgt52m/week-3/workshop.html#footnotes", - "title": "Workshop", - "section": "Footnotes", - "text": "Footnotes\n\nPlain text files can be opened in notepad or other similar editor and still be readable.ā†©ļøŽ\nDo not be tempted to import data this way. Unless you are careful, your data import will not be scripted or will not be scripted correctly.ā†©ļøŽ\nnote read_csv() and read_table() are the same functions with some different settings.ā†©ļøŽ\nCount data are usually ā€œPoissonā€ distributed.ā†©ļøŽ" + "objectID": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-1", + "href": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-1", + "title": "RStudio ProjectsWho, what, why?", + "section": "Why use RStudio Projects", + "text": "Why use RStudio Projects\n\nTransferable: explicit training in organising work" }, { - "objectID": "pgt52m/week-3/study_after_workshop.html", - "href": "pgt52m/week-3/study_after_workshop.html", - "title": "Independent Study to consolidate this week", - "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nfly_bristles_means <- read_excel(\"data-raw/bristles-mean.xlsx\")\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\nExercises\n\nšŸ’» Summarise the fly_bristles_means dataframe by calculating the mean, median, sample size, standard deviation and standard error of the mean_count variable.\n\n\nCodefly_bristles_means_summary <- fly_bristles_means |> \n summarise(mean = mean(mean_count),\n median = median(mean_count),\n n = length(mean_count),\n standard_dev = sd(mean_count),\n standard_error = standard_dev / sqrt(n))\n\n\n\nšŸ’» Create an appropriate plot to show the distribution of mean_count in fly_bristles_means\n\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10)\n\n\n\nšŸ’» Can you format the plot 2. by removing the grey background, giving the bars a black outline and the fill colour of your choice and improving the axis format and labelling? You may want to refer to last weekā€™s workshop.\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10, \n colour = \"black\",\n fill = \"skyblue\") +\n scale_x_continuous(name = \"Number of bristles\",\n expand = c(0, 0)) +\n scale_y_continuous(name = \"Frequency\",\n expand = c(0, 0),\n limits = c(0, 35)) +\n theme_classic()\n\n\n\nšŸ’» Amend this code to change the order of the bars by the average mass of each coat colour? Changing the order of bars was covered last week. You may also want to practice formatting the graph nicely.\n\n\nggplot(cats, aes(x = coat, y = mass)) +\n geom_boxplot()\n\n\n\n\n\nCodeggplot(cats, \n aes(x = reorder(coat, mass), y = mass)) +\n geom_boxplot(fill = \"darkcyan\") +\n scale_x_discrete(name = \"Coat colour\") +\n scale_y_continuous(name = \"Mass (kg)\", \n expand = c(0, 0),\n limits = c(0, 8)) +\n theme_classic()\n\n\n\nšŸ“– Read Understanding the pipe |>" + "objectID": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-2", + "href": "r4babs1/week-7/rstudio-projects.html#why-use-rstudio-projects-2", + "title": "RStudio ProjectsWho, what, why?", + "section": "Why use RStudio Projects", + "text": "Why use RStudio Projects\n\n\n\nhelp you to work with your most important collaborator\n\n\n\n\n\nfutureself, CC-BY-NC, by Julen Colomb" }, { - "objectID": "pgt52m/week-8/workshop.html", - "href": "pgt52m/week-8/workshop.html", - "title": "Workshop", + "objectID": "r4babs1/week-7/rstudio-projects.html#section", + "href": "r4babs1/week-7/rstudio-projects.html#section", + "title": "RStudio ProjectsWho, what, why?", "section": "", - "text": "Artwork by Horst (2023): ā€œDebugging and feelingsā€\n\n\nIn this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "via GIPHY" }, { - "objectID": "pgt52m/week-8/workshop.html#session-overview", - "href": "pgt52m/week-8/workshop.html#session-overview", - "title": "Workshop", - "section": "", - "text": "In this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R." + "objectID": "r4babs1/week-7/rstudio-projects.html#working-directories-and-paths", + "href": "r4babs1/week-7/rstudio-projects.html#working-directories-and-paths", + "title": "RStudio ProjectsWho, what, why?", + "section": "Working directories and Paths", + "text": "Working directories and Paths\n\ndirectory means folder\nimportant concepts when you interact with computers without clicking\n\n\nAllison Horst cartoon ā€œcode gets the blameā€" }, { - "objectID": "pgt52m/week-8/workshop.html#philosophy", - "href": "pgt52m/week-8/workshop.html#philosophy", - "title": "Workshop", - "section": "", - "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "objectID": "r4babs1/week-7/rstudio-projects.html#working-directories", + "href": "r4babs1/week-7/rstudio-projects.html#working-directories", + "title": "RStudio ProjectsWho, what, why?", + "section": "Working directories", + "text": "Working directories\n\nDefault folder a program will read and write to.\nYou will have some understanding\n\nWord demo" }, { - "objectID": "pgt52m/week-8/workshop.html#myoglobin-in-seal-muscle", - "href": "pgt52m/week-8/workshop.html#myoglobin-in-seal-muscle", - "title": "Workshop", - "section": "Myoglobin in seal muscle", - "text": "Myoglobin in seal muscle\nThe myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in seal.csv. We want to know if there is a difference between species. Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.\n Save a copy of the data file seal.csv to data-raw\n Read in the data and check the structure. I used the name seal for the dataframe/tibble.\n What kind of variables do you have?\n\n\n\nExploring\n Do a quick plot of the data. You may need to refer to a previous workshop\nSummarising the data\nDo you remember Look after future you!\n If you followed that tip youā€™ll be able to open that script and whizz through summarising,testing and plotting.\n Create a data frame called seal_summary that contains the means, standard deviations, sample sizes and standard errors for each species.\nYou should get the following numbers:\n\n\n\n\nspecies\nmean\nstd\nn\nse\n\n\n\nBladdernose Seal\n42.31600\n8.020634\n30\n1.464361\n\n\nHarbour Seal\n49.01033\n8.252004\n30\n1.506603\n\n\nWeddell Seal\n44.66033\n7.849816\n30\n1.433174\n\n\n\n\n\nApplying, interpreting and reporting\nWe can now carry out a one-way ANOVA using the same lm() function we used for two-sample tests.\n Carry out an ANOVA and examine the results with:\n\nmod <- lm(data = seal, myoglobin ~ species)\nsummary(mod)\n\n\nCall:\nlm(formula = myoglobin ~ species, data = seal)\n\nResiduals:\n Min 1Q Median 3Q Max \n-16.306 -5.578 -0.036 5.240 18.250 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 42.316 1.468 28.819 < 2e-16 ***\nspeciesHarbour Seal 6.694 2.077 3.224 0.00178 ** \nspeciesWeddell Seal 2.344 2.077 1.129 0.26202 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 8.043 on 87 degrees of freedom\nMultiple R-squared: 0.1096, Adjusted R-squared: 0.08908 \nF-statistic: 5.352 on 2 and 87 DF, p-value: 0.006427\n\n\nRemember: the tilde (~) means test the values in myoglobin when grouped by the values in species. Or explain myoglobin with species\n What do you conclude so far from the test? Write your conclusion in a form suitable for a report.\n\n\n\n Can you relate the values under Estimate to the means?\n\n\n\n\n\n\n\nThe ANOVA is significant but this only tells us that species matters, meaning at least two of the means differ. To find out which means differ, we need a post-hoc test. A post-hoc (ā€œafter thisā€) test is done after a significant ANOVA test. There are several possible post-hoc tests and we will be using Tukeyā€™s HSD (honestly significant difference) test (Tukey 1949) implemented in the emmeans (Lenth 2023) package.\n Load the package\n\nlibrary(emmeans)\n\n Carry out the post-hoc test\n\nemmeans(mod, ~ species) |> pairs()\n\n contrast estimate SE df t.ratio p.value\n Bladdernose Seal - Harbour Seal -6.69 2.08 87 -3.224 0.0050\n Bladdernose Seal - Weddell Seal -2.34 2.08 87 -1.129 0.4990\n Harbour Seal - Weddell Seal 4.35 2.08 87 2.095 0.0968\n\nP value adjustment: tukey method for comparing a family of 3 estimates \n\n\nEach row is a comparison between the two means in the ā€˜contrastā€™ column. The ā€˜estimateā€™ column is the difference between those means and the ā€˜p.valueā€™ indicates whether that difference is significant.\nA plot can be used to visualise the result of the post-hoc which can be especially useful when there are very many comparisons.\n Plot the results of the post-hoc test:\n\nemmeans(mod, ~ species) |> plot()\n\n\n\n\nWhere the purple bars overlap, there is no significant difference.\n What do you conclude from the test?\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\nWe will again use both our seal and seal_summary dataframes.\n Create the plot:\n\n\n\n\n\n Save your figure to your figures folder." + "objectID": "r4babs1/week-7/rstudio-projects.html#paths", + "href": "r4babs1/week-7/rstudio-projects.html#paths", + "title": "RStudio ProjectsWho, what, why?", + "section": "Paths", + "text": "Paths\n\nlocation of a file/folder\nappear in the address bar of explorer/finder and browsers\n\ndemo\n\n\nwhen you canā€™t click, you need the path\n\n\nchaffinch <- read_table(\"chaff.txt\")" }, { - "objectID": "pgt52m/week-8/workshop.html#leafminers-on-birch", - "href": "pgt52m/week-8/workshop.html#leafminers-on-birch", + "objectID": "r4babs1/week-7/rstudio-projects.html#absolute-path", + "href": "r4babs1/week-7/rstudio-projects.html#absolute-path", + "title": "RStudio ProjectsWho, what, why?", + "section": "Absolute path", + "text": "Absolute path\n\nchaffinch <- read_table(\"C:/Users/er13/OneDrive - University of York/Desktop/Desktop/undergrad-teaching-york/BIO00017C/BIO00017C-Data-Analysis-in-R-2020/data/chaff.txt\")\n\n\nOnly exists on my computer!" + }, + { + "objectID": "r4babs1/week-7/rstudio-projects.html#relative-paths", + "href": "r4babs1/week-7/rstudio-projects.html#relative-paths", + "title": "RStudio ProjectsWho, what, why?", + "section": "Relative paths", + "text": "Relative paths\n\nlocation of a file/folder relative to the working directory\nIf my working directory is BIO00017C-Data-Analysis-in-R-2020:\n\n\nchaffinch <- read_table(\"data/chaff.txt\")" + }, + { + "objectID": "r4babs1/week-7/rstudio-projects.html#rstudio-projects", + "href": "r4babs1/week-7/rstudio-projects.html#rstudio-projects", + "title": "RStudio ProjectsWho, what, why?", + "section": "RStudio Projects", + "text": "RStudio Projects\n\nSets the working directory to be the project folder\nCode is portable: you send someone the folder and everything just works!" + }, + { + "objectID": "r4babs1/week-7/rstudio-projects.html#demo", + "href": "r4babs1/week-7/rstudio-projects.html#demo", + "title": "RStudio ProjectsWho, what, why?", + "section": "demo", + "text": "demo" + }, + { + "objectID": "r4babs1/week-6/study_after_workshop.html", + "href": "r4babs1/week-6/study_after_workshop.html", + "title": "Independent Study to consolidate this week", + "section": "", + "text": "There is no additional study this week but you may want to look ahead to next week." + }, + { + "objectID": "r4babs1/week-6/workshop.html", + "href": "r4babs1/week-6/workshop.html", "title": "Workshop", - "section": "Leafminers on Birch", - "text": "Leafminers on Birch\nLarvae of the Ambermarked birch leafminer, Profenusa thomsoni, feed on the interior leaf tissues of Birch (Betula) species. They do not normally kill the tree but can weaken it making it susceptible to attack from other species. Researchers are interested in whether there is a difference in the rates at which white, grey and yellow birch are attacked. They introduce adult female P.thomsoni to a green house containing 30 young trees (ten of each type) and later count the egg laying events on each tree. The data are in leaf.txt.\nExploring\n Read in the data and check the structure. I used the name leaf for the dataframe/tibble.\n What kind of variables do we have?\n\n\n\n Do a quick plot of the data.\n Using your common sense, do these data look normally distributed?\n\n\n Why is a Kruskal-Wallis appropriate in this case?\n\n\n\n\n\n Calculate the medians, means and sample sizes.\nApplying, interpreting and reporting\n Carry out a Kruskal-Wallis:\n\nkruskal.test(data = leaf, eggs ~ birch)\n\n\n Kruskal-Wallis rank sum test\n\ndata: eggs by birch\nKruskal-Wallis chi-squared = 6.3393, df = 2, p-value = 0.04202\n\n\n What do you conclude from the test?\n\n\n\nA significant Kruskal-Wallis tells us at least two of the groups differ but where do the differences lie? The Dunn test is a post-hoc multiple comparison test for a significant Kruskal-Wallis. It is available in the package FSA\n Load the package using:\n\nlibrary(FSA)\n\n Run the post-hoc test with:\n\ndunnTest(data = leaf, eggs ~ birch)\n\n Comparison Z P.unadj P.adj\n1 Grey - White 1.296845 0.19468465 0.38936930\n2 Grey - Yellow -1.220560 0.22225279 0.22225279\n3 White - Yellow -2.517404 0.01182231 0.03546692\n\n\nThe P.adj column gives p-value for the comparison listed in the first column. Z is the test statistic.\n What do you conclude from the test?\n\n\n\n Write up the result is a form suitable for a report.\n\n\n\n\n\n\nIllustrating\n A box plot is an appropriate choice for illustrating a Kruskal-Wallis. Can you produce a figure like this?\n\n\n\n\n\nYouā€™re finished!" + "section": "", + "text": "There is no formal workshop this week but you might want to install R and RStudio on your own machine. This is optional because University computers already have R and RStudio installed.\nInstall R and RStudio.\nNote you need a computer - not a tablet." }, { - "objectID": "pgt52m/week-8/study_after_workshop.html", - "href": "pgt52m/week-8/study_after_workshop.html", + "objectID": "r4babs1/week-8/study_after_workshop.html", + "href": "r4babs1/week-8/study_after_workshop.html", "title": "Independent Study to consolidate this week", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Sports scientists were investigating the effects of fitness and heat acclimatisation on the sodium content of sweat. They measured the sodium content of the sweat (Ī¼moll^āˆ’1) of three groups of individuals: unfit and unacclimatised (UU); fit and unacclimatised(FU); and fit and acclimatised (FA). The are in sweat.txt. Is there a difference between the groups in the sodium content of their sweat?\n\n\nCode# read in the data and look at structure\nsweat <- read_table(\"data-raw/sweat.txt\")\nstr(sweat)\n\n\n\nCode# quick plot of the data\nggplot(data = sweat, aes(x = gp, y = na)) +\n geom_boxplot()\nCode# Since the sample sizes are small and not the same in each group and the \n# variance in the FA gp looks a bit lower, I'm leaning to a non-parametric test K-W.\n# However, don't panic if you decided to do an anova\n\n\n\nCode# calculate some summary stats \nsweat_summary <- sweat %>% \n group_by(gp) %>% \n summarise(mean = mean(na),\n n = length(na),\n median = median(na))\n\n\n\nCode# Kruskal-Wallis\nkruskal.test(data = sweat, na ~ gp)\n# We can say there is a difference between the groups in the sodium \n# content of their sweat (chi-squared = 11.9802, df = 2, p-value = 0.002503).\n# Unfit and unacclimatised people have most salty sweat, \n# Fit and acclimatised people the least salty.\n\n\n\nCode# a post-hoc test to see where the sig differences lie:\nlibrary(FSA)\ndunnTest(data = sweat, na ~ gp)\n# Fit and acclimatised people (median = 49.5 Ī¼moll^āˆ’1) have significantly less sodium in their\n# sweat than the unfit and unacclimatised people (70 Ī¼moll^āˆ’1) \n# (Kruskal-Wallis multiple comparison p-values adjusted with the Holm method: p = 0.0026).\n# Fit and unacclimatised (54 Ī¼moll^āˆ’1) also have significantly less sodium in their\n# people have sodium concentrations than unfit and unacclimatised people (p = 0.033). \n# There was no difference between the Fit and unacclimatised and the Fit and acclimatised. See figure 1.\n\n\n\nCodeggplot(sweat, aes(x = gp, y = na) ) +\n geom_boxplot() +\n scale_x_discrete(labels = c(\"Fit Acclimatised\", \n \"Fit Unacclimatised\", \n \"Unfit Unacclimatised\"), \n name = \"Group\") +\n scale_y_continuous(limits = c(0, 110), \n expand = c(0, 0),\n name = expression(\"Sodium\"~mu*\"mol\"*l^{-1})) +\n annotate(\"segment\", x = 1, xend = 3, \n y = 100, yend = 100,\n colour = \"black\") +\n annotate(\"text\", x = 2, y = 103, \n label = expression(italic(p)~\"= 0.0026\")) +\n annotate(\"segment\", x = 2, xend = 3, \n y = 90, yend = 90,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 93, \n label = expression(italic(p)~\"= 0.0340\")) +\n theme_classic()\nCode#Figure 1. Sodium content of sweat for three groups: Fit and acclimatised\n#(FA), Fit and unacclimatised (FU) and Unfit and unacclimatised (UU). Heavy lines\n#indicate the median, boxes the interquartile range and whiskers the range. \n\n\n\nšŸ’» The data are given in biomass.txt are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Do the insecticides vary in their effectiveness? What advice would you give to a person: - currently using insecticide E? - trying to choose between A and D? - trying to choose between C and B?\n\n\nCodebiom <- read_table(\"data-raw/biomass.txt\")\n# The data are organised with an insecticide treatment group in\n# each column.\n\n\n\nCode#Put the data into tidy format.\n\nbiom <- biom |> \n pivot_longer(cols = everything(),\n names_to = \"spray\",\n values_to = \"biomass\")\n\n\n\nCode# quick plot of the data\nggplot(data = biom, aes(x = spray, y = biomass)) +\n geom_boxplot()\nCode# Looks like there is a difference between sprays. E doesn't look very effective.\n\n\n\nCode# summary statistics\nbiom_summary <- biom %>% \n group_by(spray) %>% \n summarise(mean = mean(biomass),\n median = median(biomass),\n sd = sd(biomass),\n n = length(biomass),\n se = sd / sqrt(n))\n# thoughts so far: the sample sizes are equal, 10 is a smallish but\n# reasonable sample size\n# the means and medians are similar to each other (expected for\n# normally distributed data), A has a smaller variance \n\n# We have one explanatory variable, \"spray\" comprising 6 levels\n# Biomass has decimal places and we would expect such data to be \n# normally distributed therefore one-way ANOVA is the desired test\n# - we will check the assumptions after building the model\n\n\n\nCode# arry out an ANOVA and examine the results \nmod <- lm(data = biom, biomass ~ spray)\nsummary(mod)\n# spray type does have an effect F-statistic: 26.46 on 5 and 54 DF, p-value: 2.081e-13\n\n\n\nCode# Carry out the post-hoc test\nlibrary(emmeans)\n\nemmeans(mod, ~ spray) |> pairs()\n\n# the signifcant comparisons are:\n# contrast estimate SE df t.ratio p.value\n# A - D -76.50 21.9 54 -3.489 0.0119\n# A - E -175.51 21.9 54 -8.005 <.0001\n# A - WaterControl -175.91 21.9 54 -8.024 <.0001\n# B - E -154.32 21.9 54 -7.039 <.0001\n# B - WaterControl -154.72 21.9 54 -7.057 <.0001\n# C - E -155.71 21.9 54 -7.102 <.0001\n# C - WaterControl -156.11 21.9 54 -7.120 <.0001\n# D - E -99.01 21.9 54 -4.516 0.0005\n# D - WaterControl -99.41 21.9 54 -4.534 0.0004\n# All sprays are better than the water control except E. \n# This is probably the most important result.\n# What advice would you give to a person currently using insecticide E?\n# Don't bother!! It's no better than water. Switch to any of \n# the other sprays\n# What advice would you give to a person currently\n# + trying to choose between A and D? Choose A because A has sig lower\n# insect biomass than D \n# + trying to choose between C and B? It doesn't matter because there is \n# no difference in insect biomass. Use other criteria to chose (e.g., price)\n# We might report this like:\n# There is a very highly significant effect of spray type on pest \n# biomass (F = 26.5; d.f., 5, 54; p < 0.001). Post-hoc testing \n# showed E was no more effective than the control; A, C and B were \n# all better than the control but could be equally as good as each\n# other; D would be a better choice than the control or E but \n# worse than A. See figure 1\n\n\n\nCode# I reordered the bars to make is easier for me to annotate with\n# I also used * to indicate significance\n\nggplot() +\n geom_point(data = biom, aes(x = reorder(spray, biomass), y = biomass),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Pest Biomass (units)\",\n limits = c(0, 540),\n expand = c(0, 0)) +\n scale_x_discrete(\"Spray treatment\") +\n # E and control are one group\n annotate(\"segment\", x = 4.5, xend = 6.5, \n y = 397, yend = 397,\n colour = \"black\", linewidth = 1) +\n annotate(\"text\", x = 5.5, y = 385, \n label = \"N.S\", size = 4) +\n # WaterControl-D and E-D ***\n annotate(\"segment\", x = 4, xend = 5.5, \n y = 410, yend = 410,\n colour = \"black\") +\n annotate(\"text\", x = 4.5, y = 420, \n label = \"***\", size = 5) +\n # WaterControl-B ***\n annotate(\"segment\", x = 3, xend = 5.5, \n y = 440, yend = 440,\n colour = \"black\") +\n annotate(\"text\", x = 4, y = 450,\n label = \"***\", size = 5) +\n # WaterControl-C ***\n annotate(\"segment\", x = 2, xend = 5.5, \n y = 475, yend = 475,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 485, \n label = \"***\", size = 5) +\n # WaterControl-A ***\n annotate(\"segment\", x = 1, xend = 5.5, \n y = 510, yend = 510,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 520, \n label = \"***\", size = 5) + \n# A-D ***\n annotate(\"segment\", x = 1, xend = 4, \n y = 330, yend = 330,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 335, \n label = \"*\", size = 5) +\n theme_classic()\nCode# Figure 1. The mean pest biomass following various insecticide treatments.\n# Error bars are +/- 1 S.E. Significant comparisons are indicated: * is p < 0.05, ** p < 0.01 and *** is p < 0.001" + "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nfly_bristles_means <- read_excel(\"data-raw/bristles-mean.xlsx\")\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\nExercises\n\nšŸ’» Summarise the fly_bristles_means dataframe by calculating the mean, median, sample size, standard deviation and standard error of the mean_count variable.\n\n\nCodefly_bristles_means_summary <- fly_bristles_means |> \n summarise(mean = mean(mean_count),\n median = median(mean_count),\n n = length(mean_count),\n standard_dev = sd(mean_count),\n standard_error = standard_dev / sqrt(n))\n\n\n\nšŸ’» Create an appropriate plot to show the distribution of mean_count in fly_bristles_means\n\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10)\n\n\n\nšŸ’» Can you format the plot 2. by removing the grey background, giving the bars a black outline and the fill colour of your choice and improving the axis format and labelling? You may want to refer to last weekā€™s workshop.\n\n\nCodeggplot(fly_bristles_means, aes(x = mean_count)) +\n geom_histogram(bins = 10, \n colour = \"black\",\n fill = \"skyblue\") +\n scale_x_continuous(name = \"Number of bristles\",\n expand = c(0, 0)) +\n scale_y_continuous(name = \"Frequency\",\n expand = c(0, 0),\n limits = c(0, 35)) +\n theme_classic()\n\n\n\nšŸ’» Amend this code to change the order of the bars by the average mass of each coat colour? Changing the order of bars was covered last week. You may also want to practice formatting the graph nicely.\n\n\nggplot(cats, aes(x = coat, y = mass)) +\n geom_boxplot()\n\n\n\n\n\nCodeggplot(cats, \n aes(x = reorder(coat, mass), y = mass)) +\n geom_boxplot(fill = \"darkcyan\") +\n scale_x_discrete(name = \"Coat colour\") +\n scale_y_continuous(name = \"Mass (kg)\", \n expand = c(0, 0),\n limits = c(0, 8)) +\n theme_classic()\n\n\n\nšŸ“– Read Understanding the pipe |>" }, { - "objectID": "pgt52m/week-4/workshop.html", - "href": "pgt52m/week-4/workshop.html", + "objectID": "r4babs1/week-8/workshop.html", + "href": "r4babs1/week-8/workshop.html", "title": "Workshop", "section": "", - "text": "Data data Artwork from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst\n\n\nIn this workshop you will learn to summarise and plot datasets with more than one variable and how to write figures to files. You will also get more practice with working directories, importing data, formatting figures and the pipe.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "Artwork by Horst (2023): Continuous and Discrete\n\n\nIn this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "pgt52m/week-4/workshop.html#session-overview", - "href": "pgt52m/week-4/workshop.html#session-overview", + "objectID": "r4babs1/week-8/workshop.html#session-overview", + "href": "r4babs1/week-8/workshop.html#session-overview", "title": "Workshop", "section": "", - "text": "In this workshop you will learn to summarise and plot datasets with more than one variable and how to write figures to files. You will also get more practice with working directories, importing data, formatting figures and the pipe." + "text": "In this workshop you will learn how to import data from files and create summaries and plots for it. You will also get more practice with working directories, formatting figures and the pipe." }, { - "objectID": "pgt52m/week-4/workshop.html#philosophy", - "href": "pgt52m/week-4/workshop.html#philosophy", + "objectID": "r4babs1/week-8/workshop.html#philosophy", + "href": "r4babs1/week-8/workshop.html#philosophy", "title": "Workshop", "section": "", "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier workshops\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "pgt52m/week-4/workshop.html#myoglobin-in-seal-muscle", - "href": "pgt52m/week-4/workshop.html#myoglobin-in-seal-muscle", + "objectID": "r4babs1/week-8/workshop.html#importing-data-from-files", + "href": "r4babs1/week-8/workshop.html#importing-data-from-files", "title": "Workshop", - "section": "Myoglobin in seal muscle", - "text": "Myoglobin in seal muscle\nThe myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in seal.csv. Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.\nImport\n Save seal.csv to your data-raw folder\n Read the data into a dataframe called seal. . You might want to look up data import from last week.\n What types of variables do you have in the seal dataframe? What role would you expect them to play in analysis?\n\n\n\n\nThe key point here is that the fundamental structure of:\n\none continuous response and one nominal explanatory variable with two groups (adipocytes), and\none continuous response and one nominal explanatory variable with three groups (seals)\n\nis the same! The only thing that differs is the number of groups (the number of values in the nominal variable). This means the code for summarising and plotting is identical except for the variable names!\n\n\n\n\n\n\nTip\n\n\n\nWhen two datasets have the same number of columns and the response variable and the explanatory variables have the same data types then the code you need is the same.\n\n\nSummarise\nSummarising the data for each species is the next sensible step. The most useful summary statistics for a continuous variable like myoglobin are the means, standard deviations, sample sizes and standard errors. You might remember from last week that we use the group_by() and summarise() functions along with the functions that do the calculations.\n Create a data frame called seal_summary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples.\n\nseal_summary <- seal %>%\n group_by(species) %>%\n summarise(mean = mean(myoglobin),\n std = sd(myoglobin),\n n = length(myoglobin),\n se = std/sqrt(n))\n\nYou should get the following numbers:\n\n\n\n\nspecies\nmean\nstd\nn\nse\n\n\n\nBladdernose Seal\n42.31600\n8.020634\n30\n1.464361\n\n\nHarbour Seal\n49.01033\n8.252004\n30\n1.506603\n\n\nWeddell Seal\n44.66033\n7.849816\n30\n1.433174\n\n\n\n\n\nVisualise\nMost commonly, we put the explanatory variable on the x axis and the response variable on the y axis. A continuous response, particularly one that follows the normal distribution, is best summarised with the mean and the standard error. In my opinion, you should also show all the raw data points if possible.\nWe are going to create a figure like this:\n\n\n\n\n\nIn this figure, we have the data points themselves which are in seal dataframe and the means and standard errors which are in the seal_summary dataframe. That is, we have two dataframes we want to plot.\nHere you will learn that dataframes and aesthetics can be specified within a geom_xxxx (rather than in the ggplot()). This is very useful if the geom only applies to some of the data you want to plot.\n\n\n\n\n\n\nTip: ggplot()\n\n\n\nYou put the data argument and aes() inside ggplot() if you want all the geoms to use that dataframe and variables. If you want a different dataframe for a geom, put the data argument and aes() inside the geom_xxxx()\n\n\nI will build the plot up in small steps but you should edit your existing ggplot() command as we go.\n Plot the data points first.\n\nggplot() +\n geom_point(data = seal, \n aes(x = species, y = myoglobin))\n\n\n\n\nNotice how we have given the data argument and the aesthetics inside the geom. The variables species and myoglobin are in the seal dataframe\n So the data points donā€™t overlap, we can add some random jitter in the x direction (edit your existing code):\n\nggplot() +\n geom_point(data = seal, \n aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0))\n\n\n\n\nNote that position = position_jitter(width = 0.1, height = 0) is inside the geom_point() parentheses, after the aes() and a comma.\nWeā€™ve set the vertical jitter to 0 because, in contrast to the categorical x-axis, movement on the y-axis has meaning (the myoglobin levels).\n Letā€™s make the points a light grey (edit your existing code):\n\nggplot() +\n geom_point(data = seal, \n aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\")\n\n\n\n\nNow to add the errorbars. These go from one standard error below the mean to one standard error above the mean.\n Add a geom_errorbar() for errorbars (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) \n\n\n\n\nWe have specified the seal_summary dataframe and the variables species, mean and se are in that.\nThere are several ways you could add the mean. You could use geom_point() but I like to use geom_errorbar() again with the ymin and ymax both set to the mean.\n Add a geom_errorbar() for the mean (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2)\n\n\n\n\n Alter the axis labels and limits using scale_y_continuous() and scale_x_discrete() (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Myoglobin (g/kg)\", \n limits = c(0, 80), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Species\")\n\n\n\n\nYou only need to use scale_y_continuous() and scale_x_discrete() to use labels that are different from those in the dataset. Often this is to use proper terminology and captialisation.\n Format the figure in a way that is more suitable for including in a report using theme_classic() (edit your existing code):\n\nggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Myoglobin (g/kg)\", \n limits = c(0, 80), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Species\") +\n theme_classic()\n\n\n\n\nWriting figures to file\n Make a new folder called figures.\n Edit you ggplot code so that you assign the figure to a variable.\n\nsealfig <- ggplot() +\n geom_point(data = seal, aes(x = species, y = myoglobin),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"grey50\") +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = seal_summary, \n aes(x = species, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Myoglobin (g/kg)\", \n limits = c(0, 80), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Species\") +\n theme_classic()\n\nThe figure wonā€™t be shown in the Plots tab - the output has gone into sealfig rather than to the Plots tab. To make it appear in the Plots tab type sealfig\n The ggsave() command will write a ggplot figure to a file:\n\nggsave(\"figures/seal-muscle.png\",\n plot = sealfig,\n device = \"png\",\n width = 4,\n height = 3,\n units = \"in\",\n dpi = 300)\n\nfiguresseal-muscle.png is the name of the file, including the relative path.\n Look up ggsave() in the manual to understand the arguments. You can do this by putting your cursor on the command and pressing F1" + "section": "Importing data from files", + "text": "Importing data from files\nLast week we created data by typing the values in to R. This is not practical when you have added a lot of data to a spreadsheet, or you are using data file that has been supplied to you by a person or a machine. Far more commonly, we import data from a file into R. This requires you know two pieces of information.\n\n\nWhat format the data are in\nThe format of the data determines what function you will use to import it and the file extension often indicates format.\n\n\n.txt a plain text file1, where the columns are often separated by a space but might also be separated by a tab, a backslash or forward slash, or some other character\n\n.csv a plain text file where the columns are separated by commas\n\n.xlsx an Excel file\n\n\n\nWhere the file is relative to your working directory\nR can only read in a file if you say where it is, i.e., you give its relative path. If you follow the advice in this course, your data will be in a folder, data-raw which is inside your Project folder (and working directory).\n\n\nWe will save the four files for this workshop to our Project folder (week-8) and read them in. We will then create a new folder inside our Project folder called data-raw and move the data files to there before modifying the file paths as required. This is demonstrate how the relative path to the file will change after we move it.\n Save these four files in to your week-8 folder\n\nThe coat colour and mass of 62 cats: cat-coats.csv\n\nThe relative size of over 5000 cells measure by forward scatter (FSC) in flow cytometry: cell-size.txt\n\nThe number of sternopleural bristles on 96 female Drosophila: bristles.txt\n\nThe number of sternopleural bristles on 96 female Drosophila (with technical replicates): bristles-mean.xlsx\n\n\nThe first three files can be read in with core tidyverse Wickham et al. (2019) functions and the last can be read in with the readxl Wickham and Bryan (2023) package.\n Load the two packages\n\nlibrary(tidyverse)\nlibrary(readxl)\n\nWe will first read in cat-coats.csv. A .csv. extension suggests this is plain text file with comma separated columns. However, before we attempt to read it it, when should take a look at it. We can do this from RStudio\n Go to the Files pane (bottom right), click on the cat-coats.csv file and choose View File2\n\n\nRStudio Files Pane\n\nAny plain text file will open in the top left pane (Excel files will launch Excel).\n Is the file csv?\n\n\n What kind of variables does the file contain?\n\n\n Read in the csv file with:\n\ncats <- read_csv(\"cat-coats.csv\")\n\nThe data from the file a read into a dataframe called cats and you will be able to see it in the Environment.\n Click on each of the remaining files and choose View File.\n In each case, say what the format is and what types of variables it contains.\n\n\n\n\n\n\n\n\nWe use the read_table()3 command to read in plain text files of single columns or where the columns are separated by spacesā€¦\n ā€¦so in cell-size.txt can be read into a dataframe called cells like this:\n\ncells <- read_table(\"cell-size.txt\")\n\n Now you try reading bristles.txt in to a dataframe called fly_bristles\nThe readxl package we loaded earlier has two useful functions for working with Excel files: excel_sheets(\"filename.xlsx\") will list the sheets in an Excel workbook; read_excel(\"filename.xlsx\") will read in to top sheet or a specified sheet with a small modification read_excel(\"filename.xlsx\", sheet = \"Sheet1\").\n List the the names of the sheets and read in the sheet with the data like this:\n\nexcel_sheets(\"bristles-mean.xlsx\")\nfly_bristles_means <- read_excel(\"bristles-mean.xlsx\", sheet = \"means\")\n\nWell done! You can now read read in from files in your working directory.\nTo help you understand relative file paths, we will now move the data files.\n First remove the dataframes you just created to make it easier to see whether you can successfully read in the files from a different place:\n\nrm(cat_coats, fly_bristles, cells, flies_bristles_means)\n\n Now make a new folder called data-raw. You can do this on the Files Pane by clicking New Folder and typing into the box that appears.\n Check the boxes next to the file names and choose More | Moveā€¦ and select the data-raw folder.\n The files will move. To import data from files in the data-raw folder, you need to give the relative path to the file from the working directory. The working directory is the Project folder, week-8 so the relative path is data-raw/cat-coats.csv\n Import the cat-coats.csv data like this:\n\ncats <- read_csv(\"data-raw/cat-coats.csv\")\n\n Now you do the other files.\nFrom this point forward in the course, we will always create a data-raw folder each time we make a new Project.\nThese are the most common forms of data file you will encounter at first. However, data can certainly come to you in other formats particularly when they have come from particular software. Usually, there is an R package specially for that format.\nIn the rest of the workshop we will take each dataset in turn and create summaries and plots appropriate for the data types. Data is summarised using the group_by() and summarise() functions" }, { - "objectID": "pgt52m/week-4/workshop.html#pigeons", - "href": "pgt52m/week-4/workshop.html#pigeons", + "objectID": "r4babs1/week-8/workshop.html#summarising-discrete-data-cat-coat", + "href": "r4babs1/week-8/workshop.html#summarising-discrete-data-cat-coat", "title": "Workshop", - "section": "Pigeons", - "text": "Pigeons\nThe data in pigeon.txt are 40 measurements of interorbital width (in mm) for two populations of domestic pigeons measured to the nearest 0.1mm\n\n\nInterorbital width is the distance between the eyes\n\nImport\n Save pigeon.txt to your data-raw folder\n Read the data into a dataframe called pigeons.\n What variables are there in the pigeons dataframe?\n\n\n\n\nHummmm, these data are not organised like the other data sets we have used. The population is given as the column names and the interorbital distances for one population are given in a different column than those for the other population. The first row has data from two pigeons which have nothing in common, they just happen to be the first individual recorded in each population.\n\n\n\n\n\nA\nB\n\n\n\n12.4\n12.6\n\n\n11.2\n11.3\n\n\n11.6\n12.1\n\n\n12.3\n12.2\n\n\n11.8\n11.8\n\n\n10.7\n11.5\n\n\n11.3\n11.2\n\n\n11.6\n11.9\n\n\n12.3\n11.2\n\n\n10.5\n12.1\n\n\n12.1\n11.9\n\n\n10.4\n10.7\n\n\n10.8\n11.0\n\n\n11.9\n12.2\n\n\n10.9\n12.6\n\n\n10.8\n11.6\n\n\n10.4\n10.7\n\n\n12.0\n12.4\n\n\n11.7\n11.8\n\n\n11.3\n11.1\n\n\n11.5\n12.9\n\n\n11.8\n11.9\n\n\n10.3\n11.1\n\n\n10.3\n12.2\n\n\n11.5\n11.8\n\n\n10.7\n11.5\n\n\n11.3\n11.2\n\n\n11.6\n11.9\n\n\n13.3\n11.2\n\n\n10.7\n11.1\n\n\n12.1\n11.6\n\n\n10.2\n12.7\n\n\n10.8\n11.0\n\n\n11.4\n12.2\n\n\n10.9\n11.3\n\n\n10.3\n11.6\n\n\n10.4\n12.2\n\n\n10.0\n12.4\n\n\n11.2\n11.3\n\n\n11.3\n11.1\n\n\n\n\n\n\n\nThis data is not in ā€˜tidyā€™ format (Wickham 2014).\nTidy format has variables in column and observations in rows. All of the distance measurements should be in one column and a second column should give the population.\n\n\n\n\n\npopulation\ndistance\n\n\n\nA\n12.4\n\n\nB\n12.6\n\n\nA\n11.2\n\n\nB\n11.3\n\n\nA\n11.6\n\n\nB\n12.1\n\n\nA\n12.3\n\n\nB\n12.2\n\n\nA\n11.8\n\n\nB\n11.8\n\n\nA\n10.7\n\n\nB\n11.5\n\n\nA\n11.3\n\n\nB\n11.2\n\n\nA\n11.6\n\n\nB\n11.9\n\n\nA\n12.3\n\n\nB\n11.2\n\n\nA\n10.5\n\n\nB\n12.1\n\n\nA\n12.1\n\n\nB\n11.9\n\n\nA\n10.4\n\n\nB\n10.7\n\n\nA\n10.8\n\n\nB\n11.0\n\n\nA\n11.9\n\n\nB\n12.2\n\n\nA\n10.9\n\n\nB\n12.6\n\n\nA\n10.8\n\n\nB\n11.6\n\n\nA\n10.4\n\n\nB\n10.7\n\n\nA\n12.0\n\n\nB\n12.4\n\n\nA\n11.7\n\n\nB\n11.8\n\n\nA\n11.3\n\n\nB\n11.1\n\n\nA\n11.5\n\n\nB\n12.9\n\n\nA\n11.8\n\n\nB\n11.9\n\n\nA\n10.3\n\n\nB\n11.1\n\n\nA\n10.3\n\n\nB\n12.2\n\n\nA\n11.5\n\n\nB\n11.8\n\n\nA\n10.7\n\n\nB\n11.5\n\n\nA\n11.3\n\n\nB\n11.2\n\n\nA\n11.6\n\n\nB\n11.9\n\n\nA\n13.3\n\n\nB\n11.2\n\n\nA\n10.7\n\n\nB\n11.1\n\n\nA\n12.1\n\n\nB\n11.6\n\n\nA\n10.2\n\n\nB\n12.7\n\n\nA\n10.8\n\n\nB\n11.0\n\n\nA\n11.4\n\n\nB\n12.2\n\n\nA\n10.9\n\n\nB\n11.3\n\n\nA\n10.3\n\n\nB\n11.6\n\n\nA\n10.4\n\n\nB\n12.2\n\n\nA\n10.0\n\n\nB\n12.4\n\n\nA\n11.2\n\n\nB\n11.3\n\n\nA\n11.3\n\n\nB\n11.1\n\n\n\n\n\n\n\nData which is in tidy format is easier to summarise, analyses and plot because the organisation matches the conceptual structure of the data:\n\nit is more obvious what the variables are because they columns are named with them - in the untidy format, that the measures are distances is not clear and what A and B are isnā€™t clear\nit is more obvious that there is no relationship between any of the pigeons except for population\nfunctions are designed to work with variables in columns\nTidying data\nWe can put this data in such a format with the pivot_longer() function from the tidyverse:\npivot_longer() collects the values from specified columns (cols) into a single column (values_to) and creates a column to indicate the group (names_to).\n Put the data in tidy format:\n\npigeons <- pivot_longer(data = pigeons, \n cols = everything(), \n names_to = \"population\", \n values_to = \"distance\")\n\nWe have overwritten the original dataframe. If you wanted to keep the original you would need to give a new name on the left side of the assignment <- Note: the data in the file are unchanged." + "section": "Summarising discrete data: Cat coat", + "text": "Summarising discrete data: Cat coat\nThe most appropriate way to summarise nominal data like the colour of cat coats is to tabulate the number of cats with each colour.\n Summarise the cats dataframe by counting the number of cats in each category\n\ncats |> \n group_by(coat) |> \n count()\n\n# A tibble: 6 Ɨ 2\n# Groups: coat [6]\n coat n\n <chr> <int>\n1 black 23\n2 calico 1\n3 ginger 10\n4 tabby 8\n5 tortoiseshell 5\n6 white 15\n\n\n|> is the pipe and can be produced with Ctrl+Shift+M\nThis sort of data might be represented with a barchart. You have two options for producing that barchart:\n\nplot the summary table using geom_col()\nplot the raw data using geom_bar()\n\nWe did the first of these last week. The geom_col() function uses the numbers in a second column to determine how high the bars are. However, the geom_bar() function will do the tabulating for you.\n Plot the coat data using geom_bar:\n\nggplot(cats, aes(x = coat)) +\n geom_bar()\n\n\n\n\nThe gaps that R put automatically between the bars reflects that the coat colours are discrete categories." + }, + { + "objectID": "r4babs1/week-8/workshop.html#summarising-counts-bristles", + "href": "r4babs1/week-8/workshop.html#summarising-counts-bristles", + "title": "Workshop", + "section": "Summarising Counts: Bristles", + "text": "Summarising Counts: Bristles\nCounts are discrete and can be thought of a categories with an order (ordinal).\n Summarise the fly_bristles dataframe by counting the number of flies in each category of bristle number\nSince counts are numbers, we might also want to calculate some summary statistics such as the median and interquartile range.\n Summarise the fly_bristles dataframe by calculate the median and interquartile range\n\nfly_bristles |> \n summarise(median(number),\n IQR(number))\n\n# A tibble: 1 Ɨ 2\n `median(number)` `IQR(number)`\n <dbl> <dbl>\n1 6 4\n\n\nAs the interquartile is 4 and the median is 6 then 25% flies have 4 bristles or fewer and 25% have 8 or more.\nThe distribution of counts4 is not symmetrical for lower counts so the mean is not usually a good way to summarise count data.\n If you want to save the table you created and give the columns better names you can make two adjustments:\n\nfly_bristles_summary <- fly_bristles |> \n summarise(med = median(number),\n interquartile = IQR(number))\n\n Plot the bristles data using geom_bar:\nIf counts have a a high mean and big range, like number of hairs on a personā€™s head, then you can often treat them as continuous. This means you can use statistics like the mean and standard deviation to summarise them, histograms to plot them and use some standard statistical tests on them." + }, + { + "objectID": "r4babs1/week-8/workshop.html#summarising-continuous-data", + "href": "r4babs1/week-8/workshop.html#summarising-continuous-data", + "title": "Workshop", + "section": "Summarising continuous data", + "text": "Summarising continuous data\nCat mass\nThe variable mass in the cats dataframe is continuous. Very many continuous variables have a normal distribution. e normal distribution is also known as the bell-shaped curve. If we had the mass of all the cats in the world, we would find many cats were near the mean and fewer would be away from the mean, either much lighter or much heavier. In fact 68% would be within one standard deviation of the mean and about 96% would be within two standard deviations.\n\n\n\n\n\n We can find the mean mass with:\n\ncats |> \n summarise(mean = mean(mass))\n\n# A tibble: 1 Ɨ 1\n mean\n <dbl>\n1 4.51\n\n\nWe can add any sort of summary by placing it inside the the summarise parentheses. Each one is separated by a comma. We did this to find the median and the interquatrile range for fly bristles.\n For example, another way to calculate the number of values is to use the length() function:\n\ncats |> \n summarise(mean = mean(mass),\n n = length(mass))\n\n# A tibble: 1 Ɨ 2\n mean n\n <dbl> <int>\n1 4.51 62\n\n\n Adapt the code to calculate the mean, the sample size and the standard deviation (sd())\nA single continuous variable can be plotted using a histogram to show the shape of the distribution.\n Plots a histogram of cats mass:\n\nggplot(cats, aes(x = mass)) +\n geom_histogram(bins = 15, colour = \"black\") \n\n\n\n\nNotice that there are no gaps between the bars which reflects that mass is continuous. bins determines how many groups the variable is divided up into (i.e., the number of bars) and colour sets the colour for the outline of the bars. A sample of 62 is a relatively small number of values for plotting a distribution and the number of bins used determines how smooth or normally distributed the values look.\n Experiment with the number of bins. Does the number of bins affect how you view the distribution.\nNext week we will practice summarise and plotting data files with several variables but just to give you a taste, we will find summary statistics about mass for each of the coat types. The group_by() function is used before the summarise() to do calculations for each of the coats:\n\ncats |> \n group_by(coat) |> \n summarise(mean = mean(mass),\n standard_dev = sd(mass))\n\n# A tibble: 6 Ɨ 3\n coat mean standard_dev\n <chr> <dbl> <dbl>\n1 black 4.63 1.33 \n2 calico 2.19 NA \n3 ginger 4.46 1.12 \n4 tabby 4.86 0.444\n5 tortoiseshell 4.50 0.929\n6 white 4.34 1.34 \n\n\nYou can read this as:\n\ntake cats and then group by coat and then summarise by finding the mean of mass and the standard deviation of mass\n\n Why do we get an NA for the standard deviation of the calico cats?\n\n\n\nCells\n Summarise the cells dataframe by calculating the mean, median, sample size and standard deviation of FSC.\n Add a column for the standard error which is given by \\(\\frac{s.d.}{\\sqrt{n}}\\)\nMeans of counts\nMany things are quite difficult to measure or count and in these cases we often do technical replicates. A technical replicate allows us the measure the exact same thing to check how variable the measurement process is. For example, Drosophila are small and counting their sternopleural bristles is tricky. In addition, where a bristle is short (young) or broken scientists might vary in whether they count it. Or people or machines might vary in measuring the concentration of the same solution.\nWhen we do technical replicates we calculate their mean and use that as the measure. This is what is in our fly_bristles_means dataframe - the bristles of each of the 96 flies was counted by 5 people and the data are those means. These has an impact on how we plot and summarise the dataset because the distribution of mean counts is continuous! We can use means, standard deviations and histograms. This will be an exercise in Consolidate." }, { - "objectID": "pgt52m/week-4/workshop.html#ulna-and-height", - "href": "pgt52m/week-4/workshop.html#ulna-and-height", + "objectID": "r4babs1/week-8/workshop.html#look-after-future-you", + "href": "r4babs1/week-8/workshop.html#look-after-future-you", "title": "Workshop", - "section": "Ulna and height", - "text": "Ulna and height\nThe datasets we have used up to this point, have had a continuous variable and a categorical variable where it makes sense to summarise the response for each of the different groups in the categorical variable and plot the response on the y-axis. We will now summarise a dataset with two continuous variables. The data in height.txt are the ulna length (cm) and height (m) of 30 people. In this case, it is more appropriate to summarise both of thee variables and to plot them as a scatter plot.\nWe will use summarise() again but we do not need the group_by() function this time. We will also need to use each of the summary functions, such as mean(), twice, once for each variable.\nImport\n Save height.txt to your data-raw folder\n Read the data into a dataframe called ulna_heights.\nSummarise\n Create a data frame called ulna_heights_summary that contains the sample size and means, standard deviations and standard errors for both variables.\n\nulna_heights_summary <- ulna_heights %>%\n summarise(n = length(ulna),\n mean_ulna = mean(ulna),\n std_ulna = sd(ulna),\n se_ulna = std_ulna/sqrt(n),\n mean_height = mean(height),\n std_height = sd(height),\n se_height = std_height/sqrt(n))\n\nYou should get the following numbers:\n\n\n\n\nn\nmean_ulna\nstd_ulna\nse_ulna\nmean_height\nstd_height\nse_height\n\n\n30\n24.72\n4.137332\n0.75537\n1.494\n0.2404823\n0.0439059\n\n\n\n\nVisualise\nTo plot make a scatter plot we need to use geom_point() again but without any scatter. In this case, it does not really matter which variable is on the x-axis and which is on the y-axis.\n Make a simple scatter plot\n\nggplot(data = ulna_heights, aes(x = ulna, y = height)) +\n geom_point()\n\n\n\n\nIf you have time, you may want to format the figure more appropriately.\n\n\nYouā€™re finished!" + "section": "Look after future you!", + "text": "Look after future you!\nFuture you is going to summarise and plot data from the ā€œRiver practicalsā€. You can make this much easier by documenting what you have done now. At the moment all of your code from this workshop is in a single file, probably called analysis.R. I recommend making a new script for each of nominal, continuous and count data and copying the code which imports, summarises and plots it. This will make it easier for future you to find the code you need. Here is an example: nominal_data.R. You may wish to comment your version much more.\nYouā€™re finished!" }, { - "objectID": "pgt52m/week-4/study_after_workshop.html", - "href": "pgt52m/week-4/study_after_workshop.html", - "title": "Independent Study to consolidate this week", - "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the packages and import the data.\n\nlibrary(tidyverse)\nlibrary(readxl)\n\n\nšŸ’» Summarise and plot the pigeons dataframe appropriately.\n\n\nCode# import\npigeons <- read_table(\"data-raw/pigeon.txt\")\n\n# reformat to tidy\npigeons <- pivot_longer(data = pigeons, \n cols = everything(), \n names_to = \"population\", \n values_to = \"distance\")\n\n# sumnmarise\npigeons_summary <- pigeons %>%\n group_by(population) %>%\n summarise(mean = mean(distance),\n std = sd(distance),\n n = length(distance),\n se = std/sqrt(n))\n# plot\nggplot() +\n geom_point(data = pigeons, aes(x = population, y = distance),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = pigeons_summary, \n aes(x = population, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Interorbital distance (mm)\", \n limits = c(0, 14), \n expand = c(0, 0)) +\n scale_x_discrete(name = \"Population\") +\n theme_classic()\n\n\n\n\nšŸ’» The data in blood.csv are measurements of several blood parameters from fifty people with Crohnā€™s disease, a lifelong condition where parts of the digestive system become inflamed. Twenty-five of people are in the early stages of diagnosis and 25 have started treatment. The variables in the dataset are:\n\nsodium - Sodium concentration in umol/L, the average of 5 technical replicates\npotassium - Potassium concentration in umol/L, the average of 5 technical replicates\nB12 Vitamin - B12 in pmol/L, the average of 5 technical replicates\nwbc - White blood cell count in 10^9 /L, the average of 5 technical replicates\nrbc count - Red blood cell count in 10^12 /L, the average of 5 technical replicates\nplatlet count - platlet count in 10^9 /L, the average of 5 technical replicates\ninflammation marker - the presence or absence of a marker of inflammation, either 0 or 1\nstatus - whether the individual is before or after treatment.\n\nYour task is to summarise and plot these data in any suitable way. Create a complete RStudio Project for an analysis of these data. You will need to:\n\nMake a new project\nMake folders for data and for figures\nImport the data\nSummarise and plot variables of your choice. It doesnā€™t matter what you chose - the goal is the practice the project workflow and selecting appropriate plotting and summarising methods for particular data sets." + "objectID": "r4babs1/week-8/workshop.html#footnotes", + "href": "r4babs1/week-8/workshop.html#footnotes", + "title": "Workshop", + "section": "Footnotes", + "text": "Footnotes\n\nPlain text files can be opened in notepad or other similar editor and still be readable.ā†©ļøŽ\nDo not be tempted to import data this way. Unless you are careful, your data import will not be scripted or will not be scripted correctly.ā†©ļøŽ\nnote read_csv() and read_table() are the same functions with some different settings.ā†©ļøŽ\nCount data are usually ā€œPoissonā€ distributed.ā†©ļøŽ" }, { - "objectID": "pgt52m/week-5/workshop.html", - "href": "pgt52m/week-5/workshop.html", - "title": "Workshop", + "objectID": "r4babs1/week-9/study_before_workshop.html", + "href": "r4babs1/week-9/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", "section": "", - "text": "Artwork by Horst (2023): ā€œlove this classā€\n\n\nIn this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "šŸ“– Read From importing to reporting. The first part of this chapter is about data import which we covered in the last workshop. You may be able to skip that part or you may find it useful to revise. The section on Summarising data will be mainly new." }, { - "objectID": "pgt52m/week-5/workshop.html#session-overview", - "href": "pgt52m/week-5/workshop.html#session-overview", - "title": "Workshop", + "objectID": "r4babs1/week-9/overview.html", + "href": "r4babs1/week-9/overview.html", + "title": "Overview", "section": "", - "text": "In this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples." + "text": "Last week you summarised and plotted single variables. This week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by ā€œtidyā€ data and how to perform a simple data tidying task. Finally you will discover how to save your figures and place them in documents.\n\nLearning objectives\n\nsummarise and plot appropriately datasets with more than one variable\nrecognise that variables can be categorised by their role in analysis\nexplain what is meant by ā€˜tidyā€™ data and be able to perform some data tidying tasks.\nsave figures to file\ncreate neat reports which include text and figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– From importing to reporting\n\nWorkshop\n\nšŸ’» Summarise and plot datasets with more than one variable.\nšŸ’» Practice with working directories, importing data, formatting figures and the pipe\nšŸ’» Lay out text, figures and figure legends in documents\n\nConsolidate\n\nšŸ’» Summarise and plot a dataframe from the workshop\nšŸ’» Practice the complete RStudio Project worklfow for a new dataset" }, { - "objectID": "pgt52m/week-5/workshop.html#philosophy", - "href": "pgt52m/week-5/workshop.html#philosophy", - "title": "Workshop", + "objectID": "r4babs4/r4babs4.html", + "href": "r4babs4/r4babs4.html", + "title": "Data Analysis in R for BABS 4", "section": "", - "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "This is the last of the four BABS modules.\ncore strand specific: immunology\n\n\nThe BABS4 Module Learning outcomes that relate to the Data Analysis in R content are:" }, { - "objectID": "pgt52m/week-5/workshop.html#remind-yourself-how-to-import-files", - "href": "pgt52m/week-5/workshop.html#remind-yourself-how-to-import-files", - "title": "Workshop", - "section": "Remind yourself how to import files!", - "text": "Remind yourself how to import files!\nImporting data from files was covered in a previous workshop (Rand 2023) if you need to remind yourself." + "objectID": "r4babs4/r4babs4.html#module-learning-objectives", + "href": "r4babs4/r4babs4.html#module-learning-objectives", + "title": "Data Analysis in R for BABS 4", + "section": "", + "text": "The BABS4 Module Learning outcomes that relate to the Data Analysis in R content are:" }, { - "objectID": "pgt52m/week-5/workshop.html#confidence-intervals-large-samples", - "href": "pgt52m/week-5/workshop.html#confidence-intervals-large-samples", - "title": "Workshop", - "section": "Confidence intervals (large samples)", - "text": "Confidence intervals (large samples)\nThe data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by:\n\\(\\bar{x} \\pm 1.96 \\times s.e.\\)\nWhere 1.96 is the quantile for 95% confidence.\n Save beewing.txt to your data-raw folder.\n Read in the data and check the structure of the resulting dataframe.\n Calculate and assign to variables: the mean, standard deviation and standard error:\n\n# mean\nm <- mean(bee$wing)\n\n# standard deviation\nsd <- sd(bee$wing)\n\n# sample size (needed for the se)\nn <- length(bee$wing)\n\n# standard error\nse <- sd / sqrt(n)\n\n To calculate the 95% confidence interval we need to look up the quantile (multiplier) using qnorm()\n\nq <- qnorm(0.975)\n\nThis should be about 1.96.\n Now we can use it in our confidence interval calculation\n\nlcl <- m - q * se\nucl <- m + q * se\n\n Print the values\n\nlcl\n\n[1] 4.473176\n\nucl\n\n[1] 4.626824\n\n\nThis means we are 95% confident the population mean lies between 4.47 mm and 4.63 mm. The usual way of expressing this is that the mean is 4.55 +/- 0.07 mm\n Between what values would you be 99% confident of the population mean being?" + "objectID": "r4babs4/r4babs4.html#all-strands", + "href": "r4babs4/r4babs4.html#all-strands", + "title": "Data Analysis in R for BABS 4", + "section": "All Strands", + "text": "All Strands\nmany var and obs, examples, PCA, log fc transformation, normalisation, QC gating, missing values, excluding proteins idā€™d from fewer thaan two peptides, FDR, data structures" }, { - "objectID": "pgt52m/week-5/workshop.html#confidence-intervals-small-samples", - "href": "pgt52m/week-5/workshop.html#confidence-intervals-small-samples", - "title": "Workshop", - "section": "Confidence intervals (small samples)", - "text": "Confidence intervals (small samples)\nThe confidence interval for small samples is given by:\n\\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\)\nThe only difference between the calculation for small and large sample is the multiple. For large samples we use the ā€œthe standard normal distributionā€ accessed with qnorm(); for small samples we use the ā€œt distributionā€ assessed with qt().The value returned by q(t) is larger than that returned by qnorm() which reflects the greater uncertainty we have on estimations of population means based on small samples.\nThe fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \\(\\mu m^2\\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt\n Save neuron.txt to your data-raw folder\n Read in the data and check the structure of the resulting dataframe\n Assign the mean to m.\n Calculate and assign the standard error to se.\nTo work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).\n We can assign this to a variable, df, using:\n\ndf <- length(neur$csa) - 1\n\n The t value is found by:\n\nt <- qt(0.975, df = df)\n\nNote that we are using qt() rather than qnorm() but that the probability, 0.975, used is the same. Finally, we need to put our mean, standard error and t value in the equation. \\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\).\n The upper confidence limit is:\n\n(m + t * se) |> round(2)\n\n[1] 151.95\n\n\nThe first part of the command, (m + t * se) calculates the upper limit. This is ā€˜pipedā€™ in to the round() function to round the result to two decimal places.\n Calculate the lower confidence limit:\n Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?\n\n\n\n\nYouā€™re finished!" + "objectID": "pgt52m/week-7/study_before_workshop.html", + "href": "pgt52m/week-7/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", + "section": "", + "text": "Prepare\n\nšŸ“– Read Two-Sample tests" }, { - "objectID": "pgt52m/week-5/study_after_workshop.html", - "href": "pgt52m/week-5/study_after_workshop.html", - "title": "Independent Study to consolidate this week", + "objectID": "pgt52m/week-7/overview.html", + "href": "pgt52m/week-7/overview.html", + "title": "Overview", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not. Estimate the mean Adiponectin concentration in each group - this means calculate the sample mean and construct a confidence interval around it for each group. This exercise forces you to bring together ideas from this workshop and from previous workshops\n\n\nHow to calculate a confidence intervals (this workshop)\n\nHow to summarise variables in more than one group (previous workshop)\n\n\nCode# data import\nadip <- read_table(\"data-raw/adipocytes.txt\")\n\n# examine the structure\nstr(adip)\n\n# summarise\nadip_summary <- adip %>% \n group_by(treatment) %>% \n summarise(mean = mean(adiponectin),\n sd = sd(adiponectin),\n n = length(adiponectin),\n se = sd/sqrt(n),\n dif = qt(0.975, df = n - 1) * se,\n lower_ci = mean - dif,\n uppp_ci = mean + dif)\n\n\n# we conclude we're 95% certain the mean for the control group is \n# between 4.73 and 6.36 and the mean for the nicotinic group is \n# between 6.52 and 8.50. More usually we might put is like this:\n# the mean for the control group is 5.55 +/- 0.82 and that for the nicotinic group is 7.51 +/- 0.99" + "text": "This week you will how to use and interpret the general linear model when the x variable is categorical and has two groups. Just as with single linear regression, the model puts a line of best through data and the model parameters, the intercept and the slope, have the same in interpretation The intercept is one of the group means and the slope is the difference between that, mean and the other group mean. You will also learn about the non-parametric equivalents - the tests we use when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nunderstand the principles of two-sample tests\nappreciate that two-sample tests with lm() are based on the normal distribution and thus have assumptions\nappropriately select parametric and non-parametric two-sample tests\nappropriately select paired and and unpaired two-sample tests\napply and interpret lm()and wilcox.test()\nevaluate whether the assumptions of lm() are met\nscientifically report a two-sample test result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read Two-Sample tests\n\nWorkshop\n\nšŸ’» Parametric two-sample test\nšŸ’» Non-parametric two-sample test\nšŸ’» Parametric paired-sample test\n\nConsolidate\n\nšŸ’» Appropriately test whether a genetic modification was successful in increasing omega 3 fatty acids in Cannabis sativa.\nšŸ’» ā€¦." }, { - "objectID": "pgt52m/week-6/workshop.html", - "href": "pgt52m/week-6/workshop.html", - "title": "Workshop", + "objectID": "pgt52m/week-6/study_before_workshop.html", + "href": "pgt52m/week-6/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", "section": "", - "text": "In this workshop you will get practice in applying, interpreting and reporting single linear regression.\n\n\nArtwork by Horst (2023): ā€œlinear regression dragonsā€\n\n\nIn this session you will carry out, interpret and report on a single linear regression.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "šŸ“– Read What is a statistical model\nšŸ“– Read Single linear regression" }, { - "objectID": "pgt52m/week-6/workshop.html#session-overview", - "href": "pgt52m/week-6/workshop.html#session-overview", - "title": "Workshop", + "objectID": "pgt52m/week-6/overview.html", + "href": "pgt52m/week-6/overview.html", + "title": "Overview", "section": "", - "text": "In this session you will carry out, interpret and report on a single linear regression." + "text": "This week you will be introduced to the idea of a statistical ā€œmodelā€ in general and to general linear model in particular. Our first general linear model will be single linear regression which puts a line of best fit through data so the response can be predicted from the explanatory variable. We will consider the two ā€œparametersā€ estimated by the model (the slope and the intercept) and whether these differ from zero\n\nLearning objectives\nThe successful student will be able to:\n\nexplain what is meant by a statistical model and fitting a model\nknow what the general linear model is and how it relates to regression\nexplain the principle of regression and know when it can be applied\napply and interpret a simple linear regression in R\nevaluate whether the assumptions of regression are met\nscientifically report a regression result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read What is a statistical model\nšŸ“– Read Single linear regression\n\nWorkshop\ni.šŸ’» Carry out a single linear regression\nConsolidate\n\nšŸ’» Appropriately analyse the relationsip between juvenile hormone and mandible size in stage beetles\nšŸ’» Appropriately analyse the relationsip between anxiety and performance" }, { - "objectID": "pgt52m/week-6/workshop.html#philosophy", - "href": "pgt52m/week-6/workshop.html#philosophy", - "title": "Workshop", + "objectID": "pgt52m/week-8/study_before_workshop.html", + "href": "pgt52m/week-8/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", "section": "", - "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "Prepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis" }, { - "objectID": "pgt52m/week-6/workshop.html#linear-regression", - "href": "pgt52m/week-6/workshop.html#linear-regression", - "title": "Workshop", - "section": "Linear Regression", - "text": "Linear Regression\nThe data in plant.xlsx is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other.\n Save a copy of plant.xlsx to your data-raw folder and import it.\n What type of variables do you have? Which is the response and which is the explanatory? What is the null hypothesis?\n\n\n\n\n\n\nExploring\n Do a quick plot of the data:\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point()\n\n\n\n\n What are the assumptions of linear regression? Do these seem to be met?\n\n\n\n\n\n\n\n\n\n\nApplying, interpreting and reporting\n We now carry out a regression assigning the result of the lm() procedure to a variable and examining it with summary().\n\nmod <- lm(data = plant, mass ~ day)\nsummary(mod)\n\n\nCall:\nlm(formula = mass ~ day, data = plant)\n\nResiduals:\n Min 1Q Median 3Q Max \n-32.810 -11.253 -0.408 9.075 48.869 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -8.6834 6.4729 -1.342 0.186 \nday 1.6026 0.1705 9.401 1.5e-12 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 17.92 on 49 degrees of freedom\nMultiple R-squared: 0.6433, Adjusted R-squared: 0.636 \nF-statistic: 88.37 on 1 and 49 DF, p-value: 1.503e-12\n\n\nThe Estimates in the Coefficients table give the intercept (first line) and the slope (second line) of the best fitting straight line. The p-values on the same line are tests of whether that coefficient is different from zero.\nThe F value and p-value in the last line are a test of whether the model as a whole explains a significant amount of variation in the dependent variable. For a single linear regression this is exactly equivalent to the test of the slope against zero.\n What is the equation of the line? What do you conclude from the analysis?\n\n\n\n\n\n Does the line go through (0,0)?\n\n\n\n What percentage of variation is explained by the line?\n\n\nIt might be useful to assign the slope and the intercept to variables in case we need them later. The can be accessed in the mod$coefficients variable:\n\nmod$coefficients\n\n(Intercept) day \n -8.683379 1.602606 \n\n\n Assign mod$coefficients[1] to b0 and mod$coefficients[1] to b1:\n\nb0 <- mod$coefficients[1] |> round(2)\nb1 <- mod$coefficients[2] |> round(2)\n\nI also rounded the values to two decimal places.\nChecking assumptions\nWe need to examine the residuals. Very conveniently, the object which is created by lm() contains a variable called $residuals. Also conveniently, the Rā€™s plot() function can used on the output objects of lm(). The assumptions demand that each y is drawn from a normal distribution for each x and these normal distributions have the same variance. Therefore we plot the residuals against the fitted values to see if the variance is the same for all the values of x. The fitted - predicted - values are the values on the line of best fit. Each residual is the difference between the fitted values and the observed value.\n Plot the model residuals against the fitted values like this:\n\nplot(mod, which = 1)\n\n\n\n\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals:\n\nggplot(mapping = aes(x = mod$residuals)) + \n geom_histogram(bins = 10)\n\n\n\n\n Use the shapiro.test() to test the normality of the model residuals\n\nshapiro.test(mod$residuals)\n\n\n Shapiro-Wilk normality test\n\ndata: mod$residuals\nW = 0.96377, p-value = 0.1208\n\n\nUsually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the assumptions are met.\n What to you conclude?\n\n\n\n\nIllustrating\nWe want a figure with the points and the statistical model, i.e., the best fitting straight line.\n Create a scatter plot using geom_point()\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n theme_classic()\n\n\n\n\n The geom_smooth() function will had a variety of fitted lines to a plot. We want a line so we need to specify method = \"lm\":\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n theme_classic()\n\n\n\n\n What do the se and colour arguments do? Try changing them.\n Letā€™s add the equation of the line to the figure using annotate():\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() +\n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n annotate(\"text\", x = 20, y = 110, \n label = \"mass = 1.61 * day - 8.68\") +\n theme_classic()\n\n\n\n\nWe have to tell annotate() what type of geom we want - text in this case, - where to put it, and the text we want to appear.\n Improve the axes. You may need to refer back Changing the axes from the Week 2 workshop\n Save your figure to your figures folder." + "objectID": "pgt52m/week-8/overview.html", + "href": "pgt52m/week-8/overview.html", + "title": "Overview", + "section": "", + "text": "Last week you learnt how to use and interpret the general linear model when the x variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (analysis of variance). You will also learn about the Kruskal- Wallis test which can be used when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nexplain the rationale behind ANOVA understand the meaning of the F values\nselect, appropriately, one-way ANOVA and Kruskal-Wallis\nknow what functions are used in R to run these tests and how to interpret them\nevaluate whether the assumptions of lm() are met\nscientifically report the results of these tests including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read One-way ANOVA and Kruskal-Wallis\n\nWorkshop\n\nšŸ’» One-way ANOVA\nšŸ’» Kruskal-Wallis\n\nConsolidate\n\nšŸ’» Appropriately test if fitness and acclimation effect the sodium content of sweat\nšŸ’» Appropriately test if insecticides vary in their effectiveness" }, { - "objectID": "pgt52m/week-6/workshop.html#look-after-future-you", - "href": "pgt52m/week-6/workshop.html#look-after-future-you", - "title": "Workshop", - "section": "Look after future you!", - "text": "Look after future you!\nYouā€™re finished!" + "objectID": "pgt52m/week-2/study_before_workshop.html", + "href": "pgt52m/week-2/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", + "section": "", + "text": "Either\n\nšŸ“– Read First Steps in RStudio in\n\nOR\n\nšŸ“¹ Watch" }, { - "objectID": "pgt52m/week-6/study_after_workshop.html", - "href": "pgt52m/week-6/study_after_workshop.html", - "title": "Independent Study to consolidate this week", + "objectID": "pgt52m/week-2/overview.html", + "href": "pgt52m/week-2/overview.html", + "title": "Overview", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Effect of anxiety status and sporting performance. The data in sprint.txt are from an investigation of the effect of anxiety status and sporting performance. A group of 40 100m sprinters undertook a psychometric test to measure their anxiety shortly before competing. The data are their anxiety scores and the 100m times achieved. What you do conclude from these data?\n\n\nCode# this example is designed to emphasise the importance of plotting your data first\nsprint <- read_table(\"data-raw/sprint.txt\")\n# Anxiety is discrete but ranges from 16 to 402 meaning the gap between possible measures is small and \n# the variable could be treated as continuous if needed. Time is a continuous measure that has decimal places and which we would expect to follow a normal distribution \n\n# explore with a plot\nggplot(sprint, aes(x = anxiety, y = time) ) +\n geom_point()\nCode# A scatterplot of the data clearly reveals that these data are not linear. There is a good relationship between the two variables but since it is not linear, single linear regression is not appropriate.\n\n\n\nšŸ’» Juvenile hormone in stag beetles. The concentration of juvenile hormone in stag beetles is known to influence mandible growth. Groups of stag beetles were injected with different concentrations of juvenile hormone (arbitrary units) and their average mandible size (mm) determined. The experimenters planned to analyse their data with regression. The data are in stag.txt\n\n\n\nCode# read the data in and check the structure\nstag <- read_table(\"data-raw/stag.txt\")\nstr(stag)\n\n# jh is discrete but ordered and has been chosen by the experimenter - it is the explanatory variable. \n# the response is mandible size which has decimal places and is something we would expect to be \n# normally distributed. So far, common sense suggests the assumptions of regression are met.\n\n\n\nCode# exploratory plot\nggplot(stag, aes(x = jh, y = mand)) +\n geom_point()\nCode# looks linear-ish on the scatter\n# regression still seems appropriate\n# we will check the other assumptions after we have run the lm\n\n\n\nCode# build the statistical model\nmod <- lm(data = stag, mand ~ jh)\n\n# examine it\nsummary(mod)\n# mand = 0.032*jh + 0.419\n# the slope of the line is significantly different from zero / the jh explains a significant amount of the variation in mand (ANOVA: F = 16.63; d.f. = 1,14; p = 0.00113).\n# the intercept is 0.419 and differs significantly from zero \n\n\n\nCode# checking the assumption\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be equal along the x axis.\n# with a small data set there is some apparent heterogeneity but it doesn't look too.\n# \nhist(mod$residuals)\nCode# We have some skew which again might be partly a result of a small sample size.\nshapiro.test(mod$residuals) # the also test not sig diff from normal\n\n# On balance the use of regression is probably justifiable but it is borderline\n# but ideally the experiment would be better if multiple individuals were measure at\n# each of the chosen juvenile hormone levels.\n\n\n\nCode# a better plot\nggplot(stag, aes(x = jh, y = mand) ) +\n geom_point() +\n geom_smooth(method = lm, se = FALSE, colour = \"black\") +\n scale_x_continuous(name = \"Juvenile hormone (arbitrary units)\",\n expand = c(0, 0),\n limits = c(0, 32)) +\n scale_y_continuous(name = \"Mandible size (mm)\",\n expand = c(0, 0),\n limits = c(0, 2)) +\n theme_classic()" + "text": "This week you will start writing R code in RStudio and will create your first graph! You will learn about data types such as ā€œnumericsā€ and ā€œcharactersā€ and some of the different types of objects in R such as ā€œvectorsā€ and ā€œdataframesā€. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects.\n\n\n\nArtwork by Horst (2023): ā€œbless this workflowā€\n\n\n\nLearning objectives\nThe successful student will be able to:\n\nuse the R command line as a calculator and to assign variables\ncreate and use the basic data types in R\nfind their way around the RStudio windows\nuse an RStudio Project to organise work\nuse a script to run R commands\ncreate and customise a barplot\nsearch and understand manual pages\n\n\n\nInstructions\n\nPrepare\n\nFirst Steps in RStudio: Either šŸ“– Read the book OR šŸ“¹ Watch two videos\n\nWorkshop\ni.šŸ’» šŸˆ Coat colour of cats. Type in some data, perform calculations on, and plot it.\nConsolidate\n\nšŸ’» Create a plot\nšŸ“– Read Workflow in RStudio\n\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." }, { "objectID": "pgt52m/week-2/rstudio-projects.html#outline", @@ -1225,11 +1253,18 @@ "text": "demo" }, { - "objectID": "pgt52m/week-2/overview.html", - "href": "pgt52m/week-2/overview.html", + "objectID": "pgt52m/week-5/study_before_workshop.html", + "href": "pgt52m/week-5/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", + "section": "", + "text": "šŸ“– Read The logic of hyothesis testing\nšŸ“– Read Confidence Intervals" + }, + { + "objectID": "pgt52m/week-5/overview.html", + "href": "pgt52m/week-5/overview.html", "title": "Overview", "section": "", - "text": "This week you will start writing R code in RStudio and will create your first graph! You will learn about data types such as ā€œnumericsā€ and ā€œcharactersā€ and some of the different types of objects in R such as ā€œvectorsā€ and ā€œdataframesā€. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects.\n\n\n\nArtwork by Horst (2023): ā€œbless this workflowā€\n\n\n\nLearning objectives\nThe successful student will be able to:\n\nuse the R command line as a calculator and to assign variables\ncreate and use the basic data types in R\nfind their way around the RStudio windows\nuse an RStudio Project to organise work\nuse a script to run R commands\ncreate and customise a barplot\nsearch and understand manual pages\n\n\n\nInstructions\n\nPrepare\n\nFirst Steps in RStudio: Either šŸ“– Read the book OR šŸ“¹ Watch two videos\n\nWorkshop\ni.šŸ’» šŸˆ Coat colour of cats. Type in some data, perform calculations on, and plot it.\nConsolidate\n\nšŸ’» Create a plot\nšŸ“– Read Workflow in RStudio\n\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." + "text": "This week we will cover the logic of consider the logic of hypothesis testing and type 1 and type 2 errors. We will also find out what the sampling distribution of the mean and the standard error are, and how to calculate confidence intervals.\n\n\n\nArtwork by Horst (2023): ā€œtype 1 errorā€\n\n\n\n\n\nArtwork by Horst (2023): ā€œtype 2 errorā€\n\n\n\nLearning objectives\nThe successful student will be able to:\n\ndemonstrate the process of hypothesis testing with an example\nexplain type 1 and type 2 errors\ndefine the sampling distribution of the mean and the standard error\nexplain what a confidence interval is\ncalculate confidence intervals for large and small samples\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read The logic of hyothesis testing\nšŸ“– Read Confidence Intervals\n\nWorkshop\n\nšŸ’» Remind yourself how to import files\nšŸ’» Calculate confidence intervals on large\nšŸ’» Calculate confidence intervals on small samples.\n\nConsolidate\n\nšŸ’» Calculate confidence intervals for each group in a data set\n\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." }, { "objectID": "pgt52m/week-1/study_before_workshop.html", @@ -1246,207 +1281,151 @@ "text": "This week you will carry out some independent study to ensure you have some understanding of computer file systems. We will introduce you to the concepts of paths and working directories.\n\n\n\nArtwork by Horst (2023): ā€œcode gets the blameā€\n\n\n\nLearning objectives\nThe parentheses after each learning objective indicate where the content covers that objective.\nThe successful student will be able to:\n\nexplain what an operating system is\nexplain the organisation of files and directories in a file systems\nexplain what a file is and give some common files types\nexplain what is meant by a plain text file\nexplain the relationship between the file extensions, the file format and associations with programs\nuse a file manager\nexplain root, home and working directories\nexplain absolute and relative file paths\nknow what R and RStudio are\nknow how to organise their work\n\n\n\nInstructions\n\nPrepare\n\nWatch an Introduction to Data Analysis in R for BABS 1 - 4\nRead What they forgot to teach you about computers\nRead What are R and Rstudio?\n\nWorkshop\n\nOptional: Install R and RStudio\n\nConsolidate\n\n\n\n\n\n\nReferences\n\nHorst, Allison. 2023. ā€œData Science Illustrations.ā€ https://allisonhorst.com/allison-horst." }, { - "objectID": "pgt52m/week-7/study_before_workshop.html", - "href": "pgt52m/week-7/study_before_workshop.html", + "objectID": "pgt52m/week-4/study_before_workshop.html", + "href": "pgt52m/week-4/study_before_workshop.html", "title": "Independent Study to prepare for workshop", "section": "", - "text": "Prepare\n\nšŸ“– Read Two-Sample tests" + "text": "šŸ“– Read From importing to reporting. The first part of this chapter is about data import which we covered in the last workshop. You may be able to skip that part or you may find it useful to revise. The section on Summarising data will be mainly new." }, { - "objectID": "pgt52m/week-7/overview.html", - "href": "pgt52m/week-7/overview.html", + "objectID": "pgt52m/week-4/overview.html", + "href": "pgt52m/week-4/overview.html", "title": "Overview", "section": "", - "text": "This week you will how to use and interpret the general linear model when the x variable is categorical and has two groups. Just as with single linear regression, the model puts a line of best through data and the model parameters, the intercept and the slope, have the same in interpretation The intercept is one of the group means and the slope is the difference between that, mean and the other group mean. You will also learn about the non-parametric equivalents - the tests we use when the assumptions of the general linear model are not met.\n\nLearning objectives\nThe successful student will be able to:\n\nunderstand the principles of two-sample tests\nappreciate that two-sample tests with lm() are based on the normal distribution and thus have assumptions\nappropriately select parametric and non-parametric two-sample tests\nappropriately select paired and and unpaired two-sample tests\napply and interpret lm()and wilcox.test()\nevaluate whether the assumptions of lm() are met\nscientifically report a two-sample test result including appropriate figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read Two-Sample tests\n\nWorkshop\n\nšŸ’» Parametric two-sample test\nšŸ’» Non-parametric two-sample test\nšŸ’» Parametric paired-sample test\n\nConsolidate\n\nšŸ’» Appropriately test whether a genetic modification was successful in increasing omega 3 fatty acids in Cannabis sativa.\nšŸ’» ā€¦." + "text": "Last week you summarised and plotted single variables. This week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by ā€œtidyā€ data and how to perform a simple data tidying task. Finally you will discover how to save your figures and place them in documents.\n\nLearning objectives\n\nsummarise and plot appropriately datasets with more than one variable\nrecognise that variables can be categorised by their role in analysis\nexplain what is meant by ā€˜tidyā€™ data and be able to perform some data tidying tasks.\nsave figures to file\ncreate neat reports which include text and figures\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– From importing to reporting\n\nWorkshop\n\nšŸ’» Summarise and plot datasets with more than one variable.\nšŸ’» Practice with working directories, importing data, formatting figures and the pipe\nšŸ’» Lay out text, figures and figure legends in documents\n\nConsolidate\n\nšŸ’» Summarise and plot a dataframe from the workshop\nšŸ’» Practice the complete RStudio Project worklfow for a new dataset" }, { - "objectID": "pgt52m/pgt52m.html", - "href": "pgt52m/pgt52m.html", - "title": "52M Data Analysis in R", + "objectID": "pgt52m/week-3/study_before_workshop.html", + "href": "pgt52m/week-3/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", "section": "", - "text": "This module introduces you to data analysis in R. The first 4 weeks covers core concepts about scientific computing, types of variable, the role of variables in analysis and how to use RStudio to organise analysis and import, summarise and plot data. In weeks 5 to 8, you will learn about the logic of hypothesis testing, confidence intervals, what is meant by a statistical model, two-sample tests and one-way analysis of variance (ANOVA). You will learn how to write reproducible reports in Quarto in weeks 9 and 10. Finally, there will be a drop-in for your questions in week 11.\nThis module complement the work you will do in BIO00070M Research, Professional and Team Skills where you will you will learn how to organise reproducible data analyses using a project-oriented workflow and analyses RNA sequence data. It will be important to use the skills and tools you learn in 52M and apply them in 70M.\n\n\nThe Module Learning outcomes are:\n\nExplain the purpose of data analysis and the rationale for scripting analysis in the biosciences\nRecognise when statistics such as t-tests, one-way ANOVA, correlation and regression can be applied, and use R to perform these analyses on data in a variety of formats\nSummarise data in single or multiple groups, recognise tidy data formats, and carry out some typical data tidying tasks\nUse markdown (through Quarto) to produce reproducible analyses, figures and reports" + "text": "šŸ“– Read Ideas about data" }, { - "objectID": "pgt52m/pgt52m.html#module-learning-objectives", - "href": "pgt52m/pgt52m.html#module-learning-objectives", - "title": "52M Data Analysis in R", + "objectID": "pgt52m/week-3/overview.html", + "href": "pgt52m/week-3/overview.html", + "title": "Overview", "section": "", - "text": "The Module Learning outcomes are:\n\nExplain the purpose of data analysis and the rationale for scripting analysis in the biosciences\nRecognise when statistics such as t-tests, one-way ANOVA, correlation and regression can be applied, and use R to perform these analyses on data in a variety of formats\nSummarise data in single or multiple groups, recognise tidy data formats, and carry out some typical data tidying tasks\nUse markdown (through Quarto) to produce reproducible analyses, figures and reports" - }, - { - "objectID": "pgt52m/pgt52m.html#week-1-understanding-file-systems", - "href": "pgt52m/pgt52m.html#week-1-understanding-file-systems", - "title": "52M Data Analysis in R", - "section": "Week 1: Understanding file systems", - "text": "Week 1: Understanding file systems\nYou will learn about operating systems, files and file systems, working directories, absolute and relative paths, what R and RStudio are" - }, - { - "objectID": "pgt52m/pgt52m.html#week-2-introduction-to-r-and-project-organisation", - "href": "pgt52m/pgt52m.html#week-2-introduction-to-r-and-project-organisation", - "title": "52M Data Analysis in R", - "section": "Week 2: Introduction to R and project organisation", - "text": "Week 2: Introduction to R and project organisation\nYou will start writing R code in RStudio and will create your first graph! You will learn about data types such as ā€œnumericsā€ and ā€œcharactersā€ and some of the different types of objects in R such as ā€œvectorsā€ and ā€œdataframesā€. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects." - }, - { - "objectID": "pgt52m/pgt52m.html#week-3-types-of-variable-summarising-and-plotting-data", - "href": "pgt52m/pgt52m.html#week-3-types-of-variable-summarising-and-plotting-data", - "title": "52M Data Analysis in R", - "section": "Week 3: Types of variable, summarising and plotting data", - "text": "Week 3: Types of variable, summarising and plotting data\nThe type of values our data can take is important in how we analyse and visualise it. This week you will learn the difference between continuous and discrete values and how we summarise and visualise them. The focus will be on plotting and summarising single variables. You will also learn how to read in data in to RStudio from plain text files and Excel files." - }, - { - "objectID": "pgt52m/pgt52m.html#week-4-summarising-data-with-several-variables", - "href": "pgt52m/pgt52m.html#week-4-summarising-data-with-several-variables", - "title": "52M Data Analysis in R", - "section": "Week 4: Summarising data with several variables", - "text": "Week 4: Summarising data with several variables\nThis week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by ā€œtidyā€ data and how to perform a simple data tidying task. Finally you will discover how to save your figures and place them in documents." - }, - { - "objectID": "pgt52m/pgt52m.html#week-5-the-logic-of-hypothesis-testing-and-ci", - "href": "pgt52m/pgt52m.html#week-5-the-logic-of-hypothesis-testing-and-ci", - "title": "52M Data Analysis in R", - "section": "Week 5: The logic of hypothesis testing and CI", - "text": "Week 5: The logic of hypothesis testing and CI\nThis week we will cover the logic of consider the logic of hypothesis testing and type 1 and type 2 errors. We will also find out what the sampling distribution of the mean and the standard error are, and how to calculate confidence intervals." - }, - { - "objectID": "pgt52m/pgt52m.html#week-6-introduction-to-statistical-models-single-regression", - "href": "pgt52m/pgt52m.html#week-6-introduction-to-statistical-models-single-regression", - "title": "52M Data Analysis in R", - "section": "Week 6: Introduction to statistical models: Single regression", - "text": "Week 6: Introduction to statistical models: Single regression\nThis week you will be introduced to the idea of a statistical ā€œmodelā€ in general and to general linear model in particular. Our first general linear model will be single linear regression which puts a line of best fit through data so the response can be predicted from the explanatory variable. We will consider the two ā€œparametersā€ estimated by the model (the slope and the intercept) and whether these differ from zero" - }, - { - "objectID": "pgt52m/pgt52m.html#week-7-two-sample-tests", - "href": "pgt52m/pgt52m.html#week-7-two-sample-tests", - "title": "52M Data Analysis in R", - "section": "Week 7: Two-sample tests", - "text": "Week 7: Two-sample tests\nThis week you will how to use and interpret the general linear model when the x variable is categorical and has two groups. Just as with single linear regression, the model puts a line of best through data and the model parameters, the intercept and the slope, have the same in interpretation The intercept is one of the group means and the slope is the difference between that, mean and the other group mean. You will also learn about the non-parametric equivalents - the tests we use when the assumptions of the general linear model are not met." - }, - { - "objectID": "pgt52m/pgt52m.html#week-8-one-way-anova-and-kruskal-wallis", - "href": "pgt52m/pgt52m.html#week-8-one-way-anova-and-kruskal-wallis", - "title": "52M Data Analysis in R", - "section": "Week 8: One-way ANOVA and Kruskal-Wallis", - "text": "Week 8: One-way ANOVA and Kruskal-Wallis\nLast week you learnt how to use and interpret the general linear model when the x variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (analysis of variance). You will also learn about the Kruskal- Wallis test which can be used when the assumptions of the general linear model are not met." - }, - { - "objectID": "pgt52m/pgt52m.html#week-9-assessment-intro", - "href": "pgt52m/pgt52m.html#week-9-assessment-intro", - "title": "52M Data Analysis in R", - "section": "Week 9: Assessment intro", - "text": "Week 9: Assessment intro\nReproducible analysis of some relevant data." + "text": "The type of values our data can take is important in how we analyse and visualise it. This week you will learn the difference between continuous and discrete values and how we summarise and visualise them. You will also learn about the ā€œnormal distributionā€ which is the most important continuous distribution.\n\n\n\nDiscrete variable\n\n\n\nLearning objectives\nThe successful student will be able to:\n\ndistinguish between continuous, discrete, nominal and ordinal variable\nread in data in to RStudio from a plain text file and Excel files\nsummarise and plot variables appropriately for the data type\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read: Ideas about data\n\nWorkshop\n\nšŸ’» Importing data\nšŸ’» Summarising discrete data\nšŸ’» Summarising count data\nšŸ’» Summarising continuous data\n\nConsolidate\n\nšŸ’» Summarise some data\nšŸ’» Plot some data\nšŸ’» Format a plot (1)\nšŸ’» Format a plot (2)\nšŸ“– Read Understanding the pipe |>" }, { - "objectID": "pgt52m/pgt52m.html#week-10-reproducible-reporting", - "href": "pgt52m/pgt52m.html#week-10-reproducible-reporting", - "title": "52M Data Analysis in R", - "section": "Week 10: Reproducible Reporting", - "text": "Week 10: Reproducible Reporting\nUsing Quarto" + "objectID": "r4babs2/week-6/study_before_workshop.html", + "href": "r4babs2/week-6/study_before_workshop.html", + "title": "Independent Study to prepare for workshop", + "section": "", + "text": "Prepare\n\nEither šŸ“– Read xxxxx in OR šŸ“¹ Watch" }, { - "objectID": "pgt52m/pgt52m.html#week-11-drop-in", - "href": "pgt52m/pgt52m.html#week-11-drop-in", - "title": "52M Data Analysis in R", - "section": "Week 11: Drop-in", - "text": "Week 11: Drop-in" + "objectID": "r4babs2/week-6/overview.html", + "href": "r4babs2/week-6/overview.html", + "title": "Overview", + "section": "", + "text": "This week you will\n\nLearning objectives\nThe successful student will be able to:\n\n\n\n\n\n\n\n\nInstructions\n\nPrepare\n\nšŸ“– Read the book OR šŸ“¹ Watch two videos\n\nWorkshop\ni.šŸ’»\nConsolidate\n\nšŸ’»\nšŸ“– Read" }, { - "objectID": "r4babs2/week-3/workshop.html", - "href": "r4babs2/week-3/workshop.html", - "title": "Workshop", + "objectID": "r4babs2/r4babs2.html", + "href": "r4babs2/r4babs2.html", + "title": "Data Analysis in R for BABS 2", "section": "", - "text": "Artwork by Horst (2023): ā€œHow much I think I know about Rā€\n\n\nIn this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "This is the second of the four BABS modules. Over six weeks you will learn about the logic of hypothesis testing, confidence intervals, what is meant by a statistical model, two-sample tests and one- and two-way analysis of variance (ANOVA).\n\n\nThe BABS2 Module Learning outcomes that relate to the Data Analysis in R content are:\n\nThink creatively to address a Grand Challenge by designing investigations with testable hypotheses and rigorous controls\nAppropriately select classical univariate statistical tests and some non-parametric equivalents to a given scenario and recognise when these are not suitable\nUse R to perform these analyses, reproducibly, on data in a variety of formats and present the results graphically\nCommunicate research in scientific reports and via oral presentation." }, { - "objectID": "r4babs2/week-3/workshop.html#session-overview", - "href": "r4babs2/week-3/workshop.html#session-overview", - "title": "Workshop", + "objectID": "r4babs2/r4babs2.html#module-learning-objectives", + "href": "r4babs2/r4babs2.html#module-learning-objectives", + "title": "Data Analysis in R for BABS 2", "section": "", - "text": "In this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R." + "text": "The BABS2 Module Learning outcomes that relate to the Data Analysis in R content are:\n\nThink creatively to address a Grand Challenge by designing investigations with testable hypotheses and rigorous controls\nAppropriately select classical univariate statistical tests and some non-parametric equivalents to a given scenario and recognise when these are not suitable\nUse R to perform these analyses, reproducibly, on data in a variety of formats and present the results graphically\nCommunicate research in scientific reports and via oral presentation." }, { - "objectID": "r4babs2/week-3/workshop.html#philosophy", - "href": "r4babs2/week-3/workshop.html#philosophy", - "title": "Workshop", - "section": "", - "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "objectID": "r4babs2/r4babs2.html#the-logic-of-hypothesis-testing-and-cis", + "href": "r4babs2/r4babs2.html#the-logic-of-hypothesis-testing-and-cis", + "title": "Data Analysis in R for BABS 2", + "section": "The logic of hypothesis testing and CIs", + "text": "The logic of hypothesis testing and CIs" }, { - "objectID": "r4babs2/week-3/workshop.html#adiponectin-secretion", - "href": "r4babs2/week-3/workshop.html#adiponectin-secretion", - "title": "Workshop", - "section": "Adiponectin secretion", - "text": "Adiponectin secretion\nAdiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not.\n Save a copy of adipocytes.txt to data-raw\n Read in the data and check the structure. I used the name adip for the dataframe/tibble.\nWe have a tibble containing two variables: adiponectin is the response and is continuous and treatment is explanatory. treatment is categorical with two levels (groups). The first task is visualise the data to get an overview. For continuous response variables with categorical explanatory variables you could use geom_point(), geom_boxplot() or a variety of other geoms. I often use geom_violin() which allows us to see the distribution - the violin is fatter where there are more data points.\n Do a quick plot of the data:\n\nggplot(data = adip, aes(x = treatment, y = adiponectin)) +\n geom_violin()\n\n\n\n\nSummarising the data\nSummarising the data for each treatment group is the next sensible step. The most useful summary statistics are the means, standard deviations, sample sizes and standard errors.\n Create a data frame called adip_summary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples. You may need to the Summarise from the Week 9 workshop of BABS1 (Rand 2023)\nYou should get the following numbers:\n\n\n\n\ntreatment\nmean\nstd\nn\nse\n\n\n\ncontrol\n5.546000\n1.475247\n15\n0.3809072\n\n\nnicotinic\n7.508667\n1.793898\n15\n0.4631824\n\n\n\n\n\nSelecting a test\n Do you think this is a paired-sample test or two-sample test?\n\n\n\n\nApplying, interpreting and reporting\n Create a two-sample model like this:\n\nmod <- lm(data = adip,\n adiponectin ~ treatment)\n\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = adiponectin ~ treatment, data = adip)\n\nResiduals:\n Min 1Q Median 3Q Max \n-4.3787 -1.0967 0.1927 1.0245 3.1113 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 5.5460 0.4240 13.079 1.9e-13 ***\ntreatmentnicotinic 1.9627 0.5997 3.273 0.00283 ** \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.642 on 28 degrees of freedom\nMultiple R-squared: 0.2767, Adjusted R-squared: 0.2509 \nF-statistic: 10.71 on 1 and 28 DF, p-value: 0.00283\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\n\n\n\n\n\nWe now need to annotate the figure with the results from the statistical test. This most commonly done with a line linking the means being compared and the p-value. The annotate() function can be used to draw the line and then to add the value. The line is a segment and the p-value is a text.\n Add annotation to the figure by adding:\n...... +\n annotate(\"segment\", x = 1, xend = 2, \n y = 11.3, yend = 11.3,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 11.7, \n label = expression(italic(p)~\"= 0.003\")) +\n theme_classic()\n\n\n\n\n\nFor the segment, annotate() needs the x and y coordinates for the start and the finish of the line.\nThe use of expression() allows you to specify formatting or special characters. expression() takes strings or LaTeX formatting. Each string or piece of LaTeX is separated by a * or a ~. The * concatenates the strings without a space, ~ does so with a space. It will generate a warning message ā€œIn is.na(x) : is.na() applied to non-(list or vector) of type ā€˜expressionā€™ā€ which can be ignored.\n Save your figure to your figures folder." + "objectID": "r4babs2/r4babs2.html#introduction-to-statistical-models-single-regression", + "href": "r4babs2/r4babs2.html#introduction-to-statistical-models-single-regression", + "title": "Data Analysis in R for BABS 2", + "section": "Introduction to statistical models: Single regression", + "text": "Introduction to statistical models: Single regression" }, { - "objectID": "r4babs2/week-3/workshop.html#grouse-parasites", - "href": "r4babs2/week-3/workshop.html#grouse-parasites", - "title": "Workshop", - "section": "Grouse Parasites", - "text": "Grouse Parasites\nGrouse livers were dissected and the number of individuals of a parasitic nematode were counted for two estates ā€˜Gordonā€™ and ā€˜Mossā€™. We want to know if the two estates have different infection rates. The data are in grouse.csv\n Save a copy of grouse.csv to data-raw\n Read in the data and check the structure. I used the name grouse for the dataframe/tibble.\nSelecting\n Using your common sense, do these data look normally distributed?\n\n\n\n What test do you suggest?\n\n\nApplying, interpreting and reporting\n Summarise the data by finding the median of each group:\n Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney):\n\nwilcox.test(data = grouse, nematodes ~ estate)\n\n\n Wilcoxon rank sum exact test\n\ndata: nematodes by estate\nW = 78, p-value = 0.03546\nalternative hypothesis: true location shift is not equal to 0\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\nIllustrating\nA box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range.\n We can create a simple boxplot with:\n\nggplot(data = grouse, aes(x = estate, y = nematodes) ) +\n geom_boxplot() \n\n\n\n\n Annotate and format the figure so it is more suitable for a report and save it to your figures folder." + "objectID": "r4babs2/r4babs2.html#two-sample-tests", + "href": "r4babs2/r4babs2.html#two-sample-tests", + "title": "Data Analysis in R for BABS 2", + "section": "Two-sample tests", + "text": "Two-sample tests" }, { - "objectID": "r4babs2/week-3/workshop.html#gene-expression", - "href": "r4babs2/week-3/workshop.html#gene-expression", - "title": "Workshop", - "section": "Gene Expression", - "text": "Gene Expression\nBambara groundnut (Vigna subterranea) is an African legume with good nutritional value which can be influenced by low temperature stress. Researchers are interested in the expression levels of a particular set of 35 genes (probe_id) in response to temperature stress. They measure the expression of the genes at 23 and 18 degrees C (high and low temperature). These samples are not independent because we have two measure from one gene. The data are in expr.xlxs.\nSelecting\n What is the null hypothesis?\n\n\n\n Save a copy of expr.xlxs and import the data. I named the dataframe bambara\n What is the appropriate parametric test?\n\n\nApplying, interpreting and reporting\nA paired test requires us to test whether the difference in expression between high and low temperatures is zero on average. One handy way to achieve this is to organise our groups into two columns. The pivot_wider() function will do this for us. We need to tell it what column gives the identifiers (i.e., matches the the pairs) - the probe_ids in this case. We also need to say which variable contains what will become the column names and which contains the values.\n Pivot the data so there is a column for each temperature:\n\nbambara <- bambara |> \n pivot_wider(names_from = temperature, \n values_from = expression, \n id_cols = probe_id)\n\n Click on the bambara dataframe in the environment to open a view of it so that you understand what pivot_wider() has done.\n Create a paired-sample model like this:\n\nmod <- lm(data = bambara, \n highert - lowert ~ 1)\n\nSince we have done highert - lowert, the ā€œ(Intercept) Estimateā€ will be the average of the higher temperature expression minus the lower temperature expression for each gene.\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = highert - lowert ~ 1, data = bambara)\n\nResiduals:\n Min 1Q Median 3Q Max \n-1.05478 -0.46058 0.09682 0.33342 1.06892 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.30728 0.09591 3.204 0.00294 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.5674 on 34 degrees of freedom\n\n\n State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any significant effect." + "objectID": "r4babs2/r4babs2.html#one-way-anova-and-kruskal-wallis", + "href": "r4babs2/r4babs2.html#one-way-anova-and-kruskal-wallis", + "title": "Data Analysis in R for BABS 2", + "section": "One-way ANOVA and Kruskal-Wallis", + "text": "One-way ANOVA and Kruskal-Wallis" }, { - "objectID": "r4babs2/week-3/workshop.html#look-after-future-you", - "href": "r4babs2/week-3/workshop.html#look-after-future-you", - "title": "Workshop", - "section": "Look after future you!", - "text": "Look after future you!\nThe code required to summarise, test, and plot data for any two-sample test AND for any for any one-way ANOVA is exactly the same except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it your code so that you can make use of it next week.\n\nYouā€™re finished!" + "objectID": "r4babs2/r4babs2.html#two-way-anova", + "href": "r4babs2/r4babs2.html#two-way-anova", + "title": "Data Analysis in R for BABS 2", + "section": "Two-way ANOVA", + "text": "Two-way ANOVA" }, { - "objectID": "r4babs2/week-3/study_after_workshop.html", - "href": "r4babs2/week-3/study_after_workshop.html", + "objectID": "r4babs2/r4babs2.html#chi-squared-tests-and-correlation", + "href": "r4babs2/r4babs2.html#chi-squared-tests-and-correlation", + "title": "Data Analysis in R for BABS 2", + "section": "Chi-squared tests and correlation", + "text": "Chi-squared tests and correlation" + }, + { + "objectID": "r4babs2/week-2/study_after_workshop.html", + "href": "r4babs2/week-2/study_after_workshop.html", "title": "Independent Study to consolidate this week", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Plant Biotech. Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt. Do you think their modification has been successful?\n\n\nCodecsativa <- read_table(\"data-raw/csativa.txt\")\nstr(csativa)\n\n# First realise that this is a two sample test. You have two independent samples\n# - there are a total of 100 different plants and the values in one \n# group have no relationship to the values in the other.\n\n\n\nCode# create a rough plot of the data \nggplot(data = csativa, aes(x = plant, y = omega)) +\n geom_violin()\nCode# note the modified plants seem to have lower omega!\n\n\n\nCode# create a summary of the data\ncsativa_summary <- csativa %>%\n group_by(plant) %>%\n summarise(mean = mean(omega),\n std = sd(omega),\n n = length(omega),\n se = std/sqrt(n))\n\n\n\nCode# The data seem to be continuous so it is likely that a parametric test will be fine\n# we will check the other assumptions after we have run the lm\n\n# build the statistical model\nmod <- lm(data = csativa, omega ~ plant)\n\n\n# examine it\nsummary(mod)\n# So there is a significant difference but you need to make sure you know the direction!\n# Wild plants have a significantly higher omega 3 content (mean +/- s.e = 56.41 +/- 1.11) \n# than modified plants (49.46 +/- 0.82)(t = 5.03; d.f. = 98; p < 0.0001).\n\n\n\nCode# let's check the assumptions\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be the same in both groups.\n# This looks OK. Maybe a bit higher in the wild plants (with the higher mean)\n \nhist(mod$residuals)\nCodeshapiro.test(mod$residuals)\n# On balance the use of lm() is probably justifiable The variance isn't quite equal \n# and the histogram looks a bit off normal but the normality test is NS and the \n# effect (in the figure) is clear.\n\n\n\nCode# A figure \nfig1 <- ggplot() +\n geom_point(data = csativa, aes(x = plant, y = omega),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_x_discrete(name = \"Plant type\", labels = c(\"GMO\", \"WT\")) +\n scale_y_continuous(name = \"Amount of Omega 3 (units)\",\n expand = c(0, 0),\n limits = c(0, 90)) +\n annotate(\"segment\", x = 1, xend = 2, \n y = 80, yend = 80,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 85, \n label = expression(italic(p)~\"< 0.001\")) +\n theme_classic()\n\n# save figure to figures/csativa.png\nggsave(\"figures/csativa.png\",\n plot = fig1,\n width = 3.5,\n height = 3.5,\n units = \"in\",\n dpi = 300)\n\n\n\nšŸ’» another example" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Effect of anxiety status and sporting performance. The data in sprint.txt are from an investigation of the effect of anxiety status and sporting performance. A group of 40 100m sprinters undertook a psychometric test to measure their anxiety shortly before competing. The data are their anxiety scores and the 100m times achieved. What you do conclude from these data?\n\n\nCode# this example is designed to emphasise the importance of plotting your data first\nsprint <- read_table(\"data-raw/sprint.txt\")\n# Anxiety is discrete but ranges from 16 to 402 meaning the gap between possible measures is small and \n# the variable could be treated as continuous if needed. Time is a continuous measure that has decimal places and which we would expect to follow a normal distribution \n\n# explore with a plot\nggplot(sprint, aes(x = anxiety, y = time) ) +\n geom_point()\nCode# A scatterplot of the data clearly reveals that these data are not linear. There is a good relationship between the two variables but since it is not linear, single linear regression is not appropriate.\n\n\n\nšŸ’» Juvenile hormone in stag beetles. The concentration of juvenile hormone in stag beetles is known to influence mandible growth. Groups of stag beetles were injected with different concentrations of juvenile hormone (arbitrary units) and their average mandible size (mm) determined. The experimenters planned to analyse their data with regression. The data are in stag.txt\n\n\n\nCode# read the data in and check the structure\nstag <- read_table(\"data-raw/stag.txt\")\nstr(stag)\n\n# jh is discrete but ordered and has been chosen by the experimenter - it is the explanatory variable. \n# the response is mandible size which has decimal places and is something we would expect to be \n# normally distributed. So far, common sense suggests the assumptions of regression are met.\n\n\n\nCode# exploratory plot\nggplot(stag, aes(x = jh, y = mand)) +\n geom_point()\nCode# looks linear-ish on the scatter\n# regression still seems appropriate\n# we will check the other assumptions after we have run the lm\n\n\n\nCode# build the statistical model\nmod <- lm(data = stag, mand ~ jh)\n\n# examine it\nsummary(mod)\n# mand = 0.032*jh + 0.419\n# the slope of the line is significantly different from zero / the jh explains a significant amount of the variation in mand (ANOVA: F = 16.63; d.f. = 1,14; p = 0.00113).\n# the intercept is 0.419 and differs significantly from zero \n\n\n\nCode# checking the assumption\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be equal along the x axis.\n# with a small data set there is some apparent heterogeneity but it doesn't look too.\n# \nhist(mod$residuals)\nCode# We have some skew which again might be partly a result of a small sample size.\nshapiro.test(mod$residuals) # the also test not sig diff from normal\n\n# On balance the use of regression is probably justifiable but it is borderline\n# but ideally the experiment would be better if multiple individuals were measure at\n# each of the chosen juvenile hormone levels.\n\n\n\nCode# a better plot\nggplot(stag, aes(x = jh, y = mand) ) +\n geom_point() +\n geom_smooth(method = lm, se = FALSE, colour = \"black\") +\n scale_x_continuous(name = \"Juvenile hormone (arbitrary units)\",\n expand = c(0, 0),\n limits = c(0, 32)) +\n scale_y_continuous(name = \"Mandible size (mm)\",\n expand = c(0, 0),\n limits = c(0, 2)) +\n theme_classic()" }, { - "objectID": "r4babs2/week-4/workshop.html", - "href": "r4babs2/week-4/workshop.html", + "objectID": "r4babs2/week-2/workshop.html", + "href": "r4babs2/week-2/workshop.html", "title": "Workshop", "section": "", - "text": "Artwork by Horst (2023): ā€œDebugging and feelingsā€\n\n\nIn this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "In this workshop you will get practice in applying, interpreting and reporting single linear regression.\n\n\nArtwork by Horst (2023): ā€œlinear regression dragonsā€\n\n\nIn this session you will carry out, interpret and report on a single linear regression.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-4/workshop.html#session-overview", - "href": "r4babs2/week-4/workshop.html#session-overview", + "objectID": "r4babs2/week-2/workshop.html#session-overview", + "href": "r4babs2/week-2/workshop.html#session-overview", "title": "Workshop", "section": "", - "text": "In this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R." + "text": "In this session you will carry out, interpret and report on a single linear regression." }, { - "objectID": "r4babs2/week-4/workshop.html#philosophy", - "href": "r4babs2/week-4/workshop.html#philosophy", + "objectID": "r4babs2/week-2/workshop.html#philosophy", + "href": "r4babs2/week-2/workshop.html#philosophy", "title": "Workshop", "section": "", "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-4/workshop.html#myoglobin-in-seal-muscle", - "href": "r4babs2/week-4/workshop.html#myoglobin-in-seal-muscle", + "objectID": "r4babs2/week-2/workshop.html#linear-regression", + "href": "r4babs2/week-2/workshop.html#linear-regression", "title": "Workshop", - "section": "Myoglobin in seal muscle", - "text": "Myoglobin in seal muscle\nThe myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in seal.csv. We want to know if there is a difference between species. Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.\n Save a copy of the data file seal.csv to data-raw\n Read in the data and check the structure. I used the name seal for the dataframe/tibble.\n What kind of variables do you have?\n\n\n\nExploring\n Do a quick plot of the data. You may need to refer to a previous workshop\nSummarising the data\nDo you remember Look after future you!\n If you followed that tip youā€™ll be able to open that script and whizz through summarising,testing and plotting.\n Create a data frame called seal_summary that contains the means, standard deviations, sample sizes and standard errors for each species.\nYou should get the following numbers:\n\n\n\n\nspecies\nmean\nstd\nn\nse\n\n\n\nBladdernose Seal\n42.31600\n8.020634\n30\n1.464361\n\n\nHarbour Seal\n49.01033\n8.252004\n30\n1.506603\n\n\nWeddell Seal\n44.66033\n7.849816\n30\n1.433174\n\n\n\n\n\nApplying, interpreting and reporting\nWe can now carry out a one-way ANOVA using the same lm() function we used for two-sample tests.\n Carry out an ANOVA and examine the results with:\n\nmod <- lm(data = seal, myoglobin ~ species)\nsummary(mod)\n\n\nCall:\nlm(formula = myoglobin ~ species, data = seal)\n\nResiduals:\n Min 1Q Median 3Q Max \n-16.306 -5.578 -0.036 5.240 18.250 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 42.316 1.468 28.819 < 2e-16 ***\nspeciesHarbour Seal 6.694 2.077 3.224 0.00178 ** \nspeciesWeddell Seal 2.344 2.077 1.129 0.26202 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 8.043 on 87 degrees of freedom\nMultiple R-squared: 0.1096, Adjusted R-squared: 0.08908 \nF-statistic: 5.352 on 2 and 87 DF, p-value: 0.006427\n\n\nRemember: the tilde (~) means test the values in myoglobin when grouped by the values in species. Or explain myoglobin with species\n What do you conclude so far from the test? Write your conclusion in a form suitable for a report.\n\n\n\n Can you relate the values under Estimate to the means?\n\n\n\n\n\n\n\nThe ANOVA is significant but this only tells us that species matters, meaning at least two of the means differ. To find out which means differ, we need a post-hoc test. A post-hoc (ā€œafter thisā€) test is done after a significant ANOVA test. There are several possible post-hoc tests and we will be using Tukeyā€™s HSD (honestly significant difference) test (Tukey 1949) implemented in the emmeans (Lenth 2023) package.\n Load the package\n\nlibrary(emmeans)\n\n Carry out the post-hoc test\n\nemmeans(mod, ~ species) |> pairs()\n\n contrast estimate SE df t.ratio p.value\n Bladdernose Seal - Harbour Seal -6.69 2.08 87 -3.224 0.0050\n Bladdernose Seal - Weddell Seal -2.34 2.08 87 -1.129 0.4990\n Harbour Seal - Weddell Seal 4.35 2.08 87 2.095 0.0968\n\nP value adjustment: tukey method for comparing a family of 3 estimates \n\n\nEach row is a comparison between the two means in the ā€˜contrastā€™ column. The ā€˜estimateā€™ column is the difference between those means and the ā€˜p.valueā€™ indicates whether that difference is significant.\nA plot can be used to visualise the result of the post-hoc which can be especially useful when there are very many comparisons.\n Plot the results of the post-hoc test:\n\nemmeans(mod, ~ species) |> plot()\n\n\n\n\nWhere the purple bars overlap, there is no significant difference.\n What do you conclude from the test?\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\nWe will again use both our seal and seal_summary dataframes.\n Create the plot:\n\n\n\n\n\n Save your figure to your figures folder." + "section": "Linear Regression", + "text": "Linear Regression\nThe data in plant.xlsx is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other.\n Save a copy of plant.xlsx to your data-raw folder and import it.\n What type of variables do you have? Which is the response and which is the explanatory? What is the null hypothesis?\n\n\n\n\n\n\nExploring\n Do a quick plot of the data:\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point()\n\n\n\n\n What are the assumptions of linear regression? Do these seem to be met?\n\n\n\n\n\n\n\n\n\n\nApplying, interpreting and reporting\n We now carry out a regression assigning the result of the lm() procedure to a variable and examining it with summary().\n\nmod <- lm(data = plant, mass ~ day)\nsummary(mod)\n\n\nCall:\nlm(formula = mass ~ day, data = plant)\n\nResiduals:\n Min 1Q Median 3Q Max \n-32.810 -11.253 -0.408 9.075 48.869 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -8.6834 6.4729 -1.342 0.186 \nday 1.6026 0.1705 9.401 1.5e-12 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 17.92 on 49 degrees of freedom\nMultiple R-squared: 0.6433, Adjusted R-squared: 0.636 \nF-statistic: 88.37 on 1 and 49 DF, p-value: 1.503e-12\n\n\nThe Estimates in the Coefficients table give the intercept (first line) and the slope (second line) of the best fitting straight line. The p-values on the same line are tests of whether that coefficient is different from zero.\nThe F value and p-value in the last line are a test of whether the model as a whole explains a significant amount of variation in the dependent variable. For a single linear regression this is exactly equivalent to the test of the slope against zero.\n What is the equation of the line? What do you conclude from the analysis?\n\n\n\n\n\n Does the line go through (0,0)?\n\n\n\n What percentage of variation is explained by the line?\n\n\nIt might be useful to assign the slope and the intercept to variables in case we need them later. The can be accessed in the mod$coefficients variable:\n\nmod$coefficients\n\n(Intercept) day \n -8.683379 1.602606 \n\n\n Assign mod$coefficients[1] to b0 and mod$coefficients[1] to b1:\n\nb0 <- mod$coefficients[1] |> round(2)\nb1 <- mod$coefficients[2] |> round(2)\n\nI also rounded the values to two decimal places.\nChecking assumptions\nWe need to examine the residuals. Very conveniently, the object which is created by lm() contains a variable called $residuals. Also conveniently, the Rā€™s plot() function can used on the output objects of lm(). The assumptions demand that each y is drawn from a normal distribution for each x and these normal distributions have the same variance. Therefore we plot the residuals against the fitted values to see if the variance is the same for all the values of x. The fitted - predicted - values are the values on the line of best fit. Each residual is the difference between the fitted values and the observed value.\n Plot the model residuals against the fitted values like this:\n\nplot(mod, which = 1)\n\n\n\n\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals:\n\nggplot(mapping = aes(x = mod$residuals)) + \n geom_histogram(bins = 10)\n\n\n\n\n Use the shapiro.test() to test the normality of the model residuals\n\nshapiro.test(mod$residuals)\n\n\n Shapiro-Wilk normality test\n\ndata: mod$residuals\nW = 0.96377, p-value = 0.1208\n\n\nUsually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the assumptions are met.\n What to you conclude?\n\n\n\n\nIllustrating\nWe want a figure with the points and the statistical model, i.e., the best fitting straight line.\n Create a scatter plot using geom_point()\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n theme_classic()\n\n\n\n\n The geom_smooth() function will had a variety of fitted lines to a plot. We want a line so we need to specify method = \"lm\":\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n theme_classic()\n\n\n\n\n What do the se and colour arguments do? Try changing them.\n Letā€™s add the equation of the line to the figure using annotate():\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() +\n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n annotate(\"text\", x = 20, y = 110, \n label = \"mass = 1.61 * day - 8.68\") +\n theme_classic()\n\n\n\n\nWe have to tell annotate() what type of geom we want - text in this case, - where to put it, and the text we want to appear.\n Improve the axes. You may need to refer back Changing the axes from the Week 7 workshop in BABS1 (Rand 2023)\n Save your figure to your figures folder." }, { - "objectID": "r4babs2/week-4/workshop.html#leafminers-on-birch", - "href": "r4babs2/week-4/workshop.html#leafminers-on-birch", + "objectID": "r4babs2/week-2/workshop.html#look-after-future-you", + "href": "r4babs2/week-2/workshop.html#look-after-future-you", "title": "Workshop", - "section": "Leafminers on Birch", - "text": "Leafminers on Birch\nLarvae of the Ambermarked birch leafminer, Profenusa thomsoni, feed on the interior leaf tissues of Birch (Betula) species. They do not normally kill the tree but can weaken it making it susceptible to attack from other species. Researchers are interested in whether there is a difference in the rates at which white, grey and yellow birch are attacked. They introduce adult female P.thomsoni to a green house containing 30 young trees (ten of each type) and later count the egg laying events on each tree. The data are in leaf.txt.\nExploring\n Read in the data and check the structure. I used the name leaf for the dataframe/tibble.\n What kind of variables do we have?\n\n\n\n Do a quick plot of the data.\n Using your common sense, do these data look normally distributed?\n\n\n Why is a Kruskal-Wallis appropriate in this case?\n\n\n\n\n\n Calculate the medians, means and sample sizes.\nApplying, interpreting and reporting\n Carry out a Kruskal-Wallis:\n\nkruskal.test(data = leaf, eggs ~ birch)\n\n\n Kruskal-Wallis rank sum test\n\ndata: eggs by birch\nKruskal-Wallis chi-squared = 6.3393, df = 2, p-value = 0.04202\n\n\n What do you conclude from the test?\n\n\n\nA significant Kruskal-Wallis tells us at least two of the groups differ but where do the differences lie? The Dunn test is a post-hoc multiple comparison test for a significant Kruskal-Wallis. It is available in the package FSA\n Load the package using:\n\nlibrary(FSA)\n\n Run the post-hoc test with:\n\ndunnTest(data = leaf, eggs ~ birch)\n\n Comparison Z P.unadj P.adj\n1 Grey - White 1.296845 0.19468465 0.38936930\n2 Grey - Yellow -1.220560 0.22225279 0.22225279\n3 White - Yellow -2.517404 0.01182231 0.03546692\n\n\nThe P.adj column gives p-value for the comparison listed in the first column. Z is the test statistic.\n What do you conclude from the test?\n\n\n\n Write up the result is a form suitable for a report.\n\n\n\n\n\n\nIllustrating\n A box plot is an appropriate choice for illustrating a Kruskal-Wallis. Can you produce a figure like this?\n\n\n\n\n\nYouā€™re finished!" + "section": "Look after future you!", + "text": "Look after future you!\nYouā€™re finished!" }, { - "objectID": "r4babs2/week-4/study_after_workshop.html", - "href": "r4babs2/week-4/study_after_workshop.html", + "objectID": "r4babs2/week-5/study_after_workshop.html", + "href": "r4babs2/week-5/study_after_workshop.html", "title": "Independent Study to consolidate this week", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Sports scientists were investigating the effects of fitness and heat acclimatisation on the sodium content of sweat. They measured the sodium content of the sweat (Ī¼moll^āˆ’1) of three groups of individuals: unfit and unacclimatised (UU); fit and unacclimatised(FU); and fit and acclimatised (FA). The are in sweat.txt. Is there a difference between the groups in the sodium content of their sweat?\n\n\nCode# read in the data and look at structure\nsweat <- read_table(\"data-raw/sweat.txt\")\nstr(sweat)\n\n\n\nCode# quick plot of the data\nggplot(data = sweat, aes(x = gp, y = na)) +\n geom_boxplot()\nCode# Since the sample sizes are small and not the same in each group and the \n# variance in the FA gp looks a bit lower, I'm leaning to a non-parametric test K-W.\n# However, don't panic if you decided to do an anova\n\n\n\nCode# calculate some summary stats \nsweat_summary <- sweat %>% \n group_by(gp) %>% \n summarise(mean = mean(na),\n n = length(na),\n median = median(na))\n\n\n\nCode# Kruskal-Wallis\nkruskal.test(data = sweat, na ~ gp)\n# We can say there is a difference between the groups in the sodium \n# content of their sweat (chi-squared = 11.9802, df = 2, p-value = 0.002503).\n# Unfit and unacclimatised people have most salty sweat, \n# Fit and acclimatised people the least salty.\n\n\n\nCode# a post-hoc test to see where the sig differences lie:\nlibrary(FSA)\ndunnTest(data = sweat, na ~ gp)\n# Fit and acclimatised people (median = 49.5 Ī¼moll^āˆ’1) have significantly less sodium in their\n# sweat than the unfit and unacclimatised people (70 Ī¼moll^āˆ’1) \n# (Kruskal-Wallis multiple comparison p-values adjusted with the Holm method: p = 0.0026).\n# Fit and unacclimatised (54 Ī¼moll^āˆ’1) also have significantly less sodium in their\n# people have sodium concentrations than unfit and unacclimatised people (p = 0.033). \n# There was no difference between the Fit and unacclimatised and the Fit and acclimatised. See figure 1.\n\n\n\nCodeggplot(sweat, aes(x = gp, y = na) ) +\n geom_boxplot() +\n scale_x_discrete(labels = c(\"Fit Acclimatised\", \n \"Fit Unacclimatised\", \n \"Unfit Unacclimatised\"), \n name = \"Group\") +\n scale_y_continuous(limits = c(0, 110), \n expand = c(0, 0),\n name = expression(\"Sodium\"~mu*\"mol\"*l^{-1})) +\n annotate(\"segment\", x = 1, xend = 3, \n y = 100, yend = 100,\n colour = \"black\") +\n annotate(\"text\", x = 2, y = 103, \n label = expression(italic(p)~\"= 0.0026\")) +\n annotate(\"segment\", x = 2, xend = 3, \n y = 90, yend = 90,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 93, \n label = expression(italic(p)~\"= 0.0340\")) +\n theme_classic()\nCode#Figure 1. Sodium content of sweat for three groups: Fit and acclimatised\n#(FA), Fit and unacclimatised (FU) and Unfit and unacclimatised (UU). Heavy lines\n#indicate the median, boxes the interquartile range and whiskers the range. \n\n\n\nšŸ’» The data are given in biomass.txt are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Do the insecticides vary in their effectiveness? What advice would you give to a person: - currently using insecticide E? - trying to choose between A and D? - trying to choose between C and B?\n\n\nCodebiom <- read_table(\"data-raw/biomass.txt\")\n# The data are organised with an insecticide treatment group in\n# each column.\n\n\n\nCode#Put the data into tidy format.\n\nbiom <- biom |> \n pivot_longer(cols = everything(),\n names_to = \"spray\",\n values_to = \"biomass\")\n\n\n\nCode# quick plot of the data\nggplot(data = biom, aes(x = spray, y = biomass)) +\n geom_boxplot()\nCode# Looks like there is a difference between sprays. E doesn't look very effective.\n\n\n\nCode# summary statistics\nbiom_summary <- biom %>% \n group_by(spray) %>% \n summarise(mean = mean(biomass),\n median = median(biomass),\n sd = sd(biomass),\n n = length(biomass),\n se = sd / sqrt(n))\n# thoughts so far: the sample sizes are equal, 10 is a smallish but\n# reasonable sample size\n# the means and medians are similar to each other (expected for\n# normally distributed data), A has a smaller variance \n\n# We have one explanatory variable, \"spray\" comprising 6 levels\n# Biomass has decimal places and we would expect such data to be \n# normally distributed therefore one-way ANOVA is the desired test\n# - we will check the assumptions after building the model\n\n\n\nCode# arry out an ANOVA and examine the results \nmod <- lm(data = biom, biomass ~ spray)\nsummary(mod)\n# spray type does have an effect F-statistic: 26.46 on 5 and 54 DF, p-value: 2.081e-13\n\n\n\nCode# Carry out the post-hoc test\nlibrary(emmeans)\n\nemmeans(mod, ~ spray) |> pairs()\n\n# the signifcant comparisons are:\n# contrast estimate SE df t.ratio p.value\n# A - D -76.50 21.9 54 -3.489 0.0119\n# A - E -175.51 21.9 54 -8.005 <.0001\n# A - WaterControl -175.91 21.9 54 -8.024 <.0001\n# B - E -154.32 21.9 54 -7.039 <.0001\n# B - WaterControl -154.72 21.9 54 -7.057 <.0001\n# C - E -155.71 21.9 54 -7.102 <.0001\n# C - WaterControl -156.11 21.9 54 -7.120 <.0001\n# D - E -99.01 21.9 54 -4.516 0.0005\n# D - WaterControl -99.41 21.9 54 -4.534 0.0004\n# All sprays are better than the water control except E. \n# This is probably the most important result.\n# What advice would you give to a person currently using insecticide E?\n# Don't bother!! It's no better than water. Switch to any of \n# the other sprays\n# What advice would you give to a person currently\n# + trying to choose between A and D? Choose A because A has sig lower\n# insect biomass than D \n# + trying to choose between C and B? It doesn't matter because there is \n# no difference in insect biomass. Use other criteria to chose (e.g., price)\n# We might report this like:\n# There is a very highly significant effect of spray type on pest \n# biomass (F = 26.5; d.f., 5, 54; p < 0.001). Post-hoc testing \n# showed E was no more effective than the control; A, C and B were \n# all better than the control but could be equally as good as each\n# other; D would be a better choice than the control or E but \n# worse than A. See figure 1\n\n\n\nCode# I reordered the bars to make is easier for me to annotate with\n# I also used * to indicate significance\n\nggplot() +\n geom_point(data = biom, aes(x = reorder(spray, biomass), y = biomass),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Pest Biomass (units)\",\n limits = c(0, 540),\n expand = c(0, 0)) +\n scale_x_discrete(\"Spray treatment\") +\n # E and control are one group\n annotate(\"segment\", x = 4.5, xend = 6.5, \n y = 397, yend = 397,\n colour = \"black\", linewidth = 1) +\n annotate(\"text\", x = 5.5, y = 385, \n label = \"N.S\", size = 4) +\n # WaterControl-D and E-D ***\n annotate(\"segment\", x = 4, xend = 5.5, \n y = 410, yend = 410,\n colour = \"black\") +\n annotate(\"text\", x = 4.5, y = 420, \n label = \"***\", size = 5) +\n # WaterControl-B ***\n annotate(\"segment\", x = 3, xend = 5.5, \n y = 440, yend = 440,\n colour = \"black\") +\n annotate(\"text\", x = 4, y = 450,\n label = \"***\", size = 5) +\n # WaterControl-C ***\n annotate(\"segment\", x = 2, xend = 5.5, \n y = 475, yend = 475,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 485, \n label = \"***\", size = 5) +\n # WaterControl-A ***\n annotate(\"segment\", x = 1, xend = 5.5, \n y = 510, yend = 510,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 520, \n label = \"***\", size = 5) + \n# A-D ***\n annotate(\"segment\", x = 1, xend = 4, \n y = 330, yend = 330,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 335, \n label = \"*\", size = 5) +\n theme_classic()\nCode# Figure 1. The mean pest biomass following various insecticide treatments.\n# Error bars are +/- 1 S.E. Significant comparisons are indicated: * is p < 0.05, ** p < 0.01 and *** is p < 0.001" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’»\n\n\nšŸ“– Read xxx" }, { "objectID": "r4babs2/week-5/workshop.html", @@ -1477,129 +1456,150 @@ "text": "Effect of brain region and choline deficiency on neuron size\nCognitive performance is influenced by the choline intake in utero. To better understand this phenomenon, pregnant mice were fed a control or choline-deficient diet and their offspring examined. The cross sectional area (CSA) of cholinergic neurons was determined in two brain regions, the MSN and the DB. The data are given in neuron-csa.xlsx\n Save a copy of the data file neuron-csa.xlsx to data-raw\nYou have previously read data from an excel file.\n List the the names of the work sheets in the excel workbook.\nThese data are organised into two worksheets, one for each brain region\n Read in each sheet. I used the names db and msn for the two dataframes/tibble.\n We have the top half and the bottom half of a data set and can combine these togther with bind_rows()\n\nneuron <- bind_rows(db, msn)\n\nYou might want to click on neuron in the environment to open the spreadsheet-like view to check it looks how you expect.\n What kind of variables do you have?\n\n\n\n\n\nExploring\nWhen we have a single explanatory variable, it always goes on the x-axis. Here we have two explanatory variables: brain region and diet. We can map one of the explanatory variables to the x-axis and the other to a aesthetic like colour, shape or fill.\n Do a quick plot of the data:\n\nggplot(data = neuron, aes(x = BrainRegion, y = CSA, fill = Diet)) +\n geom_violin()\n\n\n\n\nWhether we map BrainRegion to the x-axis or the fill does not really matter. It looks as though the cross sectional area of neurons is higher for the control diet than the deficient diet (the average of the read bars is grater than the average of the blue bars). It also looks like there might be a significant interaction between the effects of diet and brain region because the effect of diet seems to be greater in the DB region.\nSummarising the data\nJust as we needed to incorporate the second explanatory variable in the rough plot, we need to incorporate it into our summary. We do this by adding it to the group_by().\n Create a data frame called neuron_summary that contains the means, standard deviations, sample sizes and standard errors for each group:\n\nneuron_summary <- neuron %>%\n group_by(BrainRegion, Diet) %>%\n summarise(mean = mean(CSA),\n std = sd(CSA),\n n = length(CSA),\n se = std/sqrt(n))\n\nYou will get a message that you donā€™t need to worry about summarise()has grouped output by 'BrainRegion'. You can override using the.groupsargument.>\nYou should get the following numbers:\n\n\n\n\nBrainRegion\nDiet\nmean\nstd\nn\nse\n\n\n\nDB\nControl\n26.6645\n3.633975\n10\n1.1491638\n\n\nDB\nDeficient\n21.2245\n4.213968\n10\n1.3325736\n\n\nMSN\nControl\n20.9695\n2.779860\n10\n0.8790688\n\n\nMSN\nDeficient\n19.9325\n2.560446\n10\n0.8096842\n\n\n\n\n\nApplying, interpreting and reporting\nWe can now carry out a two-way ANOVA using the same lm() function we used for two-sample tests and one-way ANOVA.\n Carry out an ANOVA and examine the results with:\n\nmod <- lm(data = neuron, CSA ~ BrainRegion * Diet)\nsummary(mod)\n\n\nCall:\nlm(formula = CSA ~ BrainRegion * Diet, data = neuron)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.6045 -2.6308 0.0765 2.4820 5.5505 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 26.664 1.064 25.071 < 2e-16 ***\nBrainRegionMSN -5.695 1.504 -3.786 0.000560 ***\nDietDeficient -5.440 1.504 -3.617 0.000907 ***\nBrainRegionMSN:DietDeficient 4.403 2.127 2.070 0.045692 * \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 3.363 on 36 degrees of freedom\nMultiple R-squared: 0.4034, Adjusted R-squared: 0.3537 \nF-statistic: 8.115 on 3 and 36 DF, p-value: 0.0002949\n\n\nRemember: the tilde (~) means test the values in CSA when grouped by the values in BrainRegion and Diet Or explain CSA with BrainRegion and Diet\n Can you relate the values under Estimate to the means?\n\n\n\n\n\n\n\n\n\nThe model of brain region and diet overall explains a significant amount of the variation in the cross sectional area of neurons (p-value: 0.0002949). To see which of the three effects are significant we can use the anova() function on our model.\n Determine which effects are significant:\n\nanova(mod)\n\nAnalysis of Variance Table\n\nResponse: CSA\n Df Sum Sq Mean Sq F value Pr(>F) \nBrainRegion 1 122.05 122.045 10.7893 0.002280 **\nDiet 1 104.88 104.879 9.2717 0.004334 **\nBrainRegion:Diet 1 48.47 48.466 4.2846 0.045692 * \nResiduals 36 407.22 11.312 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nThere is a significant effect of brain region (F = 10.8; d.f. = 1, 36; p = 0.002) and diet (F = 9.3; d.f. = 1, 36; p = 0.004) on CSA and these effects interact (F = 4.3; d.f. = 1, 36; p = 0.046)\nWe need a post-hoc test to see which comparisons are significant and can again use then emmeans (Lenth 2023) package.\n Load the package\n\nlibrary(emmeans)\n\n Carry out the post-hoc test\n\nemmeans(mod, ~ BrainRegion * Diet) |> pairs()\n\n contrast estimate SE df t.ratio p.value\n DB Control - MSN Control 5.695 1.5 36 3.786 0.0030\n DB Control - DB Deficient 5.440 1.5 36 3.617 0.0048\n DB Control - MSN Deficient 6.732 1.5 36 4.476 0.0004\n MSN Control - DB Deficient -0.255 1.5 36 -0.170 0.9982\n MSN Control - MSN Deficient 1.037 1.5 36 0.689 0.9005\n DB Deficient - MSN Deficient 1.292 1.5 36 0.859 0.8257\n\nP value adjustment: tukey method for comparing a family of 4 estimates \n\n\nEach row is a comparison between the two means in the ā€˜contrastā€™ column. The ā€˜estimateā€™ column is the difference between those means and the ā€˜p.valueā€™ indicates whether that difference is significant.\nA plot can be used to visualise the result of the post hoc which can be especially useful when there are very many comparisons.\n Plot the results of the post-hoc test:\n\nemmeans(mod, ~ BrainRegion * Diet) |> plot()\n\n\n\n\n What do you conclude from the test?\n\n\n\n\n\n\nWe might report this result as:\nA choline-deficient diet in pregnant mice significantly decreases the cross sectional area of cholinergic neurons in the DB region of their offspring (t = 3.62; d.f. = 36; p = 0.0048). The cross sectional area of cholinergic neurons in the MSN region are also significantly smaller than those in the DB region (t = 3.79; d.f. = 36; p = 0.0030) but are not reduces by maternal choline-deficiency.\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\nWe are going to create a figure like this:\n\n\n\n\n\nWe will again use both our neuron and neuron_summary dataframes.\n Try emulating what you did for one-way ANOVA based on Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023).\n\nggplot() +\n geom_point(data = neuron, \n aes(x = BrainRegion, y = CSA),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\",\n size = 3) +\n geom_errorbar(data = neuron_summary, \n aes(x = BrainRegion, \n ymin = mean - se, \n ymax = mean + se),\n width = 0.4) +\n geom_errorbar(data = neuron_summary, \n aes(x = BrainRegion, \n ymin = mean,\n ymax = mean),\n width = 0.3, \n linewidth = 1) +\n scale_y_continuous(name = \"CSA\",\n expand = c(0, 0),\n limits = c(0, 45)) +\n scale_x_discrete(name = \"BrainRegion\") +\n theme_classic() \n\n\n\n\nHow can we show the two diets separately?\n We can map the Diet variable to the shape aesthetic!\n\nggplot() +\n geom_point(data = neuron, \n aes(x = BrainRegion, y = CSA, shape = Diet),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\",\n size = 3) +\n geom_errorbar(data = neuron_summary, \n aes(x = BrainRegion, \n ymin = mean - se, \n ymax = mean + se),\n width = 0.4) +\n geom_errorbar(data = neuron_summary, \n aes(x = BrainRegion, \n ymin = mean,\n ymax = mean),\n width = 0.3, \n linewidth = 1) +\n scale_y_continuous(name = \"CSA\",\n expand = c(0, 0),\n limits = c(0, 45)) +\n scale_x_discrete(name = \"BrainRegion\") +\n theme_classic() \n\n\n\n\nOh, that isnā€™t quite what we want! We want the two diets side-by-side, not on top of each other.\n We can achieve that by using setting the position argument to position_jitterdodge() in the geom_point() and to position_dodge() in the two geom_errorbar(). We also have to specify that the error bars are grouped by Diet since they are not otherwise mapped to a shape, colour or fill.\n\nggplot() +\n geom_point(data = neuron, \n aes(x = BrainRegion, y = CSA, shape = Diet),\n position = position_jitterdodge(dodge.width = 1,\n jitter.width = 0.3,\n jitter.height = 0),\n colour = \"gray50\",\n size = 3) +\n geom_errorbar(data = neuron_summary, \n aes(x = BrainRegion, \n ymin = mean - se, \n ymax = mean + se,\n group = Diet),\n width = 0.4,\n position = position_dodge(width = 1)) +\n geom_errorbar(data = neuron_summary, \n aes(x = BrainRegion, \n ymin = mean,\n ymax = mean,\n group = Diet),\n width = 0.3, \n linewidth = 1,\n position = position_dodge(width = 1)) +\n scale_y_continuous(name = \"CSA\",\n expand = c(0, 0),\n limits = c(0, 45)) +\n scale_x_discrete(name = \"BrainRegion\") +\n theme_classic() \n\n\n\n\n Add the annotation of the statistical results\n Finally, we can move the legend to a space on the plot area which helps you minimise the width needed like this:\n\n...... +\n theme(legend.position = c(0.15, 0.15),\n legend.background = element_rect(colour = \"black\"))\n\n Save your figure to your figures folder.\nYouā€™re finished!" }, { - "objectID": "r4babs2/week-5/study_after_workshop.html", - "href": "r4babs2/week-5/study_after_workshop.html", + "objectID": "r4babs2/week-1/study_after_workshop.html", + "href": "r4babs2/week-1/study_after_workshop.html", "title": "Independent Study to consolidate this week", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’»\n\n\nšŸ“– Read xxx" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not. Estimate the mean Adiponectin concentration in each group - this means calculate the sample mean and construct a confidence interval around it for each group. This exercise forces you to bring together ideas from this workshop and from previous workshops\n\n\nHow to calculate a confidence intervals (this workshop)\n\nHow to summarise variables in more than one group (previous workshop)\n\n\nCode# data import\nadip <- read_table(\"data-raw/adipocytes.txt\")\n\n# examine the structure\nstr(adip)\n\n# summarise\nadip_summary <- adip %>% \n group_by(treatment) %>% \n summarise(mean = mean(adiponectin),\n sd = sd(adiponectin),\n n = length(adiponectin),\n se = sd/sqrt(n),\n dif = qt(0.975, df = n - 1) * se,\n lower_ci = mean - dif,\n uppp_ci = mean + dif)\n\n\n# we conclude we're 95% certain the mean for the control group is \n# between 4.73 and 6.36 and the mean for the nicotinic group is \n# between 6.52 and 8.50. More usually we might put is like this:\n# the mean for the control group is 5.55 +/- 0.82 and that for the nicotinic group is 7.51 +/- 0.99" }, { - "objectID": "r4babs2/week-6/workshop.html", - "href": "r4babs2/week-6/workshop.html", + "objectID": "r4babs2/week-1/workshop.html", + "href": "r4babs2/week-1/workshop.html", "title": "Workshop", "section": "", - "text": "Artwork by Horst (2023):\n\n\nIn this session you will\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "Artwork by Horst (2023): ā€œlove this classā€\n\n\nIn this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-6/workshop.html#session-overview", - "href": "r4babs2/week-6/workshop.html#session-overview", + "objectID": "r4babs2/week-1/workshop.html#session-overview", + "href": "r4babs2/week-1/workshop.html#session-overview", "title": "Workshop", "section": "", - "text": "In this session you will" + "text": "In this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples." }, { - "objectID": "r4babs2/week-6/workshop.html#philosophy", - "href": "r4babs2/week-6/workshop.html#philosophy", + "objectID": "r4babs2/week-1/workshop.html#philosophy", + "href": "r4babs2/week-1/workshop.html#philosophy", "title": "Workshop", "section": "", "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-6/study_after_workshop.html", - "href": "r4babs2/week-6/study_after_workshop.html", + "objectID": "r4babs2/week-1/workshop.html#remind-yourself-how-to-import-files", + "href": "r4babs2/week-1/workshop.html#remind-yourself-how-to-import-files", + "title": "Workshop", + "section": "Remind yourself how to import files!", + "text": "Remind yourself how to import files!\nImporting data from files was covered in BABS 1 (Rand 2023) if you need to remind yourself." + }, + { + "objectID": "r4babs2/week-1/workshop.html#confidence-intervals-large-samples", + "href": "r4babs2/week-1/workshop.html#confidence-intervals-large-samples", + "title": "Workshop", + "section": "Confidence intervals (large samples)", + "text": "Confidence intervals (large samples)\nThe data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by:\n\\(\\bar{x} \\pm 1.96 \\times s.e.\\)\nWhere 1.96 is the quantile for 95% confidence.\n Save beewing.txt to your data-raw folder.\n Read in the data and check the structure of the resulting dataframe.\n Calculate and assign to variables: the mean, standard deviation and standard error:\n\n# mean\nm <- mean(bee$wing)\n\n# standard deviation\nsd <- sd(bee$wing)\n\n# sample size (needed for the se)\nn <- length(bee$wing)\n\n# standard error\nse <- sd / sqrt(n)\n\n To calculate the 95% confidence interval we need to look up the quantile (multiplier) using qnorm()\n\nq <- qnorm(0.975)\n\nThis should be about 1.96.\n Now we can use it in our confidence interval calculation\n\nlcl <- m - q * se\nucl <- m + q * se\n\n Print the values\n\nlcl\n\n[1] 4.473176\n\nucl\n\n[1] 4.626824\n\n\nThis means we are 95% confident the population mean lies between 4.47 mm and 4.63 mm. The usual way of expressing this is that the mean is 4.55 +/- 0.07 mm\n Between what values would you be 99% confident of the population mean being?" + }, + { + "objectID": "r4babs2/week-1/workshop.html#confidence-intervals-small-samples", + "href": "r4babs2/week-1/workshop.html#confidence-intervals-small-samples", + "title": "Workshop", + "section": "Confidence intervals (small samples)", + "text": "Confidence intervals (small samples)\nThe confidence interval for small samples is given by:\n\\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\)\nThe only difference between the calculation for small and large sample is the multiple. For large samples we use the ā€œthe standard normal distributionā€ accessed with qnorm(); for small samples we use the ā€œt distributionā€ assessed with qt().The value returned by q(t) is larger than that returned by qnorm() which reflects the greater uncertainty we have on estimations of population means based on small samples.\nThe fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \\(\\mu m^2\\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt\n Save neuron.txt to your data-raw folder\n Read in the data and check the structure of the resulting dataframe\n Assign the mean to m.\n Calculate and assign the standard error to se.\nTo work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).\n We can assign this to a variable, df, using:\n\ndf <- length(neur$csa) - 1\n\n The t value is found by:\n\nt <- qt(0.975, df = df)\n\nNote that we are using qt() rather than qnorm() but that the probability, 0.975, used is the same. Finally, we need to put our mean, standard error and t value in the equation. \\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\).\n The upper confidence limit is:\n\n(m + t * se) |> round(2)\n\n[1] 151.95\n\n\nThe first part of the command, (m + t * se) calculates the upper limit. This is ā€˜pipedā€™ in to the round() function to round the result to two decimal places.\n Calculate the lower confidence limit:\n Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?\n\n\n\n\nYouā€™re finished!" + }, + { + "objectID": "r4babs2/week-4/study_after_workshop.html", + "href": "r4babs2/week-4/study_after_workshop.html", "title": "Independent Study to consolidate this week", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’»\n\n\nšŸ“– Read xxx" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Sports scientists were investigating the effects of fitness and heat acclimatisation on the sodium content of sweat. They measured the sodium content of the sweat (Ī¼moll^āˆ’1) of three groups of individuals: unfit and unacclimatised (UU); fit and unacclimatised(FU); and fit and acclimatised (FA). The are in sweat.txt. Is there a difference between the groups in the sodium content of their sweat?\n\n\nCode# read in the data and look at structure\nsweat <- read_table(\"data-raw/sweat.txt\")\nstr(sweat)\n\n\n\nCode# quick plot of the data\nggplot(data = sweat, aes(x = gp, y = na)) +\n geom_boxplot()\nCode# Since the sample sizes are small and not the same in each group and the \n# variance in the FA gp looks a bit lower, I'm leaning to a non-parametric test K-W.\n# However, don't panic if you decided to do an anova\n\n\n\nCode# calculate some summary stats \nsweat_summary <- sweat %>% \n group_by(gp) %>% \n summarise(mean = mean(na),\n n = length(na),\n median = median(na))\n\n\n\nCode# Kruskal-Wallis\nkruskal.test(data = sweat, na ~ gp)\n# We can say there is a difference between the groups in the sodium \n# content of their sweat (chi-squared = 11.9802, df = 2, p-value = 0.002503).\n# Unfit and unacclimatised people have most salty sweat, \n# Fit and acclimatised people the least salty.\n\n\n\nCode# a post-hoc test to see where the sig differences lie:\nlibrary(FSA)\ndunnTest(data = sweat, na ~ gp)\n# Fit and acclimatised people (median = 49.5 Ī¼moll^āˆ’1) have significantly less sodium in their\n# sweat than the unfit and unacclimatised people (70 Ī¼moll^āˆ’1) \n# (Kruskal-Wallis multiple comparison p-values adjusted with the Holm method: p = 0.0026).\n# Fit and unacclimatised (54 Ī¼moll^āˆ’1) also have significantly less sodium in their\n# people have sodium concentrations than unfit and unacclimatised people (p = 0.033). \n# There was no difference between the Fit and unacclimatised and the Fit and acclimatised. See figure 1.\n\n\n\nCodeggplot(sweat, aes(x = gp, y = na) ) +\n geom_boxplot() +\n scale_x_discrete(labels = c(\"Fit Acclimatised\", \n \"Fit Unacclimatised\", \n \"Unfit Unacclimatised\"), \n name = \"Group\") +\n scale_y_continuous(limits = c(0, 110), \n expand = c(0, 0),\n name = expression(\"Sodium\"~mu*\"mol\"*l^{-1})) +\n annotate(\"segment\", x = 1, xend = 3, \n y = 100, yend = 100,\n colour = \"black\") +\n annotate(\"text\", x = 2, y = 103, \n label = expression(italic(p)~\"= 0.0026\")) +\n annotate(\"segment\", x = 2, xend = 3, \n y = 90, yend = 90,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 93, \n label = expression(italic(p)~\"= 0.0340\")) +\n theme_classic()\nCode#Figure 1. Sodium content of sweat for three groups: Fit and acclimatised\n#(FA), Fit and unacclimatised (FU) and Unfit and unacclimatised (UU). Heavy lines\n#indicate the median, boxes the interquartile range and whiskers the range. \n\n\n\nšŸ’» The data are given in biomass.txt are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Do the insecticides vary in their effectiveness? What advice would you give to a person: - currently using insecticide E? - trying to choose between A and D? - trying to choose between C and B?\n\n\nCodebiom <- read_table(\"data-raw/biomass.txt\")\n# The data are organised with an insecticide treatment group in\n# each column.\n\n\n\nCode#Put the data into tidy format.\n\nbiom <- biom |> \n pivot_longer(cols = everything(),\n names_to = \"spray\",\n values_to = \"biomass\")\n\n\n\nCode# quick plot of the data\nggplot(data = biom, aes(x = spray, y = biomass)) +\n geom_boxplot()\nCode# Looks like there is a difference between sprays. E doesn't look very effective.\n\n\n\nCode# summary statistics\nbiom_summary <- biom %>% \n group_by(spray) %>% \n summarise(mean = mean(biomass),\n median = median(biomass),\n sd = sd(biomass),\n n = length(biomass),\n se = sd / sqrt(n))\n# thoughts so far: the sample sizes are equal, 10 is a smallish but\n# reasonable sample size\n# the means and medians are similar to each other (expected for\n# normally distributed data), A has a smaller variance \n\n# We have one explanatory variable, \"spray\" comprising 6 levels\n# Biomass has decimal places and we would expect such data to be \n# normally distributed therefore one-way ANOVA is the desired test\n# - we will check the assumptions after building the model\n\n\n\nCode# arry out an ANOVA and examine the results \nmod <- lm(data = biom, biomass ~ spray)\nsummary(mod)\n# spray type does have an effect F-statistic: 26.46 on 5 and 54 DF, p-value: 2.081e-13\n\n\n\nCode# Carry out the post-hoc test\nlibrary(emmeans)\n\nemmeans(mod, ~ spray) |> pairs()\n\n# the signifcant comparisons are:\n# contrast estimate SE df t.ratio p.value\n# A - D -76.50 21.9 54 -3.489 0.0119\n# A - E -175.51 21.9 54 -8.005 <.0001\n# A - WaterControl -175.91 21.9 54 -8.024 <.0001\n# B - E -154.32 21.9 54 -7.039 <.0001\n# B - WaterControl -154.72 21.9 54 -7.057 <.0001\n# C - E -155.71 21.9 54 -7.102 <.0001\n# C - WaterControl -156.11 21.9 54 -7.120 <.0001\n# D - E -99.01 21.9 54 -4.516 0.0005\n# D - WaterControl -99.41 21.9 54 -4.534 0.0004\n# All sprays are better than the water control except E. \n# This is probably the most important result.\n# What advice would you give to a person currently using insecticide E?\n# Don't bother!! It's no better than water. Switch to any of \n# the other sprays\n# What advice would you give to a person currently\n# + trying to choose between A and D? Choose A because A has sig lower\n# insect biomass than D \n# + trying to choose between C and B? It doesn't matter because there is \n# no difference in insect biomass. Use other criteria to chose (e.g., price)\n# We might report this like:\n# There is a very highly significant effect of spray type on pest \n# biomass (F = 26.5; d.f., 5, 54; p < 0.001). Post-hoc testing \n# showed E was no more effective than the control; A, C and B were \n# all better than the control but could be equally as good as each\n# other; D would be a better choice than the control or E but \n# worse than A. See figure 1\n\n\n\nCode# I reordered the bars to make is easier for me to annotate with\n# I also used * to indicate significance\n\nggplot() +\n geom_point(data = biom, aes(x = reorder(spray, biomass), y = biomass),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = biom_summary, \n aes(x = spray, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_y_continuous(name = \"Pest Biomass (units)\",\n limits = c(0, 540),\n expand = c(0, 0)) +\n scale_x_discrete(\"Spray treatment\") +\n # E and control are one group\n annotate(\"segment\", x = 4.5, xend = 6.5, \n y = 397, yend = 397,\n colour = \"black\", linewidth = 1) +\n annotate(\"text\", x = 5.5, y = 385, \n label = \"N.S\", size = 4) +\n # WaterControl-D and E-D ***\n annotate(\"segment\", x = 4, xend = 5.5, \n y = 410, yend = 410,\n colour = \"black\") +\n annotate(\"text\", x = 4.5, y = 420, \n label = \"***\", size = 5) +\n # WaterControl-B ***\n annotate(\"segment\", x = 3, xend = 5.5, \n y = 440, yend = 440,\n colour = \"black\") +\n annotate(\"text\", x = 4, y = 450,\n label = \"***\", size = 5) +\n # WaterControl-C ***\n annotate(\"segment\", x = 2, xend = 5.5, \n y = 475, yend = 475,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 485, \n label = \"***\", size = 5) +\n # WaterControl-A ***\n annotate(\"segment\", x = 1, xend = 5.5, \n y = 510, yend = 510,\n colour = \"black\") +\n annotate(\"text\", x = 3.5, y = 520, \n label = \"***\", size = 5) + \n# A-D ***\n annotate(\"segment\", x = 1, xend = 4, \n y = 330, yend = 330,\n colour = \"black\") +\n annotate(\"text\", x = 2.5, y = 335, \n label = \"*\", size = 5) +\n theme_classic()\nCode# Figure 1. The mean pest biomass following various insecticide treatments.\n# Error bars are +/- 1 S.E. Significant comparisons are indicated: * is p < 0.05, ** p < 0.01 and *** is p < 0.001" }, { - "objectID": "r4babs2/week-2/workshop.html", - "href": "r4babs2/week-2/workshop.html", + "objectID": "r4babs2/week-4/workshop.html", + "href": "r4babs2/week-4/workshop.html", "title": "Workshop", "section": "", - "text": "In this workshop you will get practice in applying, interpreting and reporting single linear regression.\n\n\nArtwork by Horst (2023): ā€œlinear regression dragonsā€\n\n\nIn this session you will carry out, interpret and report on a single linear regression.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "Artwork by Horst (2023): ā€œDebugging and feelingsā€\n\n\nIn this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-2/workshop.html#session-overview", - "href": "r4babs2/week-2/workshop.html#session-overview", + "objectID": "r4babs2/week-4/workshop.html#session-overview", + "href": "r4babs2/week-4/workshop.html#session-overview", "title": "Workshop", "section": "", - "text": "In this session you will carry out, interpret and report on a single linear regression." + "text": "In this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R." }, { - "objectID": "r4babs2/week-2/workshop.html#philosophy", - "href": "r4babs2/week-2/workshop.html#philosophy", + "objectID": "r4babs2/week-4/workshop.html#philosophy", + "href": "r4babs2/week-4/workshop.html#philosophy", "title": "Workshop", "section": "", "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-2/workshop.html#linear-regression", - "href": "r4babs2/week-2/workshop.html#linear-regression", + "objectID": "r4babs2/week-4/workshop.html#myoglobin-in-seal-muscle", + "href": "r4babs2/week-4/workshop.html#myoglobin-in-seal-muscle", "title": "Workshop", - "section": "Linear Regression", - "text": "Linear Regression\nThe data in plant.xlsx is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other.\n Save a copy of plant.xlsx to your data-raw folder and import it.\n What type of variables do you have? Which is the response and which is the explanatory? What is the null hypothesis?\n\n\n\n\n\n\nExploring\n Do a quick plot of the data:\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point()\n\n\n\n\n What are the assumptions of linear regression? Do these seem to be met?\n\n\n\n\n\n\n\n\n\n\nApplying, interpreting and reporting\n We now carry out a regression assigning the result of the lm() procedure to a variable and examining it with summary().\n\nmod <- lm(data = plant, mass ~ day)\nsummary(mod)\n\n\nCall:\nlm(formula = mass ~ day, data = plant)\n\nResiduals:\n Min 1Q Median 3Q Max \n-32.810 -11.253 -0.408 9.075 48.869 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -8.6834 6.4729 -1.342 0.186 \nday 1.6026 0.1705 9.401 1.5e-12 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 17.92 on 49 degrees of freedom\nMultiple R-squared: 0.6433, Adjusted R-squared: 0.636 \nF-statistic: 88.37 on 1 and 49 DF, p-value: 1.503e-12\n\n\nThe Estimates in the Coefficients table give the intercept (first line) and the slope (second line) of the best fitting straight line. The p-values on the same line are tests of whether that coefficient is different from zero.\nThe F value and p-value in the last line are a test of whether the model as a whole explains a significant amount of variation in the dependent variable. For a single linear regression this is exactly equivalent to the test of the slope against zero.\n What is the equation of the line? What do you conclude from the analysis?\n\n\n\n\n\n Does the line go through (0,0)?\n\n\n\n What percentage of variation is explained by the line?\n\n\nIt might be useful to assign the slope and the intercept to variables in case we need them later. The can be accessed in the mod$coefficients variable:\n\nmod$coefficients\n\n(Intercept) day \n -8.683379 1.602606 \n\n\n Assign mod$coefficients[1] to b0 and mod$coefficients[1] to b1:\n\nb0 <- mod$coefficients[1] |> round(2)\nb1 <- mod$coefficients[2] |> round(2)\n\nI also rounded the values to two decimal places.\nChecking assumptions\nWe need to examine the residuals. Very conveniently, the object which is created by lm() contains a variable called $residuals. Also conveniently, the Rā€™s plot() function can used on the output objects of lm(). The assumptions demand that each y is drawn from a normal distribution for each x and these normal distributions have the same variance. Therefore we plot the residuals against the fitted values to see if the variance is the same for all the values of x. The fitted - predicted - values are the values on the line of best fit. Each residual is the difference between the fitted values and the observed value.\n Plot the model residuals against the fitted values like this:\n\nplot(mod, which = 1)\n\n\n\n\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals:\n\nggplot(mapping = aes(x = mod$residuals)) + \n geom_histogram(bins = 10)\n\n\n\n\n Use the shapiro.test() to test the normality of the model residuals\n\nshapiro.test(mod$residuals)\n\n\n Shapiro-Wilk normality test\n\ndata: mod$residuals\nW = 0.96377, p-value = 0.1208\n\n\nUsually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the assumptions are met.\n What to you conclude?\n\n\n\n\nIllustrating\nWe want a figure with the points and the statistical model, i.e., the best fitting straight line.\n Create a scatter plot using geom_point()\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n theme_classic()\n\n\n\n\n The geom_smooth() function will had a variety of fitted lines to a plot. We want a line so we need to specify method = \"lm\":\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() + \n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n theme_classic()\n\n\n\n\n What do the se and colour arguments do? Try changing them.\n Letā€™s add the equation of the line to the figure using annotate():\n\nggplot(plant, aes(x = day, y = mass)) +\n geom_point() +\n geom_smooth(method = lm, \n se = FALSE, \n colour = \"black\") +\n annotate(\"text\", x = 20, y = 110, \n label = \"mass = 1.61 * day - 8.68\") +\n theme_classic()\n\n\n\n\nWe have to tell annotate() what type of geom we want - text in this case, - where to put it, and the text we want to appear.\n Improve the axes. You may need to refer back Changing the axes from the Week 7 workshop in BABS1 (Rand 2023)\n Save your figure to your figures folder." + "section": "Myoglobin in seal muscle", + "text": "Myoglobin in seal muscle\nThe myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in seal.csv. We want to know if there is a difference between species. Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.\n Save a copy of the data file seal.csv to data-raw\n Read in the data and check the structure. I used the name seal for the dataframe/tibble.\n What kind of variables do you have?\n\n\n\nExploring\n Do a quick plot of the data. You may need to refer to a previous workshop\nSummarising the data\nDo you remember Look after future you!\n If you followed that tip youā€™ll be able to open that script and whizz through summarising,testing and plotting.\n Create a data frame called seal_summary that contains the means, standard deviations, sample sizes and standard errors for each species.\nYou should get the following numbers:\n\n\n\n\nspecies\nmean\nstd\nn\nse\n\n\n\nBladdernose Seal\n42.31600\n8.020634\n30\n1.464361\n\n\nHarbour Seal\n49.01033\n8.252004\n30\n1.506603\n\n\nWeddell Seal\n44.66033\n7.849816\n30\n1.433174\n\n\n\n\n\nApplying, interpreting and reporting\nWe can now carry out a one-way ANOVA using the same lm() function we used for two-sample tests.\n Carry out an ANOVA and examine the results with:\n\nmod <- lm(data = seal, myoglobin ~ species)\nsummary(mod)\n\n\nCall:\nlm(formula = myoglobin ~ species, data = seal)\n\nResiduals:\n Min 1Q Median 3Q Max \n-16.306 -5.578 -0.036 5.240 18.250 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 42.316 1.468 28.819 < 2e-16 ***\nspeciesHarbour Seal 6.694 2.077 3.224 0.00178 ** \nspeciesWeddell Seal 2.344 2.077 1.129 0.26202 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 8.043 on 87 degrees of freedom\nMultiple R-squared: 0.1096, Adjusted R-squared: 0.08908 \nF-statistic: 5.352 on 2 and 87 DF, p-value: 0.006427\n\n\nRemember: the tilde (~) means test the values in myoglobin when grouped by the values in species. Or explain myoglobin with species\n What do you conclude so far from the test? Write your conclusion in a form suitable for a report.\n\n\n\n Can you relate the values under Estimate to the means?\n\n\n\n\n\n\n\nThe ANOVA is significant but this only tells us that species matters, meaning at least two of the means differ. To find out which means differ, we need a post-hoc test. A post-hoc (ā€œafter thisā€) test is done after a significant ANOVA test. There are several possible post-hoc tests and we will be using Tukeyā€™s HSD (honestly significant difference) test (Tukey 1949) implemented in the emmeans (Lenth 2023) package.\n Load the package\n\nlibrary(emmeans)\n\n Carry out the post-hoc test\n\nemmeans(mod, ~ species) |> pairs()\n\n contrast estimate SE df t.ratio p.value\n Bladdernose Seal - Harbour Seal -6.69 2.08 87 -3.224 0.0050\n Bladdernose Seal - Weddell Seal -2.34 2.08 87 -1.129 0.4990\n Harbour Seal - Weddell Seal 4.35 2.08 87 2.095 0.0968\n\nP value adjustment: tukey method for comparing a family of 3 estimates \n\n\nEach row is a comparison between the two means in the ā€˜contrastā€™ column. The ā€˜estimateā€™ column is the difference between those means and the ā€˜p.valueā€™ indicates whether that difference is significant.\nA plot can be used to visualise the result of the post-hoc which can be especially useful when there are very many comparisons.\n Plot the results of the post-hoc test:\n\nemmeans(mod, ~ species) |> plot()\n\n\n\n\nWhere the purple bars overlap, there is no significant difference.\n What do you conclude from the test?\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\nWe will again use both our seal and seal_summary dataframes.\n Create the plot:\n\n\n\n\n\n Save your figure to your figures folder." }, { - "objectID": "r4babs2/week-2/workshop.html#look-after-future-you", - "href": "r4babs2/week-2/workshop.html#look-after-future-you", + "objectID": "r4babs2/week-4/workshop.html#leafminers-on-birch", + "href": "r4babs2/week-4/workshop.html#leafminers-on-birch", "title": "Workshop", - "section": "Look after future you!", - "text": "Look after future you!\nYouā€™re finished!" + "section": "Leafminers on Birch", + "text": "Leafminers on Birch\nLarvae of the Ambermarked birch leafminer, Profenusa thomsoni, feed on the interior leaf tissues of Birch (Betula) species. They do not normally kill the tree but can weaken it making it susceptible to attack from other species. Researchers are interested in whether there is a difference in the rates at which white, grey and yellow birch are attacked. They introduce adult female P.thomsoni to a green house containing 30 young trees (ten of each type) and later count the egg laying events on each tree. The data are in leaf.txt.\nExploring\n Read in the data and check the structure. I used the name leaf for the dataframe/tibble.\n What kind of variables do we have?\n\n\n\n Do a quick plot of the data.\n Using your common sense, do these data look normally distributed?\n\n\n Why is a Kruskal-Wallis appropriate in this case?\n\n\n\n\n\n Calculate the medians, means and sample sizes.\nApplying, interpreting and reporting\n Carry out a Kruskal-Wallis:\n\nkruskal.test(data = leaf, eggs ~ birch)\n\n\n Kruskal-Wallis rank sum test\n\ndata: eggs by birch\nKruskal-Wallis chi-squared = 6.3393, df = 2, p-value = 0.04202\n\n\n What do you conclude from the test?\n\n\n\nA significant Kruskal-Wallis tells us at least two of the groups differ but where do the differences lie? The Dunn test is a post-hoc multiple comparison test for a significant Kruskal-Wallis. It is available in the package FSA\n Load the package using:\n\nlibrary(FSA)\n\n Run the post-hoc test with:\n\ndunnTest(data = leaf, eggs ~ birch)\n\n Comparison Z P.unadj P.adj\n1 Grey - White 1.296845 0.19468465 0.38936930\n2 Grey - Yellow -1.220560 0.22225279 0.22225279\n3 White - Yellow -2.517404 0.01182231 0.03546692\n\n\nThe P.adj column gives p-value for the comparison listed in the first column. Z is the test statistic.\n What do you conclude from the test?\n\n\n\n Write up the result is a form suitable for a report.\n\n\n\n\n\n\nIllustrating\n A box plot is an appropriate choice for illustrating a Kruskal-Wallis. Can you produce a figure like this?\n\n\n\n\n\nYouā€™re finished!" }, { - "objectID": "r4babs2/week-2/study_after_workshop.html", - "href": "r4babs2/week-2/study_after_workshop.html", + "objectID": "r4babs2/week-3/study_after_workshop.html", + "href": "r4babs2/week-3/study_after_workshop.html", "title": "Independent Study to consolidate this week", "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Effect of anxiety status and sporting performance. The data in sprint.txt are from an investigation of the effect of anxiety status and sporting performance. A group of 40 100m sprinters undertook a psychometric test to measure their anxiety shortly before competing. The data are their anxiety scores and the 100m times achieved. What you do conclude from these data?\n\n\nCode# this example is designed to emphasise the importance of plotting your data first\nsprint <- read_table(\"data-raw/sprint.txt\")\n# Anxiety is discrete but ranges from 16 to 402 meaning the gap between possible measures is small and \n# the variable could be treated as continuous if needed. Time is a continuous measure that has decimal places and which we would expect to follow a normal distribution \n\n# explore with a plot\nggplot(sprint, aes(x = anxiety, y = time) ) +\n geom_point()\nCode# A scatterplot of the data clearly reveals that these data are not linear. There is a good relationship between the two variables but since it is not linear, single linear regression is not appropriate.\n\n\n\nšŸ’» Juvenile hormone in stag beetles. The concentration of juvenile hormone in stag beetles is known to influence mandible growth. Groups of stag beetles were injected with different concentrations of juvenile hormone (arbitrary units) and their average mandible size (mm) determined. The experimenters planned to analyse their data with regression. The data are in stag.txt\n\n\n\nCode# read the data in and check the structure\nstag <- read_table(\"data-raw/stag.txt\")\nstr(stag)\n\n# jh is discrete but ordered and has been chosen by the experimenter - it is the explanatory variable. \n# the response is mandible size which has decimal places and is something we would expect to be \n# normally distributed. So far, common sense suggests the assumptions of regression are met.\n\n\n\nCode# exploratory plot\nggplot(stag, aes(x = jh, y = mand)) +\n geom_point()\nCode# looks linear-ish on the scatter\n# regression still seems appropriate\n# we will check the other assumptions after we have run the lm\n\n\n\nCode# build the statistical model\nmod <- lm(data = stag, mand ~ jh)\n\n# examine it\nsummary(mod)\n# mand = 0.032*jh + 0.419\n# the slope of the line is significantly different from zero / the jh explains a significant amount of the variation in mand (ANOVA: F = 16.63; d.f. = 1,14; p = 0.00113).\n# the intercept is 0.419 and differs significantly from zero \n\n\n\nCode# checking the assumption\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be equal along the x axis.\n# with a small data set there is some apparent heterogeneity but it doesn't look too.\n# \nhist(mod$residuals)\nCode# We have some skew which again might be partly a result of a small sample size.\nshapiro.test(mod$residuals) # the also test not sig diff from normal\n\n# On balance the use of regression is probably justifiable but it is borderline\n# but ideally the experiment would be better if multiple individuals were measure at\n# each of the chosen juvenile hormone levels.\n\n\n\nCode# a better plot\nggplot(stag, aes(x = jh, y = mand) ) +\n geom_point() +\n geom_smooth(method = lm, se = FALSE, colour = \"black\") +\n scale_x_continuous(name = \"Juvenile hormone (arbitrary units)\",\n expand = c(0, 0),\n limits = c(0, 32)) +\n scale_y_continuous(name = \"Mandible size (mm)\",\n expand = c(0, 0),\n limits = c(0, 2)) +\n theme_classic()" + "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Plant Biotech. Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt. Do you think their modification has been successful?\n\n\nCodecsativa <- read_table(\"data-raw/csativa.txt\")\nstr(csativa)\n\n# First realise that this is a two sample test. You have two independent samples\n# - there are a total of 100 different plants and the values in one \n# group have no relationship to the values in the other.\n\n\n\nCode# create a rough plot of the data \nggplot(data = csativa, aes(x = plant, y = omega)) +\n geom_violin()\nCode# note the modified plants seem to have lower omega!\n\n\n\nCode# create a summary of the data\ncsativa_summary <- csativa %>%\n group_by(plant) %>%\n summarise(mean = mean(omega),\n std = sd(omega),\n n = length(omega),\n se = std/sqrt(n))\n\n\n\nCode# The data seem to be continuous so it is likely that a parametric test will be fine\n# we will check the other assumptions after we have run the lm\n\n# build the statistical model\nmod <- lm(data = csativa, omega ~ plant)\n\n\n# examine it\nsummary(mod)\n# So there is a significant difference but you need to make sure you know the direction!\n# Wild plants have a significantly higher omega 3 content (mean +/- s.e = 56.41 +/- 1.11) \n# than modified plants (49.46 +/- 0.82)(t = 5.03; d.f. = 98; p < 0.0001).\n\n\n\nCode# let's check the assumptions\nplot(mod, which = 1) \nCode# we're looking for the variance in the residuals to be the same in both groups.\n# This looks OK. Maybe a bit higher in the wild plants (with the higher mean)\n \nhist(mod$residuals)\nCodeshapiro.test(mod$residuals)\n# On balance the use of lm() is probably justifiable The variance isn't quite equal \n# and the histogram looks a bit off normal but the normality test is NS and the \n# effect (in the figure) is clear.\n\n\n\nCode# A figure \nfig1 <- ggplot() +\n geom_point(data = csativa, aes(x = plant, y = omega),\n position = position_jitter(width = 0.1, height = 0),\n colour = \"gray50\") +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean - se, ymax = mean + se),\n width = 0.3) +\n geom_errorbar(data = csativa_summary, \n aes(x = plant, ymin = mean, ymax = mean),\n width = 0.2) +\n scale_x_discrete(name = \"Plant type\", labels = c(\"GMO\", \"WT\")) +\n scale_y_continuous(name = \"Amount of Omega 3 (units)\",\n expand = c(0, 0),\n limits = c(0, 90)) +\n annotate(\"segment\", x = 1, xend = 2, \n y = 80, yend = 80,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 85, \n label = expression(italic(p)~\"< 0.001\")) +\n theme_classic()\n\n# save figure to figures/csativa.png\nggsave(\"figures/csativa.png\",\n plot = fig1,\n width = 3.5,\n height = 3.5,\n units = \"in\",\n dpi = 300)\n\n\n\nšŸ’» another example" }, { - "objectID": "r4babs2/week-1/workshop.html", - "href": "r4babs2/week-1/workshop.html", + "objectID": "r4babs2/week-3/workshop.html", + "href": "r4babs2/week-3/workshop.html", "title": "Workshop", "section": "", - "text": "Artwork by Horst (2023): ā€œlove this classā€\n\n\nIn this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." + "text": "Artwork by Horst (2023): ā€œHow much I think I know about Rā€\n\n\nIn this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R.\n\nWorkshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-1/workshop.html#session-overview", - "href": "r4babs2/week-1/workshop.html#session-overview", + "objectID": "r4babs2/week-3/workshop.html#session-overview", + "href": "r4babs2/week-3/workshop.html#session-overview", "title": "Workshop", "section": "", - "text": "In this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples." + "text": "In this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R." }, { - "objectID": "r4babs2/week-1/workshop.html#philosophy", - "href": "r4babs2/week-1/workshop.html#philosophy", + "objectID": "r4babs2/week-3/workshop.html#philosophy", + "href": "r4babs2/week-3/workshop.html#philosophy", "title": "Workshop", "section": "", "text": "Workshops are not a test. It is expected that you often donā€™t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips\n\ndonā€™t worry about making mistakes\ndonā€™t let what you can not do interfere with what you can do\ndiscussing code with your neighbours will help\nlook things up in the independent study material\nlook things up in your own code from earlier\nthere are no stupid questions\n\n\n\n\n\n\n\nKey\n\n\n\nThese four symbols are used at the beginning of each instruction so you know where to carry out the instruction.\n Something you need to do on your computer. It may be opening programs or documents or locating a file.\n Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.\n Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.\n A question for you to think about and answer. Record your answers in your script for future reference." }, { - "objectID": "r4babs2/week-1/workshop.html#remind-yourself-how-to-import-files", - "href": "r4babs2/week-1/workshop.html#remind-yourself-how-to-import-files", + "objectID": "r4babs2/week-3/workshop.html#adiponectin-secretion", + "href": "r4babs2/week-3/workshop.html#adiponectin-secretion", "title": "Workshop", - "section": "Remind yourself how to import files!", - "text": "Remind yourself how to import files!\nImporting data from files was covered in BABS 1 (Rand 2023) if you need to remind yourself." + "section": "Adiponectin secretion", + "text": "Adiponectin secretion\nAdiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not.\n Save a copy of adipocytes.txt to data-raw\n Read in the data and check the structure. I used the name adip for the dataframe/tibble.\nWe have a tibble containing two variables: adiponectin is the response and is continuous and treatment is explanatory. treatment is categorical with two levels (groups). The first task is visualise the data to get an overview. For continuous response variables with categorical explanatory variables you could use geom_point(), geom_boxplot() or a variety of other geoms. I often use geom_violin() which allows us to see the distribution - the violin is fatter where there are more data points.\n Do a quick plot of the data:\n\nggplot(data = adip, aes(x = treatment, y = adiponectin)) +\n geom_violin()\n\n\n\n\nSummarising the data\nSummarising the data for each treatment group is the next sensible step. The most useful summary statistics are the means, standard deviations, sample sizes and standard errors.\n Create a data frame called adip_summary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples. You may need to the Summarise from the Week 9 workshop of BABS1 (Rand 2023)\nYou should get the following numbers:\n\n\n\n\ntreatment\nmean\nstd\nn\nse\n\n\n\ncontrol\n5.546000\n1.475247\n15\n0.3809072\n\n\nnicotinic\n7.508667\n1.793898\n15\n0.4631824\n\n\n\n\n\nSelecting a test\n Do you think this is a paired-sample test or two-sample test?\n\n\n\n\nApplying, interpreting and reporting\n Create a two-sample model like this:\n\nmod <- lm(data = adip,\n adiponectin ~ treatment)\n\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = adiponectin ~ treatment, data = adip)\n\nResiduals:\n Min 1Q Median 3Q Max \n-4.3787 -1.0967 0.1927 1.0245 3.1113 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 5.5460 0.4240 13.079 1.9e-13 ***\ntreatmentnicotinic 1.9627 0.5997 3.273 0.00283 ** \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.642 on 28 degrees of freedom\nMultiple R-squared: 0.2767, Adjusted R-squared: 0.2509 \nF-statistic: 10.71 on 1 and 28 DF, p-value: 0.00283\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\n\nCheck assumptions\nThe assumptions of the general linear model are that the residuals ā€“ the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the mod$residuals variable. You may want to refer to Checking assumptions in the ā€œSingle regressionā€ workshop.\n Plot the model residuals against the fitted values.\n What to you conclude?\n\n\n\nTo examine normality of the model residuals we can plot them as a histogram and do a normality test on them.\n Plot a histogram of the residuals.\n Use the shapiro.test() to test the normality of the model residuals\n What to you conclude?\n\n\n\n\nIllustrating\n Create a figure like the one below. You may need to refer to Visualise from the ā€œSummarising data with several variablesā€ workshop (Rand 2023)\n\n\n\n\n\nWe now need to annotate the figure with the results from the statistical test. This most commonly done with a line linking the means being compared and the p-value. The annotate() function can be used to draw the line and then to add the value. The line is a segment and the p-value is a text.\n Add annotation to the figure by adding:\n...... +\n annotate(\"segment\", x = 1, xend = 2, \n y = 11.3, yend = 11.3,\n colour = \"black\") +\n annotate(\"text\", x = 1.5, y = 11.7, \n label = expression(italic(p)~\"= 0.003\")) +\n theme_classic()\n\n\n\n\n\nFor the segment, annotate() needs the x and y coordinates for the start and the finish of the line.\nThe use of expression() allows you to specify formatting or special characters. expression() takes strings or LaTeX formatting. Each string or piece of LaTeX is separated by a * or a ~. The * concatenates the strings without a space, ~ does so with a space. It will generate a warning message ā€œIn is.na(x) : is.na() applied to non-(list or vector) of type ā€˜expressionā€™ā€ which can be ignored.\n Save your figure to your figures folder." }, { - "objectID": "r4babs2/week-1/workshop.html#confidence-intervals-large-samples", - "href": "r4babs2/week-1/workshop.html#confidence-intervals-large-samples", + "objectID": "r4babs2/week-3/workshop.html#grouse-parasites", + "href": "r4babs2/week-3/workshop.html#grouse-parasites", "title": "Workshop", - "section": "Confidence intervals (large samples)", - "text": "Confidence intervals (large samples)\nThe data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by:\n\\(\\bar{x} \\pm 1.96 \\times s.e.\\)\nWhere 1.96 is the quantile for 95% confidence.\n Save beewing.txt to your data-raw folder.\n Read in the data and check the structure of the resulting dataframe.\n Calculate and assign to variables: the mean, standard deviation and standard error:\n\n# mean\nm <- mean(bee$wing)\n\n# standard deviation\nsd <- sd(bee$wing)\n\n# sample size (needed for the se)\nn <- length(bee$wing)\n\n# standard error\nse <- sd / sqrt(n)\n\n To calculate the 95% confidence interval we need to look up the quantile (multiplier) using qnorm()\n\nq <- qnorm(0.975)\n\nThis should be about 1.96.\n Now we can use it in our confidence interval calculation\n\nlcl <- m - q * se\nucl <- m + q * se\n\n Print the values\n\nlcl\n\n[1] 4.473176\n\nucl\n\n[1] 4.626824\n\n\nThis means we are 95% confident the population mean lies between 4.47 mm and 4.63 mm. The usual way of expressing this is that the mean is 4.55 +/- 0.07 mm\n Between what values would you be 99% confident of the population mean being?" + "section": "Grouse Parasites", + "text": "Grouse Parasites\nGrouse livers were dissected and the number of individuals of a parasitic nematode were counted for two estates ā€˜Gordonā€™ and ā€˜Mossā€™. We want to know if the two estates have different infection rates. The data are in grouse.csv\n Save a copy of grouse.csv to data-raw\n Read in the data and check the structure. I used the name grouse for the dataframe/tibble.\nSelecting\n Using your common sense, do these data look normally distributed?\n\n\n\n What test do you suggest?\n\n\nApplying, interpreting and reporting\n Summarise the data by finding the median of each group:\n Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney):\n\nwilcox.test(data = grouse, nematodes ~ estate)\n\n\n Wilcoxon rank sum exact test\n\ndata: nematodes by estate\nW = 78, p-value = 0.03546\nalternative hypothesis: true location shift is not equal to 0\n\n\n What do you conclude from the test? Write your conclusion in a form suitable for a report.\n\n\n\nIllustrating\nA box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range.\n We can create a simple boxplot with:\n\nggplot(data = grouse, aes(x = estate, y = nematodes) ) +\n geom_boxplot() \n\n\n\n\n Annotate and format the figure so it is more suitable for a report and save it to your figures folder." }, { - "objectID": "r4babs2/week-1/workshop.html#confidence-intervals-small-samples", - "href": "r4babs2/week-1/workshop.html#confidence-intervals-small-samples", + "objectID": "r4babs2/week-3/workshop.html#gene-expression", + "href": "r4babs2/week-3/workshop.html#gene-expression", "title": "Workshop", - "section": "Confidence intervals (small samples)", - "text": "Confidence intervals (small samples)\nThe confidence interval for small samples is given by:\n\\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\)\nThe only difference between the calculation for small and large sample is the multiple. For large samples we use the ā€œthe standard normal distributionā€ accessed with qnorm(); for small samples we use the ā€œt distributionā€ assessed with qt().The value returned by q(t) is larger than that returned by qnorm() which reflects the greater uncertainty we have on estimations of population means based on small samples.\nThe fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \\(\\mu m^2\\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt\n Save neuron.txt to your data-raw folder\n Read in the data and check the structure of the resulting dataframe\n Assign the mean to m.\n Calculate and assign the standard error to se.\nTo work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).\n We can assign this to a variable, df, using:\n\ndf <- length(neur$csa) - 1\n\n The t value is found by:\n\nt <- qt(0.975, df = df)\n\nNote that we are using qt() rather than qnorm() but that the probability, 0.975, used is the same. Finally, we need to put our mean, standard error and t value in the equation. \\(\\bar{x} \\pm \\sf t_{[d.f]} \\times s.e.\\).\n The upper confidence limit is:\n\n(m + t * se) |> round(2)\n\n[1] 151.95\n\n\nThe first part of the command, (m + t * se) calculates the upper limit. This is ā€˜pipedā€™ in to the round() function to round the result to two decimal places.\n Calculate the lower confidence limit:\n Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?\n\n\n\n\nYouā€™re finished!" + "section": "Gene Expression", + "text": "Gene Expression\nBambara groundnut (Vigna subterranea) is an African legume with good nutritional value which can be influenced by low temperature stress. Researchers are interested in the expression levels of a particular set of 35 genes (probe_id) in response to temperature stress. They measure the expression of the genes at 23 and 18 degrees C (high and low temperature). These samples are not independent because we have two measure from one gene. The data are in expr.xlxs.\nSelecting\n What is the null hypothesis?\n\n\n\n Save a copy of expr.xlxs and import the data. I named the dataframe bambara\n What is the appropriate parametric test?\n\n\nApplying, interpreting and reporting\nA paired test requires us to test whether the difference in expression between high and low temperatures is zero on average. One handy way to achieve this is to organise our groups into two columns. The pivot_wider() function will do this for us. We need to tell it what column gives the identifiers (i.e., matches the the pairs) - the probe_ids in this case. We also need to say which variable contains what will become the column names and which contains the values.\n Pivot the data so there is a column for each temperature:\n\nbambara <- bambara |> \n pivot_wider(names_from = temperature, \n values_from = expression, \n id_cols = probe_id)\n\n Click on the bambara dataframe in the environment to open a view of it so that you understand what pivot_wider() has done.\n Create a paired-sample model like this:\n\nmod <- lm(data = bambara, \n highert - lowert ~ 1)\n\nSince we have done highert - lowert, the ā€œ(Intercept) Estimateā€ will be the average of the higher temperature expression minus the lower temperature expression for each gene.\n Examine the model with:\n\nsummary(mod)\n\n\nCall:\nlm(formula = highert - lowert ~ 1, data = bambara)\n\nResiduals:\n Min 1Q Median 3Q Max \n-1.05478 -0.46058 0.09682 0.33342 1.06892 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.30728 0.09591 3.204 0.00294 **\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.5674 on 34 degrees of freedom\n\n\n State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any significant effect." }, { - "objectID": "r4babs2/week-1/study_after_workshop.html", - "href": "r4babs2/week-1/study_after_workshop.html", - "title": "Independent Study to consolidate this week", - "section": "", - "text": "Set up\nIf you have just opened RStudio you will want to load the tidyverse package\n\nlibrary(tidyverse)\n\nExercises\n\nšŸ’» Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not. Estimate the mean Adiponectin concentration in each group - this means calculate the sample mean and construct a confidence interval around it for each group. This exercise forces you to bring together ideas from this workshop and from previous workshops\n\n\nHow to calculate a confidence intervals (this workshop)\n\nHow to summarise variables in more than one group (previous workshop)\n\n\nCode# data import\nadip <- read_table(\"data-raw/adipocytes.txt\")\n\n# examine the structure\nstr(adip)\n\n# summarise\nadip_summary <- adip %>% \n group_by(treatment) %>% \n summarise(mean = mean(adiponectin),\n sd = sd(adiponectin),\n n = length(adiponectin),\n se = sd/sqrt(n),\n dif = qt(0.975, df = n - 1) * se,\n lower_ci = mean - dif,\n uppp_ci = mean + dif)\n\n\n# we conclude we're 95% certain the mean for the control group is \n# between 4.73 and 6.36 and the mean for the nicotinic group is \n# between 6.52 and 8.50. More usually we might put is like this:\n# the mean for the control group is 5.55 +/- 0.82 and that for the nicotinic group is 7.51 +/- 0.99" + "objectID": "r4babs2/week-3/workshop.html#look-after-future-you", + "href": "r4babs2/week-3/workshop.html#look-after-future-you", + "title": "Workshop", + "section": "Look after future you!", + "text": "Look after future you!\nThe code required to summarise, test, and plot data for any two-sample test AND for any for any one-way ANOVA is exactly the same except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it your code so that you can make use of it next week.\n\nYouā€™re finished!" } ] \ No newline at end of file