INR_day2_June2024.Rmd

---
title: "INR_day2"
author: "Frances"
output: html_document
---

# Review with Datasaurus Dozen 

The Datasaurus Dozen dataset is a handful of data points that complement the `dplyr` package. Aside from functions, packages can also import objects. 

```{r}

```

```{r}

```

There are 13 different datasets in this one object. We will use tidyverse functions to take an overview look at the object, grouped by the datasets. 

```{r}

```

All of the datasets have roughly the same mean and standard deviation along both the x and y axis. 

Let's take a look the data graphically. We will use `filter` to extract the rows belonging to one dataset and then pipe that directly into a ggplot. Both dplyr and ggplot are developed within "the Tidyverse" and can use pipes, but you may not be able to pipe in base R functions or functions from different packages. 

```{r}

```

Tidyverse's data wranging packages use the pipe ` %>% ` to move the previous output to the next line, where as ggplot uses the plus sign `+`

Try editing the code above to display different datasets. Notice how different groups of data points can all give similar statistical summaries - so it's always a good choice to visualize your data rather than relying on just numbers. 

If we wanted to take a look at all of the datasets at once, we can also use the `facet_wrap()` function. 

```{r}

```

## Genearlizable code 

A major strength of programming is the ability to automate repetitive tasks. As a general rule of thumb, if you need to do a task more than three times (ex. analyzing multiple PCR plates or integrating clinical data from multiple days), it is worth it to invest time to write generalizable code or a custom function. 

Now that we're getting comfortable writing code, we will spend some time revisiting code that we wrote to make them generalizable and even better! Generalizable means that the code is flexible and can be applied to multiple similar objects. For example, if we're running a clinical study and we have patient demographic data from multiple sites, we want to check that the mean patient demographic is comparable between sites by creating similar plots of each hospital site to compare. If we write code for one location and then copy and paste it into another code chunk to apply to the next location, the code may require some modification before it works. 

Generalizable code begins at data collection. Depending on your workflow, you may or may not be able to influence this stage of the analysis. If possible, it is best practice to keep the column names and entries for categorical variables consistent. For example, when recording the age of patients, "6", "6 ", "six", and "Six" are all considered different levels of the factor so you will need to either make sure data collection is consistent or check and correct these inconsistencies in your code. Get into a habit of checking your work. Whether it is code you've written yourself, code you you've been sent by a collaborator, or published data from a biobank - never assume the data is as you predicted. 

# Generalizable plots

Remember when we edited our code to test out multiple datasets in the datasaurus dozen object? Perhaps you copy and pasted the code several time and changed the column name? This is not optimal because if you need to change the code in one instance (for example changing the x-axis label), you'll need to revisit ever instance that you copy and pasted to code to. This approach leads you vulnerable to errors when copy and pasting. 

One way to make your code robust is to bring all the factors that need editing to the start of the data. This may seem cumbersome for such a simple example where we are only changing the dataset name, but we'll return to this concept later with more complicated examples. 

Let's grab the code we used to make one plot earlier and modify it to be more generalizable

```{r}

```

```{r}

```

Once we have converted our code to a generalized format, we can convert it into a more versatile custom function! 

Curly brackets are used for inputting multiple lines of code. It is generally attached to the function that proceeds it. 

```{r}

```

## Exercise 

We've now encountered round brackets `()`, square brackets `[]`, and curly brackets`{}` - each have their own distinct functions! Take a few moments to chat with your neighbors and outline cases in which we've used each bracket and what is their role in R syntax. 

```{r}

```


## Dataset - Heart Stroke Preduction 

The dataset we will be working with for today's workshop contains clinical data collected with the aim of predicting whether a patient is likely to suffer a stroke. 

The dataset can be found: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Here is some more information about the columns in this dataset: 

1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Let's get started! 

## Exercise

Reading in the dataset can be an intimidating step when you're just starting out with programming. Since this is a csv file, we can use the appropriately named `read.csv()` function. In cases when you have other file types such as .txt or .tab files that are tab deliminated, there is also a `read.table()` function that is more universal (but requires more parameters to let R know how your data is stored). 

Read in the `healthcare-dataset-stroke-data.csv` into an object called `heart`. Check your object using head and/or summary functions. Toggle a parameter called `stringsAsFactors` to `TRUE` in order to automatically import character values as factors rather than characters 

Hint: make sure the dataset is in the same directory or folder as this .Rmd file for ease of import 

```{r}

```

Before we dive into this dataset, we get a very limited indication that the object is read in correctly by checking the Environment panel - you'll notice the new `heart` appears under Data and it lets us know that there are 5110 observations (number of patients) and 12 variables (number of clinical features with entries). You can also double click the name of the object here to open up a view of the whole dataset - caution that this can cause your machine to stall if the dataset is exceptionally large and/or your machine is running on minimal memory. 

## Exercise 

Explore the dataset! Take a look at the columns and identify some potential issues with this dataset that either warrant further investigation or correction. 

There is no universally right or wrong way to do this. Perhaps the only truly incorrect way of doing this is going through the dataset which is thousands of observations line by line. 

```{r}

```

# Factors

The `stringsAsFactors` parameter takes care of the character values but we still have some integer values that should be interpreted as factors. 

When deciding on whether a number is a factor or should be kept numeric, consider if decimals/numbers-in-between make sense. The first two entries for `avg_glucose_level` are 229 and 202 - a glucose level in between would be reasonable. In contrast, the first to entries for `heart_disease` are 1 and 0 - as these are coding for having or not having the disorder, an entry of 1.2 does not make sense. 

Recall from the introduction to the dataset from above: 
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
12) stroke: 1 if the patient had a stroke or 0 if not

```{r}

```


## Incomplete data  

Firstly, we have to make a decision on how to handle missing values. We can either accept that some of the columns are incomplete or eliminate rows that do not have full data. Let's evaluate which columns this affects 

If you ever encounter missing data when you are entering data, use `NA`. 

```{r}

```

Let's use tidyverse to remove the rows in the `smoking_status` column with a value of `Unknown`. 

```{r}

```

After double checking, we can see that the smoking status has an empty level, we'll clean this up before moving on

```{r}

```

Great! Now let's tackle the typos in the gender column. 

```{r}

```

We need to fix the typos that happened during data entry and the single observation of `Other` will not be enough data for us to draw any statistical conclusion so we'll remove this row while we're at it. 

We can use `str_replace_all()` as a search and replace tool 

```{r}

```
Remember that this only displays the output, it does not replace the columns in the dataset.

Before we apply it globally, we can set up a quick double check to make sure that the right values are changed. 

```{r}

```

There's an issue with the "male" search grabbing from the "female" word!! Good thing we made a checkpoint earlier. Let's bring this back to our main heart object - it's a good habit to make some objects to checkpoint your work. In case you have not had this set up, you can always back track in your code and re-read in the object and re-run some earlier code to catch up. 

```{r}

```

Let's try this again: 

```{r}

```

The `^` is a special character that indicates the start of a word. 

Lastly, we'll the one `Other` entry 

```{r}

```

Overall, the data is looking much cleaner than when we started!

## Exercise 

Investigate the `work_type` column and correct the data entry problems! Also, remove any "N/A" entries under `bmi`

```{r}

```

## Loops

"For loops" allow R to apply highly automated tasks. It will cycle through a range of inputs and "for" each of them, it will carry out your custom task. 

Here's a very simple example to show you the structure of for loops

```{r}

```
The `for()` function accepts firstly the name of a temporary object that exists only within the curly brackets of the for loop, the `in` is a special R syntax specification, and `fruit` is the object that we are applying the for loop on. 

The for loop will take each entry of the fruits object, store it in the temporary object `x`, and apply the code written written within the curly bracket before repeating with the next entry. 

I recommend testing your code outside the `for()` loop before moving it into the loop to make sure it is robust and if you are applying the loop to a long vector or large dataset, consider trying it on a truncated version first as a proof of principle. 

```{r}
levels(heart$smoking_status)

heart %>% 
  filter(smoking_status == "never smoked") %>% 
  ggplot(aes(x=avg_glucose_level, y=bmi)) +
  geom_point() + 
  ggtitle("never smoked") +
  theme_classic()
```

If we wanted to save a pdf of every category of `smoking_status`, we can convert our code into a loop. When working on a larger section of code, it is helpful to sketch out the steps you need to do with `#` comments to keep you focused. 

```{r}

```


## Exercise

Copy and paste the loop from  above and modify it so that the range of values on the x and y axis are the same for all plots. 

Hint: the limits of the x axis can be specified by adding a layer called `xlim(lower, upper)` where it takes two numbers - the lower limit followed by the upper limit. These numbers can be stored in objects or inputted directly . Similarly, there is a parallel function called `ylim()` which also takes the same two parameters 

When working through the code, you can temporarily remove the code removing the axis labels by commenting out the lines with a hashtag. 

```{r}

```
Great! Now the y-axis does not change between the plots and they are directly comparable.

## Conditional for loops 

Boolean statements can be used to write conditional statements. If we do not want the loop to be applied to every item, we can add a condition. 

```{r}

```

This will only output if the condition is met. We can also modify this statement to do something in case the condition is not met. 

```{r}

```
Using `if else` statements will allow more customizability in our code. Let's use this to add a new column called `ever_smoked` based on the value in the `smoking_status` column. 

```{r}

```

## Exercise

Expected fasting blood glucose concentrations defined by the WHO are between 70 - 100 mg/dL. Create a new column called `glucose_WHO` in which: 

- `avg_glucose_levels` less than 70 are annotated as `followup`
- `avg_glucose_levels` between 70-100 are annotated as `average`
- `avg_glucose_levels` over 70 are annotated as `followup`

Hint: conditions can be combined using the `&` for `and` where as `|` is used for `or` statements. 

```{r}

```


## Countinous variables 

In the heart dataset, we have three continuous variables. In R, continuous variables will be `numeric` values. Continuous variable have a wide range of ordered values. For example, the `age` values have a range of 10 to 82 - any value in between is possible. 

```{r}

```

Staticians commonly prefer working with normally distributed data because this is a heavily studied and predictable distribution. Confirming that the variable is normally distributed opens up options for robust statistical approaches to be applied. 

Is this heart dataset normally distributed? 

QQ plots, or quantile-quantile plots, are unique scatterplots that help us determine the distribution. Rather than black and white diagnostic tool, this is a visualization tool for inform our analysis. The `qqnorm()` sorts the values in the vector and compares it to a theoretical normal distribution (the `norm` part of `qqnorm`). 

```{r}

```

For a normal distribution, we ideally want a straight diagonal line. Notice the slight curve on the right end, we can see the tail is also on the right side of this histogram. Curves indicate deviation away from normality. This looks faily normal, we can check to see if a transformation improves the distribution. 

```{r}

```

The tail is exaggerated. Since all transformations add some artificial noise, we avoid applying them when it does not significantly improve the shape of our data. 

We will proceed with the non-transformed data. 

## Linear models 

Linear regression models allow us to investigate the relationship between two continuous variables. For simple linear models, we have one independent and one dependent variable. The independent variable is the one that is being controlled or manipulated in the experiment, and the dependent variable will change respectively. 

For example, if we are investigating if a high fat diet affects sleep quality, the diet is the independent variable (changing or is different between participants) while the sleep quality is the dependent variable (depending on the diet, the sleep quality will change). 

Here, we are investigating the relation between bmi and avg_glucose_level

We'll first visualize the two variables

```{r}

```

Notice here the aes is specified in the `geom_point()` call rather than the parent `ggplot()` call. This is helpful if your plots have multiple layers and you want the aes to apply only to one layer. Parameters in the `ggplot()` call will apply to all layers in the plot where as parameters specified in the `geom_point()` will only affect this specific layer. 

For this plot with only one layer, this has no functional impact on the plot made, but this will be important if you make more complex and layered plots. 

```{r}

```

This can be read as `avg_glucose_level` as a function of `bmi`

Use the function summary() on fit1 object to obtain more details of the model.

```{r}

```
This overall looks like a good model. The p-value is very low and statistically significant. However, the Multiple R-squared values is small and the slope of bmi is low. 

From the results, we could conclude that changes in bmi are associated to the average glucose level as the p-value is less than 0.05. We can also state that as bmi increases, there will be an increase in the average glucose level as the slope is weakly positive 0.11.

Now that our model has given us the intercept and slope, we can use this information to build a formula in the format of 

> dependent = (m)(independent) + b 

> avg_glucose_level = (0.11150)(bmi) + 88.68377

We can use our knowledge of writing functions to calculate the predict the glucose level from the patient's bmi

```{r}

```


Next, let's visualize this using ggplot 

```{r}

```


## Exercise 

Hmm, this looks like there are two densities of data in this image. Let's try to investigate if we can figure it out.  

I've started you off by creating two new objects, an object called `heart_stroke` that contains only patents who experienced a stroke (stroke == 1) and a second object called `heart_nostroke` that contains only patents who have not experienced a stroke (stroke == 0). 

Next, create two separate linear models called `fit_stroke` and `fit_nostroke` - are they different? How will you visualize the data?

```{r}
heart_stroke <- filter(heart, stroke == 1)
table(heart_stroke$stroke)

heart_nostroke <- filter(heart, stroke == 0)
table(heart_nostroke$stroke)
```

```{r}

```

## Day 2 project

For this mini guided project, we will be working with a dataset that contains the prices and other attributes of almost 54,000 diamonds and is publicly available at: https://www.kaggle.com/datasets/shivam2503/diamonds

Here is some more information about each column: 

price price in US dollars (\$326--\$18,823)
carat weight of the diamond (0.2--5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from J (worst) to D (best)
clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x length in mm (0--10.74)
y width in mm (0--58.9)
z depth in mm (0--31.8)
depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
table width of top of diamond relative to widest point (43--95)

Insert a code chunk underneath each step to carry out the instruction. 

1. Read in the "diamonds.csv" data into an object called data. Use the `strongAsFactors` parameter to automatically import the character values as a factor. Check the object you created by printing out the first 10 rows and applying the summary function.

2. We'll start investigating the relationship between different variables with price. Create a boxplot using ggplot with the cut on the x axis and the price on the y axis. 

3. Using the code in the previous step as foundation, create an object called `diamonds_categorical` that contains the name of all the categorical columns. Then, write a loop to print out a separate plot with each of the different categorical variables on the x axis and price on the y-axis. 

Hint: remember how aes can be specified in both the `ggplot()` or `geom_xx()` layer? You will need to use this because we are string the name of a column as a variable in our loop and in order for R to know that it is looking for a column name rather than an object, you will need to use the `aes_string()` parameter for just the `geom_boxplot()` layer to specify the changing x axis. The y axis can remain with `aes()` in the parent `ggplot()` layer. 

4. Write a function called `diamond_continous` that allows you to make a scatterplot that accepts one variable to plot on the x-axis as well as another variable to color the plot by as the inputs. Price will remain on the y-axis. 

Hint: Start out by making one plot, make it generalized, and then convert this into a function. 

5. Based on your plots from the previous step, pick a continuous variable to compare with price and create a linear model. Make sure the price is the dependent variable. Add a layer to your plot generated by the function to include the equation of the line. 

Is the relationship significant?