Subsetting.Rmd

# Subsetting

```{r Subsetting-1, include = FALSE}
source("common.R")
```

Attaching the needed libraries:

```{r Subsetting-2, warning=FALSE, message=FALSE}
library(tibble)
```

## Selecting multiple elements (Exercises 4.2.6)

**Q1.** Fix each of the following common data frame subsetting errors:

```{r Subsetting-3, eval = FALSE}
# styler: off
mtcars[mtcars$cyl = 4, ]
mtcars[-1:4, ]
mtcars[mtcars$cyl <= 5]
mtcars[mtcars$cyl == 4 | 6, ]
# styler: on
```

**A1.** Fixed versions of these commands:

```{r Subsetting-4, eval=FALSE}
# `==` instead of `=`
mtcars[mtcars$cyl == 4, ]

# `-(1:4)` instead of `-1:4`
mtcars[-(1:4), ]

# `,` was missing
mtcars[mtcars$cyl <= 5, ]

# correct subsetting syntax
mtcars[mtcars$cyl == 4 | mtcars$cyl == 6, ]
mtcars[mtcars$cyl %in% c(4, 6), ]
```

**Q2.** Why does the following code yield five missing values?

```{r Subsetting-5}
x <- 1:5
x[NA]
```

**A2.** This is because of two reasons:

- The default type of `NA` in R is of `logical` type.

```{r Subsetting-6}
typeof(NA)
```

- R recycles indexes to match the length of the vector.

```{r Subsetting-7}
x <- 1:5
x[c(TRUE, FALSE)] # recycled to c(TRUE, FALSE, TRUE, FALSE, TRUE)
```

**Q3.** What does `upper.tri()` return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

```{r Subsetting-8, eval = FALSE}
x <- outer(1:5, 1:5, FUN = "*")
x[upper.tri(x)]
```

**A3.** The documentation for `upper.tri()` states-

> Returns a matrix of logicals the same size of a given matrix with entries `TRUE` in the **upper triangle**

```{r Subsetting-9}
(x <- outer(1:5, 1:5, FUN = "*"))

upper.tri(x)
```

When used with a matrix for subsetting, elements corresponding to `TRUE` in the subsetting matrix are selected. But, instead of a matrix, this returns a vector:

```{r Subsetting-10}
x[upper.tri(x)]
```

**Q4.** Why does `mtcars[1:20]` return an error? How does it differ from the similar `mtcars[1:20, ]`?

**A4.** When indexed like a list, data frame columns at given indices will be selected.

```{r Subsetting-11}
head(mtcars[1:2])
```

`mtcars[1:20]` doesn't work because there are only `r length(mtcars)` columns in `mtcars` dataset.

On the other hand, `mtcars[1:20, ]` indexes a dataframe like a matrix, and because there are indeed 20 rows in `mtcars`, all columns with these rows are selected.

```{r Subsetting-12}
nrow(mtcars[1:20, ])
```

**Q5.** Implement your own function that extracts the diagonal entries from a matrix (it should behave like `diag(x)` where `x` is a matrix).

**A5.** We can combine the existing functions to our advantage:

```{r Subsetting-13}
x[!upper.tri(x) & !lower.tri(x)]

diag(x)
```

**Q6.** What does `df[is.na(df)] <- 0` do? How does it work?

**A6.** This expression replaces every instance of `NA` in `df` with `0`.

`is.na(df)` produces a matrix of logical values, which provides a way of subsetting.

```{r Subsetting-14}
(df <- tibble(x = c(1, 2, NA), y = c(NA, 5, NA)))

is.na(df)

class(is.na(df))
```

## Selecting a single element (Exercises 4.3.5)

**Q1.** Brainstorm as many ways as possible to extract the third value from the `cyl` variable in the `mtcars` dataset.

**A1.** Possible ways to to extract the third value from the `cyl` variable in the `mtcars` dataset:

```{r Subsetting-15}
mtcars[["cyl"]][[3]]
mtcars[[c(2, 3)]]
mtcars[3, ][["cyl"]]
mtcars[3, ]$cyl
mtcars[3, "cyl"]
mtcars[, "cyl"][[3]]
mtcars[3, 2]
mtcars$cyl[[3]]
```

**Q2.** Given a linear model, e.g., `mod <- lm(mpg ~ wt, data = mtcars)`, extract the residual degrees of freedom. Then extract the R squared from the model summary (`summary(mod)`)

**A2.** Given that objects of class `lm` are lists, we can use subsetting operators to extract elements we want.

```{r Subsetting-16}
mod <- lm(mpg ~ wt, data = mtcars)
class(mod)
typeof(mod)
```

- extracting the residual degrees of freedom

```{r Subsetting-17}
mod$df.residual 
mod[["df.residual"]]
```

- extracting the R squared from the model summary

```{r Subsetting-18}
summary(mod)$r.squared
summary(mod)[["r.squared"]]
```

## Applications (Exercises 4.5.9)

**Q1.**  How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?

**A1.** Let's create a small data frame to work with.

```{r Subsetting-19}
df <- head(mtcars)

# original
df
```

To randomly permute the columns of a data frame, we can combine `[` and `sample()` as follows:

- randomly permute columns

```{r Subsetting-20}
df[sample.int(ncol(df))]
```

- randomly permute rows

```{r Subsetting-21}
df[sample.int(nrow(df)), ]
```

- randomly permute columns and rows

```{r Subsetting-22}
df[sample.int(nrow(df)), sample.int(ncol(df))]
```

**Q2.** How would you select a random sample of `m` rows from a data frame?  What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

**A2.**  Let's create a small data frame to work with.

```{r Subsetting-23}
df <- head(mtcars)

# original
df

# number of rows to sample
m <- 2L
```

To  select a random sample of `m` rows from a data frame, we can combine `[` and `sample()` as follows:

- random and non-contiguous sample of `m` rows from a data frame

```{r Subsetting-24}
df[sample(nrow(df), m), ]
```

- random and contiguous sample of `m` rows from a data frame

```{r Subsetting-25}
# select a random starting position from available number of rows
start_row <- sample(nrow(df) - m + 1, size = 1)

# adjust ending position while avoiding off-by-one error
end_row <- start_row + m - 1

df[start_row:end_row, ]
```

**Q3.** How could you put the columns in a data frame in alphabetical order?

**A3.** we can sort columns in a data frame in the alphabetical order using `[` with `order()`:

```{r Subsetting-26}
# columns in original order
names(mtcars)

# columns in alphabetical order
names(mtcars[order(names(mtcars))])
```

## Session information

```{r Subsetting-27}
sessioninfo::session_info(include_base = TRUE)
```