A collection of example datasets for teaching purposes.
Add tags to dataset for easy indexing later. Include links to data & descriptions of data if available.
optional: include code chunk for cleaning messy data
- biological - biological examples
- non-biological
- messy - data that requires cleaning
- clean - data ready to use
Huge dataset tracking human disease over time. Unsure whether license agreement allows rehosting.
Large excel table with messy formatting--could be good for tidying data examples.
library(readxl)
library(magrittr)
link <- "https://ucr.fbi.gov/crime-in-the-u.s/2015/crime-in-the-u.s.-2015/tables/table-9/table_9_offenses_known_to_law_enforcement_by_state_by_university_and_college_2015.xls"
file <- "crime.xls"
download.file(link, file)
df <- read_xls(file, skip = 3) # skip header
# drop annotations at bottom of data table
drop <- nrow(df) - 8
df <- df[1:drop,]
# drop extra columns read in because of annotations at bottom
df %<>%
dplyr::select(-grep("X_", names(.)))
# clean up col names
names(df) %<>% gsub("\n", "_", .) %>%
gsub(" ", "_", .) %>%
gsub("-", "", .) %>%
gsub("/", ".", .) %>%
gsub("\\d", "", .) %>%
gsub("[()]", "", .) %>%
tolower()
# fill in missing values caused by using merged cells in excel
df %<>%
tidyr::fill(state) %>%
tidyr::fill(university.college)