README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

# Download the files?
do_dl <- FALSE
save_load_file <- "~/ukbschemas-test-data/ukbschemas_db_test.sqlite"

```
# ukbschemas

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![Build Status](https://travis-ci.com/bjcairns/ukbschemas.svg?token=tA2cYTLpigx5VuTgcHFd&branch=master)](https://travis-ci.com/bjcairns/ukbschemas)
<!-- badges: end -->

This R package can be used to create and/or load a database containing the [UK Biobank Data Showcase schemas](http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi), which are data dictionaries describing the structure of the UK Biobank main dataset.

## Installation

You can install the current version of ukbschemas from [GitHub](https://github.com/) with:

```{r install, eval=FALSE}
# install.packages("devtools")
devtools::install_github("bjcairns/ukbschemas")

library(ukbschemas)
```

## Examples

```{r prelim, echo=FALSE, include=FALSE}
try(rm(list=c("db","sch")))
library(ukbschemas)
```

The package supports two workflows. 

#### Save-Load workflow (recommended)

The recommended approach is to use `ukbschemas_db()` to download the schema tables and save them to an SQLite database, then use `load_db()` to load the tables from the database and store them as tibbles in a named list:

```{r create, eval=do_dl}
db <- ukbschemas_db(path = tempdir())
sch <- load_db(db = db)
```

```{r info, echo=FALSE, include=FALSE}
if (do_dl) {
  file <- paste0(tempdir(), "\\ukb-schemas-", Sys.Date(), ".sqlite")
  file.copy(file, save_load_file)
}

finfo <- file.info(save_load_file)
fsize <- round(finfo$size/1e6, 1)
fmtime <- finfo$mtime
```

By default, the database is named `ukb-schemas-YYYY-MM-DD.sqlite` (where `YYYY-MM-DD` is the current date) and placed in the current working directory. (`path = tempdir()` in the above example puts it in the current temporary directory instead.) At the most recent compilation of the database (`r format(fmtime, "%d %B %Y")`), the size of the .sqlite database file produced by `ukbschemas_db()` was approximately `r fsize`MB.

Note that without further arguments, `ukbschemas_db()` tidies up the database to give it a more consistent relational structure (the changes are summarised in the output of the first example, above). Alternatively the raw data can be loaded with the `as_is` argument:

```{r create_as_is, eval=FALSE}
db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)
```

The `overwrite` option allows the database file to be overwritten (if `TRUE`), or prevents this (`FALSE`), or if not specified and the session is interactive (`interactive() == TRUE`) then the user is prompted to decide.

**Note:** If you have created a schemas database with an earlier version of ukbschemas, it should be possible to load that database with the latest version of `load_db()`, which (currently) should load any SQLite database, regardless of contents.

#### Load-Save workflow

The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried. 

This is **not** recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:

```{r inmemory, eval=FALSE}
sch <- ukbschemas()
db <- save_db(sch, path = tempdir())
```

## Why R?

This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).

## Notes

#### Design notes

* All the encoding value tables (`esimpint`, `esimpstring`, `esimpreal`, `esimpdate`, `ehierint`, `ehierstring`) have been harmonised and combined into a single table `encvalues`. The `value` column in `encvalues` has type `TEXT`, but a `type` column has been added in case the value is not clear from context. The original type-specific tables have been deleted.
* To avoid redunancy, category parent-child relationships have been moved to table `categories`, as column `parent_id`, from table `catbrowse` (which has been deleted).
* Reference to the category to which a field belongs is in the `main_category` column in the `fields` schema, but has been renamed to `category_id` for consistency with the `categories` schema.
* Details of several of the field properties (`value_type`, `stability`, `item_type`, `strata` and `sexed`) are available elsewhere on the Data Showcase. These have been added manually to tables `valuetypes`, `stability`, `itemtypes`, `strata` and `sexed`, and appropriate ID references have been renamed with the `_id` suffix in tables `fields` and `encodings`.
* There are several columns in the tables which are not well-documented (e.g. `base_type` in fields, `availability` in `encodings` and `categories`, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).

#### Known code issues

* The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database. 
* Because `readr::read_csv()` reads whole numbers as type `double`, not `integer` (allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications.
* Any [other issues](https://github.com/bjcairns/ukbschemas/issues).