-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
111 lines (76 loc) · 5.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
# Download the files?
do_dl <- FALSE
save_load_file <- "~/ukbschemas-test-data/ukbschemas_db_test.sqlite"
```
# ukbschemas
<!-- badges: start -->
[data:image/s3,"s3://crabby-images/b1188/b118805435842f11524232186ba827bd737e812b" alt="Lifecycle: experimental"](https://www.tidyverse.org/lifecycle/#experimental)
[data:image/s3,"s3://crabby-images/53200/532000cdd1000cb2ea97e8037f9fe86c29d418c8" alt="Build Status"](https://travis-ci.com/bjcairns/ukbschemas)
<!-- badges: end -->
This R package can be used to create and/or load a database containing the [UK Biobank Data Showcase schemas](http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi), which are data dictionaries describing the structure of the UK Biobank main dataset.
## Installation
You can install the current version of ukbschemas from [GitHub](https://github.com/) with:
```{r install, eval=FALSE}
# install.packages("devtools")
devtools::install_github("bjcairns/ukbschemas")
library(ukbschemas)
```
## Examples
```{r prelim, echo=FALSE, include=FALSE}
try(rm(list=c("db","sch")))
library(ukbschemas)
```
The package supports two workflows.
#### Save-Load workflow (recommended)
The recommended approach is to use `ukbschemas_db()` to download the schema tables and save them to an SQLite database, then use `load_db()` to load the tables from the database and store them as tibbles in a named list:
```{r create, eval=do_dl}
db <- ukbschemas_db(path = tempdir())
sch <- load_db(db = db)
```
```{r info, echo=FALSE, include=FALSE}
if (do_dl) {
file <- paste0(tempdir(), "\\ukb-schemas-", Sys.Date(), ".sqlite")
file.copy(file, save_load_file)
}
finfo <- file.info(save_load_file)
fsize <- round(finfo$size/1e6, 1)
fmtime <- finfo$mtime
```
By default, the database is named `ukb-schemas-YYYY-MM-DD.sqlite` (where `YYYY-MM-DD` is the current date) and placed in the current working directory. (`path = tempdir()` in the above example puts it in the current temporary directory instead.) At the most recent compilation of the database (`r format(fmtime, "%d %B %Y")`), the size of the .sqlite database file produced by `ukbschemas_db()` was approximately `r fsize`MB.
Note that without further arguments, `ukbschemas_db()` tidies up the database to give it a more consistent relational structure (the changes are summarised in the output of the first example, above). Alternatively the raw data can be loaded with the `as_is` argument:
```{r create_as_is, eval=FALSE}
db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)
```
The `overwrite` option allows the database file to be overwritten (if `TRUE`), or prevents this (`FALSE`), or if not specified and the session is interactive (`interactive() == TRUE`) then the user is prompted to decide.
**Note:** If you have created a schemas database with an earlier version of ukbschemas, it should be possible to load that database with the latest version of `load_db()`, which (currently) should load any SQLite database, regardless of contents.
#### Load-Save workflow
The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried.
This is **not** recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:
```{r inmemory, eval=FALSE}
sch <- ukbschemas()
db <- save_db(sch, path = tempdir())
```
## Why R?
This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).
## Notes
#### Design notes
* All the encoding value tables (`esimpint`, `esimpstring`, `esimpreal`, `esimpdate`, `ehierint`, `ehierstring`) have been harmonised and combined into a single table `encvalues`. The `value` column in `encvalues` has type `TEXT`, but a `type` column has been added in case the value is not clear from context. The original type-specific tables have been deleted.
* To avoid redunancy, category parent-child relationships have been moved to table `categories`, as column `parent_id`, from table `catbrowse` (which has been deleted).
* Reference to the category to which a field belongs is in the `main_category` column in the `fields` schema, but has been renamed to `category_id` for consistency with the `categories` schema.
* Details of several of the field properties (`value_type`, `stability`, `item_type`, `strata` and `sexed`) are available elsewhere on the Data Showcase. These have been added manually to tables `valuetypes`, `stability`, `itemtypes`, `strata` and `sexed`, and appropriate ID references have been renamed with the `_id` suffix in tables `fields` and `encodings`.
* There are several columns in the tables which are not well-documented (e.g. `base_type` in fields, `availability` in `encodings` and `categories`, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).
#### Known code issues
* The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database.
* Because `readr::read_csv()` reads whole numbers as type `double`, not `integer` (allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications.
* Any [other issues](https://github.com/bjcairns/ukbschemas/issues).