Skip to content

wlandau/targetopia

Repository files navigation

---
title: "Contributing"
output: distill::distill_article
description: "How to create an R Targetopia package"
---

```{r, include = FALSE}
knitr::opts_chunk$set(eval = FALSE, echo = TRUE)
```

The R Targetopia has the potential to cover multiple fields of Statistics and data science, and community contributions are extremely valuable. The following guide explains how to create your own R Targetopia package.

## Before you begin

### Prerequisites

1. Domain expertise in a subfield of data science.
1. Familiarity with [`targets`](https://docs.ropensci.org/targets). (Resources [linked here](https://docs.ropensci.org/targets).)
1. [R package development](https://r-pkgs.org/), including [documentation](https://r-pkgs.org/man.html) and [testing](https://r-pkgs.org/tests.html). The [rOpenSci development guide](https://devguide.ropensci.org/) is super helpful.

### Scope

R Targetopia packages are highly specialized, and each is tailored to an existing implementation of the underlying methodology. For example, [`stantargets`](https://docs.ropensci.org/stantargets/) builds on [`cmdstanr`](https://mc-stan.org/cmdstanr/), and the former inherits interface patterns and documentation from the latter. [`brms`](https://paul-buerkner.github.io/brms/) compatibility is out of scope and would need to be implemented in its own R Targetopia package ([discussion here](ropensci/stantargets#12)).

## Implementation

### Target factories

R Targetopia packages leverage target factories to make pipeline construction easier. A target factory is a function that accepts simple inputs, calls [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html), and produces a list of target objects. Sketch:

```{r}
# R/factory.R
#' @title Example target factory.
#' @description Define 3 targets:
#' 1. Track the user-supplied data file.
#' 2. Read the data using `read_data()` (defined elsewhere).
#' 3. Fit a model to the data using `fit_model()` (defined elsewhere).
#' @return A list of target objects.
#' @export
#' @param file Character, data file path.
target_factory <- function(file) {
  list(
    tar_target_raw("file", file, format = "file", deployment = "main"),
    tar_target_raw("data", quote(read_data(file)), format = "fst_tbl", deployment = "main"),
    tar_target_raw("model", quote(run_model(data)), format = "qs")
  )
}
```

In `_targets.R`, the user writes one call to the factory instead of multiple calls to [`tar_target()`](https://docs.ropensci.org/targets/reference/tar_target.html).^[Users can still write their own downstream [`tar_target()`](https://docs.ropensci.org/targets/reference/tar_target.html) calls in the pipeline for custom postprocessing.] This shorthand makes user-side code simpler and more concise, and it abstracts away low-level configuration settings like `format = "file"` and `deployment = "main"`.

```{r}
# _targets.R
library(targets)
library(yourExamplePackage)
target_factory("data.csv") # End with a list of targets.
```

```{r}
# R console
tar_manifest(fields = command)
#> # A tibble: 3 x 2
#>   name  command          
#>   <chr> <chr>            
#> 1 file  "\"data.csv\""   
#> 2 data  "read_data(file)"           
#> 3 model "run_model(data)"
```

### Metaprogramming

Target factories invoke the [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html) function. Whereas [`tar_target()`](https://docs.ropensci.org/targets/reference/tar_target.html) is for end users, [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html) is for developers. [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html) expects a character string for the `name` argument and expression objects for arguments `command` and `pattern`. Functions `deparse()`, [`substitute()`](http://adv-r.had.co.nz/Computing-on-the-language.html#substitute), [`tar_sub()`](https://docs.ropensci.org/tarchetypes/reference/tar_sub.html), and [`tar_eval()`](https://docs.ropensci.org/tarchetypes/reference/tar_eval.html) can help you create these arguments.^[For more information about metaprogramming in base R, see the ["Computing on the Language" chapter of the Advanced R book](http://adv-r.had.co.nz/Computing-on-the-language.html#capturing-expressions).]

The `quote()` function captures arbitrary expressions.

```{r, eval = TRUE}
quote(f(x + y))

str(quote(f(x + y)))
```

The `deparse()` function turns expressions into characters.

```{r, eval = TRUE}
deparse(quote(f(x + y)))
```

The  [`substitute()`](http://adv-r.had.co.nz/Computing-on-the-language.html#substitute) function quotes code, creates expressions, and inserts arbitrary values into symbols.

```{r, eval = TRUE}
substitute(f(arg = arg), env = list(arg = quote(x + y)))
```

If you call `substitute()` from inside a function (or other non-global environment) then `env` defaults to the calling environment.

```{r, eval = TRUE}
f <- function(arg) substitute(f(arg = arg))
f(arg = f(x + y))
```

Together, `quote()`, `deparse()`, and `substitute()` help you create factories that accept friendly user inputs and supply safe arguments to [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html).

```{r}
# R/factory.R
#' @title Example target factory.
#' @description Define 3 targets:
#' 1. Track the user-supplied data file.
#' 2. Read the data using `read_data()` (defined elsewhere).
#' 3. Fit a model to the data using `fit_model()` (defined elsewhere).
#' @return A list of target objects.
#' @export
#' @param name Symbol, name for the collection of targets.
#' @param file Character, data file path.
target_factory <- function(name, file) {
  name_model <- deparse(substitute(name))
  name_file <- paste0(name_model, "_file")
  name_data <- paste0(name_model, "_data")
  sym_file <- as.symbol(name_file)
  sym_data <- as.symbol(name_data)
  command_data <- substitute(read_data(file), env = list(file = sym_file))
  command_model <- substitute(run_model(data), env = list(data = sym_data))
  list(
    tar_target_raw(name_file, file, format = "file", deployment = "main"),
    tar_target_raw(name_data, command_data, format = "fst_tbl", deployment = "main"),
    tar_target_raw(name_model, command_model, format = "qs")
  )
}
```

```{r}
# R console
tar_manifest(fields = command)
#> # A tibble: 3 x 2
#>   name        command                  
#>   <chr>       <chr>                    
#> 1 custom_file "\"data.csv\""           
#> 2 custom_data "read_data(custom_file)"
#> 3 custom      "run_model(custom_data)"
```

### Settings

Situational knowledge helps us supply optimal arguments to [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html) that the user should not need to bother with. We have four such examples in `target_factory()` above.

1. `deployment = "main"`: the data file lives on the user's local machine or login node, so remote workers in [high-performance computing](https://books.ropensci.org/targets/hpc.html) scenarios may not be able to access it. Targets like these should not run on remote compute nodes.
1. `format = "file"`: track the input data file and invalidate the appropriate targets when the contents of the file change.
1. `format = "fst_tbl"`: the `"fst_tbl"` is a specialized format to efficiently store and retrieve data frames.
1. `format = "qs"`: efficient general-purpose storage format for R objects.

Many of the remaining arguments to [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html) should be exposed as arguments to the factory (omitted from our example `target_factory()` for brevity) with default values from [`tar_option_get()`](https://docs.ropensci.org/targets/reference/tar_option_get.html). Examples may include `priority` and `cue` because users may have good reasons to set these. However, arguments like `command`, `pattern`, `deps`, and `string` are low level and should not be supported.

### Branching

[Dynamic branching](https://books.ropensci.org/targets/dynamic.html) and [static branching](https://books.ropensci.org/targets/static.html) are difficult for most end users, so the mechanics of branching should happen behind the scenes. Simplification and guardrails are critical.

#### Static branching

[Static branching](https://books.ropensci.org/targets/static.html) works best with a small number of potentially heterogeneous tasks. Functions [`tar_map()`](https://docs.ropensci.org/tarchetypes/reference/tar_map.html), [`tar_combine_raw()`](https://docs.ropensci.org/tarchetypes/reference/tar_combine_raw.html), [`tar_sub()`](https://docs.ropensci.org/tarchetypes/reference/tar_sub.html), and [`tar_eval()`](https://docs.ropensci.org/tarchetypes/reference/tar_map.html) can help with the implementation internally. User-side inputs should be as simple as possible. For example, the [`stantargets::tar_stan_mcmc()`](https://docs.ropensci.org/stantargets//reference/tar_stan_mcmc.html) factory accepts a character vector of Stan model files and internally calls [`tar_map()`](https://docs.ropensci.org/tarchetypes/reference/tar_map.html) to create a group of targets for each model.

#### Dynamic branching

[Dynamic branching](https://books.ropensci.org/targets/dynamic.html) is best suited to larger collections of homogeneous tasks whose inputs are not necessarily known in advance. A factory with dynamic branching should create the `pattern` argument of [`tar_target_raw()`](https://docs.ropensci.org/targets/reference/tar_target_raw.html) with behind-the-scenes metaprogramming, and it should support [batching](https://books.ropensci.org/targets/dynamic.html#batching) to sensibly partition the work. Users should control the number of batches and reps per batch, but they should not be able to control the `pattern` argument. Examples of batching include [`tar_rep_raw()`](https://docs.ropensci.org/tarchetypes/reference/tar_rep_raw.html), [`tar_stan_mcmc_rep_summary()`](https://docs.ropensci.org/stantargets//reference/tar_stan_mcmc_rep_summary.html) and the [`targets-stan`](https://github.com/wlandau/targets-stan/) workflow.

## Documentation

### Examples

The `@examples` field of the [`roxygen2`](https://roxygen2.r-lib.org/) docstring should run quickly and avoid creating non-temporary files, which is why the examples in [`stantargets`](https://github.com/ropensci/stantargets) are mostly just sketches of pipelines. If you want to actually run a pipeline in an example, consider enclosing it inside [`tar_dir()`](https://docs.ropensci.org/targets/reference/tar_dir.html) to run the code in a temporary directory.

### README.Rmd

Feel free to include a README badge to let others know your package is part of the R Targetopia.

```md
[![R Targetopia](https://img.shields.io/badge/R_Targetopia-member-blue?style=flat&labelColor=gray)](https://wlandau.github.io/targetopia/)
```

[![R Targetopia](https://img.shields.io/badge/R_Targetopia-member-blue?style=flat&labelColor=gray)](https://wlandau.github.io/targetopia/)


## Testing

### What to test

1. Results: write a pipeline with [`tar_script()`](https://docs.ropensci.org/targets/reference/tar_script.html), run it with [`tar_make()`](https://docs.ropensci.org/targets/reference/tar_make.html), and inspect the output with [`tar_read()`](https://docs.ropensci.org/targets/reference/tar_read.html).
2. Manifest: use [`tar_manifest()`](https://docs.ropensci.org/targets/reference/tar_manifest.html) to check that the pipeline has the correct number of targets with the correct commands and configuration settings.
3. Dependencies: use the graph edges from  [`tar_network()`](https://docs.ropensci.org/targets/reference/tar_network.html) to check the dependency relationships among the targets. For example, in our target factory from earlier, there should be a directed edge from the input file target to the data target.

### Speed

[Unit tests](https://testthat.r-lib.org/) should run quickly if possible. To increase testing speed, you may wish to set `callr_function = NULL` in functions like [`tar_make()`](https://docs.ropensci.org/targets/reference/tar_make.html), but be warned that the result will be sensitive to functions you define in the testing environment. [CRAN](https://cran.r-project.org/) has strict policies about total check time, and [`testthat::skip_on_cran()`](https://testthat.r-lib.org/reference/skip.html) can help.

### Environment

Tests should avoid creating non-temporary files, and they should avoid permanently changing [target-specific options](https://docs.ropensci.org/targets/reference/tar_option_set.html) that could affect other tests. [`tar_test()`](https://docs.ropensci.org/targets/reference/tar_test.html) is a drop-in replacement for [`test_that()`](https://testthat.r-lib.org/reference/test_that.html) which solves these problems. It runs the test in a temporary directory, and it automatically calls [`tar_option_reset()`](https://docs.ropensci.org/targets/reference/tar_option_set.html) when the test is over. Tests using [`tar_test()`](https://docs.ropensci.org/targets/reference/tar_test.html) can freely create local files and set target options.

## rOpenSci

R Targetopia packages support [workflow automation](https://devguide.ropensci.org/policies.html#package-categories), making them excellent candidates for [rOpenSci software review](https://github.com/ropensci/software-review). The review process is a valuable source of feedback, and the [rOpenSci](https://ropensci.org/) community is welcoming and supportive. More details are [available here](https://devguide.ropensci.org/softwarereviewintro.html).

## Contact

If you have a package idea or are actively working on one, please feel free to reach out.

<ul>
<li>`r fontawesome::fa("github")` [`@wlandau`](https://github.com/wlandau)</li>
<li>`r fontawesome::fa("linkedin")` [`@wlandau`](https://linkedin.com/in/wlandau)</li>
<li>`r fontawesome::fa("twitter")` [`@wmlandau`](https://twitter.com/wmlandau)</li>
</ul>

Releases

No releases published

Packages

No packages published