Skip to content

Commit 44e3a31

Browse files
committed
Programmatically generate version-specific specification documents.
This is arguably easier to read if we want to understand any given version of the spec, as we don't have to mentally ignore the parts of the spec related to other versions. We use knitr to generate one document per version, only keeping the clauses relevant to that version via conditional chunks.
1 parent b1d39e6 commit 44e3a31

File tree

7 files changed

+221
-85
lines changed

7 files changed

+221
-85
lines changed

.github/workflows/doxygenate.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,28 @@ on:
66
name: Build documentation
77

88
jobs:
9+
build-spec:
10+
runs-on: ubuntu-latest
11+
container: bioconductor/bioconductor_docker:devel
12+
13+
steps:
14+
- name: Checkout repo
15+
uses: actions/checkout@v3
16+
17+
- name: Compile markdown
18+
run: |
19+
cd docs/specifications
20+
R -f build.R
21+
22+
- name: Upload markdown
23+
uses: actions/upload-artifact@v3
24+
with:
25+
name: built-spec
26+
path: docs/specifications/compiled
27+
928
docs:
1029
runs-on: ubuntu-latest
30+
needs: build-spec
1131
steps:
1232
- uses: actions/checkout@v3
1333

@@ -16,6 +36,12 @@ jobs:
1636
with:
1737
args: -O docs/doxygen-awesome.css https://raw.githubusercontent.com/jothepro/doxygen-awesome-css/main/doxygen-awesome.css
1838

39+
- name: Download markdown
40+
uses: actions/download-artifact@v3
41+
with:
42+
name: built-spec
43+
path: docs/specifications/compiled
44+
1945
- name: Doxygen Action
2046
uses: mattnotmitt/doxygen-action@v1
2147
with:

docs/Doxyfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -794,7 +794,8 @@ INPUT = ../include/uzuki2/parse_json.hpp \
794794
../include/uzuki2/parse_hdf5.hpp \
795795
../include/uzuki2/interfaces.hpp \
796796
../include/uzuki2/uzuki2.hpp \
797-
../README.md
797+
../README.md \
798+
specifications/compiled
798799

799800
# This tag can be used to specify the character encoding of the source files
800801
# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses

docs/specifications/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
compiled/

docs/specifications/build.R

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
library(knitr)
2+
dir.create("compiled", showWarnings=FALSE)
3+
4+
for (v in c("1.0", "1.1", "1.2", "1.3")) {
5+
.version <- package_version(v)
6+
knitr::knit("hdf5.Rmd", output=file.path("compiled", paste0("hdf5-", v, ".md")))
7+
}
8+
9+
for (v in c("1.0", "1.1", "1.2")) {
10+
.version <- package_version(v)
11+
knitr::knit("json.Rmd", output=file.path("compiled", paste0("json-", v, ".md")))
12+
}
13+
14+
file.copy("misc.md", file.path("compiled", "misc.md"))
Lines changed: 114 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,51 @@
1-
# HDF5 Specification
1+
```{r, results="hide", echo=FALSE}
2+
knitr::opts_chunk$set(error=FALSE)
3+
if (!exists(".version")) {
4+
.version <- package_version("1.3")
5+
}
6+
```
27

3-
## General comments
8+
```{r, results="asis", echo=FALSE}
9+
cat("# HDF5 Specification (", as.character(.version), ")", sep="")
10+
```
411

5-
We use `**/` to represent a variable name of the group representing any of the supported R objects.
6-
It is assumed that `**/` will be replaced by the actual name of the group in implementations,
7-
as defined by users (for the top-level group) or by the specification (e.g., as a nested child of a list).
12+
## Comments
813

9-
All objects should be nested inside an R list.
14+
### General
15+
16+
Every R object is represented by a HDF5 group.
17+
In the descriptions below, we use `**/` as a placeholder for the name of the group.
18+
19+
All R objects should be nested inside an R list.
20+
In other words, the top-level HDF5 group should represent an R list.
1021

1122
The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses.
1223
This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
13-
The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0**.
24+
The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0** for back-compatibility purposes.
25+
26+
```{r, echo=FALSE, results="asis"}
27+
if (.version >= package_version("1.3")) {
28+
cat("### Datatypes
29+
30+
The HDF5 datatype specification used by each R object is based on the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0).
31+
This aims to provide readers with a guaranteed type for faithfully representing the data in memory.
32+
The draft also describes the use of placeholders to represent missing values within HDF5 datasets.")
33+
}
34+
```
35+
36+
### Names
37+
38+
Some R objects may have a `**/names` dataset in their HDF5 group.
39+
If `**/names` is supplied, the contents should always be non-missing, so any `missing-value-placeholder` will not be respected.
40+
Each name is allowed to be any string, including an empty string.
1441

15-
## Lists
42+
It is technically permitted to provide duplicate names in `**/names`, consistent with how R itself supports duplicate names in its lists and vectors.
43+
However, this is not recommended as other frameworks may wish to use representations that assume unique names, e.g., using Python dictionaries to represent named lists.
44+
By providing unique names, users can improve interoperability with native data structures in other frameworks.
45+
46+
## Object types
47+
48+
### Lists
1649

1750
An R list is represented as a HDF5 group (`**/`) with the following attributes:
1851

@@ -24,15 +57,19 @@ One subgroup should be present for each integer in `[0, N)`, given a list of len
2457
Each list element may be any of the objects described in this specification, including further nested lists.
2558

2659
If the list is named, there will additionally be a 1-dimensional `**/names` string dataset of length equal to the number of elements in `**/data`.
27-
See also the [comments on names](misc.md#comments-on-names).
2860

29-
## Atomic vectors
61+
### Atomic vectors
3062

3163
An atomic vector is represented as a HDF5 group (`**/`) with the following attributes:
3264

3365
- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
34-
- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.
35-
- **(for version 1.0)** this may also be `"date"` or `"date-time"`.
66+
```{r, echo=FALSE, results="asis"}
67+
if (.version == package_version("1.0")) {
68+
cat('- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"`, `"string"`, `"date"` or `"date-time"`.')
69+
} else {
70+
cat('- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.')
71+
}
72+
```
3673

3774
The group should contain an 1-dimensional dataset at `**/data`.
3875
Vectors of length 1 may also be represented as a scalar dataset.
@@ -41,72 +78,93 @@ The allowed HDF5 datatype depends on `uzuki_type`:
4178

4279
- `"integer"`, `"boolean"`: any type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
4380
Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
44-
- **(for version < 1.3)** `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
45-
- **(for version >= 1.3)** `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float.
46-
This implies a limit of 32 bits for any integer datatype.
47-
See also the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
81+
```{r, echo=FALSE, results="asis"}
82+
if (.version == package_version("1.0")) {
83+
cat('- `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.')
84+
} else {
85+
cat('- `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float.')
86+
}
87+
```
4888
- `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
49-
- **(for version 1.0)** `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
50-
- **(for version 1.0)** `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.
51-
52-
For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true).
53-
54-
**(for versions >= 1.1)**
55-
For the `string` type, the group may optionally contain the `**/format` dataset.
89+
```{r, echo=FALSE, results="asis"}
90+
if (.version == package_version("1.0")) {
91+
cat('- `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
92+
- `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.')
93+
}
94+
```
95+
96+
For `boolean` type, values in `**/data` should be one of 0 (false) or non-zero (true).
97+
98+
```{r, echo=FALSE, results="asis"}
99+
if (.version >= package_version("1.1")) {
100+
cat('For the `string` type, the group may optionally contain the `**/format` dataset.
56101
This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`:
57102
58103
- `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value.
59-
- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.
104+
- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.')
105+
}
106+
```
60107

61108
The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`.
62109
If `**/data` is a scalar, `**/names` should have length 1.
63-
See also the [comments on names](misc.md#comments-on-names).
64110

65111
### Representing missing values
66112

67-
**(for version >= 1.1)**
68-
Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
113+
```{r, echo=FALSE, results="asis"}
114+
if (.version >= package_version("1.1")) {
115+
cat('Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
69116
If present, this should be a scalar dataset that specifies the placeholder for missing values.
70117
Any value of `**/data` that is equal to this placeholder should be treated as missing.
71-
If no such attribute is present, it can be assumed that there are no missing values.
118+
If no such attribute is present, it can be assumed that there are no missing values.')
119+
}
72120
73-
**(for version >= 1.2)**
74-
The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
121+
if (.version >= package_version("1.2")) {
122+
cat('The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
75123
The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type;
76-
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.
124+
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.')
125+
}
77126
78-
**(for version == 1.1)**
79-
The data type of the placeholder attribute should have the same data type class as `**/data`.
127+
if (.version >= package_version("1.1")) {
128+
cat('The data type of the placeholder attribute should have the same data type class as `**/data`.')
129+
}
80130
81-
**(for version >= 1.3)**
82-
Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s.
131+
if (.version >= package_version("1.3")) {
132+
cat('Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s.
83133
No casting should be performed to a lower-precision type, as this may cause a non-missing value to become equal to the placeholder.
84-
If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload.
85-
See the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
134+
If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload.')
135+
}
86136
87-
**(for version >= 1.1, < 1.3)**
88-
Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
89-
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.
137+
if (.version >= package_version("1.1") && .version < package_version("1.3")) {
138+
cat('Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
139+
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.')
140+
}
90141
91-
**(for version 1.0)**
92-
Integer or boolean values of -2147483648 are treated as missing.
142+
if (.version == package_version("1.0")) {
143+
cat("Integer or boolean values of -2147483648 are treated as missing.
93144
Missing floats are represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
94145
For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute.
95146
If present, this should be a scalar string dataset that specifies the placeholder for missing values.
96147
Any value of `**/data` that is equal to this placeholder should be treated as missing.
97-
If no such attribute is present, it can be assumed that there are no missing values.
148+
If no such attribute is present, it can be assumed that there are no missing values.")
149+
}
150+
```
98151

99-
## Factors
152+
### Factors
100153

101154
A factor is represented as a HDF5 group (`**/`) with the following attributes:
102155

103156
- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
104-
- `uzuki_type`, a scalar string dataset containing `"factor"`.
105-
- **(for version 1.0)** `uzuki_type` could also be set to `"ordered"`.
106-
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
157+
```{r, echo=FALSE, results="asis"}
158+
if (.version == package_version("1.0")) {
159+
cat('- `uzuki_type`, a scalar string dataset containing `"factor"` or `"ordered"`.')
160+
} else {
161+
cat('- `uzuki_type`, a scalar string dataset containing `"factor"`.')
162+
}
163+
```
107164

108165
The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
109166
This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
167+
(Admittedly, this should have been an unsigned integer, but we started with a signed integer and we'll just keep it so for back-compatibility.)
110168
Missing values are represented as described above for atomic vectors.
111169

112170
The group should also contain `**/levels`, a 1-dimensional string dataset that contains the levels for the indices in `**/data`.
@@ -118,16 +176,20 @@ beyond that count, the levels cannot be indexed by elements of `**/data`.
118176
The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
119177
See also the [comments on names](misc.md#comments-on-names).
120178

121-
**(for version >= 1.1)** The group may optionally contain `**/ordered`, a scalar integer dataset.
122-
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.
179+
```{r, echo=FALSE, results="asis"}
180+
if (.version == package_version("1.1")) {
181+
cat('The group may optionally contain `**/ordered`, a scalar integer dataset.
182+
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.')
183+
}
184+
```
123185

124-
## Nothing
186+
### Nothing
125187

126188
A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes:
127189

128190
- `uzuki_object`, a scalar string dataset containing the value `"nothing"`.
129191

130-
## External object
192+
### External object
131193

132194
Each external object is represented as a HDF5 group (`**/`) with the following attributes:
133195

@@ -136,5 +198,5 @@ Each external object is represented as a HDF5 group (`**/`) with the following a
136198
This should contain an `**/index` scalar dataset, containing an index that identifies this external object uniquely within the entire list.
137199
`**/index` should start at zero and be incremented whenever an external object is encountered.
138200

139-
By indexing this external metadata, we can restore the object in its appropriate location in the list.
201+
By indexing some external metadata with the value of `**/index`, we can restore the external object in its appropriate location in the R list.
140202
The exact mechanism by which this restoration occurs is implementation-defined.

0 commit comments

Comments
 (0)