You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is arguably easier to read if we want to understand any given version of
the spec, as we don't have to mentally ignore the parts of the spec related to
other versions. We use knitr to generate one document per version, only
keeping the clauses relevant to that version via conditional chunks.
We use `**/` to represent a variable name of the group representing any of the supported R objects.
6
-
It is assumed that `**/` will be replaced by the actual name of the group in implementations,
7
-
as defined by users (for the top-level group) or by the specification (e.g., as a nested child of a list).
12
+
## Comments
8
13
9
-
All objects should be nested inside an R list.
14
+
### General
15
+
16
+
Every R object is represented by a HDF5 group.
17
+
In the descriptions below, we use `**/` as a placeholder for the name of the group.
18
+
19
+
All R objects should be nested inside an R list.
20
+
In other words, the top-level HDF5 group should represent an R list.
10
21
11
22
The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses.
12
23
This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
13
-
The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0**.
24
+
The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0** for back-compatibility purposes.
25
+
26
+
```{r, echo=FALSE, results="asis"}
27
+
if (.version >= package_version("1.3")) {
28
+
cat("### Datatypes
29
+
30
+
The HDF5 datatype specification used by each R object is based on the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0).
31
+
This aims to provide readers with a guaranteed type for faithfully representing the data in memory.
32
+
The draft also describes the use of placeholders to represent missing values within HDF5 datasets.")
33
+
}
34
+
```
35
+
36
+
### Names
37
+
38
+
Some R objects may have a `**/names` dataset in their HDF5 group.
39
+
If `**/names` is supplied, the contents should always be non-missing, so any `missing-value-placeholder` will not be respected.
40
+
Each name is allowed to be any string, including an empty string.
14
41
15
-
## Lists
42
+
It is technically permitted to provide duplicate names in `**/names`, consistent with how R itself supports duplicate names in its lists and vectors.
43
+
However, this is not recommended as other frameworks may wish to use representations that assume unique names, e.g., using Python dictionaries to represent named lists.
44
+
By providing unique names, users can improve interoperability with native data structures in other frameworks.
45
+
46
+
## Object types
47
+
48
+
### Lists
16
49
17
50
An R list is represented as a HDF5 group (`**/`) with the following attributes:
18
51
@@ -24,15 +57,19 @@ One subgroup should be present for each integer in `[0, N)`, given a list of len
24
57
Each list element may be any of the objects described in this specification, including further nested lists.
25
58
26
59
If the list is named, there will additionally be a 1-dimensional `**/names` string dataset of length equal to the number of elements in `**/data`.
27
-
See also the [comments on names](misc.md#comments-on-names).
28
60
29
-
## Atomic vectors
61
+
###Atomic vectors
30
62
31
63
An atomic vector is represented as a HDF5 group (`**/`) with the following attributes:
32
64
33
65
-`uzuki_object`, a scalar string dataset containing the value `"vector"`.
34
-
-`uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.
35
-
-**(for version 1.0)** this may also be `"date"` or `"date-time"`.
66
+
```{r, echo=FALSE, results="asis"}
67
+
if (.version == package_version("1.0")) {
68
+
cat('- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"`, `"string"`, `"date"` or `"date-time"`.')
69
+
} else {
70
+
cat('- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.')
71
+
}
72
+
```
36
73
37
74
The group should contain an 1-dimensional dataset at `**/data`.
38
75
Vectors of length 1 may also be represented as a scalar dataset.
@@ -41,72 +78,93 @@ The allowed HDF5 datatype depends on `uzuki_type`:
41
78
42
79
-`"integer"`, `"boolean"`: any type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
43
80
Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
44
-
-**(for version < 1.3)**`"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
45
-
-**(for version >= 1.3)**`"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float.
46
-
This implies a limit of 32 bits for any integer datatype.
47
-
See also the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
81
+
```{r, echo=FALSE, results="asis"}
82
+
if (.version == package_version("1.0")) {
83
+
cat('- `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.')
84
+
} else {
85
+
cat('- `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float.')
86
+
}
87
+
```
48
88
-`"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
49
-
-**(for version 1.0)**`"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
50
-
-**(for version 1.0)**`"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.
51
-
52
-
For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true).
53
-
54
-
**(for versions >= 1.1)**
55
-
For the `string` type, the group may optionally contain the `**/format` dataset.
89
+
```{r, echo=FALSE, results="asis"}
90
+
if (.version == package_version("1.0")) {
91
+
cat('- `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
92
+
- `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.')
93
+
}
94
+
```
95
+
96
+
For `boolean` type, values in `**/data` should be one of 0 (false) or non-zero (true).
97
+
98
+
```{r, echo=FALSE, results="asis"}
99
+
if (.version >= package_version("1.1")) {
100
+
cat('For the `string` type, the group may optionally contain the `**/format` dataset.
56
101
This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`:
57
102
58
103
- `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value.
59
-
-`"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.
104
+
- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.')
105
+
}
106
+
```
60
107
61
108
The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`.
62
109
If `**/data` is a scalar, `**/names` should have length 1.
63
-
See also the [comments on names](misc.md#comments-on-names).
64
110
65
111
### Representing missing values
66
112
67
-
**(for version >= 1.1)**
68
-
Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
113
+
```{r, echo=FALSE, results="asis"}
114
+
if (.version >= package_version("1.1")) {
115
+
cat('Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
69
116
If present, this should be a scalar dataset that specifies the placeholder for missing values.
70
117
Any value of `**/data` that is equal to this placeholder should be treated as missing.
71
-
If no such attribute is present, it can be assumed that there are no missing values.
118
+
If no such attribute is present, it can be assumed that there are no missing values.')
119
+
}
72
120
73
-
**(for version >= 1.2)**
74
-
The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
121
+
if (.version >= package_version("1.2")) {
122
+
cat('The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
75
123
The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type;
76
-
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.
124
+
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.')
125
+
}
77
126
78
-
**(for version == 1.1)**
79
-
The data type of the placeholder attribute should have the same data type class as `**/data`.
127
+
if (.version >= package_version("1.1")) {
128
+
cat('The data type of the placeholder attribute should have the same data type class as `**/data`.')
129
+
}
80
130
81
-
**(for version >= 1.3)**
82
-
Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s.
131
+
if (.version >= package_version("1.3")) {
132
+
cat('Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s.
83
133
No casting should be performed to a lower-precision type, as this may cause a non-missing value to become equal to the placeholder.
84
-
If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload.
85
-
See the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
134
+
If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload.')
135
+
}
86
136
87
-
**(for version >= 1.1, < 1.3)**
88
-
Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
89
-
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.
137
+
if (.version >= package_version("1.1") && .version < package_version("1.3")) {
138
+
cat('Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
139
+
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.')
140
+
}
90
141
91
-
**(for version 1.0)**
92
-
Integer or boolean values of -2147483648 are treated as missing.
142
+
if (.version == package_version("1.0")) {
143
+
cat("Integer or boolean values of -2147483648 are treated as missing.
93
144
Missing floats are represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
94
145
For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute.
95
146
If present, this should be a scalar string dataset that specifies the placeholder for missing values.
96
147
Any value of `**/data` that is equal to this placeholder should be treated as missing.
97
-
If no such attribute is present, it can be assumed that there are no missing values.
148
+
If no such attribute is present, it can be assumed that there are no missing values.")
149
+
}
150
+
```
98
151
99
-
## Factors
152
+
###Factors
100
153
101
154
A factor is represented as a HDF5 group (`**/`) with the following attributes:
102
155
103
156
-`uzuki_object`, a scalar string dataset containing the value `"vector"`.
104
-
-`uzuki_type`, a scalar string dataset containing `"factor"`.
105
-
-**(for version 1.0)**`uzuki_type` could also be set to `"ordered"`.
106
-
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
157
+
```{r, echo=FALSE, results="asis"}
158
+
if (.version == package_version("1.0")) {
159
+
cat('- `uzuki_type`, a scalar string dataset containing `"factor"` or `"ordered"`.')
160
+
} else {
161
+
cat('- `uzuki_type`, a scalar string dataset containing `"factor"`.')
162
+
}
163
+
```
107
164
108
165
The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
109
166
This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
167
+
(Admittedly, this should have been an unsigned integer, but we started with a signed integer and we'll just keep it so for back-compatibility.)
110
168
Missing values are represented as described above for atomic vectors.
111
169
112
170
The group should also contain `**/levels`, a 1-dimensional string dataset that contains the levels for the indices in `**/data`.
@@ -118,16 +176,20 @@ beyond that count, the levels cannot be indexed by elements of `**/data`.
118
176
The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
119
177
See also the [comments on names](misc.md#comments-on-names).
120
178
121
-
**(for version >= 1.1)** The group may optionally contain `**/ordered`, a scalar integer dataset.
122
-
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.
179
+
```{r, echo=FALSE, results="asis"}
180
+
if (.version == package_version("1.1")) {
181
+
cat('The group may optionally contain `**/ordered`, a scalar integer dataset.
182
+
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.')
183
+
}
184
+
```
123
185
124
-
## Nothing
186
+
###Nothing
125
187
126
188
A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes:
127
189
128
190
-`uzuki_object`, a scalar string dataset containing the value `"nothing"`.
129
191
130
-
## External object
192
+
###External object
131
193
132
194
Each external object is represented as a HDF5 group (`**/`) with the following attributes:
133
195
@@ -136,5 +198,5 @@ Each external object is represented as a HDF5 group (`**/`) with the following a
136
198
This should contain an `**/index` scalar dataset, containing an index that identifies this external object uniquely within the entire list.
137
199
`**/index` should start at zero and be incremented whenever an external object is encountered.
138
200
139
-
By indexing this external metadata, we can restore the object in its appropriate location in the list.
201
+
By indexing some external metadata with the value of `**/index`, we can restore the external object in its appropriate location in the R list.
140
202
The exact mechanism by which this restoration occurs is implementation-defined.
0 commit comments