Skip to content

Commit 45dbccc

Browse files
authored
Clarified the differences between specification versions. (#12)
Also mentioned the relationship between specification and library versions.
1 parent 99ebc7b commit 45dbccc

File tree

1 file changed

+38
-47
lines changed

1 file changed

+38
-47
lines changed

README.md

Lines changed: 38 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ All objects should be nested inside an R list.
2626

2727
The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses.
2828
This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
29-
If not provided, it is assumed to be "1.0".
29+
The latest version of this specification is **1.2**; if not provided, it is assumed to be **1.0**.
3030

3131
### Lists
3232

@@ -47,6 +47,7 @@ An atomic vector is represented as a HDF5 group (`**/`) with the following attri
4747

4848
- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
4949
- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.
50+
- **(for version 1.0)** this may also be `"date"` or `"date-time"`.
5051

5152
The group should contain an 1-dimensional dataset at `**/data`.
5253
Vectors of length 1 may also be represented as a scalar dataset.
@@ -57,10 +58,13 @@ The allowed HDF5 datatype depends on `uzuki_type`:
5758
Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
5859
- `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
5960
- `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
61+
- **(for version 1.0)** `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
62+
- **(for version 1.0)** `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.
6063

6164
For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true).
6265

63-
For `string` type, the group may optionally contain the `**/format` dataset.
66+
**(for versions >= 1.1)**
67+
For the `string` type, the group may optionally contain the `**/format` dataset.
6468
This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`:
6569

6670
- `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value.
@@ -69,45 +73,42 @@ This should be a scalar string dataset that specifies constraints to the format
6973
The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`.
7074
If `**/data` is a scalar, `**/names` should have length 1.
7175

72-
<details>
73-
<summary>Changes from previous versions</summary>
74-
75-
In version 1.0, it was possible to have `uzuki_type` set to `"date"` or `"date-time"`.
76-
This is the same as `uzuki_type` of `"string"` with `**/format` set to `"date"` or `"date-time"`.
77-
</details>
78-
7976
#### Representing missing values
8077

78+
**(for version >= 1.1)**
8179
Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
8280
If present, this should be a scalar dataset that specifies the placeholder for missing values.
8381
Any value of `**/data` that is equal to this placeholder should be treated as missing.
8482
If no such attribute is present, it can be assumed that there are no missing values.
8583

84+
**(for version >= 1.2)**
8685
The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
8786
The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type;
8887
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.
8988

89+
**(for version == 1.1)**
90+
The data type of the placeholder attribute should have the same data type class as `**/data`.
91+
92+
**(for version >= 1.1)**
9093
Floating point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
9194
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.
9295

93-
<details>
94-
<summary>Changes from previous versions</summary>
95-
96-
**Version 1.1**
97-
The missing value placeholder only needed to be of the same type class as `**/data`.
98-
99-
**Version 1.0**
96+
**(for version 1.0)**
10097
Integer or boolean values of -2147483648 were treated as missing.
101-
10298
Missing floats were represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
103-
</details>
99+
For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute.
100+
If present, this should be a scalar string dataset that specifies the placeholder for missing values.
101+
Any value of `**/data` that is equal to this placeholder should be treated as missing.
102+
If no such attribute is present, it can be assumed that there are no missing values.
104103

105104
### Factors
106105

107106
A factor is represented as a HDF5 group (`**/`) with the following attributes:
108107

109108
- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
110109
- `uzuki_type`, a scalar string dataset containing `"factor"`.
110+
- **(for version 1.0)** `uzuki_type` could also be set to `"ordered"`.
111+
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
111112

112113
The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
113114
This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
@@ -121,16 +122,9 @@ beyond that count, the levels cannot be indexed by elements of `**/data`.
121122

122123
The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
123124

124-
The group may optionally contain `**/ordered`, a scalar integer dataset.
125+
**(for version >= 1.1)** The group may optionally contain `**/ordered`, a scalar integer dataset.
125126
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.
126127

127-
<details>
128-
<summary>Changes from previous versions</summary>
129-
130-
In version 1.0, it was possible to have `uzuki_type` set to `"ordered"`.
131-
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
132-
</details>
133-
134128
### Nothing
135129

136130
A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes:
@@ -155,7 +149,7 @@ All R objects are represented by JSON objects with a `type` property.
155149
Every R object should be nested inside an R list.
156150

157151
The top-level object may have a `version` property that contains the **uzuki2** specification version as a `"X.Y"` string for non-negative integers `X` and `Y`.
158-
If missing, the version can be assumed to be "1.0".
152+
The latest version of this specification is **1.2**; if missing, the version can be assumed to be **1.0**.
159153

160154
### Lists
161155

@@ -171,6 +165,8 @@ An R list is represented as a JSON object with the following properties:
171165
An atomic vector is represented as a JSON object with the following properties:
172166

173167
- `type`, set to one of `"integer"`, `"boolean"`, `"number"`, `"string"`.
168+
- **(for version 1.0)** `type` could also be set to `"date"` or `"date-time"`.
169+
This specifies strings in the date or Internet Date/Time format.
174170
- `values`, an array of values for the vector (see below).
175171
This may also be a scalar of the same type as the array contents.
176172
- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.
@@ -183,10 +179,12 @@ The contents of `values` is subject to some constraints:
183179
IEEE special values can be represented by strings, i.e., `NaN`, `Inf`, `-Inf`.
184180
- `"integer"`: values should be JSON numbers that can be represented by a 32-bit signed integer.
185181
Missing values may be represented by `null`.
182+
- **(for version 1.0)** missing integers could also be represented by the special value -2147483648.
186183
- `"boolean"`: values should be JSON booleans or `null` (for missing values).
187184
- `string`: values should be JSON strings.
188185
`null` is also allowed and represents a missing value.
189186

187+
**(for version >= 1.1)**
190188
For `type` of `"string"`, the object may optionally have a `format` property that constrains the `values`:
191189

192190
- `"date"`: values should be JSON strings following a `YYYY-MM-DD` format.
@@ -197,37 +195,21 @@ For `type` of `"string"`, the object may optionally have a `format` property tha
197195
Vectors of length 1 may also be represented as scalars of the appropriate type.
198196
While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.
199197

200-
<details>
201-
<summary>Changes from previous versions</summary>
202-
203-
In version 1.0, it was possible to have `type` set to `"date"` or `"date-time"`.
204-
This is the same as `"type": "string"` with `format` set to `"date"` or `"date-time"`.
205-
206-
In version 1.0, missing integers could also be represented by the special value -2147483648.
207-
</details>
208-
209198
### Factors
210199

211200
A factor is represented as a JSON object with the following properties:
212201

213202
- `type`, set to `"factor"`.
203+
- **(for version 1.0)** `type` can also be set to `"ordered"` for ordered levels.
214204
- `values`, an array of 0-based integer indices for the factor.
215205
These should be non-negative JSON numbers that can fit into a 32-bit signed integer.
216206
They should also be less than the length of `levels`.
217207
Missing values are represented by `null`.
208+
- **(for version 1.0)** missing values could also be represented by the special value -2147483648.
218209
- `levels`, an array of unique strings containing the levels for the indices in `values`.
219-
- (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered.
220-
If absent, levels are assumed to be non-ordered.
221210
- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.
222-
223-
<details>
224-
<summary>Changes from previous versions</summary>
225-
226-
In version 1.0, it was possible to have `"type": "ordered"`.
227-
This is the same as `"type": "factor"` with `"ordered": true`.
228-
229-
In version 1.0, missing values could also be represented by the special value -2147483648.
230-
</details>
211+
- **(for version >= 1.1)** (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered.
212+
If absent, levels are assumed to be non-ordered.
231213

232214
### Nothing
233215

@@ -285,6 +267,15 @@ DefaultExternals ext(nexpected);
285267
auto ptr = uzuki2::hdf5::parse<DefaultProvisioner>(file_path, group_name, ext);
286268
```
287269
270+
The parser supports multiple specification versions,
271+
though note the version number of the specification has no direct relationship to the version number of the **uzuki2** library.
272+
273+
|Library version|HDF5 version|JSON version|
274+
|---------------|------------|------------|
275+
| 1.0.x| 1.0| 1.0|
276+
| 1.1.x| 1.0 - 1.1| 1.0 - 1.1|
277+
| 1.2.x| 1.0 - 1.2| 1.0 - 1.2|
278+
288279
Also see the [reference documentation](https://artifactdb.github.io/uzuki2) for more details.
289280
290281
### Building projects

0 commit comments

Comments
 (0)