You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+38-47Lines changed: 38 additions & 47 deletions
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ All objects should be nested inside an R list.
26
26
27
27
The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses.
28
28
This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
29
-
If not provided, it is assumed to be "1.0".
29
+
The latest version of this specification is **1.2**; if not provided, it is assumed to be **1.0**.
30
30
31
31
### Lists
32
32
@@ -47,6 +47,7 @@ An atomic vector is represented as a HDF5 group (`**/`) with the following attri
47
47
48
48
-`uzuki_object`, a scalar string dataset containing the value `"vector"`.
49
49
-`uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.
50
+
-**(for version 1.0)** this may also be `"date"` or `"date-time"`.
50
51
51
52
The group should contain an 1-dimensional dataset at `**/data`.
52
53
Vectors of length 1 may also be represented as a scalar dataset.
@@ -57,10 +58,13 @@ The allowed HDF5 datatype depends on `uzuki_type`:
57
58
Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
58
59
-`"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
59
60
-`"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
61
+
-**(for version 1.0)**`"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
62
+
-**(for version 1.0)**`"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.
60
63
61
64
For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true).
62
65
63
-
For `string` type, the group may optionally contain the `**/format` dataset.
66
+
**(for versions >= 1.1)**
67
+
For the `string` type, the group may optionally contain the `**/format` dataset.
64
68
This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`:
65
69
66
70
-`"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value.
@@ -69,45 +73,42 @@ This should be a scalar string dataset that specifies constraints to the format
69
73
The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`.
70
74
If `**/data` is a scalar, `**/names` should have length 1.
71
75
72
-
<details>
73
-
<summary>Changes from previous versions</summary>
74
-
75
-
In version 1.0, it was possible to have `uzuki_type` set to `"date"` or `"date-time"`.
76
-
This is the same as `uzuki_type` of `"string"` with `**/format` set to `"date"` or `"date-time"`.
77
-
</details>
78
-
79
76
#### Representing missing values
80
77
78
+
**(for version >= 1.1)**
81
79
Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
82
80
If present, this should be a scalar dataset that specifies the placeholder for missing values.
83
81
Any value of `**/data` that is equal to this placeholder should be treated as missing.
84
82
If no such attribute is present, it can be assumed that there are no missing values.
85
83
84
+
**(for version >= 1.2)**
86
85
The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
87
86
The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type;
88
87
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.
89
88
89
+
**(for version == 1.1)**
90
+
The data type of the placeholder attribute should have the same data type class as `**/data`.
91
+
92
+
**(for version >= 1.1)**
90
93
Floating point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
91
94
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.
92
95
93
-
<details>
94
-
<summary>Changes from previous versions</summary>
95
-
96
-
**Version 1.1**
97
-
The missing value placeholder only needed to be of the same type class as `**/data`.
98
-
99
-
**Version 1.0**
96
+
**(for version 1.0)**
100
97
Integer or boolean values of -2147483648 were treated as missing.
101
-
102
98
Missing floats were represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
103
-
</details>
99
+
For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute.
100
+
If present, this should be a scalar string dataset that specifies the placeholder for missing values.
101
+
Any value of `**/data` that is equal to this placeholder should be treated as missing.
102
+
If no such attribute is present, it can be assumed that there are no missing values.
104
103
105
104
### Factors
106
105
107
106
A factor is represented as a HDF5 group (`**/`) with the following attributes:
108
107
109
108
-`uzuki_object`, a scalar string dataset containing the value `"vector"`.
110
109
-`uzuki_type`, a scalar string dataset containing `"factor"`.
110
+
-**(for version 1.0)**`uzuki_type` could also be set to `"ordered"`.
111
+
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
111
112
112
113
The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
113
114
This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
@@ -121,16 +122,9 @@ beyond that count, the levels cannot be indexed by elements of `**/data`.
121
122
122
123
The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
123
124
124
-
The group may optionally contain `**/ordered`, a scalar integer dataset.
125
+
**(for version >= 1.1)**The group may optionally contain `**/ordered`, a scalar integer dataset.
125
126
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.
126
127
127
-
<details>
128
-
<summary>Changes from previous versions</summary>
129
-
130
-
In version 1.0, it was possible to have `uzuki_type` set to `"ordered"`.
131
-
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
132
-
</details>
133
-
134
128
### Nothing
135
129
136
130
A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes:
@@ -155,7 +149,7 @@ All R objects are represented by JSON objects with a `type` property.
155
149
Every R object should be nested inside an R list.
156
150
157
151
The top-level object may have a `version` property that contains the **uzuki2** specification version as a `"X.Y"` string for non-negative integers `X` and `Y`.
158
-
If missing, the version can be assumed to be "1.0".
152
+
The latest version of this specification is **1.2**; if missing, the version can be assumed to be **1.0**.
159
153
160
154
### Lists
161
155
@@ -171,6 +165,8 @@ An R list is represented as a JSON object with the following properties:
171
165
An atomic vector is represented as a JSON object with the following properties:
172
166
173
167
-`type`, set to one of `"integer"`, `"boolean"`, `"number"`, `"string"`.
168
+
-**(for version 1.0)**`type` could also be set to `"date"` or `"date-time"`.
169
+
This specifies strings in the date or Internet Date/Time format.
174
170
-`values`, an array of values for the vector (see below).
175
171
This may also be a scalar of the same type as the array contents.
176
172
- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.
@@ -183,10 +179,12 @@ The contents of `values` is subject to some constraints:
183
179
IEEE special values can be represented by strings, i.e., `NaN`, `Inf`, `-Inf`.
184
180
-`"integer"`: values should be JSON numbers that can be represented by a 32-bit signed integer.
185
181
Missing values may be represented by `null`.
182
+
-**(for version 1.0)** missing integers could also be represented by the special value -2147483648.
186
183
-`"boolean"`: values should be JSON booleans or `null` (for missing values).
187
184
-`string`: values should be JSON strings.
188
185
`null` is also allowed and represents a missing value.
189
186
187
+
**(for version >= 1.1)**
190
188
For `type` of `"string"`, the object may optionally have a `format` property that constrains the `values`:
191
189
192
190
-`"date"`: values should be JSON strings following a `YYYY-MM-DD` format.
@@ -197,37 +195,21 @@ For `type` of `"string"`, the object may optionally have a `format` property tha
197
195
Vectors of length 1 may also be represented as scalars of the appropriate type.
198
196
While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.
199
197
200
-
<details>
201
-
<summary>Changes from previous versions</summary>
202
-
203
-
In version 1.0, it was possible to have `type` set to `"date"` or `"date-time"`.
204
-
This is the same as `"type": "string"` with `format` set to `"date"` or `"date-time"`.
205
-
206
-
In version 1.0, missing integers could also be represented by the special value -2147483648.
207
-
</details>
208
-
209
198
### Factors
210
199
211
200
A factor is represented as a JSON object with the following properties:
212
201
213
202
-`type`, set to `"factor"`.
203
+
-**(for version 1.0)**`type` can also be set to `"ordered"` for ordered levels.
214
204
-`values`, an array of 0-based integer indices for the factor.
215
205
These should be non-negative JSON numbers that can fit into a 32-bit signed integer.
216
206
They should also be less than the length of `levels`.
217
207
Missing values are represented by `null`.
208
+
-**(for version 1.0)** missing values could also be represented by the special value -2147483648.
218
209
-`levels`, an array of unique strings containing the levels for the indices in `values`.
219
-
- (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered.
220
-
If absent, levels are assumed to be non-ordered.
221
210
- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.
222
-
223
-
<details>
224
-
<summary>Changes from previous versions</summary>
225
-
226
-
In version 1.0, it was possible to have `"type": "ordered"`.
227
-
This is the same as `"type": "factor"` with `"ordered": true`.
228
-
229
-
In version 1.0, missing values could also be represented by the special value -2147483648.
230
-
</details>
211
+
-**(for version >= 1.1)** (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered.
0 commit comments