Clarified the differences between specification versions. (#12)

LTLA · web-flow · commit 45dbcccb8441 · 2023-10-21T13:01:04.000-07:00
Also mentioned the relationship between specification and library versions.
diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ All objects should be nested inside an R list.
 
 The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses.
 This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
-If not provided, it is assumed to be "1.0".
+The latest version of this specification is **1.2**; if not provided, it is assumed to be **1.0**.
 
 ### Lists
 
@@ -47,6 +47,7 @@ An atomic vector is represented as a HDF5 group (`**/`) with the following attri
 
 - `uzuki_object`, a scalar string dataset containing the value `"vector"`.
 - `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.
+   - **(for version 1.0)** this may also be `"date"` or `"date-time"`.
 
 The group should contain an 1-dimensional dataset at `**/data`.
 Vectors of length 1 may also be represented as a scalar dataset.
@@ -57,10 +58,13 @@ The allowed HDF5 datatype depends on `uzuki_type`:
   Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
 - `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
 - `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
+- **(for version 1.0)** `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
+- **(for version 1.0)** `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.
 
 For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true).
 
-For `string` type, the group may optionally contain the `**/format` dataset.
+**(for versions >= 1.1)** 
+For the `string` type, the group may optionally contain the `**/format` dataset.
 This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`:
 
 - `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value.
@@ -69,45 +73,42 @@ This should be a scalar string dataset that specifies constraints to the format
 The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`.
 If `**/data` is a scalar, `**/names` should have length 1.
 
-<details>
-<summary>Changes from previous versions</summary>
-
-In version 1.0, it was possible to have `uzuki_type` set to `"date"` or `"date-time"`.
-This is the same as `uzuki_type` of `"string"` with `**/format` set to `"date"` or `"date-time"`.
-</details>
-
 #### Representing missing values
 
+**(for version >= 1.1)** 
 Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
 If present, this should be a scalar dataset that specifies the placeholder for missing values.
 Any value of `**/data` that is equal to this placeholder should be treated as missing.
 If no such attribute is present, it can be assumed that there are no missing values. 
 
+**(for version >= 1.2)** 
 The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
 The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type;
 it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.
 
+**(for version == 1.1)** 
+The data type of the placeholder attribute should have the same data type class as `**/data`.
+
+**(for version >= 1.1)** 
 Floating point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
 Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.
 
-<details>
-<summary>Changes from previous versions</summary>
-
-**Version 1.1**
-The missing value placeholder only needed to be of the same type class as `**/data`.
-
-**Version 1.0**
+**(for version 1.0)** 
 Integer or boolean values of -2147483648 were treated as missing.
-
 Missing floats were represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
-</details>
+For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute.
+If present, this should be a scalar string dataset that specifies the placeholder for missing values.
+Any value of `**/data` that is equal to this placeholder should be treated as missing.
+If no such attribute is present, it can be assumed that there are no missing values. 
 
 ### Factors
 
 A factor is represented as a HDF5 group (`**/`) with the following attributes:
 
 - `uzuki_object`, a scalar string dataset containing the value `"vector"`.
 - `uzuki_type`, a scalar string dataset containing `"factor"`.
+  - **(for version 1.0)** `uzuki_type` could also be set to `"ordered"`.
+    This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
 
 The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
 This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
@@ -121,16 +122,9 @@ beyond that count, the levels cannot be indexed by elements of `**/data`.
 
 The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
 
-The group may optionally contain `**/ordered`, a scalar integer dataset.
+**(for version >= 1.1)** The group may optionally contain `**/ordered`, a scalar integer dataset.
 This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.
 
-<details>
-<summary>Changes from previous versions</summary>
-
-In version 1.0, it was possible to have `uzuki_type` set to `"ordered"`.
-This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
-</details>
-
 ### Nothing
 
 A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes:
@@ -155,7 +149,7 @@ All R objects are represented by JSON objects with a `type` property.
 Every R object should be nested inside an R list.
 
 The top-level object may have a `version` property that contains the **uzuki2** specification version as a `"X.Y"` string for non-negative integers `X` and `Y`.
-If missing, the version can be assumed to be "1.0".
+The latest version of this specification is **1.2**; if missing, the version can be assumed to be **1.0**.
 
 ### Lists
 
@@ -171,6 +165,8 @@ An R list is represented as a JSON object with the following properties:
 An atomic vector is represented as a JSON object with the following properties:
 
 - `type`, set to one of `"integer"`, `"boolean"`, `"number"`, `"string"`.
+  - **(for version 1.0)** `type` could also be set to `"date"` or `"date-time"`.
+    This specifies strings in the date or Internet Date/Time format.
 - `values`, an array of values for the vector (see below).
   This may also be a scalar of the same type as the array contents.
 - (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.
@@ -183,10 +179,12 @@ The contents of `values` is subject to some constraints:
   IEEE special values can be represented by strings, i.e., `NaN`, `Inf`, `-Inf`.
 - `"integer"`: values should be JSON numbers that can be represented by a 32-bit signed integer.
   Missing values may be represented by `null`.
+  - **(for version 1.0)** missing integers could also be represented by the special value -2147483648.
 - `"boolean"`: values should be JSON booleans or `null` (for missing values).
 - `string`: values should be JSON strings.
   `null` is also allowed and represents a missing value.
 
+**(for version >= 1.1)** 
 For `type` of `"string"`, the object may optionally have a `format` property that constrains the `values`:
   
 - `"date"`: values should be JSON strings following a `YYYY-MM-DD` format.
@@ -197,37 +195,21 @@ For `type` of `"string"`, the object may optionally have a `format` property tha
 Vectors of length 1 may also be represented as scalars of the appropriate type.
 While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.
 
-<details>
-<summary>Changes from previous versions</summary>
-
-In version 1.0, it was possible to have `type` set to `"date"` or `"date-time"`.
-This is the same as `"type": "string"` with `format` set to `"date"` or `"date-time"`.
-
-In version 1.0, missing integers could also be represented by the special value -2147483648.
-</details>
-
 ### Factors
 
 A factor is represented as a JSON object with the following properties:
 
 - `type`, set to `"factor"`. 
+  - **(for version 1.0)** `type` can also be set to `"ordered"` for ordered levels.
 - `values`, an array of 0-based integer indices for the factor.
   These should be non-negative JSON numbers that can fit into a 32-bit signed integer.
   They should also be less than the length of `levels`.
   Missing values are represented by `null`.
+  - **(for version 1.0)** missing values could also be represented by the special value -2147483648.
 - `levels`, an array of unique strings containing the levels for the indices in `values`.
-- (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered.
-  If absent, levels are assumed to be non-ordered.
 - (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.
-
-<details>
-<summary>Changes from previous versions</summary>
-
-In version 1.0, it was possible to have `"type": "ordered"`.
-This is the same as `"type": "factor"` with `"ordered": true`. 
-
-In version 1.0, missing values could also be represented by the special value -2147483648.
-</details>
+- **(for version >= 1.1)** (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered.
+  If absent, levels are assumed to be non-ordered.
 
 ### Nothing
 
@@ -285,6 +267,15 @@ DefaultExternals ext(nexpected);
 auto ptr = uzuki2::hdf5::parse<DefaultProvisioner>(file_path, group_name, ext);
 ```
 
+The parser supports multiple specification versions,
+though note the version number of the specification has no direct relationship to the version number of the **uzuki2** library.
+
+|Library version|HDF5 version|JSON version|
+|---------------|------------|------------|
+|          1.0.x|         1.0|         1.0|
+|          1.1.x|   1.0 - 1.1|   1.0 - 1.1|
+|          1.2.x|   1.0 - 1.2|   1.0 - 1.2|
+
 Also see the [reference documentation](https://artifactdb.github.io/uzuki2) for more details.
 
 ### Building projects