From 5f8d3f02219cf3c3b7fa4e173d295b457a6c3bf3 Mon Sep 17 00:00:00 2001 From: LTLA Date: Sat, 18 Nov 2023 00:38:56 -0800 Subject: [PATCH] Split up the README into separate spec files. --- README.md | 252 ++---------------------------------- docs/specifications/hdf5.md | 140 ++++++++++++++++++++ docs/specifications/json.md | 87 +++++++++++++ docs/specifications/misc.md | 25 ++++ 4 files changed, 261 insertions(+), 243 deletions(-) create mode 100644 docs/specifications/hdf5.md create mode 100644 docs/specifications/json.md create mode 100644 docs/specifications/misc.md diff --git a/README.md b/README.md index 59f9364..c21e820 100644 --- a/README.md +++ b/README.md @@ -11,239 +11,16 @@ List elements may be atomic vectors, `NULL`, or nested lists of such objects. It also supports missing values in the vectors and per-element names on the vectors or lists. A mechanism is also provided to handle external references to more complex objects (e.g., S4 classes) that cannot be directly saved into the format. -We support serialization in either [HDF5](https://www.hdfgroup.org/) or (possibly Gzip-compressed) [JSON](https://json.org). +## Specifications + +We support serialization in either HDF5 or (possibly Gzip-compressed) JSON. Both of these are widely used formats and have complementary strengths for list representation. HDF5 supports random access into list components, which can provide optimization opportunities when the list is large and/or contains large atomic vectors. In contrast, JSON is easier to parse and has less storage overhead per list element. -## HDF5 Specification - -We use `**/` to represent a variable name of the group representing any of the supported R objects. -It is assumed that `**/` will be replaced by the actual name of the group in implementations, -as defined by users (for the top-level group) or by the specification (e.g., as a nested child of a list). - -All objects should be nested inside an R list. - -The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses. -This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`. -The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0**. - -### Lists - -An R list is represented as a HDF5 group (`**/`) with the following attributes: - -- `uzuki_object`, a scalar string dataset containing the value `"list"`. - -This group should contain a subgroup `**/data` that contains the list elements. -Each list element is itself represented by a subgroup that is named after its 0-based position in the list, e.g., `**/data/0` for the first list element. -One subgroup should be present for each integer in `[0, N)`, given a list of length `N`. -Each list element may be any of the objects described in this specification, including further nested lists. - -If the list is named, there will additionally be a 1-dimensional `**/names` string dataset of length equal to the number of elements in `**/data`. - -### Atomic vectors - -An atomic vector is represented as a HDF5 group (`**/`) with the following attributes: - -- `uzuki_object`, a scalar string dataset containing the value `"vector"`. -- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`. - - **(for version 1.0)** this may also be `"date"` or `"date-time"`. - -The group should contain an 1-dimensional dataset at `**/data`. -Vectors of length 1 may also be represented as a scalar dataset. -(While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.) -The allowed HDF5 datatype depends on `uzuki_type`: - -- `"integer"`, `"boolean"`: any type of `H5T_INTEGER` that can be represented by a 32-bit signed integer. - Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset. -- **(for version < 1.3)** `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float. -- **(for version >= 1.3)** `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float. - This implies a limit of 32 bits for any integer datatype. - See also the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details. -- `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string. -- **(for version 1.0)** `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value. -- **(for version 1.0)** `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value. - -For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true). - -**(for versions >= 1.1)** -For the `string` type, the group may optionally contain the `**/format` dataset. -This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`: - -- `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value. -- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value. - -The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`. -If `**/data` is a scalar, `**/names` should have length 1. - -#### Representing missing values - -**(for version >= 1.1)** -Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute. -If present, this should be a scalar dataset that specifies the placeholder for missing values. -Any value of `**/data` that is equal to this placeholder should be treated as missing. -If no such attribute is present, it can be assumed that there are no missing values. - -**(for version >= 1.2)** -The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting. -The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type; -it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`. - -**(for version == 1.1)** -The data type of the placeholder attribute should have the same data type class as `**/data`. - -**(for version >= 1.3)** -Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s. -No casting should be performed to a lower-precision type, as this may cause a non-missing value to become equal to the placeholder. -If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload. -See the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details. - -**(for version >= 1.1, < 1.3)** -Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value. -Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account. - -**(for version 1.0)** -Integer or boolean values of -2147483648 are treated as missing. -Missing floats are represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98). -For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute. -If present, this should be a scalar string dataset that specifies the placeholder for missing values. -Any value of `**/data` that is equal to this placeholder should be treated as missing. -If no such attribute is present, it can be assumed that there are no missing values. - -### Factors - -A factor is represented as a HDF5 group (`**/`) with the following attributes: - -- `uzuki_object`, a scalar string dataset containing the value `"vector"`. -- `uzuki_type`, a scalar string dataset containing `"factor"`. - - **(for version 1.0)** `uzuki_type` could also be set to `"ordered"`. - This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value. - -The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels. -This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer. -Missing values are represented as described above for atomic vectors. - -The group should also contain `**/levels`, a 1-dimensional string dataset that contains the levels for the indices in `**/data`. -Values in `**/levels` should be unique. -Values in `**/data` should be non-negative (missing values excepted) and less than the length of `**/levels`. -Note that the type constraints on `**/data` suggest that there should not be more than 2147483647 levels; -beyond that count, the levels cannot be indexed by elements of `**/data`. - -The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`. - -**(for version >= 1.1)** The group may optionally contain `**/ordered`, a scalar integer dataset. -This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered. - -### Nothing +The full HDF5 specification is provided [here](docs/specifications/hdf5.md). -A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes: - -- `uzuki_object`, a scalar string dataset containing the value `"nothing"`. - -### External object - -Each external object is represented as a HDF5 group (`**/`) with the following attributes: - -- `uzuki_object`, a scalar string dataset containing the value `"external"`. - -This should contain an `**/index` scalar dataset, containing an index that identifies this external object uniquely within the entire list. -`**/index` should start at zero and be incremented whenever an external object is encountered. - -By indexing this external metadata, we can restore the object in its appropriate location in the list. -The exact mechanism by which this restoration occurs is implementation-defined. - -## JSON Specification - -All R objects are represented by JSON objects with a `type` property. -Every R object should be nested inside an R list. - -The top-level object may have a `version` property that contains the **uzuki2** specification version as a `"X.Y"` string for non-negative integers `X` and `Y`. -The latest version of this specification is **1.2**; if missing, the version can be assumed to be **1.0**. - -### Lists - -An R list is represented as a JSON object with the following properties: - -- `type`, set to `"list"`. -- `values`, an array of JSON objects corresponding to nested R objects. - Each JSON object may follow any of the formats described in this specification. -- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements. - -### Atomic vectors - -An atomic vector is represented as a JSON object with the following properties: - -- `type`, set to one of `"integer"`, `"boolean"`, `"number"`, `"string"`. - - **(for version 1.0)** `type` could also be set to `"date"` or `"date-time"`. - This specifies strings in the date or Internet Date/Time format. -- `values`, an array of values for the vector (see below). - This may also be a scalar of the same type as the array contents. -- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements. - If `values` is a scalar, `names` should have length 1. - -The contents of `values` is subject to some constraints: - -- `"number"`: values should be JSON numbers. - Missing values are represented by `null`. - IEEE special values can be represented by strings, i.e., `NaN`, `Inf`, `-Inf`. -- `"integer"`: values should be JSON numbers that can be represented by a 32-bit signed integer. - Missing values may be represented by `null`. - - **(for version 1.0)** missing integers could also be represented by the special value -2147483648. -- `"boolean"`: values should be JSON booleans or `null` (for missing values). -- `string`: values should be JSON strings. - `null` is also allowed and represents a missing value. - -**(for version >= 1.1)** -For `type` of `"string"`, the object may optionally have a `format` property that constrains the `values`: - -- `"date"`: values should be JSON strings following a `YYYY-MM-DD` format. - `null` is also allowed and represents a missing value. -- `"date-time"`: values should be JSON strings following the Internet Date/Time format. - `null` is also allowed and represents a missing value. - -Vectors of length 1 may also be represented as scalars of the appropriate type. -While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant. - -### Factors - -A factor is represented as a JSON object with the following properties: - -- `type`, set to `"factor"`. - - **(for version 1.0)** `type` can also be set to `"ordered"` for ordered levels. -- `values`, an array of 0-based integer indices for the factor. - These should be non-negative JSON numbers that can fit into a 32-bit signed integer. - They should also be less than the length of `levels`. - Missing values are represented by `null`. - - **(for version 1.0)** missing values could also be represented by the special value -2147483648. -- `levels`, an array of unique strings containing the levels for the indices in `values`. -- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements. -- **(for version >= 1.1)** (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered. - If absent, levels are assumed to be non-ordered. - -### Nothing - -A "nothing" (a.k.a., "null", "none") value is represented as a JSON object with the following properties: - -- `type`, set to `"nothing"`. - -### External object - -Each external object is represented as a JSON object with the following properties: - -- `type`, set to `"index"`. -- `index`, a non-negative JSON number that can fit into a 32-bit signed integer. - This identifies this external object uniquely within the entire list. - See the equivalent in the HDF5 specification for more details. - -## Comments on names - -Both HDF5 and JSON support naming of the vector elements, typically via the `names` group/property. -If `names` are supplied, their contents should always be non-missing (e.g., not `null` in JSON, no `missing-value-placeholder` in HDF5). -Each name is allowed to be any string, including an empty string. - -It is technically permitted to provide duplicate names in `names`, consistent with how R itself supports duplicate names in its lists and vectors. -However, this is not recommended as other frameworks may wish to use representations that assume unique names, e.g., using Python dictionaries to represent named lists. -By providing unique names, users can improve interoperability with native data structures in other frameworks. +The full JSON specification is provided [here](docs/specifications/json.md). ## Validation @@ -284,6 +61,7 @@ though note the version number of the specification has no direct relationship t | 1.0.x| 1.0| 1.0| | 1.1.x| 1.0 - 1.1| 1.0 - 1.1| | 1.2.x| 1.0 - 1.2| 1.0 - 1.2| +| 1.3.x| 1.0 - 1.3| 1.0 - 1.2| Also see the [reference documentation](https://artifactdb.github.io/uzuki2) for more details. @@ -339,22 +117,10 @@ either directly or with Git submodules - and include their path during compilati You will also need to link to the dependencies listed in the [`extern/CMakeLists.txt`](extern/CMakeLists.txt) directory, along with the HDF5 and Zlib libraries. -## Comparison to version 1 - -**uzuki2** involves some major changes from the original [**uzuki**](https://github.com/LTLA/uzuki) library. -Most obviously, we added support for HDF5 alongside the JSON format. -The latter supports random access without loading the entire list contents into memory, -which provides some optimization opportunities for parsers when large vectors are present. - -Arrays and data frames are no longer supported in **uzuki2**. -Such objects should instead be represented by external references, -under the assumption that any serialization framework using **uzuki2** would already have a separate mechanism for representing arrays and data frames. -For example, the [**alabaster**](https://github.com/ArtifactDB/alabaster.base) framework has its own staging methods for these objects. +## Further comments -In the JSON format, **uzuki2** is also more explicit with its serialization of lists. -These now have their own dedicated `"type": "list"`, rather than relying on the implicit interpretation of arrays as unnamed lists and JSON objects as named lists. -In particular, treating JSON objects as named lists led to ambiguities when a list element was named `"type"`; it also failed to preserve the ordering of list elements. +See [here](docs/specifications/misc.md#comparison-to-version-1) for a list of changes from the original [**uzuki**](https://github.com/LTLA/uzuki) library. -Just like the original **uzuki** library, we're just re-using the reference to [Uzuki Shimamura](https://myanimelist.net/character/70883/Uzuki_Shimamura) for the name: +Just like the original **uzuki**, we're just re-using the reference to [Uzuki Shimamura](https://myanimelist.net/character/70883/Uzuki_Shimamura) for the name: ![Uzuki Shimamura](https://media1.giphy.com/media/7Oy2FDqWV5mak/giphy.gif) diff --git a/docs/specifications/hdf5.md b/docs/specifications/hdf5.md new file mode 100644 index 0000000..8b52aad --- /dev/null +++ b/docs/specifications/hdf5.md @@ -0,0 +1,140 @@ +# HDF5 Specification + +## General comments + +We use `**/` to represent a variable name of the group representing any of the supported R objects. +It is assumed that `**/` will be replaced by the actual name of the group in implementations, +as defined by users (for the top-level group) or by the specification (e.g., as a nested child of a list). + +All objects should be nested inside an R list. + +The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses. +This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`. +The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0**. + +## Lists + +An R list is represented as a HDF5 group (`**/`) with the following attributes: + +- `uzuki_object`, a scalar string dataset containing the value `"list"`. + +This group should contain a subgroup `**/data` that contains the list elements. +Each list element is itself represented by a subgroup that is named after its 0-based position in the list, e.g., `**/data/0` for the first list element. +One subgroup should be present for each integer in `[0, N)`, given a list of length `N`. +Each list element may be any of the objects described in this specification, including further nested lists. + +If the list is named, there will additionally be a 1-dimensional `**/names` string dataset of length equal to the number of elements in `**/data`. +See also the [comments on names](misc.md#comments-on-names). + +## Atomic vectors + +An atomic vector is represented as a HDF5 group (`**/`) with the following attributes: + +- `uzuki_object`, a scalar string dataset containing the value `"vector"`. +- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`. + - **(for version 1.0)** this may also be `"date"` or `"date-time"`. + +The group should contain an 1-dimensional dataset at `**/data`. +Vectors of length 1 may also be represented as a scalar dataset. +(While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.) +The allowed HDF5 datatype depends on `uzuki_type`: + +- `"integer"`, `"boolean"`: any type of `H5T_INTEGER` that can be represented by a 32-bit signed integer. + Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset. +- **(for version < 1.3)** `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float. +- **(for version >= 1.3)** `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float. + This implies a limit of 32 bits for any integer datatype. + See also the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details. +- `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string. +- **(for version 1.0)** `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value. +- **(for version 1.0)** `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value. + +For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true). + +**(for versions >= 1.1)** +For the `string` type, the group may optionally contain the `**/format` dataset. +This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`: + +- `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value. +- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value. + +The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`. +If `**/data` is a scalar, `**/names` should have length 1. +See also the [comments on names](misc.md#comments-on-names). + +### Representing missing values + +**(for version >= 1.1)** +Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute. +If present, this should be a scalar dataset that specifies the placeholder for missing values. +Any value of `**/data` that is equal to this placeholder should be treated as missing. +If no such attribute is present, it can be assumed that there are no missing values. + +**(for version >= 1.2)** +The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting. +The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type; +it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`. + +**(for version == 1.1)** +The data type of the placeholder attribute should have the same data type class as `**/data`. + +**(for version >= 1.3)** +Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s. +No casting should be performed to a lower-precision type, as this may cause a non-missing value to become equal to the placeholder. +If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload. +See the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details. + +**(for version >= 1.1, < 1.3)** +Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value. +Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account. + +**(for version 1.0)** +Integer or boolean values of -2147483648 are treated as missing. +Missing floats are represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98). +For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute. +If present, this should be a scalar string dataset that specifies the placeholder for missing values. +Any value of `**/data` that is equal to this placeholder should be treated as missing. +If no such attribute is present, it can be assumed that there are no missing values. + +## Factors + +A factor is represented as a HDF5 group (`**/`) with the following attributes: + +- `uzuki_object`, a scalar string dataset containing the value `"vector"`. +- `uzuki_type`, a scalar string dataset containing `"factor"`. + - **(for version 1.0)** `uzuki_type` could also be set to `"ordered"`. + This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value. + +The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels. +This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer. +Missing values are represented as described above for atomic vectors. + +The group should also contain `**/levels`, a 1-dimensional string dataset that contains the levels for the indices in `**/data`. +Values in `**/levels` should be unique. +Values in `**/data` should be non-negative (missing values excepted) and less than the length of `**/levels`. +Note that the type constraints on `**/data` suggest that there should not be more than 2147483647 levels; +beyond that count, the levels cannot be indexed by elements of `**/data`. + +The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`. +See also the [comments on names](misc.md#comments-on-names). + +**(for version >= 1.1)** The group may optionally contain `**/ordered`, a scalar integer dataset. +This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered. + +## Nothing + +A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes: + +- `uzuki_object`, a scalar string dataset containing the value `"nothing"`. + +## External object + +Each external object is represented as a HDF5 group (`**/`) with the following attributes: + +- `uzuki_object`, a scalar string dataset containing the value `"external"`. + +This should contain an `**/index` scalar dataset, containing an index that identifies this external object uniquely within the entire list. +`**/index` should start at zero and be incremented whenever an external object is encountered. + +By indexing this external metadata, we can restore the object in its appropriate location in the list. +The exact mechanism by which this restoration occurs is implementation-defined. diff --git a/docs/specifications/json.md b/docs/specifications/json.md new file mode 100644 index 0000000..b30ee3d --- /dev/null +++ b/docs/specifications/json.md @@ -0,0 +1,87 @@ +# JSON Specification + +## General comments + +All R objects are represented by JSON objects with a `type` property. +Every R object should be nested inside an R list. + +The top-level object may have a `version` property that contains the **uzuki2** specification version as a `"X.Y"` string for non-negative integers `X` and `Y`. +The latest version of this specification is **1.2**; if missing, the version can be assumed to be **1.0**. + +## Lists + +An R list is represented as a JSON object with the following properties: + +- `type`, set to `"list"`. +- `values`, an array of JSON objects corresponding to nested R objects. + Each JSON object may follow any of the formats described in this specification. +- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements. + See also the [comments on names](misc.md#comments-on-names). + +## Atomic vectors + +An atomic vector is represented as a JSON object with the following properties: + +- `type`, set to one of `"integer"`, `"boolean"`, `"number"`, `"string"`. + - **(for version 1.0)** `type` could also be set to `"date"` or `"date-time"`. + This specifies strings in the date or Internet Date/Time format. +- `values`, an array of values for the vector (see below). + This may also be a scalar of the same type as the array contents. +- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements. + If `values` is a scalar, `names` should have length 1. + See also the [comments on names](misc.md#comments-on-names). + +The contents of `values` is subject to some constraints: + +- `"number"`: values should be JSON numbers. + Missing values are represented by `null`. + IEEE special values can be represented by strings, i.e., `NaN`, `Inf`, `-Inf`. +- `"integer"`: values should be JSON numbers that can be represented by a 32-bit signed integer. + Missing values may be represented by `null`. + - **(for version 1.0)** missing integers could also be represented by the special value -2147483648. +- `"boolean"`: values should be JSON booleans or `null` (for missing values). +- `string`: values should be JSON strings. + `null` is also allowed and represents a missing value. + +**(for version >= 1.1)** +For `type` of `"string"`, the object may optionally have a `format` property that constrains the `values`: + +- `"date"`: values should be JSON strings following a `YYYY-MM-DD` format. + `null` is also allowed and represents a missing value. +- `"date-time"`: values should be JSON strings following the Internet Date/Time format. + `null` is also allowed and represents a missing value. + +Vectors of length 1 may also be represented as scalars of the appropriate type. +While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant. + +## Factors + +A factor is represented as a JSON object with the following properties: + +- `type`, set to `"factor"`. + - **(for version 1.0)** `type` can also be set to `"ordered"` for ordered levels. +- `values`, an array of 0-based integer indices for the factor. + These should be non-negative JSON numbers that can fit into a 32-bit signed integer. + They should also be less than the length of `levels`. + Missing values are represented by `null`. + - **(for version 1.0)** missing values could also be represented by the special value -2147483648. +- `levels`, an array of unique strings containing the levels for the indices in `values`. +- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements. + See also the [comments on names](misc.md#comments-on-names). +- **(for version >= 1.1)** (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered. + If absent, levels are assumed to be non-ordered. + +## Nothing + +A "nothing" (a.k.a., "null", "none") value is represented as a JSON object with the following properties: + +- `type`, set to `"nothing"`. + +## External object + +Each external object is represented as a JSON object with the following properties: + +- `type`, set to `"index"`. +- `index`, a non-negative JSON number that can fit into a 32-bit signed integer. + This identifies this external object uniquely within the entire list. + See the equivalent in the HDF5 specification for more details. diff --git a/docs/specifications/misc.md b/docs/specifications/misc.md new file mode 100644 index 0000000..2599d3d --- /dev/null +++ b/docs/specifications/misc.md @@ -0,0 +1,25 @@ +# Comments on names + +Both HDF5 and JSON support naming of the vector elements, typically via the `names` group/property. +If `names` are supplied, their contents should always be non-missing (e.g., not `null` in JSON, no `missing-value-placeholder` in HDF5). +Each name is allowed to be any string, including an empty string. + +It is technically permitted to provide duplicate names in `names`, consistent with how R itself supports duplicate names in its lists and vectors. +However, this is not recommended as other frameworks may wish to use representations that assume unique names, e.g., using Python dictionaries to represent named lists. +By providing unique names, users can improve interoperability with native data structures in other frameworks. + +# Comparison to version 1 + +**uzuki2** involves some major changes from the original [**uzuki**](https://github.com/LTLA/uzuki) library. +Most obviously, we added support for HDF5 alongside the JSON format. +The latter supports random access without loading the entire list contents into memory, +which provides some optimization opportunities for parsers when large vectors are present. + +Arrays and data frames are no longer supported in **uzuki2**. +Such objects should instead be represented by external references, +under the assumption that any serialization framework using **uzuki2** would already have a separate mechanism for representing arrays and data frames. +For example, the [**alabaster**](https://github.com/ArtifactDB/alabaster.base) framework has its own staging methods for these objects. + +In the JSON format, **uzuki2** is also more explicit with its serialization of lists. +These now have their own dedicated `"type": "list"`, rather than relying on the implicit interpretation of arrays as unnamed lists and JSON objects as named lists. +In particular, treating JSON objects as named lists led to ambiguities when a list element was named `"type"`; it also failed to preserve the ordering of list elements.