Skip to content
This repository was archived by the owner on Mar 15, 2021. It is now read-only.
277 changes: 277 additions & 0 deletions content/rsf-spec/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
---
rfc:
start_date: 2018-04-04
pr: openregisters/registers-rfc#12
status: draft
---

# The Register Serialisation Format

## Summary

This RFC aims to collect in one place the current implementation of RSF so it
can be added to the specification and it can be evolved with new RFCs when
required.

The Register Serialisation Format, from now on RSF, is an event log describing
the evolution of the Register data and metadata.

## Motivation

TODO

## Explanation

### RSF Grammar

RSF is a positional line-based textual format separated by tabs. Each
line defines a command to apply to a Register state to obtain the next state.

This specification uses the Augmented Backus-Naur Form (ABNF) as defined by
[RFC5234](https://tools.ietf.org/html/rfc5234) and refined by
[RFC7405](https://tools.ietf.org/html/rfc7405). It assumes the following
definitions:

* RFC5234: `ALPHA` (letters), `CRLF` (carriage return, line feed), `DIGIT`
(decimal digits), `HEXDIG` (hexadecimal digits) and `HTAB` (horizontal tab).
* Registers specification: [`CANONREP`][canon-rep] (canonical representation).
Note that, in turn, it depends on [RFC8259](https://tools.ietf.org/html/rfc8259).

```abnf
log = command *(CRLF command) [CRLF]
command = add-item / append-entry / assert-root-hash

assert-root-hash = %s"assert-root-hash" HTAB hash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does %s mean case sensitive? If so, why do we only use it for "add-item" etc and not for things like "user"? Not that I think we've ever specified whether these things are case sensitive or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Yes %s means case sensitive and all tokens that should be strictly in lower case should be prepended with that. I'll amend them 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a new commit.


add-item = %s"add-item" HTAB CANONREP

append-entry = %s"append-entry" HTAB type HTAB key HTAB timestamp HTAB hash-list
type = %s"user" / %s"system"
key = alphanum *(alphanum / %x2D / %x5F)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this depends on #22
Also: what about forward slashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does indeed. I need to amend that once #22 is accepted.

hash-list = hash *(list-separator hash)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the RSF I've looked at, this has just been a single item hash. When would this be a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation is theoretical and it is when you have an index. It is experimental and in review to assess if benefits outweigh (perceived) complexity.

hash = %s"sha-256:" 64(HEXDIG) ; sha-256
list-separator = ";" ; hash list separator

alphanum = ALPHA / DIGIT

; timestamp
timestamp = date %s"T" time
date = century year DSEP month DSEP day ; date YYYY-MM-DD
time = hour TSEP minute TSEP second TZ ; time HH:MM:SSZ

; date
century = 2DIGIT ; 00-99
year = 2DIGIT ; 00-99
month = 2DIGIT ; 01-12
day = 2DIGIT ; 01-28, 01-29, 01-30, 01-31 based on month/year
DSEP = "-" ; date separator

; time
hour = 2DIGIT ; 00-24
minute = 2DIGIT ; 00-59
second = 2DIGIT ; 00-58, 00-59, 00-60 based on leap-second rules
TSEP = ":" ; time separator
TZ = %s"Z" ; timezone
```

### Media type

The current media type is `application/uk-gov-rsf`. It should change to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is specific to ORJ, can we say that the media type of RSF is application/vnd.rsf but application/uk-gov-rsf may also be used for legacy reasons? Then we can fix that at any time without raising another RFC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable 👍

`application/vnd.rsf` to align with [RFC6838](https://tools.ietf.org/html/rfc6838).

### REST API

TODO

The current implementation uses `GET /download-rsf`. The main issue with that
is that diverges from the rest of the API where serialisation is expressed
either via suffix or via media type. The problem with using the same approach,
say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the endpoint GET /download-rsf doesn't provide the same information as GET /register.rsf, or that the user has to provide different information?

The current implementation uses `GET /download-rsf`. The main issue with that
is that diverges from the rest of the API where serialisation is expressed
either via suffix or via media type. The problem with using the same approach,
say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, /register.json right now offers some sort of summary-metadata of the register. RSF by nature is the full register. So both things are at odds.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about that endpoint 😬. It goes without saying that I think it should have a different name, e.g. /summary.json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least it's clearer in intention 👍


What is a good name for a resource that represents the whole raw database?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "register" is a good name for a register ;)

GET /register.rsf

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another endpoint called /download-register that downloads a ZIP file containing "the whole database". We should also consider whether we keep that an/or rename it in line with these thoughts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note there are other RSF endpoints:

  • /download-rsf/n gets RSF for the register after entry-number n, i.e. it returns the whole register from entry number n+1 to the end of the register.
  • /download-rsf/n/m gets RSF for entries n+1 to m'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the description in a new commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that the /download-register endpoint should be renamed to /archive. With that change, I think having /archive.rsf or better /archive -H 'Accept: application/vnd.rsf would help consolidate resources that are conceptually the same.

To accommodate the filtering that happens in /download-rsf/n we could do something on the lines of /archive.rsf?from=n&to=m.

Thoughts?

/cc @MatMoore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, /archive (zip file) would be normative, rsf a non-normative extension.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea - I like the name archive better than register for this purpose, and it makes sense for the zip & rsf to be different representations of the same thing.


```
# RSF

GET /{db resource}.rsf

GET /{db resource}
Accept: application/vnd.rsf


# JSON Lines — Hypothetical

GET /{db resource}.jsonl

GET /{db resource}.jsonl
Accept: application/x-ndjson
```

It is also possible to get a register patch in RSF:

* `GET /download-rsf/{n}`. Returns the RSF patch from the entry number `n`
(non inclusive) to the most recent entry number.
* `GET /download-rsf/{n}/{m}`. Returns the RSF patch from the entry number `n`
(non inclusive) to the entry number `m` (inclusive).

### Commands

#### <a id="assert-root-hash-command">`assert-root-hash` command</a>

Asserts that the provided root hash is the same as the one computed from the
current user entry log as defined in the [Digital Proofs][digital-proofs]
specification.

Note that the system entries are not part of the root hash computation and are
not asserted in any way.

##### Arguments

1. The `hash` of the root of the tree.

For example, the empty root hash:

```
assert-root-hash sha-256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
```

#### <a id="add-item-command">`add-item` command</a>

Adds a new [Item resource][item-res] to the register. There must be at least
an [`append-entry` command](#append-entry-command) referencing the item's hash
later on to make the RSF patch [valid](#validation-rules).

##### Arguments

1. The [canonical representation][canon-rep] of the item.

For exeample:

```
add-item {"country":"GB","name":"United Kingdom","official-name":"The United Kingdom of Great Britain and Northern Ireland"}
```

#### <a id="append-entry-command">`append-entry` command</a>

Appends a new [Entry resource][entry-res] to the register.

##### Arguments

1. The `type` of the entry determines if the entry belongs to the data log
(`user`) or to the metadata log (`system`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that the data log and the metadata log are separate things. Is this intentional? I know they are kind of separate currently (e.g. system entries are ignored in root-hashes) but they do all appear in the same "log" in the RSF. I guess this kind of does correctly explain how things are now (even if we want to change them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this RFC should document how things are right now. When we change it, we will have another RFC that explains the change and can refer to the original RFC as its starting point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs are intertwined because A data item must conform to the current schema derived from the previous system entries, so even though system entries aren't in the root hashes, the RSF is still invalid if they are reordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, there is a level of checking that guarantees consistency cross log 👍

2. The `key` of the entry. The primary key field is the field with the same
name as the register.
3. The `timestamp` of the entry. This is the time at which the entry was
appended to the register.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the first time we have defined what the timestamp means. Are we sure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the first time we have something that clear yes. Based on usage I'd say yes, the timestamp is the consequence of minting an item so it's the recording time for the entry. It mimics git's behaviour.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me nervous because that's only the way we've used it. I haven't thought about the consequences of timestamps being out of sequence. @michaelabenyohai am I being paranoid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the problem of having timestamps out of sequence? The order of the log is dictated by the entry number.

Copy link
Contributor Author

@arnau arnau Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could mess with tooling using the timestamp to infer something related to time outside "time of recording" but again, nothing you wouldn't see in git or similar.

I think it's up to the tooling to be zealous about timestamps to the extent a tool can be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'm satisfied 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, RFC0003 covers this topic (#16 )

4. The `hash` of the item for which the entry was appended. This is the
[sha-256 hash of the item][canon-rep].

For example:

```
append-entry user GB 2010-11-12T13:14:15Z sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6
```


### <a href="validation-rules">Validation rules</a>

A RSF list of commands is expected to conform to the following rules:

* [Commands](#commands) are executed in order of appearance, top to bottom.
* User entries are numbered in sequence in order of appearance starting with 1
if the register is empty, otherwise incrementing on the latest entry number
found in the register.
* System entries are numbered in sequence in order of appearance starting with
1 if the register is empty, otherwise incrementing on the latest entry
number found in the register.
* An [`append-entry` command](#append-entry-command) must always appear after
the [`add-item` command](#add-item-command) that introduces the item is
referencing *unless* the item already exists in the register.
* It is illegal to have orphan items. An `add-item` must have at least one
`append-entry` referencing to the item.
* It is illegal to have broken references. An `append-entry` must reference an
existing item or an item previously introduced by an `add-item` command.
* It is illegal to have two identical consecutive `append-entry` commands.
* The item in the `add-item` command must always be in the canonical form.


#### Type checking

Although not part of the RSF specification, it is worth mentioning that a
Registers implementation is expected to type check the data according to the
computed schema.

##### Metadata

A metadata item must conform to the metadata schema

[TODO: This needs definition].

##### Data

A data item must conform to the current schema derived from the previous
system entries. A type checker is expected to verify:

* It has the primary key defined.
* Fieldnames exist in the schema.
* Cardinality is consistent.
* Datatype is consistent.

Given the example “[All commands in use](#all-commands-example)”, a new data
item is valid if:

* It has the primary key, `country` defined.
* It has at most one `name` field and one `official-name` field.
* The `country` field has cardinality 1.
* The `country` field is a String.
* The `name` field has cardinality 1.
* The `name` field is a String.
* The `official-name` field has cardinality 1.
* The `official-name` field is a String.

Each datatype must be parsed according to the [datatype specification][datatype-spec].

A RSF patch (set of commands) must be treated as a single transaction. If
there is a validation error, the whole patch must be rejected and any changes
to the state rolled back.


### Examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these examples be valid patches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should yes, if I haven't mess it up, they are 😱

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I don't think you've messed them up. But I think the spec should require an assert-root-hash in the first and final lines.


#### Simple RSF

```
add-item {"country":"GB","name":"United Kingdom","official-name":"The United Kingdom of Great Britain and Northern Ireland"}
append-entry user GB 2010-11-12T13:14:15Z sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6
```

#### Multiple items

```
add-item {"local-authority-eng":"LND","local-authority-type":"NMD","name":"London"}
add-item {"local-authority-eng":"LEI","local-authority-type":"NMD","name":"Leicester"}
add-item {"local-authority-eng":"CHE","local-authority-type":"NMD","name":"Cheshire"}
append-entry user NMD 2016-04-05T13:23:05Z sha-256:490636974f8087e4518d222eba08851dd3e2b85095f2b1427ff6ecd3fa482435;sha-256:8b748c574bf975990e47e69df040b47126d2a0a3895b31dce73988fba2ba27d8;sha-256:eb3ee00e6149cd734a7fa7e1f01a5fbf5fb50e1b38a065fd97d6ad3017750351
```

#### <a id="all-commands-example">All commands in use</a>

```
assert-root-hash sha-256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
add-item {"cardinality":"1","datatype":"string","field":"country","phase":"beta","register":"country","text":"The country's 2-letter ISO 3166-2 alpha2 code."}
add-item {"cardinality":"1","datatype":"string","field":"name","phase":"beta","text":"The commonly-used name of a record."}
add-item {"cardinality":"1","datatype":"string","field":"official-name","phase":"beta","text":"The official or technical name of a record."}
append-entry system field:country 2017-01-10T17:16:07Z sha-256:a303d05bdbeb029440344e0f1148f5524b4a2f9076d1b0f36a95ff7d5eeedb0e
append-entry system field:name 2017-01-10T17:16:07Z sha-256:a7a9f2237dadcb3980f6ff8220279a3450778e9c78b6f0f12febc974d49a4a9f
append-entry system field:official-name 2017-01-10T17:16:07Z sha-256:5c4728f439f6cbc6c7eea42992b858afc78c182962ba35d169f49db2c88e1e41
add-item {"country":"GB","name":"United Kingdom","official-name":"The United Kingdom of Great Britain and Northern Ireland"}
append-entry user GB 2010-11-12T13:14:15Z sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6
```


[item-res]: https://openregister.github.io/specification/#item-resource
[entry-res]: https://openregister.github.io/specification/#entry-resource
[canon-rep]: https://openregister.github.io/specification/#sha-256-item-hash
[digital-proofs]: http://openregister.github.io/specification/#digital-proofs
[datatype-spec]: http://openregister.github.io/specification/#datatypes