Skip to content
This repository was archived by the owner on Mar 15, 2021. It is now read-only.

Comments

RFC0006: RSF version 1#12

Open
arnau wants to merge 15 commits intomasterfrom
rsf-spec
Open

RFC0006: RSF version 1#12
arnau wants to merge 15 commits intomasterfrom
rsf-spec

Conversation

@arnau
Copy link
Contributor

@arnau arnau commented Apr 5, 2018

Context

RSF has been documented in multiple places but never in a consistent manner. This is an attempt to describe the current state of things and serve as the starting point to create the first RSF specification.

Note that RSF has no versioning system. This is a good opportunity to discuss if we should add one to allow non backwards-compatible changes without breaking clients.

Guidance to review

Review the current implementation of RSF is accurately described in this RFC and that the explanation is understandable and unequivocal.

@arnau
Copy link
Contributor Author

arnau commented Apr 5, 2018

/cc @gidsg

is that diverges from the rest of the API where serialisation is expressed
either via suffix or via media type. The problem with using the same approach,
say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the endpoint GET /download-rsf doesn't provide the same information as GET /register.rsf, or that the user has to provide different information?

The current implementation uses `GET /download-rsf`. The main issue with that
is that diverges from the rest of the API where serialisation is expressed
either via suffix or via media type. The problem with using the same approach,
say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, /register.json right now offers some sort of summary-metadata of the register. RSF by nature is the full register. So both things are at odds.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about that endpoint 😬. It goes without saying that I think it should have a different name, e.g. /summary.json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least it's clearer in intention 👍

say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

What is a good name for a resource that represents the whole raw database?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "register" is a good name for a register ;)

GET /register.rsf

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another endpoint called /download-register that downloads a ZIP file containing "the whole database". We should also consider whether we keep that an/or rename it in line with these thoughts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note there are other RSF endpoints:

  • /download-rsf/n gets RSF for the register after entry-number n, i.e. it returns the whole register from entry number n+1 to the end of the register.
  • /download-rsf/n/m gets RSF for entries n+1 to m'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the description in a new commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that the /download-register endpoint should be renamed to /archive. With that change, I think having /archive.rsf or better /archive -H 'Accept: application/vnd.rsf would help consolidate resources that are conceptually the same.

To accommodate the filtering that happens in /download-rsf/n we could do something on the lines of /archive.rsf?from=n&to=m.

Thoughts?

/cc @MatMoore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, /archive (zip file) would be normative, rsf a non-normative extension.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea - I like the name archive better than register for this purpose, and it makes sense for the zip & rsf to be different representations of the same thing.


### Commands

#### <a id="assert-root-hash-command">`assert-root-hash` command</a>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two assert-root-hash commands per register? How do you use them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert-root-hash can appear as many times as you want in a RSF file. The idea is that every time you find one, you can use it to verify you have got all the previous entries right, it's a checkpoint really.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that answers my later question about whether an RSF file starts at the beginning of a register. I should have checked with echo -n | sha256sum. In that case should this spec require that the the first line of an RSF file be assert-root-hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested that in another comment, the potential problem is that currently it is not a requirement so some RSF might be broken if we impose this restriction. Let me assess how big of a deal it is really.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the scope of this review is "Review the current implementation of RSF is accurately described in this RFC", I'll back off and open this in an issue instead.

#### <a id="add-item-command">`add-item` command</a>

Adds a new [Item resource][item-res] to the register. It will require an
[`append-entry` command](#append-entry-command) to make it visible to users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "make it visible to users", I would put "associate the item with a record key".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, I like what you are suggesting but I don't like the idea of using the word "record". @michaelabenyohai ideas?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "There must be a corresponding append-entry command that refers to the item's hash"? Noting the later comment on "It is illegal to have orphan items", which insists on the append-entry for the file to be valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work 👍

2. The `key` of the entry. The primary key field is the field with the same
name as the register.
3. The `timestamp` of the entry. This is the time at which the entry was
appended to the register.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the first time we have defined what the timestamp means. Are we sure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the first time we have something that clear yes. Based on usage I'd say yes, the timestamp is the consequence of minting an item so it's the recording time for the entry. It mimics git's behaviour.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me nervous because that's only the way we've used it. I haven't thought about the consequences of timestamps being out of sequence. @michaelabenyohai am I being paranoid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the problem of having timestamps out of sequence? The order of the log is dictated by the entry number.

Copy link
Contributor Author

@arnau arnau Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could mess with tooling using the timestamp to infer something related to time outside "time of recording" but again, nothing you wouldn't see in git or similar.

I think it's up to the tooling to be zealous about timestamps to the extent a tool can be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'm satisfied 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, RFC0003 covers this topic (#16 )

* [Commands](#commands) are executed in order of appearance, top to bottom.
* Entries are numbered in sequence in order of appearance starting with 1 if
the register is empty, otherwise incrementing on the latest entry number
found in the register.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to tell whether an RSF file represents a complete register, or only appends to a register?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way as far as I know is via assert-root-hash. In a situation where the RSF has no assertions you wouldn't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility that would require evolving the implementation would be requiring an assert-root-hash as the first command of any RSF file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think it would be useful to require an assert-root-hash in the first line of every file, for the sake of reassembling a register from its patches.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify that this currently does not include system entries (though it should do).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this I mean that entry numbers are computed independently for user and system currently (as they are separate logs on the inside).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it needs to explain the current implementation exactly.

the [`add-item` command](#add-item-command) that introduces the item is
referencing.
* It is illegal to have orphan items. An `add-item` must have at least one
`append-entry` referencing to the item.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the the validity of an add-item depends on a subsequent append-entry, then a file that gets accidentally truncated in transit will be invalid. This seems unnecessarily strict, but I haven't thought hard about the disadvantages of orphan items. Is it to do with treating patches as transactions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the original discussion about tradeoffs of allowing orphan items or not made the team decide to be transactional per file so if something wrong happens in the middle of consuming a delta, it must be rolled back so you can retry again with the same delta without complex gymnastics. There were other considerations around how to handle orphans in the API and the garbage collection of them if allowed.

I think it's a reasonable restriction given the consequences of not having it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter whether the add-item command comes before or after the append-entry one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my memory serves me well, it has to do with exposing the entry to the API users and not having the item available yet (the record couldn't be computed either). This is me reading between lines and guessing a bit here. In all fairness if we think it's ok to expose an entry as soon as it gets parsed, it means we are not behaving transactionally (a rollback could cause a disruption in someone that already consumed an item). I'll try to pull out the rationale for this restriction but, if it's not satisfactory I might reconsider having the restriction in place.

#### Type checking

Although not part of the RSF specification, it is worth mentioning that a
Registers' implementation is expected to type check the data according to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the apostrophe is needed on "Registers'"

to the state rolled back.


### Examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these examples be valid patches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should yes, if I haven't mess it up, they are 😱

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I don't think you've messed them up. But I think the spec should require an assert-root-hash in the first and final lines.

Arnau Siches added 9 commits April 26, 2018 09:14
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>

append-entry = %s"append-entry" HTAB type HTAB key HTAB timestamp HTAB hash-list
type = "user" / "system"
key = alphanum / %x2D / %x5F

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what alphanum / %x2D / %x5F means. Can you explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alphanum is defined a few lines below as alphanum = ALPHA / DIGIT, so the key definition is: "any ASCII character (A-Z, a-z) or a hypen (%x2D, -) or an underscore (%x5F, _)".

alpha and digit are defined in https://tools.ietf.org/html/rfc5234#appendix-B.1
%x2D and %x5F are the hexadecimal codes for hypen and underscore respectively and / defines an OR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed the definition of alphanum so that makes sense now.

Does this imply there is only one character in a key? i.e it's either a single ASCII character, hyphen or underscore. It doesn't appear to say that a key can be made up of multiple of these things but I might be reading it wrong still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HA, very good point. I have to add multiple chars :)

I'm wondering if the first char has special treatment in the current implementation? (e.g. can a key start with a hypen: -foo? If not, this needs further restriction

log = command *(CRLF command) [CRLF]
command = add-item / append-entry / assert-root-hash

assert-root-hash = %s"assert-root-hash" HTAB hash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does %s mean case sensitive? If so, why do we only use it for "add-item" etc and not for things like "user"? Not that I think we've ever specified whether these things are case sensitive or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Yes %s means case sensitive and all tokens that should be strictly in lower case should be prepended with that. I'll amend them 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a new commit.


Asserts that the provided root hash is the same as the one computed from the
current entry log as defined in the [Digital Proofs][digital-proofs]
specification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify that although the included in RSF, an assert-root-hash command currently ignores all system entries? It only includes user entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. I will amend this too 👍

##### Arguments

1. The `type` of the entry determines if the entry belongs to the data log
(`user`) or to the metadata log (`system`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that the data log and the metadata log are separate things. Is this intentional? I know they are kind of separate currently (e.g. system entries are ignored in root-hashes) but they do all appear in the same "log" in the RSF. I guess this kind of does correctly explain how things are now (even if we want to change them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this RFC should document how things are right now. When we change it, we will have another RFC that explains the change and can refer to the original RFC as its starting point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs are intertwined because A data item must conform to the current schema derived from the previous system entries, so even though system entries aren't in the root hashes, the RSF is still invalid if they are reordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, there is a level of checking that guarantees consistency cross log 👍

* It is illegal to have broken references. An `append-entry` must reference an
item previously introduced by an `add-item` command.
* The item in the `add-item` command must always be in the canonical form.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention anything about identical consecutive append-entry commands currently being illegal?

i.e. this is illegal

append-entry	user	GB	2010-11-12T13:14:15Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6
append-entry	user	GB	2010-11-12T13:14:50Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6

This is ok

append-entry	user	GB	2010-11-12T13:14:15Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6
append-entry	user	GB	2010-11-12T13:14:30Z	sha-256:490636974f8087e4518d222eba08851dd3e2b85095f2b1427ff6ecd3fa482435
append-entry	user	GB	2010-11-12T13:14:50Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the recent changes, is this still the situation? I'll need clarification on this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently how it is implemented, yes. ORJ won't let you load RSF that looks like the first example. However, government-service register will currently produce illegal RSF, because we accidentally allowed these duplicate entries into the register before coding the rule.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule is good hygiene but should it be enforced in the spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the spec is to allow different implementations to be built independently and be compatible with each other. If we don't enforce the same rules, a valid RSF in one tool could be invalid in another one. To me, this is a spec failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss if it should be a rule that exits parsing (and thus rolls back the entire patch), it should be ignored entirely (skipped) or raise a warning but keep parsing.

Given that the current implementation fails, the initial spec should reflect that exact behaviour.

* It is illegal to have orphan items. An `add-item` must have at least one
`append-entry` referencing to the item.
* It is illegal to have broken references. An `append-entry` must reference an
item previously introduced by an `add-item` command.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, if a corresponding add-item does not appear in the RSF file then it checks to see if the item already exists in the register. If it does, then the RSF file is deemed valid, otherwise it's rejected. This means that an RSF file that has an "orphan entry" could actually be valid if it is a patch on an existing register.

The inverse is not true for add-item. If an add-item is in an RSF file, the corresponding append-entry must appear in the same RSF file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I misunderstood this behaviour. I'll amend 👍

Arnau Siches added 6 commits May 3, 2018 16:38
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Explains that the current behaviour uses only data entries to generate
the root hash which leaves metadata entries unasserted.

Also makes explicit that the numbering for data and metadata logs are
independent.

Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Ensures keys have one or more characters where the first is
alphanumeric.

Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Copy link

@MatMoore MatMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, and I think it's good to specify the grammar and semantics of RSF, but I'd rather the registers spec doesn't require RSF as way to mint or download data. Can we leave out the REST API section entirely, and address it as an ADR in openregisters-java?

It feels like there are already a lot of different ways of downloading data from a register, and I'm wary about adding any more endpoints that do similar things to existing ones, because it makes the API harder to use. At the moment RSF is an internal thing and API users shouldn't need to understand it.

I like @michaelabenyohai's idea of combining with the download-register endpoint as the RSF could just be requested through content negotiation.


append-entry = %s"append-entry" HTAB type HTAB key HTAB timestamp HTAB hash-list
type = %s"user" / %s"system"
key = alphanum *(alphanum / %x2D / %x5F)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this depends on #22
Also: what about forward slashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does indeed. I need to amend that once #22 is accepted.

append-entry = %s"append-entry" HTAB type HTAB key HTAB timestamp HTAB hash-list
type = %s"user" / %s"system"
key = alphanum *(alphanum / %x2D / %x5F)
hash-list = hash *(list-separator hash)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the RSF I've looked at, this has just been a single item hash. When would this be a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation is theoretical and it is when you have an index. It is experimental and in review to assess if benefits outweigh (perceived) complexity.


### Media type

The current media type is `application/uk-gov-rsf`. It should change to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is specific to ORJ, can we say that the media type of RSF is application/vnd.rsf but application/uk-gov-rsf may also be used for legacy reasons? Then we can fix that at any time without raising another RFC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable 👍

@arnau arnau changed the title RSF version 1 RFC0006: RSF version 1 Aug 9, 2018
@arnau
Copy link
Contributor Author

arnau commented Aug 10, 2018

@MatMoore although I may agree RSF is not normative, I'd like to have it in the spec as an example of how the reference implementation has approached register serialisation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants