disk cache: store a data integrity header for non-CAS blobs #186

mostynb · 2020-02-10T17:41:38Z

The header is made up of three fields:

Little-endian int32 (4 bytes) representing the REAPIv2
DigestFunction.
Little-endian int64 (8 bytes) representing the number
of bytes in the blob.
The hash bytes from the digest, length determined by
the particular DigestFunction.
(32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (#67).

This also includes a slightly refactored version of PR #123
(load the items from disk concurrently) by @bdittmer.

mostynb · 2020-02-26T12:01:43Z

@buchgr: any thoughts on this? In particular:

The file format (simple header, manually parsed, room to implement other digests).
The new directory structure (version number in the directory names, to make later changes easier).

buchgr · 2020-02-26T14:40:22Z

The file format (simple header, manually parsed, room to implement other digests).

Is there a good reason to not use protobuf for the header? Did you consider reusing the Digest message from the REAPI?

The new directory structure (version number in the directory names, to make later changes easier).

I think it's a great idea. Thanks!

buchgr · 2020-02-26T14:48:21Z

cache/disk/disk.go

-			if err != nil {
-				log.Fatal(err)
+
+			casDir := filepath.Join(dir, cache.CAS.String(), subDir)


Should we also version the CAS dir? I think it's possible that we'll need to make changes to it eventually as well.

I skipped versioning the CAS dir when I first implemented this because they are already stored in a natural representation on disk. But then I added support for different hash types, and forgot to revisit this choice.

Rethinking it now, CAS blobs should have a header to define the hash type, at the small cost of making it more difficult to verify blobs with the sha256sum command line utility. So the CAS dir should be versioned also. Maybe we should use the same header as for AC/RAW blobs, to simplify our code.

A simple alternative could be to encode the hash function in the path string i.e. cas/sha256. Would that be too complex?

Interesting idea- we could then just use a Digest proto as the header, and the blob offset is determined by the hash type in the filename.

ac.v2/sha256/<key> ac.v2/md5/<key> cas.v2/sha256/<key>

It turns out that using a Digest like this is not so simple, since due to the SizeBytes field it doesn't have a fixed size on disk even if the Hash field has a constant length. So this would still require a prefix to store the length of the proto.

It would work well with a hand-rolled header though.

buchgr · 2020-02-26T15:01:03Z

cache/disk/load.go

+						addChecksum: addChecksum,
+					}
+
+					if scanDir.version < 2 {


Why is it necessary to mix the migration detection with the loading logic? Shouldn't detecting whether a migration is necessary be as simple as checking whether a folder V-1 exists and is not empty?

if hasItems("ac/)) { migrate(); rmdir("ac/") } load_items()

I looked through this code again, and remembered why I mixed the logic:

So we end up with an accurate order for the LRU index (migrating old style files first would always place them at the start of the index even if they're older than new style files).

Building the LRU index is slower when done sequentially than in parallel, and I would prefer cache upgrades to be fast (less chance that the admin might think something went wrong and ctrl-c in the middle of an upgrade).

mostynb · 2020-02-26T15:30:52Z

The file format (simple header, manually parsed, room to implement other digests).

Is there a good reason to not use protobuf for the header?

I didn't want to use a single protobuf for the entire file (header + blob) because it requires reading the entire file into memory in order to unmarshal, and that might be undesirable for large files.

Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.

Did you consider reusing the Digest message from the REAPI?

The Digest message alone is insufficient if we want to allow other hash functions in the future. But this would make sense if we figure out a good way to use protobuf for the header.

buchgr · 2020-02-26T15:37:04Z

Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.

There is the delimited protobuf format that's recommended in the official docs [1] [2]. It's implemented in the Java protobuf library [3]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

[1] https://developers.google.com/protocol-buffers/docs/techniques#streaming
[2] golang/protobuf#779
[3] https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/Parser.html#parseDelimitedFrom-java.io.InputStream-

mostynb · 2020-02-26T15:47:56Z

There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

Interesting, I'll try this out- thanks!

Yannic · 2020-02-26T16:05:31Z

There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

Interesting, I'll try this out- thanks!

As an alternative to storing data as delimited Protobufs there is also riegeli [1], which allows storing metadata in the same file and doesn't require reading the whole file into memory. The kythe project seems to have a Go implementation available [2]. The nice thing about riegeli is that it can also automatically compress the data and save some disc space (although I'm not sure whether the overhead is worth it for non-CAS blobs, even with very fast compression modes like brotli or snappy).

[1] https://github.com/google/riegeli
[2] https://pkg.go.dev/kythe.io/kythe/go/util/riegeli?tab=doc

buchgr · 2020-02-26T17:38:46Z

Thanks @Yannic. The overhead of the header is negligible and there isn't a need for compression. I'd prefer to not pull in another dependency for the header.

mostynb · 2020-02-26T21:33:54Z

I would also prefer not to pull in another dependency for what I hope can be a simple header.

But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / bazelbuild/bazel#4575 ?

pcj · 2020-02-27T16:49:57Z

Writing the header proto size, then the message, then the content sounds pretty reasonable to me.

For inspiration, similar code is in: https://github.com/kythe/kythe/blob/master/kythe/go/platform/delimited/delimited.go

buchgr · 2020-02-28T09:17:59Z

But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / bazelbuild/bazel#4575 ?

The issue #12 talks about making compression part of the caching protocol. It was always my understanding that compression is most interesting for CAS blobs. We do not store protobufs in the CAS. For that zlib sounds more interesting for compression and I believe it is shipped as part of the Go standard library.

Does any of you feel differently?

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

ulrfa · 2020-05-08T14:22:18Z

Which cache operations should do the data verification? Put? Get? Contains?

Have you benchmarked how performance is affected? Especially for HTTP/1.1-without-SSL use cases, where golang’s io.Copy can use linux kernel’s sendfile(2) for transfeering data directly from file to socket, instead of copying data to and from user space. I think that is what today makes bazel-remote on par with heavily optimized http servers like NGINX.

mostynb · 2020-05-08T15:33:52Z

I haven't performed any benchmarks yet, and I haven't figured out when/where to perform the validation. But I suspect that we will either validate data on the first Get or Contains call, and cache the result until the file is rewritten (or the server restarts).

mostynb requested a review from buchgr February 10, 2020 17:41

mostynb force-pushed the ac_integrity branch 6 times, most recently from 6c1660d to 061cfa2 Compare February 14, 2020 17:52

mostynb force-pushed the ac_integrity branch from 061cfa2 to 314b7c2 Compare February 18, 2020 13:29

mostynb force-pushed the ac_integrity branch 2 times, most recently from 594719e to 1b97393 Compare February 26, 2020 13:30

buchgr reviewed Feb 26, 2020

View reviewed changes

mostynb force-pushed the ac_integrity branch from 1b97393 to 71b4293 Compare March 4, 2020 19:10

mostynb force-pushed the ac_integrity branch from 71b4293 to dd925de Compare March 8, 2020 08:05

mostynb mentioned this pull request May 7, 2020

Read After Write Guarantee #267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disk cache: store a data integrity header for non-CAS blobs #186

disk cache: store a data integrity header for non-CAS blobs #186

mostynb commented Feb 10, 2020

mostynb commented Feb 26, 2020

buchgr commented Feb 26, 2020

buchgr Feb 26, 2020

mostynb Feb 26, 2020 •

edited

Loading

buchgr Feb 26, 2020

mostynb Feb 26, 2020 •

edited

Loading

mostynb Feb 27, 2020

buchgr Feb 26, 2020

mostynb Feb 26, 2020

mostynb Mar 4, 2020

mostynb commented Feb 26, 2020

buchgr commented Feb 26, 2020 •

edited

Loading

mostynb commented Feb 26, 2020

Yannic commented Feb 26, 2020

buchgr commented Feb 26, 2020

mostynb commented Feb 26, 2020

pcj commented Feb 27, 2020

buchgr commented Feb 28, 2020 •

edited

Loading

ulrfa commented May 8, 2020

mostynb commented May 8, 2020

disk cache: store a data integrity header for non-CAS blobs #186

Are you sure you want to change the base?

disk cache: store a data integrity header for non-CAS blobs #186

Conversation

mostynb commented Feb 10, 2020

mostynb commented Feb 26, 2020

buchgr commented Feb 26, 2020

buchgr Feb 26, 2020

Choose a reason for hiding this comment

mostynb Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

buchgr Feb 26, 2020

Choose a reason for hiding this comment

mostynb Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

mostynb Feb 27, 2020

Choose a reason for hiding this comment

buchgr Feb 26, 2020

Choose a reason for hiding this comment

mostynb Feb 26, 2020

Choose a reason for hiding this comment

mostynb Mar 4, 2020

Choose a reason for hiding this comment

mostynb commented Feb 26, 2020

buchgr commented Feb 26, 2020 • edited Loading

mostynb commented Feb 26, 2020

Yannic commented Feb 26, 2020

buchgr commented Feb 26, 2020

mostynb commented Feb 26, 2020

pcj commented Feb 27, 2020

buchgr commented Feb 28, 2020 • edited Loading

ulrfa commented May 8, 2020

mostynb commented May 8, 2020

mostynb Feb 26, 2020 •

edited

Loading

mostynb Feb 26, 2020 •

edited

Loading

buchgr commented Feb 26, 2020 •

edited

Loading

buchgr commented Feb 28, 2020 •

edited

Loading