Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk cache: store a data integrity header for non-CAS blobs #186

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mostynb
Copy link
Collaborator

@mostynb mostynb commented Feb 10, 2020

The header is made up of three fields:

  1. Little-endian int32 (4 bytes) representing the REAPIv2
    DigestFunction.
  2. Little-endian int64 (8 bytes) representing the number
    of bytes in the blob.
  3. The hash bytes from the digest, length determined by
    the particular DigestFunction.
    (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (#67).

This also includes a slightly refactored version of PR #123
(load the items from disk concurrently) by @bdittmer.

@mostynb
Copy link
Collaborator Author

mostynb commented Feb 26, 2020

@buchgr: any thoughts on this? In particular:

  • The file format (simple header, manually parsed, room to implement other digests).
  • The new directory structure (version number in the directory names, to make later changes easier).

@mostynb mostynb force-pushed the ac_integrity branch 2 times, most recently from 594719e to 1b97393 Compare February 26, 2020 13:30
@buchgr
Copy link
Owner

buchgr commented Feb 26, 2020

The file format (simple header, manually parsed, room to implement other digests).

Is there a good reason to not use protobuf for the header? Did you consider reusing the Digest message from the REAPI?

The new directory structure (version number in the directory names, to make later changes easier).

I think it's a great idea. Thanks!

if err != nil {
log.Fatal(err)

casDir := filepath.Join(dir, cache.CAS.String(), subDir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also version the CAS dir? I think it's possible that we'll need to make changes to it eventually as well.

Copy link
Collaborator Author

@mostynb mostynb Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skipped versioning the CAS dir when I first implemented this because they are already stored in a natural representation on disk. But then I added support for different hash types, and forgot to revisit this choice.

Rethinking it now, CAS blobs should have a header to define the hash type, at the small cost of making it more difficult to verify blobs with the sha256sum command line utility. So the CAS dir should be versioned also. Maybe we should use the same header as for AC/RAW blobs, to simplify our code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple alternative could be to encode the hash function in the path string i.e. cas/sha256. Would that be too complex?

Copy link
Collaborator Author

@mostynb mostynb Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea- we could then just use a Digest proto as the header, and the blob offset is determined by the hash type in the filename.

ac.v2/sha256/<key>
ac.v2/md5/<key>
cas.v2/sha256/<key>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that using a Digest like this is not so simple, since due to the SizeBytes field it doesn't have a fixed size on disk even if the Hash field has a constant length. So this would still require a prefix to store the length of the proto.

It would work well with a hand-rolled header though.

addChecksum: addChecksum,
}

if scanDir.version < 2 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to mix the migration detection with the loading logic? Shouldn't detecting whether a migration is necessary be as simple as checking whether a folder V-1 exists and is not empty?

if hasItems("ac/)) {
  migrate();
  rmdir("ac/")
}

load_items()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through this code again, and remembered why I mixed the logic:

  1. So we end up with an accurate order for the LRU index (migrating old style files first would always place them at the start of the index even if they're older than new style files).
  2. Building the LRU index is slower when done sequentially than in parallel, and I would prefer cache upgrades to be fast (less chance that the admin might think something went wrong and ctrl-c in the middle of an upgrade).

@mostynb
Copy link
Collaborator Author

mostynb commented Feb 26, 2020

The file format (simple header, manually parsed, room to implement other digests).

Is there a good reason to not use protobuf for the header?

I didn't want to use a single protobuf for the entire file (header + blob) because it requires reading the entire file into memory in order to unmarshal, and that might be undesirable for large files.

Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.

Did you consider reusing the Digest message from the REAPI?

The Digest message alone is insufficient if we want to allow other hash functions in the future. But this would make sense if we figure out a good way to use protobuf for the header.

@buchgr
Copy link
Owner

buchgr commented Feb 26, 2020

Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.

There is the delimited protobuf format that's recommended in the official docs [1] [2]. It's implemented in the Java protobuf library [3]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

[1] https://developers.google.com/protocol-buffers/docs/techniques#streaming
[2] golang/protobuf#779
[3] https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/Parser.html#parseDelimitedFrom-java.io.InputStream-

@mostynb
Copy link
Collaborator Author

mostynb commented Feb 26, 2020

There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

Interesting, I'll try this out- thanks!

@Yannic
Copy link
Contributor

Yannic commented Feb 26, 2020

There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

Interesting, I'll try this out- thanks!

As an alternative to storing data as delimited Protobufs there is also riegeli [1], which allows storing metadata in the same file and doesn't require reading the whole file into memory. The kythe project seems to have a Go implementation available [2]. The nice thing about riegeli is that it can also automatically compress the data and save some disc space (although I'm not sure whether the overhead is worth it for non-CAS blobs, even with very fast compression modes like brotli or snappy).

[1] https://github.com/google/riegeli
[2] https://pkg.go.dev/kythe.io/kythe/go/util/riegeli?tab=doc

@buchgr
Copy link
Owner

buchgr commented Feb 26, 2020

Thanks @Yannic. The overhead of the header is negligible and there isn't a need for compression. I'd prefer to not pull in another dependency for the header.

@mostynb
Copy link
Collaborator Author

mostynb commented Feb 26, 2020

I would also prefer not to pull in another dependency for what I hope can be a simple header.

But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / bazelbuild/bazel#4575 ?

@pcj
Copy link

pcj commented Feb 27, 2020

Writing the header proto size, then the message, then the content sounds pretty reasonable to me.

For inspiration, similar code is in: https://github.com/kythe/kythe/blob/master/kythe/go/platform/delimited/delimited.go

@buchgr
Copy link
Owner

buchgr commented Feb 28, 2020

But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / bazelbuild/bazel#4575 ?

The issue #12 talks about making compression part of the caching protocol. It was always my understanding that compression is most interesting for CAS blobs. We do not store protobufs in the CAS. For that zlib sounds more interesting for compression and I believe it is shipped as part of the Go standard library.

Does any of you feel differently?

The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
@ulrfa
Copy link
Contributor

ulrfa commented May 8, 2020

Which cache operations should do the data verification? Put? Get? Contains?

Have you benchmarked how performance is affected? Especially for HTTP/1.1-without-SSL use cases, where golang’s io.Copy can use linux kernel’s sendfile(2) for transfeering data directly from file to socket, instead of copying data to and from user space. I think that is what today makes bazel-remote on par with heavily optimized http servers like NGINX.

@mostynb
Copy link
Collaborator Author

mostynb commented May 8, 2020

I haven't performed any benchmarks yet, and I haven't figured out when/where to perform the validation. But I suspect that we will either validate data on the first Get or Contains call, and cache the result until the file is rewritten (or the server restarts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants