The caching functionality in Build Accelerator (BuildXL) is provided by a set of components which is packaged separately and can be used independently of the BuildXL Build Engine. This functionality goes beyond the typical key-value construct, having features designed to support build engines that can handle non-deterministic and non-hermetic tasks.
The Build Accelerator Cache [BuildXL Cache, for short], is logically composed of two parts: a Content Store and a Memoization Store.
The Content Store is effectively a CAS (Content Addressable Store). There are methods to store and retrieve content blobs. In this document we use the term "blob" to differentiate cache content from "files on disk" - the term is not used in code much, though. Its functionality is embodied by the IContentStore
interface, with IContentSession
providing methods to store and retrieve content from the CAS.
Each blob in the cache is identified by a hash of its contents. A few different hashing algorithms can be used, as seen in the Hashing directory. At this time such algorithms are not simple MD5 or SHA-* hashes of the content stream, but rather a hash-of-hashes. A blob is divided into segments, each segment hashed independently, and a hash-of-hashes computed as the blob's identifier. There are variants for fixed- and variable-length segmentation, each optimized for use with different cloud storage APIs. These APIs use the segmented nature of this scheme to validate and de-duplicate content in transmission and at rest.
Content Store implementations are responsible for managing storage utilization - some do it based on space quotas, some on time-based retention policies.
In some contexts BuildXL needs to signal to the cache that it intends to use a given blob, without retrieving it from the cache. So, besides the "add" and "get" methods typical of CAS implementations, a Content Store also provides a Pin
method, which which will:
- Check that pinned blob is available and could be retrieved if needed later during a client's session;
- Ensure that the blob is not deleted by garbage collection or space management activities for the duration of the client's session.
The Memoization Store is used to look up a set of blobs, based on an arbitrary lookup fingerprint. This implementation is embodied by the IMemoizationStore
interface, with a IMemoizationSession
providing methods to insert and look up a content-set based on its fingerprint. Fingerprints are domain-dependent, and handled opaquely by the Memoization Store.
The Memoization Store APIs were designed to address BuildXL's unique cache lookup requirements. To understand how, one needs to first review what BuildXL has to work with, in order to determine whether to reuse cached outputs for previously executed build tasks:
-
Before executing a build, BuildXL parses build specs, which include: (a) the tasks involved in the build; (b) dependencies amongst such tasks; and (c) the inputs used by each task. Such inputs are inferrable from build specs and are called static inputs.
-
Unfortunately the list of inputs required by a task may be incomplete. For example, a C++ compilation task may declare the exact set of input .CPP/.CXX source files required, while not declaring the complete set of .HPP/.HXX header files. Determining the set of includes would typically require running the C++ pre-processor, which is undesirable. Of course, when BuildXL does execute the C++ compilation task, it can observe the actual header files accessed by the compiler. Those accesses are recorded as dynamic inputs.
To accommodate these static and dynamic inputs, the Memoization Store's data model goes beyond the typical key->value model.
Consider the following table, which illustrates the Memoization Store's conceptual data model:
WeakFingerprint | CasElement | HashElement | Content List |
---|---|---|---|
|-------------- | StrongFingerprint | --------------| | |
F1 | F1C1 | F1C1H1 | {C1, C2, ... } |
F1 | F1C1 | F1C1H2 | {C3, C4, ... } |
F1 | F1C1 | F1C1H3 | {C5, C6, ... } |
F1 | F1C2 | F1C2H1 | {C7, C8, C9, ... } |
F2 | F2C1 | F2C1H1 | {C10, ... } |
... |
The WeakFingerprint
, the CasElement
and the HashElement
are all used in the lookup process. The Content List is the result of a lookup, an ordered list of content identifiers which can be used to retrieve blobs from the CAS.
-
To look up a given build task, BuildXL computes a
WeakFingerprint
, typically a hash-of-hashes of the files declared as the task's static inputs. Let's say the weak fingerprint is F1, as seen in the table above. A call to GetSelectors( F1 ) would return a set ofSelector
objects associated with that weak fingerprint. EachSelector
is tuple of the form <CasElement, HashElement>. Based on the example the result would be
{ <F1C1, F1C1H1>, <F1C1, F1C1H2>, <F1C1, F1C1H3>, <F1C2, F1C2H1> }. TheCasElement
term is a content identifier for a CAS blob. The the contents of that blob being application dependent.
In the BuildXL case, the build engine uses the blob referenced byCasElement
to store the list of dynamic inputs observed doing a previous execution of the task. In the C++ example above, this would be paths to header files last used by the compiler.
TheHashElement
is an applicationdefined opaque byte-array, used to test "how good of a match" this cache record is to the application's need. -
BuildXL evaluates the returned
Selectors
attempting to find one that matches the file system state for the present build. For eachSelector
, BuildXL retrieves the blob referenced by itsCasElement
and, using the paths listed as dynamic inputs, computes a hash-of-hashes which summarizes the file system state for the relevant files. The hash-of-hashes is then compared to theHashElement
in theSelector
(and possibly to any otherSelectors
with the sameCasElement
). In the example above, let's say that it matches the value F1C1H2.
To retrieve the associated content, BuildXL calls GetContentHashList( F1,F1C1, F1C1H2). Based on the table, the value returned would be { C3, C4, ... }. -
In another scenario let's say that BuildXL calls
GetSelectors
as described above, but receives a result that amounts to "not found", or thatGetSelectors
returns some tuples, but BuildXL is unable to find one with a matchingHashElement
. Both cases are cache misses. As BuildXL executes the task, it observes its dynamic inputs and collects its outputs. It stores the collected output into the CAS and uses the dynamic inputs observations to compute theHashElement
value F1C1Hnew . With that information it calls AddOrGetContentHashList( F1,F1C1, F1C1H2, F1C1Hnew , { Cnew1, Cnew2 } ), which effectively adds the new entry to the cache.
Notes:
-
In reality,
GetContentHashList
is called with aStrongFingerprint
object, effectively a tuple <WeakFingerprint, CasElement, HashElement>). Similarly,AddOrGetContentHashList
also takes aStrongFingerprint
, plus a few other arguments omitted here for brevity. -
The cache can store multiple
HashElements
for the sameCasElement
. In the example this would correspond to different content for one or more C++ headers referenced by the compiler task. -
The cache can also store multiple
CasElements
for the sameWeakFingerprint
. In the example, this would correspond to different sets of C++ headers observed in different executions of this compiler task, which could occur if, say, one of the header files was re-factored into two files. -
The Memoization Store attempts to apply an LRU retention scheme to cached entries. Each of the records depicted in the table above would have a distinct TTL (Time To Live) value, which is extended by calls to
GetContentHashList
(though not by calls toGetSelectors
). -
A client that needs a simple key->value model, can still used these APIs. In this case the
WeakFingerprint
is sufficient as a key. To callGetContentHashList
orAddOrGetContentHashList
, aStrongFingerprint
can be "manufactured" by using thatWeakFingerprint
and hard-coded values forHashElement
andCasElement
(typically these would be hard-coded to an empty byte-array and the content id for the zero-length blob, respectively). No calls toGetSelectors
are needed.
The BuildXL cache implements a layered architecture, which allows for local and distributed caching.
-
At a minimum BuildXL runs with a local cache, which uses configurable directories on disk to retain Memoization Store and Content Store data across different BuildXL invocations. Disk utilization and data retention are managed automatically based on configuration settings. Whenever possible, hard-links are used to "copy" data into the cache, and back out into the build "work directories". On Windows, Access Control Lists [ACLs] are placed on he files to avoid clobbering their contents. On MacOS, the APFS copy-on-write feature is used instead.
-
The local store can be tiered with distributed cache, which stores both the Memoization Store and CAS data into the "cloud", via a set of APIs implemented in the Azure DevOps service.
NOTE: As of this writing, the Azure DevOps APIs are not public, and can only be used for internal Azure DevOps accounts. The API client code can still be found in this repo, in classes using the 'VSTS*' prefix, moniker for 'Visual Studio Team Services', Azure DevOps's earlier name. -
When deployed within CloudBuild, Microsoft's internal build service, there is also a datacenter CAS component, which allows multiple per-node CAS services to track and retrieve each other's content, effectively implementing a "content mesh." Most of this implementation can be found in this repo, though as of this writing we don't expect it to be used outside of the confines of Microsoft.