Skip to content

Caching Proxy: Features

Allon Guralnek edited this page Sep 13, 2017 · 22 revisions

Automatic Wrapping of Service Proxy

When you request the interface of a remote service via dependency injection and the interface contains cached methods, instead of directly getting an instance of the Service Proxy you will get it wrapped in a Caching Proxy. This means any call to the interface will first be handled by the Caching Proxy, which will provide a cached value if available, and if it's not it will call the Service Proxy and cache the result.

This functionality is provided when loading ServiceProxyModule into Ninject, which is loaded by MicrodotModule which in turn is loaded by MicrodotServiceHost and MicrodotOrleansServiceHost base-classes which you extended into your own hosting classes.

Cache Key Generation

The idea behind transparent proxies is that they add a caching layer without requiring code changes. That means they are responsible for generating the cache key using only the intercepted method call. The cache key must be unique for any given method and arguments, so that a change the method or in any argument produces a different cache key.

The Caching Proxy generates the cache key in the following pattern:

$"{TypeName}.{MethodName}({string.Join(",", ParameterTypes)}#{GetArgumentHash(args)})

Where GetArgumentHash creates converts all the arguments in an array to JSON, calculates the SHA1 hash of it and converts the hash to a Base64 string. So for example, a call to the following method:

void Task SaveUser(string name, int age);

With the arguments ["Alice", 42] will produce the following cache key:

"ICustomerService.SaveUser(string,int)#lfqfcjjeyjOD5bMtKjtJ3wKTEIs="

Time-to-live (TTL)

You can define a TTL to evict an item from the cache after a certain period of time has elapsed since it was added. This can prevent stale data from being returned from the cache. If an item is requested from the cache after it has been evicted by TTL, the data source will be used to retrieve an up-to-date version of the item.

The Caching Proxy doesn't support sliding TTLs in its current version, but only absolute TTLs. Sliding TTLs are effectively implemented by background refreshes (see below).

Using configuration you can change the TTL of the cache (default is 6 hours) at either the service level, which affects all cached methods on that service, or at the method level, which affect only a specific method group (all of its overloads). For more information about how to change the caching TTLs in the configurations, see Service Discovery: Configuration.

Soft Invalidations and Background Refresh

Traditional TTLs have certain downsides. One of them is "latency spiked". Since there is a very large performance difference between calls that hit the cache and those that miss, when a cache item is invalidated, it immediately causes a spike in latency and a drop in throughput, because all requests that were served quickly from the cache now have to wait for a remote request to complete. Also, invalidation of a cache item also hurts the fault tolerance described since if the item is not available in the cache, any failure of a remote service will be felt immediately.

To avoid those issues, the Caching Proxy implements a soft invalidation mechanism, where an item is only marked as stale rather than actually being evicted. An item is marked as stale after a certain period of time, called Refresh Time, since it's been in the cache, typically much shorter than the TTL. The default is 1 minute, but can be changed using configurations (see Service Discovery: Configuration).

When an item is requested from the cache but only a stale item is available, it is returned to the caller but a background refresh process is started that attempts to get the latest version of the item and replace the stale one that's in the cache with the freshly retrieve version. The background process contacts the data source, and if an item is returned, it puts it in the cache, overwriting the old item. It also resets the TTL and Refresh Time, which has the effect of a sliding TTL. If for any reason the data source doesn't return an item (usually when an exception was thrown), the stale item will remain in the cache and will continue to be served to any callers. A waiting time called Failed Refresh Delay (also configurable, default to 1 second) will be enacted before additional attempts are made to retrieve the value from the data source, so that it's not "hammered". There is no maximum number of retries, attempts to refresh the data will continue until the TTL elapses and the item is evicted from the cache.

This soft invalidation prevents sudden latency and throughputs changes when the TTL expires (which can be much shorter if soft invalidation wasn't available) and allows the cache to protect against unavailability of a service due to longer TTLs. If using the default TTL of 6 hours and Refresh Time of 1 minutes, data in the cache would be kept up-to-date up to around 1 minute and yet allow for a service to be down for up to 6 hours before the cache no longer has even stale data to serve. This is a good balance of keeping the data fresh and available.

You want to set the Refresh Time to a reasonable amount of time a data can be out of date but still useful. The shorter the time period, the bigger the load on the system and the lower the benefit of the cache.

You want to set the TTL to the maximum staleness of data you're willing to accept, which beyond that time period you'd rather requests start failing than use data so stale. Remember that the TTL is your safety net in case a service fails, at least some requests will be served from the cache. Beyond the TTL you will have 100% failure rates if the service is still down. Configure it according to your SLA and the amount of time it takes you to detect and recover from a service failure.

Revocation

There are two situations where TTLs and soft invalidations are insufficient:

  • When your maximum tolerable period of stale data is very short (seconds or even milliseconds) and the frequency of accessing the data is considerably lower. Since a background refresh only or you can't afford a delayed background refresh that's caused by cache miss that returns stale data.
  • When you want to minimize repeated background refreshes for data that doesn't change often on one hand, but want it reasonably fresh on the other hand.

In the first case, background refreshes (even with short Refresh Times) will not be sufficient to ensure that data is fresh because a background refresh is only triggered after stale data is returned. For example, if you need your data to be fresh within a 5 second time period, but you only access the data once a minute, it doesn't matter what you set the Refresh Time to be, you will always get one-minute-old data (since the fresh data of a background refresh will only be served next time you request the data, by which time it's stale again). While setting a TTL of 5 seconds will solve the freshness issue, all requests will have to wait for the data source, negating any advantages the cache provides (you could turn off the cache and get the same effect).

In the second case, you need reasonably fresh data (e.g. maximum 1 minutes old) but the data changes very infrequently (e.g. once a week). This would mean you're performing a lot of background refreshes (once every 10 minutes) that return the exact same data already present in the cache, which is wasteful. Using the 10 minutes/1 week example, you'd be performing about 10,000 refreshes every week (per client) where only one of them would return different data. If you have 100 caches for this data source (say 10 microservices each having 10 nodes), this ratio climbs to 1,000,000:1, making 99.9999% of your refreshes become busywork, contribute nothing to cache coherency. Setting a longer Refresh Time would compromise the freshness of your data and is not an acceptable solution.

The solution is active revocation of data. When a service becomes aware that data which is cached by its clients has changed, it can inform those clients that the relevant cached items are stale and should be invalidated. Such a notification system is faster and in many cases uses fewer resources than timed refreshes. To protect against missed revocation messages causing stale data to be stuck in the cache for extended periods of time, you still want to use TTLs with soft invalidations, but their configured time periods can be very long since they're only used as a safety net.

The Caching Proxy supports revocation using Revoke Key Accumulation. You wrap your returned object in the Revocable<T> class which allows you to specify revokes keys for that entity. You cached method should also return a Task<Revocable<T>> instead of your entity directly. For example:

Before

public class User
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
}

[HttpService(10000)]
public interface IMyService
{
    public Task<User> GetUserByEmail(string email);
    public Task<User> GetUserByPhoneNumber(string phoneNumber);
}

After

public class User
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
}

[HttpService(12345)]
public interface IMyService
{
    [Cached]
    public Task<Revocable<User>> GetUserByEmail(string email);

    [Cached]
    public Task<Revocable<User>> GetUserByPhoneNumber(string phoneNumber);
}

The Revocable<T> class has two properties:

public class Revocable<T> : IRevocable
{
     public T Value { get; set; }
     public IEnumerable<string> RevokeKeys { get; set; }
}

Place your data in the Value property and one or more revoke keys in the RevokeKeys property.

When the Caching Proxy receives a Revocable, it will listen in to any revocation messages that arrive for those revoke keys. If a revoke message arrives for at least one of the revoke keys, the item will immediately be evicted from the cache (soft invalidations with revocables aren't supported yet).

The reason you can specify more than one revoke key per entity is that:

  1. It may be composed of several other entities, and a revocation of any of those should also revoke the composed entity. For example, if you have an OrderConfirmation entity that includes data from both the Customer entity and the Order entity, you will want to include the revocation keys of both the Customer and Order that were used to compose it, so that if any of these change OrderConfirmation will be revoked.

Constructing a good revoke key

Revoke keys are strings created entirely by you. Microdot currently doesn't provide a facility to generate a revoke key, so you must construct them yourself. Revoke keys must be unique, as collisions can unintentionally remove items from a cache leading to poor performance and reduced failure resiliency.

A good revoke key should have the following properties:

  • To prevent collisions, it should be prefixed by a unique string that identifies your entity throughout the entire system. Be as verbose as you need to ensure uniqueness. A revoke key of C35895 for a Customer entity with an ID of 35895 risks colliding with the revoke key for Check. Better keys might be Accounts.Customer_35895 and PaymentProcessing.Check_35895.
  • It should include a unique identifier of the entity (like a primary key from the database), preferably a surrogate key.
  • It shouldn't include any data that can change (for example, a customer's first and last name) as this could cause the revoke message to contain a different revoke key than the one the entity had when it was retrieved from the data source.
  • To aid debugging, it should be human readable. Hashes, base64 encoded data and the likes should be avoided.

Revoking a cache entry

You can revoke a specified key by using the ICacheRevoker.Revoke() method.

public class UserGrain : Grain, IUserGrain
{
    private readonly ICacheRevoker CacheRevoker { get; }
    private readonly User Data { get; set; }

    public UserGrain(ICacheRevoker cacheRevoker)
    {
        CacheRevoker = cacheRevoker;
    }

    public Task SetFirstName(string firstName)
    {
        Data.FirstName = firstName;
        CacheRevoker.Revoke($"Billing.User_{this.GetPrimaryKey()}");
    }
}

Note that in order to for revokes to work, you must implement and bind two interface: ICacheRevoker for sending revoke messages and IRevokeListener for receiving revoke messages (which the Caching Proxy subscribes to revoke events on its RevokeSource property using TPL Dataflow). At Gigya we use an internal implementation that uses RabbitMQ, but you may implement it using any pub-sub or message broadcast infrastructure.

Call grouping

Call grouping is a way to merge several identical concurrent requests into a single request. The idea is that if several components need the same data that's not available to the cache, instead of each of them making their own request, they group together and make a single request, sharing the result between them. This improves performance by reducing redundant calls and simplifies cache management by not having several results competing for the same cache slot.

If you ignore concurrency, when multiple consecutive calls occur, the first one causes a cache miss and the rest are cache hits. All cache misses cause the data source to be contacted, and all cache hits don't.

After a cache miss, the caching proxy starts a request from the data source. If additional requests are made for the same data before the original request to the data source completes, then instead of issuing additional requests to the data source, the Caching Proxy will group those calls into a call group. When the request to the data source completes, all members of the call group will share the result from the data source, and that result will be put into the cache. If an exception was thrown during the request to the data source, all members of the call group will receive that exception.

Other than the advantages outlined above, it might cause you to observe one or more of the following (possibly undesirable) phenomena:

  • You make several concurrent requests for a call that you know hasn't been cached yet, but you see only one outgoing call (e.g. in Fiddler).
  • Several of your requests get the same exception, even though there was only one failure. That is because an exception that was returned for a call group is propagated to all members of that group.
  • Distributed tracing doesn't show you the clinetReq event under the callId of your request. This happens because your request joined an existing call group, and the clientReq event used the callId of the first request (which created the call group), rather than your request. There currently is no way to discover the callId of the request the created the call group that you joined, which can make tracing more difficult.

Metrics

The CachingProxy exposes several metrics to help you understand how it's used in your service. Metrics are provided in three levels:

  • Global (per AppDomain)
  • Per interface
  • Per method

The following metrics are provided at the above three levels:

  • Cache Hits
  • Cache Misses
  • Requests that joined a group (still uses the old term "team")
  • Requests awaiting result from data source
  • Requests that failed at the data source
  • Number of times the cache was forcefully cleared

The following are only provided at the global level:

  • Number of entries in the cache
  • Memory consumption of the cache, in megabytes
  • The configured size limit of the cache, in megabytes
  • The configured size limit of the cache, in percent of available physical memory