Notes from Tail Latency Aware Caching Paper by RobinHood

The problem


When the web service latency increases, the first suggested technique is to cache. The cache is a good solution when your system is a read heavy system.

The common technique is to cache the frequently used objects. The method generally reduces the latency, but doesn’t help much for tail latency (p99). The paper “Tail Latency Aware caching - Dynamically Reallocating from cache rich to cache poor” proposes a novel solution for maintaining low request tail latency.

Latency of backends

In multi-tier architecture, each service gets a cache resource and then depending on the need to optimize the latency or throughput, the cache resources can be increased or decreased. The main drawback of the method, the policy that determines the cache resource is “static”. The static policies are mostly offline and based on estimates that tend to solve 90% of users and not the tail end of the users.

What’s the proposal?

Latency of backends

The paper suggests a novel approach of creating the polices “dynamically” aimed at solving the tail latency. The new policies are made on the fly in a fixed time interval based on the latency contribution of each service in the previous cycle. If service A contributes to 10% to tail latency, the service gets X% of cache resource and in the next cycle if service A contributes to 5% to tail latency, service A can occupy Y% of caching resource.

The key idea behind RobinHood is to identify backend queries that are responsible for high P99 request latency, which they call “cache-poor” backends. RobinHood then shifts cache resources from the other “cache-rich” backends to the cache-poor backends.

What’s new in the proposal?

The design of the system focuses on solving tail latency (p99). Rather than considering cache resource policy as static policy, the cache policies are dynamic adjusted based on statistics.


  • The latency of each system varies over time.
  • Latency is not correlated with specific queries nor with query rate.
  • Latency depends on request structure, which varies greatly

How does it work?


Robinhood collects the response time for each request and filters the response time that falls between P98.5 and P99.5. Next, the system collects each service response time for the request ID in the tail bucket. Then the system counts each service’s contribution to tail latency. This metric is called RBC (Request Blocking count). The systems with the higher RBC values are cache-poor systems.

Once cache poor systems are identified, the system can leverage the available cache resources. The policy of cache is calculated every 5 minutes. The basic RobinHood algorithm assumes that redistributed cache space is filled immediately by each backend’s queries. In reality, some backends are slow to make use of the additional cache space because their hit ratios are already high.

Each request statistics from the application server is forwarded to the RBC server. Then the RBC server generates the new policy every five minutes and updates the cache controller. Each application has one controller. The controller enforces the cache resource resizing.

Implementation details

  • The RobinHood controller is a lightweight Python process.
  • The RBC server and application servers are highly concurrent and implemented in Go.
  • The caching layer is composed of off-the-shelf memcached instances, capable of dynamic resizing via the memcached API. Each application server has a local cache with 32 GB cache capacity.
  • On average, a request to the application server spawns 50 queries. A query is first looked up in the local memcached instance; cache misses are then forwarded to the corresponding backend system.
  • During the experiments, the average query rate of the system is 200,000 queries per second (over 500,000 peak).
  • The experimental test bed consists of 16 application servers and 34 backend servers divided among 20 back- end services. These components are deployed across 50 Microsoft Azure D16 v3 VMs.



The empirical evaluation of RobinHood focuses on five key questions. Throughout this section, the goal is to meet a P99 request latency Service Level Objective (SLO) of 150ms. RobinHood brings SLO violations down to 0.3%, compared to 30% SLO violations under the next best policy. For quickly increasing backend load imbalances, RobinHood maintains SLO violations below 1.5%, compared to 38% SLO violations under the next best policy. RobinHood maintains less than 5% SLO violations, while other policies do significantly worse. The best clairvoyant static allocation re- quires 73% more cache space in order to provide each backend with its maximum allocation under RobinHood. RobinHood introduces negligible overhead on network, CPU, and memory usage.

Six caching system result

The evaluation compares the result to the existing two production systems (OneRF, TAO++) and three research caching systems (Cliffhgr++, FAIR++, LAMA++).


RobinHood algorithm is capable of meeting a 150ms SLO for the OneRF workload even under challeng ing conditions where backends simultaneously become overloaded. Many other systems, Facebook, Google, Amazon, and Wikipedia, use a similar multi-tier architecture where a request depends on many queries. However, these other systems may have different optimization goals, more complex workloads, or slight variations in system architecture compared to OneRF.


RobinHood is also lightweight, scalable, and can be deployed on top of an off-the-shelf software stack. The RobinHood caching system demonstrates how to effectively identify the root cause of P99 request latency in the presence of structured requests.


paper  notes 

See also

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Powered by Buttondown.