Let's Talk Latency

May 21, 2021

Semi-relevant XKCD ;)

Update

Now that I've graduated, I'm finally catching up to some more 'recent' developments in CS beyond university curriculum. Berkeley's EECS department has its (immense) strengths and it really does instill the tools to understand and independently reason about academic papers, industry articles, etc., but it's unfortunate that a distributed systems class isn't available for undergrads, especially considering how prevalent and applicable the concepts are in understanding practically any major website's backend design.

I attended my first 'Papers we love too' meeting today, where Yao Yue, an engineering manager at Twitter who's worked extensively on large-scale distributed systems and caching, talked about 'The Tail at Scale' (Jeffrey Dean, Luiz Andre Barroso) -- a 2013 ACM article about response variability and potential mitigations. I didn't end up asking any questions because I was a little nervous and unsure about the structure of the discussion (note to self: prepare questions for next time), but I ended up digging into the paper more deeply with Yue's comments / disagreements and audience questions, and learning a lot about practical systems challenges, particularly at Twitter.

Scatter/gathering my way through other distributed systems ideas so I can gain a fuller understanding of the field's landscape, I'm going through MIT's 6.824 video lectures (#1 so far, but I highly recommend the playlist already). Also, a random Piazza post about Go vs. Rust in a 2020 Discord blog update led to some enlightening context that I'll mention in a bit. All these seemingly disparate pieces of information tie together, somehow. :)

The Tail at Scale

Paper Takeaways

You can't avoid response-time variability at scale.

The impact of variability increases at scale. Imagine a computer that fails to execute a request in 1% of instances (1 request / day). We expect ~3-4 failures during a year. A large-scale application may depend on the availability of 1000 such computers (and potentially many such requests / day). At this point, ~3000-4000 failures a year implies ~10 failures a day. Scale makes rare individual issues a lot more likely to occur, making latency dependent on fanout.

Underlying this paper is an assumption that isn't visualized fully, according to Yue, but in her presentation, she includes a nice image from a colleague on slide 3 -- the remote procedure call (RPC) dependency tree. In a nutshell, most computers execute procedures, in which a block of code is run locally, with local results. In an RPC, instructions may be sent to other devices over a network, run remotely, with final results being returned to the parent device. This is the building block of a distributed system, in which -- by definition -- you may be distributing tasks over many computers for efficient execution.

So, in the dependency tree, completion of a parent node's task is dependent on the child's result being propagated up. Tree depth is directly proportional to the execution time of the request that spawned the tree e.g. parent node A with child node B (200ms) which has child node C (400ms) should take ~600ms to complete. However, in this discussion of depth, we completely ignore the tree's other dimension -- width -- which includes task fanout at individual nodes. When nodes are running in parallel, what do we expect the impact on latency to be? It's not simply additive like tree depth, although parallel tasks that output a single result may face bottlenecks in the form of the longest-running task.

Causes of variability mentioned in the paper include the following:

shared resources -- local contention between requests or entire applications
daemons -- scheduled background daemons cause small expected interruption times to build up
global resource sharing -- global contention between applications for file system access, network switches, etc.
maintenance activities -- background logging, cleanup, garbage collection, etc.
queuing -- many queue layers can lead to unpredictable variability behavior, more in the realm of network traffic than strictly software engineering
power limits -- throttling devices to prevent overheating
garbage collection -- periodic garbage collection can really affect read latency even if reads overwhelmingly outnumber writes
energy management -- power-saving / sleep modes can take an unreasonable amount of time to snap out of

Yue indicated that the list is not comprehensive, perhaps because the examples used are Google File System-specific and may have had the infrastructure to run natively, ignoring other issues like garbage collection in virtual machines, synchronizing concurrent data structures, etc. She categorized forms of variability as:

resource starvation a.k.a. 'unfair wait' -- Task A is queued up and ready to run, but the device suddenly spins up several hundred threads related to some large map-reduce job. A is unnecessarily slowed down or receives insufficient CPU resources during the time slice in which the job is running before being kicked out.
queue depth a.k.a. 'fair wait' -- Task A is, unfortunately, far back in a queue that contains other (possibly earlier) tasks of higher priority. Scheduling algorithms that assign chunks to computers that currently don't have long queues can suffer from race conditions in which several tasks are assigned to the same device at once, lengthening the queue before updated queue length reaches the root node.
degradation a.k.a. 'moving slowly' -- The computer has been heating up, so fewer tasks are assigned to it / assigned tasks execute more slowly at a lower clock rate as the device cools.

Overall, variability seems to arise from resource sharing (libraries, data structures, file systems, etc.), bursty behavior (e.g. sometimes, background activity can occupy critical computational resources, leading to transient slowness), and physical / hardware limitations. The mention about libraries confused me a little, though. This might be a Twitter-specific phenomenon, because I wasn't able to find any references to library imports affecting response time, but my educated guess would be that Yue was referring to the time it takes to find a module's code on the computer's disk and bundle it with application logic, or time necessary to resolve dependencies. From my memories of CS 164 (Compilers), compiling a program to create bytecode requires resolving dependencies, which might contribute to slowness or intermittent failures.

Yue additionally split up variability mitigation tactics mentioned in the article between component builders (people who manage individual nodes / service endpoints) and systems builders (people who write the services composed of components), adding some of her own tips from experience.

Component builders can:

differentiate service classes -- separating & differently prioritizing tasks related to user interactivity from background batch operations / requests. We might care more about whether a user gets real-time feedback on whether the word they typed into a document was spelled correctly than updating the backend model with new parameters.
reduce head-of-line blocking -- chunk long, time-intensive tasks into smaller ones that require fewer time slices. Smaller tasks give rise to greater flexibility to reprioritize / shuffle / interleave assignments between nodes if necessary. If there's one thing I remember from CS 162 (Operating Systems), it's that no one can really predict how long a particular task will take to complete; the best we can do is update a heuristic.
use synchronized disruption -- instead of letting individual machines handle background activity at their own pace, synchronize them so they all garbage collect, log, etc. at once, leading to temporary scheduled slowdowns of requests during this bursty period. When I read about this mitigation in the paper, I felt a little wary for some inexplicable reason... maybe because I was thinking about the implementation mechanics or that it might not be a great idea to rely on several things happening at once on different machines in an article about variability ;). Yue partially validated my feelings by speaking from practical experience that this can backfire, and might work better with additional redundancy.
minimize sharing -- global concurrent data structure vs. distributed local subsets... especially for write updates, the second wins out
limit bursts -- keeping background activity to a minimum and not allowing its threads to starve those a user is waiting on
latency-optimized hardware configuration -- I need to look into this further though... my knowledge of hardware for distributed systems is pretty limited.
cache -- the paper states that caching won't directly address tail latency, but Yue pointed out that a high hit rate could drastically cut fanout, which contributes to the latency distribution and could reduce the tail

Systems builders can:

use hedged requests -- wait for a reply from a node up until the 95th percentile delay in the overall latency distribution, then ping a different node. Chances are that the long response time is transient and not a systematic issue with the request. Google apparently uses canary servers to detect this... just another example of how intertwined systems and security design considerations are. The other node will likely respond quickly while masking the underlying delay and distributed implementation.
use tied requests -- ping a bunch of servers at once. The first reply will be used. This might not be the best idea due to redundant use of computational resources and the necessity of cross-server communication, which we generally want to keep low because more complex communication design can lead to more hard-to-debug failure patterns.
micro-partition (with selective replication) -- generate more task partitions than machines, so assignments can be dynamically shuffled. Replicating some tasks / data can lead to higher completion rate based on request demand.
detect outliers -- suspend a slow machine, while sending shadow requests to occasionally check in on status
use graceful degradation -- a rack of 20 nodes each receiving a certain amount of power might heat up quickly; it might be better at certain periods to power 15 nodes and allow the others to cool at the cost of temporarily reduced compute ability.

One of the most interesting things Yue spoke about was updates to the hardware considerations / trends put forth in the paper, which is now pretty outdated. Hardware moves fast! :') (and to some extent, apparently they're giving up on regular SWEs keeping up-to-date on these advances)

Offloading certain software functionalities to hardware leads to orders of magnitude of speedup, but most backend code sits on a layer far above hardware and is meant to be architecture-agnostic. I wonder how sustainable the current situation is, with hardware encroaching / intermingling further with software concerns. Is this even a bad thing?

Some of the biggest distributed systems challenges are related to grey failures -- requests that result in timeouts, and when it might be best to retry. According to queuing theory, one slow request can lead to a backup, snowballing into an issue that affects many users. Yue has another talk on the topic -- Lies, Damned Lies, and Timeouts -- and after how clear and descriptive this one was, I'll definitely be checking these slides out!

Important note: Most of the mitigations mentioned in this paper and talk address read requests, or requests that don't alter critical states. Because one of the core goals of a distributed system is failure recovery, replaying state-changing requests can result in unwanted side effects. The paper states that latency isn't an issue with most 'writes' though, and that generally, these requests can be queued and performed more leisurely. I've noticed this firsthand with user settings changes on GitHub. Updating your profile image doesn't result in immediate changes; it takes a couple of minutes but presumably these kinds of requests are occurring less frequently and are lower priority than... say... reading the files in a repo.

Discord

As for how this all ties into Discord...

Discord is a messaging application, which means it must support asynchronous operations. One of its hottest paths is user read state, which tracks read messages / mentions across channels and DMs using counters. These counters must be updated and read thousands of times per second, necessitating low latency. A difference of milliseconds at any stage could be critical for user experience.

The trouble began when the Discord team noticed latency spikes every 2 minutes, like clockwork. Discord's backend is implemented in Go, and the scheduled spikes seemed like they could be blamed on garbage collection, because memory isn't freed immediately. This regular GC is baked into Go's source code and doesn't have an easy workaround. Discord tried the following things:

manually alter GC percent and trigger it more often, thereby spreading out the latency caused by GC background activity. This didn't really work, because GC scans the whole LRU cache, independent of actual size of freed memory.
shrinking cache size, so the GC has less memory to check. It worked, but at what cost? :') Shrink a cache and you shrink its hit rate, increasing latency for a different reason.
hand-tuning different cache capacities to reach a careful balance between tradeoffs. Ok for now, but fragile, and fragile design usually causes problems later.

Noticing Rust's success in other parts of the codebase, Discord decided to port the read state service over to the language as well. Rust doesn't have a GC; its memory ownership functionality ensures that runtime memory bugs are resolved at compile time, and leaks never crop up.

Asynchronous Rust was in beta at the time, but Discord decided to hop on the train, and Rust ended up outperforming Go in nearly every performance metric, before any particular optimizations. Raising cache capacity after this new implementation gave all the speedup benefits, undampened by increased GC activity.

Takeaway: The Discord case study illustrates how scaling up amplifies small delays caused by node background activities, including GC. The codebase is likely smaller than other applications facing similar issues, like Twitter or Facebook, which may not have the ability to rewrite vast chunks in a different language. Sometimes, abstractions meant to make programmers' lives easier -- normally, high-level languages want to make sure you never think about GC -- turn into language-specific limitations. Working on performance improvements might mean getting into the weeds to attribute a slowdown to a part of your infrastructure, even if it might mean that a fundamental assumption about the system might need to be thrown out and rewritten.

Distributed Systems, More Generally

Why even build a distributed system? As websites and their user bases grow ever larger, it becomes unfeasible to run an application's code on one machine. To parallelize computations that can be done at once, enable fault tolerance, deal with an inherently physically distributed computational problem, or facilitate security / isolation, it might be a good idea to distribute. Splitting work or data across multiple machines is an obvious solution, but introduces a new set of challenges, mostly related to concurrency, partial failures, and performance.

The goal of distributed computing is, fundamentally, abstraction. A client using a distributed service should optimally not receive indications that it's distributed, but this is really, really hard. Sometimes, distributed failure mitigations involve prioritizing speed in returning 'good enough' results (e.g. Google search). By throwing more computers at a problem, we expect performance to scale; preferably, 2x the number of nodes will lead to 2x throughput.

We also expect fault tolerance, which includes availability, recoverability, and consistency. Availability and recoverability in the case of failure are pretty interlocked. If a node goes down and becomes unavailable, you also want its data to be recoverable (usually by logging) so it can continue servicing requests correctly when it comes back up. Data usually lives in non-volatile storage (shameless plug for Pure Storage here) e.g. solid-state disks which have the downside of being expensive to update, or in replicas for redundancy. By default, Google Drive stores 3 replicas per file, generating new ones automatically if a node goes down and loses information. Consistency refers to synchronization of shared data structures across nodes. A client interacting with a key-value store with replicas on multiple nodes needs write requests to propagate across all copies of a particular key, so stale values are never returned on reads.

One of the first forays into distributed computing was the development of MapReduce. This lecture goes into the algorithm in far more detail with examples, but the main idea involves dividing an input into chunks, applying a map function in parallel to produce intermediate output, calling a reduce function across all nodes to coalesce data by key, and emitting a dictionary result. MapReduce is a powerful (now open-sourced) idea and was originally used as the basis for Google's big data processing. These days, Google uses something different, as MapReduce also has glaring flaws, including natural incompatibility for iterative usage. New jobs and map / reduce functions need to be created if you want to run something that changes over time, like PageRank weights.

Addendum: A Proof

Variability in the latency distribution of individual components is magnified at the service level; for example, consider a system where each server typically responds in 10ms but with a 99th-percentile latency of one second. If a user request is handled on just one such server, one user request in 100 will be slow (one second)... If a user request must collect response from 100 such servers in parallel, then 63% of user requests will take more than one second... (Dean et al. 2013)

Why 63%? That number seems oddly familiar (throwback to EECS 126: Probability) -- it's $1 - \frac{1}{e}$.

The expected number of user requests per 100 requests that will be slow is 1 i.e. $E[\text{# of slow requests per 100}] = 1$. The paper uses 100 as a concrete example of a very large number $n$.

More generally, $E[\text{fraction of slow requests per $n$}] = \frac{1}{n}$, so $E[\text{fraction of successful requests per $n$}] = 1 - \frac{1}{n}$. If each response must be collected from $n$ servers, we multiply the probabilities associated with the independent events to get an expression for the fraction of successful requests in a distributed system: $(1 - \frac{1}{n})^n$. As $n$ grows larger, we take the limit: $$ \displaylines{ \lim_{n \rightarrow \infty} (1 - \frac{1}{n})^n = \lim_{n \rightarrow \infty} e^{n \log(1 - \frac{1}{n})} \\ = \lim_{n \rightarrow \infty} e^{-1 + \frac{1}{n}}, \text{ because } n\log(1 + \frac{x}{n}) \rightarrow x + \frac{x^2}{n}. \text{ In this case, $x = -1$}. \\ = \lim_{n \rightarrow \infty} e^{-1 + \frac{1}{n}} \\ = \frac{1}{e} }$$ The expected fraction of unsuccessful requests is $1 - \frac{1}{e} \approx 63\%$. In class, I remember using the Taylor series after binomial expansion but this apparently isn't a rigorous enough argument as the number of terms goes to $\infty$. This is one of those problems that looks simple on the surface but is actually pretty difficult to thoroughly prove.

Sources

The Tail at Scale, Jeffrey Dean & Luis Andre Barroso

Tail at Scale: With Personal Commentary, Yao Yue

Why Discord is switching from Go to Rust

MIT 6.824 Distributed Systems (Spring 2020)