Let's Talk Latency

May 21, 2021


Semi-relevant XKCD ;)

Update

Now that I've graduated, I'm finally catching up to some more 'recent' developments in CS beyond university curriculum. Berkeley's EECS department has its (immense) strengths and it really does instill the tools to understand and independently reason about academic papers, industry articles, etc., but it's unfortunate that a distributed systems class isn't available for undergrads, especially considering how prevalent and applicable the concepts are in understanding practically any major website's backend design.

I attended my first 'Papers we love too' meeting today, where Yao Yue, an engineering manager at Twitter who's worked extensively on large-scale distributed systems and caching, talked about 'The Tail at Scale' (Jeffrey Dean, Luiz Andre Barroso) -- a 2013 ACM article about response variability and potential mitigations. I didn't end up asking any questions because I was a little nervous and unsure about the structure of the discussion (note to self: prepare questions for next time), but I ended up digging into the paper more deeply with Yue's comments / disagreements and audience questions, and learning a lot about practical systems challenges, particularly at Twitter.

Scatter/gathering my way through other distributed systems ideas so I can gain a fuller understanding of the field's landscape, I'm going through MIT's 6.824 video lectures (#1 so far, but I highly recommend the playlist already). Also, a random Piazza post about Go vs. Rust in a 2020 Discord blog update led to some enlightening context that I'll mention in a bit. All these seemingly disparate pieces of information tie together, somehow. :)

The Tail at Scale

Paper Takeaways Underlying this paper is an assumption that isn't visualized fully, according to Yue, but in her presentation, she includes a nice image from a colleague on slide 3 -- the remote procedure call (RPC) dependency tree. In a nutshell, most computers execute procedures, in which a block of code is run locally, with local results. In an RPC, instructions may be sent to other devices over a network, run remotely, with final results being returned to the parent device. This is the building block of a distributed system, in which -- by definition -- you may be distributing tasks over many computers for efficient execution.

So, in the dependency tree, completion of a parent node's task is dependent on the child's result being propagated up. Tree depth is directly proportional to the execution time of the request that spawned the tree e.g. parent node A with child node B (200ms) which has child node C (400ms) should take ~600ms to complete. However, in this discussion of depth, we completely ignore the tree's other dimension -- width -- which includes task fanout at individual nodes. When nodes are running in parallel, what do we expect the impact on latency to be? It's not simply additive like tree depth, although parallel tasks that output a single result may face bottlenecks in the form of the longest-running task.

Causes of variability mentioned in the paper include the following: Yue indicated that the list is not comprehensive, perhaps because the examples used are Google File System-specific and may have had the infrastructure to run natively, ignoring other issues like garbage collection in virtual machines, synchronizing concurrent data structures, etc. She categorized forms of variability as: Overall, variability seems to arise from resource sharing (libraries, data structures, file systems, etc.), bursty behavior (e.g. sometimes, background activity can occupy critical computational resources, leading to transient slowness), and physical / hardware limitations. The mention about libraries confused me a little, though. This might be a Twitter-specific phenomenon, because I wasn't able to find any references to library imports affecting response time, but my educated guess would be that Yue was referring to the time it takes to find a module's code on the computer's disk and bundle it with application logic, or time necessary to resolve dependencies. From my memories of CS 164 (Compilers), compiling a program to create bytecode requires resolving dependencies, which might contribute to slowness or intermittent failures.

Yue additionally split up variability mitigation tactics mentioned in the article between component builders (people who manage individual nodes / service endpoints) and systems builders (people who write the services composed of components), adding some of her own tips from experience.

Component builders can: Systems builders can: One of the most interesting things Yue spoke about was updates to the hardware considerations / trends put forth in the paper, which is now pretty outdated. Hardware moves fast! :') (and to some extent, apparently they're giving up on regular SWEs keeping up-to-date on these advances)

Offloading certain software functionalities to hardware leads to orders of magnitude of speedup, but most backend code sits on a layer far above hardware and is meant to be architecture-agnostic. I wonder how sustainable the current situation is, with hardware encroaching / intermingling further with software concerns. Is this even a bad thing?

Some of the biggest distributed systems challenges are related to grey failures -- requests that result in timeouts, and when it might be best to retry. According to queuing theory, one slow request can lead to a backup, snowballing into an issue that affects many users. Yue has another talk on the topic -- Lies, Damned Lies, and Timeouts -- and after how clear and descriptive this one was, I'll definitely be checking these slides out!

Important note: Most of the mitigations mentioned in this paper and talk address read requests, or requests that don't alter critical states. Because one of the core goals of a distributed system is failure recovery, replaying state-changing requests can result in unwanted side effects. The paper states that latency isn't an issue with most 'writes' though, and that generally, these requests can be queued and performed more leisurely. I've noticed this firsthand with user settings changes on GitHub. Updating your profile image doesn't result in immediate changes; it takes a couple of minutes but presumably these kinds of requests are occurring less frequently and are lower priority than... say... reading the files in a repo.

Discord

As for how this all ties into Discord...

Discord is a messaging application, which means it must support asynchronous operations. One of its hottest paths is user read state, which tracks read messages / mentions across channels and DMs using counters. These counters must be updated and read thousands of times per second, necessitating low latency. A difference of milliseconds at any stage could be critical for user experience.

The trouble began when the Discord team noticed latency spikes every 2 minutes, like clockwork. Discord's backend is implemented in Go, and the scheduled spikes seemed like they could be blamed on garbage collection, because memory isn't freed immediately. This regular GC is baked into Go's source code and doesn't have an easy workaround. Discord tried the following things: Noticing Rust's success in other parts of the codebase, Discord decided to port the read state service over to the language as well. Rust doesn't have a GC; its memory ownership functionality ensures that runtime memory bugs are resolved at compile time, and leaks never crop up.

Asynchronous Rust was in beta at the time, but Discord decided to hop on the train, and Rust ended up outperforming Go in nearly every performance metric, before any particular optimizations. Raising cache capacity after this new implementation gave all the speedup benefits, undampened by increased GC activity.

Takeaway: The Discord case study illustrates how scaling up amplifies small delays caused by node background activities, including GC. The codebase is likely smaller than other applications facing similar issues, like Twitter or Facebook, which may not have the ability to rewrite vast chunks in a different language. Sometimes, abstractions meant to make programmers' lives easier -- normally, high-level languages want to make sure you never think about GC -- turn into language-specific limitations. Working on performance improvements might mean getting into the weeds to attribute a slowdown to a part of your infrastructure, even if it might mean that a fundamental assumption about the system might need to be thrown out and rewritten.

Distributed Systems, More Generally

Why even build a distributed system? As websites and their user bases grow ever larger, it becomes unfeasible to run an application's code on one machine. To parallelize computations that can be done at once, enable fault tolerance, deal with an inherently physically distributed computational problem, or facilitate security / isolation, it might be a good idea to distribute. Splitting work or data across multiple machines is an obvious solution, but introduces a new set of challenges, mostly related to concurrency, partial failures, and performance.

The goal of distributed computing is, fundamentally, abstraction. A client using a distributed service should optimally not receive indications that it's distributed, but this is really, really hard. Sometimes, distributed failure mitigations involve prioritizing speed in returning 'good enough' results (e.g. Google search). By throwing more computers at a problem, we expect performance to scale; preferably, 2x the number of nodes will lead to 2x throughput.

We also expect fault tolerance, which includes availability, recoverability, and consistency. Availability and recoverability in the case of failure are pretty interlocked. If a node goes down and becomes unavailable, you also want its data to be recoverable (usually by logging) so it can continue servicing requests correctly when it comes back up. Data usually lives in non-volatile storage (shameless plug for Pure Storage here) e.g. solid-state disks which have the downside of being expensive to update, or in replicas for redundancy. By default, Google Drive stores 3 replicas per file, generating new ones automatically if a node goes down and loses information. Consistency refers to synchronization of shared data structures across nodes. A client interacting with a key-value store with replicas on multiple nodes needs write requests to propagate across all copies of a particular key, so stale values are never returned on reads.

One of the first forays into distributed computing was the development of MapReduce. This lecture goes into the algorithm in far more detail with examples, but the main idea involves dividing an input into chunks, applying a map function in parallel to produce intermediate output, calling a reduce function across all nodes to coalesce data by key, and emitting a dictionary result. MapReduce is a powerful (now open-sourced) idea and was originally used as the basis for Google's big data processing. These days, Google uses something different, as MapReduce also has glaring flaws, including natural incompatibility for iterative usage. New jobs and map / reduce functions need to be created if you want to run something that changes over time, like PageRank weights.

Addendum: A Proof

Variability in the latency distribution of individual components is magnified at the service level; for example, consider a system where each server typically responds in 10ms but with a 99th-percentile latency of one second. If a user request is handled on just one such server, one user request in 100 will be slow (one second)... If a user request must collect response from 100 such servers in parallel, then 63% of user requests will take more than one second... (Dean et al. 2013)
Why 63%? That number seems oddly familiar (throwback to EECS 126: Probability) -- it's \(1 - \frac{1}{e}\).

The expected number of user requests per 100 requests that will be slow is 1 i.e. \(E[\text{# of slow requests per 100}] = 1\). The paper uses 100 as a concrete example of a very large number \(n\).

More generally, \(E[\text{fraction of slow requests per $n$}] = \frac{1}{n}\), so \(E[\text{fraction of successful requests per $n$}] = 1 - \frac{1}{n}\). If each response must be collected from \(n\) servers, we multiply the probabilities associated with the independent events to get an expression for the fraction of successful requests in a distributed system: \((1 - \frac{1}{n})^n\). As \(n\) grows larger, we take the limit: $$ \displaylines{ \lim_{n \rightarrow \infty} (1 - \frac{1}{n})^n = \lim_{n \rightarrow \infty} e^{n \log(1 - \frac{1}{n})} \\ = \lim_{n \rightarrow \infty} e^{-1 + \frac{1}{n}}, \text{ because } n\log(1 + \frac{x}{n}) \rightarrow x + \frac{x^2}{n}. \text{ In this case, \(x = -1\)}. \\ = \lim_{n \rightarrow \infty} e^{-1 + \frac{1}{n}} \\ = \frac{1}{e} }$$ The expected fraction of unsuccessful requests is \(1 - \frac{1}{e} \approx 63\%\). In class, I remember using the Taylor series after binomial expansion but this apparently isn't a rigorous enough argument as the number of terms goes to \(\infty\). This is one of those problems that looks simple on the surface but is actually pretty difficult to thoroughly prove.

Sources

The Tail at Scale, Jeffrey Dean & Luis Andre Barroso

Tail at Scale: With Personal Commentary, Yao Yue

Why Discord is switching from Go to Rust

MIT 6.824 Distributed Systems (Spring 2020)