Latency & Throughput

When people say a system is "fast," they usually mean one of two very different things. One is how long you wait for your request to come back. The other is how much total work the system can churn through in a given window. Confusing the two is one of the most common mistakes in performance discussions — they sound similar, but they are measured differently and improved by different means.

The two measures are latency and throughput, and keeping them straight is the foundation of reasoning about performance.

What latency is

Latency is the time it takes for a single operation to complete, from the moment it starts to the moment the answer comes back. It's measured in units of time — usually milliseconds (ms), sometimes microseconds or seconds. If you click a button and the page responds in 200 ms, that 200 ms is the latency of that one request.

Latency is about one thing happening. Lower is better: it's the wait you personally experience.

What throughput is

Throughput is the rate of completed work — how many operations finish per unit of time. It's measured in things-per-time: requests per second, transactions per minute, messages per hour. If a server handles 10,000 requests every second, that's its throughput.

Throughput is about how many things happen, not how long any single one took. Higher is better: it's the total capacity of the system.

An everyday analogy: the highway

Picture a highway. Latency is the time it takes for one car to drive from the start of the road to the end — say, ten minutes. Throughput is how many cars pass a given point each minute.

Now add more lanes. A car still takes ten minutes to drive the road — its latency is unchanged. But far more cars get through every minute, because they travel side by side. Widening lanes raised throughput without making any single car faster. To lower latency, you'd need a different fix entirely: raise the speed limit or shorten the road.

One request vs. many

request

Clients

Server

Latency and throughput measure two different things.

Note

The highway makes the key point: adding lanes (throughput) and raising the speed limit (latency) are separate levers. More lanes move more cars per minute but don't speed up any one trip. That's why a system can be high-throughput and high-latency at the same time — lots of cars on the road, each still taking ten minutes.

They move independently

The crucial insight is that latency and throughput are not the same dial, and changing one does not automatically change the other. You can have:

High throughput with high latency — a system that batches work or buffers it in a queue processes huge volumes efficiently, but each individual item waits its turn before being handled. Lots of work gets done; any single piece of it takes a while.
Low latency with limited throughput — a system tuned to answer each request the instant it arrives may be very snappy for one user, yet fall over once thousands arrive at once because it has little spare capacity.

Optimizing for one can even hurt the other, so it's worth being explicit about which one actually matters for your use case.

Tail latency: averages lie

A single average latency number hides the worst experiences. If most requests take 50 ms but a slow few take 3 seconds, the average still looks healthy — yet some users are clearly suffering.

That's why teams track percentiles. The p95 latency is the value 95% of requests come in under; p99 is the value 99% come in under. These describe the tail — the slowest requests. At scale, the tail matters enormously: a single page view may fan out into dozens of internal calls, so even a rare slow response is likely to land in almost every user's session.

Tip

Measure percentiles, not averages. Report p95 and p99 alongside (or instead of) the mean. An average can stay flat while the p99 quietly doubles — and it's the p99 your unluckiest, often most active, users actually feel. If you only watch the average, you'll miss the pain.

What moves which dial

Most performance techniques target one measure more than the other:

Caching lowers latency — a cache hit returns in microseconds instead of re-running a slow query. By removing load from the slow path, it also lifts throughput, since freed-up capacity can serve more requests.
Load balancing and scaling raise throughput by spreading work across more servers — the highway's extra lanes. They add capacity, but a single request usually isn't faster on a less-loaded server (though avoiding queueing under heavy load does help its latency).
Batching and queues raise throughput by amortizing fixed overhead across many items, at the cost of added latency for each one.

The trade-off in practice

Almost every performance decision is a trade between these two. Batching writes into one big database operation processes far more rows per second (throughput up) but each write waits for the batch to fill (latency up). Answering every request the instant it arrives keeps latency low but caps how many you can handle before the system is overwhelmed.

There's no universally "correct" answer — it depends on what you're building. A chat app lives or dies on low latency; a nightly analytics job cares only about total throughput. Decide which one matters for your workload, measure it honestly with percentiles, and tune deliberately rather than chasing both at once.