Throttling & Rate Limiting

Every service has a ceiling — some number of requests per second beyond which latency spikes, memory balloons, and things start falling over. The trouble is that traffic rarely arrives politely. A viral moment, a buggy client stuck in a retry loop, or an outright abusive caller can all send a sudden flood your way.

Throttling and rate limiting are how you stay standing through that flood: instead of accepting every request and collapsing under the weight, you accept what you can safely handle and turn the rest away. A service that says "no" to a fraction of traffic is far more useful than one that says nothing at all because it has crashed.

The problem

Without any limit, your service's capacity is shared on a first-come, first-served basis — and that's exactly the problem. When demand suddenly exceeds capacity, requests pile up, queues grow, threads and connections get tied up waiting, and response times climb for everyone. A spike meant for one feature degrades the entire system.

It's worse on a multi-tenant service. A single runaway client hammering your API in a tight loop can consume all the capacity, starving every other customer. One caller's bad day becomes everybody's outage. You need a way to draw a line: this much, and no more, per caller.

How it works

A throttle sits in front of the service, like a bouncer at a door. For each incoming request it asks a simple question: is this caller within their allowed rate? If yes, the request passes through to the service as normal. If no, the request is rejected — typically with an HTTP 429 Too Many Requests response — or deferred for later, rather than being allowed to add to the overload.

The key idea is that the limiter makes this decision cheaply and early, before the expensive work begins. Excess traffic bounces off the front door instead of consuming the resources behind it. The animation below shows requests streaming toward a service: those within the limit are waved through, while the surplus is turned away at the gate.

Cap the rate that reaches the service

request

Clients

Rate limiter

Service

Requests within the limit pass through; the rest are rejected or deferred, protecting the service.

Tip

Tell clients how to behave. A 429 response should include a Retry-After header (and often RateLimit-* headers describing the limit and remaining quota). Well-built clients read these and back off gracefully — turning a hard rejection into a brief, orderly pause instead of a frantic retry storm.

The common algorithms

Several algorithms decide whether a request fits within the limit, and they differ mainly in how they treat bursts:

Token bucket — a bucket holds tokens that refill at a steady rate, and each request spends one. A request is allowed only if a token is available. Because tokens accumulate while traffic is quiet, this tolerates short bursts up to the bucket's size, then settles to the refill rate.
Leaky bucket — requests pour into a queue that drains (leaks) at a fixed rate; if the queue overflows, new requests are dropped. This smooths bumpy traffic into a steady, predictable stream rather than allowing bursts.
Fixed window — count requests in each clock-aligned interval (say, per minute) and reject once the count hits the cap. Simple, but allows a double-rate burst straddling the boundary between two windows.
Sliding window — track requests over a continuously moving interval instead of fixed blocks, smoothing out that boundary spike at the cost of keeping a little more state.

Why it protects availability and fairness

Throttling defends your service in two distinct ways. First, availability: by capping total intake at or below what the system can handle, you keep it inside its safe operating envelope. A throttled service degrades gracefully — turning away the excess — instead of collapsing entirely and serving no one.

Second, fairness: with per-tenant or per-client limits, no single caller can monopolize shared capacity. Each tenant gets a guaranteed slice, so a noisy neighbor's spike is contained to their own quota rather than spilling over onto everyone else. This isolation is what makes throttling essential for any public or multi-tenant API.

Note

Rejecting vs. buffering. Throttling rejects or sheds excess to protect the service right now. Queue-based load leveling instead buffers the excess in a queue so the service can work through it at its own pace. They solve the same overload from opposite ends — drop it or defer it — and are often used together.

When to use it

Reach for throttling whenever a service can be overwhelmed by demand it doesn't control: public APIs, multi-tenant platforms, expensive endpoints, and anything fronting a limited downstream resource. It's the standard way to enforce usage tiers, defend against abuse and runaway clients, and keep a shared service fair under load.

It pairs naturally with related patterns. Use queue-based load leveling when the work can wait and you'd rather defer the spike than drop it; use throttling when you must protect capacity immediately. And combine it with a circuit breaker, which protects you from a failing downstream dependency, while throttling protects you from excessive inbound demand — together they guard both ends of the call.