Explainstuff.mebeta
All concepts
Cloud Native Patternsintermediate6 min

Retry

Many failures in distributed systems are just blips — so before giving up, wait a moment and try again.

Imagine calling a friend and getting a busy signal. You don't conclude they've vanished from the earth — you wait a few seconds and dial again, and the second time it rings through. That small habit of trying again is one of the most useful instincts in distributed systems.

The Retry pattern bakes that instinct into your code: when a call to another service fails, instead of surrendering immediately, you pause briefly and attempt it again a few times before deciding it's truly broken.

The problem

When services talk to each other over a network, a surprising share of failures are transient — they clear up on their own within moments. A packet gets dropped, a connection times out under a brief load spike, or a busy service replies "too many requests, slow down" (throttling). None of these mean the operation is impossible; they just mean not right now.

If your code treats every error as permanent and gives up on the first failure, it throws away requests that would have succeeded on a second try. That's wasteful, and it pushes avoidable errors all the way up to your users.

How it works

The core idea is simple: catch the failure, wait, and try the same call again — up to a fixed number of attempts. The trick is how you wait. A good retry strategy combines three ingredients:

  • Exponential backoff — make each wait longer than the last (say 1s, then 2s, then 4s). A struggling service needs room to recover, and stretching the gaps gives it that room instead of piling on.
  • Jitter — add a small random amount to each wait. Without it, many clients that failed at the same instant would all retry at the same instant — a synchronized thundering herd that knocks the service over again. Randomizing spreads the retries out.
  • A cap on attempts — after a handful of tries, give up and surface the error. Retrying forever just turns a quick failure into a request that hangs indefinitely.

The animation below shows a single call that fails on its first attempt, waits, retries, and finally succeeds — the request making it through on the second try without anyone upstream ever seeing an error.

Try, back off, try again
call
Caller
Service
Many failures are temporary — so try again before giving up.
Tip

Backoff and jitter are a team. Backoff alone keeps one client polite, but if a thousand clients all failed together they'll still retry in lockstep. Jitter breaks that synchronization, smearing the retries across time so the recovering service sees a gentle trickle instead of a wall.

Only retry what's safe to repeat

Here's the catch that trips people up: a retry sends the same request twice. If the first attempt actually reached the service and did its work — but the response got lost on the way back — your retry runs the operation a second time. For a read, no harm done. For a payment or an order, you've just charged the customer twice.

This is why you should only retry operations that are idempotent — ones where doing them again produces the same result as doing them once. Reads are naturally safe; for writes, you typically attach a unique key to the request so the server can recognize a repeat and ignore it.

Watch out

Retries can't fix every error — and shouldn't try. A 400 Bad Request, a validation failure, or an authentication error is not transient: the same request will fail the same way no matter how many times you send it. Retrying these wastes time and load. Reserve retries for errors that genuinely might pass on a second attempt, like timeouts, 503s, and throttling responses.

When to use it

Reach for retries on any remote call that can fail transiently — calls to other services, databases, message queues, or third-party APIs over an unreliable network. They're a cheap, high-leverage way to absorb the everyday turbulence of distributed systems before it ever reaches a user.

But retries aren't a cure-all, and on their own they can make a real outage worse by adding load. Pair them with a circuit breaker: let retries handle the brief blips, and let the breaker step in to stop retrying once a fault is clearly persistent. Together they give you both resilience to momentary glitches and protection against a dependency that's genuinely down.

Key takeaways

  • Many failures in distributed systems are transient — a network blip, a momentary overload, a throttle — so a quick retry often succeeds where the first attempt failed.
  • Use exponential backoff (wait longer after each failure) plus jitter (randomize the wait) so retries don't hammer the service or sync up into a thundering herd.
  • Always cap the number of attempts; retrying forever just turns a brief fault into a stuck request.
  • Only retry operations that are safe to repeat — idempotent ones — or you risk duplicate side effects like double charges.
  • Don't retry non-transient errors such as 400s or validation failures; the result won't change, so fail fast instead.

Keep going