Explainstuff.mebeta
All concepts
Cloud Native Patternsintermediate6 min

Bulkhead

Wall off your resources into isolated pools so a single failing dependency can't drown the whole application.

A ship's hull isn't one big open hold. It's divided into sealed bulkhead compartments, each watertight, so that if the hull is breached, water floods only that one compartment instead of the entire vessel. The ship stays afloat because the damage is contained.

The bulkhead pattern borrows that idea for software. Instead of letting every request draw from one shared pile of resources, you wall those resources off into separate compartments — so that when one springs a leak, it can't sink the whole application.

The problem

Most services run on a single shared pool of finite resources: a thread pool, a connection pool, a fixed set of instances. As long as everything is healthy, that's efficient — whoever needs capacity grabs it.

The trouble starts when one dependency turns slow or starts failing. Calls to it hang instead of returning, and each hung call keeps holding its thread and connection while it waits. Under steady traffic those resources fill up fast, and once the shared pool is drained there's nothing left to serve any request — even the ones that never touched the broken dependency. A single sick component has caused resource exhaustion and dragged down the entire app in a cascading failure.

How it works

The fix is to stop sharing one big pool. Partition your resources into separate, isolated pools and give each one a hard cap. A common split is per dependency (the payment API gets its own pool, the recommendations service gets another), but you can also carve pools out per tenant (one noisy customer can't starve the rest) or per criticality (checkout traffic never competes with background reporting).

Now a failure is confined to its own compartment. If the payment dependency goes slow and saturates its pool, those calls back up within that pool only — and the moment it's full, further payment calls are rejected fast rather than stealing capacity from anyone else. Every other pool keeps running on its own untouched budget, so the blast radius is limited to the one thing that broke.

The animation below shows resources split into two isolated pools. One pool saturates and its servers turn red, while the second pool keeps serving requests normally — the failure never crosses the wall between them.

Failures stay in their compartment
request
Client
Pool A
Pool A
Pool B
Pool B
Pool B is exhausted, but Pool A is isolated and keeps serving — the blast radius is contained.
Note

Isolation is the whole point. The value of a bulkhead isn't that it makes the broken dependency work again — it doesn't. It's that the rest of your service stays alive and responsive while that one compartment is underwater, turning a total outage into a partial, survivable one.

Pairing with other patterns

Bulkheads rarely travel alone. They define where the walls are, and other resilience patterns decide what to do inside each compartment.

A circuit breaker complements a bulkhead by detecting that a compartment's dependency is unhealthy and stopping calls to it altogether — so the pool doesn't even fill up before traffic is shed. Throttling works alongside both by capping how much load any one consumer can push, which is effectively how you enforce each pool's budget. Together, they keep one bad actor from taking everyone else down.

The trade-offs

Compartments aren't free. The same walls that contain failures also stop pools from sharing spare capacity, so your overall utilization drops — a pool sitting idle can't lend its threads to a pool that's momentarily slammed, the way one big shared pool could.

There's also more to tune. Every pool needs a size, and getting it wrong cuts both ways: too small and you reject healthy traffic during normal spikes; too large and the pool can exhaust the host before its cap ever kicks in. The right numbers depend on real traffic patterns and usually need adjusting over time.

Watch out

Don't carve up too finely. Every pool you add reserves capacity that can't be shared, so a flood of tiny single-purpose bulkheads quietly wastes a lot of resources. Group dependencies that share a fate or a criticality level, and reserve dedicated pools for the few that genuinely need isolation.

When to use it

Reach for bulkheads when your service depends on several backends and a problem in one could starve the others — especially when those dependencies have very different reliability or latency profiles, or when one is far more critical than the rest. They're also a strong fit for multi-tenant systems, where you want to guarantee that one heavy customer can't degrade everyone else's experience.

If you only have a single dependency and no shared-resource contention to worry about, the extra pools and tuning may not earn their keep. But the moment a slow backend can cascade into a full outage, partitioning your resources is one of the cheapest ways to turn a system-wide failure into a contained, recoverable one.

Key takeaways

  • The bulkhead pattern partitions resources — threads, connections, instances — into isolated pools so a failure stays trapped in one compartment.
  • Without isolation, a single slow or failing dependency can soak up every shared resource and exhaust the whole app, causing a cascading failure.
  • Common ways to slice the pools are per dependency, per tenant, or per criticality, with each pool given its own capped budget.
  • The benefit is a contained blast radius: when one compartment floods, the others keep serving normally.
  • The cost is lower overall utilization and more tuning, since capped pools can't borrow each other's spare capacity.

Keep going