Imagine a single web server handling every request to your app. It works fine on launch day. Then you get popular, traffic doubles, and that one server starts to sweat — requests queue up, latency climbs, and eventually it falls over. You could buy a beefier machine (vertical scaling), but there's always a ceiling, and a single machine is a single point of failure.
Load balancing takes the other path. Instead of one big server, you run several identical ones and put a traffic cop in front of them.
The problem: one server can't do it all
When all traffic funnels into a single server, two things go wrong as you grow:
- Capacity — one machine can only handle so many concurrent requests before it saturates CPU, memory, or network.
- Availability — if that machine restarts or crashes, your entire app is down.
Both problems have the same shape: you've put all your eggs in one basket.
- ClientA user or app sending requests — a browser, mobile app, or another service.
- Single ServerOne machine handling every request. The bottleneck and single point of failure.
How it works
A load balancer sits between clients and your servers. Clients connect to it — they never address the servers directly. For each incoming request, the balancer picks one server from a pool of identical ones and forwards the request there.
Because the servers are interchangeable, it doesn't matter which one handles any given request. Add three more servers and you've roughly tripled your capacity. This is horizontal scaling: you grow by adding machines, not by enlarging one.
- ClientsUsers or apps sending requests in to the system.
- Load BalancerThe single entry point; forwards each request to one server in the pool.
- ServerAn interchangeable machine that handles a request. Add more to scale out.
- DatabaseShared persistent storage the servers read from and write to.
Identical servers are the key precondition. Load balancing only works cleanly when any server can handle any request. That usually means servers are stateless — they keep no per-user data in local memory. Anything that must persist (sessions, uploads) lives in a shared store like a database or cache.
Routing around failure
The load balancer continuously runs health checks — small periodic requests to each server ("are you alive?"). When a server stops responding, the balancer marks it unhealthy and simply stops sending it traffic. Users never notice; their requests quietly flow to the healthy servers.
When the sick server recovers and starts passing health checks again, it's added back to the rotation. This is what turns a pile of servers into a resilient system.
- ClientsUsers or apps sending requests in to the system.
- Load BalancerThe single entry point that spreads requests across the server pool.
- ServerAn interchangeable machine that handles a request. Add more to scale.
- DatabaseShared persistent storage every server reads from and writes to.
How does it choose a server?
The routing algorithm decides which server gets each request:
- Round-robin — hand requests out in a cycle: 1, 2, 3, 1, 2, 3… Simple and even when requests cost about the same.
- Least connections — send the next request to whichever server is currently handling the fewest. Better when some requests are much heavier than others.
- Hashing — derive the server from something stable, like the client's IP or a URL. The same input always maps to the same server, which is useful for cache locality.
Sticky sessions pin a given user to the same server for their whole session. It's a quick fix when servers do hold local state — but it undermines even distribution and makes failures more disruptive (a dead server takes its users' sessions with it). Prefer stateless servers with a shared session store instead.
When to reach for it
Reach for a load balancer when you need to scale past one machine or you need redundancy so a single failure doesn't take you offline — which, in practice, is almost any production web service. It's one of the most common building blocks in system design, and it pairs naturally with techniques like caching (to cut work per request) and circuit breakers (to handle downstream failures gracefully).