Your app launches on a single server and everything is fine. Then word spreads, traffic climbs, and that one machine starts to struggle — CPU pegs at 100%, memory fills up, requests queue, and latency creeps up until things start timing out. You haven't done anything wrong; you've simply outgrown one box.
The question every growing system eventually faces is: how do we handle more load? There are exactly two answers. You can make the single machine more powerful, or you can add more machines. These are called vertical scaling and horizontal scaling, and knowing when to use each is one of the most fundamental decisions in system design.
The problem: one server runs out of room
A single server has a fixed amount of CPU, memory, disk, and network bandwidth. As traffic grows, you consume more of each until one of them becomes the bottleneck. Once that resource saturates, every extra request just makes the queue longer and the experience worse.
There's a second, quieter problem too: a single server is a single point of failure. If it crashes, reboots, or needs maintenance, your entire app goes dark. So as you scale, you're really chasing two things at once — more capacity and more resilience.
Vertical scaling (scale up)
Vertical scaling — often called scaling up — means giving the one server more power: more CPU cores, more RAM, faster disks. You keep the same single machine and the same architecture; you just make it bigger. In the cloud this can be as easy as stopping the instance and restarting it on a larger size.
The animation below shows this in action. As load rises, the single server grows bigger and more powerful to absorb it — until it reaches the largest size available and can grow no further.
- ClientsUsers or apps sending requests.
- ServerOne machine — scaled up with more CPU/RAM, until it hits a ceiling.
The appeal of scaling up is simplicity. Your code doesn't change, there's nothing to coordinate, and one machine is easy to reason about. That's why it's almost always the right first move.
But it has hard limits. There's a hardware ceiling — at some point you're already on the biggest machine money can buy, and you can't go bigger. The top-end machines are also disproportionately expensive, so the last bit of headroom costs far more than the first. And critically, you still have just one box: it remains a single point of failure, and upgrading it usually means downtime while it restarts on the new size.
Start by scaling up. For most apps, the simplest path to handling more load is to move to a bigger instance. It buys you time and breathing room without forcing you to redesign anything. Reach for horizontal scaling only once you start bumping into the ceiling, the cost, or the single-point-of-failure problem.
Horizontal scaling (scale out)
Horizontal scaling — scaling out — takes the opposite approach. Instead of one big machine, you run several smaller, identical ones and spread the work across them. Clients talk to a load balancer, which sits in front of the pool and hands each incoming request to one of the servers.
The animation below shows the idea: as load rises, new servers spin up behind the load balancer, and the incoming requests get shared out among all of them so no single machine is overwhelmed.
- ClientsUsers or apps sending requests.
- Load BalancerSpreads requests across the pool of machines.
- ServerOne of many interchangeable machines. Add more to scale out.
Scaling out is near-limitless — need more capacity, add more machines. It's also resilient: because the servers are interchangeable, if one dies the load balancer simply routes around it and the others pick up the slack, so there's no single point of failure. You can even add and remove machines while the system stays online, which means no downtime to grow.
The catch is that it demands more of your design. Servers must be stateless — they can't keep per-user data in local memory, because the next request from that user might land on a different machine. Anything that must persist (sessions, uploads) has to move to a shared store like a database or cache. And running many machines adds coordination complexity: deployments, configuration, logging, and debugging all get harder when work is spread across a fleet.
Statelessness is a precondition for scaling out, not an afterthought. If a server stores a user's session or shopping cart in its own memory, you can't freely route that user to any machine — and a crashed server takes its users' data with it. Push shared state into a database or cache before you add servers, or horizontal scaling will quietly break in confusing ways.
How this relates to latency and throughput
Scaling is really about latency and throughput. Throughput is how much total work the system can do per unit of time — requests per second, for example. Adding machines horizontally raises throughput directly: two servers can handle roughly twice the requests of one, ten servers roughly ten times.
Note what scaling out does not automatically fix: the latency of a single request. One request still runs on one server at that server's speed. Adding machines lets you serve more requests at once, but it doesn't make any individual request faster — for that you'd look at faster hardware, caching, or doing less work per request.
When to choose each
In practice the two approaches are a sequence, not a rivalry. Scale up first, because it's the simplest way to buy capacity and it keeps your architecture unchanged. Most applications can go a surprisingly long way on a single well-sized machine.
Then scale out once you hit the limits of scaling up — when you've reached the hardware ceiling, when bigger machines stop being cost-effective, or when you can no longer tolerate having a single point of failure. Many mature systems end up doing both: each machine in a horizontally scaled pool is itself a reasonably beefy (vertically scaled) box. The art is knowing which lever to pull, and when.