Explainstuff.mebeta
All concepts
Cloud Native Patternsintermediate6 min

Leader Election

When several identical instances run side by side, pick exactly one to own the jobs that must happen only once.

Picture a relay team where every runner is equally fast and equally ready. The race only works if exactly one of them is holding the baton at any moment. If two grab it, chaos; if nobody does, the team stalls. Someone has to be the runner right now — and if they trip, a teammate must snatch the baton instantly.

Leader Election gives a fleet of identical service instances that same single-baton rule. They're interchangeable, but for certain jobs exactly one of them must be in charge, and the role must pass on cleanly if that one goes down.

The problem

Running many copies of a service is how we scale and stay available — and we work hard to keep those copies stateless and interchangeable so any of them can handle any request. But some tasks break if more than one instance does them. Think of a nightly aggregation job, advancing a shared workflow, or assigning work from a queue: if every instance runs it, you get duplicated effort, double charges, or corrupted state.

You can't just hard-code "instance #1 does it" — instance #1 will eventually crash, get redeployed, or scale away, and then the job never runs. You need the fleet to agree, on its own and continuously, on which single instance currently owns the special work.

Without leader election — split brain
all act at once
Instance · "I'm leader"
Instance · "I'm leader"
Instance · "I'm leader"
Single-owner job
Corrupted / duplicated state
With nothing to arbitrate ownership, every instance assumes it's the leader and runs the single-owner job at once, producing duplicated and conflicting actions.

How it works

The instances compete to acquire one shared, exclusive token — usually a lease or distributed lock backed by a store that can guarantee only one holder at a time (a database row, a blob lease, a coordination service like ZooKeeper or etcd). Whichever instance grabs it becomes the leader and takes responsibility for the single-owner work; the rest see it's taken and wait as standbys.

The catch is failure. The leader must keep renewing its lease on a heartbeat. If it crashes or hangs, it stops renewing, the lease expires, and the standbys race to claim the now-free token — electing a fresh leader automatically, with no human in the loop. The diagram below shows three identical instances contending for one lease, the winner leading, and the role passing on when it drops.

Leader Election — one baton, many ready hands
renew lease, then lead
Instance · leader
Instance · standby
Instance · standby
Shared lease
Single-owner job
Identical instances contend for a single shared lease; the holder leads and runs the single-owner job, and if it stops renewing, a standby takes over.
Tip

Don't roll your own consensus — and make leader work idempotent anyway. Correct distributed election is notoriously subtle; use a battle-tested lease or coordination primitive rather than inventing one. And design the leader's actions to be idempotent, because there's always a sliver of time where an old leader thinks it's still in charge while a new one has taken over.

When to use it

Reach for leader election when a task in a multi-instance system must be performed by exactly one instance at a time, yet must survive the loss of whichever instance currently holds the role — coordinating a workflow, running a singleton scheduler, or managing shared resources.

Don't use it for work that can run in parallel. If you've got a pile of independent messages to process, competing consumers is faster and simpler — many instances pulling from the same queue, each handling different items. Leader election adds a coordination bottleneck and a single point of activity, so reserve it strictly for the jobs that genuinely demand a single owner.

Key takeaways

  • Leader election designates one instance among many to coordinate work that must not run in parallel.
  • The other instances stand ready to take over the moment the leader fails.
  • Election usually relies on a shared lease or lock that only one instance can hold at a time.
  • The leader must renew its lease; if it stops, the lease expires and the others elect a new one.
  • It's the right tool only for genuinely single-owner work — for parallelizable jobs, prefer competing consumers.

Keep going