Picture booking a holiday that needs a flight, a hotel, and a rental car — three separate companies, any of which might be slow, fail, or quietly drop your request. A good travel agent doesn't just fire off three bookings and hope. They track each one, chase the ones that don't confirm, and if the hotel falls through after the flight is booked, they sort out a fix rather than leaving you stranded.
The Scheduler Agent Supervisor pattern builds that diligent travel agent into your system. It coordinates a sequence of distributed steps, keeps durable track of how each one is going, and — crucially — has a dedicated watcher whose job is to notice when a step has gone wrong and put things right.
The problem
A single business action often spans several remote services, and any of them can fail in messy ways: time out, return an error, succeed but never send back a confirmation, or just hang. In a distributed system these aren't edge cases — they're routine. If you simply call each step in turn with no memory of progress, a crash halfway through leaves the work in an unknown, half-finished state with no one responsible for cleaning it up.
Worse, transient failures and permanent failures look the same in the moment. A naive flow either gives up too early on a blip or retries forever on something that will never succeed. What's missing is something that remembers the intended outcome of every step, compares it to reality, and actively shepherds stalled work toward completion or a clean rollback.
- CallerFires each step in turn with no durable memory of progress. If it crashes mid-flow, the work is left in an unknown, half-finished state.
- Hung stepStep 2 times out or never returns a confirmation. With no watchdog, nothing detects the stall and nothing drives a retry or rollback.
- Stranded stepStep 3 never runs because step 2 never completed. The workflow is stuck half-done with no one responsible for cleaning it up.
How it works
The pattern splits responsibilities across three roles. The scheduler starts the workflow, breaks it into steps, and writes the whole plan plus each step's status to a durable state store — so progress survives a crash. Each step is carried out by an agent: an isolated worker that talks to one remote service and reports back, shielding the rest of the system from that service's quirks.
The star is the supervisor. It periodically reads the state store looking for steps that are stuck, timed out, or failed. For a transient failure it asks the scheduler to retry the step; for one that can't be completed, it triggers compensation to undo the steps already done and leave the system consistent. Because every status is recorded, the supervisor can pick up after any interruption. The diagram below shows the scheduler launching steps, agents executing them, and the supervisor watching the state store to drive recovery.
- SchedulerStarts the workflow, dispatches each step to an agent, and records the plan and progress in the state store.
- AgentAn isolated worker that performs one step against a remote service and reports its status back to the state store.
- SupervisorWatches the state store for stuck or failed steps and asks the scheduler to retry them or compensate.
- State storeDurable record of every step's intended and current state, so progress survives crashes and recovery can resume.
Make every step idempotent and re-runnable. The supervisor's recovery only works safely if retrying a step that may have partly succeeded doesn't double-charge or duplicate. Design agents so that running the same step twice lands you in the same place as running it once.
When to use it
Reach for this pattern when a workflow spans multiple unreliable remote services and you need it to either finish completely or unwind cleanly — order fulfilment, provisioning across cloud resources, multi-party financial transactions. It's closely related to the saga: both coordinate distributed steps and lean on compensating transactions to roll back. The distinguishing feature here is the explicit supervisor — a built-in watchdog that turns ordinary retry and recovery into a continuous, self-healing process rather than something you bolt on per call.
The cost is real complexity: a durable state store, a separate supervisor process, and the discipline of idempotent steps. For a quick local transaction or a workflow where partial failure is harmless, that machinery is overkill. But for long-running, high-stakes processes that absolutely must not be left half-done, the supervisor's tireless watching is exactly what keeps the system honest.