Explainstuff.mebeta
All concepts
Cloud Native Patternsadvanced7 min

Scheduler Agent Supervisor

Coordinate a multi-step distributed action with a supervisor that watches for failed steps and drives them back to a known-good state.

Picture booking a holiday that needs a flight, a hotel, and a rental car — three separate companies, any of which might be slow, fail, or quietly drop your request. A good travel agent doesn't just fire off three bookings and hope. They track each one, chase the ones that don't confirm, and if the hotel falls through after the flight is booked, they sort out a fix rather than leaving you stranded.

The Scheduler Agent Supervisor pattern builds that diligent travel agent into your system. It coordinates a sequence of distributed steps, keeps durable track of how each one is going, and — crucially — has a dedicated watcher whose job is to notice when a step has gone wrong and put things right.

The problem

A single business action often spans several remote services, and any of them can fail in messy ways: time out, return an error, succeed but never send back a confirmation, or just hang. In a distributed system these aren't edge cases — they're routine. If you simply call each step in turn with no memory of progress, a crash halfway through leaves the work in an unknown, half-finished state with no one responsible for cleaning it up.

Worse, transient failures and permanent failures look the same in the moment. A naive flow either gives up too early on a blip or retries forever on something that will never succeed. What's missing is something that remembers the intended outcome of every step, compares it to reality, and actively shepherds stalled work toward completion or a clean rollback.

Without a supervisor — a stuck step strands the workflow
step hangs — no one notices
Caller
Step 1 · ok
Step 2 · hung
Step 3 · never runs
Remote service
The caller runs steps in sequence with no recorded state and no watcher. When step 2 hangs, nothing detects it; step 3 never starts and the multi-step job is left half-finished forever.

How it works

The pattern splits responsibilities across three roles. The scheduler starts the workflow, breaks it into steps, and writes the whole plan plus each step's status to a durable state store — so progress survives a crash. Each step is carried out by an agent: an isolated worker that talks to one remote service and reports back, shielding the rest of the system from that service's quirks.

The star is the supervisor. It periodically reads the state store looking for steps that are stuck, timed out, or failed. For a transient failure it asks the scheduler to retry the step; for one that can't be completed, it triggers compensation to undo the steps already done and leave the system consistent. Because every status is recorded, the supervisor can pick up after any interruption. The diagram below shows the scheduler launching steps, agents executing them, and the supervisor watching the state store to drive recovery.

Scheduler Agent Supervisor — a self-healing distributed workflow
run, record, recover
Scheduler
State store
Agent · step 1
Agent · step 2
Agent · step 3
Supervisor
The scheduler launches steps via agents and records progress; the supervisor reads that durable state, spots the failed step, and drives recovery through the scheduler.
Tip

Make every step idempotent and re-runnable. The supervisor's recovery only works safely if retrying a step that may have partly succeeded doesn't double-charge or duplicate. Design agents so that running the same step twice lands you in the same place as running it once.

When to use it

Reach for this pattern when a workflow spans multiple unreliable remote services and you need it to either finish completely or unwind cleanly — order fulfilment, provisioning across cloud resources, multi-party financial transactions. It's closely related to the saga: both coordinate distributed steps and lean on compensating transactions to roll back. The distinguishing feature here is the explicit supervisor — a built-in watchdog that turns ordinary retry and recovery into a continuous, self-healing process rather than something you bolt on per call.

The cost is real complexity: a durable state store, a separate supervisor process, and the discipline of idempotent steps. For a quick local transaction or a workflow where partial failure is harmless, that machinery is overkill. But for long-running, high-stakes processes that absolutely must not be left half-done, the supervisor's tireless watching is exactly what keeps the system honest.

Key takeaways

  • The scheduler kicks off a workflow and records each step's intended and current state durably.
  • Agents are the workers that actually perform each step against a remote service, isolated behind their own boundaries.
  • The supervisor monitors progress, detects steps that are stuck or failed, and triggers recovery — retry or compensation.
  • Because state is stored, the system can resume or roll back after crashes, timeouts, or partial failures.
  • It's the resilient-orchestration cousin of the saga: a built-in watchdog turns flaky distributed steps into a self-healing flow.

Keep going