Imagine placing an online order. Behind that one button click, several independent services have to cooperate: one records the order, another reserves the item from stock, a third charges your card, and a fourth schedules the shipment. Each of those services owns its own database. So what happens when the payment succeeds but shipping reports the item is actually out of stock? You've taken the customer's money for something you can't deliver, and there's no single "undo" button across four separate systems.
The problem
Inside a single database, this is a solved problem: wrap everything in one ACID transaction and the database guarantees it all commits together or not at all. But a business operation that spans place order → reserve stock → charge payment → ship touches four services, each with a separate database, often on different machines.
There is no shared transaction manager that can lock all four databases at once and roll them back as a unit. Holding a global lock across services for the duration of the whole operation would also be disastrous for availability — you'd be freezing inventory and payment systems while you wait on the network. We need a way to keep the operation consistent without a single all-or-nothing transaction.
How it works
A saga reframes the operation as a sequence of local transactions, one per service. Each service does its own small piece of work, commits it to its own database, and then triggers the next step. Place the order, then reserve the stock, then charge the payment, then ship — each commit is real and durable on its own.
The trick is the failure path. If a later step fails, the saga walks backward through the steps that already succeeded and runs a compensating transaction for each one: refund the payment, release the reserved stock, cancel the order. These compensations don't magically restore the old database state — they're new transactions that semantically undo the effect of the earlier ones. It's a logical rollback, not a true database rollback. The animation below shows the forward chain of local transactions, a failure partway through, and the compensations running in reverse to unwind the completed steps.
- Service stepEach service commits its own local transaction in sequence.
Compensation isn't time travel. A database rollback erases a transaction as if it never happened. A compensation is a brand-new action that counteracts a committed one — a refund offsets a charge, it doesn't delete the original charge. That's why your domain has to define a sensible "undo" for every step; some effects (an email already sent, a missile already launched) can't truly be taken back.
Two ways to coordinate: choreography vs. orchestration
Sagas come in two coordination styles. In choreography, there is no central brain: each service publishes an event when it finishes its local transaction, and the next service listens for that event and reacts. The order service emits OrderPlaced, inventory reacts and emits StockReserved, payment reacts to that, and so on. This is naturally pub/sub-driven and pairs well with event sourcing, keeping services loosely coupled — but the overall flow is implicit, smeared across many services and hard to see in one place.
In orchestration, a central coordinator (the orchestrator) owns the workflow. It calls each service in turn, waits for the result, and decides what to do next — including which compensations to fire on failure. The logic lives in one place, which makes the saga far easier to understand, monitor, and modify, at the cost of a component that every step now depends on.
Consistency, isolation, and idempotency
Because each step commits independently, a saga gives you eventual consistency, not the instant all-or-nothing consistency of a single ACID transaction. For a window of time the system sits in a partial state — the order exists and the payment is taken, but the shipment hasn't been arranged yet.
That partial state is also visible to other readers: a saga has no isolation, so another request can observe stock that's been reserved but not yet paid for. And because messages get retried and steps can re-run after a crash, every step and every compensation must be idempotent — running it twice must produce the same result as running it once, or you'll double-charge a card or release the same stock twice.
The intermediate states are real, and others can see them. With no isolation, a customer might briefly see an order marked "placed" that's about to be cancelled by a compensation, and a concurrent operation can act on stock that's only tentatively reserved. Design every step to tolerate these in-between states — use pending/confirmed status flags rather than assuming each step is the final word.
The trade-offs
Sagas buy you cross-service consistency, but they aren't free:
- Complexity — you've replaced one transaction with a distributed workflow, plus a mirror-image set of compensation logic and the messaging or coordination plumbing to drive it. That's a lot more moving parts to build, test, and reason about.
- No isolation — intermediate states leak, so every part of the system has to be written with partial progress in mind.
- Designing compensations is hard — for each step you must define a meaningful undo, handle the case where the compensation itself fails (and must be retried), and accept that some real-world effects simply can't be reversed and need human or manual handling instead.
When to use it
Reach for a saga when a single business operation must span multiple services or databases and you genuinely cannot wrap it in one ACID transaction — the classic microservices situation of distributed data you still need to keep consistent. It's the standard answer for long-running, multi-step workflows like order fulfillment, booking and travel itineraries, or any "do these five things across five teams, and undo them if one fails" process.
If, on the other hand, your operation lives inside a single service and database, don't reach for a saga at all — use a plain local transaction and let ACID do the work. Sagas are the price you pay for distribution, so only pay it when distribution is what forced your hand.