Sagas and Compensation
The Problem: Distributed Transactions
In a monolith, you wrap multiple operations in a single database transaction. If anything fails, everything rolls back. In a distributed system with multiple services, each owning its own database, there is no single transaction boundary.
Monolith:
BEGIN TRANSACTION
reserve_inventory()
charge_payment()
create_shipment()
COMMIT -- all or nothing
Microservices:
InventoryService (owns its DB) -- reserve stock
PaymentService (owns its DB) -- charge card
ShippingService (owns its DB) -- create shipment
No shared transaction. What happens when payment succeeds but shipping fails?
Two-phase commit (2PC) is the traditional solution for distributed transactions. A coordinator asks all participants to prepare, then tells them to commit. If any participant cannot prepare, all abort.
Coordinator:
Phase 1 (Prepare): "Can you commit?"
InventoryService -> "Yes, prepared"
PaymentService -> "Yes, prepared"
ShippingService -> "Yes, prepared"
Phase 2 (Commit): "Commit now"
All services commit.
If any says "No" in Phase 1:
"Abort" -> All services roll back.
Why 2PC is rarely used across services:
- Blocking. All participants hold locks during both phases. If the coordinator crashes between Phase 1 and Phase 2, participants are stuck holding locks indefinitely.
- Availability. Every participant must be reachable. One service being down blocks the entire transaction.
- Latency. Two network round-trips minimum. This does not scale to high-throughput systems.
- Coupling. All services must support the same 2PC protocol, typically through a shared transaction manager like XA.
2PC works well within a single database cluster. It works poorly across independent services with different databases, different teams, and different deployment schedules. The saga pattern is the practical alternative.
The Saga Pattern
A saga replaces a single distributed transaction with a sequence of local transactions. Each local transaction updates one service and publishes an event or sends a command to trigger the next step. If a step fails, previously completed steps are undone through compensating actions.
Forward actions: Compensating actions:
1. Reserve inventory -> Release inventory
2. Charge payment -> Refund payment
3. Create shipment -> Cancel shipment
If step 2 (charge payment) fails:
-> Compensate step 1: release inventory
-> Report failure to user
If step 3 (create shipment) fails:
-> Compensate step 2: refund payment
-> Compensate step 1: release inventory
-> Report failure to user
Critical rule: Every forward action must have a compensating action. Compensation is not the same as "undo" -- you cannot un-send an email or un-charge a credit card. Compensation is a semantic reverse: you send a correction email, you issue a refund.
Choreography vs Orchestration
There are two ways to coordinate a saga's steps.
Choreography
Each service listens for events and decides what to do next. There is no central coordinator.
OrderService publishes OrderPlaced
-> InventoryService hears it, reserves stock, publishes StockReserved
-> PaymentService hears it, charges card, publishes PaymentCompleted
-> ShippingService hears it, creates shipment, publishes ShipmentCreated
-> OrderService hears it, marks order as complete
If PaymentService fails:
PaymentService publishes PaymentFailed
-> InventoryService hears it, releases stock
-> OrderService hears it, marks order as failed
Advantages:
- No single point of coordination. Each service is autonomous.
- Easy to add new consumers without modifying existing services.
- Naturally aligns with event-driven architecture.
Disadvantages:
- The business flow is scattered across services. No single place shows the full workflow.
- Adding a new step requires understanding which services react to which events.
- Cycles and feedback loops can emerge, making the flow hard to reason about.
- Difficult to answer "what step is this order on?" without querying multiple services.
Choreography works well for simple flows with few steps (two or three services). Beyond that, the implicit flow becomes hard to maintain.
Orchestration
A central saga orchestrator explicitly directs each step. It sends commands to services and reacts to their responses.
OrderSaga (orchestrator):
1. Send ReserveStock command to InventoryService
2. Receive StockReserved -> Send ChargePayment to PaymentService
3. Receive PaymentCompleted -> Send CreateShipment to ShippingService
4. Receive ShipmentCreated -> Mark order as complete
If any step fails:
Run compensating actions in reverse order
Advantages:
- The entire business flow is visible in one place -- the orchestrator.
- Easy to add, remove, or reorder steps.
- Easy to answer "what step is this order on?" -- the orchestrator tracks it.
- Complex failure handling logic lives in one place.
Disadvantages:
- The orchestrator knows about all participating services (higher coupling).
- Risk of the orchestrator becoming a "god object" with too much logic.
- The orchestrator must be highly available -- if it goes down, in-flight sagas stall.
Orchestration is the better choice for complex, multi-step business processes where visibility and explicit failure handling matter.
Rust Implementation: Saga Orchestrator
/// Each step in a saga: a forward action and its compensating action.
STRUCTURE SagaStep:
name ← string
action ← FUNCTION(context) → Result
compensation ← FUNCTION(context) → Result
ENUMERATION SagaError:
StepFailed { step, reason }
CompensationFailed { step, reason }
/// A generic saga orchestrator that runs steps in order and compensates on failure.
STRUCTURE SagaOrchestrator:
steps ← list of SagaStep
PROCEDURE NEW_SAGA():
RETURN SagaOrchestrator { steps ← empty list }
PROCEDURE ADD_STEP(saga, name, action, compensation):
APPEND SagaStep { name, action, compensation } TO saga.steps
/// Execute the saga. On failure, compensate all completed steps in reverse order.
PROCEDURE EXECUTE(saga, context):
completed ← empty list
FOR EACH (i, step) IN ENUMERATE(saga.steps):
PRINT "[saga] Executing step:", step.name
result ← step.action(context)
IF result IS Ok THEN
APPEND i TO completed
ELSE
PRINT "[saga] Step '" + step.name + "' failed, compensating..."
// Compensate in reverse order
FOR EACH j IN REVERSE(completed):
PRINT "[saga] Compensating step:", saga.steps[j].name
comp_result ← saga.steps[j].compensation(context)
IF comp_result IS error THEN
// Compensation failure is serious -- log and continue
LOG CRITICAL: comp_result
RETURN result (error)
PRINT "[saga] All steps completed successfully"
RETURN Ok
// --- Application-specific context and usage ---
STRUCTURE OrderContext:
order_id ← UUID
customer_id ← string
amount ← float
inventory_reserved ← boolean
payment_charged ← boolean
shipment_created ← boolean
PROCEDURE BUILD_ORDER_SAGA():
saga ← NEW_SAGA()
ADD_STEP(saga, "reserve_inventory",
action: FUNCTION(ctx):
// In production: send command to InventoryService, await response
PRINT " Reserving inventory for order", ctx.order_id
ctx.inventory_reserved ← true
RETURN Ok,
compensation: FUNCTION(ctx):
PRINT " Releasing inventory for order", ctx.order_id
ctx.inventory_reserved ← false
RETURN Ok
)
ADD_STEP(saga, "charge_payment",
action: FUNCTION(ctx):
PRINT " Charging $" + ctx.amount + " for order", ctx.order_id
// Simulate a failure for demonstration:
// RETURN Error(StepFailed { step ← "charge_payment", reason ← "Card declined" })
ctx.payment_charged ← true
RETURN Ok,
compensation: FUNCTION(ctx):
PRINT " Refunding $" + ctx.amount + " for order", ctx.order_id
ctx.payment_charged ← false
RETURN Ok
)
ADD_STEP(saga, "create_shipment",
action: FUNCTION(ctx):
PRINT " Creating shipment for order", ctx.order_id
ctx.shipment_created ← true
RETURN Ok,
compensation: FUNCTION(ctx):
PRINT " Cancelling shipment for order", ctx.order_id
ctx.shipment_created ← false
RETURN Ok
)
RETURN saga
The orchestrator is generic over Ctx, so it can be reused for different saga types. In a production system, the actions would send commands over a message broker and the orchestrator would be async, persisting its state between steps so it can survive restarts.
Compensating Actions: Design Guidelines
Not every action has a trivial reverse. Designing compensating actions requires careful thought.
| Forward Action | Compensation | Notes | |---------------|-------------|-------| | Reserve inventory | Release inventory | Straightforward reversal | | Charge credit card | Issue refund | Refund is a new transaction, not a rollback | | Send email | Send correction email | Cannot un-send; can only follow up | | Create database record | Mark as cancelled (soft delete) | Do not hard-delete; maintain audit trail | | Publish event | Publish compensating event | Consumers must handle both | | Call third-party API | Call cancellation endpoint | Not always possible; may require manual intervention |
Key principles:
- Compensations must be idempotent. A compensation might be retried if the first attempt's result is unknown. Refunding twice is worse than not refunding.
- Compensations can fail. Have a strategy: retry with backoff, dead-letter queue for manual intervention, or alerting.
- Some actions are not compensatable. Sending a physical package, triggering a bank wire, or calling an external API with no cancellation endpoint. Design the saga so non-compensatable steps run last.
Failure Scenarios
Happy Path
All steps succeed in order. No compensation needed. The saga completes and the order moves to "confirmed."
Mid-Saga Failure
Payment fails after inventory is reserved. The orchestrator runs compensations for all completed steps in reverse: release inventory. The order is marked as "failed" with a reason.
Compensation Failure
The most dangerous scenario. A forward step fails, and then a compensating action also fails. For example, payment fails and the inventory release call times out.
Strategies:
- Retry with exponential backoff. Most transient failures resolve on retry.
- Dead-letter queue. After N retries, park the failed compensation for manual review.
- Alerting. Compensation failures need immediate human attention -- the system is in an inconsistent state.
- Reconciliation jobs. Periodic background processes that detect and fix inconsistencies.
Orchestrator Crash
If the orchestrator crashes mid-saga, it must resume when it restarts. This requires persisting the saga state (current step, completed steps, context) to a durable store. On restart, load incomplete sagas and continue from where they left off.
Real-World Examples
E-Commerce Order Flow
A customer places an order. The saga coordinates:
1. Validate order -> No compensation needed (read-only)
2. Reserve inventory -> Release inventory
3. Calculate tax -> No compensation needed (read-only)
4. Authorize payment -> Void authorization
5. Confirm order -> Cancel order
6. Send confirmation -> (non-compensatable, runs last)
Notice: read-only steps (validate, calculate tax) do not need compensation. The non-compensatable step (send confirmation) is placed last, so it only runs after all compensatable steps have succeeded.
Banking Transfer
Transfer $500 from Account A to Account B.
1. Debit Account A ($500) -> Credit Account A ($500)
2. Credit Account B ($500) -> Debit Account B ($500)
If step 2 fails (Account B is closed), the compensation credits the 500 rather than debiting immediately, and the hold is either finalized or released based on step 2's outcome.
Travel Booking
Book a flight, hotel, and car rental as a package.
1. Reserve flight seat -> Cancel flight reservation
2. Reserve hotel room -> Cancel hotel reservation
3. Reserve rental car -> Cancel car reservation
4. Charge payment -> Refund payment
If the hotel is fully booked (step 2 fails), the flight reservation from step 1 is cancelled. The customer is not charged because step 4 has not run yet. This is why the payment step comes after all reservations -- it minimizes refunds.
Choreography vs Orchestration: When to Use Each
| Factor | Choreography | Orchestration | |--------|-------------|---------------| | Number of steps | 2-3 | 4+ | | Flow complexity | Linear, no branching | Branching, conditional steps | | Visibility | Distributed across services | Centralized in orchestrator | | Team structure | Independent teams, loose coordination | Team owns the business process end-to-end | | Failure handling | Each service handles its own | Centralized failure and compensation logic | | Adding new steps | Add a new consumer | Modify the orchestrator | | Debugging | Requires distributed tracing | Read the orchestrator's state |
In practice, many systems use a hybrid: choreography for simple, loosely-coupled event reactions (send email when user registers) and orchestration for complex multi-step business processes (order fulfillment, payment processing).
Key Takeaways
- Distributed transactions (2PC) block participants and reduce availability. They work within a database cluster but not across independent services.
- The saga pattern replaces one distributed transaction with a sequence of local transactions plus compensating actions.
- Every forward action must have a defined compensating action. Design for the fact that compensation can also fail.
- Choreography distributes the flow across services -- good for simple flows, hard to debug for complex ones.
- Orchestration centralizes the flow in one place -- easier to reason about, but the orchestrator becomes a coordination point.
- Place non-compensatable steps (sending emails, calling irrevocable APIs) at the end of the saga.
- Persist saga state so the orchestrator can resume after a crash. In-memory-only sagas will lose in-flight work.