Delivery Guarantees

Delivery guarantees define how a messaging system handles message loss and duplication. The choice between at-most-once, at-least-once, and exactly-once profoundly impacts system design, performance, and correctness.

At-Most-Once Delivery

The system makes one attempt to deliver each message. If delivery fails, the message is lost. No retries.

At-most-once flow:
  Producer --> Broker --> Consumer
  
  If broker-to-consumer delivery fails:
    Message is gone. No retry. No redelivery.

When At-Most-Once Works

This is appropriate when individual messages have low value and the cost of duplication exceeds the cost of loss. Telemetry and real-time metrics are classic examples.

Use cases for at-most-once:
  - Live video frame delivery (missed frame is better than delayed/duplicate)
  - Real-time game position updates (next update overwrites anyway)
  - High-frequency sensor readings (one missed reading is noise)
  - UDP-based protocols where speed matters more than completeness

Twitch uses at-most-once semantics for chat messages during high-traffic streams. Dropping an occasional message during a peak moment is acceptable; duplicating messages would be more disruptive to the user experience.

Implementation

Producer: fire and forget (no ack from broker)
Broker: deliver once, do not track delivery status
Consumer: no acknowledgment required

Configuration example (Kafka):
  acks=0  (producer does not wait for broker confirmation)

At-Least-Once Delivery

The system guarantees every message is delivered at least once. If delivery status is uncertain, the system retries. Duplicates are possible.

At-least-once flow:
  Producer --> Broker: send message
  Broker --> Producer: ack received
  (if no ack, producer retries --> broker may receive duplicate)

  Broker --> Consumer: deliver message
  Consumer --> Broker: ack processed
  (if no ack, broker redelivers --> consumer may receive duplicate)

Why This Is the Default

At-least-once is the most common guarantee because it prevents data loss without requiring the complexity of exactly-once. Most business operations can tolerate duplicates if consumers are designed to handle them.

At-least-once in practice:
  Amazon SQS: default behavior, messages redelivered if not deleted within visibility timeout
  Kafka: acks=all ensures broker durability, consumer offsets control redelivery
  RabbitMQ: publisher confirms + consumer acks ensure no loss

Shopify processes order events with at-least-once semantics. When an order is placed, the event must reach inventory, payment, and fulfillment services. A duplicate is handled gracefully; a lost order is not.

The Duplicate Problem

Duplicates arise from normal operation, not bugs.

Scenario: network partition after processing
  1. Consumer receives message
  2. Consumer processes message successfully
  3. Consumer sends ack to broker
  4. Network fails, ack is lost
  5. Broker assumes message was not processed
  6. Broker redelivers message
  7. Consumer receives and processes it AGAIN

This is unavoidable in distributed systems. The solution is idempotent consumers.

Exactly-Once Delivery

True exactly-once delivery between independent distributed systems is theoretically impossible. What production systems provide is exactly-once processing semantics: the end result is the same as if each message were processed exactly once, even though the underlying mechanism involves retries and deduplication.

Exactly-Once Within a Closed System

Kafka achieves exactly-once semantics when both input and output are Kafka topics, using three mechanisms together.

Kafka exactly-once components:
  1. Idempotent producer: broker deduplicates based on producer ID + sequence number
  2. Transactional writes: produce to output topic + commit consumer offset atomically
  3. Read-committed isolation: consumers only see committed (complete) transactions

Flow:
  Read from input topic
  Process
  Write to output topic + commit offset  [atomic transaction]
  
  If failure occurs mid-transaction, entire transaction is rolled back
  Consumer reprocesses from last committed offset
  Output topic sees each result exactly once

Exactly-Once Across System Boundaries

When the consumer writes to an external database, the messaging system cannot coordinate atomically with that database. You need application-level strategies.

Strategies for cross-system exactly-once:
  1. Idempotent operations (natural deduplication)
  2. Deduplication table (track processed message IDs)
  3. Transactional outbox (combine business write + message tracking)

Idempotency

An operation is idempotent if performing it multiple times produces the same result as performing it once. Idempotency is the practical foundation for handling duplicates.

Naturally Idempotent Operations

Idempotent:
  SET balance = 100           (same result regardless of repetition)
  DELETE FROM orders WHERE id = 'abc'  (deleting twice has same effect)
  PUT /users/123 { name: "Alice" }     (overwrite is same each time)

NOT idempotent:
  INCREMENT balance BY 10     (each execution adds 10)
  INSERT INTO orders (...)    (each execution creates a new row)
  POST /transfers { amount: 50 }       (each call transfers money)

Making Non-Idempotent Operations Safe

Use an idempotency key: a unique identifier per logical operation. Before processing, check if this key was already handled.

Idempotency key pattern:
  1. Message arrives with idempotencyKey: "pay-abc-123"
  2. Check database: SELECT * FROM processed_messages WHERE key = 'pay-abc-123'
  3. If found: skip processing, return previous result
  4. If not found:
     BEGIN TRANSACTION
       Process the payment
       INSERT INTO processed_messages (key, result, timestamp) VALUES (...)
     COMMIT
  5. Return result

Stripe requires an Idempotency-Key header on all mutating API calls. If a network error causes a retry, Stripe returns the cached result of the first call instead of charging the customer twice.

Deduplication

Deduplication detects and discards duplicate messages before processing. It complements idempotency by filtering duplicates at the infrastructure level.

Broker-Side Deduplication

Some brokers track message IDs and reject duplicates.

SQS FIFO deduplication:
  - Each message includes a MessageDeduplicationId
  - SQS tracks IDs for a 5-minute deduplication window
  - Duplicate IDs within the window are silently dropped

Kafka idempotent producer:
  - Broker tracks (producerId, sequenceNumber) per partition
  - Retried messages with same sequence number are deduplicated
  - Window is the lifetime of the producer session

Consumer-Side Deduplication

When the broker does not deduplicate, consumers must track processed message IDs.

Consumer deduplication approaches:
  
  1. Database table:
     Store messageId in a processed_messages table
     Check before processing, insert after
     
  2. Bloom filter:
     Probabilistic set membership test
     False positives (skip a new message) possible but rare
     Very memory efficient for high-volume streams
     
  3. Redis set:
     SADD messageId to a set with TTL
     Fast lookup, automatic expiration
     Good for deduplication windows of minutes to hours

Deduplication Window

You cannot track every message ID forever. Define a window based on how long duplicates might appear.

Deduplication window considerations:
  - SQS visibility timeout: 30 seconds to 12 hours
  - Kafka consumer restart: duplicates from last committed offset
  - Network partition recovery: duplicates from buffered retries

Typical windows:
  - Payment processing: 24 hours (covers retry storms)
  - Event streaming: 1 hour (covers consumer restarts)
  - Real-time analytics: 5 minutes (short window, duplicates are tolerable)

Outbox Pattern

The outbox pattern solves the dual-write problem: how to update a database and publish a message atomically without distributed transactions.

The Dual-Write Problem

Naive approach (broken):
  1. Update database (order status = "confirmed")
  2. Publish OrderConfirmed event to broker
  
  Failure between step 1 and 2:
    Database updated, but event never published
    Other services never learn about the confirmation
    System is inconsistent

How the Outbox Pattern Works

Write the event to an outbox table in the same database transaction as the business operation. A separate process reads the outbox and publishes to the message broker.

Outbox pattern:
  1. BEGIN TRANSACTION
       UPDATE orders SET status = 'confirmed' WHERE id = 'abc-123'
       INSERT INTO outbox (id, event_type, payload, created_at, published)
         VALUES ('evt-001', 'OrderConfirmed', '{...}', NOW(), false)
     COMMIT
  
  2. Outbox publisher (separate process):
     SELECT * FROM outbox WHERE published = false ORDER BY created_at
     For each row:
       Publish to message broker
       UPDATE outbox SET published = true WHERE id = 'evt-001'

Outbox with Change Data Capture

Instead of polling the outbox table, use change data capture (CDC) to stream inserts from the outbox table directly to the message broker.

CDC-based outbox:
  Application --> writes to outbox table
  Debezium (CDC) --> reads database transaction log
  Debezium --> publishes to Kafka topic
  
  Advantages:
    No polling delay
    No additional load on database from SELECT queries
    Guaranteed ordering (follows transaction log order)

Wepay uses Debezium with the outbox pattern to ensure that every payment state change is reliably published to Kafka without dual-write inconsistencies.

Outbox Cleanup

The outbox table grows continuously. Implement cleanup for published events.

Cleanup strategies:
  - Delete published rows older than retention period (e.g., 7 days)
  - Partition outbox table by date for efficient drops
  - Archive to cold storage before deletion for audit compliance

Choosing the Right Guarantee

| Guarantee      | Data Loss | Duplicates | Complexity | Throughput |
|----------------|-----------|------------|------------|------------|
| At-most-once   | Possible  | None       | Low        | Highest    |
| At-least-once  | None      | Possible   | Medium     | High       |
| Exactly-once   | None      | None       | High       | Lower      |

Most production systems use at-least-once delivery with idempotent consumers. This gives the reliability of exactly-once processing with the throughput and simplicity of at-least-once delivery.

Common Pitfalls

Assuming the broker handles exactly-once for you. Even Kafka's exactly-once only works within the Kafka ecosystem. Cross-system writes need application-level idempotency.
Unbounded deduplication tracking. Storing every message ID forever consumes unbounded storage. Define a deduplication window and prune old entries.
Non-atomic outbox writes. Writing to the outbox outside the business transaction defeats the purpose. The outbox insert must be in the same transaction as the business operation.
Idempotency keys that are too broad or too narrow. A key per request is correct. A key per user is too broad (blocks legitimate operations). A key per message is too narrow (does not catch application-level retries).
Ignoring redelivery during deployments. Rolling deployments cause consumer restarts. Uncommitted offsets lead to reprocessing. Always design for redelivery.
Testing only the happy path. Delivery guarantee failures happen during network partitions, broker failovers, and consumer crashes. Test these failure modes explicitly.

Key Takeaways

At-most-once is fire-and-forget. At-least-once retries until confirmed. Exactly-once requires coordination between producer, broker, and consumer.
Idempotency is the most important design principle for reliable messaging. Make consumers safe to call multiple times with the same input.
The outbox pattern solves the dual-write problem by combining business writes and event publishing in a single database transaction.
Deduplication windows must be bounded. Track message IDs for a defined period, not forever.
Default to at-least-once with idempotent consumers. Only invest in exactly-once semantics when the business cost of duplicates exceeds the engineering cost of preventing them.