Exactly-Once Semantics
In distributed systems, every message delivery comes with a guarantee: at-most-once, at-least-once, or exactly-once. Exactly-once is the holy grail — every event is processed once and only once, with no duplicates and no data loss. It is also the hardest to achieve, and in many cases, "effectively once" through idempotent at-least-once is the pragmatic choice.
The Three Delivery Guarantees
At-Most-Once
Fire and forget. The producer sends a message and does not retry on failure. If the message is lost in transit, it is gone. This is the fastest approach and works for non-critical data like debug logs or metrics where occasional loss is acceptable.
Producer -> [message] -> Broker
|
Broker crashes
|
Message lost
(no retry)
At-Least-Once
The producer retries until it gets an acknowledgment. If the acknowledgment is lost (the broker received the message but the ack never made it back), the producer retries and the message is delivered twice. No data loss, but duplicates are possible.
Producer -> [message] -> Broker (writes message)
|
Ack lost in transit
|
Producer -> [message] -> Broker (writes message AGAIN)
|
Two copies exist
This is the default in most systems and the most common in data engineering pipelines.
Exactly-Once
Every message is delivered and processed exactly one time. No loss, no duplicates. This requires coordination between the producer, the message broker, and the consumer — all three must agree on what has been processed.
Producer -> [message] -> Broker -> Consumer
|
Processed exactly once
(even after failures)
Why Exactly-Once Is Hard
Exactly-once semantics are difficult because of a fundamental problem in distributed systems: you cannot distinguish between a failed operation and a slow operation. When a producer sends a message and does not receive an acknowledgment, it has no way to know whether:
- The broker never received the message (should retry).
- The broker received and wrote the message but the ack was lost (should NOT retry).
This is a version of the Two Generals' Problem. No finite number of messages can guarantee that both sides agree on the state of a single message delivery.
The Failure Scenarios
Consider a consumer that reads from Kafka, processes an event, writes to a database, and commits the offset:
Step 1: Read message from Kafka [OK]
Step 2: Process the event [OK]
Step 3: Write result to database [OK]
Step 4: Commit offset to Kafka [CRASH]
The consumer crashes after writing to the database but before committing the offset. On restart, it re-reads the same message (offset was not committed) and writes to the database again. Duplicate.
Now consider the reverse:
Step 1: Read message from Kafka [OK]
Step 2: Process the event [OK]
Step 3: Commit offset to Kafka [OK]
Step 4: Write result to database [CRASH]
The consumer crashes after committing but before writing. On restart, it skips the message (offset was committed) but the database never received the result. Data loss.
The core problem: the offset commit and the database write are two separate operations that cannot be made atomic without additional coordination.
Idempotent Producers
Kafka solves the producer-side duplication problem with idempotent producers. When enabled, Kafka assigns each producer a unique ID and each message a sequence number. If the broker receives a duplicate (same producer ID and sequence number), it silently discards it.
from confluent_kafka import Producer
config = {
'bootstrap.servers': 'kafka:9092',
'enable.idempotence': True, # Enables exactly-once on the producer side
'acks': 'all', # Required for idempotence
'max.in.flight.requests.per.connection': 5, # Max allowed with idempotence
}
producer = Producer(config)
producer.produce('orders', key='order-123', value='{"amount": 50.00}')
producer.flush()
With idempotent producers, network retries never create duplicate messages in Kafka. This solves one piece of the puzzle, but only between the producer and Kafka. The consumer side is still unsolved.
Transactional Consumers
Kafka's transactional API extends exactly-once to the consume-transform-produce pattern. A consumer reads messages, processes them, produces output messages, and commits offsets — all in a single atomic transaction.
from confluent_kafka import Consumer, Producer
producer_config = {
'bootstrap.servers': 'kafka:9092',
'transactional.id': 'order-enrichment-txn-1',
}
producer = Producer(producer_config)
producer.init_transactions()
consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'order-enrichment',
'enable.auto.commit': False,
'isolation.level': 'read_committed', # Only read committed messages
})
consumer.subscribe(['raw-orders'])
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
# Begin transaction
producer.begin_transaction()
try:
# Process and produce enriched event
enriched = enrich_order(msg.value())
producer.produce('enriched-orders', value=enriched)
# Commit offsets and produced messages atomically
producer.send_offsets_to_transaction(
consumer.position(consumer.assignment()),
consumer.consumer_group_metadata(),
)
producer.commit_transaction()
except Exception:
producer.abort_transaction()
This works within the Kafka ecosystem: consuming from Kafka, producing to Kafka, and committing offsets. But most data engineering pipelines write to external systems (databases, data warehouses, object storage), and Kafka transactions do not extend to those systems.
The Outbox Pattern
The outbox pattern solves exactly-once when you need to update a database and publish an event atomically. Instead of writing to the database and Kafka separately, you write both the business data and the event to the same database in a single transaction. A separate process reads the outbox table and publishes events to Kafka.
-- Single database transaction
BEGIN;
-- Update the business table
UPDATE orders SET status = 'shipped' WHERE order_id = 12345;
-- Write the event to the outbox table
INSERT INTO outbox (event_type, payload, created_at) VALUES (
'order_shipped',
'{"order_id": 12345, "shipped_at": "2025-03-15T10:00:00Z"}',
NOW()
);
COMMIT;
A separate connector (like Debezium with Kafka Connect) reads the outbox table using change data capture (CDC) and publishes events to Kafka. Because the business update and the event are in the same database transaction, they succeed or fail together.
Application -> Database (orders + outbox in one transaction)
|
CDC Connector (Debezium)
|
Kafka (publishes outbox events)
Outbox Trade-offs
- Adds complexity: you need CDC infrastructure.
- The database becomes a bottleneck for event publishing throughput.
- Events are eventually consistent (there is a small delay between the database write and the Kafka publish).
- But it is correct: no data loss, no orphaned events.
Deduplication Strategies
When exactly-once at the infrastructure level is impractical, you implement it at the application level through deduplication.
Idempotency Keys
Every event carries a unique identifier. Before processing, check if that identifier has already been processed.
def process_order(event):
idempotency_key = event['order_id'] + '_' + event['event_id']
# Check if already processed
if redis.exists(f"processed:{idempotency_key}"):
return # Skip duplicate
# Process the event
write_to_warehouse(event)
# Mark as processed (with TTL matching retention)
redis.set(f"processed:{idempotency_key}", "1", ex=86400 * 7)
Database-Level Deduplication
Use unique constraints or upserts to make duplicate writes harmless.
-- PostgreSQL: INSERT or do nothing on conflict
INSERT INTO order_events (event_id, order_id, status, event_time)
VALUES ('evt-abc-123', 12345, 'shipped', '2025-03-15T10:00:00Z')
ON CONFLICT (event_id) DO NOTHING;
-- BigQuery: MERGE for deduplication
MERGE INTO order_events AS target
USING staging_events AS source
ON target.event_id = source.event_id
WHEN NOT MATCHED THEN
INSERT (event_id, order_id, status, event_time)
VALUES (source.event_id, source.order_id, source.status, source.event_time);
Windowed Deduplication
In stream processing, maintain a time-bounded set of seen event IDs. Events outside the window are assumed to not be duplicates (or are handled by downstream batch reconciliation).
Dedup window: 1 hour
Seen IDs: {evt-001, evt-002, evt-003, ...}
New event evt-002 arrives -> already in set -> drop
New event evt-004 arrives -> not in set -> process, add to set
Events older than 1 hour -> evict from set
Effectively Once: The Pragmatic Approach
True exactly-once requires tight coordination between all systems in the pipeline. In practice, many teams settle for "effectively once" — at-least-once delivery combined with idempotent processing. The result is the same: each event's effect is applied exactly once, even if the event is delivered multiple times.
When Effectively Once Is Good Enough
Most of the time. If your consumer is idempotent (processing the same event twice produces the same result), you do not need infrastructure-level exactly-once.
# Idempotent consumer: writing the same row with the same primary key
# produces the same result whether you do it once or ten times
def process_event(event):
db.execute("""
INSERT INTO daily_revenue (date, product_id, revenue)
VALUES (%s, %s, %s)
ON CONFLICT (date, product_id) DO UPDATE SET
revenue = EXCLUDED.revenue
""", (event['date'], event['product_id'], event['revenue']))
When You Actually Need Exactly-Once
- Financial transactions: Charging a credit card twice is not acceptable. Use the outbox pattern or a payment provider's idempotency API.
- Inventory management: Decrementing stock twice oversells items.
- Counter-based metrics: If your pipeline increments a counter (
revenue += order_amount), duplicates cause overcounting. Rewrite as idempotent (SET revenue = calculated_value) instead.
# NOT idempotent: duplicate processing doubles the count
db.execute("UPDATE metrics SET order_count = order_count + 1 WHERE date = %s", (date,))
# Idempotent: duplicate processing produces the same result
db.execute("""
INSERT INTO order_counts (order_id, date, amount)
VALUES (%s, %s, %s)
ON CONFLICT (order_id) DO NOTHING
""", (order_id, date, amount))
# Then: SELECT COUNT(*), SUM(amount) FROM order_counts WHERE date = ...
Choosing Your Strategy
Requirement Strategy
----------- --------
Logs, metrics (loss OK) At-most-once
Most data pipelines At-least-once + idempotent consumer
Kafka-to-Kafka transformations Kafka transactions
DB + event publishing Outbox pattern
Financial, inventory Full exactly-once with deduplication
Common Pitfalls
- Assuming your pipeline is idempotent when it is not. Any operation that increments a counter, appends to a list, or sends a notification is not idempotent. Audit every step.
- Using incrementing operations instead of set operations.
revenue += 100is not idempotent.revenue = 500is. Rewrite increments as absolute values whenever possible. - Deduplication with unbounded state. Storing every event ID forever eventually exhausts memory. Use time-bounded dedup windows and accept that very late duplicates might slip through.
- Ignoring the cost of exactly-once. Kafka transactions reduce throughput by 10-30%. Outbox patterns add infrastructure complexity. Choose the guarantee level your use case actually requires.
- Conflating delivery guarantees with processing guarantees. Kafka can deliver a message exactly once to a consumer. But if the consumer writes to an external database, the end-to-end guarantee depends on the consumer's logic, not Kafka's.
- Not testing failure scenarios. Kill your consumer mid-processing. Introduce network partitions. Verify that your deduplication actually works under realistic failure conditions.
Key Takeaways
- At-most-once loses data. At-least-once creates duplicates. Exactly-once prevents both but is expensive and complex.
- True exactly-once requires coordination between producer, broker, and consumer. Kafka transactions achieve this within the Kafka ecosystem.
- For external systems, the outbox pattern provides atomic database-plus-event writes through change data capture.
- Deduplication strategies (idempotency keys, upserts, windowed dedup) make at-least-once behave like exactly-once at the application level.
- "Effectively once" (at-least-once with idempotent consumers) is good enough for most data engineering pipelines and far simpler to implement.
- Design consumers to be idempotent from the start. Use set operations instead of increment operations. Use unique keys and upserts.
- Reserve infrastructure-level exactly-once for financial transactions, inventory systems, and other cases where duplicates have real-world consequences.