Building Idempotent Pipelines

Run the same pipeline twice with the same input. If you get the same result both times, the pipeline is idempotent. If the second run creates duplicates, corrupts data, or fails, you have a problem that will haunt you at 3 AM.

Why Idempotency Matters

Pipelines fail. Networks drop connections mid-transfer. A warehouse runs out of compute mid-query. An upstream API times out after sending half the data. A developer accidentally triggers a backfill that overlaps with production runs.

When a pipeline fails halfway through, you have two options:

Debug the partial state, manually fix it, then restart. This does not scale. It requires human intervention every time something breaks.
Just re-run the pipeline. This only works if the pipeline is idempotent.

Idempotent pipelines let you:

Retry failures without worrying about duplicates or data corruption
Backfill historical data by re-running the pipeline for past dates
Handle late-arriving data by reprocessing affected time windows
Run pipelines in parallel without coordination between runs

The Non-Idempotent Pipeline

Here is the most common anti-pattern:

-- This is NOT idempotent
INSERT INTO analytics.daily_revenue
SELECT
    DATE_TRUNC('day', created_at) AS revenue_date,
    SUM(amount) AS total_revenue
FROM raw.payments
WHERE created_at >= '2025-01-15'
  AND created_at < '2025-01-16'
GROUP BY 1;

Run this once, you get one row for January 15th. Run it again, you get two identical rows. Run it after a failed retry, you might get three. Your revenue dashboard now shows 3x actual revenue, and someone in finance is about to have a very bad meeting.

Strategy 1: Delete-Then-Insert

The simplest idempotent pattern. Delete the target data, then insert fresh results. Wrap it in a transaction so it is atomic.

BEGIN;

DELETE FROM analytics.daily_revenue
WHERE revenue_date = '2025-01-15';

INSERT INTO analytics.daily_revenue
SELECT
    DATE_TRUNC('day', created_at) AS revenue_date,
    SUM(amount) AS total_revenue,
    COUNT(*) AS payment_count
FROM raw.payments
WHERE created_at >= '2025-01-15'
  AND created_at < '2025-01-16'
GROUP BY 1;

COMMIT;

Run this ten times. The result is always the same: one row for January 15th with the correct totals.

When to Use Delete-Then-Insert

Target table has no unique key that maps cleanly to source data
You are replacing entire partitions or date ranges
The dataset is small enough that full replacement is fast

Watch Out For

Without a transaction, a failure between DELETE and INSERT leaves you with missing data
Not all warehouses support transactions across DDL and DML the same way

Strategy 2: Upsert (MERGE)

When you have a natural key, MERGE (or its equivalent) updates existing rows and inserts new ones in a single atomic operation.

MERGE INTO analytics.customer_metrics AS target
USING (
    SELECT
        customer_id,
        COUNT(*) AS total_orders,
        SUM(amount) AS lifetime_value,
        MAX(created_at) AS last_order_at
    FROM raw.orders
    GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET
        total_orders = source.total_orders,
        lifetime_value = source.lifetime_value,
        last_order_at = source.last_order_at
WHEN NOT MATCHED THEN
    INSERT (customer_id, total_orders, lifetime_value, last_order_at)
    VALUES (source.customer_id, source.total_orders, source.lifetime_value, source.last_order_at);

When to Use MERGE

You have a clear primary or natural key
You are processing incremental updates, not full replacements
Rows can be updated (e.g., a customer's lifetime value changes)

Platform Variations

Not every warehouse spells it the same way:

Snowflake:      MERGE INTO ... USING ... ON ... WHEN MATCHED/NOT MATCHED
BigQuery:       MERGE INTO ... USING ... ON ... WHEN MATCHED/NOT MATCHED
Redshift:       No native MERGE; use DELETE + INSERT in a transaction
PostgreSQL:     INSERT ... ON CONFLICT DO UPDATE (upsert syntax)

Strategy 3: Partition Replacement

Replace an entire partition atomically. This is the most common pattern for date-partitioned fact tables in warehouses.

-- BigQuery: overwrite a specific partition
INSERT INTO analytics.daily_events
PARTITION BY event_date
SELECT
    event_date,
    event_type,
    user_id,
    event_properties
FROM raw.events
WHERE event_date = '2025-01-15';

In Snowflake, you achieve this by deleting the partition and inserting within a transaction. In BigQuery, WRITE_TRUNCATE disposition on a partition-decorated table handles it natively.

Why Partition Replacement Works Well

It is conceptually simple: the pipeline "owns" a date partition
No key management or merge logic needed
Scales well because each partition is independent
Enables easy parallel backfills (one task per partition)

Timestamps & Watermarks for Incremental Loads

Full table scans are expensive. Incremental loads process only new or changed data. But you need to track what "new" means.

High Watermark Pattern

Track the maximum timestamp from the last successful run. Only process rows after that timestamp.

-- Get the watermark from the last successful run
SET last_watermark = (
    SELECT COALESCE(MAX(loaded_at), '1970-01-01')
    FROM analytics.events
);

-- Process only new data
INSERT INTO analytics.events
SELECT
    event_id,
    user_id,
    event_type,
    created_at,
    CURRENT_TIMESTAMP AS loaded_at
FROM raw.events
WHERE created_at > $last_watermark;

The Late-Arriving Data Problem

Watermarks assume data arrives in order. It often does not. A payment processed on January 15th might not appear in the source system until January 17th. If your watermark already passed January 15th, you miss it.

Solutions:

Overlapping windows. Instead of processing from the watermark forward, go back a few hours or days. This requires idempotent writes (MERGE or partition replacement) so the overlap does not create duplicates.
Change data capture (CDC). Track changes by a system-generated timestamp (e.g., _updated_at) rather than business timestamps.
Full reprocess of recent partitions. Re-run the last N days every time. Simple, idempotent, and catches late arrivals.

-- Overlap strategy: reprocess the last 3 days every run
BEGIN;

DELETE FROM analytics.daily_revenue
WHERE revenue_date >= CURRENT_DATE - INTERVAL '3 days';

INSERT INTO analytics.daily_revenue
SELECT
    DATE_TRUNC('day', created_at) AS revenue_date,
    SUM(amount) AS total_revenue
FROM raw.payments
WHERE created_at >= CURRENT_DATE - INTERVAL '3 days'
GROUP BY 1;

COMMIT;

Making dbt Models Idempotent

dbt provides built-in patterns for idempotency through its materialization strategies.

Table Materialization

-- models/marts/daily_revenue.sql
-- config: materialized='table'
-- This is idempotent by default: dbt drops and recreates the table every run
SELECT
    DATE_TRUNC('day', created_at) AS revenue_date,
    SUM(amount) AS total_revenue
FROM {{ ref('stg_payments') }}
GROUP BY 1

Incremental Materialization

-- models/marts/events.sql
{{
    config(
        materialized='incremental',
        unique_key='event_id',
        incremental_strategy='merge'
    )
}}

SELECT
    event_id,
    user_id,
    event_type,
    created_at
FROM {{ ref('stg_events') }}

{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

The unique_key combined with incremental_strategy='merge' ensures that re-running the model with overlapping data does not create duplicates.

Designing for Backfills

A well-designed pipeline accepts a date range as a parameter. This lets you:

Re-run a single day after a bug fix
Backfill a new column for all historical data
Reprocess a window of data after a source system correction

# Good: parameterized pipeline
python run_pipeline.py --start-date 2025-01-01 --end-date 2025-01-31

# Bad: pipeline only processes "today"
python run_pipeline.py

When each run is scoped to a date range and writes are idempotent, backfilling is just a loop over dates.

Common Pitfalls

Using INSERT without deduplication. The single most common source of data quality issues. Every INSERT should either be preceded by a DELETE, wrapped in a MERGE, or writing to a partition that gets replaced.

Relying on pipeline run time instead of data time. If your pipeline processes "everything since last run," a failure and retry will process different data. Scope pipelines to data timestamps, not pipeline execution timestamps.

Not wrapping delete-then-insert in a transaction. If the pipeline fails between the DELETE and INSERT, you lose data. Always use transactions.

Assuming MERGE is always better than delete-insert. MERGE is great for small incremental updates. For replacing large partitions, delete-insert is often faster because MERGE has to check every row for matches.

Ignoring late-arriving data. A watermark-only approach will silently miss late records. Build in overlap windows or use CDC timestamps to catch them.

Making the incremental window too small. Processing only the last hour means a 2-hour outage causes permanent data loss. Build in buffer. Reprocess the last 3 days if you can afford it.

Key Takeaways

An idempotent pipeline produces the same result regardless of how many times it runs for the same input
The three core strategies are delete-then-insert, MERGE/upsert, and partition replacement
Always wrap delete-then-insert in a transaction to prevent data loss on failure
Use overlapping windows or CDC timestamps to handle late-arriving data, not just high watermarks
Design pipelines to accept date range parameters so backfills are trivial
dbt's incremental materialization with a unique key provides idempotency out of the box