ETL vs ELT

Two acronyms, same three steps, different order. The order matters more than you think because it determines where your transformation logic lives, how much raw data you retain, and how easily you can iterate on business logic after the fact.

The Traditional Approach: ETL

Extract, Transform, Load. Data moves from source systems into a staging area, gets cleaned and reshaped there, then lands in the warehouse already in its final form.

Source -> Extract -> Staging Area -> Transform -> Load -> Warehouse

ETL dominated for decades because warehouse compute was expensive. You paid per query, per storage byte, per CPU cycle. It made sense to do heavy lifting outside the warehouse on cheaper compute, then load only the polished result.

How ETL Typically Works

Extract: Pull data from source systems (databases, APIs, flat files)
Transform: Clean, deduplicate, join, aggregate in a separate processing layer
Load: Insert the transformed data into the warehouse

The transformation layer was usually a dedicated ETL tool like Informatica, Talend, or DataStage. These tools had visual drag-and-drop interfaces for building transformation pipelines.

When ETL Still Makes Sense

ETL is not dead. It is the right choice when:

You are working with sensitive data that must be masked or filtered before it touches the warehouse
Regulatory requirements mandate that raw PII never lands in your analytics environment
Your warehouse genuinely cannot handle the transformation workload (rare today, but real for some on-prem setups)
You need to reduce data volume dramatically before loading (e.g., aggregating billions of IoT events into hourly summaries)

The Modern Approach: ELT

Extract, Load, Transform. Pull data from sources, dump it raw into the warehouse, then transform it there using SQL.

Source -> Extract -> Load -> Warehouse -> Transform -> Warehouse (new tables/views)

Why ELT Won

The shift happened because three things changed simultaneously:

Storage got cheap. Cloud warehouses charge pennies per gigabyte per month for compressed columnar storage. Keeping raw data is no longer a financial burden.

Warehouse compute became elastic. Snowflake spins up a new compute cluster in seconds. BigQuery auto-scales. You are not constrained by a fixed number of CPUs sitting in a rack.

Raw data preservation became valuable. When you transform before loading, the original data is gone. If your business logic changes six months later, you cannot go back and re-derive the data. With ELT, the raw data is always there. You just write a new transformation.

The ELT Workflow in Practice

-- Step 1: Raw data lands in a staging schema
-- (loaded by Fivetran, Airbyte, custom scripts, etc.)
SELECT * FROM raw.stripe_charges LIMIT 5;

-- Step 2: Transform in the warehouse using SQL
CREATE TABLE analytics.monthly_revenue AS
SELECT
    DATE_TRUNC('month', created_at) AS revenue_month,
    currency,
    SUM(amount) / 100.0 AS total_revenue,
    COUNT(*) AS charge_count
FROM raw.stripe_charges
WHERE status = 'succeeded'
GROUP BY 1, 2;

The raw stripe_charges table is untouched. If next quarter the finance team decides refunded charges should be excluded differently, you rewrite the transformation, not the pipeline.

dbt as the ELT Transformation Layer

dbt (data build tool) is the tool that made ELT practical at scale. It turns your warehouse transformations into a proper software engineering workflow.

What dbt Does

dbt handles the "T" in ELT. It does not extract or load data. It takes raw data already in your warehouse and transforms it using SQL SELECT statements.

-- models/staging/stg_stripe_charges.sql
SELECT
    id AS charge_id,
    customer AS customer_id,
    amount / 100.0 AS amount_dollars,
    currency,
    status,
    created AS created_at
FROM {{ source('stripe', 'charges') }}
WHERE status != 'failed'

-- models/marts/monthly_revenue.sql
SELECT
    DATE_TRUNC('month', created_at) AS revenue_month,
    SUM(amount_dollars) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM {{ ref('stg_stripe_charges') }}
WHERE status = 'succeeded'
GROUP BY 1

Why dbt Changed the Game

Version control. Transformations live in Git, not in a proprietary ETL tool. You can review changes in a pull request before they touch production data.
Testing. You write tests that assert your data meets expectations. A model that produces null primary keys fails the build before bad data reaches dashboards.
Documentation. dbt auto-generates a DAG of your models and their dependencies. New team members can see how data flows without reading every SQL file.
Modularity. Models reference other models with {{ ref() }}, creating a clean dependency chain. dbt resolves the execution order automatically.
Incremental builds. Only re-process data that has changed, reducing both runtime and warehouse costs for large tables.

Choosing Between ETL & ELT

Factor	ETL	ELT
Compute cost	Cheaper for heavy transforms on cheap infra	Uses warehouse compute (elastic but not free)
Raw data retention	Lost after transform	Preserved
Iteration speed	Slow (rebuild pipeline)	Fast (rewrite SQL)
Data privacy	Can mask before loading	Raw PII lands in warehouse
Tool ecosystem	Informatica, Talend, SSIS	dbt, Fivetran + dbt, Airbyte + dbt
Debugging	Hard (transforms are opaque)	Easy (query intermediate tables)

A Realistic Architecture

Most modern data teams run something like this:

Sources (Postgres, Stripe, Salesforce, S3)
    |
    v
Ingestion (Fivetran / Airbyte / custom)
    |
    v
Raw Schema (warehouse, untransformed)
    |
    v
Staging Models (dbt: rename, cast, filter)
    |
    v
Intermediate Models (dbt: joins, business logic)
    |
    v
Mart Models (dbt: final tables for dashboards)

The ingestion layer handles EL. dbt handles T. The warehouse handles storage and compute for everything.

Hybrid Approaches

Pure ELT is not always enough. Some transformations are better done before loading:

Data volume reduction. If a source produces 10 billion events per day but you only need hourly aggregates, it is wasteful to load all 10 billion rows
Format conversion. Converting XML or nested JSON to flat relational tables is easier in Python than in SQL
Enrichment from external APIs. Geocoding addresses or resolving company names against a third-party API happens outside the warehouse
PII masking. Hashing or tokenizing sensitive fields before they reach the warehouse satisfies compliance requirements without adding warehouse-side complexity

The pragmatic answer: use ELT as the default, add pre-load transformations only when you have a specific reason.

The Historical Context

Understanding why ETL existed helps you avoid reinventing it accidentally.

In the 1990s and 2000s, warehouse compute was provisioned in fixed capacity. An enterprise might buy a Teradata appliance with 50 nodes and use it for three years. Every query competed for the same fixed pool of CPU and memory. Transforming data before loading was not a preference; it was a necessity because the warehouse literally could not spare the compute.

The shift started around 2012-2015 with the rise of cloud warehouses. Redshift launched in 2012. BigQuery had been internal at Google since 2010 and went GA in 2012. Snowflake launched in 2014. These systems offered elastic compute that could scale up for heavy transformations and scale back down when done.

dbt arrived in 2016 and provided the missing piece: a developer-friendly framework for writing, testing, and version-controlling warehouse transformations. By 2020, the ELT pattern with dbt had become the default for new data teams.

Timeline:
  1990s-2010s: ETL dominates (Informatica, DataStage, Talend)
  2012-2015:   Cloud warehouses launch (Redshift, BigQuery, Snowflake)
  2016:        dbt launches, ELT becomes practical
  2018-2020:   Fivetran + dbt becomes the standard modern data stack
  2020s:       ELT is the default; ETL used selectively

The Real-World Cost Comparison

A concrete example helps illustrate why ELT typically wins on economics.

Suppose you have 50 SaaS data sources feeding into your analytics platform. With ETL, you need transformation logic for each source, running on a dedicated compute layer (EC2 instances, Kubernetes pods, or a managed ETL tool). With ELT, the ingestion tool handles extraction and loading, and all transformation happens in the warehouse.

ETL cost structure:
  Ingestion tool (Fivetran):     $2,000/month
  Transformation compute (EC2):  $1,500/month
  ETL tool licenses (Informatica): $5,000/month
  Warehouse storage + compute:   $3,000/month
  Engineering time to maintain:  2 FTEs
  Total: ~$11,500/month + 2 FTEs

ELT cost structure:
  Ingestion tool (Fivetran):     $2,000/month
  dbt Cloud:                     $500/month
  Warehouse storage + compute:   $5,000/month (more compute, but elastic)
  Engineering time to maintain:  0.5 FTE
  Total: ~$7,500/month + 0.5 FTE

The warehouse bill is higher with ELT because transformations run there. But you eliminate the ETL tool license, the dedicated transformation compute, and most of the engineering overhead. The net savings come from fewer tools to manage and less custom code to maintain.

Common Pitfalls

Loading raw data with no organization. ELT does not mean "dump everything in one schema and figure it out later." You still need a clear schema strategy: raw, staging, intermediate, marts.

Ignoring warehouse costs. ELT shifts compute cost to the warehouse. If you run expensive transformations on every row every hour without incremental processing, the bill adds up fast.

Over-engineering the ingestion layer. If you are writing custom Python scripts to extract and load data from SaaS tools, stop. Fivetran or Airbyte will do it better and maintain it for you.

Treating dbt as the entire pipeline. dbt handles transformations. It does not handle extraction, loading, orchestration, or alerting. You still need tools for those.

Not testing transformations. The ease of writing SQL transformations in dbt makes it tempting to skip tests. Do not skip tests. A bad JOIN condition can silently duplicate or drop millions of rows.

Key Takeaways

ETL transforms data before loading into the warehouse; ELT loads raw data first and transforms inside the warehouse
ELT won because warehouse compute is elastic, storage is cheap, and preserving raw data enables iteration
dbt is the standard tool for the "T" in ELT, bringing version control, testing, and modularity to SQL transformations
Use ELT as the default approach, but add pre-load transformations when data volume, privacy, or format demands it
The real value of ELT is not the order of operations; it is that your business logic lives in version-controlled SQL, not in opaque ETL tool configurations