Infrastructure as Code for Data

The modern data stack should be fully reproducible from a Git repository. Infrastructure defined in Terraform, transformations in dbt, pipelines in version-controlled Python, and everything deployed through CI/CD. If your data platform cannot be rebuilt from scratch by running a few commands, it is fragile. Infrastructure as code turns your data platform from a collection of manually configured services into a versioned, testable, reviewable system.

Why Code, Not Clicks

The ClickOps Problem

When you create a Snowflake warehouse through the UI, grant permissions by hand, and configure Airflow connections through the web interface, you have a data platform that:

Cannot be reproduced if disaster strikes
Cannot be reviewed in a pull request
Cannot be tested before deploying
Cannot be rolled back when something goes wrong
Cannot be audited for compliance

These are the same problems that infrastructure-as-code solved for DevOps a decade ago. Data engineering is catching up.

The IaC Promise

With infrastructure as code, your entire data platform is defined in files:

repo: data-platform/
  terraform/
    snowflake.tf          # Warehouses, databases, roles, grants
    aws.tf                # S3 buckets, IAM roles, networking
    kafka.tf              # Kafka clusters, topics, ACLs
    airflow.tf            # Airflow infrastructure
  dbt/
    models/               # All SQL transformations
    tests/                # Data quality tests
    sources.yml           # Source definitions
  airflow/
    dags/                 # Pipeline definitions
  .github/
    workflows/
      deploy.yml          # CI/CD pipeline

Every change goes through a pull request. Every deployment is automated. Every configuration is versioned.

Terraform for Infrastructure

Terraform manages the cloud resources that your data platform runs on.

Snowflake Resources

# terraform/snowflake.tf

resource "snowflake_warehouse" "transform" {
  name                = "TRANSFORM_WH"
  warehouse_size      = "medium"
  auto_suspend        = 300    # Suspend after 5 minutes of inactivity
  auto_resume         = true
  min_cluster_count   = 1
  max_cluster_count   = 3
  scaling_policy      = "economy"
  
  initially_suspended = true
}

resource "snowflake_database" "analytics" {
  name                = "ANALYTICS"
  data_retention_days = 30
}

resource "snowflake_schema" "staging" {
  database = snowflake_database.analytics.name
  name     = "STAGING"
}

resource "snowflake_role" "analyst" {
  name = "ANALYST_ROLE"
}

resource "snowflake_grant_privileges_to_role" "analyst_read" {
  role_name  = snowflake_role.analyst.name
  privileges = ["SELECT"]
  
  on_schema_object {
    future {
      object_type_plural = "TABLES"
      in_schema          = "${snowflake_database.analytics.name}.${snowflake_schema.staging.name}"
    }
  }
}

S3 Buckets for Data Lake

# terraform/aws.tf

resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake-prod"
}

resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "move-to-infrequent-access"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 365
      storage_class = "GLACIER"
    }
  }
}

resource "aws_s3_bucket_versioning" "data_lake_versioning" {
  bucket = aws_s3_bucket.data_lake.id
  versioning_configuration {
    status = "Enabled"
  }
}

Kafka Topics

# terraform/kafka.tf

resource "kafka_topic" "order_events" {
  name               = "order-events"
  replication_factor = 3
  partitions         = 12

  config = {
    "retention.ms"    = "604800000"  # 7 days
    "cleanup.policy"  = "delete"
    "compression.type" = "zstd"
  }
}

resource "kafka_topic" "user_activity" {
  name               = "user-activity"
  replication_factor = 3
  partitions         = 24

  config = {
    "retention.ms"    = "2592000000"  # 30 days
    "cleanup.policy"  = "delete"
    "compression.type" = "zstd"
  }
}

dbt for Transformations

dbt (data build tool) is the standard for version-controlled SQL transformations. Every model is a SQL file. Dependencies are declared with ref(). Tests are defined alongside models.

Project Structure

dbt/
  models/
    staging/
      stg_orders.sql
      stg_customers.sql
      stg_payments.sql
    intermediate/
      int_daily_revenue.sql
      int_customer_orders.sql
    marts/
      fct_orders.sql
      dim_customers.sql
      mart_revenue.sql
  tests/
    assert_positive_revenue.sql
  sources.yml
  schema.yml
  dbt_project.yml

Models as SQL Files

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        CAST(created_at AS TIMESTAMP) AS order_date,
        amount_cents / 100.0 AS amount,
        status
    FROM source
    WHERE _deleted IS FALSE
)

SELECT * FROM renamed

-- models/marts/mart_revenue.sql
WITH daily AS (
    SELECT
        order_date,
        COUNT(*) AS order_count,
        SUM(amount) AS total_revenue,
        AVG(amount) AS avg_order_value
    FROM {{ ref('stg_orders') }}
    WHERE status IN ('paid', 'shipped', 'delivered')
    GROUP BY order_date
)

SELECT
    d.*,
    SUM(d.total_revenue) OVER (
        ORDER BY d.order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7day_revenue
FROM daily d

dbt Tests

# schema.yml
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

CI/CD for Data Pipelines

The Data Pipeline CI/CD Flow

Developer pushes code
  -> CI runs:
    1. Lint SQL (sqlfluff) and Python (ruff)
    2. dbt compile (verify SQL is valid)
    3. dbt run --target dev (run transformations in dev)
    4. dbt test --target dev (run data quality tests in dev)
    5. Terraform plan (show infrastructure changes)
  -> PR review
  -> Merge to main
  -> CD runs:
    1. Terraform apply (deploy infrastructure changes)
    2. dbt run --target prod (run transformations in prod)
    3. dbt test --target prod (validate production data)
    4. Deploy Airflow DAGs

GitHub Actions Example

# .github/workflows/data-pipeline.yml
name: Data Pipeline CI/CD

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install sqlfluff ruff
      - run: sqlfluff lint dbt/models/
      - run: ruff check airflow/dags/

  dbt-ci:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - run: pip install dbt-snowflake
      - run: cd dbt && dbt deps
      - run: cd dbt && dbt compile --target ci
      - run: cd dbt && dbt run --target ci --select state:modified+
      - run: cd dbt && dbt test --target ci --select state:modified+

  deploy:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    needs: [lint]
    steps:
      - uses: actions/checkout@v4
      - run: pip install dbt-snowflake
      - run: cd dbt && dbt deps
      - run: cd dbt && dbt run --target prod
      - run: cd dbt && dbt test --target prod

Dev / Staging / Prod Environments

Why Environments Matter

Testing a query against production data sounds harmless until you accidentally run DELETE FROM without a WHERE clause. Separate environments isolate risk.

Environment   Purpose                    Data
-----------   -------                    ----
Dev           Individual development      Sample or synthetic data
Staging       Pre-production testing      Copy of production data (anonymized)
Prod          Live data platform          Real production data

dbt Profiles for Environments

# profiles.yml
data_platform:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: company.snowflakecomputing.com
      database: ANALYTICS_DEV
      warehouse: DEV_WH
      schema: "{{ env_var('DBT_SCHEMA', 'dev_' + env_var('USER')) }}"

    staging:
      type: snowflake
      account: company.snowflakecomputing.com
      database: ANALYTICS_STAGING
      warehouse: STAGING_WH
      schema: STAGING

    prod:
      type: snowflake
      account: company.snowflakecomputing.com
      database: ANALYTICS
      warehouse: TRANSFORM_WH
      schema: PRODUCTION

Each developer gets their own schema in dev. Staging mirrors production structure with anonymized data. Production is the real thing.

Testing Transformations Before Deploying

# Local development workflow
1. Create a branch: git checkout -b feature/add-discount-model
2. Write the model: models/marts/mart_discounts.sql
3. Test locally: dbt run --select mart_discounts --target dev
4. Validate: dbt test --select mart_discounts --target dev
5. Verify results: query the dev table manually
6. Push and open PR
7. CI runs tests against staging
8. After review and merge, CD deploys to prod

The Modern Data Stack as Code

Putting it all together, a fully code-defined data platform:

Component          Tool                As Code
---------          ----                -------
Infrastructure     Terraform           .tf files in Git
Ingestion config   Fivetran/Airbyte    API configs or Terraform provider
Transformations    dbt                 SQL files in Git
Orchestration      Airflow/Dagster     Python DAG files in Git
Quality tests      dbt tests/GE        YAML and SQL in Git
Access control     Terraform           Grants and roles in .tf files
Monitoring         Terraform           Alert configs in .tf files
Documentation      dbt docs            Generated from YAML descriptions

The Benefits Compound

When everything is in code:

Disaster recovery: Rebuild the entire platform from Git in hours, not weeks.
Audit trail: Git history shows who changed what, when, and why.
Review process: Every change goes through a PR with peer review.
Consistency: Dev, staging, and prod are structurally identical.
Onboarding: New team members can read the code to understand the platform.

Common Pitfalls

Partial IaC adoption. Half the infrastructure in Terraform, half configured manually. This is worse than all-manual because you do not know which source of truth to trust.
Not testing dbt models in CI. Running dbt run only in production means bugs are discovered by data consumers, not by CI.
Skipping the staging environment. Going straight from dev to prod means untested changes hit production data. A staging environment with realistic data catches most issues.
Storing credentials in code. Terraform state files, dbt profiles, and CI/CD configs must never contain plaintext passwords. Use secret managers (AWS Secrets Manager, HashiCorp Vault, GitHub Secrets).
Over-engineering the CI/CD pipeline. A pipeline that takes 45 minutes to run discourages frequent deployments. Optimize for fast feedback: lint quickly, test only changed models, parallelize where possible.
Not versioning Terraform state. Terraform state must be stored in a remote backend (S3, GCS) with versioning and locking. Local state files get lost and cause drift.
Ignoring drift detection. Someone makes a manual change in the console. Terraform does not know about it. Run terraform plan regularly to detect drift.

Key Takeaways

Define everything in code: infrastructure (Terraform), transformations (dbt), pipelines (Airflow/Dagster), and quality tests. Version control is the single source of truth.
CI/CD for data pipelines follows the same principles as application CI/CD: lint, test, review, deploy.
Separate dev, staging, and prod environments. Test transformations against realistic data before they touch production.
The modern data stack as code enables disaster recovery, audit trails, peer review, and consistent environments.
Store credentials in secret managers, not in code. Store Terraform state in a remote backend with locking.
Start with the highest-value piece (usually dbt for transformations) and expand IaC coverage over time.