Infrastructure as Code for Data
The modern data stack should be fully reproducible from a Git repository. Infrastructure defined in Terraform, transformations in dbt, pipelines in version-controlled Python, and everything deployed through CI/CD. If your data platform cannot be rebuilt from scratch by running a few commands, it is fragile. Infrastructure as code turns your data platform from a collection of manually configured services into a versioned, testable, reviewable system.
Why Code, Not Clicks
The ClickOps Problem
When you create a Snowflake warehouse through the UI, grant permissions by hand, and configure Airflow connections through the web interface, you have a data platform that:
- Cannot be reproduced if disaster strikes
- Cannot be reviewed in a pull request
- Cannot be tested before deploying
- Cannot be rolled back when something goes wrong
- Cannot be audited for compliance
These are the same problems that infrastructure-as-code solved for DevOps a decade ago. Data engineering is catching up.
The IaC Promise
With infrastructure as code, your entire data platform is defined in files:
repo: data-platform/
terraform/
snowflake.tf # Warehouses, databases, roles, grants
aws.tf # S3 buckets, IAM roles, networking
kafka.tf # Kafka clusters, topics, ACLs
airflow.tf # Airflow infrastructure
dbt/
models/ # All SQL transformations
tests/ # Data quality tests
sources.yml # Source definitions
airflow/
dags/ # Pipeline definitions
.github/
workflows/
deploy.yml # CI/CD pipeline
Every change goes through a pull request. Every deployment is automated. Every configuration is versioned.
Terraform for Infrastructure
Terraform manages the cloud resources that your data platform runs on.
Snowflake Resources
# terraform/snowflake.tf
resource "snowflake_warehouse" "transform" {
name = "TRANSFORM_WH"
warehouse_size = "medium"
auto_suspend = 300 # Suspend after 5 minutes of inactivity
auto_resume = true
min_cluster_count = 1
max_cluster_count = 3
scaling_policy = "economy"
initially_suspended = true
}
resource "snowflake_database" "analytics" {
name = "ANALYTICS"
data_retention_days = 30
}
resource "snowflake_schema" "staging" {
database = snowflake_database.analytics.name
name = "STAGING"
}
resource "snowflake_role" "analyst" {
name = "ANALYST_ROLE"
}
resource "snowflake_grant_privileges_to_role" "analyst_read" {
role_name = snowflake_role.analyst.name
privileges = ["SELECT"]
on_schema_object {
future {
object_type_plural = "TABLES"
in_schema = "${snowflake_database.analytics.name}.${snowflake_schema.staging.name}"
}
}
}
S3 Buckets for Data Lake
# terraform/aws.tf
resource "aws_s3_bucket" "data_lake" {
bucket = "company-data-lake-prod"
}
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "move-to-infrequent-access"
status = "Enabled"
transition {
days = 90
storage_class = "STANDARD_IA"
}
transition {
days = 365
storage_class = "GLACIER"
}
}
}
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
Kafka Topics
# terraform/kafka.tf
resource "kafka_topic" "order_events" {
name = "order-events"
replication_factor = 3
partitions = 12
config = {
"retention.ms" = "604800000" # 7 days
"cleanup.policy" = "delete"
"compression.type" = "zstd"
}
}
resource "kafka_topic" "user_activity" {
name = "user-activity"
replication_factor = 3
partitions = 24
config = {
"retention.ms" = "2592000000" # 30 days
"cleanup.policy" = "delete"
"compression.type" = "zstd"
}
}
dbt for Transformations
dbt (data build tool) is the standard for version-controlled SQL transformations. Every model is a SQL file. Dependencies are declared with ref(). Tests are defined alongside models.
Project Structure
dbt/
models/
staging/
stg_orders.sql
stg_customers.sql
stg_payments.sql
intermediate/
int_daily_revenue.sql
int_customer_orders.sql
marts/
fct_orders.sql
dim_customers.sql
mart_revenue.sql
tests/
assert_positive_revenue.sql
sources.yml
schema.yml
dbt_project.yml
Models as SQL Files
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
CAST(created_at AS TIMESTAMP) AS order_date,
amount_cents / 100.0 AS amount,
status
FROM source
WHERE _deleted IS FALSE
)
SELECT * FROM renamed
-- models/marts/mart_revenue.sql
WITH daily AS (
SELECT
order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM {{ ref('stg_orders') }}
WHERE status IN ('paid', 'shipped', 'delivered')
GROUP BY order_date
)
SELECT
d.*,
SUM(d.total_revenue) OVER (
ORDER BY d.order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day_revenue
FROM daily d
dbt Tests
# schema.yml
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: amount
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
CI/CD for Data Pipelines
The Data Pipeline CI/CD Flow
Developer pushes code
-> CI runs:
1. Lint SQL (sqlfluff) and Python (ruff)
2. dbt compile (verify SQL is valid)
3. dbt run --target dev (run transformations in dev)
4. dbt test --target dev (run data quality tests in dev)
5. Terraform plan (show infrastructure changes)
-> PR review
-> Merge to main
-> CD runs:
1. Terraform apply (deploy infrastructure changes)
2. dbt run --target prod (run transformations in prod)
3. dbt test --target prod (validate production data)
4. Deploy Airflow DAGs
GitHub Actions Example
# .github/workflows/data-pipeline.yml
name: Data Pipeline CI/CD
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install sqlfluff ruff
- run: sqlfluff lint dbt/models/
- run: ruff check airflow/dags/
dbt-ci:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- run: pip install dbt-snowflake
- run: cd dbt && dbt deps
- run: cd dbt && dbt compile --target ci
- run: cd dbt && dbt run --target ci --select state:modified+
- run: cd dbt && dbt test --target ci --select state:modified+
deploy:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
needs: [lint]
steps:
- uses: actions/checkout@v4
- run: pip install dbt-snowflake
- run: cd dbt && dbt deps
- run: cd dbt && dbt run --target prod
- run: cd dbt && dbt test --target prod
Dev / Staging / Prod Environments
Why Environments Matter
Testing a query against production data sounds harmless until you accidentally run DELETE FROM without a WHERE clause. Separate environments isolate risk.
Environment Purpose Data
----------- ------- ----
Dev Individual development Sample or synthetic data
Staging Pre-production testing Copy of production data (anonymized)
Prod Live data platform Real production data
dbt Profiles for Environments
# profiles.yml
data_platform:
target: dev
outputs:
dev:
type: snowflake
account: company.snowflakecomputing.com
database: ANALYTICS_DEV
warehouse: DEV_WH
schema: "{{ env_var('DBT_SCHEMA', 'dev_' + env_var('USER')) }}"
staging:
type: snowflake
account: company.snowflakecomputing.com
database: ANALYTICS_STAGING
warehouse: STAGING_WH
schema: STAGING
prod:
type: snowflake
account: company.snowflakecomputing.com
database: ANALYTICS
warehouse: TRANSFORM_WH
schema: PRODUCTION
Each developer gets their own schema in dev. Staging mirrors production structure with anonymized data. Production is the real thing.
Testing Transformations Before Deploying
# Local development workflow
1. Create a branch: git checkout -b feature/add-discount-model
2. Write the model: models/marts/mart_discounts.sql
3. Test locally: dbt run --select mart_discounts --target dev
4. Validate: dbt test --select mart_discounts --target dev
5. Verify results: query the dev table manually
6. Push and open PR
7. CI runs tests against staging
8. After review and merge, CD deploys to prod
The Modern Data Stack as Code
Putting it all together, a fully code-defined data platform:
Component Tool As Code
--------- ---- -------
Infrastructure Terraform .tf files in Git
Ingestion config Fivetran/Airbyte API configs or Terraform provider
Transformations dbt SQL files in Git
Orchestration Airflow/Dagster Python DAG files in Git
Quality tests dbt tests/GE YAML and SQL in Git
Access control Terraform Grants and roles in .tf files
Monitoring Terraform Alert configs in .tf files
Documentation dbt docs Generated from YAML descriptions
The Benefits Compound
When everything is in code:
- Disaster recovery: Rebuild the entire platform from Git in hours, not weeks.
- Audit trail: Git history shows who changed what, when, and why.
- Review process: Every change goes through a PR with peer review.
- Consistency: Dev, staging, and prod are structurally identical.
- Onboarding: New team members can read the code to understand the platform.
Common Pitfalls
- Partial IaC adoption. Half the infrastructure in Terraform, half configured manually. This is worse than all-manual because you do not know which source of truth to trust.
- Not testing dbt models in CI. Running
dbt runonly in production means bugs are discovered by data consumers, not by CI. - Skipping the staging environment. Going straight from dev to prod means untested changes hit production data. A staging environment with realistic data catches most issues.
- Storing credentials in code. Terraform state files, dbt profiles, and CI/CD configs must never contain plaintext passwords. Use secret managers (AWS Secrets Manager, HashiCorp Vault, GitHub Secrets).
- Over-engineering the CI/CD pipeline. A pipeline that takes 45 minutes to run discourages frequent deployments. Optimize for fast feedback: lint quickly, test only changed models, parallelize where possible.
- Not versioning Terraform state. Terraform state must be stored in a remote backend (S3, GCS) with versioning and locking. Local state files get lost and cause drift.
- Ignoring drift detection. Someone makes a manual change in the console. Terraform does not know about it. Run
terraform planregularly to detect drift.
Key Takeaways
- Define everything in code: infrastructure (Terraform), transformations (dbt), pipelines (Airflow/Dagster), and quality tests. Version control is the single source of truth.
- CI/CD for data pipelines follows the same principles as application CI/CD: lint, test, review, deploy.
- Separate dev, staging, and prod environments. Test transformations against realistic data before they touch production.
- The modern data stack as code enables disaster recovery, audit trails, peer review, and consistent environments.
- Store credentials in secret managers, not in code. Store Terraform state in a remote backend with locking.
- Start with the highest-value piece (usually dbt for transformations) and expand IaC coverage over time.