Advanced SQL Patterns
SQL is the primary language of data engineering. Not Python, not Scala — SQL. You will write more SQL than anything else, and the difference between a productive data engineer and a struggling one is often fluency with a dozen patterns that come up constantly. This is not a SQL tutorial. This is the SQL you will use every day in data pipelines.
CTEs for Readability
Common Table Expressions (CTEs) using the WITH clause are the single most important pattern for writing maintainable SQL. A CTE lets you name a subquery and reference it like a table.
Without CTEs:
SELECT
customer_segment,
AVG(order_total) AS avg_order_value
FROM (
SELECT
c.customer_segment,
o.order_id,
SUM(oi.quantity * oi.unit_price) AS order_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01'
GROUP BY c.customer_segment, o.order_id
) subq
GROUP BY customer_segment;
With CTEs:
WITH order_totals AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2025-01-01'
GROUP BY o.order_id, o.customer_id, o.order_date
),
customer_orders AS (
SELECT
c.customer_segment,
ot.order_total
FROM order_totals ot
JOIN customers c ON ot.customer_id = c.customer_id
)
SELECT
customer_segment,
AVG(order_total) AS avg_order_value
FROM customer_orders
GROUP BY customer_segment;
The CTE version is longer but dramatically easier to read, debug, and modify. Each CTE is a logical step you can run independently. When a pipeline breaks, you can isolate the problem to a specific CTE.
CTE Guidelines
- Name CTEs descriptively:
active_customers, nottemp1 - Each CTE should do one thing: filter, join, aggregate, or transform
- Stack CTEs sequentially — each builds on the previous
- Use CTEs even when a subquery would work — readability beats brevity
Subqueries vs Joins
Both subqueries and joins combine data from multiple tables, but they serve different purposes.
Use joins when you need columns from both tables in the result:
SELECT
o.order_id,
o.order_date,
c.customer_name,
c.customer_segment
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
Use subqueries for existence checks and filtering:
-- Customers who have placed at least one order this year
SELECT customer_id, customer_name
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2025-01-01'
);
-- More efficient: EXISTS instead of IN
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2025-01-01'
);
EXISTS is generally faster than IN for large datasets because it stops scanning as soon as it finds a match. IN with a subquery materializes the full result set first.
Anti-joins find records that do not match:
-- Customers who have NOT placed any order this year
SELECT c.customer_id, c.customer_name
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
AND o.order_date >= '2025-01-01'
WHERE o.order_id IS NULL;
This pattern (LEFT JOIN + WHERE IS NULL) is often faster than NOT IN or NOT EXISTS, though modern query optimizers frequently treat them identically.
CASE Expressions
CASE is how you implement conditional logic in SQL. It replaces what would be if/else statements in other languages.
-- Categorize orders by size
SELECT
order_id,
order_amount,
CASE
WHEN order_amount >= 1000 THEN 'large'
WHEN order_amount >= 100 THEN 'medium'
ELSE 'small'
END AS order_size
FROM orders;
Conditional Aggregation
CASE inside aggregate functions is one of the most useful patterns in data engineering. It lets you pivot data without using a PIVOT operator.
-- Monthly revenue broken down by channel, in a single query
SELECT
DATE_TRUNC('month', order_date) AS order_month,
SUM(CASE WHEN channel = 'web' THEN order_amount ELSE 0 END) AS web_revenue,
SUM(CASE WHEN channel = 'mobile' THEN order_amount ELSE 0 END) AS mobile_revenue,
SUM(CASE WHEN channel = 'store' THEN order_amount ELSE 0 END) AS store_revenue,
SUM(order_amount) AS total_revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY order_month;
Boolean Aggregation
-- Count and percentage of orders that were returned
SELECT
product_category,
COUNT(*) AS total_orders,
SUM(CASE WHEN is_returned THEN 1 ELSE 0 END) AS returned_orders,
ROUND(
100.0 * SUM(CASE WHEN is_returned THEN 1 ELSE 0 END) / COUNT(*),
2
) AS return_rate_pct
FROM orders o
JOIN products p ON o.product_id = p.product_id
GROUP BY product_category;
COALESCE & NULLIF
NULLs are pervasive in real-world data. These two functions handle them cleanly.
COALESCE returns the first non-NULL value from its arguments:
-- Use shipping address if available, otherwise billing address
SELECT
customer_id,
COALESCE(shipping_address, billing_address, 'No address on file') AS address
FROM customers;
-- Fill gaps in a time series with zero
SELECT
d.date,
COALESCE(SUM(o.order_amount), 0) AS daily_revenue
FROM dim_date d
LEFT JOIN orders o ON d.date = DATE(o.order_date)
WHERE d.date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY d.date;
NULLIF returns NULL if two values are equal. Its primary use is preventing division by zero:
-- Safe division: returns NULL instead of error when denominator is 0
SELECT
campaign_name,
total_clicks,
total_impressions,
ROUND(
100.0 * total_clicks / NULLIF(total_impressions, 0),
2
) AS click_through_rate
FROM campaign_metrics;
Without NULLIF, a campaign with zero impressions would cause a division-by-zero error. With NULLIF, it returns NULL, which you can then handle with COALESCE if needed.
String Functions
Data pipelines spend a surprising amount of time cleaning strings.
-- Common string operations
SELECT
TRIM(raw_name) AS cleaned_name, -- remove whitespace
LOWER(email) AS normalized_email, -- normalize case
SPLIT_PART(email, '@', 2) AS email_domain, -- extract domain
LEFT(phone, 3) AS area_code, -- first N chars
REPLACE(address, '\n', ' ') AS single_line_address, -- remove newlines
LENGTH(description) AS desc_length,
CONCAT(first_name, ' ', last_name) AS full_name
FROM raw_customers;
Pattern Matching
-- LIKE for simple patterns
SELECT * FROM customers WHERE email LIKE '%@gmail.com';
-- SIMILAR TO or regex for complex patterns (PostgreSQL)
SELECT * FROM events WHERE url ~ '^/api/v[0-9]+/users';
-- Extract parts with REGEXP_REPLACE or SUBSTRING
SELECT
SUBSTRING(log_line FROM 'status=(\d+)') AS http_status
FROM raw_logs;
Date Arithmetic
Date manipulation is in almost every data pipeline. The syntax varies by database, but the patterns are universal.
-- Common date operations (PostgreSQL syntax)
SELECT
CURRENT_DATE AS today,
CURRENT_DATE - INTERVAL '7 days' AS one_week_ago,
DATE_TRUNC('month', order_date) AS first_of_month,
DATE_TRUNC('quarter', order_date) AS first_of_quarter,
EXTRACT(DOW FROM order_date) AS day_of_week, -- 0=Sunday
EXTRACT(EPOCH FROM end_time - start_time) AS duration_seconds,
AGE(CURRENT_DATE, signup_date) AS customer_age
FROM orders;
Generating Date Series
Filling gaps in time series data is a common need:
-- Generate a series of dates (PostgreSQL)
SELECT generate_series(
'2025-01-01'::date,
'2025-12-31'::date,
'1 day'::interval
)::date AS date;
-- BigQuery equivalent
SELECT date
FROM UNNEST(
GENERATE_DATE_ARRAY('2025-01-01', '2025-12-31', INTERVAL 1 DAY)
) AS date;
Time Zone Handling
-- Convert UTC to a specific timezone
SELECT
event_timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York' AS et_time
FROM events;
-- Truncate to day in a specific timezone (not UTC)
SELECT
DATE_TRUNC('day',
event_timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York'
) AS event_date_et
FROM events;
Getting time zones wrong is one of the most common sources of off-by-one-day errors in data pipelines. Always be explicit about which timezone you are truncating in.
UNION & UNION ALL
UNION ALL combines result sets from multiple queries, keeping all rows including duplicates. UNION deduplicates, which is slower.
-- Combine events from two source tables
SELECT event_id, user_id, event_type, event_timestamp, 'web' AS source
FROM web_events
WHERE event_timestamp >= '2025-01-01'
UNION ALL
SELECT event_id, user_id, event_type, event_timestamp, 'mobile' AS source
FROM mobile_events
WHERE event_timestamp >= '2025-01-01';
Use UNION ALL by default. Only use UNION (with deduplication) when you specifically need to eliminate duplicates, and even then consider whether deduplication belongs in a separate step where it is more visible.
GROUP BY Patterns
GROUPING SETS for Multiple Aggregation Levels
-- Revenue by category, by region, and overall total in one query
SELECT
COALESCE(product_category, 'ALL') AS category,
COALESCE(region, 'ALL') AS region,
SUM(order_amount) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.product_id
GROUP BY GROUPING SETS (
(product_category, region),
(product_category),
(region),
()
)
ORDER BY category, region;
This replaces running four separate queries and combining them with UNION ALL.
Common Pitfalls
- Nested subqueries instead of CTEs. Three levels of nested subqueries is unreadable. Refactor into CTEs.
- Using SELECT DISTINCT to mask a bad join. If a join produces duplicates, adding DISTINCT hides the problem. Fix the join condition instead.
- Ignoring NULL behavior in aggregations.
COUNT(column)excludes NULLs.COUNT(*)counts all rows.AVG(column)ignores NULLs, which may or may not be what you want. - String comparison without normalization.
'New York'and'new york'and' New York 'are different values. TRIM and LOWER before comparing. - Date truncation in the wrong timezone. Truncating a UTC timestamp to date gives a different result than truncating after converting to the user's timezone. Know which one your business logic requires.
- Using UNION when UNION ALL is sufficient. UNION forces a sort for deduplication. If you know the sets are disjoint (or duplicates are acceptable), use UNION ALL.
Key Takeaways
- CTEs are your primary tool for writing readable, debuggable SQL. Use them aggressively.
- CASE expressions enable conditional logic and pivoting within aggregate queries.
- COALESCE handles NULLs, NULLIF prevents division by zero. Use them together.
- Master date arithmetic and timezone handling — they appear in nearly every pipeline.
- Prefer EXISTS over IN for correlated filtering on large tables.
- Use UNION ALL by default, UNION only when deduplication is explicitly needed.
- Write SQL that the next person can read. Clever SQL is bad SQL if nobody else can maintain it.