Window Functions
Window functions are the most powerful feature in SQL for data engineering work. They let you perform calculations across rows related to the current row without collapsing the result set. Where GROUP BY reduces many rows to one, window functions keep every row and add computed columns. Once you internalize these patterns, you will stop writing self-joins and correlated subqueries entirely.
How Window Functions Work
A window function operates on a set of rows (the "window") defined by the OVER clause. The function computes a value for each row based on the rows in its window.
-- Basic structure
SELECT
column,
FUNCTION() OVER (
PARTITION BY grouping_column
ORDER BY ordering_column
ROWS BETWEEN start AND end
) AS computed_value
FROM table;
PARTITION BY divides rows into groups (like GROUP BY, but without collapsing). Each partition is processed independently.
ORDER BY determines the order of rows within each partition. Required for ranking and running calculations.
Frame clause (ROWS BETWEEN) defines exactly which rows relative to the current row are included in the calculation. Defaults vary by function and database.
Ranking Functions
ROW_NUMBER
Assigns a sequential number to each row within its partition. No ties — if two rows are equal, one arbitrarily gets the lower number.
-- Most recent order per customer
WITH ranked_orders AS (
SELECT
customer_id,
order_id,
order_date,
order_amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS rn
FROM orders
)
SELECT customer_id, order_id, order_date, order_amount
FROM ranked_orders
WHERE rn = 1;
This is one of the most common patterns in data engineering. "Give me the latest/first/top-N per group" is a ROW_NUMBER problem.
Deduplication with ROW_NUMBER
Raw data often contains duplicates. ROW_NUMBER is the standard way to deduplicate:
-- Remove duplicate events, keeping the earliest occurrence
WITH deduplicated AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY ingested_at ASC
) AS rn
FROM raw_events
)
SELECT *
FROM deduplicated
WHERE rn = 1;
RANK & DENSE_RANK
Unlike ROW_NUMBER, these handle ties explicitly.
SELECT
student_name,
score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
RANK() OVER (ORDER BY score DESC) AS rank,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM exam_results;
student_name | score | row_num | rank | dense_rank
Alice | 95 | 1 | 1 | 1
Bob | 95 | 2 | 1 | 1
Charlie | 90 | 3 | 3 | 2
Diana | 85 | 4 | 4 | 3
- ROW_NUMBER: Always unique. Alice gets 1, Bob gets 2 (arbitrary).
- RANK: Ties get the same rank. Next rank skips. Alice and Bob both get 1, Charlie gets 3.
- DENSE_RANK: Ties get the same rank. Next rank does not skip. Alice and Bob both get 1, Charlie gets 2.
When to use each:
- ROW_NUMBER: Deduplication, top-N per group, any case where you need exactly one row
- RANK: Leaderboards where ties should be visible and gaps in ranking are expected
- DENSE_RANK: Top-N distinct values (e.g., "top 3 scores" where ties mean more than 3 students qualify)
LAG & LEAD
LAG looks at a previous row. LEAD looks at a following row. They are essential for comparing sequential events.
-- Compare each month's revenue to the previous month
SELECT
order_month,
revenue,
LAG(revenue, 1) OVER (ORDER BY order_month) AS prev_month_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY order_month) AS month_over_month_change,
ROUND(
100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY order_month))
/ NULLIF(LAG(revenue, 1) OVER (ORDER BY order_month), 0),
2
) AS pct_change
FROM monthly_revenue;
order_month | revenue | prev_month | mom_change | pct_change
2025-01 | 150000 | NULL | NULL | NULL
2025-02 | 165000 | 150000 | 15000 | 10.00
2025-03 | 142000 | 165000 | -23000 | -13.94
Session Detection
LAG is commonly used to detect session boundaries in event streams:
-- Detect new sessions: a gap of 30+ minutes between events
WITH event_gaps AS (
SELECT
user_id,
event_timestamp,
LAG(event_timestamp) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
) AS prev_event_timestamp,
EXTRACT(EPOCH FROM
event_timestamp - LAG(event_timestamp) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
)
) / 60.0 AS minutes_since_last_event
FROM events
),
session_starts AS (
SELECT
*,
CASE
WHEN minutes_since_last_event IS NULL THEN 1 -- first event
WHEN minutes_since_last_event > 30 THEN 1 -- new session
ELSE 0
END AS is_new_session
FROM event_gaps
)
SELECT
user_id,
event_timestamp,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
) AS session_id
FROM session_starts;
This pattern — using LAG to detect gaps, then using a running SUM to assign IDs — is a building block for session analysis, funnel construction, and state-change tracking.
Running Aggregations
Running Total
-- Cumulative revenue over time
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue
FROM daily_revenue_summary;
Running Average
-- 7-day moving average of daily orders
SELECT
order_date,
daily_orders,
ROUND(
AVG(daily_orders) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
),
2
) AS moving_avg_7d
FROM daily_order_counts;
Note: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives a 7-day window (current day plus 6 prior days). This is a common off-by-one source.
Partitioned Running Totals
-- Cumulative revenue per product category
SELECT
order_date,
product_category,
daily_revenue,
SUM(daily_revenue) OVER (
PARTITION BY product_category
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS category_cumulative_revenue
FROM daily_category_revenue;
PARTITION BY resets the running total for each category. Without it, the running total spans all categories.
FIRST_VALUE & LAST_VALUE
-- For each order, show the customer's first and most recent order amounts
SELECT
customer_id,
order_id,
order_date,
order_amount,
FIRST_VALUE(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_order_amount,
LAST_VALUE(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS latest_order_amount
FROM orders;
LAST_VALUE requires ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Without it, the default frame ends at the current row, and LAST_VALUE just returns the current row's value — which is never what you want.
NTILE
NTILE divides rows into N roughly equal buckets. Useful for percentile analysis and segmentation.
-- Divide customers into spending quartiles
WITH customer_spending AS (
SELECT
customer_id,
SUM(order_amount) AS total_spending
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY customer_id
)
SELECT
customer_id,
total_spending,
NTILE(4) OVER (ORDER BY total_spending DESC) AS spending_quartile
FROM customer_spending;
customer_id | total_spending | spending_quartile
C-1001 | 52000 | 1 (top 25%)
C-1002 | 38000 | 1
C-1003 | 15000 | 2
C-1004 | 12000 | 2
C-1005 | 5000 | 3
C-1006 | 3000 | 3
C-1007 | 1500 | 4
C-1008 | 500 | 4 (bottom 25%)
Frame Clauses
The frame clause controls exactly which rows the window function considers. This is where subtle bugs hide.
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- all rows from start to here
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- last 7 rows
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- here to end
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING -- all rows in partition
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING -- previous, current, next
ROWS vs RANGE: ROWS counts physical rows. RANGE groups rows with the same ORDER BY value together. For most data engineering work, use ROWS — it is more predictable.
-- ROWS: strictly the previous 2 physical rows
SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
-- RANGE: all rows with order_date within 2 units of the current row
-- If multiple orders share a date, RANGE includes all of them
SUM(amount) OVER (ORDER BY order_date RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)
Combining Window Functions
Real-world queries often use multiple window functions together:
-- Customer order analysis: ranking, comparison, and running total
SELECT
customer_id,
order_id,
order_date,
order_amount,
ROW_NUMBER() OVER w AS order_sequence,
order_amount - LAG(order_amount) OVER w AS change_from_last,
SUM(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS lifetime_spend_so_far,
AVG(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS rolling_3_order_avg
FROM orders
WINDOW w AS (PARTITION BY customer_id ORDER BY order_date);
The WINDOW clause (supported in PostgreSQL, BigQuery, and others) lets you define a named window and reuse it, reducing repetition.
Common Pitfalls
- Forgetting PARTITION BY. Without PARTITION BY, the window function operates across the entire result set. Your "per-customer" running total becomes a global running total.
- Wrong frame clause with LAST_VALUE. The default frame often ends at the current row, making LAST_VALUE return the current row's value. Always specify
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGwhen using LAST_VALUE. - Off-by-one in moving windows.
ROWS BETWEEN 6 PRECEDING AND CURRENT ROWis 7 rows, not 6. Count carefully. - Using window functions in WHERE clauses. Window functions cannot be used directly in WHERE. Wrap them in a CTE or subquery and filter in the outer query.
- Performance on large datasets. Window functions that scan entire partitions (UNBOUNDED PRECEDING to UNBOUNDED FOLLOWING) on large partitions can be slow. Partition on high-cardinality columns and consider whether a self-join might be more efficient for very specific lookups.
- Assuming deterministic ROW_NUMBER without a tiebreaker. If two rows have the same ORDER BY value, ROW_NUMBER assigns them arbitrarily. Add a tiebreaker column (like a unique ID) to make the result deterministic.
Key Takeaways
- Window functions compute values across related rows without collapsing the result set. They replace self-joins and correlated subqueries with cleaner, more efficient patterns.
- ROW_NUMBER is your workhorse for deduplication and top-N-per-group queries.
- LAG and LEAD enable row-to-row comparisons for time-series analysis, session detection, and change tracking.
- Running SUM, AVG, and COUNT with frame clauses power cumulative totals and moving averages.
- Always specify the frame clause explicitly. Default frames vary and cause subtle bugs.
- Window functions cannot appear in WHERE — wrap them in a CTE and filter in the outer query.
- Master these patterns and you eliminate entire categories of complex, error-prone SQL.