Window Functions

Window functions are the most powerful feature in SQL for data engineering work. They let you perform calculations across rows related to the current row without collapsing the result set. Where GROUP BY reduces many rows to one, window functions keep every row and add computed columns. Once you internalize these patterns, you will stop writing self-joins and correlated subqueries entirely.

How Window Functions Work

A window function operates on a set of rows (the "window") defined by the OVER clause. The function computes a value for each row based on the rows in its window.

-- Basic structure
SELECT
    column,
    FUNCTION() OVER (
        PARTITION BY grouping_column
        ORDER BY ordering_column
        ROWS BETWEEN start AND end
    ) AS computed_value
FROM table;

PARTITION BY divides rows into groups (like GROUP BY, but without collapsing). Each partition is processed independently.

ORDER BY determines the order of rows within each partition. Required for ranking and running calculations.

Frame clause (ROWS BETWEEN) defines exactly which rows relative to the current row are included in the calculation. Defaults vary by function and database.

Ranking Functions

ROW_NUMBER

Assigns a sequential number to each row within its partition. No ties — if two rows are equal, one arbitrarily gets the lower number.

-- Most recent order per customer
WITH ranked_orders AS (
    SELECT
        customer_id,
        order_id,
        order_date,
        order_amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date DESC
        ) AS rn
    FROM orders
)
SELECT customer_id, order_id, order_date, order_amount
FROM ranked_orders
WHERE rn = 1;

This is one of the most common patterns in data engineering. "Give me the latest/first/top-N per group" is a ROW_NUMBER problem.

Deduplication with ROW_NUMBER

Raw data often contains duplicates. ROW_NUMBER is the standard way to deduplicate:

-- Remove duplicate events, keeping the earliest occurrence
WITH deduplicated AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY event_id
            ORDER BY ingested_at ASC
        ) AS rn
    FROM raw_events
)
SELECT *
FROM deduplicated
WHERE rn = 1;

RANK & DENSE_RANK

Unlike ROW_NUMBER, these handle ties explicitly.

SELECT
    student_name,
    score,
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
    RANK()       OVER (ORDER BY score DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM exam_results;

student_name | score | row_num | rank | dense_rank
Alice        | 95    | 1       | 1    | 1
Bob          | 95    | 2       | 1    | 1
Charlie      | 90    | 3       | 3    | 2
Diana        | 85    | 4       | 4    | 3

ROW_NUMBER: Always unique. Alice gets 1, Bob gets 2 (arbitrary).
RANK: Ties get the same rank. Next rank skips. Alice and Bob both get 1, Charlie gets 3.
DENSE_RANK: Ties get the same rank. Next rank does not skip. Alice and Bob both get 1, Charlie gets 2.

When to use each:

ROW_NUMBER: Deduplication, top-N per group, any case where you need exactly one row
RANK: Leaderboards where ties should be visible and gaps in ranking are expected
DENSE_RANK: Top-N distinct values (e.g., "top 3 scores" where ties mean more than 3 students qualify)

LAG & LEAD

LAG looks at a previous row. LEAD looks at a following row. They are essential for comparing sequential events.

-- Compare each month's revenue to the previous month
SELECT
    order_month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY order_month) AS prev_month_revenue,
    revenue - LAG(revenue, 1) OVER (ORDER BY order_month) AS month_over_month_change,
    ROUND(
        100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY order_month))
        / NULLIF(LAG(revenue, 1) OVER (ORDER BY order_month), 0),
        2
    ) AS pct_change
FROM monthly_revenue;

order_month | revenue  | prev_month | mom_change | pct_change
2025-01     | 150000   | NULL       | NULL       | NULL
2025-02     | 165000   | 150000     | 15000      | 10.00
2025-03     | 142000   | 165000     | -23000     | -13.94

Session Detection

LAG is commonly used to detect session boundaries in event streams:

-- Detect new sessions: a gap of 30+ minutes between events
WITH event_gaps AS (
    SELECT
        user_id,
        event_timestamp,
        LAG(event_timestamp) OVER (
            PARTITION BY user_id
            ORDER BY event_timestamp
        ) AS prev_event_timestamp,
        EXTRACT(EPOCH FROM
            event_timestamp - LAG(event_timestamp) OVER (
                PARTITION BY user_id
                ORDER BY event_timestamp
            )
        ) / 60.0 AS minutes_since_last_event
    FROM events
),

session_starts AS (
    SELECT
        *,
        CASE
            WHEN minutes_since_last_event IS NULL THEN 1  -- first event
            WHEN minutes_since_last_event > 30 THEN 1     -- new session
            ELSE 0
        END AS is_new_session
    FROM event_gaps
)

SELECT
    user_id,
    event_timestamp,
    SUM(is_new_session) OVER (
        PARTITION BY user_id
        ORDER BY event_timestamp
    ) AS session_id
FROM session_starts;

This pattern — using LAG to detect gaps, then using a running SUM to assign IDs — is a building block for session analysis, funnel construction, and state-change tracking.

Running Aggregations

Running Total

-- Cumulative revenue over time
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue
FROM daily_revenue_summary;

Running Average

-- 7-day moving average of daily orders
SELECT
    order_date,
    daily_orders,
    ROUND(
        AVG(daily_orders) OVER (
            ORDER BY order_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ),
        2
    ) AS moving_avg_7d
FROM daily_order_counts;

Note: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives a 7-day window (current day plus 6 prior days). This is a common off-by-one source.

Partitioned Running Totals

-- Cumulative revenue per product category
SELECT
    order_date,
    product_category,
    daily_revenue,
    SUM(daily_revenue) OVER (
        PARTITION BY product_category
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS category_cumulative_revenue
FROM daily_category_revenue;

PARTITION BY resets the running total for each category. Without it, the running total spans all categories.

FIRST_VALUE & LAST_VALUE

-- For each order, show the customer's first and most recent order amounts
SELECT
    customer_id,
    order_id,
    order_date,
    order_amount,
    FIRST_VALUE(order_amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS first_order_amount,
    LAST_VALUE(order_amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS latest_order_amount
FROM orders;

LAST_VALUE requires ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Without it, the default frame ends at the current row, and LAST_VALUE just returns the current row's value — which is never what you want.

NTILE

NTILE divides rows into N roughly equal buckets. Useful for percentile analysis and segmentation.

-- Divide customers into spending quartiles
WITH customer_spending AS (
    SELECT
        customer_id,
        SUM(order_amount) AS total_spending
    FROM orders
    WHERE order_date >= '2025-01-01'
    GROUP BY customer_id
)

SELECT
    customer_id,
    total_spending,
    NTILE(4) OVER (ORDER BY total_spending DESC) AS spending_quartile
FROM customer_spending;

customer_id | total_spending | spending_quartile
C-1001      | 52000          | 1    (top 25%)
C-1002      | 38000          | 1
C-1003      | 15000          | 2
C-1004      | 12000          | 2
C-1005      | 5000           | 3
C-1006      | 3000           | 3
C-1007      | 1500           | 4
C-1008      | 500            | 4    (bottom 25%)

Frame Clauses

The frame clause controls exactly which rows the window function considers. This is where subtle bugs hide.

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW     -- all rows from start to here
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW             -- last 7 rows
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING      -- here to end
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING  -- all rows in partition
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING              -- previous, current, next

ROWS vs RANGE: ROWS counts physical rows. RANGE groups rows with the same ORDER BY value together. For most data engineering work, use ROWS — it is more predictable.

-- ROWS: strictly the previous 2 physical rows
SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

-- RANGE: all rows with order_date within 2 units of the current row
-- If multiple orders share a date, RANGE includes all of them
SUM(amount) OVER (ORDER BY order_date RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)

Combining Window Functions

Real-world queries often use multiple window functions together:

-- Customer order analysis: ranking, comparison, and running total
SELECT
    customer_id,
    order_id,
    order_date,
    order_amount,
    ROW_NUMBER() OVER w AS order_sequence,
    order_amount - LAG(order_amount) OVER w AS change_from_last,
    SUM(order_amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS lifetime_spend_so_far,
    AVG(order_amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS rolling_3_order_avg
FROM orders
WINDOW w AS (PARTITION BY customer_id ORDER BY order_date);

The WINDOW clause (supported in PostgreSQL, BigQuery, and others) lets you define a named window and reuse it, reducing repetition.

Common Pitfalls

Forgetting PARTITION BY. Without PARTITION BY, the window function operates across the entire result set. Your "per-customer" running total becomes a global running total.
Wrong frame clause with LAST_VALUE. The default frame often ends at the current row, making LAST_VALUE return the current row's value. Always specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when using LAST_VALUE.
Off-by-one in moving windows. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW is 7 rows, not 6. Count carefully.
Using window functions in WHERE clauses. Window functions cannot be used directly in WHERE. Wrap them in a CTE or subquery and filter in the outer query.
Performance on large datasets. Window functions that scan entire partitions (UNBOUNDED PRECEDING to UNBOUNDED FOLLOWING) on large partitions can be slow. Partition on high-cardinality columns and consider whether a self-join might be more efficient for very specific lookups.
Assuming deterministic ROW_NUMBER without a tiebreaker. If two rows have the same ORDER BY value, ROW_NUMBER assigns them arbitrarily. Add a tiebreaker column (like a unique ID) to make the result deterministic.

Key Takeaways

Window functions compute values across related rows without collapsing the result set. They replace self-joins and correlated subqueries with cleaner, more efficient patterns.
ROW_NUMBER is your workhorse for deduplication and top-N-per-group queries.
LAG and LEAD enable row-to-row comparisons for time-series analysis, session detection, and change tracking.
Running SUM, AVG, and COUNT with frame clauses power cumulative totals and moving averages.
Always specify the frame clause explicitly. Default frames vary and cause subtle bugs.
Window functions cannot appear in WHERE — wrap them in a CTE and filter in the outer query.
Master these patterns and you eliminate entire categories of complex, error-prone SQL.