4 min read
On this page

SQL Deep Dive

SQL is the most valuable query language a software engineer can know. This document covers the features that separate a junior developer writing basic SELECT statements from a senior developer who solves complex business problems in a single query.


Joins

Joins combine rows from two or more tables based on a related column. Understanding joins is the foundation of relational querying.

INNER JOIN

Returns only rows that have a match in both tables.

-- Get all orders with user names (only users who have orders)
SELECT o.id AS order_id, u.name, o.total, o.created_at
FROM orders o
INNER JOIN users u ON o.user_id = u.id;

LEFT JOIN

Returns all rows from the left table, with NULLs for the right table when there is no match.

-- Get all users with their order count (including users with 0 orders)
SELECT u.name, u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.name, u.email;

Pitfall: Using COUNT(*) instead of COUNT(o.id) with LEFT JOIN. COUNT(*) counts NULL rows, so users with no orders would show order_count = 1. COUNT(o.id) correctly returns 0 because o.id is NULL.

RIGHT JOIN and FULL OUTER JOIN

Right join is a left join with the tables swapped. Full outer join returns all rows from both tables. In practice, left join covers 95% of use cases. If you find yourself writing a right join, swap the table order and use a left join instead.

CROSS JOIN

Produces the cartesian product of two tables. Rarely needed, but useful for generating combinations.

-- Generate a report template: every user x every month
SELECT u.name, m.month
FROM users u
CROSS JOIN generate_series(
    '2024-01-01'::date,
    '2024-12-01'::date,
    '1 month'::interval
) AS m(month);

Self joins

A table joined to itself. Common for hierarchical data.

-- Find employees and their managers
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Window Functions

Window functions perform a calculation across a set of rows related to the current row, without collapsing the result set. They are among the most powerful features in SQL.

Syntax

function_name() OVER (
    [PARTITION BY column_list]
    [ORDER BY column_list]
    [frame_clause]
)
  • PARTITION BY divides rows into groups (like GROUP BY, but without collapsing)
  • ORDER BY defines the order within each partition
  • Frame clause defines which rows relative to the current row are included

ROW_NUMBER, RANK, DENSE_RANK

-- Rank users by total spending within each region
SELECT
    region,
    user_id,
    total_spent,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_spent DESC) AS row_num,
    RANK()       OVER (PARTITION BY region ORDER BY total_spent DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY region ORDER BY total_spent DESC) AS dense_rank
FROM user_stats;

| Difference | Behavior on ties | |-----------|------------------| | ROW_NUMBER | Assigns unique numbers; ties get arbitrary order | | RANK | Ties get the same rank; next rank is skipped (1, 1, 3) | | DENSE_RANK | Ties get the same rank; next rank is not skipped (1, 1, 2) |

The "latest per group" pattern

One of the most common real-world uses: get the most recent record per group.

-- Each user's most recent order
SELECT * FROM (
    SELECT
        o.*,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
    FROM orders o
) sub
WHERE rn = 1;

Real-world use case: An admin dashboard showing each customer's last login, last purchase, or last support ticket.

Running totals and moving averages

-- Cumulative daily revenue
SELECT
    date,
    revenue,
    SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue
FROM daily_revenue;

-- 7-day moving average of daily revenue
SELECT
    date,
    revenue,
    AVG(revenue) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM daily_revenue;

LAG and LEAD

Access the previous or next row's value without a self-join.

-- Month-over-month revenue growth
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month,
    revenue - LAG(revenue) OVER (ORDER BY month) AS growth,
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY month))
        / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100,
        1
    ) AS growth_percent
FROM monthly_revenue;

NTILE: percentile buckets

-- Divide users into spending quartiles
SELECT
    user_id,
    total_spent,
    NTILE(4) OVER (ORDER BY total_spent) AS spending_quartile
FROM user_stats;

Common Table Expressions (CTEs)

CTEs let you name intermediate query results, making complex queries readable. They are defined with the WITH keyword.

Basic CTE

WITH active_users AS (
    SELECT id, name, email
    FROM users
    WHERE status = 'active' AND last_login > NOW() - INTERVAL '30 days'
)
SELECT au.name, COUNT(o.id) AS recent_orders
FROM active_users au
LEFT JOIN orders o ON au.id = o.user_id AND o.created_at > NOW() - INTERVAL '30 days'
GROUP BY au.name
ORDER BY recent_orders DESC;

Chained CTEs for complex reporting

-- Monthly revenue report with growth rates and running totals
WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', created_at) AS month,
        SUM(total) AS revenue,
        COUNT(*) AS order_count,
        COUNT(DISTINCT user_id) AS unique_customers
    FROM orders
    WHERE status = 'completed'
    GROUP BY DATE_TRUNC('month', created_at)
),
with_growth AS (
    SELECT
        month,
        revenue,
        order_count,
        unique_customers,
        LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
        revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_growth
    FROM monthly_revenue
),
with_running_total AS (
    SELECT
        *,
        SUM(revenue) OVER (ORDER BY month) AS ytd_revenue,
        ROUND(
            revenue_growth / NULLIF(prev_revenue, 0) * 100, 1
        ) AS growth_percent
    FROM with_growth
)
SELECT * FROM with_running_total ORDER BY month;

Recursive CTEs

Recursive CTEs query hierarchical data like org charts, category trees, or graph structures.

-- Find all subordinates of a manager (org chart traversal)
WITH RECURSIVE subordinates AS (
    -- Base case: the manager
    SELECT id, name, manager_id, 0 AS depth
    FROM employees
    WHERE id = 42

    UNION ALL

    -- Recursive case: employees who report to someone in our result set
    SELECT e.id, e.name, e.manager_id, s.depth + 1
    FROM employees e
    INNER JOIN subordinates s ON e.manager_id = s.id
    WHERE s.depth < 10  -- safety limit to prevent infinite recursion
)
SELECT * FROM subordinates ORDER BY depth, name;

Real-world use case: Displaying a category breadcrumb (Electronics > Phones > Smartphones) by walking up the category tree from a leaf node.


Subqueries

Subqueries are queries nested inside other queries. Use them when a CTE feels too heavyweight for a simple filter.

Scalar subquery (returns one value)

-- Users who have spent more than average
SELECT name, total_spent
FROM user_stats
WHERE total_spent > (SELECT AVG(total_spent) FROM user_stats);

EXISTS (correlated subquery)

-- Users who have at least one order in the last 30 days
SELECT u.name, u.email
FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id
    AND o.created_at > NOW() - INTERVAL '30 days'
);

Performance note: EXISTS stops scanning as soon as it finds one match. It is generally faster than IN (SELECT ...) for large subquery results.

Lateral joins

A lateral join lets the subquery reference columns from the preceding table. Think of it as a "for each row" subquery.

-- For each user, get their 3 most recent orders
SELECT u.name, recent.id AS order_id, recent.total, recent.created_at
FROM users u
CROSS JOIN LATERAL (
    SELECT o.id, o.total, o.created_at
    FROM orders o
    WHERE o.user_id = u.id
    ORDER BY o.created_at DESC
    LIMIT 3
) recent;

Practical Business Problems

Leaderboard with rank

WITH user_scores AS (
    SELECT
        u.id,
        u.name,
        u.avatar_url,
        SUM(a.points) AS total_points
    FROM users u
    INNER JOIN activities a ON u.id = a.user_id
    WHERE a.created_at > DATE_TRUNC('month', NOW())
    GROUP BY u.id, u.name, u.avatar_url
)
SELECT
    DENSE_RANK() OVER (ORDER BY total_points DESC) AS rank,
    name,
    avatar_url,
    total_points
FROM user_scores
ORDER BY rank
LIMIT 100;

Cohort retention analysis

-- What percentage of users who signed up each month are still active N months later?
WITH user_cohorts AS (
    SELECT
        id AS user_id,
        DATE_TRUNC('month', created_at) AS cohort_month
    FROM users
),
user_activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) AS activity_month
    FROM orders
)
SELECT
    c.cohort_month,
    COUNT(DISTINCT c.user_id) AS cohort_size,
    COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '1 month'
        THEN c.user_id END) AS retained_month_1,
    COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '2 months'
        THEN c.user_id END) AS retained_month_2,
    COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '3 months'
        THEN c.user_id END) AS retained_month_3
FROM user_cohorts c
LEFT JOIN user_activity a ON c.user_id = a.user_id
GROUP BY c.cohort_month
ORDER BY c.cohort_month;

Gap detection (finding missing data)

-- Find days with no orders (useful for monitoring/alerting)
WITH date_range AS (
    SELECT generate_series(
        '2024-01-01'::date,
        CURRENT_DATE,
        '1 day'::interval
    )::date AS date
),
daily_orders AS (
    SELECT created_at::date AS date, COUNT(*) AS order_count
    FROM orders
    GROUP BY created_at::date
)
SELECT dr.date
FROM date_range dr
LEFT JOIN daily_orders do ON dr.date = do.date
WHERE do.order_count IS NULL
ORDER BY dr.date;

Pitfalls

  • Overusing subqueries where a JOIN suffices. Correlated subqueries run once per row. A JOIN often produces the same result more efficiently.
  • CTEs as optimization barriers. In PostgreSQL 12+, CTEs can be inlined by the optimizer. In older versions, CTEs are materialized as optimization fences. If performance matters on older PG, test CTEs vs subqueries.
  • Window functions without PARTITION BY. If you forget PARTITION BY, the window function operates over the entire result set, which may not be what you intend.
  • Not handling NULLs in LAG/LEAD. The first row has no LAG value (it is NULL). Use COALESCE or handle NULLs in your application code.
  • Recursive CTEs without depth limits. A cycle in your data (employee A reports to B, B reports to A) causes infinite recursion. Always add a WHERE depth < N guard.