SQL Deep Dive
SQL is the most valuable query language a software engineer can know. This document covers the features that separate a junior developer writing basic SELECT statements from a senior developer who solves complex business problems in a single query.
Joins
Joins combine rows from two or more tables based on a related column. Understanding joins is the foundation of relational querying.
INNER JOIN
Returns only rows that have a match in both tables.
-- Get all orders with user names (only users who have orders)
SELECT o.id AS order_id, u.name, o.total, o.created_at
FROM orders o
INNER JOIN users u ON o.user_id = u.id;
LEFT JOIN
Returns all rows from the left table, with NULLs for the right table when there is no match.
-- Get all users with their order count (including users with 0 orders)
SELECT u.name, u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.name, u.email;
Pitfall: Using COUNT(*) instead of COUNT(o.id) with LEFT JOIN. COUNT(*) counts NULL rows, so users with no orders would show order_count = 1. COUNT(o.id) correctly returns 0 because o.id is NULL.
RIGHT JOIN and FULL OUTER JOIN
Right join is a left join with the tables swapped. Full outer join returns all rows from both tables. In practice, left join covers 95% of use cases. If you find yourself writing a right join, swap the table order and use a left join instead.
CROSS JOIN
Produces the cartesian product of two tables. Rarely needed, but useful for generating combinations.
-- Generate a report template: every user x every month
SELECT u.name, m.month
FROM users u
CROSS JOIN generate_series(
'2024-01-01'::date,
'2024-12-01'::date,
'1 month'::interval
) AS m(month);
Self joins
A table joined to itself. Common for hierarchical data.
-- Find employees and their managers
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
Window Functions
Window functions perform a calculation across a set of rows related to the current row, without collapsing the result set. They are among the most powerful features in SQL.
Syntax
function_name() OVER (
[PARTITION BY column_list]
[ORDER BY column_list]
[frame_clause]
)
- PARTITION BY divides rows into groups (like GROUP BY, but without collapsing)
- ORDER BY defines the order within each partition
- Frame clause defines which rows relative to the current row are included
ROW_NUMBER, RANK, DENSE_RANK
-- Rank users by total spending within each region
SELECT
region,
user_id,
total_spent,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_spent DESC) AS row_num,
RANK() OVER (PARTITION BY region ORDER BY total_spent DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY region ORDER BY total_spent DESC) AS dense_rank
FROM user_stats;
| Difference | Behavior on ties | |-----------|------------------| | ROW_NUMBER | Assigns unique numbers; ties get arbitrary order | | RANK | Ties get the same rank; next rank is skipped (1, 1, 3) | | DENSE_RANK | Ties get the same rank; next rank is not skipped (1, 1, 2) |
The "latest per group" pattern
One of the most common real-world uses: get the most recent record per group.
-- Each user's most recent order
SELECT * FROM (
SELECT
o.*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM orders o
) sub
WHERE rn = 1;
Real-world use case: An admin dashboard showing each customer's last login, last purchase, or last support ticket.
Running totals and moving averages
-- Cumulative daily revenue
SELECT
date,
revenue,
SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue
FROM daily_revenue;
-- 7-day moving average of daily revenue
SELECT
date,
revenue,
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM daily_revenue;
LAG and LEAD
Access the previous or next row's value without a self-join.
-- Month-over-month revenue growth
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month,
revenue - LAG(revenue) OVER (ORDER BY month) AS growth,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100,
1
) AS growth_percent
FROM monthly_revenue;
NTILE: percentile buckets
-- Divide users into spending quartiles
SELECT
user_id,
total_spent,
NTILE(4) OVER (ORDER BY total_spent) AS spending_quartile
FROM user_stats;
Common Table Expressions (CTEs)
CTEs let you name intermediate query results, making complex queries readable. They are defined with the WITH keyword.
Basic CTE
WITH active_users AS (
SELECT id, name, email
FROM users
WHERE status = 'active' AND last_login > NOW() - INTERVAL '30 days'
)
SELECT au.name, COUNT(o.id) AS recent_orders
FROM active_users au
LEFT JOIN orders o ON au.id = o.user_id AND o.created_at > NOW() - INTERVAL '30 days'
GROUP BY au.name
ORDER BY recent_orders DESC;
Chained CTEs for complex reporting
-- Monthly revenue report with growth rates and running totals
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(total) AS revenue,
COUNT(*) AS order_count,
COUNT(DISTINCT user_id) AS unique_customers
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', created_at)
),
with_growth AS (
SELECT
month,
revenue,
order_count,
unique_customers,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_growth
FROM monthly_revenue
),
with_running_total AS (
SELECT
*,
SUM(revenue) OVER (ORDER BY month) AS ytd_revenue,
ROUND(
revenue_growth / NULLIF(prev_revenue, 0) * 100, 1
) AS growth_percent
FROM with_growth
)
SELECT * FROM with_running_total ORDER BY month;
Recursive CTEs
Recursive CTEs query hierarchical data like org charts, category trees, or graph structures.
-- Find all subordinates of a manager (org chart traversal)
WITH RECURSIVE subordinates AS (
-- Base case: the manager
SELECT id, name, manager_id, 0 AS depth
FROM employees
WHERE id = 42
UNION ALL
-- Recursive case: employees who report to someone in our result set
SELECT e.id, e.name, e.manager_id, s.depth + 1
FROM employees e
INNER JOIN subordinates s ON e.manager_id = s.id
WHERE s.depth < 10 -- safety limit to prevent infinite recursion
)
SELECT * FROM subordinates ORDER BY depth, name;
Real-world use case: Displaying a category breadcrumb (Electronics > Phones > Smartphones) by walking up the category tree from a leaf node.
Subqueries
Subqueries are queries nested inside other queries. Use them when a CTE feels too heavyweight for a simple filter.
Scalar subquery (returns one value)
-- Users who have spent more than average
SELECT name, total_spent
FROM user_stats
WHERE total_spent > (SELECT AVG(total_spent) FROM user_stats);
EXISTS (correlated subquery)
-- Users who have at least one order in the last 30 days
SELECT u.name, u.email
FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.user_id = u.id
AND o.created_at > NOW() - INTERVAL '30 days'
);
Performance note: EXISTS stops scanning as soon as it finds one match. It is generally faster than IN (SELECT ...) for large subquery results.
Lateral joins
A lateral join lets the subquery reference columns from the preceding table. Think of it as a "for each row" subquery.
-- For each user, get their 3 most recent orders
SELECT u.name, recent.id AS order_id, recent.total, recent.created_at
FROM users u
CROSS JOIN LATERAL (
SELECT o.id, o.total, o.created_at
FROM orders o
WHERE o.user_id = u.id
ORDER BY o.created_at DESC
LIMIT 3
) recent;
Practical Business Problems
Leaderboard with rank
WITH user_scores AS (
SELECT
u.id,
u.name,
u.avatar_url,
SUM(a.points) AS total_points
FROM users u
INNER JOIN activities a ON u.id = a.user_id
WHERE a.created_at > DATE_TRUNC('month', NOW())
GROUP BY u.id, u.name, u.avatar_url
)
SELECT
DENSE_RANK() OVER (ORDER BY total_points DESC) AS rank,
name,
avatar_url,
total_points
FROM user_scores
ORDER BY rank
LIMIT 100;
Cohort retention analysis
-- What percentage of users who signed up each month are still active N months later?
WITH user_cohorts AS (
SELECT
id AS user_id,
DATE_TRUNC('month', created_at) AS cohort_month
FROM users
),
user_activity AS (
SELECT
user_id,
DATE_TRUNC('month', created_at) AS activity_month
FROM orders
)
SELECT
c.cohort_month,
COUNT(DISTINCT c.user_id) AS cohort_size,
COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '1 month'
THEN c.user_id END) AS retained_month_1,
COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '2 months'
THEN c.user_id END) AS retained_month_2,
COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month + INTERVAL '3 months'
THEN c.user_id END) AS retained_month_3
FROM user_cohorts c
LEFT JOIN user_activity a ON c.user_id = a.user_id
GROUP BY c.cohort_month
ORDER BY c.cohort_month;
Gap detection (finding missing data)
-- Find days with no orders (useful for monitoring/alerting)
WITH date_range AS (
SELECT generate_series(
'2024-01-01'::date,
CURRENT_DATE,
'1 day'::interval
)::date AS date
),
daily_orders AS (
SELECT created_at::date AS date, COUNT(*) AS order_count
FROM orders
GROUP BY created_at::date
)
SELECT dr.date
FROM date_range dr
LEFT JOIN daily_orders do ON dr.date = do.date
WHERE do.order_count IS NULL
ORDER BY dr.date;
Pitfalls
- Overusing subqueries where a JOIN suffices. Correlated subqueries run once per row. A JOIN often produces the same result more efficiently.
- CTEs as optimization barriers. In PostgreSQL 12+, CTEs can be inlined by the optimizer. In older versions, CTEs are materialized as optimization fences. If performance matters on older PG, test CTEs vs subqueries.
- Window functions without PARTITION BY. If you forget
PARTITION BY, the window function operates over the entire result set, which may not be what you intend. - Not handling NULLs in LAG/LEAD. The first row has no
LAGvalue (it is NULL). UseCOALESCEor handle NULLs in your application code. - Recursive CTEs without depth limits. A cycle in your data (employee A reports to B, B reports to A) causes infinite recursion. Always add a
WHERE depth < Nguard.