The SQL roadmap
Learn SQL, concept by concept.
Sixty-six concepts in prerequisite order. Read the lecture, run every example against a real Postgres database in your browser, then practice. No signup to start.
In order
Concepts follow a prerequisite graph, so each one builds on what came before. You never hit something you are not ready for.
Run it free
Every example and problem runs against a real Postgres database in your browser. Nothing to install, no signup to start.
Mastery, measured
An adaptive engine tracks your Mastery Confidence on each concept and brings problems back for review before you forget.
All 66 concepts
Grouped by tier, in the order they build.
Tier 1 · Foundations
The shape of a query: choose columns, filter rows, sort, and limit what comes back.
The SELECT clause controls what columns and expressions a query returns. Its scope is the shape of the output — which columns appear and what they contain — with no influence over which rows are included.
The FROM clause identifies the row source a query operates on. It establishes the full population of rows that all subsequent clauses filter, aggregate, and shape.
The WHERE clause filters the rows produced by FROM, keeping only those for which a condition evaluates to true. Its job is deciding which rows get processed at all.
Every value in a PostgreSQL query has a type, and PostgreSQL uses that type to determine what operations are valid and how expressions are evaluated. Literal values and type casting are the mechanisms by which you introduce and convert typed values within a query.
NULL represents the absence of a known value. Every NULL-related behavior in PostgreSQL follows from that single fact.
The WHERE clause accepts a single boolean expression, and AND, OR, and NOT are the operators for combining multiple conditions into one. Their behavior follows three-valued logic, which means NULL in any condition affects how the combined expression evaluates.
ORDER BY controls the sequence of rows in a query's output. Without it, PostgreSQL makes no guarantee about the order in which rows are returned.
LIMIT restricts the number of rows a query returns. OFFSET skips a specified number of rows before returning the rest. Both operate on the final sorted result set.
A column alias assigns a name to an expression in the SELECT list. That name appears as the column header in the result set and can be referenced by name in specific parts of the same query.
DISTINCT removes duplicate rows from a query's output. It operates on the complete set of output columns, treating two rows as duplicates only when every column value matches.
Arithmetic and comparison expressions are the building blocks of computed values and conditions in SQL. Their behavior is governed entirely by the types of their operands.
BETWEEN, IN, and LIKE are shorthand filter operators that expand on the basic comparison operators available in WHERE. Each tests a column value against a range, a set, or a pattern.
Tier 2 · Core SQL
Aggregation, grouping, joins, and your first subqueries. The everyday analyst toolkit.
Aggregate functions reduce a set of rows to a single computed value. COUNT, SUM, AVG, MIN, and MAX each collapse the row set differently, and each has specific behavior around NULLs, type coercion, and what counts as input.
GROUP BY partitions a query's rows into groups and directs aggregate functions to operate on each group independently. Without it, an aggregate function collapses the entire row set into one value. With it, the same function produces one value per group.
The HAVING clause filters groups after aggregation has occurred. It operates on the output of GROUP BY, applying conditions to aggregate values that do not exist until grouping is complete.
The CASE WHEN expression evaluates a series of conditions and returns a value for the first condition that is true. It can appear anywhere in a query where a value expression is valid: in SELECT, WHERE, ORDER BY, and inside aggregate functions.
INNER JOIN combines rows from two tables by matching them on a condition, returning one output row for every pair of rows — one from each table — where that condition is true. Rows from either table that have no match in the other are excluded from the result.
LEFT JOIN returns all rows from the left table, paired with matching rows from the right table where the join condition is true, and fills unmatched positions with NULL. RIGHT JOIN does the same in the opposite direction, preserving all rows from the right table.
FULL OUTER JOIN returns all rows from both tables, matching them where the join condition is true and filling unmatched positions on either side with NULL. It combines the preservation behavior of LEFT JOIN and RIGHT JOIN into a single operation.
CROSS JOIN combines every row from one table with every row from another, producing a result with no join condition. The output contains one row for every possible pairing between the two tables.
A self-join joins a table to itself, treating two instances of the same table as distinct participants in the join. It follows the same mechanics as any other join; the only requirement is that both instances carry separate aliases so PostgreSQL can distinguish them.
Joining multiple tables extends the two-table join pattern by chaining additional JOIN clauses, each introducing a new table into the working result set. The result of each join becomes the input for the next.
Set operations combine the results of two or more SELECT statements into a single result set. Each operation has distinct rules for which rows survive the combination.
A scalar subquery is a SELECT statement embedded inside another query that returns exactly one column and one row, producing a single value. That value can be used anywhere a literal or column expression is valid.
`IN`, `EXISTS`, `ANY`, and `ALL` extend WHERE filtering to test a row's values against the results of a subquery. Each operator expresses a different membership or comparison relationship between the outer row and the subquery's output.
A subquery in the FROM clause produces a named result set that the outer query treats as a table. The inner query executes completely before the outer query begins, making the inner query's output the sole source the outer query works from.
Conditional aggregation places a CASE WHEN expression inside an aggregate function, letting a single GROUP BY query compute separate aggregate values for different subsets of rows simultaneously.
Tier 3 · Intermediate
CTEs, date and string work, conditional logic, and an introduction to window functions.
`COALESCE` and `NULLIF` are two scalar functions that convert between NULL and non-NULL values. `COALESCE` replaces NULL with a fallback; `NULLIF` produces NULL when a condition is met.
Joins and aggregate functions each interact with NULL in ways that are consistent with NULL semantics but non-obvious in combination. The failures they produce are silent: the query runs, returns results, and the results are wrong.
A Common Table Expression (CTE) defines a named subquery at the top of a statement that the rest of the query can reference by name. Its scope is a single query execution: the name exists only within the statement that defines it.
A chained CTE query defines multiple named subqueries in a single `WITH` clause, each available to the main query and to CTEs that follow it. The chain is a sequence of named intermediate layers, resolved in the order they are written.
PostgreSQL provides four primary types for representing points in time and durations: `DATE`, `TIME`, `TIMESTAMP`, and `TIMESTAMPTZ`. The choice between them determines what gets stored, how PostgreSQL interprets the value, and how comparisons between values behave.
`DATE_TRUNC` and `EXTRACT` are the two primary functions for working with date/time values at a specific calendar granularity. `DATE_TRUNC` rounds a datetime down to a period boundary and returns a datetime. `EXTRACT` pulls a single numeric component out of a datetime and returns a number.
PostgreSQL supports arithmetic between date/time values and a dedicated duration type, `INTERVAL`, which represents a span of time rather than a point in it. The result type of a date/time arithmetic expression depends on the types of both operands.
`LENGTH`, `UPPER`, `LOWER`, `TRIM`, and `SUBSTRING` are scalar string functions that inspect or transform a single text value and return a single result. Each operates on one row at a time, producing one output value per input row.
PostgreSQL provides three tools for combining and formatting strings: the `||` operator, the `CONCAT` and `CONCAT_WS` functions, and the `FORMAT` function. They differ in how they handle NULL inputs and in what kind of output they are designed to produce.
PostgreSQL provides four pattern matching mechanisms for string columns: `LIKE`, `ILIKE`, `SIMILAR TO`, and the POSIX regex operators. Each uses a different pattern language with different expressive power and different performance characteristics.
A window function computes a value for each row using a set of related rows, without collapsing those rows into a single output row. The `OVER` clause defines which rows are related to each other and in what order.
`ROW_NUMBER`, `RANK`, and `DENSE_RANK` are window functions that assign an integer position to each row within a partition based on an `ORDER BY` clause. They produce different values only when two or more rows are tied on the ordering expression.
The aggregate functions `SUM`, `AVG`, `COUNT`, `MIN`, and `MAX` can be used as window functions by attaching an `OVER` clause. Without `ORDER BY` inside `OVER`, they compute a static value across the entire partition. With `ORDER BY` inside `OVER`, they compute a running value that accumulates as it moves through the partition.
`CREATE TABLE AS SELECT` and `CREATE TEMP TABLE` materialize the result of a query into an actual database table. Unlike a CTE, the result persists beyond the query that creates it and can be referenced by subsequent queries in the same session.
Tier 4 · Advanced
The full window-function range, lateral joins, JSON, and recursive queries.
LAG and LEAD are window functions that give each row access to the value of an expression from a different row in the same window partition, without a join or subquery. The offset is positional: a fixed number of rows before (LAG) or after (LEAD) the current row, determined by the partition's ORDER BY.
FIRST_VALUE, LAST_VALUE, and NTH_VALUE return the value of an expression from a specific position within the window: the first row, the last row, or the nth row. Their result for each input row is a positional lookup: the value of the expression at a designated position within the ordered partition, retrieved and attached to every row in the window.
The window frame clause narrows the set of rows a window function operates on for each position in the partition. Where PARTITION BY divides the result set into independent windows and ORDER BY sequences the rows within each window, the frame clause defines a sliding boundary that moves with each row, determining exactly which rows contribute to that row's computation.
NTILE divides a partition's rows into a specified number of ranked buckets and assigns each row a bucket number. The percentile functions PERCENTILE_CONT and PERCENTILE_DISC compute a value at a specified percentile position within a sorted set of values. These are separate mechanisms with separate purposes, but both answer questions about where a value sits in a distribution.
DISTINCT ON returns one row per distinct value of a specified expression, choosing which row to keep based on ORDER BY. It is PostgreSQL-specific syntax that solves the "one row per group, with full row data" problem that standard SQL requires a subquery or window function to handle.
A correlated subquery references a column from its outer query, creating a dependency that forces the subquery to be re-evaluated once for each row the outer query processes. The subquery cannot execute independently; its result changes based on which outer row is currently being examined.
LATERAL allows a subquery in the FROM clause to reference columns from tables that appear earlier in the same FROM clause. Without LATERAL, each subquery in FROM is evaluated independently, with no access to the rows being produced by other FROM items. LATERAL lifts that restriction.
The FILTER clause attaches a row-level condition directly to an aggregate function, restricting which rows that specific aggregate counts or sums, while leaving other aggregates in the same SELECT list unaffected.
STRING_AGG and ARRAY_AGG are aggregate functions that collect multiple row values into a single output value per group. STRING_AGG concatenates values into a delimited string. ARRAY_AGG collects values into a PostgreSQL array. Both collapse a set of rows into one row, like COUNT or SUM, and their output is the collection of contributing values in a structured form.
generate_series() is a PostgreSQL set-returning function that produces a sequence of values from a start point to an end point at a specified step interval. Its primary analytical use is generating a complete, gap-free sequence of dates or integers that can be joined to fact data to ensure every period appears in the result, even periods with no underlying activity.
A recursive CTE defines a query that references itself, allowing it to accumulate rows iteratively until a termination condition is met. Recursive CTEs handle hierarchical and graph-structured data where the depth of traversal is unknown at query-write time and cannot be expressed with a fixed number of joins.
The `->` and `->>` operators extract fields and elements from JSONB columns. `->` returns the extracted value as JSONB, preserving its type structure. `->>` returns the extracted value as plain text, discarding all type information.
`jsonb_agg` and `json_build_object` move in opposite directions across the boundary between relational and JSONB structure. `jsonb_agg` collects relational rows into a JSONB array. `json_build_object` assembles a JSONB object from key-value pairs supplied as arguments. Together they cover the two primary operations analysts need when producing JSONB output from relational data.
A date spine is a complete, gap-free sequence of dates or periods that serves as the structural backbone of a time-series query. Joining a fact table to a date spine with a LEFT JOIN ensures every period in the desired range appears in the result, even periods where no activity occurred in the underlying data.
Period-over-period analysis compares a metric in one time period to the same metric in a prior period of equal length: this week versus last week, this month versus last month, this quarter versus the same quarter last year. The SQL implementation uses LAG to bring a prior period's value into the same row as the current period's value, enabling the comparison as inline arithmetic.
Grouping by date periods means truncating timestamps or dates to a calendar unit and using the truncated value as the GROUP BY key. The result is one row per period (one per day, one per week, one per month) with aggregated measures across all rows that fall within each period.
Tier 5 · Expert
Analytical patterns and query reasoning: running totals, sessionization, reading EXPLAIN.
Multi-CTE query architecture is the practice of decomposing a complex analytical query into a sequence of named intermediate steps, each expressed as a CTE, where each step does one clearly scoped thing and feeds the next. The architecture imposes deliberate structure on queries that would otherwise become a single deeply nested expression that is correct but unreadable and difficult to debug.
Join fanout is a correctness problem that occurs when a join multiplies rows beyond what the query intends, causing subsequent aggregations to compute over inflated row counts and produce wrong results. The query runs without error, returns plausible-looking numbers, and gives no indication anything is wrong.
`EXPLAIN` shows the execution plan PostgreSQL's query planner chose for a query before running it. Reading that plan tells an analyst where the database will spend its time, how many rows it expects to process at each step, and whether the structure of the query is causing the planner to make expensive choices.
Query structure influences how the PostgreSQL planner can execute a query. Two queries that produce identical results can generate very different execution plans depending on how they are written, because the planner's ability to push filters down, choose join strategies, and avoid unnecessary materializations depends on the shape of the SQL it receives.
Subqueries, CTEs, and joins are three different ways to compose multiple relational operations in a single query. Choosing between them is a judgment about readability, correctness, performance, and reuse, applied to the specific problem at hand.
NULL propagates through a multi-step query in ways that compound across operations. A NULL introduced in one layer of a complex query can travel through subsequent joins, window functions, and aggregations, arriving in the final output at a location that gives no indication of where it originated. The challenge in complex queries is tracing how NULL moves across multiple operations that each handle it differently, from the layer where it was introduced to the output column where it appears.
A running total accumulates a metric across an ordered sequence of rows, so that each row carries the sum of all preceding rows plus its own value. Cumulative metrics extend this pattern: cumulative revenue, cumulative user signups, cumulative events. Both are produced by combining period-grouped aggregation with a window function that uses an expanding frame over the ordered result.
Sessionization groups a stream of timestamped events into discrete sessions by detecting gaps between events that exceed a defined timeout. Funnel analysis measures how many users complete each step in a defined sequence, and how many drop off between steps. Both patterns combine LAG, CTEs, and ordered event data into multi-stage query structures that transform raw event logs into interpretable behavioral summaries.
Analyst debugging is the practice of systematically locating the layer in a complex query where a result diverges from the expected output. The process applies to two distinct classes of problem: wrong results, where the query returns a plausible but incorrect value, and unexpected NULLs or missing rows, where the output is incomplete in ways that are not immediately traceable to the query's structure.
Learn SQL the way it actually works.
Read a concept, run the examples, and practice until the engine confirms you have it. Start with SELECT, no signup required.
Start free