Premium Categories by Average Price — CTEs in SQL

The problem

Brightlane's catalog team is reviewing pricing for properly categorized products. Products awaiting classification should not influence the analysis.

Write a query to return the category ID and average price for every assigned category whose average product price exceeds $500.

Assumptions:

Products awaiting classification have a missing category_id and should not contribute to the analysis.
A category's average price is the average of every price value among products linked to that category_id.
Only categories whose average price exceeds $500 should appear.

Output:

One row per qualifying category, with columns category_id and avg_price.

Schema · ecommerce 5 tables

The shape

Two filters in two different places. The WHERE category_id IS NOT NULL lives inside the WITH layer and removes the uncategorized products before the grouping ever sees them. The WHERE avg_price > 500 lives in the main query and runs against the layer's aggregate output. Each filter belongs where the column it references is in scope.

Clause by clause

The WITH clause defines category_prices:

WITH category_prices AS (
  SELECT category_id, AVG(price) AS avg_price
  FROM products
  WHERE category_id IS NOT NULL
  GROUP BY category_id
)

WHERE category_id IS NOT NULL drops the products awaiting classification; only categorized products reach the grouping. GROUP BY category_id then partitions the survivors per category, and AVG(price) collapses each partition to one row. Categories 5, 6, and 7 end up with averages of 782.33, 1459, and 882.33 respectively.

SELECT category_id, avg_price FROM category_prices WHERE avg_price > 500 is the main query. It reads the named layer and keeps only the rows whose average exceeds 500. All three of the result-set categories clear the threshold; the rest were either filtered out of the layer or fall below 500 in the layer's output.

Why the `IS NOT NULL` filter goes inside the layer and not in the main query

category_id exists as a column on products but does not exist on the layer's output except as the grouping key. The uncategorized rows, the ones with a missing category_id, would collapse into a single NULL group inside the layer and produce a NULL-keyed average that the prompt explicitly excludes. Filtering them in the main query after the grouping is too late: by then they have already contaminated the aggregation. The fix is to remove them upstream, while category_id is still a row-level column and the filter can act on the raw rows.

The trap

The two thresholds look symmetric but are not. The source-level filter is a row condition on products and runs before grouping; the aggregate-level filter is a column condition on category_prices and runs after. Swapping them silently changes the result. Putting avg_price > 500 inside the layer fails because avg_price does not exist as a column until the grouping completes. Putting category_id IS NOT NULL in the main query fails to exclude the uncategorized group's contribution to the layer's aggregation in the first place. Filters that act on grouping inputs go inside the layer; filters that act on grouping outputs go in the main query.

You practiced a two-stage condition: pre-restrict source records inside the WITH layer with a WHERE, then apply a threshold against the layer's aggregate output in the main query.

Return the category ID and average price for every assigned category whose average product price exceeds `$500`

The shape

Clause by clause

Why the `IS NOT NULL` filter goes inside the layer and not in the main query

The trap

Reading explains SQL. Writing it, over and over with instant feedback, is what makes you fluent.

The shape

Clause by clause

Why the IS NOT NULL filter goes inside the layer and not in the main query

The trap

Reading explains SQL. Writing it, over and over with instant feedback, is what makes you fluent.

Why the `IS NOT NULL` filter goes inside the layer and not in the main query