Scenario: Brightlane's data analyst ran EXPLAIN on a query that pulls order line items together with their products to compute per-category item counts, and saw the planner estimating 20 total rows at the nested-loop step. Actual execution processed over 100 rows, driving cost much higher than expected. The analyst suspects certain product categories contribute disproportionately.
Task: Write a query to return each category_id and the total number of line items associated with products in that category, so the analyst can see the actual per-category contribution.
Assumptions:
- One row in the result covers every line item whose product shares the same
category_id.
Output:
- One row per
category_idpresent in the data with at least one line item. - Columns in this order:
category_id,item_count.
Schema · ecommerce 5 tables
Run previews · Check grades
Write a query, then run it to see results here.
Worked solution Try it yourself first
SELECT
p.category_id,
COUNT(oi.id) AS item_count
FROM
order_items oi
JOIN products p ON oi.product_id = p.id
GROUP BY
p.category_id The shape
The planner estimated 20 rows out of the nested-loop step; actual execution processed over 100, and the per-category breakdown shows why — category 1 alone holds 25 line items, category 5 holds 19, category 6 holds 13. The skew across categories is what the planner was missing. The query joins order_items to products, groups by category_id, and counts.
Clause by clause
SELECT p.category_id, COUNT(oi.id) AS item_countreturns the category and the number of line items associated with its products.COUNT(oi.id)counts each joined line-item row once.FROM order_items oireads the line-item records — the larger of the two tables, and the side the nested-loop's outer plan would scan if the planner believes it produces few rows.JOIN products p ON oi.product_id = p.idpairs each line item with its product, bringingcategory_idinto reach. The inner join discards any line item whose product has been removed from the catalog.GROUP BY p.category_idpartitions the joined rows by category, so the count runs once per category.
Why this and not just counting order_items directly
A bare COUNT(*) FROM order_items returns the total line-item count without telling the analyst which categories drive it. The planner's 20-row estimate at the nested-loop step is shaped by its model of the join's cardinality, and that model is built from the joint distribution of order_items.product_id and products.category_id. Breaking the count down per category exposes which categories the planner is mismodeling — category 1 with 25 items is doing the disproportionate work. Without the per-category split, the analyst only knows the total is wrong; with it, they know which category's statistic to fix.
The trap
A nested-loop join's cost scales linearly with the row count of its outer side. When the planner estimates 20 rows on a node that actually produces 100+, it picked nested-loop because it expected the inner-side lookups to be cheap and few. At 100+ outer rows the same plan becomes a hash join's natural territory, but the planner is locked into its choice once execution starts. The fix isn't visible in the query — it's in the statistics that produced the underestimate, which is what running ANALYZE on the underlying tables addresses. Reading the actual per-category counts is what tells the analyst whether the issue is one fat category (a histogram problem) or many small ones the planner under-counted across the board (a row-count problem).
You practiced verifying an estimated nested-loop output size against actual row counts — the gap reveals which join cardinality the planner is mismodeling.