Scenario: Brightlane's data analyst ran EXPLAIN on a per-customer revenue rollup that pulls orders together with customers, and saw the per-customer grouping step estimating only 5 customer groups. Execution was significantly slower than expected, suggesting the real customer count is much higher.
Task: Write a query to return each customer_id and the combined total_amount across all of their orders, so the analyst can see the actual group count and revenue distribution.
Assumptions:
- A customer's
total_revenueis the combinedtotal_amountacross all of theirorders. - The result covers only customers who have placed at least one order.
Output:
- One row per customer with at least one order on record.
- Columns in this order:
customer_id,total_revenue.
Schema · ecommerce 5 tables
Run previews · Check grades
Write a query, then run it to see results here.
Worked solution Try it yourself first
SELECT
o.customer_id,
SUM(o.total_amount) AS total_revenue
FROM
orders o
JOIN customers c ON o.customer_id = c.id
GROUP BY
o.customer_id The shape
The planner expected 5 customer groups at the per-customer grouping step; the real count is 62 — over an order of magnitude off. The query joins orders to customers, groups by customer_id, and sums total_amount per customer. The grouped result shape mirrors what the planner was estimating, so the actual group count lines up against the planner's 5-row estimate directly.
Clause by clause
SELECT o.customer_id, SUM(o.total_amount) AS total_revenuereturns each customer's ID and their summed order total.SUMadds thetotal_amountof every order in that customer's group.FROM orders oreads the order records — the side carrying both the customer reference and the amount being summed.JOIN customers c ON o.customer_id = c.idmatches each order to its customer. The join doesn't bring new columns into the SELECT, but it does restrict the result to orders whose customer still exists in thecustomerstable — which is the prompt's "customers who have placed at least one order" constraint, read from theordersside.GROUP BY o.customer_idpartitions the joined rows by customer. Each output row is one customer's full order history rolled up to a single number.
Why this and not grouping on c.id
o.customer_id and c.id are equal on every joined row (that's the join condition), so grouping on either column produces the same groups and the same sums. Choosing the orders side keeps the planner's grouping work on the larger table's column, which mirrors the shape EXPLAIN was estimating — that's the surface the 5-group estimate was attached to. The two are interchangeable on this query; the choice is about which side reads more naturally as "one customer per group."
The trap
The planner's group-count estimate comes from the distinct-value statistic on the grouped column. When that statistic reports 5 but the table actually carries 62, the planner picks a hash-aggregate sized for 5 buckets and the runtime rehashes as the 6th, 7th, ... 62nd customer appears. The cost shows up as elevated actual time on the aggregate node, even though the row counts at the scan level may look fine. The misestimate isn't in the join cardinality — it's in the post-join distinct-value count, which depends on a statistic the planner reads off the column being grouped. ANALYZE on the underlying table refreshes that statistic; until then, every group-by on this column gets the same wrong allocation.
You practiced computing the real per-group count and per-group totals to compare against EXPLAIN's grouping-step estimate — the discrepancy at that step drives statistics refresh decisions.