Revenue per Customer with Join — Reading EXPLAIN Output in SQL

The problem

Scenario: Brightlane's data analyst ran EXPLAIN on a per-customer revenue rollup that pulls orders together with customers, and saw the per-customer grouping step estimating only 5 customer groups. Execution was significantly slower than expected, suggesting the real customer count is much higher.

Task: Write a query to return each customer_id and the combined total_amount across all of their orders, so the analyst can see the actual group count and revenue distribution.

Assumptions:

A customer's total_revenue is the combined total_amount across all of their orders.
The result covers only customers who have placed at least one order.

Output:

One row per customer with at least one order on record.
Columns in this order: customer_id, total_revenue.

Schema · ecommerce 5 tables

The shape

The planner expected 5 customer groups at the per-customer grouping step; the real count is 62 — over an order of magnitude off. The query joins orders to customers, groups by customer_id, and sums total_amount per customer. The grouped result shape mirrors what the planner was estimating, so the actual group count lines up against the planner's 5-row estimate directly.

Clause by clause

SELECT o.customer_id, SUM(o.total_amount) AS total_revenue returns each customer's ID and their summed order total. SUM adds the total_amount of every order in that customer's group.
FROM orders o reads the order records — the side carrying both the customer reference and the amount being summed.
JOIN customers c ON o.customer_id = c.id matches each order to its customer. The join doesn't bring new columns into the SELECT, but it does restrict the result to orders whose customer still exists in the customers table — which is the prompt's "customers who have placed at least one order" constraint, read from the orders side.
GROUP BY o.customer_id partitions the joined rows by customer. Each output row is one customer's full order history rolled up to a single number.

Why this and not grouping on `c.id`

o.customer_id and c.id are equal on every joined row (that's the join condition), so grouping on either column produces the same groups and the same sums. Choosing the orders side keeps the planner's grouping work on the larger table's column, which mirrors the shape EXPLAIN was estimating — that's the surface the 5-group estimate was attached to. The two are interchangeable on this query; the choice is about which side reads more naturally as "one customer per group."

The trap

The planner's group-count estimate comes from the distinct-value statistic on the grouped column. When that statistic reports 5 but the table actually carries 62, the planner picks a hash-aggregate sized for 5 buckets and the runtime rehashes as the 6th, 7th, ... 62nd customer appears. The cost shows up as elevated actual time on the aggregate node, even though the row counts at the scan level may look fine. The misestimate isn't in the join cardinality — it's in the post-join distinct-value count, which depends on a statistic the planner reads off the column being grouped. ANALYZE on the underlying table refreshes that statistic; until then, every group-by on this column gets the same wrong allocation.

You practiced computing the real per-group count and per-group totals to compare against EXPLAIN's grouping-step estimate — the discrepancy at that step drives statistics refresh decisions.

Return each `customer_id` and the combined `total_amount` across all of their `orders`, so the analyst can see the actual group count and revenue distribution

The shape

Clause by clause

Why this and not grouping on `c.id`

The trap

Reading explains SQL. Writing it, over and over with instant feedback, is what makes you fluent.

The shape

Clause by clause

Why this and not grouping on c.id

The trap

Reading explains SQL. Writing it, over and over with instant feedback, is what makes you fluent.

Why this and not grouping on `c.id`