Streamhub's analytics team wants to profile its top-engagement segment: users whose session count exceeds the platform-wide per-user session average. The team needs the segment's size and its average total event count.
Write a query to return the count of top-engagement users and their average total event count as a single row.
Assumptions:
- A user's session count is the number of
sessionsrecords linked to thatuser_id. A user's total events is the combinedevent_countacross those records. Only users with at least one session contribute to the per-user session average. - The per-user session average is the average of those per-user session counts.
- A user belongs to the top-engagement segment when their session count exceeds the per-user session average.
- The output reports the size of that segment and the average total events across users in the segment.
Output:
- A single row with columns
top_user_countandavg_events_per_top_user.
Schema · analytics 5 tables
Run previews · Check grades
Write a query, then run it to see results here.
Worked solution Try it yourself first
WITH
user_stats AS (
SELECT
user_id,
COUNT(*) AS session_count,
SUM(event_count) AS total_events
FROM
sessions
GROUP BY
user_id
),
top_users AS (
SELECT
user_id,
total_events
FROM
user_stats
WHERE
session_count > (
SELECT
AVG(session_count)
FROM
user_stats
)
),
summary AS (
SELECT
COUNT(*) AS top_user_count,
AVG(total_events) AS avg_events_per_top_user
FROM
top_users
)
SELECT
top_user_count,
avg_events_per_top_user
FROM
summary The shape
Three cascading layers, each doing one thing. The first computes per-user statistics. The second reads that layer twice, once row-by-row and once through a scalar subquery that aggregates it to a single comparison value, and keeps the top-engagement users. The third reduces the survivors to a one-row summary. The shape mirrors the prompt's three questions: who are the users, which of them exceed the average, and what does the segment look like in aggregate.
Clause by clause
The first CTE collapses every user's sessions into a single statistics row:
WITH user_stats AS (
SELECT user_id, COUNT(*) AS session_count, SUM(event_count) AS total_events
FROM sessions
GROUP BY user_id
)COUNT(*) and SUM(event_count) are computed in the same pass. Both aggregates are needed downstream: session_count qualifies the user, and total_events is what the summary will average over.
The second CTE keeps users whose session count exceeds the per-user average:
top_users AS (
SELECT user_id, total_events
FROM user_stats
WHERE session_count > (SELECT AVG(session_count) FROM user_stats)
)The scalar subquery (SELECT AVG(session_count) FROM user_stats) reads user_stats again and reduces it to a single number. That number is then compared row-by-row against each user's session_count. The subquery is evaluated once and its result is reused on every row.
The third CTE reduces the surviving rows to a one-row summary:
summary AS (
SELECT COUNT(*) AS top_user_count, AVG(total_events) AS avg_events_per_top_user
FROM top_users
)No GROUP BY is needed because there is no grouping key. Aggregating without GROUP BY collapses the whole source to one row, which is what the prompt asks for. COUNT(*) counts the segment's size; AVG(total_events) averages the carried-forward sum.
SELECT top_user_count, avg_events_per_top_user FROM summaryreturns the single row:9top-engagement users with an average of77.11total events.
The trap
The two averages in the query operate at different levels. AVG(session_count) in the second CTE is the average of per-user session counts, computed across the user table. AVG(total_events) in the third CTE is the average of per-user totals, computed across the qualifying segment. Pulling either one against the raw sessions table instead of the layer above it would produce a different number, weighted differently. The chain is what makes "average across users" mean exactly that, by giving each user one row in the layer the average is taken over.
You practiced cascading three WITH stages where the second references the first twice (row-by-row and through a scalar subquery) and the third reduces the second's row set to a one-row summary.