Streamhub's data team wants a deduplicated list of every user ID that has either started a session or triggered an event on the platform.
Write a query to return each qualifying user ID exactly once.
Assumptions:
- The
sessionstable records each session;user_ididentifies the user. - The
eventstable records each platform event;user_ididentifies the user. - A user with both sessions and events should appear once in the result, not twice.
Output:
- One row per distinct user ID, with a single column
user_id.
Schema · analytics 5 tables
Run previews · Check grades
Write a query, then run it to see results here.
Worked solution Try it yourself first
SELECT
user_id
FROM
sessions
UNION
SELECT
user_id
FROM
events The shape
UNION combines the two queries and then deduplicates the combined result. Every qualifying user_id appears exactly once — whether the user came from sessions, from events, or from both.
Clause by clause
SELECT user_id FROM sessionsis the left input: one row per session record, projecting the user behind each one. A user with multiple sessions contributes multiple rows here.UNIONis the set operator. After PostgreSQL evaluates both queries, it sorts or hashes the entire combined result and keeps each distinct row once. The deduplication is across the full combined set: a user appearing five times insessionsand three times ineventscollapses to a single row.SELECT user_id FROM eventsis the right input. Same column, same type as the left query, which is what lets the two sides stack into a single column.
Why this and not UNION ALL
UNION ALL would skip the deduplication step and return every record from both tables — the same user appearing in both contributes two rows, not one. The prompt asks for each qualifying user once, which is exactly the work UNION does on top of the concatenation. The deduplication has a real cost on large inputs, but here it's the load-bearing behavior, not overhead.
You practiced UNION (with deduplication) over two result sets. The recurring rule: UNION removes duplicates across the combined result — including duplicates within each side and duplicates that span both sides — by sorting or hashing the combined output.