Brightlane's data quality team is building a per-country city audit. For each country the team needs both a clean text list of recorded cities and a parallel array that preserves missing-city placeholders so the team can spot data gaps.
Write a query to return every country, a comma-separated text list of city values in alphabetical order, and an array of city values in alphabetical order.
Assumptions:
- Each
countryvalue with at least one customer should appear once. - For each country, the text list contains the recorded
cityvalues only — customers with a missingcitycontribute no entry and no separator. The array contains every customer'scityvalue, with a missing-value element placed for each customer whosecityis missing. - Within each country, both the text list and the array are arranged alphabetically. In the array, missing-value elements appear after every recorded value.
Output:
- One row per country, with columns
country,cities_text, andcities_array.
Schema · ecommerce 5 tables
Run previews · Check grades
Write a query, then run it to see results here.
Worked solution Try it yourself first
SELECT
country,
STRING_AGG(
city,
', '
ORDER BY
city
) AS cities_text,
ARRAY_AGG(
city
ORDER BY
city
) AS cities_array
FROM
customers
GROUP BY
country The shape
The same column is aggregated two ways in the same query, and the two ways disagree on what to do with the missing city values. STRING_AGG silently drops the rows whose city is NULL and joins only the recorded values. ARRAY_AGG keeps every input row, putting a NULL element into the array wherever the city was missing. The audit team sees the clean list in one column and the gap-aware collection in the next.
Clause by clause
SELECT country, STRING_AGG(city, ', ' ORDER BY city) AS cities_text, ARRAY_AGG(city ORDER BY city) AS cities_arrayreturns the country and both aggregations side by side. Both aggregates read the samecitycolumn from the same grouped rows; the difference is entirely in how each function treats NULL inputs. TheORDER BY cityinside each aggregate fixes the sequence within the value; PostgreSQL sorts NULLs last by default, which is why missing-value elements show up at the end of each array.FROM customersreads the customer rows. Every customer is in scope, including those with a missingcity— those rows have to reach the aggregate soARRAY_AGGcan include them.GROUP BY countrypartitions the rows by country so both aggregates run once per country. One output row per distinctcountry.
The trap
STRING_AGG ignoring NULLs is not a feature the function announces; nothing in the SQL says "skip the missing values." A query that runs cleanly and returns plausible-looking strings can be silently dropping rows the analyst expected to see. For country MX, every customer has a missing city and STRING_AGG returns NULL because there are zero non-NULL values to join — not an empty string, not a placeholder, but a NULL value in the output column. For country BE, the recorded city joins fine and the missing one disappears without a trace. Side-by-side with ARRAY_AGG, the gap becomes visible: the text column says 'Brussels', the array column says ['Brussels', NULL], and the audit team can see which customer rows the text list silently elided. When NULL inputs matter, reach for ARRAY_AGG or wrap STRING_AGG in a COALESCE over the column so the placeholder is explicit.
You practiced parallel STRING_AGG and ARRAY_AGG over a column with missing values — STRING_AGG skips missing inputs entirely; ARRAY_AGG includes them as missing-value array elements, the underlying NULL-handling difference between the two functions.