Cities as Text and Array — STRING_AGG and ARRAY

The problem

Brightlane's data quality team is building a per-country city audit. For each country the team needs both a clean text list of recorded cities and a parallel array that preserves missing-city placeholders so the team can spot data gaps.

Write a query to return every country, a comma-separated text list of city values in alphabetical order, and an array of city values in alphabetical order.

Assumptions:

Each country value with at least one customer should appear once.
For each country, the text list contains the recorded city values only — customers with a missing city contribute no entry and no separator. The array contains every customer's city value, with a missing-value element placed for each customer whose city is missing.
Within each country, both the text list and the array are arranged alphabetically. In the array, missing-value elements appear after every recorded value.

Output:

One row per country, with columns country, cities_text, and cities_array.

Schema · ecommerce 5 tables

The shape

The same column is aggregated two ways in the same query, and the two ways disagree on what to do with the missing city values. STRING_AGG silently drops the rows whose city is NULL and joins only the recorded values. ARRAY_AGG keeps every input row, putting a NULL element into the array wherever the city was missing. The audit team sees the clean list in one column and the gap-aware collection in the next.

Clause by clause

SELECT country, STRING_AGG(city, ', ' ORDER BY city) AS cities_text, ARRAY_AGG(city ORDER BY city) AS cities_array returns the country and both aggregations side by side. Both aggregates read the same city column from the same grouped rows; the difference is entirely in how each function treats NULL inputs. The ORDER BY city inside each aggregate fixes the sequence within the value; PostgreSQL sorts NULLs last by default, which is why missing-value elements show up at the end of each array.
FROM customers reads the customer rows. Every customer is in scope, including those with a missing city — those rows have to reach the aggregate so ARRAY_AGG can include them.
GROUP BY country partitions the rows by country so both aggregates run once per country. One output row per distinct country.

The trap

STRING_AGG ignoring NULLs is not a feature the function announces; nothing in the SQL says "skip the missing values." A query that runs cleanly and returns plausible-looking strings can be silently dropping rows the analyst expected to see. For country MX, every customer has a missing city and STRING_AGG returns NULL because there are zero non-NULL values to join — not an empty string, not a placeholder, but a NULL value in the output column. For country BE, the recorded city joins fine and the missing one disappears without a trace. Side-by-side with ARRAY_AGG, the gap becomes visible: the text column says 'Brussels', the array column says ['Brussels', NULL], and the audit team can see which customer rows the text list silently elided. When NULL inputs matter, reach for ARRAY_AGG or wrap STRING_AGG in a COALESCE over the column so the placeholder is explicit.

You practiced parallel STRING_AGG and ARRAY_AGG over a column with missing values — STRING_AGG skips missing inputs entirely; ARRAY_AGG includes them as missing-value array elements, the underlying NULL-handling difference between the two functions.

Return every country, a comma-separated text list of `city` values in alphabetical order, and an array of `city` values in alphabetical order

The shape

Clause by clause

The trap

Reading explains SQL. Writing it, over and over with instant feedback, is what makes you fluent.