UNION vs UNION ALL in SQL: Key Differences Explained
When working with SQL, combining results from multiple queries is a common task. Two operators, UNION and UNION ALL, are frequently used for this purpose. While they may seem similar at first glance, they have important differences that can impact both the results you get and the performance of your queries.
What is UNION?
The UNION operator combines the result sets of two or more SELECT statements into a single result set. Importantly, it removes duplicate rows from the final output.
SELECT favorite_language FROM developer_survey_2024
UNION
SELECT favorite_language FROM bootcamp_graduates_2024;
In this example, both developer_survey_2024 and bootcamp_graduates_2024 might list programming languages like "Python", "JavaScript", or "Go" as favorites. If "Python" appears in both tables, it will show up only once in the combined result, ensuring each language is listed uniquely.
What is UNION ALL?
The UNION ALL operator also combines results from multiple SELECT statements, but does not remove duplicates. Every row from each query is included in the final result set, even if some rows are identical.
SELECT city FROM customers
UNION ALL
SELECT city FROM suppliers;
In this example, both customers and suppliers tables may have overlapping city names. Using UNION ALL, every occurrence of a city from both tables will appear in the result, including duplicates. For instance, if both tables have "New York" as a city, it will show up twice in the combined output.
Here, if the same value exists in both tables, it will appear multiple times in the result.
Key Differences
| Feature | UNION | UNION ALL |
|---|---|---|
| Duplicates | Removes duplicates | Keeps duplicates |
| Performance | Slower (extra work) | Faster (no deduping) |
| Use case | Unique results needed | All results needed |
In Snowflake, the primary performance difference between UNION and UNION ALL is that UNION requires an additional, often resource-intensive step for duplicate elimination, whereas UNION ALL simply concatenates the data sets.
Key Performance Differences in Snowflake
- Data Processing: While
UNION ALLonly joins the inputs together,UNIONperforms the same concatenation and then must identify and remove any duplicate rows from the final result. - Operational Overhead: Because of the duplicate elimination requirement, using
UNIONwhenUNION ALLis sufficient is considered a common mistake that can lead to slower query performance. - Recursive Operations: In recursive views,
UNION ALLis typically used to iterate down a hierarchy one level at a time.
Identifying Bottlenecks in Query Profile
You can use the Query Profile to determine if a UNION operation is slowing down your query. If you see a UnionAll operator with an extra Aggregate operator positioned directly on top of it, this indicates that Snowflake is performing the duplicate elimination required by a UNION statement.
The Aggregate operator in this context is responsible for grouping the input to ensure all returned rows are unique. If your business logic does not strictly require unique rows (or if you already know the data sets are mutually exclusive), switching to UNION ALL will remove this Aggregate step and improve execution speed.
Tip: If you know your queries won't return duplicates, or if duplicates are acceptable, prefer UNION ALL for better performance.
Using UNION Operators in dbt with Snowflake
When working with Snowflake through dbt, you often need to combine data from multiple sources as part of your data modeling workflow. dbt makes this process easier and more maintainable by providing macros like dbt_utils.union_relations(), which abstracts away some of the manual SQL required for combining tables—especially when their schemas don't perfectly align.
dbt_utils.union_relations()
The dbt_utils.union_relations() macro in dbt is used to combine multiple relations (tables or views) into a single relation. It generates SQL that stacks the rows from each input relation, similar to how UNION ALL works in SQL.
Key Points of dbt_utils.union_relations() vs UNION ALL
dbt_utils.union_relations(): Automates combining multiple tables, handling column mismatches by filling missing columns withNULL. It is especially useful when tables have similar but not identical schemas.UNION ALL: Standard SQL operator that combines result sets from multiple queries, but requires the same columns and order in each query.
Summary:
dbt_utils.union_relations() is a dbt macro that simplifies combining tables with potentially different columns, while UNION ALL is a SQL operator for combining result sets when columns match exactly. Both keep all rows, including duplicates.
Practical Example: Combining Tables with UNION ALL and dbt_utils.union_relations()
Suppose you have two tables, sales_2023 and sales_2024, each with slightly different columns:
sales_2023:order_id,amount,customer_idsales_2024:order_id,amount,customer_id,discount_code
If you want to combine these tables in SQL and keep all records (including duplicates), you can use UNION ALL. However, you must ensure the columns match in number and order:
SELECT order_id, amount, customer_id, NULL AS discount_code FROM sales_2023
UNION ALL
SELECT order_id, amount, customer_id, discount_code FROM sales_2024;
This approach works, but can become cumbersome if you have many tables or frequent schema changes.
With dbt, you can use the dbt_utils.union_relations() macro to simplify this process:
{{ dbt_utils.union_relations(
relations=[
ref('sales_2023'),
ref('sales_2024')
]
) }}
This macro automatically aligns columns, filling in missing ones with NULL as needed, and uses UNION ALL under the hood. It's especially useful for data models that need to combine many sources with evolving schemas.
Understanding the difference between UNION and UNION ALL helps you write more efficient SQL queries and ensures you get the results you expect. Choose the operator that best fits your data requirements and performance needs.
Reference: Snowflake Well Kept Secret: UNION, UNION ALL, UNION BY and More