Using hash_agg for quick table comparison in Snowflake

When working with large datasets in Snowflake, comparing tables row by row can be slow and resource-intensive. The hash_agg function provides a fast way to check if two tables (or subsets of columns) have the same data, without needing to compare every row individually.

What is `hash_agg`?

hash_agg is a Snowflake aggregate function that computes a hash value over a set of rows. If two sets of data produce the same hash, they are likely identical (though hash collisions are theoretically possible, they are extremely rare).

Basic Usage

1. Compare Entire Tables

To check if two tables have the same data (all columns), use:

SELECT hash_agg(*) AS hash_value FROM db.sch.table_a;
SELECT hash_agg(*) AS hash_value FROM db.sch.table_b;

where db is database, sch is schema.

If the hash_value is the same for both queries, the tables are likely identical. For reference, it took around 1 second to calculate the hash_value for 1M records.

SELECT COUNT(*) AS row_count FROM db.sch.table_a;

and it took approximately 35ms to return the row count.

2. Compare Specific Columns

If you only care about certain columns (e.g., col1 and col2):

SELECT hash_agg(col1, col2) AS hash_value FROM db.sch.table_a;
SELECT hash_agg(col1, col2) AS hash_value FROM db.sch.table_b;

This checks if the selected columns have the same values in both tables.

3. Compare with Filters

You can add WHERE clauses to compare subsets:

SELECT hash_agg(col1, col2) AS hash_value FROM db.sch.table_a WHERE status = 'active';
SELECT hash_agg(col1, col2) AS hash_value FROM db.sch.table_b WHERE status = 'active';

4. Compare Grouped Data

To compare data grouped by a key (e.g., id):

SELECT id, hash_agg(col1, col2) AS hash_value
FROM db.sch.table_a
GROUP BY id;

You can then join this result with the same query on table_b to find mismatches per group.

Limitations to Consider

While hash_agg is highly effective, there are a few important caveats:

Row order is ignored, but column order matters: The function treats columns in the order you specify, so ensure consistency between tables.
Hash collisions are extremely rare, but possible: Two different datasets could theoretically produce the same hash, though this is unlikely in practice.
Data types and NULL handling must match: Differences in data types or how NULLs are represented can lead to misleading results. Always verify that both tables are structured identically.

Final Thoughts

The hash_agg function offers a fast, resource-efficient way to compare tables or specific columns in Snowflake. By leveraging this approach, you can quickly identify differences and validate data integrity, making it especially valuable when working with large or complex datasets.

Reference link: Snowflake forum - Table Comparison

What is hash_agg?