Mastering Data Deduplication: Techniques and Best Practices
Duplicate data can lead to inaccurate analytics, wasted storage, and poor decision-making. Identifying and removing duplicates—known as deduplication—is a crucial step in maintaining data quality. This blog explores practical ways to find and deduplicate data using SQL, Python, and data pipeline tools.
Why Deduplication Matters
- Improves data accuracy: Ensures reports and analytics are based on unique records.
- Saves storage: Reduces unnecessary data, lowering costs.
- Enhances performance: Streamlines queries and processing.
Finding Duplicates
Using SQL: GROUP BY + HAVING COUNT(*) > 1
To find duplicates in a table, use the GROUP BY and HAVING clauses:
SELECT email, first_name, last_name, COUNT(*) as duplicates
FROM users
GROUP BY email, first_name, last_name
HAVING COUNT(*) > 1;
This query lists combinations of email, first_name, and last_name that appear more than once, helping you identify duplicate user records based on multiple fields.
Another SQL Method: Using ROW_NUMBER() with Multiple Columns
You can use the ROW_NUMBER() window function to identify duplicates based on multiple columns. For example, to keep only the first occurrence of each unique combination of email, first_name, and last_name:
WITH ranked_users AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY email, first_name, last_name
ORDER BY id
) AS rn
FROM users
)
SELECT *
FROM ranked_users
WHERE rn > 1;
This approach ensures only the earliest record for each unique user combination remains, removing all subsequent duplicates.
Using Python (pandas)
With pandas, you can quickly spot duplicates:
import pandas as pd
df = pd.read_csv('data.csv')
duplicates = df[df.duplicated()]
print(duplicates)
This prints all rows that are duplicates of previous rows.
Using PySpark
If you're working with large datasets, PySpark can efficiently identify duplicates:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
duplicates = df.groupBy(df.columns).count().filter("count > 1")
duplicates.show()
This displays groups of rows that appear more than once in the dataset.
Removing Duplicates (Deduplication)
SQL Deduplication
To keep only unique records:
DELETE FROM users
WHERE (email, last_name) NOT IN (
SELECT email, last_name
FROM (
SELECT email, last_name, MIN(created_at) as min_created
FROM users
GROUP BY email, last_name
) as unique_rows
);
This example keeps only the first occurrence of each unique combination of email and last_name, based on the earliest created_at timestamp.
Using ROW_NUMBER() for Deduplication
A more flexible SQL approach uses window functions to assign a unique row number to each record within a group of duplicates:
WITH ranked_users AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY email, first_name, last_name
ORDER BY id
) AS rn
FROM users
)
DELETE FROM users
WHERE id IN (
SELECT id FROM ranked_users WHERE rn > 1
);
This query uses a common table expression (CTE) to assign a row number to each record within groups of duplicates (based on email, first_name, and last_name). It then deletes all rows where the row number is greater than 1, effectively keeping only the first occurrence (smallest id) for each unique combination and removing the rest.
Python Deduplication
With pandas:
deduped_df = df.drop_duplicates()
deduped_df.to_csv('deduped_data.csv', index=False)
This removes duplicate rows, keeping the first occurrence.
Deduplication in Data Pipelines
Modern data pipelines (e.g., Apache Spark, dbt) offer built-in deduplication:
- Spark: Use the
dropDuplicates()method on DataFrames to efficiently remove duplicate rows based on specified columns. - dbt: Deduplicate data in SQL models using several strategies:
distinctkeyword: Select only unique records in your models.- Incremental models: Configure incremental models to insert only new or updated records, leveraging unique keys to prevent duplicates.
- Surrogate keys: Create unique surrogate keys (e.g., with
ROW_NUMBER()or hash functions) to reliably identify and filter out duplicates. - Merge (upsert) strategies: Use SQL
MERGEstatements to update existing records and insert new ones, ensuring no duplicate entries. - Window functions: Apply window functions like
ROW_NUMBER()orRANK()to partition data and filter out duplicates within your models.
TL;DR: Best Practices
- Define what makes a record unique (e.g., email, user ID)
- Automate deduplication in ETL pipelines
- Audit regularly to catch new duplicates
- Log and review removed records for traceability
Deduplication is essential for reliable data analytics. By using the right tools and strategies, you can efficiently find and remove duplicates, ensuring your data remains clean and trustworthy.