Mastering Data Deduplication: Techniques and Best Practices

Duplicate data can lead to inaccurate analytics, wasted storage, and poor decision-making. Identifying and removing duplicates—known as deduplication—is a crucial step in maintaining data quality. This blog explores practical ways to find and deduplicate data using SQL, Python, and data pipeline tools.

Why Deduplication Matters

Improves data accuracy: Ensures reports and analytics are based on unique records.
Saves storage: Reduces unnecessary data, lowering costs.
Enhances performance: Streamlines queries and processing.

Finding Duplicates

Using SQL: `GROUP BY + HAVING COUNT(*) > 1`

To find duplicates in a table, use the GROUP BY and HAVING clauses:

SELECT email, first_name, last_name, COUNT(*) as duplicates
FROM users
GROUP BY email, first_name, last_name
HAVING COUNT(*) > 1;

This query lists combinations of email, first_name, and last_name that appear more than once, helping you identify duplicate user records based on multiple fields.

Another SQL Method: Using `ROW_NUMBER()` with Multiple Columns

You can use the ROW_NUMBER() window function to identify duplicates based on multiple columns. For example, to keep only the first occurrence of each unique combination of email, first_name, and last_name:

WITH ranked_users AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY email, first_name, last_name
            ORDER BY id
        ) AS rn
    FROM users
)
SELECT * 
FROM ranked_users 
WHERE rn > 1;

This approach ensures only the earliest record for each unique user combination remains, removing all subsequent duplicates.

Using Python (pandas)

With pandas, you can quickly spot duplicates:

import pandas as pd

df = pd.read_csv('data.csv')
duplicates = df[df.duplicated()]
print(duplicates)

This prints all rows that are duplicates of previous rows.

Using PySpark

If you're working with large datasets, PySpark can efficiently identify duplicates:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
duplicates = df.groupBy(df.columns).count().filter("count > 1")
duplicates.show()

This displays groups of rows that appear more than once in the dataset.

Removing Duplicates (Deduplication)

SQL Deduplication

To keep only unique records:

DELETE FROM users
WHERE (email, last_name) NOT IN (
    SELECT email, last_name
    FROM (
        SELECT email, last_name, MIN(created_at) as min_created
        FROM users
        GROUP BY email, last_name
    ) as unique_rows
);

This example keeps only the first occurrence of each unique combination of email and last_name, based on the earliest created_at timestamp.

Using `ROW_NUMBER()` for Deduplication

A more flexible SQL approach uses window functions to assign a unique row number to each record within a group of duplicates:

WITH ranked_users AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY email, first_name, last_name
            ORDER BY id
        ) AS rn
    FROM users
)
DELETE FROM users
WHERE id IN (
    SELECT id FROM ranked_users WHERE rn > 1
);

This query uses a common table expression (CTE) to assign a row number to each record within groups of duplicates (based on email, first_name, and last_name). It then deletes all rows where the row number is greater than 1, effectively keeping only the first occurrence (smallest id) for each unique combination and removing the rest.

Python Deduplication

With pandas:

deduped_df = df.drop_duplicates()
deduped_df.to_csv('deduped_data.csv', index=False)

This removes duplicate rows, keeping the first occurrence.

Deduplication in Data Pipelines

Modern data pipelines (e.g., Apache Spark, dbt) offer built-in deduplication:

Spark: Use the dropDuplicates() method on DataFrames to efficiently remove duplicate rows based on specified columns.
dbt: Deduplicate data in SQL models using several strategies:
- distinct keyword: Select only unique records in your models.
- Incremental models: Configure incremental models to insert only new or updated records, leveraging unique keys to prevent duplicates.
- Surrogate keys: Create unique surrogate keys (e.g., with ROW_NUMBER() or hash functions) to reliably identify and filter out duplicates.
- Merge (upsert) strategies: Use SQL MERGE statements to update existing records and insert new ones, ensuring no duplicate entries.
- Window functions: Apply window functions like ROW_NUMBER() or RANK() to partition data and filter out duplicates within your models.

TL;DR: Best Practices

Define what makes a record unique (e.g., email, user ID)
Automate deduplication in ETL pipelines
Audit regularly to catch new duplicates
Log and review removed records for traceability

Deduplication is essential for reliable data analytics. By using the right tools and strategies, you can efficiently find and remove duplicates, ensuring your data remains clean and trustworthy.