How to Deduplicate a Product Catalog

Step-by-step playbook to find, score, and merge duplicate SKUs into clean canonical records without losing supplier history or breaking audit trails.

Duplicate SKUs are not just a housekeeping problem. Every duplicate in your catalog silently inflates procurement spend, splits stock levels, breaks AI-powered search, and sends conflicting pricing downstream to your ERP or e-commerce channel. The root cause is almost always the same: supplier feeds arrive in different formats, PIM migrations carry over legacy rows, and manual entry over years produces near-identical records that no automated process ever reconciled. The result is one real product hiding behind three to five SKUs — each with its own price, attributes, and supplier history.

This playbook walks you through how to deduplicate a product catalog end to end: from detecting duplicate records, to scoring candidate matches, to merging them into a single canonical record you can trust. Claro sits at the center of this workflow as a managed identity-resolution layer — it normalizes incoming supplier feeds, scores every candidate pair, flags ambiguous matches for human review, and writes the clean canonical record back into your PIM or ERP with a full audit trail. That means deduplication is not a one-time project you hand to a data team every six months; it runs continuously as new sources arrive.

Run this playbook after a supplier import, a PIM migration, an acquisition, or simply when years of manual entry have caught up with you. Whether you manage MRO line items, CPG units, furniture variants, or industrial spares, the workflow is the same — only the matching attributes change.

What the problem looks like before and after

Before deduplication	After deduplication
Same product appears under 3-5 SKUs	One canonical SKU per real-world product
Conflicting prices and stock levels per duplicate	Single source of truth fed to ERP and e-commerce
Analytics and reports double-count units sold	Accurate counts, clean rollups, reliable reorder signals
AI search returns inconsistent or contradictory answers	One authoritative record AI can cite with confidence
Supplier history scattered across merged-away rows	Provenance retained and linked to the canonical record
Manual reconciliation takes days per catalog import	Continuous, automated resolution on every new feed

Before you start

A full export of the catalog (CSV, JSONL, or a PIM extract) with stable internal IDs on every row.
Agreement on which identifiers are authoritative — GTIN, MPN + brand, or an internal SKU — before any comparison runs.
A writable staging environment. Never deduplicate against live production first; you need a rollback point.
A labeled sample of known duplicates and known non-duplicates for threshold calibration (50-200 pairs is enough to start).

1

Profile the catalog and pick match keys

Count distinct values for each candidate identifier (GTIN, MPN, brand, model, supplier part number) and measure fill rate. A field that is 40% empty cannot anchor matching alone. For an MRO catalog, MPN + brand is usually the strongest key; for CPG, GTIN dominates; for furniture, brand + model + dimensions. Read SKU vs MPN vs GTIN if you are unsure which identifier means what. Claro’s attribute-coverage report surfaces fill rates and identifier quality across all supplier feeds automatically at this step.
2

Normalize before you compare

Most apparent duplicates are the same product written differently. Standardize casing, strip punctuation and leading zeros, expand abbreviations (“ss” to “stainless steel”), and convert units to a common base. Normalize MPNs by removing dashes and spaces — “HF-2200” and “HF2200” must collapse to the same token before comparison. Read What Is Data Normalization? for the full rule set. Skipping this step inflates false negatives dramatically; the MPN Normalizer handles the most common MPN patterns.
3

Block to reduce comparisons

Comparing every record to every other is O(n²) and breaks at scale. Group records into blocks that could plausibly match — by brand, by GTIN prefix, or by the first normalized token of the MPN — and compare only within each block. This keeps an industrial-distribution catalog of millions of rows tractable. If your homegrown scripts slow to a crawl past a few hundred thousand rows, the blocking step is usually what failed — not a reason to lower match quality.
4

Score candidate pairs

Within each block, score pairs using exact matches on strong identifiers plus fuzzy similarity on names and attributes. Use the Duplicate SKU Finder to surface obvious collisions, and the String Similarity Calculator to tune how lenient string comparison should be. Each pair gets a confidence score between 0 and 1. Read What Is a Confidence Score? to understand how the score is derived and what it means for downstream decisions.
5

Set merge thresholds

Decide three bands: auto-merge above a high threshold, send to human review in the middle, and reject below a low threshold. The exact cutoffs depend on the cost of a wrong merge in your domain — merging two different bearings is far more dangerous than merging two near-identical pens. See How to Set Confidence Thresholds for Auto-Merge for how to calibrate these against your labeled sample.
6

Choose the canonical record and merge

For each confirmed duplicate group, build one canonical record by selecting the best value for each attribute — most complete, most recent, or from the most trusted source — rather than blindly keeping the first row. Record which source won each field. Claro scores source trustworthiness per attribute and populates each field with the highest-confidence value, leaving a provenance tag so you know where each cell came from. See What Is a Canonical Product Record? for the full selection logic.
7

Merge reversibly and write back

Keep the merged-away records linked to the survivor with a timestamp and the rule that fired. If a merge turns out wrong, you must be able to undo it without re-importing the entire feed. Claro writes the canonical record back into your PIM or ERP and maintains a reversibility index so any bad merge can be unwound with a single API call. Reversible Merges: Deduplicating Without Losing History covers the pattern in detail.
8

Verify and schedule re-runs

Spot-check a random sample of auto-merges and all human-reviewed ones. Measure precision on the sample; if it falls below your target, tighten the auto-merge threshold and push more pairs to review. Then schedule the workflow to run on every new import — deduplication is continuous, not a one-time cleanup. Claro’s continuous resolution layer re-runs identity scoring on every incoming supplier feed automatically.

Common pitfalls

Other frequent mistakes include trusting GTIN blindly (reused and mislabeled barcodes are common in long-tail catalogs), deduplicating without a normalization pass first, and running irreversible merges that destroy supplier history you later need for sourcing or compliance. Teams that try to scale manual scripts past a few hundred thousand rows also hit a wall — Why Fuzzy-Match Scripts Break at Scale and Scripts vs. a Matching Platform both explain why blocking and probabilistic scoring are required, not optional, at catalog scale.

Entity resolution is the underlying discipline. If you want to understand the full decision logic — deterministic matching, probabilistic scoring, clustering, and canonical merging — read Fuzzy Matching vs. Entity Resolution before tuning thresholds.

Glossary

What Is Entity Resolution?

The discipline behind deciding when two records describe the same product.

Tool

Duplicate SKU Finder

Paste a catalog and surface exact and near-duplicate SKUs instantly.

Playbook

Set Confidence Thresholds for Auto-Merge

Calibrate auto-merge, review, and reject bands against a labeled sample.

Guide

Reversible Merges

Deduplicate without losing supplier history or breaking audit trails.

Glossary

Canonical Product Record

How to choose the surviving golden record field by field.

Comparison

Fuzzy Matching vs. Entity Resolution

When string similarity is enough and when you need the full resolution pipeline.

FAQ

How do I find duplicate products in a catalog?

Start by normalizing identifiers (casing, punctuation, leading zeros, units), then block records into plausible groups and score pairs within each block using exact identifier matches plus fuzzy name similarity. A tool like the Duplicate SKU Finder handles the obvious collisions; fuzzy scoring catches the rest.

What is the difference between deduplication and matching?

Matching decides whether two records refer to the same product. Deduplication is the full workflow that uses those matches to merge duplicates into one canonical record and clean the catalog. Matching is the engine; deduplication is the outcome.

Can I deduplicate a catalog automatically?

Yes, above a high confidence threshold. The safe pattern is three bands: auto-merge clear matches, route ambiguous pairs to human review, and reject weak ones. Fully automatic merging of every pair risks collapsing genuine variants like different sizes or voltages into a single incorrect record.

How do I avoid merging product variants by mistake?

Include discriminating attributes — size, color, voltage, pack quantity, material — in your match logic so that records identical in name but different in spec never auto-merge. Variants should land in the human-review band rather than being merged automatically.

Is GTIN enough to deduplicate a catalog?

Not on its own. GTINs are sometimes reused, mistyped, or missing on long-tail items, and a single product can carry multiple valid GTINs across pack sizes. Use GTIN as a strong signal, but combine it with brand, MPN, and key attributes for reliable results.

How does Claro help with ongoing catalog deduplication?

Claro runs identity resolution and reversible, provenance-tracked merges as a managed layer on top of your PIM or ERP. Every new supplier feed is resolved against the existing catalog automatically, confidence-scored, and either merged or routed to review — so duplicates do not re-accumulate after each onboarding cycle.