How to Deduplicate a Product Catalog
Step-by-step playbook to find, score, and merge duplicate SKUs into clean canonical records without losing supplier history or breaking audit trails.
Duplicate SKUs are not just a housekeeping problem. Every duplicate in your catalog silently inflates procurement spend, splits stock levels, breaks AI-powered search, and sends conflicting pricing downstream to your ERP or e-commerce channel. The root cause is almost always the same: supplier feeds arrive in different formats, PIM migrations carry over legacy rows, and manual entry over years produces near-identical records that no automated process ever reconciled. The result is one real product hiding behind three to five SKUs — each with its own price, attributes, and supplier history.
This playbook walks you through how to deduplicate a product catalog end to end: from detecting duplicate records, to scoring candidate matches, to merging them into a single canonical record you can trust. Claro sits at the center of this workflow as a managed identity-resolution layer — it normalizes incoming supplier feeds, scores every candidate pair, flags ambiguous matches for human review, and writes the clean canonical record back into your PIM or ERP with a full audit trail. That means deduplication is not a one-time project you hand to a data team every six months; it runs continuously as new sources arrive.
Run this playbook after a supplier import, a PIM migration, an acquisition, or simply when years of manual entry have caught up with you. Whether you manage MRO line items, CPG units, furniture variants, or industrial spares, the workflow is the same — only the matching attributes change.
What the problem looks like before and after
| Before deduplication | After deduplication |
|---|---|
| Same product appears under 3-5 SKUs | One canonical SKU per real-world product |
| Conflicting prices and stock levels per duplicate | Single source of truth fed to ERP and e-commerce |
| Analytics and reports double-count units sold | Accurate counts, clean rollups, reliable reorder signals |
| AI search returns inconsistent or contradictory answers | One authoritative record AI can cite with confidence |
| Supplier history scattered across merged-away rows | Provenance retained and linked to the canonical record |
| Manual reconciliation takes days per catalog import | Continuous, automated resolution on every new feed |
Before you start
- 1Profile the catalog and pick match keys
Count distinct values for each candidate identifier (GTIN, MPN, brand, model, supplier part number) and measure fill rate. A field that is 40% empty cannot anchor matching alone. For an MRO catalog, MPN + brand is usually the strongest key; for CPG, GTIN dominates; for furniture, brand + model + dimensions. Read SKU vs MPN vs GTIN if you are unsure which identifier means what. Claro’s attribute-coverage report surfaces fill rates and identifier quality across all supplier feeds automatically at this step.
- 2Normalize before you compare
Most apparent duplicates are the same product written differently. Standardize casing, strip punctuation and leading zeros, expand abbreviations (“ss” to “stainless steel”), and convert units to a common base. Normalize MPNs by removing dashes and spaces — “HF-2200” and “HF2200” must collapse to the same token before comparison. Read What Is Data Normalization? for the full rule set. Skipping this step inflates false negatives dramatically; the MPN Normalizer handles the most common MPN patterns.
- 3Block to reduce comparisons
Comparing every record to every other is O(n²) and breaks at scale. Group records into blocks that could plausibly match — by brand, by GTIN prefix, or by the first normalized token of the MPN — and compare only within each block. This keeps an industrial-distribution catalog of millions of rows tractable. If your homegrown scripts slow to a crawl past a few hundred thousand rows, the blocking step is usually what failed — not a reason to lower match quality.
- 4Score candidate pairs
Within each block, score pairs using exact matches on strong identifiers plus fuzzy similarity on names and attributes. Use the Duplicate SKU Finder to surface obvious collisions, and the String Similarity Calculator to tune how lenient string comparison should be. Each pair gets a confidence score between 0 and 1. Read What Is a Confidence Score? to understand how the score is derived and what it means for downstream decisions.
- 5Set merge thresholds
Decide three bands: auto-merge above a high threshold, send to human review in the middle, and reject below a low threshold. The exact cutoffs depend on the cost of a wrong merge in your domain — merging two different bearings is far more dangerous than merging two near-identical pens. See How to Set Confidence Thresholds for Auto-Merge for how to calibrate these against your labeled sample.
- 6Choose the canonical record and merge
For each confirmed duplicate group, build one canonical record by selecting the best value for each attribute — most complete, most recent, or from the most trusted source — rather than blindly keeping the first row. Record which source won each field. Claro scores source trustworthiness per attribute and populates each field with the highest-confidence value, leaving a provenance tag so you know where each cell came from. See What Is a Canonical Product Record? for the full selection logic.
- 7Merge reversibly and write back
Keep the merged-away records linked to the survivor with a timestamp and the rule that fired. If a merge turns out wrong, you must be able to undo it without re-importing the entire feed. Claro writes the canonical record back into your PIM or ERP and maintains a reversibility index so any bad merge can be unwound with a single API call. Reversible Merges: Deduplicating Without Losing History covers the pattern in detail.
- 8Verify and schedule re-runs
Spot-check a random sample of auto-merges and all human-reviewed ones. Measure precision on the sample; if it falls below your target, tighten the auto-merge threshold and push more pairs to review. Then schedule the workflow to run on every new import — deduplication is continuous, not a one-time cleanup. Claro’s continuous resolution layer re-runs identity scoring on every incoming supplier feed automatically.
Common pitfalls
Other frequent mistakes include trusting GTIN blindly (reused and mislabeled barcodes are common in long-tail catalogs), deduplicating without a normalization pass first, and running irreversible merges that destroy supplier history you later need for sourcing or compliance. Teams that try to scale manual scripts past a few hundred thousand rows also hit a wall — Why Fuzzy-Match Scripts Break at Scale and Scripts vs. a Matching Platform both explain why blocking and probabilistic scoring are required, not optional, at catalog scale.
Entity resolution is the underlying discipline. If you want to understand the full decision logic — deterministic matching, probabilistic scoring, clustering, and canonical merging — read Fuzzy Matching vs. Entity Resolution before tuning thresholds.
Related
Glossary
What Is Entity Resolution?
The discipline behind deciding when two records describe the same product.
Tool
Duplicate SKU Finder
Paste a catalog and surface exact and near-duplicate SKUs instantly.
Playbook
Set Confidence Thresholds for Auto-Merge
Calibrate auto-merge, review, and reject bands against a labeled sample.
Guide
Reversible Merges
Deduplicate without losing supplier history or breaking audit trails.
Glossary
Canonical Product Record
How to choose the surviving golden record field by field.
Comparison
Fuzzy Matching vs. Entity Resolution
When string similarity is enough and when you need the full resolution pipeline.
FAQ
How do I find duplicate products in a catalog?
Start by normalizing identifiers (casing, punctuation, leading zeros, units), then block records into plausible groups and score pairs within each block using exact identifier matches plus fuzzy name similarity. A tool like the Duplicate SKU Finder handles the obvious collisions; fuzzy scoring catches the rest.
What is the difference between deduplication and matching?
Matching decides whether two records refer to the same product. Deduplication is the full workflow that uses those matches to merge duplicates into one canonical record and clean the catalog. Matching is the engine; deduplication is the outcome.
Can I deduplicate a catalog automatically?
Yes, above a high confidence threshold. The safe pattern is three bands: auto-merge clear matches, route ambiguous pairs to human review, and reject weak ones. Fully automatic merging of every pair risks collapsing genuine variants like different sizes or voltages into a single incorrect record.
How do I avoid merging product variants by mistake?
Include discriminating attributes — size, color, voltage, pack quantity, material — in your match logic so that records identical in name but different in spec never auto-merge. Variants should land in the human-review band rather than being merged automatically.
Is GTIN enough to deduplicate a catalog?
Not on its own. GTINs are sometimes reused, mistyped, or missing on long-tail items, and a single product can carry multiple valid GTINs across pack sizes. Use GTIN as a strong signal, but combine it with brand, MPN, and key attributes for reliable results.
How does Claro help with ongoing catalog deduplication?
Claro runs identity resolution and reversible, provenance-tracked merges as a managed layer on top of your PIM or ERP. Every new supplier feed is resolved against the existing catalog automatically, confidence-scored, and either merged or routed to review — so duplicates do not re-accumulate after each onboarding cycle.
Claro
See where your catalog breaks — free
Claro runs this automatically: resolve identity, fill missing attributes, validate updates, and write clean records back into your PIM/ERP. Upload a sample supplier file for a free catalog audit.
Get a free catalog audit