Build a Golden Product Record: Step-by-Step Playbook

Build a golden product record: cluster duplicates, apply survivorship rules, preserve provenance, and write clean data back to PIM/ERP.

published deduplication

Supplier feeds disagree on item descriptions. Your ERP carries three SKUs for the same fastener under different vendor part numbers. Your PIM has 40,000 products but analysts trust fewer than half the attribute values because no one knows which source wins when records conflict. The root cause is the same in every case: no single authoritative version of each product exists.

Building a golden product record — one canonical, trusted record per real-world product — resolves that. Claro automates the pipeline: it clusters records that describe the same item, applies field-level survivorship rules, validates enriched attributes, and writes the clean canonical record back into your existing PIM or ERP. The result is a catalog your pricing engine, AI search, and procurement team can all rely on. This playbook walks through exactly how that process works so you can run it yourself or understand what Claro is doing under the hood.

What makes a record truly canonical

A canonical record is not “pick the newest row and call it done.” It is a deliberate decision for every attribute field: which source has the most accurate value, and what happens when two sources disagree? That decision is encoded in survivorship rules — the field-by-field logic that determines which input wins. A canonical record is also reversible: every surviving value stores the source that supplied it so any bad merge can be unwound without guessing.

The table below shows the difference in practice.

Messy catalog (before) Trusted catalog (after)
Same product in 3-5 rows under different vendor SKUs One canonical record per real-world product
Conflicting net weight, voltage, and unit-of-measure across rows Single authoritative attribute value per field with source lineage
Analytics double-count; procurement orders the same part twice Accurate rollups and clean spend analysis
AI assistants return contradictory specs or hedge on availability One citable record that AI search can point to confidently
Manual effort to reconcile feeds on every catalog refresh Automated re-clustering and write-back on a defined schedule

Step-by-step: build a golden product record

  1. 1
    Define what 'the same product' means

    Write down the business rule before you touch data. Decide whether a 5-gallon and 1-gallon pack of the same coating are one product or two, whether color variants share a canonical parent, and which identifiers — GTIN, MPN, internal SKU — are authoritative when present. For furniture, a sofa in three fabrics is usually one model with variants; for CPG, each pack size is typically its own sellable item. Capturing this as explicit policy prevents silent over-merging (collapsing genuinely distinct products) and under-merging (leaving true duplicates split) later in the pipeline.

  2. 2
    Cluster candidate duplicates

    Group records that plausibly describe the same item. Start with deterministic keys — normalized MPN, GTIN — for high-confidence ties, then layer fuzzy matching on product names, dimensions, and brand to catch the rest. An industrial distributor reconciling 40 supplier feeds will find that “1/2 in. NPT brass ball valve” and “Brass Ball Valve, 1/2 NPT” belong together despite zero shared identifier. Each cluster becomes the raw input for one canonical record. See entity resolution and record linkage for the underlying techniques.

  3. 3
    Choose survivorship rules per attribute

    Decide, field by field, which source wins. Survivorship is rarely “newest record takes all.” You might trust the manufacturer feed for dimensions and certifications, your ERP for cost and internal pricing, and the richest source for descriptions and images. Document tie-breakers explicitly — most recent, highest-trust source, or longest non-null value — and apply rules at the attribute level so the golden record can draw the best field from several sources rather than copying one row wholesale.

  4. 4
    Merge and assemble the golden record

    Apply survivorship rules to produce one record per cluster. Normalize units, casing, and enumerations as you go: “in”, “inch”, and the inch symbol collapse to one unit; “each”, “EA”, and “1 EA” collapse to one unit-of-measure code. Fill gaps from lower-priority sources so the canonical record is more complete than any single input. For a CPG catalog, this is the step where fragmented entries finally resolve into one row with a clean brand name, declared net weight, and consistent pack configuration.

  5. 5
    Record provenance and source lineage

    For every surviving value, store which source supplied it and when. Provenance is what lets a buyer ask “why does this say 120V?” and get an answer in seconds, and what lets you reverse a bad merge without reconstructing the decision from memory. Keep the contributing source IDs attached to the canonical record so every merge stays auditable. See data provenance for why this matters at scale.

  6. 6
    Set a confidence floor and route exceptions

    Auto-merge only clusters that clear a confidence threshold; send borderline clusters to human review instead of guessing. A conservative floor protects you from collapsing two genuinely different parts that share a similar name or description. Tune the threshold against a labeled sample of known-good and known-bad matches, and revisit it whenever new supplier feeds change the distribution. The confidence-thresholds playbook covers tuning in detail.

  7. 7
    Write back and re-run on a schedule

    Publish canonical IDs and clean attribute values back to your PIM, ERP, and storefront so every downstream system points at the same record. New supplier files arrive constantly, so treat this as a recurring job: re-cluster, re-merge, and reconcile on a defined cadence rather than as a one-time cleanup. Claro automates this write-back step, keeping your existing systems of record populated with trusted data without requiring a migration.

Common pitfalls when you build golden product record sets

Other frequent failures to avoid:

  • Over-trusting the “newest” source. A supplier who updated their feed last week may have introduced a transcription error, not a spec change. Recency alone is a weak signal; source trust score matters more.
  • Normalizing units inconsistently. If your survivorship step converts “mm” to “in” on some feeds but not others, dimension comparisons break silently and you generate false non-matches.
  • Treating variants as duplicates. Collapsing a product’s three color finishes into one canonical row destroys variant data. Define your same-product rule (Step 1) so variant relationships are modeled explicitly.
  • Hand-tuned match scripts that degrade at scale. What works at 5,000 SKUs typically breaks at 500,000. Threshold drift, new supplier formats, and schema changes erode match quality without a systematic review process.

FAQ

What is a golden product record?

A golden product record — also called a canonical product record — is the single authoritative version of a product assembled from every source that describes it. It combines the best attribute values across sources using survivorship rules and retains provenance so each value can be traced to its origin.

How is a canonical record different from just deleting duplicates?

Deleting duplicates picks one row and discards the rest, losing data and history. Building a canonical record merges at the attribute level, combining the most trustworthy and complete values from multiple sources, and retains lineage so the merge can be audited or reversed without data loss.

How do I decide which attribute value wins when sources conflict?

Define survivorship rules per field rather than per record. Common rules are highest-trust source, most recent update, or longest non-null value, with explicit tie-breakers documented. Manufacturer feeds typically win on specifications and certifications; your ERP wins on cost and internal SKU assignments.

Should I auto-merge every matched cluster?

No. Auto-merge only clusters that clear a confidence threshold, and route borderline matches to human review. A conservative floor prevents collapsing two genuinely different products that happen to share similar names, dimensions, or descriptions.

How often should canonical records be rebuilt?

Treat it as a recurring job, not a one-time cleanup. New supplier feeds, catalog imports, and ERP changes arrive continuously, so re-cluster and re-merge on a defined schedule and write canonical IDs back to your PIM, ERP, and storefront each cycle.

Can Claro handle the write-back step automatically?

Yes. Claro resolves product identity across supplier feeds, applies survivorship rules to assemble a trusted canonical record, and writes clean data back into your existing PIM or ERP — no new system of record required. Every merged value carries a source lineage so any decision can be explained or reversed.

Claro

See where your catalog breaks — free

Claro runs this automatically: resolve identity, fill missing attributes, validate updates, and write clean records back into your PIM/ERP. Upload a sample supplier file for a free catalog audit.

Get a free catalog audit