Classify an Inherited Catalog: A Practical Workflow

How to classify an inherited catalog you didn't build: clean, dedupe, extract attributes, and assign taxonomy codes at scale without baking in old mistakes.

published classification

You acquired a competitor, inherited an ERP migration, or absorbed a supplier range, and now tens of thousands of SKUs landed on your desk with no documentation. Descriptions are abbreviations someone understood in 2014. Half the category column is blank; the other half uses a scheme nobody can explain. The instinct is to start hand-coding rows. That doesn’t scale, and it bakes in the previous owner’s mistakes. The real job is to reconstruct meaning from messy records, assign a consistent taxonomy you can actually maintain, and write clean codes back into the PIM or ERP before those records poison your downstream channels. Claro resolves identity, extracts attributes from source documents, assigns taxonomy with a confidence score and provenance link on every value, and pushes trusted records back into your existing systems — so an inherited catalog becomes a maintainable one.

Start with what the data tells you, not what it should be

An inherited catalog is an archaeological site. Before mapping anything to a standard, profile what you actually have. Pull a column-level summary: fill rates, distinct values, and the worst offenders for free-text chaos.

You will usually find three failure patterns. A CPG catalog might carry pack configurations buried in the product name (“12x500ml”) with no structured size field. An MRO catalog often has manufacturer part numbers smashed together with internal SKUs in one column. A furniture catalog frequently encodes material and dimensions in unstructured marketing copy. Each demands a different extraction approach, so diagnose first.

Reconstruct the product before you classify it

Classification is only as good as the record underneath it. A miniature circuit breaker labeled “MCB 16A C-curve” cannot be reliably coded until you have resolved that it is one product, not three near-duplicate rows, and until the key attributes (rated current, tripping characteristic, pole count) are extracted into real fields.

Work in this order:

  1. 1
    Normalize and dedupe

    Standardize units, casing, and identifiers, then collapse obvious duplicates so you classify each real product once. Industrial distribution catalogs in particular hide the same item under several legacy SKUs.

  2. 2
    Extract attributes from free text

    Parse names, specs, and any attached PDFs into structured attributes. This is where most of the classifiable signal lives. Claro extracts from source documents automatically, linking each extracted value back to its origin so you can audit or override it.

  3. 3
    Map fields to your schema

    Align the inherited columns to your target attribute model before reaching for taxonomy codes. Good schema mapping prevents classifying against the wrong dimension.

  4. 4
    Assign taxonomy codes

    Only now apply ETIM, UNSPSC, eCl@ss, or Google Product Category, working from clean attributes rather than raw strings.

Before and after: messy vs trusted

Before (inherited, uncleaned) After (Claro-processed, trusted)
Same product in 3–5 rows with different names and no shared ID One resolved entity per product with a canonical record
Rated current, pole count, and curve type buried in a name field Structured attributes extracted and linked to source document
Category column reflects previous owner's merchandising logic ETIM or UNSPSC code assigned from clean attributes with confidence score
No audit trail; classifiers cannot explain a code Every code carries a provenance link and the confidence that drove it
Re-classification on standard update requires full manual rework Provenance lets you re-run assignments with one job when inputs change

Pick a target standard and classify in tiers

Don’t try to code everything to the deepest level on day one. Choose the standard your channels and customers actually consume, then classify in confidence tiers so the catalog becomes usable fast.

Tier What it covers How to handle it
High confidence Clear identifier or attribute match to one class Auto-assign, spot-check a sample
Medium confidence Plausible class but ambiguous attributes Route to reviewer with suggested code
Low confidence Sparse or contradictory data Quarantine, enrich first, classify later

Tiering keeps a CPG launch or a marketplace feed from being held hostage by the 8% of records that genuinely need a human. It also gives you a defensible audit trail when a retailer or auditor asks why a product carries the code it does.

Make every code traceable, not just present

A classification you cannot explain is a liability. For each assigned code, keep the evidence: which attribute or source document drove the decision, what confidence the system had, and who approved it. Provenance is what lets you re-run classification when a standard releases a new version, and what stops a black-box model from silently miscoding a regulated chemical or an electrical component.

Claro treats classification as a governed layer, not a batch job. It writes clean, confidence-scored codes back into your existing PIM or ERP so downstream channels — marketplace feeds, procurement systems, product search — see a catalog that was built to last, not just cleaned once.

FAQ

How do I classify products with no usable descriptions?

Enrich before you classify. Extract attributes from any available source — manufacturer part numbers, datasheets, supplier PDFs, barcodes — and quarantine records that stay sparse. Assigning a code to a record you cannot describe just creates a confident error you will trust later.

Should I keep the inherited categories or replace them?

Keep them as a reference signal, but classify against a clean attribute model and a standard you control. Validate your new codes against the old categories to catch disagreements, then retire the inherited scheme once coverage is solid.

Can I automate classification for an inherited catalog?

Yes, in tiers. Auto-assign high-confidence matches, route ambiguous ones to reviewers with a suggested code, and hold low-confidence records for enrichment. Full automation without confidence tiers and provenance tends to scale errors as fast as it scales coverage.

Which classification standard should I target?

Pick the one your customers and channels consume. Distributors in Europe often need ETIM or eCl@ss; procurement-driven buyers expect UNSPSC; retail and marketplace feeds need Google Product Category. The guide on choosing a standard walks through the trade-offs for each.

How do I keep classifications accurate over time?

Treat classification as ongoing, not one-time. Standards release new versions, suppliers change data, and new SKUs arrive mislabeled. Monitor for classification drift and keep provenance on every code so you can re-run assignments confidently when inputs change.

Claro

Stop maintaining this by hand

Claro keeps product and supplier data trusted as catalogs change — matching, deduplication, enrichment, and validated write-back into the systems you already run.

Book a demo