What is entity matching?

Entity matching is the task of identifying records that refer to the same real-world entity, across or within datasets, even when those records aren’t identical. It’s also called entity resolution or record linkage.

Why it’s hard

The same entity shows up with typos and missing fields, so exact joins miss matches and create duplicates. Comparing every pair of records grows quadratically, so brute force fails well before the scale that matters. Accurate matching needs an efficient way to narrow which pairs to compare and a model good enough to decide the hard cases.

The two steps

Blocking generates a small set of candidate pairs (the records that plausibly match) so you never compare all pairs. Matching then classifies each candidate pair as a match or not. Blocking quality caps recall; matching quality determines precision. When a matcher is learned, a labeling step adds the training examples it needs. How the steps fit together →

Where it’s used

Customer 360

Unify customer records across CRM, billing, and support.

Product catalogs

Deduplicate and reconcile listings from many suppliers.

Healthcare records

Link patient records across systems with mismatched IDs.

Compliance screening

Match entities against watchlists despite name variation.

At scale

Matching a few thousand rows is easy. Matching hundreds of millions needs distributed blocking and a matcher that runs in parallel, without hand-labeling a huge training set. That’s what MadMatcher’s tools are built for. Why MadMatcher’s approach →

Common questions

What is entity matching?

Entity matching identifies which records refer to the same real-world entity, across or within datasets, even when the records differ in formatting, spelling, or completeness. It is also called entity resolution or record linkage.

What is the difference between entity matching, entity resolution, and record linkage?

They mean the same thing. "Record linkage" is common in statistics and healthcare, "entity resolution" in computer science, and "entity matching" in data engineering.

Why is entity matching hard?

The same entity appears with different spellings, formats, and missing fields, so exact comparison misses matches. Comparing every pair of records is also quadratic, so it needs efficient blocking plus a model that tolerates noise.

Have a matching problem?

Book a call to scope it with the team, or explore the code on GitHub.