Notes on entity matching.
Guides on how blocking and matching work, and on resolving data at scale.
How many labels does a learned matcher need?
A learned matcher usually needs about 600 labeled pairs, not tens of thousands, because active learning spends your labeling budget on the ambiguous pairs that teach the model the most.
Entity matching at scale
Why resolving hundreds of millions of records is a different problem from resolving a few thousand, and how Spark handles it.
Why run entity matching in your own infrastructure
Running entity matching in your own Spark or single-machine environment keeps sensitive records inside your perimeter, with no data egress and a shorter compliance review.
Active learning for entity matching
Active learning trains a matcher from about 600 labels instead of tens of thousands, by labeling only the pairs the model is unsure about.
Precision and recall in entity matching: choosing your operating point
Precision is the share of your predicted matches that are real, and recall is the share of real matches you found. Blocking is tuned for recall, and the matcher sets the balance your costs call for.
Blocking vs. matching
Entity matching has two steps. Blocking generates candidate pairs, matching classifies them. They optimize for different things.