Notes on entity matching.

Guides on how blocking and matching work, and on resolving data at scale.

How many labels does a learned matcher need?

A learned matcher usually needs about 600 labeled pairs, not tens of thousands, because active learning spends your labeling budget on the ambiguous pairs that teach the model the most.

Dev Ahluwalia Jun 4, 2026

Entity matching at scale

Why resolving hundreds of millions of records is a different problem from resolving a few thousand, and how Spark handles it.

Dev Ahluwalia Jun 2, 2026

Why run entity matching in your own infrastructure

Running entity matching in your own Spark or single-machine environment keeps sensitive records inside your perimeter, with no data egress and a shorter compliance review.

Dev Ahluwalia May 28, 2026

Active learning for entity matching

Active learning trains a matcher from about 600 labels instead of tens of thousands, by labeling only the pairs the model is unsure about.

Dev Ahluwalia May 27, 2026

Precision and recall in entity matching: choosing your operating point

Precision is the share of your predicted matches that are real, and recall is the share of real matches you found. Blocking is tuned for recall, and the matcher sets the balance your costs call for.

Dev Ahluwalia May 21, 2026

Blocking vs. matching

Entity matching has two steps. Blocking generates candidate pairs, matching classifies them. They optimize for different things.

Dev Ahluwalia May 20, 2026