Scalable & Accurate Entity Matching
Blocking that beats the state of the art. A matcher you train on your own data.
Resolve the records that point to the same real-world entity across large, messy datasets, inside your own infrastructure.
Applied to demanding, real-world data
Academic benchmarks
Product, music, and citation matching datasets.
Environmental ontologies
Concept and entity matching across environmental schemas.
Sanctions & watchlists
OFAC SDN and UN Security Council consolidated lists.
Government contractor data
Resolving contractor entities across aliases and changing firm affiliations.
Rules and pretrained models miss your matches.
Most ways to match records use hand-authored rules or a fixed pretrained model. Rules turn into a long list of exceptions you keep adding to, and a pretrained model has never seen your data, so both miss real matches and pass false ones on messy records. MadMatcher learns a matcher from your own labeled data, so it fits the way your records actually vary.
Accurate matching you control.
Trained to your domain
MatchFlow learns a matcher from your labeled data with active learning. About 600 labels fit it to your domain, instead of a fixed pretrained model.
Runs in your infrastructure
Executes in your own Spark or single-machine environment. Your data never leaves your perimeter.
Scales to your hardest tables
Blocking runs share-nothing on Apache Spark and handles hundreds of millions of records per table.
Benchmarked blocking
Sparkly’s method is peer-reviewed and outperforms eight state-of-the-art blockers.
Fits your stack
Use one tool or the whole pipeline. It runs inside the stack you already have instead of making you adopt a new platform.
Two clear, tunable parts
Blocking and matching are separate stages you can tune and measure independently.
The open-source toolkit.
Two steps, three tools. Blocking narrows the full pairwise space to a handful of candidate pairs. Matching then labels each pair a match or not.
Narrow two tables to a small set of candidate pairs.
Need it in production? MadMatcher-Pro and consulting →
The blocking method behind MadMatcher was introduced and benchmarked in a peer-reviewed paper at VLDB 2023.
Read the paper ↗Have a matching problem?
Book a call to scope it with the team, or explore the code on GitHub.