Scalable & Accurate Entity Matching

Blocking that beats the state of the art. A matcher you train on your own data.

Resolve the records that point to the same real-world entity across large, messy datasets, inside your own infrastructure.

Book a call View on GitHub

Matches across two tables

Robert A. Smith Acme Inc.

Bob Smith ACME, Inc.

Jennifer Nguyen Globex Corp.

Jenn Nguyen Globex Corporation

Michael O’Brien Initech

Mike O’Brien Initech LLC

Acme Robotics Austin, TX

Acme Logistics Austin, TX

Applied to demanding, real-world data

Academic benchmarks

Product, music, and citation matching datasets.

Environmental ontologies

Concept and entity matching across environmental schemas.

Sanctions & watchlists

OFAC SDN and UN Security Council consolidated lists.

Government contractor data

Resolving contractor entities across aliases and changing firm affiliations.

Rules and pretrained models miss your matches.

Most ways to match records use hand-authored rules or a fixed pretrained model. Rules turn into a long list of exceptions you keep adding to, and a pretrained model has never seen your data, so both miss real matches and pass false ones on messy records. MadMatcher learns a matcher from your own labeled data, so it fits the way your records actually vary.

Why MadMatcher trains to your data →

Accurate matching you control.

Trained to your domain

MatchFlow learns a matcher from your labeled data with active learning. About 600 labels fit it to your domain, instead of a fixed pretrained model.

Runs in your infrastructure

Executes in your own Spark or single-machine environment. Your data never leaves your perimeter.

Scales to your hardest tables

Blocking runs share-nothing on Apache Spark and handles hundreds of millions of records per table.

Benchmarked blocking

Sparkly’s method is peer-reviewed and outperforms eight state-of-the-art blockers.

Fits your stack

Use one tool or the whole pipeline. It runs inside the stack you already have instead of making you adopt a new platform.

Two clear, tunable parts

Blocking and matching are separate stages you can tune and measure independently.

The open-source toolkit.

Two steps, three tools. Blocking narrows the full pairwise space to a handful of candidate pairs. Matching then labels each pair a match or not.

1Blocking

Sparkly TF/IDF blocking Delex Multi-strategy blocking

Narrow two tables to a small set of candidate pairs.

2Matching

MatchFlow Matching

Classify each candidate pair as a match or not.

Need it in production? MadMatcher-Pro and consulting →

The blocking method behind MadMatcher was introduced and benchmarked in a peer-reviewed paper at VLDB 2023.

Read the paper ↗

Have a matching problem?

Book a call to scope it with the team, or explore the code on GitHub.

Book a call View on GitHub