Block, then match.

Entity matching is two steps. Blocking narrows two tables to candidate pairs; matching decides which pairs are real. MatchFlow adds a labeling step when you don’t have training data.

1

Block

Sparkly · Delex

Comparing every pair of records is quadratic and infeasible past modest sizes. Blocking narrows two tables to a small set of candidate pairs. Sparkly does this with top-k TF/IDF similarity; Delex combines multiple blocking strategies. What blocking drops, matching can never recover, so blocking is tuned for recall.

2

Label

MatchFlow

A learned matcher needs labeled examples. If you have none, MatchFlow’s active learning selects a small, informative sample from the candidate set (typically about 600 pairs) for you to label using a CLI or web labeler. You can also bring your own labels.

3

Match

MatchFlow

MatchFlow turns labeled pairs into feature vectors, trains a classifier (from scikit-learn or PySpark MLlib), and applies it to the full blocking output to predict match or no-match. It runs on pandas for small data or Spark for large.

Have a matching problem?

Book a call to scope it with the team, or explore the code on GitHub.