Block, then match.
Entity matching is two steps. Blocking narrows two tables to candidate pairs; matching decides which pairs are real. MatchFlow adds a labeling step when you don’t have training data.
Block
Sparkly · DelexComparing every pair of records is quadratic and infeasible past modest sizes. Blocking narrows two tables to a small set of candidate pairs. Sparkly does this with top-k TF/IDF similarity; Delex combines multiple blocking strategies. What blocking drops, matching can never recover, so blocking is tuned for recall.
Label
MatchFlowA learned matcher needs labeled examples. If you have none, MatchFlow’s active learning selects a small, informative sample from the candidate set (typically about 600 pairs) for you to label using a CLI or web labeler. You can also bring your own labels.
Match
MatchFlowMatchFlow turns labeled pairs into feature vectors, trains a classifier (from scikit-learn or PySpark MLlib), and applies it to the full blocking output to predict match or no-match. It runs on pandas for small data or Spark for large.
Have a matching problem?
Book a call to scope it with the team, or explore the code on GitHub.