MadMatcher in production.

MadMatcher-Pro and consulting take entity matching to production, on top of an open-source core. Pro adds reliability for long-running jobs; consulting brings the team onto your data. Underneath sit three open-source tools you can also run yourself.

Book a call GitHub org

Take it to production.

MadMatcher-Pro adds the reliability a long-running job needs; consulting brings the team onto your data. The fastest path from a prototype to a production matching pipeline.

MadMatcher-Pro

The same Sparkly, Delex, and MatchFlow, plus the features a production pipeline needs. Crash recovery and live progress tracking keep a multi-hour run going instead of starting over, and semantic (embedding-based) blocking and matching push accuracy higher on messy, real-world data.

At launch

Crash recovery
Progress tracking
Semantic blocking
Semantic matching feature

Get in touch about Pro API docs

Consulting

Work directly with the team to build and tune an entity matching pipeline on your own data, from blocking strategy to a trained matcher.

Book a call

The open-source core.

The three tools underneath, free and open source. Sparkly and Delex do blocking, MatchFlow does matching. Use one on its own or chain them into a full pipeline.

TF/IDF blocking

Sparkly

Finds candidate matches using top-k TF/IDF similarity (the BM25 variant, via Lucene). It indexes one table and searches it with the other, scaling to hundreds of millions of tuples on Spark.

Top-k TF/IDF candidate generation (BM25 via Lucene)
Indexes the smaller table, searches it with the other on Spark
Scales to hundreds of millions of tuples per table

Paulsen et al., VLDB 16(6), 2023. Outperforms eight state-of-the-art blockers. Read the paper ↗

GitHub API docs

Multi-strategy blocking

Delex

Combines several blocking strategies (TF/IDF, dictionary blockers, custom rules) in one declarative program, compiled to a Spark DAG. Use it when one blocking strategy is not enough.

Combine TF/IDF, dictionary blockers, and custom rules
Write blocking as one declarative program
Compiles to an optimized Spark DAG

GitHub API docs

Matching

MatchFlow

Trains a supervised ML matcher on labeled pairs and applies it to the blocking output. Composable functions for features, labeling, training, and prediction; runs on pandas or Spark.

Trains a classifier (XGBoost, Random Forest, …) on labeled pairs
Active learning labels a small sample, about 600 pairs
Runs on pandas or Spark (single machine or cluster)

GitHub API docs

See it on your data.

Book a call to scope your matching problem with the team, or explore the tools and run the quickstart yourself.

Book a call View on GitHub