MadMatcher in production.

MadMatcher-Pro and consulting take entity matching to production, on top of an open-source core. Pro adds reliability for long-running jobs; consulting brings the team onto your data. Underneath sit three open-source tools you can also run yourself.

Take it to production.

MadMatcher-Pro adds the reliability a long-running job needs; consulting brings the team onto your data. The fastest path from a prototype to a production matching pipeline.

MadMatcher-Pro

Launching soon

The same Sparkly, Delex, and MatchFlow, plus the reliability a long-running production job needs. A multi-hour run survives a failure instead of starting from scratch.

At launch
  • Crash recovery
  • Progress tracking
Get in touch about Pro

Consulting

Available now

Work directly with the team to build and tune an entity matching pipeline on your own data, from blocking strategy to a trained matcher.

Book a call

The open-source core.

The three tools underneath, free and open source. Sparkly and Delex do blocking, MatchFlow does matching. Use one on its own or chain them into a full pipeline.

TF/IDF blocking

Sparkly

Finds candidate matches using top-k TF/IDF similarity (the BM25 variant, via Lucene). It indexes one table and searches it with the other, scaling to hundreds of millions of tuples on Spark.

  • Top-k TF/IDF candidate generation (BM25 via Lucene)
  • Indexes the smaller table, searches it with the other on Spark
  • Scales to hundreds of millions of tuples per table

Paulsen et al., VLDB 16(6), 2023. Outperforms eight state-of-the-art blockers. Read the paper ↗

Multi-strategy blocking

Delex

Combines several blocking strategies (TF/IDF, dictionary blockers, custom rules) in one declarative program, compiled to a Spark DAG. Use it when one blocking strategy is not enough.

  • Combine TF/IDF, dictionary blockers, and custom rules
  • Write blocking as one declarative program
  • Compiles to an optimized Spark DAG
Matching

MatchFlow

Trains a supervised ML matcher on labeled pairs and applies it to the blocking output. Composable functions for features, labeling, training, and prediction; runs on pandas or Spark.

  • Trains a classifier (XGBoost, Random Forest, …) on labeled pairs
  • Active learning labels a small sample, about 600 pairs
  • Runs on pandas or Spark (single machine or cluster)

See it on your data.

Book a call to scope your matching problem with the team, or explore the tools and run the quickstart yourself.