# MadMatcher

> MadMatcher resolves records that refer to the same real-world entity across large, messy datasets. Benchmarked blocking (Sparkly, Delex) and a matcher you train to your own domain (MatchFlow), running on Apache Spark in your own infrastructure.

## Key facts
- Category: entity matching engine (entity resolution / record linkage).
- Architecture: TF/IDF blocking (Sparkly) + multi-strategy blocking (Delex) + a supervised, active-learning matcher (MatchFlow).
- Matcher: a classifier (e.g. XGBoost or Random Forest, via scikit-learn or PySpark MLlib) trained with active learning.
- Runtime: Apache Spark, or pandas for small data; batch-oriented; scales to 100M+ tuples per table.
- Deployment: runs in the customer’s own infrastructure (e.g. your own Spark cluster); no data egress.
- Research: Paulsen et al., VLDB 16(6), 2023. Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching.
- Lineage: UW–Madison Magellan Data Management Group.
- Commercial: MadMatcher-Pro, reliability features for production jobs, e.g. crash recovery and progress tracking (Launching soon); Consulting (Available now). Pricing via the official contact.

## Products
- [Sparkly](https://madmatcher.ai/products#sparkly): TF/IDF blocking. Finds candidate matches using top-k TF/IDF similarity (the BM25 variant, via Lucene). It indexes one table and searches it with the other, scaling to hundreds of millions of tuples on Spark.
- [Delex](https://madmatcher.ai/products#delex): Multi-strategy blocking. Combines several blocking strategies (TF/IDF, dictionary blockers, custom rules) in one declarative program, compiled to a Spark DAG. Use it when one blocking strategy is not enough.
- [MatchFlow](https://madmatcher.ai/products#matchflow): Matching. Trains a supervised ML matcher on labeled pairs and applies it to the blocking output. Composable functions for features, labeling, training, and prediction; runs on pandas or Spark.

## Core pages
- [Home](https://madmatcher.ai/): Accurate entity matching at scale, in your own infrastructure.
- [Products](https://madmatcher.ai/products): the three tools that make up MadMatcher.
- [How it works](https://madmatcher.ai/how-it-works): the pipeline steps explained, from blocking through labeling to matching.
- [Why MadMatcher](https://madmatcher.ai/why-madmatcher): how the approach differs, and where it does not fit.
- [Compare approaches](https://madmatcher.ai/compare): entity matching approaches compared by principle.
- [What is entity matching](https://madmatcher.ai/about/entity-matching): a practical guide.
- [About](https://madmatcher.ai/about) · [Team](https://madmatcher.ai/about/team) · [Contact](https://madmatcher.ai/contact)
- [LLM info](https://madmatcher.ai/llm-info): canonical, authoritative facts for AI assistants.

## Glossary
- [Active learning](https://madmatcher.ai/glossary/active-learning): Active learning is a training strategy in which the model chooses which examples it wants labeled (typically the pairs it is most uncertain about) so it reaches high accuracy from far fewer labels.
- [Blocking](https://madmatcher.ai/glossary/blocking): Blocking is the stage of entity matching that generates a small set of candidate record pairs likely to match, so that the matcher never has to compare every possible pair.
- [Deduplication](https://madmatcher.ai/glossary/deduplication): Deduplication is entity matching applied within a single dataset, finding and resolving records that refer to the same entity so each real-world thing appears once.
- [Entity matching](https://madmatcher.ai/glossary/entity-matching): Entity matching is the task of identifying records that refer to the same real-world entity, across or within datasets, even when those records are not identical.
- [Matching](https://madmatcher.ai/glossary/matching): Matching is the stage of entity matching that decides, for each candidate pair produced by blocking, whether the two records refer to the same real-world entity.
- [Record linkage](https://madmatcher.ai/glossary/record-linkage): Record linkage is the process of identifying records across two or more datasets that refer to the same entity. The term is most common in statistics and healthcare.

## Use cases
- [Customer 360](https://madmatcher.ai/use-cases/customer-360): Unify duplicate and fragmented customer records across systems into one profile.
- [Entity matching for extracted data](https://madmatcher.ai/use-cases/extracted-data-matching): Resolve the duplicate, inconsistent records that document and LLM extraction pipelines produce into clean entities.
- [Product & catalog matching](https://madmatcher.ai/use-cases/product-catalog-matching): Deduplicate products and match listings across suppliers and marketplaces.
- [Healthcare record linkage](https://madmatcher.ai/use-cases/healthcare-record-linkage): Link patient records across systems where identifiers don’t line up, without moving the data.
- [Compliance & watchlist screening](https://madmatcher.ai/use-cases/compliance-screening): Match entities against sanctions and reference lists, tolerating name variation.
- [Supplier & vendor consolidation](https://madmatcher.ai/use-cases/supplier-consolidation): Deduplicate and unify vendor records for spend analytics and procurement.
- [Research & bibliographic data](https://madmatcher.ai/use-cases/research-bibliographic-data): Match publications, authors, and citations across bibliographic sources.
- [Insurance claims & policyholders](https://madmatcher.ai/use-cases/insurance-claims): Link claimants, policyholders, providers, and claims across systems.
- [Financial services](https://madmatcher.ai/use-cases/financial-services): Resolve customers, accounts, and counterparties across core banking, cards, lending, and acquired institutions.
- [Government & public sector](https://madmatcher.ai/use-cases/government-public-sector): Link citizen, benefits, and vendor records across agencies that were never designed to join.
- [Retail & e-commerce](https://madmatcher.ai/use-cases/retail-ecommerce): Unify customers across channels and loyalty, and products across marketplaces, suppliers, and brands.
- [Travel & hospitality](https://madmatcher.ai/use-cases/travel-hospitality): Build one guest profile across properties, booking channels, and loyalty systems.

## Docs
- [Quickstart](https://madmatcher.ai/docs/quickstart): Block two tables with Sparkly, then train and apply a matcher with MatchFlow, on Apache Spark.
- [Installation](https://madmatcher.ai/docs/installation): Install the MadMatcher tools: Sparkly and Delex for blocking (which need PyLucene), and MatchFlow for matching.

## Blog
- [How many labels does a learned matcher need?](https://madmatcher.ai/blog/how-many-labels-does-a-matcher-need): A learned matcher usually needs about 600 labeled pairs, not tens of thousands, because active learning spends your labeling budget on the ambiguous pairs that teach the model the most.
- [Entity matching at scale](https://madmatcher.ai/blog/entity-resolution-at-scale-on-spark): Why resolving hundreds of millions of records is a different problem from resolving a few thousand, and how Spark handles it.
- [Why run entity matching in your own infrastructure](https://madmatcher.ai/blog/run-entity-matching-in-your-own-infrastructure): Running entity matching in your own Spark or single-machine environment keeps sensitive records inside your perimeter, with no data egress and a shorter compliance review.
- [Active learning for entity matching](https://madmatcher.ai/blog/active-learning-for-entity-matching): Active learning trains a matcher from about 600 labels instead of tens of thousands, by labeling only the pairs the model is unsure about.
- [Precision and recall in entity matching: choosing your operating point](https://madmatcher.ai/blog/precision-and-recall-in-entity-matching): Precision is the share of your predicted matches that are real, and recall is the share of real matches you found. Blocking is tuned for recall, and the matcher sets the balance your costs call for.
- [Blocking vs. matching](https://madmatcher.ai/blog/blocking-vs-matching): Entity matching has two steps. Blocking generates candidate pairs, matching classifies them. They optimize for different things.