# MadMatcher: full content > MadMatcher resolves records that refer to the same real-world entity across large, messy datasets. Benchmarked blocking (Sparkly, Delex) and a matcher you train to your own domain (MatchFlow), running on Apache Spark in your own infrastructure. ## What MadMatcher is MadMatcher resolves records that refer to the same real-world entity across large, messy datasets. It pairs benchmarked blocking with a matcher you train to your own domain. It runs on Apache Spark (or pandas for small data), inside the customer’s own infrastructure. ### Components - Sparkly (TF/IDF blocking): Finds candidate matches using top-k TF/IDF similarity (the BM25 variant, via Lucene). It indexes one table and searches it with the other, scaling to hundreds of millions of tuples on Spark. Outperforms eight state-of-the-art blocking solutions (VLDB 2023). - Delex (Multi-strategy blocking): Combines several blocking strategies (TF/IDF, dictionary blockers, custom rules) in one declarative program, compiled to a Spark DAG. Use it when one blocking strategy is not enough. Start with Sparkly; reach for Delex when you need to combine strategies. - MatchFlow (Matching): Trains a supervised ML matcher on labeled pairs and applies it to the blocking output. Composable functions for features, labeling, training, and prediction; runs on pandas or Spark. Active learning builds training data from about 600 labeled pairs. ### Research Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching. Paulsen et al., VLDB 16(6), 2023. https://pages.cs.wisc.edu/~anhai/papers1/sparkly-vldb2023.pdf --- # Glossary ## Active learning Active learning is a training strategy in which the model chooses which examples it wants labeled (typically the pairs it is most uncertain about) so it reaches high accuracy from far fewer labels. In entity matching, most record pairs are obvious matches or obvious non-matches, and they teach a model very little. The informative pairs are the ambiguous ones near the decision boundary. Active learning uses this. Instead of labeling a large random sample, you label only the pairs the model is most unsure about, then retrain and repeat. The payoff is far less labeling, about 600 pairs instead of tens of thousands, for the same accuracy. MadMatcher's [MatchFlow](/products#matchflow) uses an uncertainty-sampling labeler to build training data this way. [Label less, match better →](/blog/active-learning-for-entity-matching/) ## Blocking Blocking is the stage of entity matching that generates a small set of candidate record pairs likely to match, so that the matcher never has to compare every possible pair. Comparing every pair of records is a quadratic operation, infeasible beyond modest sizes. Blocking solves this by grouping or indexing records so that only plausibly-matching pairs become candidates, discarding the rest before matching runs. The key metric for blocking is **recall**: any true match that blocking drops can never be recovered downstream, so blocking sets the ceiling on the whole pipeline's accuracy. Good blocking is high-recall while keeping the candidate set small. MadMatcher's [Sparkly](/products#sparkly) does TF/IDF blocking and, in its VLDB 2023 paper, outperforms eight state-of-the-art blocking solutions. [Blocking vs. matching, explained →](/blog/blocking-vs-matching/) ## Deduplication Deduplication is entity matching applied within a single dataset, finding and resolving records that refer to the same entity so each real-world thing appears once. Deduplication is the special case of [entity matching](/glossary/entity-matching/) where you match a dataset against itself. The goal is a clean set of records where each customer or product appears exactly once, with duplicates linked or merged. It uses the same machinery as cross-dataset matching, [blocking](/glossary/blocking/) to generate candidate pairs within the table and then [matching](/glossary/matching/) to classify them. It faces the same scaling problem, since the number of within-table pairs grows quadratically with the number of rows. [See the Customer 360 use case →](/use-cases/customer-360/) ## Entity matching Entity matching is the task of identifying records that refer to the same real-world entity, across or within datasets, even when those records are not identical. Entity matching answers one question. Do these two records describe the same thing? Real-world data is full of typos and missing fields, so an exact comparison fails. The same customer shows up in many forms that are not identical. A full entity matching pipeline has two core stages: [blocking](/glossary/blocking/) to generate candidate pairs, and [matching](/glossary/matching/) to classify them. It often adds a labeling stage to train the matcher. The terms **entity resolution** and **record linkage** mean the same thing as entity matching, with slightly different roots in different fields. [Read the full guide →](/about/entity-matching) ## Matching Matching is the stage of entity matching that decides, for each candidate pair produced by blocking, whether the two records refer to the same real-world entity. Matching runs only on the candidate pairs that survived [blocking](/glossary/blocking/), so it can afford to be precise. It classifies each pair as a match or not, using anything from hand-authored rules to supervised machine learning. The key metric for matching is **precision**, alongside recall: of the pairs it labels a match, how many truly are? A learned matcher reaches high accuracy on hard, varied domains, at the cost of needing labeled examples. MadMatcher's [MatchFlow](/products#matchflow) trains a classifier such as XGBoost or Random Forest on your labels. [Why a trainable matcher matters →](/why-madmatcher) ## Record linkage Record linkage is the process of identifying records across two or more datasets that refer to the same entity. The term is most common in statistics and healthcare. Record linkage is, for practical purposes, a synonym for [entity matching](/glossary/entity-matching/). It has the same goal: connect records that describe the same real-world entity despite differences in formatting and completeness. The term comes from statistics and is widely used in healthcare and official statistics, where linking individuals across systems is foundational. Classical record linkage often uses probabilistic methods, such as the Fellegi–Sunter model, valued for being explainable. Learned approaches add accuracy on complex domains at the cost of needing labels, which is a trade-off to weigh for your case. [Healthcare record linkage use case →](/use-cases/healthcare-record-linkage/) --- # Use cases ## Customer 360 The same customer is in your CRM as Robert Smith Jr. and in billing as Bob Smith. Support has them under an old work email and a phone number missing the country code. None of these records link up, because the systems do not share a customer ID. So you email the same person twice, and they count as three people in your numbers. Getting them back to one customer is an [entity matching](/glossary/entity-matching) problem. ## Why exact joins and SQL fuzzy matching fall short An exact join only works when both systems carry the same ID, and these do not. One stray character is enough to split a person in two. SQL fuzzy matching does not save you either. Set the cutoff low and it merges people who are not the same. Set it high and the duplicates stay. You end up hand-coding which differences count, and the setting that works on one system breaks on the next one you connect. The damage is quiet. You pay to reach the same person twice, and a rep on the phone has no idea the caller is one of your biggest accounts. ## How MadMatcher resolves customer records MadMatcher learns what tells two customers apart, so you stop writing rules by hand. [Blocking](/glossary/blocking) trims the data down to the pairs worth comparing. A matcher you train on about 600 labeled examples from your own records then decides, pair by pair, whether two records are the same person. It picks up from your examples that a shared email and surname are strong evidence and that a shared first name is weak. The labeling stays small because [active learning](/blog/active-learning-for-entity-matching) only puts the unclear pairs in front of you. You end up with [deduplication](/glossary/deduplication) built from your own examples. ## Keeping regulated customer data in your perimeter Customer data stays in your environment. MadMatcher [runs in your own infrastructure](/blog/run-entity-matching-in-your-own-infrastructure), which matters when those records are regulated. When you add a new source, you label a few pairs from it if its formatting is different, and the same matcher carries over. [See how matching works →](/how-it-works) · [Talk to us →](/contact) ## Entity matching for extracted data Document and LLM extraction pipelines are good at one job: turning unstructured sources into structured records. Pull a company out of a single PDF and you get a clean row. Pull the same company out of ten thousand contracts and filings and you get ten thousand rows that almost agree. Extraction structures the text. It does not decide which of those rows are the same real-world entity, and that is where the duplicates pile up. Resolving them is an [entity matching](/glossary/entity-matching) job, not an extraction one. ## Why extraction output can’t be joined clean The output is noisy by nature, so neither a join nor a cutoff resolves it. One run calls a company "Acme Corp.", the next calls it "ACME Corporation". A clean date in one record is a free-text blob in the next, and fields go missing wherever the source was vague. Join the extracted text and you drop real matches. Pick a similarity cutoff and you either fuse distinct entities or leave the obvious duplicates apart. What you ship looks structured and still double-counts the things inside it. ## How MadMatcher resolves extracted entities MadMatcher is the layer between extraction and your store. Once the records exist, [blocking](/glossary/blocking) narrows the pair space, and a model you train on about 600 labeled pairs decides which rows point to the same entity. It learns the signals that separate your entities instead of running a fixed rule, and [the small label set](/blog/how-many-labels-does-a-matcher-need) is usually the part people worry about. This is [deduplication](/glossary/deduplication) of your extracted output, run before it reaches a vector store or knowledge graph, so your index does not count many copies of one thing as many things. ## Where it runs, and where the boundary is It runs [in your own infrastructure](/blog/run-entity-matching-in-your-own-infrastructure), on Apache Spark for a large corpus or a single machine for a small one, so the extracted records stay inside your perimeter. There is one boundary to be clear about. MadMatcher matches records, it does not extract them. Keep your extractor for the parsing and use MadMatcher for the resolution. [How matching works →](/how-it-works) · [Compare approaches →](/compare) · [Talk to us →](/contact) ## Product & catalog matching Two listings read "16oz Stainless Steel Water Bottle, BPA-Free" and "Water Bottle Steel 500ml." Same product, barely a word in common. The field that should identify an item is its title, and titles read like prose. The field that should anchor it is the UPC, and the UPC goes missing or shows up padded with stray zeros about as often as not. ## Why UPC joins and title fuzzy-matching backfire Joining on UPC is out, and fuzzy-matching the titles backfires. Marketing filler inflates the similarity of unrelated items, while short, accurate listings score low. The distinctions that decide a match tend to be small. Two products share most of their text and differ only on the one attribute, a pack size or a voltage, that makes them separate SKUs. The damage is easy to miss. A duplicate SKU splits demand and breaks reorder math, and search shows the same thing three times while burying the listing the shopper wanted. ## How MadMatcher resolves product listings MadMatcher learns those distinctions from your catalog. [Blocking](/glossary/blocking) narrows the pairs, and then you label about 600 of the pairs a person actually has to think about. The matcher picks up that material and pack size decide a match while word order and adjectives do not. [That small label set](/blog/how-many-labels-does-a-matcher-need) is enough to make a clean UPC strong evidence when it exists, while items with no usable UPC still get [resolved](/glossary/deduplication). Point it at a new supplier feed later and it carries over what it already learned. ## How it runs at catalog scale It runs on your own Spark cluster or a single machine, [inside your own environment](/blog/run-entity-matching-in-your-own-infrastructure), on catalogs that reach into the hundreds of millions of items per table. The same trained matcher re-runs as feeds change, so the catalog stays deduplicated without a rebuild. [How matching works →](/how-it-works) · [Why a trainable matcher →](/why-madmatcher) · [Talk to us →](/contact) ## Healthcare record linkage One patient shows up in your EHR, your lab system, your billing platform, and an immunization registry, and each one assigned a different ID. The EHR has the legal name and date of birth. The lab has whatever the order form happened to capture. Link these records and you get a full clinical picture. Link them wrong and you either split one patient's history in two or merge two people who were never the same patient, in a setting where the mistake reaches the bedside. ## Why ID joins and one threshold fail on patient data The fields that should connect these records do not hold still, so neither an ID join nor a single cutoff is safe. People go by nicknames and by maiden or married names, so "William" and "Bill" are one patient. Birth dates pick up transposed digits. An MRN only helps when every system recorded it and recorded it correctly, which rarely holds across organizations. An exact ID join loses the patient the moment an identifier is missing or mistyped, and one similarity cutoff cannot be right for a name field and a date field at once. In a clinic, a missed match and a false merge both cause harm. ## How MadMatcher links patient records MadMatcher learns which agreements and disagreements actually tell patients apart. [Blocking](/glossary/blocking) reduces the comparisons. [MatchFlow](/products#matchflow) then trains on about 600 labeled pairs from your own records and produces a model you can [tune for precision or recall](/blog/precision-and-recall-in-entity-matching). For a merge that is hard to undo you lean toward precision. For case-finding you lean toward recall. You move the operating point rather than rewrite a rule. Because the model is measured against held-out pairs, you know its precision and recall instead of trusting a cutoff no one checked. ## Why it runs without moving PHI PHI never moves to an outside service, because MadMatcher [runs on your own infrastructure](/blog/run-entity-matching-in-your-own-infrastructure) and the work stays inside the governance and audit boundary you already have. The linking happens where the records already live, across your EHR, labs, billing, and registries. [How in-infrastructure deployment works →](/why-madmatcher) · [How matching works →](/how-it-works) · [Talk to us →](/contact) ## Compliance & watchlist screening Screening customers and counterparties against sanctions lists is only as good as the [matching](/glossary/matching) underneath it, and the matching is hard because watchlists are almost all names, and names are the worst field to match on. One Arabic or Cyrillic name has many valid Latin spellings, and a company shows up with and without its legal suffix or under a former name. The lists themselves carry aliases, so your internal record rarely matches a list entry character for character. ## Why one edit-distance cutoff can’t screen names A fixed similarity cutoff fails because the two ways to screen wrong pull in opposite directions and cost very different amounts. Match loosely and analysts drown in false positives, with common names flagged all day and onboarding stalled. Match strictly and a sanctioned party slips through, which is a breach with regulators attached. One edit-distance cutoff in SQL cannot serve both. It flags common names constantly and still misses the transliteration that mattered. ## How MadMatcher screens entities against watchlists MadMatcher learns the variation instead of hard-coding it. [Blocking](/glossary/blocking) narrows your records down to plausible list candidates. A matcher trained on about 600 labeled pairs from your own screening history then decides which name differences mean the same entity, whether a difference is a transliteration or a known alias. Because the model is trained and measured, you set the precision and recall point your risk posture calls for, and you re-tune it as lists and tolerances change. It is a classifier, not a black box, so you can see which features it relies on, in plain terms, when you need to show a regulator how the screening works. ## Keeping screening data inside your controls Screening data never leaves your perimeter. MadMatcher [runs in your own infrastructure](/blog/run-entity-matching-in-your-own-infrastructure), so customer and counterparty records stay inside the audit and recordkeeping boundary you already operate, with nothing handed to an outside service. As watchlists and internal records grow, the same trained matcher re-runs without anyone rebuilding the rules. [How matching works →](/how-it-works) · [Why benchmarked blocking matters →](/why-madmatcher) · [Talk to us →](/contact) ## Supplier & vendor consolidation Spend analytics lives on one fact, that two invoices went to the same supplier, and most procurement data cannot tell you. The same vendor gets set up more than once because different buyers created it, and an acquired unit arrives with its own supplier master. "Acme Industrial LLC" and "ACME IND. CORP" each get paid on their own, so the spend that should add up to leverage is scattered across records that never reconcile. Pulling it back together is a [deduplication](/glossary/deduplication) problem. ## Why exact joins can’t consolidate vendors The fix has to survive the ways company names break, which a plain join and a fixed cutoff do not. Legal suffixes come and go, and tax IDs and addresses are often left blank. An exact join cannot bridge that, and a single similarity cutoff either merges distinct companies that share a name or leaves one supplier's variants split. The cost is quiet. Fragmented vendors hide the volume that would earn better terms, and category spend lands in the wrong buckets. ## How MadMatcher consolidates vendor records MadMatcher learns the difference from your own data. [Blocking](/glossary/blocking) narrows the candidate pairs, and then you label about 600 of the borderline vendor pairs. It picks up that a dropped suffix or a shared tax ID means one supplier while a similar name alone does not. [That small label set goes far](/blog/how-many-labels-does-a-matcher-need), and you can teach it to roll subsidiaries up to a parent when the analysis needs that. ## Why consolidation stays clean over time Consolidation is never a one-time project, so the trained matcher re-runs as new vendors are onboarded and keeps the master clean. It all runs [in your own environment](/blog/run-entity-matching-in-your-own-infrastructure), vendor and spend data included, with nothing sent outside. [Compare approaches →](/compare) · [Why a trainable matcher →](/why-madmatcher) · [Talk to us →](/contact) ## Research & bibliographic data This is the kind of data MadMatcher's [blocking](/glossary/blocking) was measured on, and it is a hard case for a reason. One paper turns up in several bibliographic databases and a dozen citation lists, each with a slightly different title and a different rendering of the authors. An author is "J. Smith" in one record and "John Smith" in the next, and shares a name with three other researchers. Tie all of that back to one paper and one person and you have the foundation for any literature analysis. Miss it and the counts and the citation credit come out wrong. ## Why exact joins can’t tie records together Exact joins will not get you there, because most records share no DOI to join on, and the fields that remain drift in stubborn ways. Titles vary in punctuation and subtitle, and an author appears as initials in one record and a full name in another. A single similarity cutoff cannot cover it. The setting that works for a long, distinctive title is wrong for a short one and useless for telling two "J. Smith"s apart. The same blocking step that makes this tractable [outperforms eight state-of-the-art blockers (VLDB 2023)](/why-madmatcher) on exactly this kind of data. ## How MadMatcher matches publications and authors MadMatcher treats the two jobs here, publication matching and author disambiguation, as the separate problems they are. Each learns from about 600 labeled pairs which differences mean the same work or the same person, and [active learning](/blog/active-learning-for-entity-matching) keeps that labeling low. It works directly on the titles and authors rather than waiting for an identifier that usually is not there. ## How it runs over large corpora It runs on your own Spark cluster or a single machine, [over corpora in the hundreds of millions](/blog/run-entity-matching-in-your-own-infrastructure), without shipping anything out. The same trained models re-run as new sources and citation lists are added. [Why benchmarked blocking matters →](/why-madmatcher) · [How matching works →](/how-it-works) · [Talk to us →](/contact) ## Insurance claims & policyholders An insurer's data is full of the same people and companies showing up again under different descriptions. One policyholder's auto and home policies sit in the system as two unrelated customers. A provider bills under three names and two tax IDs. When these records are not [linked](/glossary/record-linkage), you cannot see total exposure, and a repeat claimant slips by because no one tied the new claim to the old ones. ## Why exact joins leave exposure hidden A plain join causes every one of these misses. A policyholder counted as two customers hides total exposure. A claimant not tied to prior claims is a hole in fraud detection. Policy and claims systems were built or bought separately, so the same person is keyed differently in each. Names pick up initials and married-name changes, and birth dates get transposed. A single similarity cutoff fuses unrelated claimants at one setting and, set tighter, leaves one person scattered across policies. ## How MadMatcher links claims and policyholders Each of these is a different matching problem, and MadMatcher treats them that way. After [blocking](/glossary/blocking) trims the pairs, a model trained on about 600 labeled pairs from your own books learns that a married name at a shared address is one policyholder, while a group name with a matching tax ID is one provider. You did not write either rule. [Active learning](/blog/active-learning-for-entity-matching) keeps the labeling small, and tying a new claim back to prior claims is exactly the [link](/glossary/matching) a flat join misses and a trained matcher catches. ## Keeping policyholder data inside your controls It runs inside your environment, so policyholder and health-related data stays within the controls you already have. As your books and claim volume grow, the same trained matcher [re-runs at scale](/blog/run-entity-matching-in-your-own-infrastructure) on your own cluster. [How matching works →](/how-it-works) · [Why a trainable matcher →](/why-madmatcher) · [Talk to us →](/contact) ## Financial services A bank almost never sees a customer as one person. Core banking has the deposit account they opened years ago. Lending has them again as a borrower, keyed off a loan application, and the bank you acquired last year brought a third file under a different spelling. KYC and single-customer-view rules assume those records connect, and across these systems they do not. ## Why scattered records are a regulatory risk When the records do not connect, the problems are regulatory before they are operational. Sanctions screening that runs on fragments misses a match it should have caught. You cannot meet a single-customer-view obligation when one person reads as several, and credit exposure looks smaller than it is when a borrower's other accounts are not tied to them. A plain join cannot bridge this, and a fixed similarity cutoff cannot either. A name picks up a suffix in one place and a transliteration in another, and a tax ID that is masked in the core system is complete in the loan file. ## Resolving customers and counterparties MadMatcher learns these patterns from your own books. [Blocking](/glossary/blocking) cuts the data down to the pairs worth comparing. A matcher trained on about 600 labeled pairs then learns that a transliterated name with a matching tax ID is one customer, and [active learning](/blog/active-learning-for-entity-matching) keeps the labeling small. You train a separate model for retail customers and another for business counterparties, since the two are keyed and written differently. A trading name and a registered legal name then resolve to one firm, so aggregate counterparty risk stops counting it twice. ## Keeping customer data inside residency controls Customer data never leaves your controls. MadMatcher [runs in your own infrastructure](/blog/run-entity-matching-in-your-own-infrastructure), which matters under residency and privacy rules. When you acquire an institution, you match its customer file in place without shipping it anywhere. [How matching works →](/how-it-works) · [Compare approaches →](/compare) · [Talk to us →](/contact) ## Government & public sector Government data lives in agencies that grew up separately and were never meant to share a key. A resident is a taxpayer to one agency and a benefits recipient to another, and neither system was built to recognize that they are the same person. Decide benefits eligibility or catch cross-program fraud and you need those records [linked](/glossary/record-linkage), and the systems holding them have nothing exact to connect on. ## Why agency systems have nothing to join on The gap is structural, not a data-quality slip you can clean away. A tax system and a procurement database were built decades apart for different purposes, so names are entered in different order and a vendor is a "doing business as" name in one place and a registered legal name in another. No exact join survives that, and one fixed cutoff cannot be right for all of it. The cost shows up in public reports. Eligibility decided on fragmented records pays the same person twice or wrongly denies someone, and contractor spend cannot be aggregated for oversight. ## Linking residents, vendors, and contractors The link has to be learned rather than looked up in an ID that does not exist, which is what a trained matcher does. [Blocking](/glossary/blocking) narrows the candidate pairs, and a matcher learns from about 600 labeled pairs out of your own records which agreements mean one person and which do not, with [active learning](/blog/active-learning-for-entity-matching) keeping the labeling light. You train a separate model for residents and another for vendors, since each is keyed and written differently. ## Keeping resident data inside your boundary MadMatcher runs on infrastructure you control, so resident data stays inside the residency and access controls you are accountable to, with nothing copied to an outside service. As programs and records grow, the same trained matcher [re-runs at scale](/blog/run-entity-matching-in-your-own-infrastructure) without anyone rebuilding the rules. [How matching works →](/how-it-works) · [Why a trainable matcher →](/why-madmatcher) · [Talk to us →](/contact) ## Retail & e-commerce Retail data comes apart along two seams at once. A shopper buys in store on a loyalty card and online under a different email, and a marketplace order comes in under a name that ties back to neither, so one customer reads as three. The same physical product arrives from a supplier feed and a marketplace listing, each with its own title and attributes, so one product reads as several. Personalization and inventory both depend on pulling those back together with [entity matching](/glossary/entity-matching), and nothing joins cleanly across the feeds. ## Why neither customers nor products join cleanly Both seams defeat a plain join and a fixed cutoff, for different reasons. On the customer side the channels share no key, just a nickname on the loyalty card and an email that changed with a new job. On the product side a supplier writes a spec-heavy title and a marketplace writes a keyword-stuffed one, and the GTIN that should anchor the item is missing as often as not. A single similarity cutoff merges distinct items at one setting and leaves duplicates at the next. The cost shows up in revenue and operations. Personalization reaches the same person more than once, and assortment analysis is wrong when an acquired brand's catalog will not merge. ## How MadMatcher unifies customers and products MadMatcher handles both as the separate problems they are, a model for customers and a model for products, each trained on about 600 labeled pairs from your own data, with [active learning](/blog/active-learning-for-entity-matching) keeping the labeling light. The customer model learns cross-channel identity, tying the in-store card to the web email without a shared key, while the product model learns that a reworded title with a matching GTIN is one item. Add a new brand or supplier feed and the trained matcher takes it on. ## How it runs at retail scale It all runs [inside your own environment](/blog/run-entity-matching-in-your-own-infrastructure), on Apache Spark for catalogs and customer bases that reach into the hundreds of millions of records per table, so retail data never leaves your perimeter. [How matching works →](/how-it-works) · [Compare approaches →](/compare) · [Talk to us →](/contact) ## Travel & hospitality A guest reaches a hotel through a different door every time, and each door keeps its own record. A booking through an online travel agency comes in with a masked email and whatever name the agency stored, while a direct reservation on the brand site has the real details. Loyalty has a member record that may connect to neither. So one guest becomes several, and the single profile that loyalty and personalization assume is not there. Building it is an [entity matching](/glossary/entity-matching) problem. ## Why exact joins fail across booking channels The channels make matching harder on purpose, which is why a plain join finds nothing and a fixed cutoff cannot cover it. The agency masks the email, and a booking flow asks for less than a loyalty signup. So an exact join has nothing to land on, and a single similarity cutoff cannot be right for a masked email and a truncated name at once. You feel the result at the front desk. A returning member books through an agency and checks in as a stranger, and lifetime value reads low because one guest is counted as several. ## How MadMatcher unifies guest profiles MadMatcher learns the patterns instead of guessing a cutoff. [Blocking](/glossary/blocking) narrows the candidate pairs, and a model trains on about 600 labeled pairs from your own bookings. It learns that a masked agency email with a matching name and phone is the same guest, with [active learning](/blog/active-learning-for-entity-matching) keeping the labeling small. It connects the agency booking and the direct reservation to the loyalty profile [without a shared key](/glossary/record-linkage). ## Keeping guest data in your region It runs inside your own environment, which keeps guest data within your regional privacy obligations, and a new property's history or a new channel feed reuses the same trained matcher [without moving anything outside](/blog/run-entity-matching-in-your-own-infrastructure). [How matching works →](/how-it-works) · [Why a trainable matcher →](/why-madmatcher) · [Talk to us →](/contact) --- # Docs ## Quickstart This is a minimal end-to-end pipeline on Apache Spark: block two tables with Sparkly, then train and apply a matcher with MatchFlow. The code mirrors the example scripts in each repo. See [sparkly/examples](https://github.com/MadMatcher/sparkly/tree/main/examples) and [MatchFlow/examples](https://github.com/MadMatcher/MatchFlow/tree/main/examples) for the full runnable scripts. > Prerequisite: install the packages first. Sparkly requires PyLucene, which is built from > source and cannot be pip installed. Complete the [Installation](/docs/installation/) steps > (including PyLucene) before running anything below. Both tables need an `_id` column with unique values. Sparkly's `_id` must be a 32- or 64-bit integer. ## 1. Block with Sparkly Sparkly indexes one table, then searches it with the other to return the top-k candidate pairs per record. This mirrors [basic_example.py](https://github.com/MadMatcher/sparkly/blob/main/examples/basic_example.py). ```python from pyspark.sql import SparkSession from sparkly.index import LuceneIndex from sparkly.index_config import IndexConfig from sparkly.search import Searcher from sparkly.utils import check_tables_manual # number of candidates returned per record limit = 50 spark = ( SparkSession.builder .master('local[*]') .appName('MadMatcher Quickstart') .getOrCreate() ) table_a = spark.read.parquet('table_a.parquet') # table to index table_b = spark.read.parquet('table_b.parquet') # table to search with # validate the id columns before any other Sparkly operation check_tables_manual(table_a, '_id', table_b, '_id') # index table A on its 'name' field, tokenized into 3-grams config = IndexConfig(id_col='_id') config.add_field('name', ['3gram']) index = LuceneIndex('/tmp/example_index/', config) index.upsert_docs(table_a) # query spec that searches all indexed fields, then search with table B query_spec = index.get_full_query_spec() searcher = Searcher(index) candidates = searcher.search(table_b, query_spec, id_col='_id', limit=limit) ``` `candidates` is a Spark DataFrame rolled up per search record: an `id2` (the table B record) and an `id1_list` (the matching table A ids). MatchFlow consumes this `(id2, id1_list)` format directly. ## 2. Match with MatchFlow MatchFlow builds feature vectors, labels a sample with active learning, trains a classifier, then applies it to the full candidate set. This mirrors [matchflow_spark_local.py](https://github.com/MadMatcher/MatchFlow/blob/main/examples/spark-local-examples/matchflow_spark_local.py). ```python from xgboost import XGBClassifier from MatchFlow import ( create_features, featurize, down_sample, create_seeds, label_data, train_matcher, apply_matcher, check_tables, check_candidates, ) from MatchFlow import SKLearnModel, CLILabeler # keep only the columns MatchFlow needs from the blocking output candidates = candidates.select('id2', 'id1_list') # validate inputs before any core MatchFlow function check_tables(table_a, table_b) check_candidates(candidates, table_a, table_b) # create a candidate feature set from the table schemas and data features = create_features( A=table_a, B=table_b, a_cols=['name'], b_cols=['name'], ) # convert each candidate pair into a feature vector feature_vectors = featurize( features=features, A=table_a, B=table_b, candidates=candidates, output_col='feature_vectors', fill_na=0.0, ) # take a sample of the feature vectors for active learning downsampled_fvs = down_sample( fvs=feature_vectors, percent=0.3, search_id_column='_id', score_column='score', bucket_size=1000, ) # label by hand with the CLI labeler (a WebUILabeler is also available) labeler = CLILabeler(a_df=table_a, b_df=table_b, id_col='_id') seeds = create_seeds( fvs=downsampled_fvs, nseeds=50, labeler=labeler, score_column='score', ) model = SKLearnModel( model=XGBClassifier, eval_metric='logloss', objective='binary:logistic', max_depth=6, seed=42, nan_fill=0.0, ) # active learning in batch mode to label more pairs labeled_data = label_data( model=model, mode='batch', labeler=labeler, fvs=downsampled_fvs, seeds=seeds, batch_size=10, max_iter=50, ) # train the matcher, then apply it to every candidate pair trained_model = train_matcher( model=model, labeled_data=labeled_data, feature_col='feature_vectors', label_col='label', ) predictions = apply_matcher( model=trained_model, df=feature_vectors, feature_col='feature_vectors', prediction_col='prediction', confidence_col='confidence', ) ``` If you already have labeled pairs, skip active learning and use passive learning. If all matches are known, the `GoldLabeler` can simulate labeling to measure precision and recall. See the [Spark examples](https://github.com/MadMatcher/MatchFlow/tree/main/examples/spark-local-examples). The matcher can use any Scikit-Learn or PySpark MLlib classifier (XGBoost above, Random Forest, and so on). ## Next - Run it on your own tables. MatchFlow ships a `GoldLabeler` for computing precision and recall against known matches. - Use [Delex](https://github.com/MadMatcher/delex) when one blocking strategy is not enough. - Read [how it works](/how-it-works) for the reasoning behind each step. ## Installation MadMatcher's three open-source tools are Python packages: Sparkly and Delex do blocking, and MatchFlow does matching. Sparkly and Delex depend on PyLucene (a Java/JCC build). MatchFlow runs on Spark or pandas without it. The packages are not yet on PyPI, so install them from GitHub. The order below is deliberate. PyLucene is the one piece that cannot be pip installed, so set it up first, then add the packages that use it. ## Requirements - Python 3 (the repo install guides test on Python 3.12) - Java Temurin 17 JDK (Spark needs Java; Sparkly and Delex also need it for PyLucene) - A C++ compiler (g++ on Linux, the Xcode command line tools on macOS) for PyLucene - pip ## Step 1. Install PyLucene (required for blocking) PyLucene is not on PyPI and cannot be installed with pip. It uses JCC to compile Lucene's Java code into a C-extension that Python can call, so the build needs a C++ compiler, a JDK, and JCC. The exact steps are OS-specific. At a high level: 1. Install a C++ compiler (g++ on Linux, the Xcode command line tools on macOS). 2. Install the Java Temurin 17 JDK. 3. Pin `setuptools==70.3.0` before building JCC. Newer setuptools removed the `pkg_resources.extern.packaging` module JCC's shared-mode build relies on, so the PyLucene `make` step otherwise fails with `JCC was not built with --shared mode support`. 4. Download and unpack PyLucene 9.12.0, build and install JCC from its `jcc` subdirectory, then build and install PyLucene with `make`. 5. Verify with `python3 -c "import lucene; print(lucene.VERSION)"`, which should print `9.12.0`. The repos test PyLucene 9.12.0 with Java Temurin 17 and Python 3.12 (macOS guides on Apple M1). Follow the exact guide for your OS: - Sparkly: [Java, JCC, and PyLucene](https://github.com/MadMatcher/sparkly/blob/main/doc/install-java-jcc-pylucene.md), [Linux](https://github.com/MadMatcher/sparkly/blob/main/doc/install-single-machine-linux.md), [macOS](https://github.com/MadMatcher/sparkly/blob/main/doc/install-single-machine-macOS.md), [why PyLucene](https://github.com/MadMatcher/sparkly/blob/main/doc/why-pylucene.md) - Delex: [Linux](https://github.com/MadMatcher/delex/blob/main/doc/installation-guides/install-linux-single-machine.md), [macOS](https://github.com/MadMatcher/delex/blob/main/doc/installation-guides/install-macos-single-machine.md) ## Step 2. Install Sparkly and Delex (blocking) With PyLucene in place, install the blocking packages from GitHub. The pip step pulls in everything except Java, JCC, and PyLucene, which you installed above. Delex builds on Sparkly, so install Sparkly first. ```bash pip install git+https://github.com/MadMatcher/sparkly.git@main pip install git+https://github.com/MadMatcher/delex.git@main ``` Reach for Delex when one blocking strategy is not enough. ## Step 3. Install MatchFlow (matching) MatchFlow has no PyLucene dependency. Install it from GitHub: ```bash pip install git+https://github.com/MadMatcher/MatchFlow.git@main ``` This installs MatchFlow and its dependencies (Flask, Joblib, Numba, Numpy, Pandas, Pyarrow, Py_Stringmatching, PySpark, Scikit-Learn, Scipy, Streamlit, Xgboost, and others). MatchFlow runs on pandas on a single machine, on Spark on a single machine, or on Spark on a cluster. For Spark it needs the Java Temurin 17 JDK. See the [MatchFlow single-machine guide](https://github.com/MadMatcher/MatchFlow/blob/main/docs/installation-guides/install-single-machine.md). ## Verify ```python import lucene # from PyLucene; should import without error import sparkly # blocking import MatchFlow # matching ``` Then run the [Quickstart](/docs/quickstart/). --- # Blog ## How many labels does a learned matcher need? A learned matcher usually needs about 600 labeled pairs, not tens of thousands, as long as you label the right pairs. That is the number we suggest planning for. Active learning is what makes it possible, because it asks you to label only the pairs the model is unsure about, so a small budget goes a long way. ## The practical answer: about 600, not tens of thousands In practice a [supervised matcher](/glossary/matching) reaches a strong model from about 600 labeled pairs. The instinct from general machine learning is that you need a large labeled set, and if you label a random sample, that instinct is right, because matching data is heavily imbalanced. In any candidate set, obvious non-matches dominate and the easy cases are common. A random sample is almost all easy pairs the model would have gotten right anyway, so you label thousands of examples and learn very little. The fix is not more labels, it is better-chosen labels. When the labeling is targeted, about 600 pairs is enough to train a matcher that generalizes across the full candidate set, even one with [hundreds of millions of pairs](/blog/entity-resolution-at-scale-on-spark). ## Why active learning selects informative pairs [Active learning](/glossary/active-learning) works because the informative pairs are the ones near the model's decision boundary, and it goes after them on purpose. A pair the model is already sure about teaches it almost nothing. A pair it is unsure about sits where the match and non-match cases blur together, and resolving it sharpens the boundary the most. So instead of labeling a random sample, the model picks the pairs it wants labeled, choosing the ones nearest its current decision boundary. Each label is spent where it changes the model most. Those boundary pairs are also the ones that set where [precision and recall trade off](/blog/precision-and-recall-in-entity-matching), so targeted labeling sharpens the model right where your operating point will sit. ## The labeling loop in MatchFlow In MatchFlow the loop is sample, label, train, repeat. The candidate set after [blocking](/glossary/blocking) is far too large to run active learning over directly, so MatchFlow first takes a manageable sample of it with `down_sample`. It then surfaces the most informative pairs from that sample and asks you to label them through a CLI or a web labeler. Each round you label a small batch, the model retrains, and it picks the next batch of unclear pairs. After a few rounds you have about 600 labels and a trained classifier. That classifier is a standard scikit-learn or PySpark MLlib model, which you then apply to the full candidate set. The labeling is an afternoon at a keyboard, not a labeling contract. For the mechanics in more detail, see [active learning for entity matching](/blog/active-learning-for-entity-matching). ## Bring your own labels If you already have labeled pairs, you can hand them straight to the matcher and skip or shorten the active-learning loop. Many teams have prior match decisions sitting in their data, like a [customer 360](/use-cases/customer-360) effort with a history of manual merges, or a [compliance](/use-cases/compliance-screening) workflow where analysts have already adjudicated cases. Those decisions are training labels. You can seed the model with them and train directly, or use active learning only to fill in coverage where your existing labels are thin. ## How this compares to other approaches About 600 targeted labels is a different bet than the two extremes of the matching landscape. Label-hungry neural methods can be accurate, but they demand large labeled sets, which puts the labeling cost up front and stalls many projects before they start. At the other end, zero-shot prompting and fixed pretrained models ask for no labels at all, but they do not learn the signals specific to your domain, so on messy real-world data their precision and recall on your records are whatever the general model happens to give you. Active learning sits in the middle by design. A small, cheap labeling effort buys a model tuned to your data and your error costs. You get the domain fit of a supervised approach without the label-hunting bill, and a model you control rather than one you prompt and hope works. If you want to estimate the labeling effort for your data, or you already have labels and want to put them to work, [talk to the team about your data](/contact) or see [how matching works](/how-it-works). ## Entity matching at scale Resolving a few thousand records is a spreadsheet task. Resolving hundreds of millions is a different problem, because the number of possible pairs grows with the square of the number of records. A 10× bigger dataset is a 100× bigger comparison problem. ## What changes at scale Two things stop working: - Comparing all pairs becomes impossible, so aggressive, high-recall blocking is no longer optional. It's the only way the job exists. - The candidate set and feature vectors outgrow one machine's memory, so the work has to be split across many. ## Where Spark fits Sparkly does blocking on Spark: it builds a Lucene index over one table and runs top-k search distributed across a cluster, scaling to hundreds of millions of tuples per table. MatchFlow handles the matching side, including the fact that featurizing and labeling hundreds of millions of pairs isn't trivially parallel. For example, active learning runs on a constructed sample of the candidate set rather than the whole thing. MatchFlow runs three ways: pandas on a single machine for small data, Spark on a single machine for testing, and Spark on a cluster for large data (roughly 5M+ tuples per table). You start small and move to a cluster when the data demands it. ## Batch, not real-time This is batch matching: you resolve a table, or an incremental slice, as a job. For most data-integration and deduplication work that's the right shape. If you need sub-second matching per request, that's a different system. [Why MadMatcher runs in your infrastructure →](/why-madmatcher) ## Why run entity matching in your own infrastructure Run entity matching in your own infrastructure when the records you are matching are sensitive. Shipping them to a vendor cloud creates a data-residency and compliance problem you then have to solve. MadMatcher runs inside your own Apache Spark cluster or on a single machine, so the data never leaves your perimeter. ## What "no data egress" actually buys you No data egress means the records you match never cross your network boundary, and that one property removes a whole category of work. The data you feed an [entity matching](/glossary/entity-matching) pipeline is usually your most sensitive, the PII and financial records you already guard most carefully. Send a copy to a third-party service and you have created a new system that stores that data. That means a new vendor security review and a new place a breach or a subpoena can reach your customers' records. Keep the computation inside your own environment and most of that does not arise. There is no copy of the data at a vendor to review or breach, and the encryption and access controls you already run apply unchanged, because the job runs where those controls already live. ## How it helps with compliance and security review Running matching in place makes a regulatory or security review much shorter, because the hardest questions are about where data goes, and the answer is nowhere. Rules like HIPAA and GDPR residency requirements turn on the location and movement of regulated data. When the data does not leave your boundary, the matching system inherits the compliance posture of the environment it runs in, instead of forcing a fresh assessment of an outside processor. This matters most in the domains where matching is valuable. [Healthcare record linkage](/use-cases/healthcare-record-linkage) joins patient records under HIPAA, and [compliance screening](/use-cases/compliance-screening) matches against watchlist data under rules that discourage exporting it. In-perimeter execution lets you do the matching without exporting the regulated data to do it. ## Cost and operational control Keeping the data local also removes egress cost and the network round-trips of moving it out and back. Moving hundreds of millions of records to a vendor and back is slow, and on most clouds it is billed by the gigabyte. Doing the work next to where the data already sits means no bulk transfer and no third-party outage sitting between you and your pipeline. You also keep operational control. You choose the cluster size and the schedule, and you can run on hardware you have already paid for. For long-running production jobs there is [MadMatcher-Pro](/products), which adds crash recovery and progress tracking so a multi-hour run survives a failure instead of restarting from scratch, all still inside your own environment. ## How it composes into your existing stack MadMatcher is built to drop into the data platform you already run, not to replace it. The core is open source: Sparkly and Delex for [blocking](/glossary/blocking), and [MatchFlow](/products#matchflow) for [matching](/glossary/matching). It runs on [Apache Spark](/blog/entity-resolution-at-scale-on-spark) for large data and on pandas on a single machine for smaller jobs, so it sits on the same compute you use for the rest of your pipeline. It reads your tables and writes match results back as a batch job, fitting alongside your warehouse and orchestration rather than asking you to move data into a new silo. That extends to the matcher itself. MatchFlow trains a standard scikit-learn or PySpark MLlib classifier on your labeled pairs, and [active learning](/glossary/active-learning) means you reach a good model from [about 600 labels](/blog/how-many-labels-does-a-matcher-need) without exporting the data to label it. You bring the data and the labels, and the pipeline stays inside your walls. If your records are too sensitive to ship to a vendor, that is the case MadMatcher is built for. [Talk to the team about your data](/contact), or read [how it works](/how-it-works) to see how the pieces fit your environment. ## Active learning for entity matching A supervised matcher needs labeled examples. Labeling tens of thousands of pairs by hand is slow and expensive, and it's where a lot of matching projects stall. Active learning is how you avoid it. ## What it does Instead of labeling a large random sample, the model picks which pairs it wants labeled. For entity matching, that means the pairs near the decision boundary, the ones it's most unsure about. A pair the model is already confident about teaches it little. An ambiguous one teaches it a lot. Most record pairs are easy: obvious matches and obvious non-matches dominate any dataset, and labeling them is wasted effort. The ambiguous pairs are rare but informative. Active learning spends your labeling budget on those. ## In MatchFlow MatchFlow's labeling step uses this. It can't run active learning directly on the full candidate set (often hundreds of millions of pairs), so it first takes a sample with `down_sample`, then iteratively asks you to label informative pairs with a CLI or web labeler. In practice this is about 600 labels rather than tens of thousands. You then train a classifier (XGBoost, Random Forest, or any scikit-learn / PySpark MLlib model) on those labels and apply it to the full candidate set. [How matching works →](/how-it-works) ## Precision and recall in entity matching: choosing your operating point In entity matching, precision is the share of your predicted matches that are real, and recall is the share of real matches you actually found. Every pipeline sits somewhere on the trade-off between the two. Where it should sit depends on what a wrong merge costs you and what a missed match costs you. ## What precision and recall mean for matching Precision and recall count two different mistakes. A pair of records either refers to the same [entity](/glossary/entity-matching) or it does not, and your pipeline either calls it a match or it does not. Precision asks how many of the pairs you called matches were real. Low precision means false merges, where two different people or products collapse into one record. Recall asks how many of the real matches you caught. Low recall means missed links, where one entity stays split across rows. You cannot push both to their maximum at once. Make the matcher call more pairs matches and recall rises while precision falls. Make it stricter and the reverse happens. The question is never whether the pipeline is accurate in the abstract. It is where on that curve you want to sit, and that choice is your operating point. ## Why blocking is tuned for recall [Blocking](/glossary/blocking) is tuned almost entirely for recall, because a pair it drops can never come back. Comparing every pair of records is quadratic, so a table with hundreds of millions of rows has far too many possible pairs for a [matcher](/glossary/matching) to score. Blocking produces a smaller set of plausible pairs and throws the rest away. Whatever it throws away is gone, and no amount of matcher tuning downstream recovers a true pair that blocking already dropped. So blocking sets the recall ceiling for the whole pipeline. That is why it aims for high recall and tolerates loose precision. It is fine for the candidate set to hold plenty of non-matches, because the matcher filters those out. Sparkly does this with top-k TF/IDF similarity, and it [outperforms eight state-of-the-art blockers (VLDB 2023)](https://www.vldb.org/pvldb/vol16/p1507-paulsen.pdf) on exactly this recall-at-scale problem. For more on how the two stages split the work, see [blocking vs. matching](/blog/blocking-vs-matching). ## Where the matcher sits on the trade-off The matcher is where the precision and recall balance is actually set. It runs only on the candidate pairs blocking kept, so it can afford to be thorough. In MatchFlow the matcher is a machine-learning model, usually a gradient-boosted classifier (XGBoost), trained on your labeled pairs to predict whether two records are the same entity. It learns that decision from your own data rather than from a fixed similarity cutoff, so it picks up the signals that actually separate your entities, like a matching tax ID counting for more than a shared common name. Because it is trained and measured on held-out pairs, you can read its precision and recall and aim for the balance your costs call for. The one limit it cannot cross is the recall ceiling blocking already set. It only ever rules on the pairs blocking kept, so it can never recover a true match that blocking discarded. This is why MatchFlow trains a supervised classifier instead of relying on a fixed rule. A model that learns your domain separates matches from non-matches more cleanly, which lifts the whole precision and recall curve and gives you better operating points to choose from. You still pick the point. The model makes every point better. ## How to choose your operating point Pick the operating point by weighing the cost of a wrong merge against the cost of a missed match. The two costs are rarely equal, and in some domains both are high. [Healthcare record linkage](/use-cases/healthcare-record-linkage) is the hard case. A wrong merge can attach the wrong allergy or medication history to a chart, and a missed link scatters one patient across records and hides earlier care and known allergies. Both directions cause harm, so you set the operating point deliberately and lean by workflow rather than treat either error as free. In [compliance screening](/use-cases/compliance-screening), missing a real match to a watchlist is the expensive failure, so recall dominates and reviewers absorb the extra false positives. Other domains, like a [customer 360](/use-cases/customer-360), sit softer, and you tune by feel. To make this concrete, hold out a labeled set and read the model's precision and recall across its operating points, with the cost of each kind of error in mind. Then choose the point where the expected cost is lowest, rather than accepting a default. Measure recall on blocking and precision and recall on matching separately, since they are different stages with different jobs. ## How active learning relates Active learning improves the curve you are choosing on, not the choice itself. A learned matcher needs labeled pairs, and [active learning](/glossary/active-learning) gets you a good model from [about 600 labels](/blog/how-many-labels-does-a-matcher-need) by asking you to label only the pairs near the model's decision boundary. Those boundary pairs are the ones that set where precision and recall trade off, so labeling them sharpens the model right where your operating point will sit. You train on those labels, run the trained model over the full candidate set, and choose your operating point on the result. Choosing an operating point is a judgment about your data and your costs, and it is one we make with customers regularly. [Talk to the team about your data](/contact) if you want help picking yours, or see [how the pipeline fits together](/how-it-works). ## Blocking vs. matching Entity matching has two steps, and they solve different problems. Blocking decides which record pairs are worth comparing. Matching decides which of those pairs are real. ## Blocking Comparing every pair of records is quadratic. A million records is half a trillion pairs, so you can't run a matcher on all of them. Blocking produces a smaller candidate set of pairs that might match, and discards the rest. Blocking is judged on recall: of all true matches, how many made it into the candidate set? A pair that blocking drops can't be recovered later, so blocking sets the recall ceiling for the whole pipeline. Sparkly does blocking with top-k TF/IDF similarity and was benchmarked on this in the [VLDB 2023 paper](https://www.vldb.org/pvldb/vol16/p1507-paulsen.pdf). ## Matching Matching runs only on the candidate pairs blocking kept, so it can be more expensive and precise. It classifies each pair as a match or not. It's judged on precision and recall: of the pairs it called matches, how many were real. MatchFlow does matching with a supervised classifier trained on labeled pairs. Because it learns from your data, it adapts to the specific signals that separate matches in your domain. ## Why the split matters If you skip blocking and lean on the matcher, the job doesn't finish: there are too many pairs. If you treat blocking as an afterthought (a couple of hand-written join keys), you silently drop true matches that don't share those keys, and no amount of matcher tuning brings them back. Build them as two steps, and measure recall on blocking and precision on matching separately. [How the pipeline fits together →](/how-it-works)