Installation
Install the MadMatcher tools: Sparkly and Delex for blocking (which need PyLucene), and MatchFlow for matching.
MadMatcher’s three open-source tools are Python packages: Sparkly and Delex do blocking, and MatchFlow does matching. Sparkly and Delex depend on PyLucene (a Java/JCC build). MatchFlow runs on Spark or pandas without it. The packages are not yet on PyPI, so install them from GitHub.
The order below is deliberate. PyLucene is the one piece that cannot be pip installed, so set it up first, then add the packages that use it.
Requirements
- Python 3 (the repo install guides test on Python 3.12)
- Java Temurin 17 JDK (Spark needs Java; Sparkly and Delex also need it for PyLucene)
- A C++ compiler (g++ on Linux, the Xcode command line tools on macOS) for PyLucene
- pip
Step 1. Install PyLucene (required for blocking)
PyLucene is not on PyPI and cannot be installed with pip. It uses JCC to compile Lucene’s Java code into a C-extension that Python can call, so the build needs a C++ compiler, a JDK, and JCC. The exact steps are OS-specific. At a high level:
- Install a C++ compiler (g++ on Linux, the Xcode command line tools on macOS).
- Install the Java Temurin 17 JDK.
- Pin
setuptools==70.3.0before building JCC. Newer setuptools removed thepkg_resources.extern.packagingmodule JCC’s shared-mode build relies on, so the PyLucenemakestep otherwise fails withJCC was not built with --shared mode support. - Download and unpack PyLucene 9.12.0, build and install JCC from its
jccsubdirectory, then build and install PyLucene withmake. - Verify with
python3 -c "import lucene; print(lucene.VERSION)", which should print9.12.0.
The repos test PyLucene 9.12.0 with Java Temurin 17 and Python 3.12 (macOS guides on Apple M1). Follow the exact guide for your OS:
- Sparkly: Java, JCC, and PyLucene, Linux, macOS, why PyLucene
- Delex: Linux, macOS
Step 2. Install Sparkly and Delex (blocking)
With PyLucene in place, install the blocking packages from GitHub. The pip step pulls in everything except Java, JCC, and PyLucene, which you installed above. Delex builds on Sparkly, so install Sparkly first.
pip install git+https://github.com/MadMatcher/sparkly.git@main
pip install git+https://github.com/MadMatcher/delex.git@main
Reach for Delex when one blocking strategy is not enough.
Step 3. Install MatchFlow (matching)
MatchFlow has no PyLucene dependency. Install it from GitHub:
pip install git+https://github.com/MadMatcher/MatchFlow.git@main
This installs MatchFlow and its dependencies (Flask, Joblib, Numba, Numpy, Pandas, Pyarrow, Py_Stringmatching, PySpark, Scikit-Learn, Scipy, Streamlit, Xgboost, and others). MatchFlow runs on pandas on a single machine, on Spark on a single machine, or on Spark on a cluster. For Spark it needs the Java Temurin 17 JDK. See the MatchFlow single-machine guide.
Verify
import lucene # from PyLucene; should import without error
import sparkly # blocking
import MatchFlow # matching
Then run the Quickstart.