Sparkly
Finds candidate matches using top-k TF/IDF similarity (the BM25 variant, via Lucene). It indexes one table and searches it with the other, scaling to hundreds of millions of tuples on Spark.
- Top-k TF/IDF candidate generation (BM25 via Lucene)
- Indexes the smaller table, searches it with the other on Spark
- Scales to hundreds of millions of tuples per table
Paulsen et al., VLDB 16(6), 2023. Outperforms eight state-of-the-art blockers. Read the paper ↗