pl-fuzzy-frame-match

High-performance fuzzy matching for Polars DataFrames that intelligently combines exact fuzzy matching with approximate joins for optimal performance on datasets of any size.

Key Features

Dual-Mode Performance: Combines exact fuzzy matching with approximate joins
Multiple Algorithms: Support for Levenshtein, Jaro, Jaro-Winkler, Hamming, Damerau-Levenshtein, and Indel
Smart Optimization: Automatic query optimization based on data uniqueness and size
Memory Efficient: Chunked processing and intelligent caching for massive datasets
Automatic Strategy Selection: No configuration needed - automatically picks the fastest approach

Quick Example

import polars as pl
from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping

# Create sample dataframes
left_df = pl.DataFrame({
    "name": ["John Smith", "Jane Doe", "Bob Johnson"],
    "id": [1, 2, 3]
}).lazy()

right_df = pl.DataFrame({
    "customer": ["Jon Smith", "Jane Does", "Robert Johnson"],
    "customer_id": [101, 102, 103]
}).lazy()

# Define fuzzy matching configuration
fuzzy_maps = [
    FuzzyMapping(
        left_col="name",
        right_col="customer",
        threshold_score=80.0,
        fuzzy_type="levenshtein"
    )
]

# Perform fuzzy matching
result = fuzzy_match_dfs(
    left_df=left_df,
    right_df=right_df,
    fuzzy_maps=fuzzy_maps,
    logger=your_logger
)

Performance

The library automatically selects the best matching strategy based on your data size:

Dataset Size	Cartesian Product	Standard Fuzzy	Automatic Selection	Speedup
15K × 10K	150M	40.82s	1.45s	28x
40K × 30K	1.2B	363.50s	4.75s	76x

Indices and tables