Quick Start

This guide will help you get started with pl-fuzzy-frame-match in just a few minutes.

Basic Example

import polars as pl
from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping
import logging

# Set up logger
logger = logging.getLogger(__name__)

# Create sample dataframes
left_df = pl.DataFrame({
    "company_name": ["Apple Inc", "Microsoft Corporation", "Google LLC"],
    "company_id": [1, 2, 3]
}).lazy()

right_df = pl.DataFrame({
    "vendor_name": ["Apple", "Microsoft Corp", "Alphabet/Google"],
    "vendor_code": ["A001", "M001", "G001"]
}).lazy()

# Define fuzzy matching
fuzzy_maps = [
    FuzzyMapping(
        left_col="company_name",
        right_col="vendor_name",
        threshold_score=70.0,  # 70% similarity
        fuzzy_type="jaro_winkler"
    )
]

# Perform matching
result = fuzzy_match_dfs(
    left_df=left_df,
    right_df=right_df,
    fuzzy_maps=fuzzy_maps,
    logger=logger
)

print(result)

Understanding the Results

The output dataframe will contain:

All columns from both input dataframes
A fuzzy score column (e.g., fuzzy_score_0) with similarity scores between 0 and 1
Only matches that meet or exceed your threshold score

Available Algorithms

levenshtein: Edit distance (insertions, deletions, substitutions)
jaro: Good for short strings
jaro_winkler: Enhanced Jaro, excellent for names
hamming: For equal-length strings
damerau_levenshtein: Includes transpositions
indel: Insertion/deletion distance only

Choosing the Right Algorithm

Names: Use jaro_winkler
Addresses: Use levenshtein
Codes/IDs: Use hamming (if same length) or levenshtein
General text: Use levenshtein or damerau_levenshtein

Next Steps

See Examples for more complex use cases
Check the API Reference for detailed function documentation
Read about performance optimization for large datasets