Quick Start
This guide will help you get started with pl-fuzzy-frame-match in just a few minutes.
Basic Example
import polars as pl
from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping
import logging
# Set up logger
logger = logging.getLogger(__name__)
# Create sample dataframes
left_df = pl.DataFrame({
"company_name": ["Apple Inc", "Microsoft Corporation", "Google LLC"],
"company_id": [1, 2, 3]
}).lazy()
right_df = pl.DataFrame({
"vendor_name": ["Apple", "Microsoft Corp", "Alphabet/Google"],
"vendor_code": ["A001", "M001", "G001"]
}).lazy()
# Define fuzzy matching
fuzzy_maps = [
FuzzyMapping(
left_col="company_name",
right_col="vendor_name",
threshold_score=70.0, # 70% similarity
fuzzy_type="jaro_winkler"
)
]
# Perform matching
result = fuzzy_match_dfs(
left_df=left_df,
right_df=right_df,
fuzzy_maps=fuzzy_maps,
logger=logger
)
print(result)
Understanding the Results
The output dataframe will contain:
All columns from both input dataframes
A fuzzy score column (e.g.,
fuzzy_score_0
) with similarity scores between 0 and 1Only matches that meet or exceed your threshold score
Available Algorithms
levenshtein: Edit distance (insertions, deletions, substitutions)
jaro: Good for short strings
jaro_winkler: Enhanced Jaro, excellent for names
hamming: For equal-length strings
damerau_levenshtein: Includes transpositions
indel: Insertion/deletion distance only
Choosing the Right Algorithm
Names: Use
jaro_winkler
Addresses: Use
levenshtein
Codes/IDs: Use
hamming
(if same length) orlevenshtein
General text: Use
levenshtein
ordamerau_levenshtein
Next Steps
See Examples for more complex use cases
Check the API Reference for detailed function documentation
Read about performance optimization for large datasets