Examples ======== Simple Matching --------------- Basic fuzzy matching between two dataframes: .. code-block:: python import polars as pl from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping import logging logger = logging.getLogger(__name__) # Create sample data companies = pl.DataFrame({ "company_id": [1, 2, 3, 4, 5], "company_name": [ "Apple Inc.", "Microsoft Corporation", "Amazon.com Inc", "Google LLC", "Meta Platforms Inc" ] }).lazy() vendors = pl.DataFrame({ "vendor_id": ["V001", "V002", "V003", "V004", "V005"], "vendor_name": [ "Apple", "Microsoft Corp.", "Amazon", "Alphabet/Google", "Facebook/Meta" ] }).lazy() # Define fuzzy matching fuzzy_maps = [ FuzzyMapping( left_col="company_name", right_col="vendor_name", threshold_score=70.0, fuzzy_type="jaro_winkler" ) ] # Perform matching result = fuzzy_match_dfs( left_df=companies, right_df=vendors, fuzzy_maps=fuzzy_maps, logger=logger ) # Display results print(result.select([ "company_id", "company_name", "vendor_id", "vendor_name", "fuzzy_score_0" ]).sort("fuzzy_score_0", descending=True)) Multi-Column Matching --------------------- Match on multiple columns with different algorithms and thresholds: .. code-block:: python # Customer database customers = pl.DataFrame({ "customer_id": [1, 2, 3, 4], "name": ["John Smith", "Jane Doe", "Bob Johnson", "Alice Brown"], "address": ["123 Main St", "456 Oak Ave", "789 Pine Rd", "321 Elm St"], "city": ["New York", "Los Angeles", "Chicago", "Houston"] }).lazy() # Vendor database with potential matches vendors = pl.DataFrame({ "vendor_id": ["A", "B", "C", "D"], "vendor_name": ["Jon Smith", "Jane Do", "Robert Johnson", "Alicia Brown"], "vendor_address": ["123 Main Street", "456 Oak Avenue", "789 Pine Road", "321 Elm Street"], "vendor_city": ["New York", "Los Angeles", "Chicago", "Houston"] }).lazy() # Multi-column fuzzy matching fuzzy_maps = [ FuzzyMapping( left_col="name", right_col="vendor_name", threshold_score=85.0, fuzzy_type="jaro_winkler" ), FuzzyMapping( left_col="address", right_col="vendor_address", threshold_score=80.0, fuzzy_type="levenshtein" ), FuzzyMapping( left_col="city", right_col="vendor_city", threshold_score=95.0, fuzzy_type="jaro" ) ] # Perform matching result = fuzzy_match_dfs( left_df=customers, right_df=vendors, fuzzy_maps=fuzzy_maps, logger=logger ) # Calculate combined score result = result.with_columns( ( pl.col("fuzzy_score_0") * 0.5 + # Name weight: 50% pl.col("fuzzy_score_1") * 0.3 + # Address weight: 30% pl.col("fuzzy_score_2") * 0.2 # City weight: 20% ).alias("combined_score") ) Large Dataset Optimization -------------------------- Handling large datasets with automatic optimization: .. code-block:: python import time # For large datasets, the library automatically optimizes # Let's simulate with medium-sized data left_df = pl.DataFrame({ "id": range(10000), "text": [f"Company Name {i}" for i in range(10000)] }).lazy() right_df = pl.DataFrame({ "id": range(8000), "text": [f"Company Name {i}" for i in range(8000)] }).lazy() fuzzy_maps = [ FuzzyMapping( left_col="text", right_col="text", threshold_score=90.0, fuzzy_type="levenshtein" ) ] # Time the operation start = time.time() result = fuzzy_match_dfs( left_df=left_df, right_df=right_df, fuzzy_maps=fuzzy_maps, logger=logger ) duration = time.time() - start print(f"Matched {len(result)} records in {duration:.2f} seconds") print(f"Potential matches: {10000 * 8000:,}") Controlling Join Strategy ------------------------- You can explicitly control the join strategy: .. code-block:: python # Force approximate matching (requires polars-simed) result = fuzzy_match_dfs( left_df, right_df, fuzzy_maps, logger, use_appr_nearest_neighbor_for_new_matches=True ) # Force standard cross join result = fuzzy_match_dfs( left_df, right_df, fuzzy_maps, logger, use_appr_nearest_neighbor_for_new_matches=False ) # Let the library decide (default) result = fuzzy_match_dfs( left_df, right_df, fuzzy_maps, logger, use_appr_nearest_neighbor_for_new_matches=None )