API Reference

This section contains the complete API reference for pl-fuzzy-frame-match, automatically generated from the code docstrings.

Main Interface

pl-fuzzy-match: Efficient Fuzzy Matching for Polars DataFrames.

pl_fuzzy_frame_match.fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger, use_appr_nearest_neighbor_for_new_matches=None, top_n_for_new_matches=500, cross_over_for_appr_nearest_neighbor=100000000)[source]

Perform fuzzy matching between two dataframes using multiple fuzzy mapping configurations.

This is the main entry point function that orchestrates the entire fuzzy matching process, from pre-processing and indexing to matching and final joining.

Parameters:

left_df (pl.LazyFrame) – Left dataframe to be matched.
right_df (pl.LazyFrame) – Right dataframe to be matched.
fuzzy_maps (list[FuzzyMapping]) – A list of fuzzy mapping configurations to apply sequentially.
logger (Logger) – Logger instance for tracking progress.
use_appr_nearest_neighbor_for_new_matches (bool | None, optional) – Controls the join strategy for generating initial candidate pairs when no prior matches exist. - If True, forces the use of approximate nearest neighbor join. - If False, forces a standard cross join. - If None (default), an automatic selection based on data size is made. Defaults to None.
top_n_for_new_matches (int, optional) – When generating new matches with the approximate method, this specifies the maximum number of similar items to consider for each record. Defaults to 500.
cross_over_for_appr_nearest_neighbor (int, optional) – The cartesian product size threshold to automatically switch to the approximate join method when use_appr_nearest_neighbor_for_new_matches is None. Defaults to 100,000,000.

Returns:

The final matched dataframe containing original data from both: dataframes along with all calculated fuzzy scores.

Return type:

pl.DataFrame

class pl_fuzzy_frame_match.FuzzyMapping(left_col, right_col=None, threshold_score=80.0, fuzzy_type='levenshtein', perc_unique=0, output_column_name=None, valid=True)[source]

Bases: JoinMap

Represents the configuration for a fuzzy string match between two columns.

This class defines all the necessary parameters to perform a fuzzy join, including the columns to match, the specific algorithm to use, and the similarity threshold required to consider two strings a match.

It generates a default name for the output score column if one is not provided.

Parameters:

left_col (str)
right_col (str)
threshold_score (float)
fuzzy_type (Literal['levenshtein', 'jaro', 'jaro_winkler', 'hamming', 'damerau_levenshtein', 'indel'])
perc_unique (float)
output_column_name (str | None)
valid (bool)

left_col

The name of the column in the left dataframe to join on.

Type:: str

right_col

The name of the column in the right dataframe to join on.

Type:: str

threshold_score

The similarity score threshold required for a match, typically on a scale of 0 to 100. Defaults to 80.0.

Type:: float

fuzzy_type

The string-matching algorithm to use. Defaults to “levenshtein”.

Type:: FuzzyTypeLiteral

perc_unique

A parameter that may be used to assess column uniqueness before performing a costly fuzzy match. Defaults to 0.0.

Type:: float

output_column_name

The name for the new column that will contain the calculated fuzzy match score. If None, a name is generated automatically in the format ‘fuzzy_score_{left_col}_{right_col}’.

Type:: str | None

valid

A flag to indicate whether this mapping is active and should be used in a join operation. Defaults to True.

Type:: bool

reversed_threshold_score

A property that converts the 0-100 threshold score into a 0.0-1.0 distance score, where 0.0 is a perfect match.

Type:: float

__init__(left_col, right_col=None, threshold_score=80.0, fuzzy_type='levenshtein', perc_unique=0, output_column_name=None, valid=True)[source]

Initializes the FuzzyMapping configuration.

Parameters:

left_col (str) – The name of the column in the left dataframe.
right_col (str | None, optional) – The name of the column in the right dataframe. If None, it defaults to the value of left_col.
threshold_score (float, optional) – The similarity threshold for a match (0-100). Defaults to 80.0.
fuzzy_type (FuzzyTypeLiteral, optional) – The fuzzy matching algorithm to use. Defaults to “levenshtein”.
perc_unique (float, optional) – The percentage of unique values. Defaults to 0.
output_column_name (str | None, optional) – Name for the output score column. Defaults to None, which triggers auto-generation.
valid (bool, optional) – Whether the mapping is considered active. Defaults to True.

valid: bool = True

threshold_score: float = 80.0

fuzzy_type: Literal['levenshtein', 'jaro', 'jaro_winkler', 'hamming', 'damerau_levenshtein', 'indel'] = 'levenshtein'

perc_unique: float = 0.0

output_column_name: str | None = None

property reversed_threshold_score: float

Converts similarity score (0-100) to a distance score (1.0-0.0).

For example, a threshold_score of 80 becomes a distance of 0.2. This is useful for libraries that measure string distance rather than similarity.

Returns:: The converted distance score.
Return type:: float

Matcher Module

pl_fuzzy_frame_match.matcher.ensure_left_is_larger(left_df, right_df, left_col_name, right_col_name)[source]

Optimize join performance by ensuring the larger dataframe is always on the left.

This function swaps dataframes and their corresponding column names if the right dataframe is larger than the left. This optimization improves performance in join operations where the larger dataset benefits from being the primary table.

Parameters:

left_df (pl.DataFrame) – The left dataframe to potentially swap.
right_df (pl.DataFrame) – The right dataframe to potentially swap.
left_col_name (str) – Column name associated with the left dataframe.
right_col_name (str) – Column name associated with the right dataframe.

Returns:

A 4-tuple containing (left_df, right_df, left_col_name, right_col_name): where the dataframes and column names may have been swapped to ensure the left dataframe is the larger one.

Return type:

tuple

Notes

Performance optimization technique for asymmetric join operations
Column names are swapped along with dataframes to maintain consistency
Uses row count as the size metric for comparison

pl_fuzzy_frame_match.matcher.split_dataframe(df, max_chunk_size=50000)[source]

Split a large Polars DataFrame into smaller chunks for memory-efficient processing.

This function divides a DataFrame into multiple smaller DataFrames to enable batch processing of large datasets that might not fit in memory or would cause performance issues when processed as a single unit.

Parameters:

df (pl.DataFrame) – The Polars DataFrame to split into chunks.
max_chunk_size (int, optional) – Maximum number of rows per chunk. Defaults to 500,000. Larger chunks use more memory but may be more efficient, while smaller chunks are more memory-friendly.

Returns:

A list of DataFrames, each containing at most max_chunk_size rows.: If the input DataFrame has fewer rows than max_chunk_size, returns a list with the original DataFrame as the only element.

Return type:

list[pl.DataFrame]

Notes

Uses ceiling division to ensure all rows are included in chunks
The last chunk may contain fewer rows than max_chunk_size
Maintains row order across chunks
Memory usage scales with chunk size - adjust based on available system memory
Useful for processing datasets that exceed available RAM

pl_fuzzy_frame_match.matcher.cross_join_large_files(left_fuzzy_frame, right_fuzzy_frame, left_col_name, right_col_name, logger, top_n=500)[source]

Perform approximate similarity joins on large datasets using polars-simed.

This function handles fuzzy matching for large datasets by using approximate nearest neighbor techniques to reduce the computational complexity from O(n*m) to something more manageable. It processes data in chunks to manage memory usage.

Parameters:

left_fuzzy_frame (pl.LazyFrame) – Left dataframe for matching.
right_fuzzy_frame (pl.LazyFrame) – Right dataframe for matching.
left_col_name (str) – Column name from left dataframe to use for similarity matching.
right_col_name (str) – Column name from right dataframe to use for similarity matching.
logger (Logger) – Logger instance for progress tracking and debugging.
top_n (int) – The maximum number of similar items to return for each item for pre-filtering. Defaults to 500

Returns:

A LazyFrame containing approximate matches between the datasets.: Returns an empty DataFrame with null schema if no matches found.

Return type:

pl.LazyFrame

Notes

Requires polars-simed library for approximate matching functionality
Automatically ensures larger dataframe is used as the left frame for optimization
Processes left dataframe in chunks of 500,000 rows to manage memory
Combines results from all chunks into a single output
Falls back to empty result if processing fails rather than crashing

pl_fuzzy_frame_match.matcher.cross_join_small_files(left_df, right_df)[source]

Perform a simple cross join for small datasets.

This function creates a cartesian product of two dataframes, suitable for small datasets where the resulting join size is manageable in memory.

Parameters:

left_df (pl.LazyFrame) – Left dataframe for cross join.
right_df (pl.LazyFrame) – Right dataframe for cross join.

Returns:

The cross-joined result containing all combinations of rows: from both input dataframes.

Return type:

pl.LazyFrame

Notes

Creates a cartesian product (every row from left × every row from right)
Only suitable for small datasets due to explosive growth in result size
For datasets of size n and m, produces n*m rows in the result
Should be used when approximate matching is not needed or available

pl_fuzzy_frame_match.matcher.cross_join_filter_existing_fuzzy_results(left_df, right_df, existing_matches, left_col_name, right_col_name)[source]

Process and filter fuzzy matching results by joining dataframes using existing match indices.

This function takes previously identified fuzzy matches (existing_matches) and performs a series of operations to create a refined dataset of matches between the left and right dataframes, preserving index relationships.

Parameters:

left_dfpl.LazyFrame: The left dataframe containing records to be matched.
right_dfpl.LazyFrame: The right dataframe containing records to be matched against.
existing_matchespl.LazyFrame: A dataframe containing the indices of already identified matches between left_df and right_df, with columns ‘__left_index’ and ‘__right_index’.
left_col_namestr: The column name from left_df to include in the result.
right_col_namestr: The column name from right_df to include in the result.

Returns:

pl.LazyFrame: A dataframe containing the unique matches between left_df and right_df, with index information for both dataframes preserved. The resulting dataframe includes the specified columns from both dataframes along with their respective index aggregations.

Notes:

The function performs these operations: 1. Join existing matches with both dataframes using their respective indices 2. Select only the relevant columns and remove duplicates 3. Create aggregations that preserve the relationship between values and their indices 4. Join these aggregations back to create the final result set

Parameters:

left_df (LazyFrame)
right_df (LazyFrame)
existing_matches (LazyFrame)
left_col_name (str)
right_col_name (str)

Return type:

LazyFrame

pl_fuzzy_frame_match.matcher.cross_join_no_existing_fuzzy_results(left_df, right_df, left_col_name, right_col_name, temp_dir_ref, logger, use_appr_nearest_neighbor=None, top_n=500, cross_over_for_appr_nearest_neighbor=100000000)[source]

Generate fuzzy matching results by performing a cross join between dataframes.

This function processes the input dataframes, determines the appropriate cross join method based on the size of the resulting cartesian product, and returns the cross-joined results for fuzzy matching when no existing matches are provided.

Parameters

left_dfpl.LazyFrame: The left dataframe containing records to be matched.
right_dfpl.LazyFrame: The right dataframe containing records to be matched against.
left_col_namestr: The column name from left_df to use for fuzzy matching.
right_col_namestr: The column name from right_df to use for fuzzy matching.
temp_dir_refstr: Reference to a temporary directory where intermediate results can be stored during processing of large dataframes.
use_appr_nearest_neighborbool | None: If True, forces the use of approximate nearest neighbor join (polars_simed) if available. If False, forces the use of a standard cross join. If None (default), an automatic selection based on cartesian_size is done.
top_nint, optional: When using approximate nearest neighbor (polars-simed), this parameter specifies the maximum number of most similar items to return for each item during the pre-filtering stage. It helps control the size of the candidate set for more detailed fuzzy matching. Defaults to 500.
cross_over_for_appr_nearest_neighborint, optional: Sets the threshold for the cartesian product size at which the function will automatically switch from a standard cross join to an approximate nearest neighbor join. This is only active when use_appr_nearest_neighbor is None. The cartesian product is the number of rows in the left dataframe multiplied by the number of rows in the right. Defaults to 100,000,000.

Returns

pl.LazyFrame: A dataframe containing the cross join results of left_df and right_df, prepared for fuzzy matching operations.

Notes

The function performs these operations:

Processes input frames using the process_fuzzy_frames helper function.
Calculates the size of the cartesian product to determine processing approach.
Uses either cross_join_large_files or cross_join_small_files based on the size: - For cartesian products > 100M but < 1T (or 10M without polars-sim), uses large file method. - For smaller products, uses the small file method.
Raises an exception if the cartesian product exceeds the maximum allowed size.

Raises

Exception: If the cartesian product of the two dataframes exceeds the maximum allowed size (1 trillion with polars-sim, 100 million without).

Parameters:

left_df (LazyFrame)
right_df (LazyFrame)
left_col_name (str)
right_col_name (str)
temp_dir_ref (str)
logger (Logger)
use_appr_nearest_neighbor (bool | None)
top_n (int)
cross_over_for_appr_nearest_neighbor (int)

Return type:

LazyFrame

pl_fuzzy_frame_match.matcher.unique_df_large(_df, cols=None)[source]

Efficiently compute unique rows in large dataframes by partitioning.

This function processes large dataframes by first partitioning them by a selected column, then finding unique combinations within each partition before recombining the results. This approach is more memory-efficient for large datasets than calling .unique() directly.

Parameters:

_dfpl.DataFrame | pl.LazyFrame: The input dataframe to process. Can be either a Polars DataFrame or LazyFrame.
colsOptional[list[str]]: The list of columns to consider when finding unique rows. If None, all columns are used. The first column in this list is used as the partition column.

Returns:

pl.DataFrame: A dataframe containing only the unique rows from the input dataframe, based on the specified columns.

Notes:

The function performs these operations: 1. Converts LazyFrame to DataFrame if necessary 2. Partitions the dataframe by the first column in cols (or the first column of the dataframe if cols is None) 3. Applies the unique operation to each partition based on the remaining columns 4. Concatenates the results back into a single dataframe 5. Frees memory by deleting intermediate objects

This implementation uses tqdm to provide a progress bar during processing, which is particularly helpful for large datasets where the operation may take time.

Parameters:

_df (DataFrame | LazyFrame)
cols (list[str] | None)

Return type:

DataFrame

pl_fuzzy_frame_match.matcher.combine_matches(matching_dfs)[source]

Parameters:: matching_dfs (list[LazyFrame])
Return type:: LazyFrame

pl_fuzzy_frame_match.matcher.add_index_column(df, column_name, tempdir)[source]

Add a row index column to a dataframe and cache it to temporary storage.

This function adds a sequential row index to track original row positions throughout fuzzy matching operations, then caches the result for efficient reuse.

Parameters:

df (pl.LazyFrame) – The dataframe to add an index column to.
column_name (str) – Name for the new index column (e.g., ‘__left_index’).
tempdir (str) – Temporary directory path for caching the indexed dataframe.

Returns:

A LazyFrame with the added index column, cached to temporary storage.

Return type:

pl.LazyFrame

Notes

Index column contains sequential integers starting from 0
Caching prevents recomputation during complex multi-step operations
Index columns are essential for tracking row relationships in fuzzy matching
The cached dataframe can be reused multiple times without recalculation

pl_fuzzy_frame_match.matcher.process_fuzzy_mapping(fuzzy_map, left_df, right_df, existing_matches, local_temp_dir_ref, i, logger, existing_number_of_matches=None, use_appr_nearest_neighbor_for_new_matches=None, top_n=500, cross_over_for_appr_nearest_neighbor=100000000)[source]

Process a single fuzzy mapping to generate matching dataframes.

Parameters:

fuzzy_map (FuzzyMapping) – The fuzzy mapping configuration containing match columns and thresholds.
left_df (LazyFrame) – Left dataframe with index column.
right_df (LazyFrame) – Right dataframe with index column.
existing_matches (LazyFrame | None) – Previously computed matches (or None). If provided, this function will only calculate scores for these existing pairs.
local_temp_dir_ref (str) – Temporary directory reference for caching interim results.
i (int) – Index of the current fuzzy mapping, used for naming the score column.
logger (Logger) – Logger instance for progress tracking.
existing_number_of_matches (int | None) – Number of existing matches (if available).
use_appr_nearest_neighbor_for_new_matches (bool | None) – Controls join strategy when existing_matches is None. See cross_join_no_existing_fuzzy_results for details.
top_n (int, optional) – When no existing_matches are provided, this value is passed to the approximate nearest neighbor join to specify the max number of similar items to find for each record. Defaults to 500.
cross_over_for_appr_nearest_neighbor (int, optional) – When no existing_matches are provided, this sets the cartesian product size threshold for automatically switching to the approximate join method. Defaults to 100,000,000.

Returns:

The final matching dataframe and the number of matches.

Return type:

tuple[pl.LazyFrame, int]

pl_fuzzy_frame_match.matcher.perform_all_fuzzy_matches(left_df, right_df, fuzzy_maps, logger, local_temp_dir_ref, use_appr_nearest_neighbor_for_new_matches=None, top_n_for_new_matches=500, cross_over_for_appr_nearest_neighbor=100000000)[source]

Iteratively processes a list of fuzzy mapping configurations to find matches between two dataframes.

For each fuzzy mapping, this function computes potential matches. If multiple fuzzy maps are provided, the matches from one mapping can be used as a basis for the next, progressively filtering or expanding the set of matches. This allows for a multi-pass fuzzy matching strategy where different columns or criteria are applied sequentially.

Parameters:

left_df (pl.LazyFrame) – The left dataframe, prepared with an index column (e.g., ‘__left_index’).
right_df (pl.LazyFrame) – The right dataframe, prepared with an index column (e.g., ‘__right_index’).
fuzzy_maps (list[models.FuzzyMapping]) – A list of fuzzy mapping configurations to apply. Each configuration specifies the columns to compare, the fuzzy matching method, and thresholds.
logger (Logger) – Logger instance for recording progress and debugging information.
local_temp_dir_ref (str) – Path to a temporary directory for caching intermediate dataframes during processing. This helps manage memory for large datasets.
use_appr_nearest_neighbor_for_new_matches (Optional[bool], optional) – Controls the join strategy for generating initial candidate pairs when no existing_matches are provided. - If True, forces the use of approximate nearest neighbor join. - If False, forces the use of a standard cross join. - If None (default), an automatic selection based on data size occurs. Defaults to None.
top_n_for_new_matches (int, optional) – When generating new matches using the approximate nearest neighbor method, this specifies the max number of similar items to return for each item. Only applies when no existing matches are being filtered. Defaults to 500.
cross_over_for_appr_nearest_neighbor (int, optional) – Sets the cartesian product size threshold for automatically switching to the approximate nearest neighbor join method when use_appr_nearest_neighbor_for_new_matches is None. Defaults to 100,000,000.

Returns:

A list of Polars LazyFrames. Each LazyFrame in the list: represents the matching results after a process_fuzzy_mapping step. The final LazyFrame in the list contains the cumulative matches after all fuzzy maps have been processed. Each frame typically includes index columns (‘__left_index’, ‘__right_index’) and a score column (e.g., ‘fuzzy_score_i’).

Return type:

list[pl.LazyFrame]

pl_fuzzy_frame_match.matcher.fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger, use_appr_nearest_neighbor_for_new_matches=None, top_n_for_new_matches=500, cross_over_for_appr_nearest_neighbor=100000000)[source]

Perform fuzzy matching between two dataframes using multiple fuzzy mapping configurations.

This is the main entry point function that orchestrates the entire fuzzy matching process, from pre-processing and indexing to matching and final joining.

Parameters:

left_df (pl.LazyFrame) – Left dataframe to be matched.
right_df (pl.LazyFrame) – Right dataframe to be matched.
fuzzy_maps (list[FuzzyMapping]) – A list of fuzzy mapping configurations to apply sequentially.
logger (Logger) – Logger instance for tracking progress.
use_appr_nearest_neighbor_for_new_matches (bool | None, optional) – Controls the join strategy for generating initial candidate pairs when no prior matches exist. - If True, forces the use of approximate nearest neighbor join. - If False, forces a standard cross join. - If None (default), an automatic selection based on data size is made. Defaults to None.
top_n_for_new_matches (int, optional) – When generating new matches with the approximate method, this specifies the maximum number of similar items to consider for each record. Defaults to 500.
cross_over_for_appr_nearest_neighbor (int, optional) – The cartesian product size threshold to automatically switch to the approximate join method when use_appr_nearest_neighbor_for_new_matches is None. Defaults to 100,000,000.

Returns:

The final matched dataframe containing original data from both: dataframes along with all calculated fuzzy scores.

Return type:

pl.DataFrame

Models Module

class pl_fuzzy_frame_match.models.JoinMap(left_col, right_col)[source]

Bases: object

A simple data structure to hold left and right column names for a join.

Parameters:

left_col (str)
right_col (str)

left_col: str

right_col: str

__init__(left_col, right_col)

Parameters:

left_col (str)
right_col (str)

Return type:

None

class pl_fuzzy_frame_match.models.FuzzyMapping(left_col, right_col=None, threshold_score=80.0, fuzzy_type='levenshtein', perc_unique=0, output_column_name=None, valid=True)[source]

Bases: JoinMap

Represents the configuration for a fuzzy string match between two columns.

This class defines all the necessary parameters to perform a fuzzy join, including the columns to match, the specific algorithm to use, and the similarity threshold required to consider two strings a match.

It generates a default name for the output score column if one is not provided.

Parameters:

left_col (str)
right_col (str)
threshold_score (float)
fuzzy_type (Literal['levenshtein', 'jaro', 'jaro_winkler', 'hamming', 'damerau_levenshtein', 'indel'])
perc_unique (float)
output_column_name (str | None)
valid (bool)

left_col

The name of the column in the left dataframe to join on.

Type:: str

right_col

The name of the column in the right dataframe to join on.

Type:: str

threshold_score

The similarity score threshold required for a match, typically on a scale of 0 to 100. Defaults to 80.0.

Type:: float

fuzzy_type

The string-matching algorithm to use. Defaults to “levenshtein”.

Type:: FuzzyTypeLiteral

perc_unique

A parameter that may be used to assess column uniqueness before performing a costly fuzzy match. Defaults to 0.0.

Type:: float

output_column_name

The name for the new column that will contain the calculated fuzzy match score. If None, a name is generated automatically in the format ‘fuzzy_score_{left_col}_{right_col}’.

Type:: str | None

valid

A flag to indicate whether this mapping is active and should be used in a join operation. Defaults to True.

Type:: bool

reversed_threshold_score

A property that converts the 0-100 threshold score into a 0.0-1.0 distance score, where 0.0 is a perfect match.

Type:: float

__init__(left_col, right_col=None, threshold_score=80.0, fuzzy_type='levenshtein', perc_unique=0, output_column_name=None, valid=True)[source]

Initializes the FuzzyMapping configuration.

Parameters:

left_col (str) – The name of the column in the left dataframe.
right_col (str | None, optional) – The name of the column in the right dataframe. If None, it defaults to the value of left_col.
threshold_score (float, optional) – The similarity threshold for a match (0-100). Defaults to 80.0.
fuzzy_type (FuzzyTypeLiteral, optional) – The fuzzy matching algorithm to use. Defaults to “levenshtein”.
perc_unique (float, optional) – The percentage of unique values. Defaults to 0.
output_column_name (str | None, optional) – Name for the output score column. Defaults to None, which triggers auto-generation.
valid (bool, optional) – Whether the mapping is considered active. Defaults to True.

valid: bool = True

threshold_score: float = 80.0

left_col: str

right_col: str

fuzzy_type: Literal['levenshtein', 'jaro', 'jaro_winkler', 'hamming', 'damerau_levenshtein', 'indel'] = 'levenshtein'

perc_unique: float = 0.0

output_column_name: str | None = None

property reversed_threshold_score: float

Converts similarity score (0-100) to a distance score (1.0-0.0).

For example, a threshold_score of 80 becomes a distance of 0.2. This is useful for libraries that measure string distance rather than similarity.

Returns:: The converted distance score.
Return type:: float

Processing Functions

Process Module

pl_fuzzy_frame_match.process.calculate_fuzzy_score(mapping_table, left_col_name, right_col_name, fuzzy_method, th_score)[source]

Calculate fuzzy matching scores between columns in a LazyFrame using specified algorithms.

This function performs string similarity calculations between two columns using various fuzzy matching algorithms. It normalizes strings to lowercase, calculates similarity scores, filters results based on a threshold, and converts distance scores to similarity scores.

Parameters:

mapping_table (pl.LazyFrame) – The DataFrame containing columns to compare.
left_col_name (str) – Name of the left column for comparison.
right_col_name (str) – Name of the right column for comparison.
fuzzy_method (FuzzyTypeLiteral) – Type of fuzzy matching algorithm to use. Options include ‘levenshtein’, ‘jaro’, ‘jaro_winkler’, ‘hamming’, ‘damerau_levenshtein’, ‘indel’.
th_score (float) – The threshold score for fuzzy matching (0-1 scale, where lower values represent stricter matching criteria).

Returns:

A LazyFrame containing the original data plus a similarity score column ‘s’: with values between 0-1, where 1 represents perfect similarity. Only rows meeting the threshold criteria are included.

Return type:

pl.LazyFrame

Notes

Strings are normalized to lowercase before comparison for case-insensitive matching
jaro_winkler algorithm doesn’t use the normalized parameter unlike other algorithms
Distance scores are converted to similarity scores using (1 - distance)
Results are filtered to only include matches above the specified threshold

pl_fuzzy_frame_match.process.process_fuzzy_frames(left_df, right_df, left_col_name, right_col_name, temp_dir_ref)[source]

Process and optimize dataframes for fuzzy matching by creating grouped representations.

This function prepares dataframes for fuzzy matching by grouping rows by the matching columns and aggregating their indices. It also optimizes the operation by ensuring the smaller dataset is used as the left frame to minimize computational complexity.

Parameters:

left_df (pl.LazyFrame) – The left dataframe containing records to be matched.
right_df (pl.LazyFrame) – The right dataframe containing records to be matched against.
left_col_name (str) – Column name from the left dataframe to use for matching.
right_col_name (str) – Column name from the right dataframe to use for matching.
temp_dir_ref (str) – Reference to the temporary directory for caching intermediate results.

Returns:

A tuple containing:

left_fuzzy_frame: Processed and cached left dataframe grouped by matching column
right_fuzzy_frame: Processed and cached right dataframe grouped by matching column
left_col_name: Final left column name (may be swapped for optimization)
right_col_name: Final right column name (may be swapped for optimization)
len_left_df: Number of unique values in the left matching column
len_right_df: Number of unique values in the right matching column

Return type:

tuple[pl.LazyFrame, pl.LazyFrame, str, str, int, int]

Notes

Groups each dataframe by the matching column and aggregates row indices
Filters out null values to ensure clean matching data
Automatically swaps left/right frames if the right frame is smaller for optimization
Caches intermediate results to temporary storage to avoid recomputation
The returned lengths represent unique value counts, not row counts

pl_fuzzy_frame_match.process.calculate_and_parse_fuzzy(mapping_table, left_col_name, right_col_name, fuzzy_method, th_score)[source]

Calculate fuzzy similarity scores and explode aggregated results for row-level matching.

This function combines fuzzy score calculation with result parsing to transform grouped/aggregated data back into individual row-level matches. It’s particularly useful when working with pre-grouped data that needs to be expanded back to individual record pairs.

Parameters:

mapping_table (pl.LazyFrame) – DataFrame containing grouped data with columns to compare and aggregated index lists (__left_index, __right_index).
left_col_name (str) – Name of the left column for fuzzy comparison.
right_col_name (str) – Name of the right column for fuzzy comparison.
fuzzy_method (FuzzyTypeLiteral) – Fuzzy matching algorithm to use for similarity calculation.
th_score (float) – Minimum similarity threshold (0-1 scale, lower = more strict).

Returns:

DataFrame with exploded individual matches containing:

’s’: Similarity score (0-1, where 1 is perfect match)
’__left_index’: Individual row index from left dataframe
’__right_index’: Individual row index from right dataframe

Return type:

pl.LazyFrame

Notes

First calculates fuzzy scores using the specified algorithm and threshold
Then explodes the aggregated index lists to create individual row pairs
Each row in the result represents a potential match between specific records
The explode operations convert list columns back to individual rows
Only matches exceeding the similarity threshold are included

Pre-Process Module

pl_fuzzy_frame_match.pre_process.get_approx_uniqueness(lf)[source]

Calculate the approximate number of unique values for each column in a LazyFrame.

Parameters:: lf (pl.LazyFrame) – Input LazyFrame to analyze.
Returns:: Dictionary mapping column names to their approximate unique value counts.
Return type:: dict[str, int]
Raises:: Exception – If the uniqueness calculation fails (empty result).

pl_fuzzy_frame_match.pre_process.calculate_uniqueness(a, b)[source]

Calculate a combined uniqueness score from two individual uniqueness ratios.

The formula prioritizes columns with high combined uniqueness while accounting for differences between the two input values.

Parameters:

a (float) – First uniqueness ratio, typically from the left dataframe.
b (float) – Second uniqueness ratio, typically from the right dataframe.

Returns:

Combined uniqueness score.

Return type:

float

pl_fuzzy_frame_match.pre_process.calculate_df_len(df)[source]

Calculate the number of rows in a LazyFrame efficiently.

This function provides a simple way to get the row count from a LazyFrame without collecting the entire dataset into memory, making it suitable for large datasets where full materialization would be expensive.

Parameters:: df (pl.LazyFrame) – Input LazyFrame to count rows for.
Returns:: Number of rows in the LazyFrame.
Return type:: int

Notes

Uses lazy evaluation to count rows without materializing the full dataset
More memory-efficient than collecting the entire LazyFrame first
Essential for preprocessing decisions in fuzzy matching operations

pl_fuzzy_frame_match.pre_process.fill_perc_unique_in_fuzzy_maps(left_df, right_df, fuzzy_maps, logger, left_len, right_len)[source]

Calculate and set uniqueness percentages for all fuzzy mapping columns.

Computes the approximate unique value counts in both dataframes for the columns specified in fuzzy_maps, then calculates a combined uniqueness score for each mapping.

Parameters:

left_df (pl.LazyFrame) – Left dataframe.
right_df (pl.LazyFrame) – Right dataframe.
fuzzy_maps (list[FuzzyMapping]) – list of fuzzy mappings between left and right columns.
logger (Logger) – Logger for information output.
left_len (int) – Number of rows in the left dataframe.
right_len (int) – Number of rows in the right dataframe.

Returns:

Updated fuzzy mappings with calculated uniqueness percentages.

Return type:

list[FuzzyMapping]

pl_fuzzy_frame_match.pre_process.determine_order_of_fuzzy_maps(fuzzy_maps)[source]

Sort fuzzy mappings by their uniqueness percentages in descending order.

This ensures that columns with higher uniqueness are prioritized in the fuzzy matching process.

Parameters:: fuzzy_maps (list[FuzzyMapping]) – list of fuzzy mappings between columns.
Returns:: Sorted list of fuzzy mappings by uniqueness (highest first).
Return type:: list[FuzzyMapping]

pl_fuzzy_frame_match.pre_process.calculate_uniqueness_rate(fuzzy_maps)[source]

Calculate the total uniqueness rate across all fuzzy mappings.

Parameters:: fuzzy_maps (list[FuzzyMapping]) – list of fuzzy mappings with calculated uniqueness.
Returns:: Sum of uniqueness percentages across all mappings.
Return type:: float

pl_fuzzy_frame_match.pre_process.determine_need_for_aggregation(uniqueness_rate, cartesian_join_number)[source]

Determine if aggregation is needed based on uniqueness and potential join size.

Aggregation helps prevent explosive cartesian joins when matching columns have low uniqueness, which could lead to performance issues.

Parameters:

uniqueness_rate (float) – Total uniqueness rate across fuzzy mappings.
cartesian_join_number (int) – Potential size of the cartesian join (left_len * right_len).

Returns:

True if aggregation is needed, False otherwise.

Return type:

bool

pl_fuzzy_frame_match.pre_process.aggregate_output(left_df, right_df, fuzzy_maps)[source]

Deduplicate the dataframes based on the fuzzy mapping columns.

This reduces the size of the join by removing duplicate rows when the uniqueness rate is low and the potential join size is large.

Parameters:

left_df (pl.LazyFrame) – Left dataframe.
right_df (pl.LazyFrame) – Right dataframe.
fuzzy_maps (list[FuzzyMapping]) – list of fuzzy mappings between columns.

Returns:

Deduplicated left and right dataframes.

Return type:

tuple[pl.LazyFrame, pl.LazyFrame]

pl_fuzzy_frame_match.pre_process.report_on_order_of_fuzzy_maps(fuzzy_maps, logger)[source]

Log the ordered list of fuzzy mappings based on their uniqueness scores.

This function provides visibility into how fuzzy mappings are prioritized during the matching process. Higher uniqueness scores indicate columns that are more selective and thus processed first for better performance.

Parameters:

fuzzy_maps (list[FuzzyMapping]) – list of fuzzy mappings sorted by uniqueness.
logger (Logger) – Logger instance for outputting the mapping order information.

Return type:

None

Notes

Mappings are logged in order of processing priority (highest uniqueness first)
Helps with debugging and understanding the optimization strategy
Uniqueness scores guide the order of fuzzy matching operations
Higher uniqueness means fewer potential matches and better performance

pl_fuzzy_frame_match.pre_process.pre_process_for_fuzzy_matching(left_df, right_df, fuzzy_maps, logger)[source]

Preprocess dataframes and fuzzy mappings for optimal fuzzy matching.

This function: 1. Calculates dataframe sizes 2. Calculates uniqueness percentages for each fuzzy mapping 3. Sorts the fuzzy mappings by uniqueness 4. Determines if aggregation is needed to prevent large cartesian joins 5. Performs aggregation if necessary

Parameters:

left_df (pl.LazyFrame) – Left dataframe.
right_df (pl.LazyFrame) – Right dataframe.
fuzzy_maps (list[FuzzyMapping]) – list of fuzzy mappings between columns.
logger (Logger) – Logger for information output.

Returns:

Potentially modified left dataframe
Potentially modified right dataframe
Sorted and updated fuzzy mappings

Return type:

tuple[pl.LazyFrame, pl.LazyFrame, list[FuzzyMapping]]

Utilities

pl_fuzzy_frame_match._utils.collect_lazy_frame(lf)[source]

Collect a LazyFrame into a DataFrame with automatic engine selection.

Attempts to use the streaming engine first for better memory efficiency, falling back to the auto engine if a PanicException occurs.

Parameters:: lf (pl.LazyFrame) – The LazyFrame to collect.
Returns:: The collected DataFrame.
Return type:: pl.DataFrame
Raises:: Exception – If both streaming and auto engines fail to collect the LazyFrame.

pl_fuzzy_frame_match._utils.write_polars_frame(_df, path, estimated_size=0)[source]

Write a Polars DataFrame or LazyFrame to disk in IPC format with memory optimization.

This function intelligently chooses between different writing strategies based on the estimated size of the data to optimize memory usage during serialization.

Parameters:

_df (pl.LazyFrame | pl.DataFrame) – The dataframe to write to disk.
path (str) – The file path where the dataframe should be saved.
estimated_size (int, optional) – Estimated size of the dataframe in bytes. Used to determine if the data fits in memory. Defaults to 0.

Returns:

True if the write operation was successful, False otherwise.

Return type:

bool

Notes

For small datasets (< 8MB estimated), converts LazyFrame to DataFrame for faster writing
For large datasets, uses memory-efficient sink_ipc method when possible
Falls back to standard write_ipc method if sink operations fail
Uses IPC format for optimal performance and compatibility

pl_fuzzy_frame_match._utils.cache_polars_frame_to_temp(_df, tempdir=None)[source]

Cache a Polars DataFrame or LazyFrame to a temporary file and return a LazyFrame reference.

This function is useful for materializing intermediate results during complex operations to avoid recomputation and manage memory usage effectively.

Parameters:

_df (pl.LazyFrame | pl.DataFrame) – The dataframe to cache to disk.
tempdir (str, optional) – Directory path where temporary files should be stored. If None, uses the system default temporary directory.

Returns:

A LazyFrame that reads from the cached temporary file.

Return type:

pl.LazyFrame

Raises:

Exception – If the caching operation fails with message “Could not cache the data”.

Notes

Creates a unique filename using UUID to avoid conflicts
Uses IPC format for efficient serialization/deserialization
The temporary file persists until explicitly deleted or system cleanup
Caller is responsible for managing the lifecycle of temporary files