LazyFrame¶
The LazyFrame is the central object in pyfloe. It represents a lazy dataframe — operations build a query plan without executing it. Data flows only when you trigger evaluation with .collect(), .to_pylist(), .to_csv(), or similar methods.
LazyFrame¶
LazyFrame
¶
A lazy, composable dataframe.
Operations on a LazyFrame build a query plan without executing it.
Data flows only when you call a materialization method like
.collect(), .to_pylist(), or .to_csv().
Examples:
>>> lf = LazyFrame([{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}])
>>> lf.filter(col("age") > 28).to_pylist()
[{'name': 'Alice', 'age': 30}]
Methods:
-
__init__–Create a LazyFrame from in-memory data.
-
explain–Return a string representation of the query plan tree.
-
print_explain–Print the query plan tree to stdout.
-
select–Select columns by name or expression.
-
filter–Filter rows matching a predicate expression.
-
with_column–Add a computed column to the LazyFrame.
-
with_columns–Add multiple computed columns at once.
-
drop–Remove columns from the LazyFrame.
-
rename–Rename columns.
-
sort–Sort rows by one or more columns.
-
join–Join with another LazyFrame.
-
group_by–Group by one or more columns.
-
explode–Unnest a list column into separate rows.
-
pivot–Pivot (reshape long to wide).
-
unpivot–Unpivot (reshape wide to long). Also available as
.melt(). -
union–Stack rows from another LazyFrame below this one.
-
apply–Apply a function to column values.
-
read–Alias for :meth:
select. Select columns by name. -
head–Return a lazy view of the first n rows.
-
optimize–Return a new LazyFrame with an optimized query plan.
-
collect–Materialize the query plan and cache the results.
-
count–Return the total number of rows.
-
to_pylist–Materialize and return data as a list of dicts.
-
to_pydict–Materialize and return data as a dict of column lists.
-
to_tuples–Materialize and return data as a list of tuples.
-
to_batches–Materialize and return data in batches of dicts.
-
to_csv–Stream the query plan to a CSV file with constant memory.
-
to_tsv–Stream the query plan to a TSV (tab-separated) file.
-
to_jsonl–Stream the query plan to a JSON Lines file.
-
to_json–Write data as a JSON array.
-
to_parquet–Write data to a Parquet file (requires pyarrow).
-
display–Print a formatted table of the first n rows.
-
typed–Wrap this LazyFrame as a TypedLazyFrame for IDE-friendly typed results.
-
validate–Validate the schema against a TypedDict type.
Attributes:
-
schema(LazySchema) –Output schema of this LazyFrame, computed without touching data.
-
columns(list[str]) –List of column names.
-
dtypes(dict[str, type]) –Mapping of column names to their Python types.
-
is_materialized(bool) –Whether the query plan has been executed and data is cached.
schema
property
¶
columns
property
¶
dtypes
property
¶
is_materialized
property
¶
Whether the query plan has been executed and data is cached.
Returns True after calling :meth:collect or :meth:to_pylist.
__init__
¶
__init__(raw_data: list[dict] | list | dict | Iterable | None = None, *, name: str | None = None) -> None
Create a LazyFrame from in-memory data.
Parameters:
-
raw_data(list[dict] | list | dict | Iterable | None, default:None) –Input data as a list of dicts, list of tuples, list of objects with
__dict__, a dict of columns, or an iterable. -
name(str | None, default:None) –Optional name for the LazyFrame.
Examples:
From a list of dicts:
>>> lf = LazyFrame([{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}])
>>> lf.columns
['name', 'age']
From a dict of columns:
>>> lf = LazyFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> lf.to_pylist()
[{'x': 1, 'y': 'a'}, {'x': 2, 'y': 'b'}, {'x': 3, 'y': 'c'}]
From a list of tuples (auto-named columns):
explain
¶
Return a string representation of the query plan tree.
Parameters:
-
optimized(bool, default:False) –If True, show the plan after optimization (filter pushdown, column pruning).
Examples:
print_explain
¶
Print the query plan tree to stdout.
Shortcut for print(lf.explain()).
Parameters:
-
optimized(bool, default:False) –If True, show the plan after optimization.
select
¶
Select columns by name or expression.
Parameters:
-
*args(str | Expr, default:()) –Column names as strings, or Expr objects.
Returns:
-
LazyFrame–A new LazyFrame with only the selected columns.
Examples:
Select by column name:
>>> lf = LazyFrame([{"a": 1, "b": 2, "c": 3}])
>>> lf.select("a", "c").to_pylist()
[{'a': 1, 'c': 3}]
Select with expressions:
filter
¶
Filter rows matching a predicate expression.
Parameters:
-
predicate_or_col–An Expr that evaluates to a boolean per row, e.g.
col("amount") > 100.
Returns:
-
LazyFrame–A new LazyFrame with only matching rows.
Examples:
>>> orders = LazyFrame([
... {"product": "A", "amount": 250, "region": "EU"},
... {"product": "B", "amount": 75, "region": "US"},
... {"product": "C", "amount": 180, "region": "EU"},
... ])
>>> orders.filter(col("amount") > 100).to_pylist()
[{'product': 'A', 'amount': 250, 'region': 'EU'}, {'product': 'C', 'amount': 180, 'region': 'EU'}]
Compound filters with & and |:
with_column
¶
Add a computed column to the LazyFrame.
Can be called with a name and expression, or with a single
expression whose output name is derived via .alias() or from
the underlying column reference.
Parameters:
-
name_or_expr(str | Expr) –Column name (str) or an expression with an inferrable output name.
-
expr(Expr | None, default:None) –Expression to compute the column values (required when name_or_expr is a string).
Returns:
-
LazyFrame–A new LazyFrame with the additional column.
Examples:
with_columns
¶
Add multiple computed columns at once.
Accepts positional expressions (with names derived from
.alias() or the underlying column) and/or keyword arguments.
Parameters:
-
*args(Expr, default:()) –Expressions with inferrable output names.
-
**kwargs(Expr, default:{}) –Column name to expression mappings.
Returns:
-
LazyFrame–A new LazyFrame with the additional columns.
Examples:
>>> lf = LazyFrame([{"amount": 250, "region": "eu"}])
>>> lf.with_columns(
... (col("amount") * 2).alias("double"),
... col("region").str.upper().alias("upper_region"),
... ).to_pylist()
[{'amount': 250, 'region': 'eu', 'double': 500, 'upper_region': 'EU'}]
>>> lf.with_columns(
... double=col("amount") * 2,
... upper_region=col("region").str.upper(),
... ).to_pylist()
[{'amount': 250, 'region': 'eu', 'double': 500, 'upper_region': 'EU'}]
drop
¶
Remove columns from the LazyFrame.
Parameters:
-
*columns(str, default:()) –Column names to drop.
Returns:
-
LazyFrame–A new LazyFrame without the specified columns.
Examples:
rename
¶
Rename columns.
Parameters:
-
mapping(dict[str, str]) –Old name to new name mapping.
Returns:
-
LazyFrame–A new LazyFrame with renamed columns.
Examples:
sort
¶
Sort rows by one or more columns.
Parameters:
-
*by(str, default:()) –Column names to sort by.
-
ascending(bool | list[bool], default:True) –Sort direction. A single bool applies to all columns; a list specifies per-column direction.
Returns:
-
LazyFrame–A new LazyFrame with sorted rows.
Examples:
>>> lf = LazyFrame([{"name": "C"}, {"name": "A"}, {"name": "B"}])
>>> lf.sort("name").to_pylist()
[{'name': 'A'}, {'name': 'B'}, {'name': 'C'}]
Descending sort:
join
¶
join(other: LazyFrame, on: str | list[str] | None = None, left_on: str | list[str] | None = None, right_on: str | list[str] | None = None, how: JoinHow = 'inner', sorted: bool = False, left_cols: str | list[str] | None = None, right_cols: str | list[str] | None = None) -> LazyFrame
Join with another LazyFrame.
Parameters:
-
other(LazyFrame) –Right-side LazyFrame to join with.
-
on(str | list[str] | None, default:None) –Column name(s) present in both sides.
-
left_on(str | list[str] | None, default:None) –Column name(s) on the left side.
-
right_on(str | list[str] | None, default:None) –Column name(s) on the right side.
-
how(JoinHow, default:'inner') –Join type —
'inner','left', or'full'. -
sorted(bool, default:False) –If True, use sort-merge join (O(1) memory for pre-sorted inputs) instead of hash join.
Returns:
-
LazyFrame–A new LazyFrame with columns from both sides.
Examples:
>>> orders = LazyFrame([{"id": 1, "cust": 101}, {"id": 2, "cust": 102}])
>>> customers = LazyFrame([{"cust": 101, "name": "Alice"}])
>>> orders.join(customers, on="cust", how="left").to_pylist()
[{'id': 1, 'cust': 101, 'right_cust': 101, 'name': 'Alice'}, {'id': 2, 'cust': 102, 'right_cust': None, 'name': None}]
Different key names on each side:
group_by
¶
Group by one or more columns.
Parameters:
-
*columns(str, default:()) –Column names to group by.
-
sorted(bool, default:False) –If True, use streaming sorted aggregation (requires input sorted by group columns).
Returns:
-
LazyGroupBy | LazyFrame–A LazyGroupBy — call
.agg()to specify aggregations.
Examples:
>>> orders = LazyFrame([
... {"region": "EU", "amount": 250},
... {"region": "EU", "amount": 180},
... {"region": "US", "amount": 320},
... ])
>>> orders.group_by("region").agg(
... col("amount").sum().alias("total"),
... ).sort("region").to_pylist()
[{'region': 'EU', 'total': 430}, {'region': 'US', 'total': 320}]
explode
¶
Unnest a list column into separate rows.
Each element in the list becomes its own row, with all other column values duplicated.
Parameters:
-
column(str) –Name of the column containing lists.
Returns:
-
LazyFrame–A new LazyFrame with one row per list element.
Examples:
pivot
¶
pivot(index: str | list[str], on: str, values: str, agg: AggFunc = 'first', columns: list[str] | None = None) -> LazyFrame
Pivot (reshape long to wide).
Parameters:
-
index(str | list[str]) –Column(s) to keep as row identifiers.
-
on(str) –Column whose unique values become new column headers.
-
values(str) –Column whose values fill the pivoted cells.
-
agg(AggFunc, default:'first') –Aggregation function name (
'first','sum', etc.). -
columns(list[str] | None, default:None) –Explicit list of pivot column values (auto-detected if None).
Returns:
-
LazyFrame–A new LazyFrame in wide format.
Examples:
>>> lf = LazyFrame([
... {"name": "Alice", "subject": "math", "score": 90},
... {"name": "Alice", "subject": "english", "score": 85},
... {"name": "Bob", "subject": "math", "score": 78},
... {"name": "Bob", "subject": "english", "score": 92},
... ])
>>> lf.pivot(index="name", on="subject", values="score",
... columns=["math", "english"]).sort("name").to_pylist()
[{'name': 'Alice', 'math': 90, 'english': 85}, {'name': 'Bob', 'math': 78, 'english': 92}]
unpivot
¶
unpivot(id_columns: str | list[str], value_columns: str | list[str] | None = None, variable_name: str = 'variable', value_name: str = 'value') -> LazyFrame
Unpivot (reshape wide to long). Also available as .melt().
Parameters:
-
id_columns(str | list[str]) –Column(s) to keep as identifiers.
-
value_columns(str | list[str] | None, default:None) –Column(s) to unpivot. If None, all non-id columns.
-
variable_name(str, default:'variable') –Name for the new column holding original column names.
-
value_name(str, default:'value') –Name for the new column holding the values.
Returns:
-
LazyFrame–A new LazyFrame in long format.
Examples:
>>> lf = LazyFrame([
... {"name": "Alice", "math": 90, "english": 85},
... {"name": "Bob", "math": 78, "english": 92},
... ])
>>> lf.unpivot("name", ["math", "english"]).sort("name", "variable").to_pylist()
[{'name': 'Alice', 'variable': 'english', 'value': 85}, {'name': 'Alice', 'variable': 'math', 'value': 90}, {'name': 'Bob', 'variable': 'english', 'value': 92}, {'name': 'Bob', 'variable': 'math', 'value': 78}]
union
¶
apply
¶
apply(func: Callable, columns: list[str] | None = None, output_dtype: type | None = None) -> LazyFrame
Apply a function to column values.
Parameters:
-
func(Callable) –Function to apply to each cell value.
-
columns(list[str] | None, default:None) –Columns to apply to. If None, applies to all columns.
-
output_dtype(type | None, default:None) –Expected output type (for schema inference).
Returns:
-
LazyFrame–A new LazyFrame with the function applied.
Examples:
Apply to specific columns:
>>> lf = LazyFrame([{"name": "Alice", "age": 30}])
>>> lf.apply(str, columns=["age"]).to_pylist()
[{'name': 'Alice', 'age': '30'}]
Apply to all columns:
read
¶
Alias for :meth:select. Select columns by name.
Parameters:
-
columns(str | list[str]) –Column name or list of column names to select.
Returns:
-
LazyFrame–A new LazyFrame with only the specified columns.
head
¶
Return a lazy view of the first n rows.
The upstream plan is not executed until the result is
materialised (e.g. via collect, to_pylist, etc.).
Parameters:
-
n(int, default:5) –Maximum number of rows to return.
Returns:
-
LazyFrame–A new LazyFrame with a
Limitnode in the plan.
Examples:
optimize
¶
Return a new LazyFrame with an optimized query plan.
Applies filter pushdown and column pruning.
Returns:
-
LazyFrame–A new LazyFrame wrapping the optimized plan.
Examples:
collect
¶
Materialize the query plan and cache the results.
After calling collect, subsequent operations use the cached data. Calling collect multiple times is safe and idempotent.
Parameters:
-
optimize(bool, default:True) –If True, run the query optimizer first.
Returns:
-
LazyFrame–Self, with data materialized.
Examples:
count
¶
Return the total number of rows.
Uses fast-path counting when possible (e.g. for in-memory data) without materializing all rows.
Parameters:
-
optimize(bool, default:True) –If True, run the query optimizer first.
Returns:
-
int–The row count as an integer.
Examples:
to_pylist
¶
to_pydict
¶
to_tuples
¶
to_batches
¶
Materialize and return data in batches of dicts.
Parameters:
-
optimize(bool, default:True) –If True, run the query optimizer first.
Yields:
-
list[dict]–Batches of rows, each represented as a list of dicts.
to_csv
¶
Stream the query plan to a CSV file with constant memory.
Data is written row-by-row without buffering the entire dataset, so this works for arbitrarily large pipelines.
Parameters:
-
path(str) –Output file path.
-
delimiter(str, default:',') –Field delimiter character.
-
header(bool, default:True) –Whether to write a header row.
-
encoding(str, default:'utf-8') –File encoding.
Examples:
to_tsv
¶
Stream the query plan to a TSV (tab-separated) file.
Equivalent to lf.to_csv(path, delimiter='\t').
Parameters:
-
path(str) –Output file path.
-
**kwargs(Any, default:{}) –Additional arguments passed to :meth:
to_csv.
to_jsonl
¶
Stream the query plan to a JSON Lines file.
Parameters:
-
path(str) –Output file path.
-
encoding(str, default:'utf-8') –File encoding.
to_json
¶
Write data as a JSON array.
Parameters:
-
path(str) –Output file path.
-
encoding(str, default:'utf-8') –File encoding.
-
indent(int | None, default:None) –JSON indentation level.
to_parquet
¶
Write data to a Parquet file (requires pyarrow).
Parameters:
-
path(str) –Output file path.
-
**kwargs(Any, default:{}) –Additional arguments passed to pyarrow.
display
¶
Print a formatted table of the first n rows.
Parameters:
-
n(int, default:20) –Maximum number of rows to display.
-
max_col_width(int, default:30) –Truncate cell values longer than this.
-
optimize(bool, default:True) –If True, run the query optimizer first.
Examples:
typed
¶
Wrap this LazyFrame as a TypedLazyFrame for IDE-friendly typed results.
Operations that preserve the schema (filter, sort, head) return
a TypedLazyFrame, so .to_pylist() returns list[T] in type checkers.
Parameters:
-
row_type(type[T]) –A TypedDict class describing the row schema.
Returns:
-
TypedLazyFrame[T]–A TypedLazyFrame wrapping the same query plan.
Examples:
validate
¶
Validate the schema against a TypedDict type.
Parameters:
-
row_type(type) –A TypedDict class. Each key is checked against the LazyFrame's schema for presence and type compatibility.
Returns:
-
LazyFrame–Self, if validation passes.
Raises:
-
TypeError–If the schema doesn't match the TypedDict.
Examples:
TypedLazyFrame¶
The TypedLazyFrame wraps a LazyFrame with a known row type (a TypedDict), enabling static type checkers and IDEs to infer the shape of results from .to_pylist().
TypedLazyFrame
¶
Bases: LazyFrame, Generic[T]
A LazyFrame with a known row type for static type checking.
Created via LazyFrame.typed(MyTypedDict). Operations that preserve
the schema (filter, sort, head) return a TypedLazyFrame, so
.to_pylist() returns list[T] in type checkers.
LazyGroupBy¶
The LazyGroupBy is created by calling LazyFrame.group_by(). Call .agg() on it to specify aggregation expressions and produce the grouped result.
LazyGroupBy
¶
Builder for grouped aggregation operations.
Created by calling LazyFrame.group_by(). Use .agg() to specify
aggregation expressions and produce a result LazyFrame.
Methods:
-
agg–Apply aggregation expressions to each group.
agg
¶
Apply aggregation expressions to each group.
Parameters:
-
*agg_exprs(AggExpr, default:()) –One or more aggregation expressions, e.g.
col("amount").sum().alias("total").
Returns:
-
LazyFrame–A new LazyFrame with one row per group.
Raises:
-
TypeError–If any argument is not an AggExpr.
Examples:
>>> import pyfloe as pf
>>> orders = pf.LazyFrame([
... {"region": "EU", "amount": 250},
... {"region": "EU", "amount": 180},
... {"region": "US", "amount": 320},
... ])
>>> orders.group_by("region").agg(
... pf.col("amount").sum().alias("total"),
... pf.col("amount").count().alias("n"),
... ).sort("region").to_pylist()
[{'region': 'EU', 'total': 430, 'n': 2}, {'region': 'US', 'total': 320, 'n': 1}]