Query Plan Nodes¶
Internal plan node classes that form the query execution tree. Each node implements the volcano / iterator model — data flows upward through the tree as parent nodes pull rows from their children.
These classes are not part of the public API but are documented here for users who want to understand or extend the query engine internals.
PlanNode¶
PlanNode
¶
Abstract base class for all query plan nodes.
Every node in the query plan tree inherits from PlanNode and
implements the volcano / iterator execution model. Data flows
upward through the tree: a parent node pulls rows from its children
by calling execute() or execute_batched().
Methods:
-
schema–Return the output schema of this node.
-
execute–Yield rows one at a time by flattening batched output.
-
execute_batched–Yield rows in batches of up to
_BATCH_SIZE. -
fast_count–Return the exact row count without executing, if known.
-
children–Return the direct child nodes of this plan node.
-
explain–Return a human-readable representation of the plan tree.
schema
¶
Return the output schema of this node.
Returns:
-
LazySchema–The schema describing columns and types produced by this node.
execute
¶
Yield rows one at a time by flattening batched output.
Returns:
-
Iterator[tuple]–An iterator of tuples, each representing one row.
execute_batched
¶
Yield rows in batches of up to _BATCH_SIZE.
Returns:
-
Iterator[list[tuple]]–An iterator of lists, where each list contains up to 1024 row tuples.
fast_count
¶
Return the exact row count without executing, if known.
Returns:
-
int | None–The row count, or
Noneif it cannot be determined cheaply.
children
¶
Return the direct child nodes of this plan node.
Returns:
-
list[PlanNode]–A list of child
PlanNodeinstances (empty for leaf nodes).
explain
¶
Return a human-readable representation of the plan tree.
Parameters:
-
indent(int, default:0) –Current indentation level (used for recursive calls).
Returns:
-
str–A multi-line string showing this node and all descendants.
Data Source Nodes¶
ScanNode¶
ScanNode
¶
Bases: PlanNode
Leaf node that reads from an in-memory list of row tuples.
This is the most common data source node, created when a LazyFrame
is constructed from Python data. It supports fast_count because
the full dataset is already materialised.
Parameters:
-
data(list) –List of row tuples.
-
columns(list[str]) –Column names corresponding to tuple positions.
-
lazy_schema(LazySchema | None, default:None) –Optional pre-computed schema; inferred from data if omitted.
IteratorSourceNode¶
IteratorSourceNode
¶
Bases: PlanNode
Leaf node that reads from a lazily-evaluated iterator factory.
Each call to execute_batched invokes the factory to produce a
fresh iterator, enabling repeatable reads from streaming sources
such as file readers.
Parameters:
-
columns(list[str]) –Column names for the produced rows.
-
lazy_schema(LazySchema) –Schema describing the output columns.
-
iterator_factory(Callable[[], Iterator[tuple]]) –A zero-argument callable that returns an iterator of row tuples.
-
source_label(str, default:'Iterator') –Descriptive label shown in
explainoutput.
Transform Nodes¶
ProjectNode¶
ProjectNode
¶
Bases: PlanNode
Selects or reorders columns, or evaluates computed expressions.
When columns is provided, only those columns are kept (a SQL SELECT).
When exprs is provided, each expression is evaluated to produce a new
set of output columns.
Parameters:
-
child(PlanNode) –Input plan node.
-
columns(list[str] | None, default:None) –Column names to select. Mutually exclusive with exprs.
-
exprs(list[Expr] | None, default:None) –Expressions to evaluate. Mutually exclusive with columns.
FilterNode¶
FilterNode
¶
WithColumnNode¶
WithColumnNode
¶
Bases: PlanNode
Appends a new computed column to the output.
Evaluates expr for every row and adds the result as a new column named name. The output schema is the parent schema plus the new column.
Parameters:
-
child(PlanNode) –Input plan node.
-
name(str) –Name for the new column.
-
expr(Expr) –Expression to evaluate for each row.
ApplyNode¶
ApplyNode
¶
Bases: PlanNode
Applies a scalar function to one or more columns.
When columns is provided, only those columns are transformed; otherwise func is applied to every value in every column.
Parameters:
-
child(PlanNode) –Input plan node.
-
func(Callable[..., Any]) –Callable applied element-wise to each value.
-
columns(list[str] | None, default:None) –Columns to transform. If
None, all columns are transformed. -
output_dtype(type | None, default:None) –Optional output type to record in the schema.
RenameNode¶
RenameNode
¶
SortNode¶
SortNode
¶
Bases: PlanNode
Sorts all rows by one or more columns using Timsort.
Materialises the full input before sorting. Multi-column sorts are composed via sequential stable sorts in reverse priority order.
Parameters:
-
child(PlanNode) –Input plan node.
-
by(list[str]) –Column names to sort by.
-
ascending(list[bool] | None, default:None) –Sort direction per column. Defaults to ascending for all.
Join Nodes¶
JoinNode¶
JoinNode
¶
Bases: PlanNode
Hash-based join of two input plan nodes.
Materialises the right side into a hash table keyed on the join
columns, then streams the left side and probes the table.
Supports inner, left, and full join modes.
Parameters:
-
left(PlanNode) –Left input plan node.
-
right(PlanNode) –Right input plan node.
-
left_on(list[str]) –Join-key column names from the left side.
-
right_on(list[str]) –Join-key column names from the right side.
-
how(JoinHow, default:'inner') –Join type —
'inner','left', or'full'.
SortedMergeJoinNode¶
SortedMergeJoinNode
¶
Bases: PlanNode
Sort-merge join for pre-sorted inputs.
Both inputs must be sorted on their respective join keys. Two cursors advance in lockstep, emitting matches when keys are equal. Memory usage is O(1) for one-to-one joins and O(g) for groups with many-to-many matches.
Parameters:
-
left(PlanNode) –Left input plan node (sorted by left_on).
-
right(PlanNode) –Right input plan node (sorted by right_on).
-
left_on(list[str]) –Join-key column names from the left side.
-
right_on(list[str]) –Join-key column names from the right side.
-
how(JoinHow, default:'inner') –Join type —
'inner','left', or'full'.
Aggregation Nodes¶
AggNode¶
AggNode
¶
Bases: PlanNode
Hash-based group-by aggregation node.
Groups rows by group_by columns using a hash map and maintains running accumulators for each aggregation expression. Memory usage is O(k) where k is the number of distinct groups.
Parameters:
-
child(PlanNode) –Input plan node.
-
group_by(list[str]) –Column names to group by.
-
agg_exprs(list[AggExpr]) –Aggregation expressions to evaluate per group.
SortedAggNode¶
SortedAggNode
¶
Bases: PlanNode
Streaming group-by aggregation for pre-sorted input.
Assumes the child yields rows already sorted by the group_by columns. Groups are detected by key changes, so only one accumulator set is active at a time — O(1) memory per group.
Parameters:
-
child(PlanNode) –Input plan node (must be pre-sorted by group_by).
-
group_by(list[str]) –Column names to group by.
-
agg_exprs(list[AggExpr]) –Aggregation expressions to evaluate per group.
Reshape Nodes¶
ExplodeNode¶
ExplodeNode
¶
UnpivotNode¶
UnpivotNode
¶
Bases: PlanNode
Converts columns into rows (melt / unpivot).
Keeps id_columns fixed and transforms each value_columns entry
into a (variable, value) row pair, similar to pandas melt
or SQL UNPIVOT.
Parameters:
-
child(PlanNode) –Input plan node.
-
id_columns(list[str]) –Columns to keep as identifiers.
-
value_columns(list[str]) –Columns to unpivot into rows.
-
variable_name(str, default:'variable') –Name for the new column holding the original column names.
-
value_name(str, default:'value') –Name for the new column holding the values.
PivotNode¶
PivotNode
¶
Bases: PlanNode
Converts rows into columns (pivot / spread).
Groups rows by index columns and spreads the distinct values in on into new columns, aggregating values with agg_name. Pivot columns are auto-discovered from the data unless columns is explicitly provided.
Parameters:
-
child(PlanNode) –Input plan node.
-
index(list[str]) –Columns to keep as the row index.
-
on(str) –Column whose distinct values become new column headers.
-
values(str) –Column containing the values to fill the pivoted cells.
-
agg_name(AggFunc, default:'first') –Aggregation function applied when multiple values map to the same cell.
-
columns(list[str] | None, default:None) –Explicit list of pivot columns. Auto-discovered if
None.
Combine Nodes¶
UnionNode¶
UnionNode
¶
Window Nodes¶
WindowNode¶
WindowNode
¶
Bases: PlanNode
Evaluates a window function and appends the result as a new column.
Implements the sort-partition-scan pattern: materialises all rows, partitions by key columns, sorts within each partition, then computes the window expression (rank, cumulative, offset, or aggregate).
Parameters:
-
child(PlanNode) –Input plan node.
-
window_expr(WindowExpr) –The window expression to evaluate.
-
output_name(str) –Name for the new column containing window results.
Optimizer¶
Optimizer
¶
Rule-based query optimizer.
Applies two optimization passes to a query plan:
- Filter pushdown — moves filter nodes closer to data sources, reducing the number of rows processed by downstream operations.
- Column pruning — removes unused columns early in the plan, reducing memory usage.
Methods:
-
optimize–Apply all optimization passes to a plan tree.