Hijacking Python's
Operators with Dunder Methods
In Module 1, we built a lazy DataFrame engine with generators and the
Volcano Model. But we ended on a cliffhanger: our FilterNode
relied on a Python lambda — a black box our engine could
never look inside. Time to fix that.
If we want to build a real query engine — one with an optimizer that can
prune unused columns and reorder operations — we need to see inside
the filter. We need col("age") > 28 to return not
True or False, but a readable data
structure our engine can inspect, rearrange, and eventually
evaluate.
This is the exact "magic" that makes Polars and PySpark so expressive. And to build it, we're going to use one of Python's most powerful features: Dunder Methods.
If you already know how __add__ and __gt__ work,
skim to Lesson 2.2 where we use them to build the
expression AST. The key insight isn't the dunder methods themselves — it's
what we do with them.
Lesson 2.1 Dunder Methods — Python's Secret Protocol
Let's remind ourselves of the problem we need to solve. Here's the filter from Module 1:
df.filter(lambda row: row["age"] > 28)
This lambda works perfectly for execution. But an optimizer needs to answer questions like: "Which columns does this filter use?" and "Can I push this filter down before the join?" With a lambda, the answer to both is: I have no idea. To the engine, a lambda is just an opaque, unreadable blob of Python bytecode.
We want to replace it with something like this:
df.filter(col("age") > 28)
Same intent, same readability — but instead of evaluating instantly,
col("age") > 28 should return a blueprint:
a tree-like data structure that records the operation without
executing it. In computer science, this kind of blueprint is called an
Abstract Syntax Tree — or AST. Let's unpack that.
What is an Abstract Syntax Tree?
You already know what a tree is from everyday life: a thing with a root,
branches, and leaves. An AST is the same idea, applied to an expression.
Take a simple math expression: (2 + 3) * 10. You can draw
it as a tree where the operator sits at each branch and the values sit at
the leaves:
To evaluate this tree, you start at the bottom: grab the 2
and the 3, hand them to the + operator
(result: 5), then hand that 5 and
10 to the * operator (result: 50).
The tree describes the computation without doing it — and you
can walk it from the bottom up whenever you're ready.
Let's break down the name:
Tree — it's a hierarchy of nodes. Operators are branch nodes (they have children). Values are leaf nodes (they don't).
Syntax — it represents the structure of an
expression. The tree captures what (2 + 3) * 10 means
structurally: "multiply the result of adding 2 and 3 by 10."
Abstract — it throws away surface details. It doesn't
care about parentheses, whitespace, or the fact that you typed
* instead of calling operator.mul. It keeps
only the essential meaning.
This is the expression tree — the second of the two trees introduced in the Introduction. It lives inside plan nodes.
Our goal for the rest of this module: make it so that writing
col("price") * col("quantity") > 1000 doesn't evaluate to
a result — it builds one of these trees.
Normally, Python sees > and immediately computes a value.
We need to hijack that instinct:
Normal Python Evaluates instantly
5 * 4 > 1000
Python computes 20, compares to 1000,
and returns False. The expression is gone — only the
boolean remains.
Our Expr system Builds a tree
lit(5) * lit(4) > lit(1000)
Same math, same numbers — but returns a BinaryExpr
tree with three Lit nodes at the leaves. Nothing is
computed. The engine can inspect it and evaluate it later.
And of course in real code, the leaves are column references instead of
hard-coded numbers — col("price") * col("qty") > 1000.
But the mechanism is identical: the operators build a tree instead of
computing a value.
But how do we make Python's * and >
operators build tree nodes instead of computing values? The answer lies
in a set of hidden methods that Python calls behind the scenes.
Operators are just method calls
Every time you write a + b in Python, you are not
calling some hard-coded addition built into the language. You are calling
a method on the object a. Specifically, Python translates
a + b into a.__add__(b).
# What you write: What Python actually calls:
a + b # a.__add__(b)
a > b # a.__gt__(b)
a == b # a.__eq__(b)
a * b # a.__mul__(b)
-a # a.__neg__()
~a # a.__invert__()
These methods — whose names start and end with double underscores — are
called dunder methods (short for "double underscore").
They are Python's protocol for operator overloading:
the ability to define what +, >,
==, and dozens of other operators do for your
custom objects.
For integers and floats, these dunders perform arithmetic. But here's the key insight: you can define these methods on your own classes to make them do anything.
And that's the trick we need: an operator that builds a new object instead of computing a result.
__gt__ on our expression class so
that col("age") > 28 doesn't compare anything — it
returns a new AST node that describes the comparison? Not
True or False — a data structure that says
"compare the left thing to the right thing using greater-than." That
data structure is something our optimizer can read.
Lesson 2.2 Hijacking the Operators — Building an AST
Let's build this step by step. We'll follow the same pattern you'll see throughout this course: start with a simplified version to understand the shape, then reveal pyfloe's real code to see what a production implementation adds.
First, we need the simplest possible
foundation: a base class Expr (for "expression") and a
Col class that represents a column reference.
class Expr:
"""Base class for all expressions."""
pass
class Col(Expr):
"""A column reference expression."""
def __init__(self, name: str):
self.name = name
def col(name: str) -> Col:
return Col(name)
Now, if a user types col("age"), they get a Col
object back. Great. But what happens if they type
col("age") > 28?
Python looks at the Col object on the left, looks at the
integer 28 on the right, and panics. It
doesn't know how to compare a Col to an integer. There is
no __gt__ method defined, so Python throws a
TypeError.
>>> col("age") > 28
# TypeError: '>' not supported between instances of 'Col' and 'int'
This is the moment where dunder methods save us. We need to tell Python:
"When someone writes > on an Expr, don't
try to compare anything. Build a new object that
records the comparison."
That new object is called a BinaryExpr — "binary" because
it has two children (a left side and a right side):
import operator as _op
class BinaryExpr(Expr):
"""Two operands combined with a binary operator."""
def __init__(self, left: Expr, right: Expr, op, op_str: str):
self.left = left # left child (an Expr)
self.right = right # right child (an Expr)
self.op = op # the callable (e.g., operator.gt)
self.op_str = op_str # for display: ">", "+", etc.
It stores four things: the left and right children (which are themselves
Expr objects), the operator as a callable Python function,
and a string representation for debugging.
Now comes the hijack. We go back to our Expr base class and
add a __gt__ method that creates a
BinaryExpr instead of doing a comparison:
class Expr:
def __gt__(self, other):
# Don't compare. Build an AST node!
return BinaryExpr(self, _ensure_expr(other), _op.gt, '>')
Now, watch the magic happen. When a user types
col("age") > 28, Python calls
col("age").__gt__(28). That method doesn't compare anything.
It creates a BinaryExpr with Col("age") as
the left child, 28 (wrapped as a Lit) as the
right child, and operator.gt as the operation.
The result is a tree, not a boolean. We have successfully delayed execution and built an AST!
__add__,
__eq__, __and__, and more. Polars, PySpark, and SQLAlchemy use
the same approach. Every col("price") * 0.9 in Polars builds
an AST node behind the scenes via a dunder method.
The Lit class and _ensure_expr
You may have noticed the _ensure_expr(other) call. When a
user writes col("age") > 28, the 28 is a plain
Python integer, not an Expr object. But our
BinaryExpr expects both children to be expressions. So we
need a literal value class and a tiny helper to auto-wrap raw values:
class Lit(Expr):
"""A literal value expression — always returns the same constant."""
def __init__(self, value):
self.value = value
def _ensure_expr(val) -> Expr:
"""Wrap raw Python values as Lit expressions."""
if isinstance(val, Expr):
return val # already an expression — leave it alone
return Lit(val) # wrap the raw value as a literal
This is why you can write col("price") * 0.9 without
explicitly calling lit(0.9). The dunder method calls
_ensure_expr(0.9), which wraps it automatically. A small
convenience — but it's what makes the API feel natural.
Polars does the same thing — col("price") * 0.9 auto-wraps
0.9 as a literal via the __mul__ dunder. pyfloe's
_ensure_expr is the same trick.
pyfloe also provides public constructor functions so the API reads cleanly:
def col(name: str) -> Col:
"""Create a column reference expression."""
return Col(name)
def lit(value) -> Lit:
"""Create a literal value expression."""
return Lit(value)
These are the exact same functions you'd use with Polars:
pf.col, pf.lit (after import pyfloe as pf). That's not a coincidence.
import pyfloe as pf
# When you write this, Python calls __mul__ then __gt__.
# No computation happens — it builds a TREE of objects.
expr = pf.col("price") * pf.col("quantity") > 1000
# The result is a BinaryExpr, not True/False!
print("Expression:", repr(expr))
print("Type:", type(expr).__name__)
print()
# Walk the tree: the root is ">"
print(f"Root operator: {expr.op_str}")
print(f" Left child: {repr(expr.left)} (type: {type(expr.left).__name__})")
print(f" Right child: {repr(expr.right)} (type: {type(expr.right).__name__})")
print()
# The left child is itself a tree: "*"
mul_node = expr.left
print(f"Drilling into the '*' node:")
print(f" Left: {repr(mul_node.left)} → column reference")
print(f" Right: {repr(mul_node.right)} → column reference")
print()
# The tree knows which columns it needs — no execution required
print("Required columns:", expr.required_columns())
print()
# Now evaluate it against an actual row
row = ("Widget", 25.0, 4)
col_map = {"name": 0, "price": 1, "quantity": 2}
result = expr.eval(row, col_map)
print(f"eval({row}) → 25.0 * 4 = 100.0, then 100.0 > 1000 → {result}")
pf.col("price") + pf.col("tax") > pf.col("budget").
How many nodes does the tree have? What does
required_columns() return? Can you predict the structure
before running it? Change the data and see if eval()
gives you what you expect.
The full operator table
With __gt__ working, adding every other operator is
mechanical — each one is a one-liner that builds a new
BinaryExpr node. Seventeen operators, all following the same pattern:
All operator overloads — same pattern, different operators
class Expr:
# Comparisons
def __gt__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.gt, '>')
def __ge__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.ge, '>=')
def __lt__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.lt, '<')
def __le__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.le, '<=')
def __eq__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.eq, '==')
def __ne__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.ne, '!=')
# Arithmetic
def __add__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.add, '+')
def __sub__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.sub, '-')
def __mul__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.mul, '*')
def __truediv__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.truediv, '/')
def __floordiv__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.floordiv, '//')
def __mod__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.mod, '%')
# Unary
def __neg__(self): return UnaryExpr(self, _op.neg, '-')
def __invert__(self): return UnaryExpr(self, _op.not_, '~')
# Logical
def __and__(self, other): return BinaryExpr(self, _ensure_expr(other), lambda a,b: a and b, '&')
def __or__(self, other): return BinaryExpr(self, _ensure_expr(other), lambda a,b: a or b, '|')
__bool__ trap are completeness details. They're useful if you encounter these edge cases, but not needed for Module 5.
Edge case: reverse operators
There's a subtle problem. What happens when the expression is written "backwards" — with the plain value on the left?
100 + col("amount")
# Python tries: (100).__add__(col("amount"))
# int doesn't know how to add an Expr → returns NotImplemented
When int.__add__ returns NotImplemented,
Python has a fallback: it tries the right operand's
reverse dunder — col("amount").__radd__(100).
pyfloe defines these so both directions work:
class Expr:
def __radd__(self, other):
return BinaryExpr(_ensure_expr(other), self, _op.add, '+')
def __rsub__(self, other):
return BinaryExpr(_ensure_expr(other), self, _op.sub, '-')
def __rmul__(self, other):
return BinaryExpr(_ensure_expr(other), self, _op.mul, '*')
def __rtruediv__(self, other):
return BinaryExpr(_ensure_expr(other), self, _op.truediv, '/')
Notice the order is flipped: _ensure_expr(other) goes on
the left, self on the right. Now 100 + col("amount")
correctly builds a BinaryExpr with Lit(100) on
the left and Col("amount") on the right.
Edge case: the __bool__ trap
There's one more dangerous trap. Python's and and
or keywords don't call __and__ or
__or__. They call __bool__ and short-circuit.
So if someone writes col("a") > 5 and col("b") < 10
(using the keyword and instead of the operator
&), Python tries to convert the expression to a
boolean — which makes no sense for an AST node.
pyfloe prevents this with a guardrail:
class Expr:
def __bool__(self):
raise TypeError(
'Cannot use an Expr as a Python bool. '
'Use & / | for logical ops, and .collect() to evaluate.'
)
__bool__ guard. You used and (which calls
__bool__) instead of & (which calls
__and__). Same mechanism, same fix.
col("age") > 28 and col("region") == "EU" in the code
editor. Read the error message. Now replace and with
&. This is the single most common mistake Polars users
make — and now you know exactly why it happens.
Free bonus: operator precedence
Because pyfloe piggybacks on Python's native operators, it gets
operator precedence for free. Just as *
binds tighter than + in arithmetic
(1 + 2 * 3 means 1 + (2 * 3)), Python's
& (AND) binds tighter than | (OR). This
means compound filters work like you'd expect:
# Arithmetic precedence:
1 + 2 * 3 # means 1 + (2 * 3) = 7, not (1 + 2) * 3 = 9
# Logical precedence — same idea:
a | b & c # means a | (b & c), not (a | b) & c
# In practice:
col("vip") == True | (col("region") == "EU") & (col("amount") > 100)
# The & binds first, then the |
# = vip OR (region is EU AND amount > 100)
pyfloe doesn't need to implement any precedence logic itself — Python
handles the order in which the dunder methods are called. When Python
parses a | b & c, it calls b.__and__(c)
first, then passes that result to a.__or__(...). The
expression tree is built in the correct order automatically.
& and
| have higher precedence than comparison
operators like > and ==. This means
col("age") > 18 & col("active") == True is parsed as
col("age") > (18 & col("active")) == True — completely
wrong. That's why you always need parentheses around each comparison:
(col("age") > 18) & (col("active") == True). This isn't
a pyfloe quirk — it's a Python precedence rule, and Polars has
the exact same requirement.
The UnaryExpr — for single-operand operations
Not every operation has two sides. Negation (-col("x")) and
logical NOT (~col("active")) have only one operand. pyfloe
handles these with UnaryExpr:
class UnaryExpr(Expr):
"""A unary operator applied to a single operand."""
def __init__(self, operand: Expr, op, op_str: str):
self.operand = operand
self.op = op
self.op_str = op_str
UnaryExpr is also reused for method-style operations like
col("x").is_null() and string methods like
col("name").str.upper(). The operand is the inner
expression, and the op is whatever function should be
applied. One class, many uses.
.str.upper(),
.str.contains()), datetime accessors
(.dt.year(), .dt.truncate()), conditional
logic (when(...).otherwise(...)), and window functions
(row_number().over(...)). We won't build all of these —
but they all follow the exact same pattern you've just learned:
a method call returns a new Expr node instead of
computing a value. Once you understand BinaryExpr and
UnaryExpr, the rest is just more of the same.
Lesson 2.3
Giving the AST a Voice — The eval() Method
Building an AST is great. We can represent col("age") > 28
as a tree of objects instead of an opaque lambda. But eventually, our
engine actually needs to process data. It needs to hand
a real row to this tree and get a result back: True or
False, a number, a string.
We do this by adding an eval() method to every node in the
tree. Let's walk through each one.
How Col evaluates: looking up a column
When we evaluate a Col expression, we just need to extract
that column's value from the row. But remember — pyfloe stores rows as
tuples, not dictionaries. So we need a col_map that tells
us which index corresponds to which column name:
class Col(Expr):
def __init__(self, name: str):
self.name = name
def eval(self, row, col_map):
# col_map tells us: "age" → index 1
return row[col_map[self.name]]
Let's trace it. If row = ("Alice", 30, "EU") and
col_map = {"name": 0, "age": 1, "region": 2}, then
Col("age").eval(row, col_map) does:
row[col_map["age"]]
row[1]
# → 30
col_map dictionary translates column names
to integer indices once, and then every row lookup is just
row[index].
How Lit evaluates: always the same
A Lit expression is even simpler. It ignores the row
entirely and always returns its stored constant:
class Lit(Expr):
def eval(self, row, col_map):
return self.value # always 28, always "EU", always 3.14
How BinaryExpr evaluates: ask the children
Here's where the recursive magic happens. A BinaryExpr
doesn't know what its children are. It doesn't care whether the left
side is a Col, a Lit, or another
BinaryExpr. It just asks each child to evaluate itself,
then applies its operator to the results:
class BinaryExpr(Expr):
def eval(self, row, col_map):
# Ask the children to evaluate themselves recursively
lv = self.left.eval(row, col_map)
rv = self.right.eval(row, col_map)
# Handle NULLs gracefully
if lv is None or rv is None:
return None
# Apply the operator (e.g., operator.gt(30, 28) → True)
return self.op(lv, rv)
That None check is important. In real-world data, values
are frequently missing. pyfloe follows the SQL convention: any
comparison involving None returns None, not
True or False. This is something a naive
lambda-based filter would never handle correctly.
Tracing a compound expression
Let's trace a complete expression through construction and evaluation. Consider:
expr = (col("price") * col("quantity")) > 1000
Step 1 — Build the tree. col("price")
creates Col("price"). col("quantity") creates
Col("quantity"). The * operator calls
__mul__, producing a BinaryExpr(*, ...). Then
the > operator calls __gt__, wrapping
everything in a second BinaryExpr(>, ...):
BinaryExpr(op=">")
├── left: BinaryExpr(op="*")
│ ├── left: Col("price")
│ └── right: Col("quantity")
└── right: Lit(1000)
No data has been touched. No computation has happened. We've just built a description of what to compute.
Pause and predict: given
row = ("Widget", 25.0, 4) and the expression
(col("price") * col("quantity")) > 1000, what will
the result be? Work it out in your head: what's 25.0 × 4, and
is that greater than 1000? Then read on.
Step 2 — Evaluate it. The engine has a row:
row = ("Widget", 25.0, 4) and
col_map = {"name": 0, "price": 1, "quantity": 2}.
It calls expr.eval(row, col_map) and the tree walks itself:
# Root: BinaryExpr(op=">") asks its children:
# Left child: BinaryExpr(op="*") asks ITS children:
# Col("price").eval(...) → row[1] → 25.0
# Col("quantity").eval(...) → row[2] → 4
# result: 25.0 * 4 = 100.0
# Right child: Lit(1000).eval(...) → 1000
# Root applies its operator: 100.0 > 1000 → False
Each node only knows how to get its children's values and apply its own
operator. The root doesn't know anything about "price" or
"quantity" — it just asks its left child for a number and
its right child for a number. This recursive self-similarity is what
makes ASTs so elegant — and so composable.
Lesson 2.4 The Payoff — Inspectable Expressions
We've built the expression tree. We've given it a voice with
eval(). But we haven't yet delivered on the
original promise from Module 1: the reason we did all this
work instead of just using lambdas.
Remember the question that an optimizer needs to answer: "Which columns does this expression actually use?" With a lambda, we were stuck. With the AST, the answer writes itself.
required_columns: the method lambdas can't provide
We add one more method to each node. The logic is beautifully simple: leaf nodes know what they need, and branch nodes combine their children's needs:
class Col(Expr):
def required_columns(self):
return {self.name} # I need exactly one column
class Lit(Expr):
def required_columns(self):
return set() # I need nothing — I'm a constant
class BinaryExpr(Expr):
def required_columns(self):
# Combine the requirements of both sides
return self.left.required_columns() | self.right.required_columns()
class UnaryExpr(Expr):
def required_columns(self):
return self.operand.required_columns()
Now watch. Our optimizer can see the future:
expr = (col("price") * col("quantity")) > 1000
expr.required_columns()
# BinaryExpr(>).required_columns()
# → BinaryExpr(*).required_columns() | Lit(1000).required_columns()
# → ({"price"} | {"quantity"}) | set()
# → {"price", "quantity"}
Two columns. If the CSV has 50 columns, the optimizer can now tell the scan node: "Only read columns 1 and 2. Skip the other 48."
col("price") * col("quantity") > 1000. And at the same
time, we have built a transparent data structure that our future
Query Optimizer (Module 5) can use to aggressively prune columns and
push filters down the execution tree. The same object powers
both the user experience and the optimizer. That's what makes this
architecture so elegant.
Lambda vs. expression — the full picture
Here's the complete comparison — why we did all this work:
Lambda Black box
lambda row: row["price"] * row["qty"] > 1000
The engine can call it, but can't look inside. It doesn't
know which columns are used. It can't rearrange or optimize anything.
Expression Inspectable tree
(col("price") * col("qty")) > 1000
The engine can walk the tree, list the columns ({"price", "qty"}),
push the filter down before a join, and prune unused columns from the scan.
Debugging with __repr__
One more dunder makes development much nicer. Each expression class
implements __repr__ so that printing an expression reveals
its full structure:
class Col(Expr):
def __repr__(self):
return f'col("{self.name}")'
class Lit(Expr):
def __repr__(self):
return repr(self.value)
class BinaryExpr(Expr):
def __repr__(self):
return f'({self.left} {self.op_str} {self.right})'
Because __repr__ is recursive — BinaryExpr's
repr calls repr(self.left) and repr(self.right),
which call their own __repr__ — you get a full round-trip
view of the tree:
>>> (col("price") * col("quantity")) > 1000
((col("price") * col("quantity")) > 1000)
>>> (col("age") > 18) & (col("active") == True)
((col("age") > 18) & (col("active") == True))
Three dunders — __gt__, eval,
__repr__ — all using the same recursive pattern, each
cascading through the same tree, each producing a different kind of
output. A tree of objects, a boolean, and a string. The AST is
the common backbone.
col("a") + col("b") instead of verbose constructor calls like
BinaryExpr(Col("a"), Col("b"), operator.add, "+").
Deep dive — From our version to pyfloe's real code
Throughout this module, we built simplified versions of each class to understand the shape of the expression system. pyfloe's actual source code has the same shape — but adds production polish. Here's what the real code adds that we skipped over:
__slots__ — You may have noticed
__slots__ = ('name',) on pyfloe's Col class
and __slots__ = ('left', 'right', 'op', 'op_str') on
BinaryExpr. Normally, every Python object carries a hidden
__dict__ dictionary to store its attributes. That
dictionary is flexible (you can add any attribute at runtime), but it
costs about 100 extra bytes per object. When you declare
__slots__, Python skips the __dict__ entirely
and stores attributes in a fixed-size internal array — roughly 40% less
memory per instance.
Why does that matter for expressions? Each node in the AST is tiny —
just a few attributes. But a complex query might create thousands of
expression nodes. And when the engine evaluates that expression across
millions of rows, every byte counts. __slots__ is a
one-line declaration that eliminates that overhead for free.
class Col(Expr):
__slots__ = ('name',) # only this attribute — no __dict__
class Lit(Expr):
__slots__ = ('value',) # ~100 bytes saved per instance
class BinaryExpr(Expr):
__slots__ = ('left', 'right', 'op', 'op_str')
import sys
class WithSlots:
__slots__ = ("key", "value")
def __init__(self, key: str, value: int):
self.key = key
self.value = value
class WithoutSlots:
def __init__(self, key: str, value: int):
self.key = key
self.value = value
a = WithSlots("price", 42)
b = WithoutSlots("price", 42)
# Memory comparison
print(f"With __slots__: {sys.getsizeof(a)} bytes")
print(f"Without __slots__: {sys.getsizeof(b)} + {sys.getsizeof(b.__dict__)} bytes (__dict__)")
print()
# __dict__ exists only on the non-slots class
print(f"b.__dict__ = {b.__dict__}")
try:
print(f"a.__dict__ = {a.__dict__}")
except AttributeError as e:
print(f"a.__dict__ → AttributeError: {e}")
print()
# You can add arbitrary attributes without slots...
b.new_attr = "surprise!"
print(f"b.new_attr = {b.new_attr!r} (anything goes)")
# ...but slots locks it down
try:
a.new_attr = "nope"
except AttributeError as e:
print(f"a.new_attr → AttributeError: {e}")
print()
# At scale this matters: 10,000 expression nodes
slots_total = sum(sys.getsizeof(WithSlots("x", i)) for i in range(10_000))
dict_total = sum(sys.getsizeof(WithoutSlots("x", i)) + sys.getsizeof(WithoutSlots("x", i).__dict__) for i in range(10_000))
print(f"10k slots objects: {slots_total:>10,} bytes")
print(f"10k dict objects: {dict_total:>10,} bytes")
print(f"Savings: {dict_total - slots_total:>10,} bytes ({(dict_total - slots_total) / dict_total:.0%})")
compile() — In addition to eval(),
pyfloe has a compile(col_map) method that pre-bakes the
column index lookup into a closure. Instead of doing a dictionary lookup
on every row, the compiled version is lambda row: row[1]
— the index 1 is already baked in. This is significantly
faster for tight loops over millions of rows.
output_dtype() — Each expression can
report what type it will produce (bool for comparisons,
float for division, etc.). This enables type inference:
pyfloe can tell you the schema of a query result before running
any data through it.
FilterNode receives an Expr, compiles
it, and evaluates it against every row flowing through the volcano
pipeline. We'll also tackle a performance problem — calling
eval() once per row is brutally slow in Python — and
solve it with batched execution.
Exercises Check Your Understanding
Before moving to Module 3, make sure these concepts are solid:
Quick check
1. When you write col("age") > 28, what does Python actually call?
2. What does _ensure_expr(42) return?
3. What does (col("a") + col("b")).required_columns() return?
4. Why does pyfloe define __radd__ in addition to __add__?
5. Why does Expr define __bool__ to raise a TypeError?
6. You build two expressions: e1 = col("x") + col("y") and
e2 = col("z") > 10. What does
e1.required_columns() | e2.required_columns() return?
Challenge Build a NegExpr Node
You've built BinaryExpr for two-operand operations. Now build
NegExpr — a unary expression that negates its child.
It should support -col("price") by implementing the
__neg__ dunder method on Expr.
Think about:
- How many children does a unary node have? (Just one.)
- What should
required_columns()return? - What should
eval(row)do?
from __future__ import annotations
from collections.abc import Callable
class Expr:
def required_columns(self) -> set[str]:
return set()
def eval(self, row: dict):
raise NotImplementedError
def __neg__(self):
# TODO: return a NegExpr wrapping self
pass
def __gt__(self, other):
return BinaryExpr(self, other if isinstance(other, Expr) else Lit(other), lambda a, b: a > b, ">")
class Col(Expr):
def __init__(self, name: str):
self.name = name
def required_columns(self) -> set[str]:
return {self.name}
def eval(self, row: dict):
return row[self.name]
class Lit(Expr):
def __init__(self, value: int | float | str):
self.value = value
def eval(self, row: dict):
return self.value
class BinaryExpr(Expr):
def __init__(self, left: Expr, right: Expr, op: Callable, op_str: str):
self.left, self.right, self.op, self.op_str = left, right, op, op_str
def required_columns(self) -> set[str]:
return self.left.required_columns() | self.right.required_columns()
def eval(self, row: dict):
return self.op(self.left.eval(row), self.right.eval(row))
class NegExpr(Expr):
def __init__(self, child: Expr):
self.child = child
def required_columns(self) -> set[str]:
# TODO
pass
def eval(self, row: dict):
# TODO
pass
# Test it:
row = {"price": 42, "discount": 5}
neg_price = -Col("price")
print(f"-col('price') => type: {type(neg_price).__name__}")
print(f"-col('price') => eval: {neg_price.eval(row)}")
print(f"-col('price') => cols: {neg_price.required_columns()}")
# Compose it: -col("price") > -50
expr = -Col("price") > -50
print(f"\n-col('price') > -50 => {expr.eval(row)}")
(-col("price")).required_columns() return? What about
(col("a") + col("b")) > (col("c") * col("d"))? Trace the
recursive required_columns() calls. (Answer:
{"price"} and {"a", "b", "c", "d"}.)
Source References
expr.py — Expression AST
Expr— base class for all expressionsCol— column reference expressionBinaryExpr— two operands combined with a binary operatorLit— literal value expressionrequired_columns()— returns the set of columns an expression needs