Build Your Own DataFrame
Module 2 of 5

Hijacking Python's
Operators with Dunder Methods

In Module 1, we built a lazy DataFrame engine with generators and the Volcano Model. But we ended on a cliffhanger: our FilterNode relied on a Python lambda — a black box our engine could never look inside. Time to fix that.

If we want to build a real query engine — one with an optimizer that can prune unused columns and reorder operations — we need to see inside the filter. We need col("age") > 28 to return not True or False, but a readable data structure our engine can inspect, rearrange, and eventually evaluate.

This is the exact "magic" that makes Polars and PySpark so expressive. And to build it, we're going to use one of Python's most powerful features: Dunder Methods.

If you already know how __add__ and __gt__ work, skim to Lesson 2.2 where we use them to build the expression AST. The key insight isn't the dunder methods themselves — it's what we do with them.

Lesson 2.1 Dunder Methods — Python's Secret Protocol

Let's remind ourselves of the problem we need to solve. Here's the filter from Module 1:

The problem we left off with
df.filter(lambda row: row["age"] > 28)

This lambda works perfectly for execution. But an optimizer needs to answer questions like: "Which columns does this filter use?" and "Can I push this filter down before the join?" With a lambda, the answer to both is: I have no idea. To the engine, a lambda is just an opaque, unreadable blob of Python bytecode.

We want to replace it with something like this:

The goal — a readable, inspectable expression
df.filter(col("age") > 28)

Same intent, same readability — but instead of evaluating instantly, col("age") > 28 should return a blueprint: a tree-like data structure that records the operation without executing it. In computer science, this kind of blueprint is called an Abstract Syntax Tree — or AST. Let's unpack that.

What is an Abstract Syntax Tree?

You already know what a tree is from everyday life: a thing with a root, branches, and leaves. An AST is the same idea, applied to an expression. Take a simple math expression: (2 + 3) * 10. You can draw it as a tree where the operator sits at each branch and the values sit at the leaves:

* operator + operator 10 2 3

To evaluate this tree, you start at the bottom: grab the 2 and the 3, hand them to the + operator (result: 5), then hand that 5 and 10 to the * operator (result: 50). The tree describes the computation without doing it — and you can walk it from the bottom up whenever you're ready.

Let's break down the name:

Tree — it's a hierarchy of nodes. Operators are branch nodes (they have children). Values are leaf nodes (they don't).

Syntax — it represents the structure of an expression. The tree captures what (2 + 3) * 10 means structurally: "multiply the result of adding 2 and 3 by 10."

Abstract — it throws away surface details. It doesn't care about parentheses, whitespace, or the fact that you typed * instead of calling operator.mul. It keeps only the essential meaning.

This is the expression tree — the second of the two trees introduced in the Introduction. It lives inside plan nodes.

Our goal for the rest of this module: make it so that writing col("price") * col("quantity") > 1000 doesn't evaluate to a result — it builds one of these trees.

Normally, Python sees > and immediately computes a value. We need to hijack that instinct:

Normal Python Evaluates instantly

5 * 4 > 1000

Python computes 20, compares to 1000, and returns False. The expression is gone — only the boolean remains.

Our Expr system Builds a tree

lit(5) * lit(4) > lit(1000)

Same math, same numbers — but returns a BinaryExpr tree with three Lit nodes at the leaves. Nothing is computed. The engine can inspect it and evaluate it later.

And of course in real code, the leaves are column references instead of hard-coded numbers — col("price") * col("qty") > 1000. But the mechanism is identical: the operators build a tree instead of computing a value.

But how do we make Python's * and > operators build tree nodes instead of computing values? The answer lies in a set of hidden methods that Python calls behind the scenes.

Operators are just method calls

Every time you write a + b in Python, you are not calling some hard-coded addition built into the language. You are calling a method on the object a. Specifically, Python translates a + b into a.__add__(b).

Python translates operators into method calls
# What you write:        What Python actually calls:
a + b                   # a.__add__(b)
a > b                   # a.__gt__(b)
a == b                  # a.__eq__(b)
a * b                   # a.__mul__(b)
-a                      # a.__neg__()
~a                      # a.__invert__()

These methods — whose names start and end with double underscores — are called dunder methods (short for "double underscore"). They are Python's protocol for operator overloading: the ability to define what +, >, ==, and dozens of other operators do for your custom objects.

For integers and floats, these dunders perform arithmetic. But here's the key insight: you can define these methods on your own classes to make them do anything.

And that's the trick we need: an operator that builds a new object instead of computing a result.

The core idea
What if we override __gt__ on our expression class so that col("age") > 28 doesn't compare anything — it returns a new AST node that describes the comparison? Not True or False — a data structure that says "compare the left thing to the right thing using greater-than." That data structure is something our optimizer can read.

Lesson 2.2 Hijacking the Operators — Building an AST

Let's build this step by step. We'll follow the same pattern you'll see throughout this course: start with a simplified version to understand the shape, then reveal pyfloe's real code to see what a production implementation adds.

Reading the code labels
Throughout this course, the colored dots on code snippets tell you what you're looking at: coral = the naive or starting approach, teal = an improved version that solves the limitation, purple = pyfloe's actual source code. The progression is always: understand the concept first, then see how the real library implements it.

First, we need the simplest possible foundation: a base class Expr (for "expression") and a Col class that represents a column reference.

Our starting point — bare minimum
class Expr:
    """Base class for all expressions."""
    pass

class Col(Expr):
    """A column reference expression."""
    def __init__(self, name: str):
        self.name = name

def col(name: str) -> Col:
    return Col(name)

Now, if a user types col("age"), they get a Col object back. Great. But what happens if they type col("age") > 28?

Python looks at the Col object on the left, looks at the integer 28 on the right, and panics. It doesn't know how to compare a Col to an integer. There is no __gt__ method defined, so Python throws a TypeError.

Python panics — no __gt__ defined
>>> col("age") > 28
# TypeError: '>' not supported between instances of 'Col' and 'int'

This is the moment where dunder methods save us. We need to tell Python: "When someone writes > on an Expr, don't try to compare anything. Build a new object that records the comparison."

That new object is called a BinaryExpr — "binary" because it has two children (a left side and a right side):

pyfloe/expr.py — the BinaryExpr node
import operator as _op

class BinaryExpr(Expr):
    """Two operands combined with a binary operator."""

    def __init__(self, left: Expr, right: Expr, op, op_str: str):
        self.left   = left      # left child (an Expr)
        self.right  = right     # right child (an Expr)
        self.op     = op        # the callable (e.g., operator.gt)
        self.op_str = op_str   # for display: ">", "+", etc.

It stores four things: the left and right children (which are themselves Expr objects), the operator as a callable Python function, and a string representation for debugging.

Now comes the hijack. We go back to our Expr base class and add a __gt__ method that creates a BinaryExpr instead of doing a comparison:

pyfloe/expr.py — hijacking the > operator
class Expr:
    def __gt__(self, other):
        # Don't compare. Build an AST node!
        return BinaryExpr(self, _ensure_expr(other), _op.gt, '>')

Now, watch the magic happen. When a user types col("age") > 28, Python calls col("age").__gt__(28). That method doesn't compare anything. It creates a BinaryExpr with Col("age") as the left child, 28 (wrapped as a Lit) as the right child, and operator.gt as the operation.

The result is a tree, not a boolean. We have successfully delayed execution and built an AST!

Show the magic
This is the exact pattern that pyfloe uses for all operators — __add__, __eq__, __and__, and more. Polars, PySpark, and SQLAlchemy use the same approach. Every col("price") * 0.9 in Polars builds an AST node behind the scenes via a dunder method.

The Lit class and _ensure_expr

You may have noticed the _ensure_expr(other) call. When a user writes col("age") > 28, the 28 is a plain Python integer, not an Expr object. But our BinaryExpr expects both children to be expressions. So we need a literal value class and a tiny helper to auto-wrap raw values:

pyfloe/expr.py — Lit and _ensure_expr
class Lit(Expr):
    """A literal value expression — always returns the same constant."""
    def __init__(self, value):
        self.value = value

def _ensure_expr(val) -> Expr:
    """Wrap raw Python values as Lit expressions."""
    if isinstance(val, Expr):
        return val       # already an expression — leave it alone
    return Lit(val)      # wrap the raw value as a literal

This is why you can write col("price") * 0.9 without explicitly calling lit(0.9). The dunder method calls _ensure_expr(0.9), which wraps it automatically. A small convenience — but it's what makes the API feel natural.

Polars does the same thing — col("price") * 0.9 auto-wraps 0.9 as a literal via the __mul__ dunder. pyfloe's _ensure_expr is the same trick.

pyfloe also provides public constructor functions so the API reads cleanly:

pyfloe/expr.py — convenience constructors
def col(name: str) -> Col:
    """Create a column reference expression."""
    return Col(name)

def lit(value) -> Lit:
    """Create a literal value expression."""
    return Lit(value)

These are the exact same functions you'd use with Polars: pf.col, pf.lit (after import pyfloe as pf). That's not a coincidence.

import pyfloe as pf

# When you write this, Python calls __mul__ then __gt__.
# No computation happens — it builds a TREE of objects.
expr = pf.col("price") * pf.col("quantity") > 1000

# The result is a BinaryExpr, not True/False!
print("Expression:", repr(expr))
print("Type:", type(expr).__name__)
print()

# Walk the tree: the root is ">"
print(f"Root operator: {expr.op_str}")
print(f"  Left child:  {repr(expr.left)}  (type: {type(expr.left).__name__})")
print(f"  Right child: {repr(expr.right)}  (type: {type(expr.right).__name__})")
print()

# The left child is itself a tree: "*"
mul_node = expr.left
print(f"Drilling into the '*' node:")
print(f"  Left:  {repr(mul_node.left)}  → column reference")
print(f"  Right: {repr(mul_node.right)}  → column reference")
print()

# The tree knows which columns it needs — no execution required
print("Required columns:", expr.required_columns())
print()

# Now evaluate it against an actual row
row = ("Widget", 25.0, 4)
col_map = {"name": 0, "price": 1, "quantity": 2}
result = expr.eval(row, col_map)
print(f"eval({row}) → 25.0 * 4 = 100.0, then 100.0 > 1000 → {result}")
Try it yourself
Modify the code above. Build your own expression — try pf.col("price") + pf.col("tax") > pf.col("budget"). How many nodes does the tree have? What does required_columns() return? Can you predict the structure before running it? Change the data and see if eval() gives you what you expect.

The full operator table

With __gt__ working, adding every other operator is mechanical — each one is a one-liner that builds a new BinaryExpr node. Seventeen operators, all following the same pattern:

All operator overloads — same pattern, different operators
pyfloe/expr.py — all operator overloads
class Expr:
    # Comparisons
    def __gt__(self, other):  return BinaryExpr(self, _ensure_expr(other), _op.gt, '>')
    def __ge__(self, other):  return BinaryExpr(self, _ensure_expr(other), _op.ge, '>=')
    def __lt__(self, other):  return BinaryExpr(self, _ensure_expr(other), _op.lt, '<')
    def __le__(self, other):  return BinaryExpr(self, _ensure_expr(other), _op.le, '<=')
    def __eq__(self, other):  return BinaryExpr(self, _ensure_expr(other), _op.eq, '==')
    def __ne__(self, other):  return BinaryExpr(self, _ensure_expr(other), _op.ne, '!=')

    # Arithmetic
    def __add__(self, other):      return BinaryExpr(self, _ensure_expr(other), _op.add, '+')
    def __sub__(self, other):      return BinaryExpr(self, _ensure_expr(other), _op.sub, '-')
    def __mul__(self, other):      return BinaryExpr(self, _ensure_expr(other), _op.mul, '*')
    def __truediv__(self, other):  return BinaryExpr(self, _ensure_expr(other), _op.truediv, '/')
    def __floordiv__(self, other): return BinaryExpr(self, _ensure_expr(other), _op.floordiv, '//')
    def __mod__(self, other):      return BinaryExpr(self, _ensure_expr(other), _op.mod, '%')

    # Unary
    def __neg__(self):             return UnaryExpr(self, _op.neg, '-')
    def __invert__(self):          return UnaryExpr(self, _op.not_, '~')

    # Logical
    def __and__(self, other):     return BinaryExpr(self, _ensure_expr(other), lambda a,b: a and b, '&')
    def __or__(self, other):      return BinaryExpr(self, _ensure_expr(other), lambda a,b: a or b,  '|')
Deep Cut
Reverse operators and the __bool__ trap are completeness details. They're useful if you encounter these edge cases, but not needed for Module 5.

Edge case: reverse operators

There's a subtle problem. What happens when the expression is written "backwards" — with the plain value on the left?

This would fail without reverse operators
100 + col("amount")
# Python tries: (100).__add__(col("amount"))
# int doesn't know how to add an Expr → returns NotImplemented

When int.__add__ returns NotImplemented, Python has a fallback: it tries the right operand's reverse dundercol("amount").__radd__(100). pyfloe defines these so both directions work:

pyfloe/expr.py — reverse operators
class Expr:
    def __radd__(self, other):
        return BinaryExpr(_ensure_expr(other), self, _op.add, '+')
    def __rsub__(self, other):
        return BinaryExpr(_ensure_expr(other), self, _op.sub, '-')
    def __rmul__(self, other):
        return BinaryExpr(_ensure_expr(other), self, _op.mul, '*')
    def __rtruediv__(self, other):
        return BinaryExpr(_ensure_expr(other), self, _op.truediv, '/')

Notice the order is flipped: _ensure_expr(other) goes on the left, self on the right. Now 100 + col("amount") correctly builds a BinaryExpr with Lit(100) on the left and Col("amount") on the right.

Edge case: the __bool__ trap

There's one more dangerous trap. Python's and and or keywords don't call __and__ or __or__. They call __bool__ and short-circuit. So if someone writes col("a") > 5 and col("b") < 10 (using the keyword and instead of the operator &), Python tries to convert the expression to a boolean — which makes no sense for an AST node.

pyfloe prevents this with a guardrail:

pyfloe/expr.py — preventing accidental bool conversion
class Expr:
    def __bool__(self):
        raise TypeError(
            'Cannot use an Expr as a Python bool. '
            'Use & / | for logical ops, and .collect() to evaluate.'
        )
A common Polars mistake
If you've ever used Polars and gotten the error "the truth value of an Expr is ambiguous", now you know exactly why. It's this __bool__ guard. You used and (which calls __bool__) instead of & (which calls __and__). Same mechanism, same fix.
Try it
Run col("age") > 28 and col("region") == "EU" in the code editor. Read the error message. Now replace and with &. This is the single most common mistake Polars users make — and now you know exactly why it happens.

Free bonus: operator precedence

Because pyfloe piggybacks on Python's native operators, it gets operator precedence for free. Just as * binds tighter than + in arithmetic (1 + 2 * 3 means 1 + (2 * 3)), Python's & (AND) binds tighter than | (OR). This means compound filters work like you'd expect:

& binds tighter than | — just like * binds tighter than +
# Arithmetic precedence:
1 + 2 * 3           # means 1 + (2 * 3) = 7, not (1 + 2) * 3 = 9

# Logical precedence — same idea:
a | b & c           # means a | (b & c), not (a | b) & c

# In practice:
col("vip") == True | (col("region") == "EU") & (col("amount") > 100)
# The & binds first, then the |
# = vip OR (region is EU AND amount > 100)

pyfloe doesn't need to implement any precedence logic itself — Python handles the order in which the dunder methods are called. When Python parses a | b & c, it calls b.__and__(c) first, then passes that result to a.__or__(...). The expression tree is built in the correct order automatically.

Watch out for comparisons
There is one gotcha. Python's bitwise operators & and | have higher precedence than comparison operators like > and ==. This means col("age") > 18 & col("active") == True is parsed as col("age") > (18 & col("active")) == True — completely wrong. That's why you always need parentheses around each comparison: (col("age") > 18) & (col("active") == True). This isn't a pyfloe quirk — it's a Python precedence rule, and Polars has the exact same requirement.

The UnaryExpr — for single-operand operations

Not every operation has two sides. Negation (-col("x")) and logical NOT (~col("active")) have only one operand. pyfloe handles these with UnaryExpr:

pyfloe/expr.py — UnaryExpr
class UnaryExpr(Expr):
    """A unary operator applied to a single operand."""

    def __init__(self, operand: Expr, op, op_str: str):
        self.operand = operand
        self.op = op
        self.op_str = op_str

UnaryExpr is also reused for method-style operations like col("x").is_null() and string methods like col("name").str.upper(). The operand is the inner expression, and the op is whatever function should be applied. One class, many uses.

More than we'll cover
pyfloe's expression system is larger than what we teach in this module. It includes string methods (.str.upper(), .str.contains()), datetime accessors (.dt.year(), .dt.truncate()), conditional logic (when(...).otherwise(...)), and window functions (row_number().over(...)). We won't build all of these — but they all follow the exact same pattern you've just learned: a method call returns a new Expr node instead of computing a value. Once you understand BinaryExpr and UnaryExpr, the rest is just more of the same.

Lesson 2.3 Giving the AST a Voice — The eval() Method

Building an AST is great. We can represent col("age") > 28 as a tree of objects instead of an opaque lambda. But eventually, our engine actually needs to process data. It needs to hand a real row to this tree and get a result back: True or False, a number, a string.

We do this by adding an eval() method to every node in the tree. Let's walk through each one.

How Col evaluates: looking up a column

When we evaluate a Col expression, we just need to extract that column's value from the row. But remember — pyfloe stores rows as tuples, not dictionaries. So we need a col_map that tells us which index corresponds to which column name:

pyfloe/expr.py — Col.eval
class Col(Expr):
    def __init__(self, name: str):
        self.name = name

    def eval(self, row, col_map):
        # col_map tells us: "age" → index 1
        return row[col_map[self.name]]

Let's trace it. If row = ("Alice", 30, "EU") and col_map = {"name": 0, "age": 1, "region": 2}, then Col("age").eval(row, col_map) does:

Tracing Col.eval
row[col_map["age"]]
row[1]
# → 30
Why tuples instead of dicts?
pyfloe stores each row as a Python tuple, not a dictionary. Tuples use about 40% less memory (no per-row key storage), they're faster to create and iterate, and they're hashable — which matters for grouping and joins. The col_map dictionary translates column names to integer indices once, and then every row lookup is just row[index].

How Lit evaluates: always the same

A Lit expression is even simpler. It ignores the row entirely and always returns its stored constant:

pyfloe/expr.py — Lit.eval
class Lit(Expr):
    def eval(self, row, col_map):
        return self.value   # always 28, always "EU", always 3.14

How BinaryExpr evaluates: ask the children

Here's where the recursive magic happens. A BinaryExpr doesn't know what its children are. It doesn't care whether the left side is a Col, a Lit, or another BinaryExpr. It just asks each child to evaluate itself, then applies its operator to the results:

pyfloe/expr.py — BinaryExpr.eval
class BinaryExpr(Expr):
    def eval(self, row, col_map):
        # Ask the children to evaluate themselves recursively
        lv = self.left.eval(row, col_map)
        rv = self.right.eval(row, col_map)
        # Handle NULLs gracefully
        if lv is None or rv is None:
            return None
        # Apply the operator (e.g., operator.gt(30, 28) → True)
        return self.op(lv, rv)

That None check is important. In real-world data, values are frequently missing. pyfloe follows the SQL convention: any comparison involving None returns None, not True or False. This is something a naive lambda-based filter would never handle correctly.

Tracing a compound expression

Let's trace a complete expression through construction and evaluation. Consider:

A compound expression
expr = (col("price") * col("quantity")) > 1000

Step 1 — Build the tree. col("price") creates Col("price"). col("quantity") creates Col("quantity"). The * operator calls __mul__, producing a BinaryExpr(*, ...). Then the > operator calls __gt__, wrapping everything in a second BinaryExpr(>, ...):

The tree Python built behind the scenes
BinaryExpr(op=">")
├── left: BinaryExpr(op="*")
│   ├── left:  Col("price")
│   └── right: Col("quantity")
└── right: Lit(1000)

No data has been touched. No computation has happened. We've just built a description of what to compute.

BinaryExpr op: > BinaryExpr op: * Lit 1000 Col "price" Col "quantity" left right left right

Pause and predict: given row = ("Widget", 25.0, 4) and the expression (col("price") * col("quantity")) > 1000, what will the result be? Work it out in your head: what's 25.0 × 4, and is that greater than 1000? Then read on.

Step 2 — Evaluate it. The engine has a row: row = ("Widget", 25.0, 4) and col_map = {"name": 0, "price": 1, "quantity": 2}. It calls expr.eval(row, col_map) and the tree walks itself:

Tracing eval through the tree
# Root: BinaryExpr(op=">") asks its children:

# Left child: BinaryExpr(op="*") asks ITS children:
#   Col("price").eval(...)    → row[1] → 25.0
#   Col("quantity").eval(...) → row[2] → 4
#   result: 25.0 * 4 = 100.0

# Right child: Lit(1000).eval(...) → 1000

# Root applies its operator: 100.0 > 1000 → False

Each node only knows how to get its children's values and apply its own operator. The root doesn't know anything about "price" or "quantity" — it just asks its left child for a number and its right child for a number. This recursive self-similarity is what makes ASTs so elegant — and so composable.


Lesson 2.4 The Payoff — Inspectable Expressions

We've built the expression tree. We've given it a voice with eval(). But we haven't yet delivered on the original promise from Module 1: the reason we did all this work instead of just using lambdas.

Remember the question that an optimizer needs to answer: "Which columns does this expression actually use?" With a lambda, we were stuck. With the AST, the answer writes itself.

required_columns: the method lambdas can't provide

We add one more method to each node. The logic is beautifully simple: leaf nodes know what they need, and branch nodes combine their children's needs:

pyfloe/expr.py — required_columns
class Col(Expr):
    def required_columns(self):
        return {self.name}          # I need exactly one column

class Lit(Expr):
    def required_columns(self):
        return set()               # I need nothing — I'm a constant

class BinaryExpr(Expr):
    def required_columns(self):
        # Combine the requirements of both sides
        return self.left.required_columns() | self.right.required_columns()

class UnaryExpr(Expr):
    def required_columns(self):
        return self.operand.required_columns()

Now watch. Our optimizer can see the future:

The optimizer can now see which columns matter
expr = (col("price") * col("quantity")) > 1000
expr.required_columns()

# BinaryExpr(>).required_columns()
#   → BinaryExpr(*).required_columns()  |  Lit(1000).required_columns()
#   → ({"price"} | {"quantity"})         |  set()
#   → {"price", "quantity"}

Two columns. If the CSV has 50 columns, the optimizer can now tell the scan node: "Only read columns 1 and 2. Skip the other 48."

This is the holy grail
We have given the user a beautiful, highly readable API: col("price") * col("quantity") > 1000. And at the same time, we have built a transparent data structure that our future Query Optimizer (Module 5) can use to aggressively prune columns and push filters down the execution tree. The same object powers both the user experience and the optimizer. That's what makes this architecture so elegant.

Lambda vs. expression — the full picture

Here's the complete comparison — why we did all this work:

Lambda Black box

lambda row: row["price"] * row["qty"] > 1000

The engine can call it, but can't look inside. It doesn't know which columns are used. It can't rearrange or optimize anything.

Expression Inspectable tree

(col("price") * col("qty")) > 1000

The engine can walk the tree, list the columns ({"price", "qty"}), push the filter down before a join, and prune unused columns from the scan.

Debugging with __repr__

One more dunder makes development much nicer. Each expression class implements __repr__ so that printing an expression reveals its full structure:

pyfloe/expr.py — __repr__ methods
class Col(Expr):
    def __repr__(self):
        return f'col("{self.name}")'

class Lit(Expr):
    def __repr__(self):
        return repr(self.value)

class BinaryExpr(Expr):
    def __repr__(self):
        return f'({self.left} {self.op_str} {self.right})'

Because __repr__ is recursive — BinaryExpr's repr calls repr(self.left) and repr(self.right), which call their own __repr__ — you get a full round-trip view of the tree:

Expressions print almost exactly as you wrote them
>>> (col("price") * col("quantity")) > 1000
((col("price") * col("quantity")) > 1000)

>>> (col("age") > 18) & (col("active") == True)
((col("age") > 18) & (col("active") == True))

Three dunders — __gt__, eval, __repr__ — all using the same recursive pattern, each cascading through the same tree, each producing a different kind of output. A tree of objects, a boolean, and a string. The AST is the common backbone.

The production version
Polars expressions, PySpark Column objects, SQLAlchemy clause elements — they all overload Python operators to build ASTs. The dunder methods let you write col("a") + col("b") instead of verbose constructor calls like BinaryExpr(Col("a"), Col("b"), operator.add, "+").
Deep dive — From our version to pyfloe's real code

Throughout this module, we built simplified versions of each class to understand the shape of the expression system. pyfloe's actual source code has the same shape — but adds production polish. Here's what the real code adds that we skipped over:

__slots__ — You may have noticed __slots__ = ('name',) on pyfloe's Col class and __slots__ = ('left', 'right', 'op', 'op_str') on BinaryExpr. Normally, every Python object carries a hidden __dict__ dictionary to store its attributes. That dictionary is flexible (you can add any attribute at runtime), but it costs about 100 extra bytes per object. When you declare __slots__, Python skips the __dict__ entirely and stores attributes in a fixed-size internal array — roughly 40% less memory per instance.

Why does that matter for expressions? Each node in the AST is tiny — just a few attributes. But a complex query might create thousands of expression nodes. And when the engine evaluates that expression across millions of rows, every byte counts. __slots__ is a one-line declaration that eliminates that overhead for free.

pyfloe/expr.py — __slots__ on every expression class
class Col(Expr):
    __slots__ = ('name',)          # only this attribute — no __dict__

class Lit(Expr):
    __slots__ = ('value',)         # ~100 bytes saved per instance

class BinaryExpr(Expr):
    __slots__ = ('left', 'right', 'op', 'op_str')
import sys

class WithSlots:
    __slots__ = ("key", "value")
    def __init__(self, key: str, value: int):
        self.key = key
        self.value = value

class WithoutSlots:
    def __init__(self, key: str, value: int):
        self.key = key
        self.value = value

a = WithSlots("price", 42)
b = WithoutSlots("price", 42)

# Memory comparison
print(f"With __slots__:    {sys.getsizeof(a)} bytes")
print(f"Without __slots__: {sys.getsizeof(b)} + {sys.getsizeof(b.__dict__)} bytes (__dict__)")
print()

# __dict__ exists only on the non-slots class
print(f"b.__dict__ = {b.__dict__}")
try:
    print(f"a.__dict__ = {a.__dict__}")
except AttributeError as e:
    print(f"a.__dict__ → AttributeError: {e}")
print()

# You can add arbitrary attributes without slots...
b.new_attr = "surprise!"
print(f"b.new_attr = {b.new_attr!r}  (anything goes)")

# ...but slots locks it down
try:
    a.new_attr = "nope"
except AttributeError as e:
    print(f"a.new_attr → AttributeError: {e}")
print()

# At scale this matters: 10,000 expression nodes
slots_total = sum(sys.getsizeof(WithSlots("x", i)) for i in range(10_000))
dict_total = sum(sys.getsizeof(WithoutSlots("x", i)) + sys.getsizeof(WithoutSlots("x", i).__dict__) for i in range(10_000))
print(f"10k slots objects:  {slots_total:>10,} bytes")
print(f"10k dict objects:   {dict_total:>10,} bytes")
print(f"Savings:            {dict_total - slots_total:>10,} bytes ({(dict_total - slots_total) / dict_total:.0%})")

compile() — In addition to eval(), pyfloe has a compile(col_map) method that pre-bakes the column index lookup into a closure. Instead of doing a dictionary lookup on every row, the compiled version is lambda row: row[1] — the index 1 is already baked in. This is significantly faster for tight loops over millions of rows.

output_dtype() — Each expression can report what type it will produce (bool for comparisons, float for division, etc.). This enables type inference: pyfloe can tell you the schema of a query result before running any data through it.

Coming in Module 3
We have expressions — but right now they're inert. They're objects sitting in memory, disconnected from any data. To make them do something, we need to plug them into the plan nodes from Module 1. That's where the two trees we've been building finally meet: the FilterNode receives an Expr, compiles it, and evaluates it against every row flowing through the volcano pipeline. We'll also tackle a performance problem — calling eval() once per row is brutally slow in Python — and solve it with batched execution.

Exercises Check Your Understanding

Before moving to Module 3, make sure these concepts are solid:

Quick check

1. When you write col("age") > 28, what does Python actually call?

2. What does _ensure_expr(42) return?

3. What does (col("a") + col("b")).required_columns() return?

4. Why does pyfloe define __radd__ in addition to __add__?

5. Why does Expr define __bool__ to raise a TypeError?

6. You build two expressions: e1 = col("x") + col("y") and e2 = col("z") > 10. What does e1.required_columns() | e2.required_columns() return?


Challenge Build a NegExpr Node

You've built BinaryExpr for two-operand operations. Now build NegExpr — a unary expression that negates its child. It should support -col("price") by implementing the __neg__ dunder method on Expr.

Think about:

  • How many children does a unary node have? (Just one.)
  • What should required_columns() return?
  • What should eval(row) do?
Starter code — implement NegExpr
from __future__ import annotations
from collections.abc import Callable

class Expr:
    def required_columns(self) -> set[str]:
        return set()
    def eval(self, row: dict):
        raise NotImplementedError
    def __neg__(self):
        # TODO: return a NegExpr wrapping self
        pass
    def __gt__(self, other):
        return BinaryExpr(self, other if isinstance(other, Expr) else Lit(other), lambda a, b: a > b, ">")

class Col(Expr):
    def __init__(self, name: str):
        self.name = name
    def required_columns(self) -> set[str]:
        return {self.name}
    def eval(self, row: dict):
        return row[self.name]

class Lit(Expr):
    def __init__(self, value: int | float | str):
        self.value = value
    def eval(self, row: dict):
        return self.value

class BinaryExpr(Expr):
    def __init__(self, left: Expr, right: Expr, op: Callable, op_str: str):
        self.left, self.right, self.op, self.op_str = left, right, op, op_str
    def required_columns(self) -> set[str]:
        return self.left.required_columns() | self.right.required_columns()
    def eval(self, row: dict):
        return self.op(self.left.eval(row), self.right.eval(row))

class NegExpr(Expr):
    def __init__(self, child: Expr):
        self.child = child
    def required_columns(self) -> set[str]:
        # TODO
        pass
    def eval(self, row: dict):
        # TODO
        pass

# Test it:
row = {"price": 42, "discount": 5}

neg_price = -Col("price")
print(f"-col('price') => type: {type(neg_price).__name__}")
print(f"-col('price') => eval:  {neg_price.eval(row)}")
print(f"-col('price') => cols:  {neg_price.required_columns()}")

# Compose it: -col("price") > -50
expr = -Col("price") > -50
print(f"\n-col('price') > -50 => {expr.eval(row)}")
Thinking question
What does (-col("price")).required_columns() return? What about (col("a") + col("b")) > (col("c") * col("d"))? Trace the recursive required_columns() calls. (Answer: {"price"} and {"a", "b", "c", "d"}.)

Source References

expr.py — Expression AST

  • Expr — base class for all expressions
  • Col — column reference expression
  • BinaryExpr — two operands combined with a binary operator
  • Lit — literal value expression
  • required_columns() — returns the set of columns an expression needs