We Need to Talk
About DataFrames
If you work with data in Python, you use DataFrames. They are the undisputed heavyweight of the data ecosystem. But have you ever stopped to ask why they are so dominant — or how they actually work under the hood?
In this course, we aren't just going to use DataFrames. We are going to build our own — in pure Python, with zero external dependencies. By the end, you'll understand the deep Python magic that powers industry giants like Polars and PySpark.
But first, we need to talk about why DataFrames exist in the first place.
If you've used Polars or PySpark, you've already used a lazy engine — you
just haven't seen inside one. This module covers the Python primitives that
make laziness work: generators and the yield keyword. If you're
already comfortable with generators, skim Lessons 1.2–1.3 and jump to
Lesson 1.4 (The Problem with Lambdas), which sets up
the rest of the course.
Lesson 1.1 The Ultimate Bridge to Excel
Native Python data structures are incredibly flexible, but they are terrible at tabular data. If you have a list of thousands of dictionaries representing user data, answering a simple question like "What is the average age of users in Europe?" requires writing clumsy for loops, managing accumulators, and handling missing keys.
import csv
# Step 1: Read the CSV into a list of dicts
rows = []
with open("orders.csv") as f:
for row in csv.DictReader(f):
rows.append(row)
# Step 2: Filter (everything is a string!)
filtered = [r for r in rows if float(r["amount"]) > 100]
# Step 3: Group by region (manual accumulator)
groups = {}
for r in filtered:
region = r["region"]
groups.setdefault(region, 0)
groups[region] += float(r["amount"])
print(groups)
# {'EU': 45230.0, 'US': 89100.5, 'APAC': 23400.0}
It works. But count the problems: everything is a string (you're casting
float() manually, in two different places — miss one and you
get a silent bug), all data sits in memory at once, and each step is
imperative plumbing with zero composability. Want to add a sort? That's
another 5 lines.
On the other hand, software like Excel or SQL makes tabular analysis beautifully intuitive. You can filter, group, and aggregate in seconds. But Excel can't easily be automated in a data pipeline, and SQL requires a database engine.
The DataFrame is the bridge. It gives you the intuitive, tabular mental model of Excel, wrapped in the programmatic power and automation capabilities of Python.
A brief history
.collect(). A paradigm shift — but it
requires a JVM cluster.
col("x") + col("y")
— feels like magic.
For a decade, Pandas was the king of this bridge. But Pandas has a fundamental architectural flaw that the industry is rapidly moving away from: Eager Execution.
Lesson 1.2 The Memory Trap: Eager vs. Lazy Execution
To understand eager vs. lazy execution, let's look at how tabular data is usually represented in pure Python: a list of dictionaries.
raw_data = [
{"name": "Alice", "age": 30, "region": "EU"},
{"name": "Bob", "age": 25, "region": "US"},
]
A completely workable approach to building a DataFrame is to wrap this list in a class and write methods that loop through it:
raw_data = [
{"name": "Alice", "age": 30, "region": "EU"},
{"name": "Bob", "age": 25, "region": "US"},
{"name": "Carol", "age": 35, "region": "EU"},
{"name": "Dave", "age": 22, "region": "APAC"},
]
class EagerFrame:
def __init__(self, data):
self.data = data
def filter(self, condition):
# Executes immediately! Builds a brand new list in memory.
return EagerFrame([row for row in self.data if condition(row)])
# Usage:
df = EagerFrame(raw_data)
df.filter(lambda row: row["age"] > 28)
print("Eager result:", df.data)
# Now compare with pyfloe's lazy approach:
import pyfloe as pf
lazy_df = pf.LazyFrame(raw_data)
result = lazy_df.filter(pf.col("age") > 28).collect()
print("Lazy result: ", result.to_pylist())
This list comprehension style is incredibly Pythonic. It works perfectly!
You can pass any callable function, chain .filter() calls
together, and it will give you the right answer.
But there is a catch: this is Eager Execution.
.filter(),
it iterates through millions of rows and creates a brand new
list in memory. If your laptop only has 8GB of RAM, your
program crashes.
Modern tools like Polars and PySpark solve this with Lazy Execution. When you tell Polars to read a massive file and filter it, it doesn't run a for loop. It waits until you explicitly ask for the final result.
Eager Pandas style
Each line runs immediately. Data is copied at every step. The file is fully loaded before any filtering. Simple to debug line-by-line, but memory explodes at scale.
Lazy Polars / Spark style
Operations build a plan describing what to do, not doing it.
Data flows only when you call .collect(). Memory stays
flat no matter how many operations you chain.
But how does Python actually "wait"? The secret lies in a built-in feature you may have seen but rarely used: Generators.
Lesson 1.3 Unlocking Laziness: Generators and the Volcano Model
When a normal Python function hits a return statement, it
spits out the entire result and shuts down. If it's returning a list of
10 million rows, it builds that entire list in memory first.
But if you replace return with the yield
keyword, the function becomes a Generator. When Python
hits yield, it hands you one item, pauses its exact
state, and waits. It only computes the next item when you
explicitly ask for it.
# Eager — builds entire list in memory
def get_all_rows():
return [row for row in million_row_dataset] # all at once
# Lazy — yields one row at a time
def get_rows_lazily():
for row in million_row_dataset:
yield row # one at a time, on demand
So generators let us produce data lazily. But a bare generator function is a one-shot, standalone thing. If we want to build a pipeline — read data, then filter it, then sort it — we need a way to connect these lazy steps together so each one can pull from the step before it.
The trick is to wrap each step in a class. We'll call each class a
"node" — as in, one link in a chain. Every node follows
the same rule: it has an .execute() method that yields rows.
Some nodes produce data on their own (read a file, hold a list). Other
nodes receive a "child" node and transform whatever the child produces
(filter it, add a column, sort it). Because every node speaks the same
language — call .execute(), get rows — you can snap them
together like building blocks.
In database engineering, this pattern is called the Volcano Execution Model. Think of it like a chain of workers — each one only talks to the person next to them, passing items along. Let's build two nodes to see how it works:
class ScanNode:
def __init__(self, data):
self.data = data
def execute(self):
# We don't return a list. We yield a generator!
for row in self.data:
yield row
for row in self.data line. It doesn't call
self.data[i] or check len(self.data). It just
iterates. That means self.data doesn't have to be a list —
it can be anything Python can iterate over. A generator that
reads lines from a CSV file. An API client that paginates through
results. A database cursor. This is how a lazy library reads a 10GB
file without loading it into memory: the "data" is a generator that
yields one row at a time from disk.
So now we have a node that can produce rows lazily. But producing rows isn't useful on its own — we need to do something with them. Filter them, transform them, aggregate them.
Here's the key idea: instead of writing a filter() method
that loops through all the data and creates a new list, we create a
second node that wraps the first one. A
FilterNode takes a ScanNode (or any other node)
as its "child." When you ask the FilterNode to execute, it
pulls rows from its child's .execute() and only yields the
ones that pass the condition:
class FilterNode:
def __init__(self, child, condition):
self.child = child # could be a ScanNode, or another FilterNode!
self.condition = condition
def execute(self):
# Pull data upwards from the child node, lazily
for row in self.child.execute():
if self.condition(row):
yield row
execute() method carefully: self.child.execute()
calls the ScanNode's execute(), which yields rows one at a time.
The FilterNode tests each row and only yields the ones that pass. So we
get a chain: ScanNode → produces rows → FilterNode
→ filters them. And because FilterNode also has an .execute()
method, you could wrap it in another FilterNode. Or a
SortNode. Or a JoinNode. Each node only knows about
its immediate child — they all speak the same language.
When .collect() pulls data through this chain:
collect()callsFilterNode.execute(), which callsScanNode.execute()— ScanNode yields Alice's row- FilterNode checks: is 30 > 28? Yes — yields the row upward
- FilterNode pulls Bob (25 > 28? No — skipped), then ScanNode is exhausted — done
Notice how each row is processed exactly once, and only one row is ever "in flight" at a time. That's the volcano model in action.
Wiring it up to a LazyFrame
Now we can rewrite our DataFrame to be completely lazy. When you call
.filter(), it does zero data processing. It
just wraps the current plan in a new FilterNode:
class LazyFrame:
def __init__(self, plan_node):
self._plan = plan_node
def filter(self, condition):
# Instant! No data is processed here.
new_plan = FilterNode(self._plan, condition)
return LazyFrame(new_plan)
def collect(self):
# THIS is where execution finally happens.
# We pull data through the entire chain of generators.
return list(self._plan.execute())
Let's use it:
scan = ScanNode(raw_data)
df = LazyFrame(scan)
# This just builds the plan: LazyFrame -> FilterNode -> ScanNode
query = df.filter(lambda row: row["age"] > 28)
# NOW the generators fire, pulling data through the chain:
result = query.collect()
print(result)
# [{'name': 'Alice', 'age': 30, 'region': 'EU'}]
from __future__ import annotations
from collections.abc import Callable, Iterator
# Part 1: Build it yourself — just generators and classes
class ScanNode:
def __init__(self, data: list[dict]):
self.data = data
def execute(self) -> Iterator[dict]:
for row in self.data:
yield row # one row at a time!
class FilterNode:
def __init__(self, child: ScanNode | FilterNode, condition: Callable[[dict], bool]):
self.child = child
self.condition = condition
def execute(self) -> Iterator[dict]:
for row in self.child.execute(): # pull from child
if self.condition(row):
yield row # pass it up
raw_data = [
{"name": "Alice", "age": 30, "region": "EU"},
{"name": "Bob", "age": 25, "region": "US"},
{"name": "Carol", "age": 35, "region": "EU"},
]
scan = ScanNode(raw_data)
filtered = FilterNode(scan, lambda row: row["age"] > 28)
print("Hand-built volcano:", list(filtered.execute()))
# Part 2: pyfloe does the exact same thing internally
import pyfloe as pf
lf = pf.LazyFrame(raw_data)
result = lf.filter(pf.col("age") > 28).collect()
print("pyfloe result: ", result.to_pylist())
That's a working lazy DataFrame in about 30 lines of Python. Because data is processed one element at a time through generators, we can chain 50 filters and transformations together, and our memory usage stays completely flat.
FilterNode on top — say, filtering for
region == "EU"? Or change the condition to filter on a
different column? The best way to understand generators is to break
them and see what happens. Go ahead — the code is live.
The real power: stopping early
Here's where generators show their true strength. What if you don't need all the results — you just want the first match?
With an eager approach, you'd have no choice: filter the entire dataset, build the full result list, then take element zero. A million rows in, one row out — and you processed all million.
With generators, we can just... stop pulling:
class LazyFrame:
# ... filter() and collect() from above ...
def first(self):
# Pull just ONE row from the generator chain.
# The rest of the data is never touched.
return next(self._plan.execute())
That's it. Python's built-in next() pulls one value from a
generator and then stops asking. The FilterNode
pauses. The ScanNode pauses. If you have a billion rows
but the first match is at row 5, you just processed 5 rows — not a
billion.
Pause and predict: in the example below, the data has
three rows and we call .first() with a filter for
age > 28. How many rows will the ScanNode
actually yield before everything stops? Think about it, then check
your answer.
scan = ScanNode([
{"name": "Bob", "age": 25},
{"name": "Alice", "age": 30}, # match — stops here!
{"name": "Carol", "age": 35}, # never even read
# ... imagine a million more rows ...
])
df = LazyFrame(scan)
df.filter(lambda row: row["age"] > 28).first()
# {'name': 'Alice', 'age': 30}
EagerFrame would have to run the filter on
every single row before you could ask for the first result —
because the filter builds a complete list. Generators don't build
anything. They just yield the next item when you ask, and pause when
you don't. That pause is the entire trick.
Polars and PySpark use this exact architecture. Their implementations are faster (batches instead of single rows, compiled code instead of Python), but the shape is identical: a chain of plan nodes, generators pulling data upward, nothing running until you ask.
Lesson 1.4 The Problem with Lambdas
We've solved the memory trap. Our engine is fully lazy. But there is a
glaring problem with our FilterNode.
Look at the filter we wrote:
query = df.filter(lambda row: row["age"] > 28)
That lambda is a black box. Our engine can
execute it — hand it a row, get True or False back — but it can never
look inside to understand what it's doing.
Why does that matter? Imagine you want to build a smart Query Optimizer — something that looks at your query plan and says:
age column. The CSV file has
20 columns. Let's not bother carrying the other 19 through the rest
of the pipeline."
"There are two filters in a row. The first one only touches columns from the left table, so we can push it down before the join."
But with a lambda, the optimizer can't answer any of these questions. Python cannot easily inspect which columns a lambda function uses.
This is the difference between a toy and a real query engine. To build a
truly expressive, optimizable API — the kind where you write
col("age") > 28 instead of a lambda — we need a way to
represent computations as data that our engine can
inspect.
We need col("age") > 28 to return not a boolean, but an
expression — a small object that says "compare the
age column to 28 using greater-than." On its own, that
expression does nothing. It's just a description. But when a plan node
like FilterNode receives it, it can inspect it — ask which
columns it needs, rearrange it, and eventually evaluate it row by row.
col("age") > 28 has a > at the
root, col("age") on the left, and 28 on the
right. That's a tree — an expression tree.
But it's a completely different tree from the chain of plan nodes we just built. The plan describes how data flows (scan → filter → collect). The expression describes what to compute on each row. Expressions live inside plan nodes — a
FilterNode holds an expression that it
evaluates per row.
Two trees, both made of nodes, doing completely different jobs. You'll build both in this course.
To pull this off, we'll need to dive into one of the most powerful
corners of Python: dunder methods and operator overloading.
We need to hijack the > operator so that instead of running
a comparison, it builds an expression object.
col("age") > 28 creates an expression (not a boolean),
how col("a") + col("b") creates another kind of expression,
and how the entire Polars-style col() API is really just
a collection of Python dunder methods building an expression tree that
the engine can walk and optimize.
Exercises Check Your Understanding
Before moving to Module 2, make sure these concepts are solid:
Quick check
1. In a lazy DataFrame library, when does computation actually happen?
2. In a lazy DataFrame library, what does col("x") > 5 return?
3. Why can't a query optimizer work with a lambda filter?
4. You have a pipeline: ScanNode → FilterNode → FilterNode → .first().
The data has 1 million rows. The first filter passes ~10% of rows, and
the second filter's first match is at position 50 in the filtered stream.
Roughly how many rows does ScanNode yield?
Challenge Build a MapNode
You've seen FilterNode — it pulls rows from its child and
keeps the ones that pass a predicate. Now build a MapNode
that transforms each row using a function. This is the
generator-based equivalent of Python's built-in map().
Think about:
- What does
execute()yield? (Hint: the transformed row, not the original.) - Should it filter anything out, or pass every row through?
from __future__ import annotations
from collections.abc import Callable, Iterator
class PlanNode:
def execute(self) -> Iterator[dict]:
raise NotImplementedError
class ScanNode(PlanNode):
def __init__(self, rows: list[dict]):
self.rows = rows
def execute(self) -> Iterator[dict]:
yield from self.rows
class MapNode(PlanNode):
def __init__(self, child: PlanNode, func: Callable[[dict], dict]):
self.child = child
self.func = func
def execute(self) -> Iterator[dict]:
# TODO: yield each row transformed by self.func
pass
# Test it: add a "senior" field based on age
data = [
{"name": "Alice", "age": 35},
{"name": "Bob", "age": 22},
{"name": "Carol", "age": 45},
]
def add_senior_flag(row):
return {**row, "senior": row["age"] >= 30}
scan = ScanNode(data)
mapped = MapNode(scan, add_senior_flag)
for row in mapped.execute():
print(row)
ScanNode → FilterNode → MapNode and call
.first(). The ScanNode has 1 million rows, but only the 500th
row passes the filter. How many rows does the ScanNode yield? How many does
the MapNode process? (Answer: 500 and 1, respectively. The generator chain
stops pulling after the first match reaches the top.)