Module 1 of 5

We Need to Talk
About DataFrames

If you work with data in Python, you use DataFrames. They are the undisputed heavyweight of the data ecosystem. But have you ever stopped to ask why they are so dominant — or how they actually work under the hood?

In this course, we aren't just going to use DataFrames. We are going to build our own — in pure Python, with zero external dependencies. By the end, you'll understand the deep Python magic that powers industry giants like Polars and PySpark.

But first, we need to talk about why DataFrames exist in the first place.

If you've used Polars or PySpark, you've already used a lazy engine — you just haven't seen inside one. This module covers the Python primitives that make laziness work: generators and the yield keyword. If you're already comfortable with generators, skim Lessons 1.2–1.3 and jump to Lesson 1.4 (The Problem with Lambdas), which sets up the rest of the course.

Lesson 1.1 The Ultimate Bridge to Excel

Native Python data structures are incredibly flexible, but they are terrible at tabular data. If you have a list of thousands of dictionaries representing user data, answering a simple question like "What is the average age of users in Europe?" requires writing clumsy for loops, managing accumulators, and handling missing keys.

Raw Python — the clumsy way

import csv

# Step 1: Read the CSV into a list of dicts
rows = []
with open("orders.csv") as f:
    for row in csv.DictReader(f):
        rows.append(row)

# Step 2: Filter (everything is a string!)
filtered = [r for r in rows if float(r["amount"]) > 100]

# Step 3: Group by region (manual accumulator)
groups = {}
for r in filtered:
    region = r["region"]
    groups.setdefault(region, 0)
    groups[region] += float(r["amount"])

print(groups)
# {'EU': 45230.0, 'US': 89100.5, 'APAC': 23400.0}

It works. But count the problems: everything is a string (you're casting float() manually, in two different places — miss one and you get a silent bug), all data sits in memory at once, and each step is imperative plumbing with zero composability. Want to add a sort? That's another 5 lines.

On the other hand, software like Excel or SQL makes tabular analysis beautifully intuitive. You can filter, group, and aggregate in seconds. But Excel can't easily be automated in a data pipeline, and SQL requires a database engine.

The DataFrame is the bridge. It gives you the intuitive, tabular mental model of Excel, wrapped in the programmatic power and automation capabilities of Python.

Deep Cut

This history section is interesting context but not required for the technical arc. Safe to skip on a first pass.

A brief history

2008 — Pandas is born

Wes McKinney creates Pandas at AQR Capital. For the first time, Python developers can manipulate tabular data with the fluency of R or Excel. Pandas becomes the backbone of data science.

2014 — Apache Spark goes mainstream

PySpark brings Spark's lazy execution model to mainstream Python adoption. Operations build a plan; computation happens only when you call .collect(). A paradigm shift — but it requires a JVM cluster.

2021 — Polars arrives

Ritchie Vink releases Polars: a Rust-powered, lazy-first DataFrame library that brings the benefits of Spark's query planning to a single laptop. The expression API — col("x") + col("y") — feels like magic.

For a decade, Pandas was the king of this bridge. But Pandas has a fundamental architectural flaw that the industry is rapidly moving away from: Eager Execution.

Lesson 1.2 The Memory Trap: Eager vs. Lazy Execution

To understand eager vs. lazy execution, let's look at how tabular data is usually represented in pure Python: a list of dictionaries.

Our data — a simple list of dicts

raw_data = [
    {"name": "Alice", "age": 30, "region": "EU"},
    {"name": "Bob",   "age": 25, "region": "US"},
]

A completely workable approach to building a DataFrame is to wrap this list in a class and write methods that loop through it:

The eager approach — intuitive, but dangerous at scale

raw_data = [
    {"name": "Alice", "age": 30, "region": "EU"},
    {"name": "Bob",   "age": 25, "region": "US"},
    {"name": "Carol", "age": 35, "region": "EU"},
    {"name": "Dave",  "age": 22, "region": "APAC"},
]

class EagerFrame:
    def __init__(self, data):
        self.data = data

    def filter(self, condition):
        # Executes immediately! Builds a brand new list in memory.
        return EagerFrame([row for row in self.data if condition(row)])

# Usage:
df = EagerFrame(raw_data)
df.filter(lambda row: row["age"] > 28)
print("Eager result:", df.data)

# Now compare with pyfloe's lazy approach:
import pyfloe as pf
lazy_df = pf.LazyFrame(raw_data)
result = lazy_df.filter(pf.col("age") > 28).collect()
print("Lazy result: ", result.to_pylist())

This list comprehension style is incredibly Pythonic. It works perfectly! You can pass any callable function, chain .filter() calls together, and it will give you the right answer.

But there is a catch: this is Eager Execution.

The trap

When you tell an eager engine to process a 10GB dataset, it immediately tries to load all 10GB into RAM. Every time you call .filter(), it iterates through millions of rows and creates a brand new list in memory. If your laptop only has 8GB of RAM, your program crashes.

Modern tools like Polars and PySpark solve this with Lazy Execution. When you tell Polars to read a massive file and filter it, it doesn't run a for loop. It waits until you explicitly ask for the final result.

Eager Pandas style

Each line runs immediately. Data is copied at every step. The file is fully loaded before any filtering. Simple to debug line-by-line, but memory explodes at scale.

Lazy Polars / Spark style

Operations build a plan describing what to do, not doing it. Data flows only when you call .collect(). Memory stays flat no matter how many operations you chain.

But how does Python actually "wait"? The secret lies in a built-in feature you may have seen but rarely used: Generators.

Lesson 1.3 Unlocking Laziness: Generators and the Volcano Model

When a normal Python function hits a return statement, it spits out the entire result and shuts down. If it's returning a list of 10 million rows, it builds that entire list in memory first.

But if you replace return with the yield keyword, the function becomes a Generator. When Python hits yield, it hands you one item, pauses its exact state, and waits. It only computes the next item when you explicitly ask for it.

return vs. yield — the crucial difference

# Eager — builds entire list in memory
def get_all_rows():
    return [row for row in million_row_dataset]  # all at once

# Lazy — yields one row at a time
def get_rows_lazily():
    for row in million_row_dataset:
        yield row                            # one at a time, on demand

So generators let us produce data lazily. But a bare generator function is a one-shot, standalone thing. If we want to build a pipeline — read data, then filter it, then sort it — we need a way to connect these lazy steps together so each one can pull from the step before it.

The trick is to wrap each step in a class. We'll call each class a "node" — as in, one link in a chain. Every node follows the same rule: it has an .execute() method that yields rows. Some nodes produce data on their own (read a file, hold a list). Other nodes receive a "child" node and transform whatever the child produces (filter it, add a column, sort it). Because every node speaks the same language — call .execute(), get rows — you can snap them together like building blocks.

In database engineering, this pattern is called the Volcano Execution Model. Think of it like a chain of workers — each one only talks to the person next to them, passing items along. Let's build two nodes to see how it works:

ScanNode — the data source

class ScanNode:
    def __init__(self, data):
        self.data = data

    def execute(self):
        # We don't return a list. We yield a generator!
        for row in self.data:
            yield row

A subtle but crucial detail

Look at that for row in self.data line. It doesn't call self.data[i] or check len(self.data). It just iterates. That means self.data doesn't have to be a list — it can be anything Python can iterate over. A generator that reads lines from a CSV file. An API client that paginates through results. A database cursor. This is how a lazy library reads a 10GB file without loading it into memory: the "data" is a generator that yields one row at a time from disk.

So now we have a node that can produce rows lazily. But producing rows isn't useful on its own — we need to do something with them. Filter them, transform them, aggregate them.

Here's the key idea: instead of writing a filter() method that loops through all the data and creates a new list, we create a second node that wraps the first one. A FilterNode takes a ScanNode (or any other node) as its "child." When you ask the FilterNode to execute, it pulls rows from its child's .execute() and only yields the ones that pass the condition:

FilterNode — pulls from its child, filters lazily

class FilterNode:
    def __init__(self, child, condition):
        self.child = child       # could be a ScanNode, or another FilterNode!
        self.condition = condition

    def execute(self):
        # Pull data upwards from the child node, lazily
        for row in self.child.execute():
            if self.condition(row):
                yield row

The chain

Read the execute() method carefully: self.child.execute() calls the ScanNode's execute(), which yields rows one at a time. The FilterNode tests each row and only yields the ones that pass. So we get a chain: ScanNode → produces rows → FilterNode → filters them. And because FilterNode also has an .execute() method, you could wrap it in another FilterNode. Or a SortNode. Or a JoinNode. Each node only knows about its immediate child — they all speak the same language.

When .collect() pulls data through this chain:

collect() calls FilterNode.execute(), which calls ScanNode.execute() — ScanNode yields Alice's row
FilterNode checks: is 30 > 28? Yes — yields the row upward
FilterNode pulls Bob (25 > 28? No — skipped), then ScanNode is exhausted — done

Notice how each row is processed exactly once, and only one row is ever "in flight" at a time. That's the volcano model in action.

Wiring it up to a LazyFrame

Now we can rewrite our DataFrame to be completely lazy. When you call .filter(), it does zero data processing. It just wraps the current plan in a new FilterNode:

LazyFrame — nothing runs until .collect()

class LazyFrame:
    def __init__(self, plan_node):
        self._plan = plan_node

    def filter(self, condition):
        # Instant! No data is processed here.
        new_plan = FilterNode(self._plan, condition)
        return LazyFrame(new_plan)

    def collect(self):
        # THIS is where execution finally happens.
        # We pull data through the entire chain of generators.
        return list(self._plan.execute())

Let's use it:

Building a plan, then executing it

scan = ScanNode(raw_data)
df = LazyFrame(scan)

# This just builds the plan: LazyFrame -> FilterNode -> ScanNode
query = df.filter(lambda row: row["age"] > 28)

# NOW the generators fire, pulling data through the chain:
result = query.collect()
print(result)
# [{'name': 'Alice', 'age': 30, 'region': 'EU'}]

from __future__ import annotations
from collections.abc import Callable, Iterator

# Part 1: Build it yourself — just generators and classes
class ScanNode:
    def __init__(self, data: list[dict]):
        self.data = data
    def execute(self) -> Iterator[dict]:
        for row in self.data:
            yield row           # one row at a time!

class FilterNode:
    def __init__(self, child: ScanNode | FilterNode, condition: Callable[[dict], bool]):
        self.child = child
        self.condition = condition
    def execute(self) -> Iterator[dict]:
        for row in self.child.execute():  # pull from child
            if self.condition(row):
                yield row                 # pass it up

raw_data = [
    {"name": "Alice", "age": 30, "region": "EU"},
    {"name": "Bob",   "age": 25, "region": "US"},
    {"name": "Carol", "age": 35, "region": "EU"},
]

scan = ScanNode(raw_data)
filtered = FilterNode(scan, lambda row: row["age"] > 28)
print("Hand-built volcano:", list(filtered.execute()))

# Part 2: pyfloe does the exact same thing internally
import pyfloe as pf
lf = pf.LazyFrame(raw_data)
result = lf.filter(pf.col("age") > 28).collect()
print("pyfloe result:     ", result.to_pylist())

That's a working lazy DataFrame in about 30 lines of Python. Because data is processed one element at a time through generators, we can chain 50 filters and transformations together, and our memory usage stays completely flat.

Try it yourself

Modify the code above. What happens if you add a second FilterNode on top — say, filtering for region == "EU"? Or change the condition to filter on a different column? The best way to understand generators is to break them and see what happens. Go ahead — the code is live.

The real power: stopping early

Here's where generators show their true strength. What if you don't need all the results — you just want the first match?

With an eager approach, you'd have no choice: filter the entire dataset, build the full result list, then take element zero. A million rows in, one row out — and you processed all million.

With generators, we can just... stop pulling:

first() — stop the generators after one result

class LazyFrame:
    # ... filter() and collect() from above ...

    def first(self):
        # Pull just ONE row from the generator chain.
        # The rest of the data is never touched.
        return next(self._plan.execute())

That's it. Python's built-in next() pulls one value from a generator and then stops asking. The FilterNode pauses. The ScanNode pauses. If you have a billion rows but the first match is at row 5, you just processed 5 rows — not a billion.

Pause and predict: in the example below, the data has three rows and we call .first() with a filter for age > 28. How many rows will the ScanNode actually yield before everything stops? Think about it, then check your answer.

Try it

scan = ScanNode([
    {"name": "Bob",   "age": 25},
    {"name": "Alice", "age": 30},  # match — stops here!
    {"name": "Carol", "age": 35},  # never even read
    # ... imagine a million more rows ...
])
df = LazyFrame(scan)

df.filter(lambda row: row["age"] > 28).first()
# {'name': 'Alice', 'age': 30}

This is impossible with eager execution

An EagerFrame would have to run the filter on every single row before you could ask for the first result — because the filter builds a complete list. Generators don't build anything. They just yield the next item when you ask, and pause when you don't. That pause is the entire trick.

Polars and PySpark use this exact architecture. Their implementations are faster (batches instead of single rows, compiled code instead of Python), but the shape is identical: a chain of plan nodes, generators pulling data upward, nothing running until you ask.

Lesson 1.4 The Problem with Lambdas

We've solved the memory trap. Our engine is fully lazy. But there is a glaring problem with our FilterNode.

Look at the filter we wrote:

The lambda — a black box

query = df.filter(lambda row: row["age"] > 28)

That lambda is a black box. Our engine can execute it — hand it a row, get True or False back — but it can never look inside to understand what it's doing.

Why does that matter? Imagine you want to build a smart Query Optimizer — something that looks at your query plan and says:

What a smart optimizer would want to know

"This filter only uses the age column. The CSV file has 20 columns. Let's not bother carrying the other 19 through the rest of the pipeline."

"There are two filters in a row. The first one only touches columns from the left table, so we can push it down before the join."

But with a lambda, the optimizer can't answer any of these questions. Python cannot easily inspect which columns a lambda function uses.

This is the difference between a toy and a real query engine. To build a truly expressive, optimizable API — the kind where you write col("age") > 28 instead of a lambda — we need a way to represent computations as data that our engine can inspect.

We need col("age") > 28 to return not a boolean, but an expression — a small object that says "compare the age column to 28 using greater-than." On its own, that expression does nothing. It's just a description. But when a plan node like FilterNode receives it, it can inspect it — ask which columns it needs, rearrange it, and eventually evaluate it row by row.

Computer science loves trees

Here's something that might trip you up: that expression is also a tree. col("age") > 28 has a > at the root, col("age") on the left, and 28 on the right. That's a tree — an expression tree.

But it's a completely different tree from the chain of plan nodes we just built. The plan describes how data flows (scan → filter → collect). The expression describes what to compute on each row. Expressions live inside plan nodes — a FilterNode holds an expression that it evaluates per row.

Two trees, both made of nodes, doing completely different jobs. You'll build both in this course.

To pull this off, we'll need to dive into one of the most powerful corners of Python: dunder methods and operator overloading. We need to hijack the > operator so that instead of running a comparison, it builds an expression object.

Coming in Module 2

We'll build this expression system by hand. You'll learn how col("age") > 28 creates an expression (not a boolean), how col("a") + col("b") creates another kind of expression, and how the entire Polars-style col() API is really just a collection of Python dunder methods building an expression tree that the engine can walk and optimize.

Exercises Check Your Understanding

Before moving to Module 2, make sure these concepts are solid:

Quick check

1. In a lazy DataFrame library, when does computation actually happen?

2. In a lazy DataFrame library, what does col("x") > 5 return?

3. Why can't a query optimizer work with a lambda filter?

4. You have a pipeline: ScanNode → FilterNode → FilterNode → .first(). The data has 1 million rows. The first filter passes ~10% of rows, and the second filter's first match is at position 50 in the filtered stream. Roughly how many rows does ScanNode yield?

Challenge Build a MapNode

You've seen FilterNode — it pulls rows from its child and keeps the ones that pass a predicate. Now build a MapNode that transforms each row using a function. This is the generator-based equivalent of Python's built-in map().

Think about:

What does execute() yield? (Hint: the transformed row, not the original.)
Should it filter anything out, or pass every row through?

Starter code — fill in MapNode.execute()

from __future__ import annotations
from collections.abc import Callable, Iterator

class PlanNode:
    def execute(self) -> Iterator[dict]:
        raise NotImplementedError

class ScanNode(PlanNode):
    def __init__(self, rows: list[dict]):
        self.rows = rows
    def execute(self) -> Iterator[dict]:
        yield from self.rows

class MapNode(PlanNode):
    def __init__(self, child: PlanNode, func: Callable[[dict], dict]):
        self.child = child
        self.func = func
    def execute(self) -> Iterator[dict]:
        # TODO: yield each row transformed by self.func
        pass

# Test it: add a "senior" field based on age
data = [
    {"name": "Alice", "age": 35},
    {"name": "Bob",   "age": 22},
    {"name": "Carol", "age": 45},
]

def add_senior_flag(row):
    return {**row, "senior": row["age"] >= 30}

scan = ScanNode(data)
mapped = MapNode(scan, add_senior_flag)

for row in mapped.execute():
    print(row)

MapNode — one line in execute()

def execute(self):
    for row in self.child.execute():
        yield self.func(row)

Every row flows through the function and out the other side. No rows are dropped — that's FilterNode's job. Notice how composable this is: you could chain ScanNode → MapNode → FilterNode and get a lazy pipeline that transforms and filters with zero intermediate lists.

Thinking question

You chain ScanNode → FilterNode → MapNode and call .first(). The ScanNode has 1 million rows, but only the 500th row passes the filter. How many rows does the ScanNode yield? How many does the MapNode process? (Answer: 500 and 1, respectively. The generator chain stops pulling after the first match reaches the top.)

Source References

plan.py — Plan Nodes

core.py — LazyFrame API

collect()

← Introduction Module 2 →

We Need to TalkAbout DataFrames