Build Your Own DataFrame
Interlude

Why I Think This
Is Interesting

Before we start building anything, I want to tell you why this project exists — and why I think it teaches something that goes beyond Python.


It started with a real problem

I built Flowfile, a visual data tool that lets people drag nodes on a canvas and build data pipelines without writing code. Under the hood, those pipelines need an engine. Something that takes a chain of operations — filter this, join that, aggregate here — and actually executes them.

So I built one. From scratch. In pure Python. It's long since been replaced by Polars — which is faster by orders of magnitude — but the architecture survived the swap almost unchanged. That's not a coincidence, and it's part of why this course exists.

The first version ran on itertools and operator.itemgetter. No AI, no autocomplete. Just the standard library, a lot of documentation reading, and a lot of getting things wrong before getting them right. That engine eventually became pyfloe — the library whose source code you'll be studying in this course.

I'm telling you this because what follows isn't a hypothetical exercise. It's not "imagine you were building a query engine." Someone actually built this, for a real product, and hit every wall you'd expect: memory limits, performance cliffs, lambdas that couldn't be inspected, schemas that didn't propagate. The solutions we'll walk through aren't textbook answers. They're the answers I actually arrived at — sometimes on the first try, often not.

In a way, this course is the documentation I wish I'd had when I started.


You're learning two things at once

On the surface, this course teaches Python. Good Python. The kind that shows up in well-maintained libraries: dunder methods, __slots__, iterator protocols, defaultdict, expression compilation. The stuff that separates "I can write Python" from "I understand how Python actually works under the hood."

But underneath that, you're learning something bigger. You're learning how to think about computation as a structure you can inspect and rewrite.

That idea — that you don't just run code, you build a description of what you want and then decide how to run it later — is one of the most important patterns in computer science. It's the idea behind SQL query planners, which have worked this way since the 1970s. It's the idea behind lazy evaluation in Spark and Polars. It's the idea behind TensorFlow's computation graphs and PyTorch's autograd. It's the idea behind every modern compiler's intermediate representation.

The big idea
When you write col("price") * col("qty") > lit(100) and it returns a tree of objects instead of a number, you're doing the same thing all of these systems do: separating the what from the how. Describing intent as data. Making computation inspectable.

That's not a Python trick. That's a way of thinking. And once you see it, you start noticing it everywhere.


Why building by hand matters

There's a version of this course that doesn't exist, where I just point you to the Polars source code and say "here's how it works." That would be faster. It would also be almost useless.

Understanding doesn't come from reading. It comes from building something, watching it break, figuring out why, and building it again. There's a moment in this course where you'll write a perfectly reasonable FilterNode using a lambda — and then hit the wall. The optimizer can't inspect a lambda. It's a black box. You can't push a filter below a join if you don't know what columns the filter needs. And suddenly expression trees aren't an academic concept. They're the only way forward.

That "oh, that's why it exists" feeling is what I chase in my work, my hobbies, and pretty much everything else. Not "here's a clever technique," but "here's the problem that makes the technique feel inevitable."

Information vs. knowledge
And honestly, that's how you turn information into knowledge. Information is knowing that hash joins exist. Knowledge is having built one, hit the edge cases, and understanding when to reach for a sort-merge join instead. The aha moment is the bridge between the two, and there's no shortcut across it.

The AI elephant in the room

It's 2026. You can ask an LLM to write you a hash join and it'll produce something reasonable in a few seconds. I'm not going to pretend that's not the reality — I built a visual tool that generates code for people.

But there's a difference between code you generated and code you understand. When Flowfile moved from my original Python engine to Polars, the architecture barely changed. Not because the architecture was brilliant — it was pretty standard, honestly — but because I understood why each piece existed. The decisions weren't arbitrary. They were load-bearing. You can swap an engine when you understand the structure. You can't swap anything when you're just prompting and hoping for the best.

The stuff that becomes obsolete is the syntax. The stuff that lasts is the mental model. This course is about the mental model.


What you're about to build

Over the next five modules you'll build a working lazy dataframe engine from scratch. Expressions, plan nodes, batched execution, hash joins, aggregation, and a query optimizer. The optimizer is the finale — about 80 lines of code that walk the plan tree and automatically eliminate waste. It's the payoff for everything before it, and it's surprisingly small. That's what good architecture does.

pyfloe is roughly 6,000 lines of Python. Zero dependencies. No C extensions. No Rust. Just the standard library and a series of design decisions that I'll try to explain as we go. Where the code is clever, I'll show you why. Where I'd do it differently today, I'll tell you that too.

Let's build something
Module 1 starts with the question every data tool has to answer first: why do DataFrames exist — and what goes wrong when you try to build one the obvious way?
Source References

plan.py — Plan Nodes

expr.py — Expression AST