Build Your Own DataFrame
Introduction

Meet pyfloe

A zero-dependency, lazy dataframe engine written in pure Python — and the reason this course can exist at all.

Most courses that teach you "how libraries work" resort to toy examples. They build a 50-line class, wave their hands, and say "real libraries do something like this, but more complicated." You walk away feeling like you learned something, but you couldn't actually point to a line of production code and say "I understand why that's there."

This course is different. We have a real library — a working, published, pip-installable dataframe engine — and we're going to take it apart until there's nothing left that feels like magic.

That library is pyfloe.


The Library What Is pyfloe?

pyfloe is a lazy dataframe library written entirely in pure Python. No C extensions. No Cython. No Rust bindings. No dependencies at all — not even NumPy. Install it with pip install pyfloe and you get the whole thing: expression API, query optimizer, window functions, streaming I/O, and type safety.

Here's what it looks like to use:

A real pyfloe pipeline
import pyfloe as pf

result = (
    pf.read_csv("orders.csv")                             # lazy — file not read yet
    .filter(pf.col("amount") > 100)                    # lazy — no rows scanned
    .with_column("rank", pf.row_number()
        .over(partition_by="region", order_by="amount"))
    .select("order_id", "region", "amount", "rank")
    .sort("region", "rank")
    .collect()                                        # NOW it runs
)

If you've used Polars or PySpark, this should look familiar. pyfloe implements the same architectural ideas — lazy evaluation, expression trees, a volcano execution model, query optimization — but in a codebase small enough that you can read every line in an afternoon.

Show the magic
That pf.col("amount") > 100 expression doesn't compare anything when you write it. It builds an expression object — a tiny tree node that says "compare column 'amount' to the value 100." The .filter() call doesn't scan any rows either — it creates a plan node that wraps the expression. Nothing executes until .collect(). This is exactly how Polars works, and by the end of this course, you'll have built every piece of this pipeline yourself.

See it for yourself

Before we run a full pipeline, try something smaller. Hit Run and look at what pf.col("price") > 100 actually returns:

Live — what does col("price") > 100 return?
import pyfloe as pf

expr = pf.col("price") > 100

# Not True. Not False. It's an expression object.
print(type(expr))
print(repr(expr))

# It knows which columns it needs:
print(f"Columns used: {expr.required_columns()}")

That's the core insight: > didn't compare anything. It built an object that describes the comparison. The engine can inspect it, optimize around it, and evaluate it later. Now let's see a full pipeline:

Live — editable pyfloe pipeline
import pyfloe as pf

# Create some data inline
orders = pf.LazyFrame([
    {"order_id": 1, "region": "EU",   "amount": 250.0},
    {"order_id": 2, "region": "US",   "amount": 45.0},
    {"order_id": 3, "region": "EU",   "amount": 180.0},
    {"order_id": 4, "region": "US",   "amount": 320.0},
    {"order_id": 5, "region": "APAC", "amount": 90.0},
    {"order_id": 6, "region": "US",   "amount": 150.0},
])

# Build a lazy pipeline — nothing runs yet
result = (
    orders
    .filter(pf.col("amount") > 100)
    .select("order_id", "region", "amount")
    .sort("region", "amount")
    .collect()
)

# Print the results
for row in result.to_pylist():
    print(row)

The Teaching Vehicle Why pyfloe?

We could have tried to reverse-engineer Polars itself, or PySpark, or DuckDB. But those libraries have a problem as teaching material: they're enormous. Polars is hundreds of thousands of lines of Rust. PySpark delegates to a JVM. DuckDB is C++. You can read their documentation, but you can't read their source code and come away understanding the full picture.

pyfloe was purpose-built to solve this problem. It implements the core architectural patterns of the big libraries, but in pure Python, at a scale where every design decision is visible:

Pure Python

No compiled extensions. Every algorithm — hash joins, expression evaluation, query optimization — is written in Python you can step through in a debugger.

Zero Dependencies

The entire library is self-contained. There's no layer of abstraction hiding behind a numpy or pyarrow import. What you see is what runs.

Small Enough to Read

The entire engine fits in six files. Small enough to read end-to-end, large enough to be real — it has streaming I/O, window functions, and a rule-based optimizer.

Real Architecture

Expression ASTs, the Volcano model, hash aggregation, filter pushdown, column pruning — the same patterns that power Polars, Spark, and every serious query engine.

Important caveat
pyfloe is not trying to compete with Polars on performance. It can't — Python is orders of magnitude slower than Rust for number-crunching. But performance isn't the point. The point is that the architecture is identical. When you understand how pyfloe's FilterNode pulls rows from a ScanNode through a generator chain, you understand how Polars does the same thing with Rust iterators. The concepts transfer directly.

The Blueprint Six Files, One Engine

pyfloe's entire codebase fits in six files. Each one is responsible for a distinct layer of the engine, and together they form the full pipeline from "read a CSV" to "return optimized results." Here's the lay of the land:

schema.py Type propagation — every node knows its output types before any data flows
expr.py The expression AST — col, lit, when, arithmetic, comparisons, aggregations, windows
plan.py Plan nodes + the query optimizer — the volcano execution model, filter pushdown, column pruning
core.py The LazyFrame class — the public API that wraps plan nodes into a fluent interface
io.py File readers and writers — CSV, TSV, JSONL, JSON, fixed-width, with streaming and type inference
stream.py Streaming pipelines — from_iter, from_chunks, and the Stream class

You don't need to memorize this. Over the course of five modules, we'll build simplified versions of each layer ourselves, then compare our code to the real pyfloe source. By the end, every file in this diagram will feel like something you could have written yourself.


The Core Idea Two Trees Working Together

If there is one architectural insight that makes everything in pyfloe (and Polars, and Spark) click, it's this: the engine is built around two distinct tree structures that work together.

Tree 1: The Plan Tree

The plan tree describes how data flows. Each node is an operation — scan a file, filter rows, join two tables, project columns. The nodes form a tree because operations are nested: a FilterNode pulls from a ScanNode, a JoinNode pulls from two children, and so on. When you call .collect(), the engine walks this tree from the top and pulls data upward through every node.

A plan tree in pseudocode
ProjectNode("order_id", "amount")           # ← collect() starts here
  └── FilterNode(amount > 100)                # pulls from ↓
        └── ScanNode("orders.csv")             # reads the file

Tree 2: The Expression Tree

The expression tree describes what to compute on each row. When you write pf.col("amount") > 100, that's not a comparison — it's a tree of objects: a BinaryExpr with a Col("amount") on the left, a Lit(100) on the right, and the > operator connecting them.

An expression tree for pf.col("amount") > 100
BinaryExpr(op=">")
  ├── left:  Col("amount")
  └── right: Lit(100)

Expressions live inside plan nodes. The FilterNode above doesn't hold a lambda — it holds that expression tree. This is the key architectural decision that makes everything else possible: because the expression is data (not a black-box function), the engine can inspect it, figure out which columns it needs, and make smart optimization decisions.

The two trees converge
These two trees appear separately at first. We'll build expressions in Module 2, and plan nodes in Modules 1 and 3. But they come together in Module 3 (when expressions are plugged into plan nodes) and fully converge in Module 5 (the Optimizer), where the engine walks the plan tree and inspects the expression trees inside each node to decide how to rewrite the plan. Keep these two structures in mind — they are the skeleton of the entire engine.

The Roadmap What We'll Build

The course follows pyfloe's architecture from the bottom up. Each module builds a new layer, and each layer depends on the one before it. Here's the journey:

1
The Context & Foundation
Why DataFrames exist. Eager vs. lazy execution. Generators and the volcano model. Why lambdas are a dead end for optimization.
2
Python Magic — Dunder Methods & the Expression AST
Hijacking Python's operators so pf.col("x") + pf.col("y") builds an inspectable tree instead of running a computation. Col, Lit, BinaryExpr, WhenExpr.
→ expr.py
3
The Engine Room — Plan Nodes & Batched Execution
Plugging expressions into plan nodes. Batched execution for performance. The LazyFrame wrapper. Schema propagation without data.
→ plan.py, core.py, schema.py
4
Advanced Algorithms — Hashing, Grouping & Joins
Hash joins, join variants (inner/left/full), hash aggregation with running accumulators, and the sorted-stream alternatives that use constant memory.
→ plan.py (JoinNode, AggNode)
5
The Query Optimizer (Capstone)
The payoff. Filter pushdown and column pruning — the optimizer walks the plan tree, inspects the expression trees inside it, and rewrites the plan to eliminate unnecessary work. Everything converges here.
→ plan.py (Optimizer)

Between Modules 4 and 5, there's a hands-on Interlude where you wire everything together into a complete mini-pipeline and watch it run. After Module 5, there are optional deep dives into streaming I/O and window functions for those who want to keep going.

Your pace, your path
This course is designed to be read front-to-back, but you don't have to treat every section equally. If you're already comfortable with generators and lazy evaluation, you can skim Modules 1–2 and focus on the interactive code blocks. If you're here mainly for the optimizer payoff, skim the algorithm details in Modules 3–4 and jump to Module 5. The code blocks are self-contained — you'll get value from running them even if you read the surrounding text quickly.

Honest Boundaries What We Won't Cover

pyfloe has more features than we can teach in five modules, and that's fine. The goal of this course isn't exhaustive documentation — it's deep understanding of the architecture. Here's what we're choosing to skip:

Streaming I/O — pyfloe's file readers (io.py) use factory-based generators to stream CSV, TSV, and JSONL files with constant memory. We include a deep-dive appendix on this, but it's not part of the main five modules.

Window functionsrow_number(), rank(), lag(), lead(), and more. These follow the same expression pattern you'll learn in Module 2, but their execution involves sort-partition-scan logic that deserves its own treatment. We cover these in a bonus topic.

Accessor methods — pyfloe supports .str (string methods like upper, contains, replace) and .dt (datetime extraction like year, month, truncate). These are straightforward once you understand the expression AST — they're just new node types.

Pivot, unpivot, and explode — reshape operations that are useful but orthogonal to the core architecture.

The good news
Every one of these features is built on the same foundation you will learn: expressions, plan nodes, and the volcano model. Once you understand the architecture, reading the code for window functions or streaming I/O is a matter of following patterns you already know.

One Last Thing How This Started

pyfloe wasn't some grand plan. I needed an engine for Flowfile — a visual data tool I was building — so I sat down and wrote one. Pure Python, itertools, operator.itemgetter, a lot of trial and error. Python is my language for everything — it's still my standard library for anything I build — so naturally the engine was Python too. It worked. Flowfile eventually moved to Polars, and the original engine went into a drawer.

Years later, I looked back at that old code and saw something I hadn't appreciated at the time: there were some really fun patterns in there. Hash joins, expression trees, a little optimizer — all the same ideas that power the big libraries, just written in plain Python by someone who was figuring it out as he went. So I cleaned it up, restructured it a bit, and turned it into something I could use for teaching — to myself first, and now to others. That's pyfloe.

This course is not going to teach you how to build pyfloe. You can clone the repo or pip install the thing — it's right there. But if you want to understand how something like this works well enough to build your own version, shaped the way that makes sense to you, this is the place to be.

I always think I can design things better — and once you understand how something works, you will too.


Source References

plan.py — Plan Nodes

  • FilterNode — filters rows using an expression predicate
  • ScanNode — reads data from a source
  • JoinNode — joins two plan trees together

core.py — LazyFrame API

  • LazyFrame — the public API that wraps plan nodes into a fluent interface

expr.py — Expression AST

  • BinaryExpr — two operands combined with a binary operator
  • Col — column reference expression
  • Lit — literal value expression
  • col — creates a column reference expression
  • lit — creates a literal value expression
  • when — conditional expression builder

stream.py — Streaming

  • Stream — streaming pipeline class