Meet pyfloe
A zero-dependency, lazy dataframe engine written in pure Python — and the reason this course can exist at all.
Most courses that teach you "how libraries work" resort to toy examples. They build a 50-line class, wave their hands, and say "real libraries do something like this, but more complicated." You walk away feeling like you learned something, but you couldn't actually point to a line of production code and say "I understand why that's there."
This course is different. We have a real library — a working, published, pip-installable dataframe engine — and we're going to take it apart until there's nothing left that feels like magic.
That library is pyfloe.
The Library What Is pyfloe?
pyfloe is a lazy dataframe library written entirely in pure Python.
No C extensions. No Cython. No Rust bindings. No dependencies at all —
not even NumPy. Install it with pip install pyfloe and you get
the whole thing: expression API, query optimizer, window functions, streaming
I/O, and type safety.
Here's what it looks like to use:
import pyfloe as pf
result = (
pf.read_csv("orders.csv") # lazy — file not read yet
.filter(pf.col("amount") > 100) # lazy — no rows scanned
.with_column("rank", pf.row_number()
.over(partition_by="region", order_by="amount"))
.select("order_id", "region", "amount", "rank")
.sort("region", "rank")
.collect() # NOW it runs
)
If you've used Polars or PySpark, this should look familiar. pyfloe implements the same architectural ideas — lazy evaluation, expression trees, a volcano execution model, query optimization — but in a codebase small enough that you can read every line in an afternoon.
pf.col("amount") > 100 expression doesn't compare anything
when you write it. It builds an expression object — a tiny tree
node that says "compare column 'amount' to the value 100." The
.filter() call doesn't scan any rows either — it creates a
plan node that wraps the expression. Nothing executes until
.collect(). This is exactly how Polars works, and by the end
of this course, you'll have built every piece of this pipeline yourself.
See it for yourself
Before we run a full pipeline, try something smaller. Hit Run and look
at what pf.col("price") > 100 actually returns:
import pyfloe as pf
expr = pf.col("price") > 100
# Not True. Not False. It's an expression object.
print(type(expr))
print(repr(expr))
# It knows which columns it needs:
print(f"Columns used: {expr.required_columns()}")
That's the core insight: > didn't compare anything. It built
an object that describes the comparison. The engine can inspect it,
optimize around it, and evaluate it later. Now let's see a full pipeline:
import pyfloe as pf
# Create some data inline
orders = pf.LazyFrame([
{"order_id": 1, "region": "EU", "amount": 250.0},
{"order_id": 2, "region": "US", "amount": 45.0},
{"order_id": 3, "region": "EU", "amount": 180.0},
{"order_id": 4, "region": "US", "amount": 320.0},
{"order_id": 5, "region": "APAC", "amount": 90.0},
{"order_id": 6, "region": "US", "amount": 150.0},
])
# Build a lazy pipeline — nothing runs yet
result = (
orders
.filter(pf.col("amount") > 100)
.select("order_id", "region", "amount")
.sort("region", "amount")
.collect()
)
# Print the results
for row in result.to_pylist():
print(row)
The Teaching Vehicle Why pyfloe?
We could have tried to reverse-engineer Polars itself, or PySpark, or DuckDB. But those libraries have a problem as teaching material: they're enormous. Polars is hundreds of thousands of lines of Rust. PySpark delegates to a JVM. DuckDB is C++. You can read their documentation, but you can't read their source code and come away understanding the full picture.
pyfloe was purpose-built to solve this problem. It implements the core architectural patterns of the big libraries, but in pure Python, at a scale where every design decision is visible:
Pure Python
No compiled extensions. Every algorithm — hash joins, expression evaluation, query optimization — is written in Python you can step through in a debugger.
Zero Dependencies
The entire library is self-contained. There's no layer of abstraction
hiding behind a numpy or pyarrow import.
What you see is what runs.
Small Enough to Read
The entire engine fits in six files. Small enough to read end-to-end, large enough to be real — it has streaming I/O, window functions, and a rule-based optimizer.
Real Architecture
Expression ASTs, the Volcano model, hash aggregation, filter pushdown, column pruning — the same patterns that power Polars, Spark, and every serious query engine.
FilterNode pulls
rows from a ScanNode through a generator chain, you understand
how Polars does the same thing with Rust iterators. The concepts transfer
directly.
The Blueprint Six Files, One Engine
pyfloe's entire codebase fits in six files. Each one is responsible for a distinct layer of the engine, and together they form the full pipeline from "read a CSV" to "return optimized results." Here's the lay of the land:
LazyFrame class — the public API that wraps plan nodes into a fluent interface
from_iter, from_chunks, and the Stream class
You don't need to memorize this. Over the course of five modules, we'll build simplified versions of each layer ourselves, then compare our code to the real pyfloe source. By the end, every file in this diagram will feel like something you could have written yourself.
The Core Idea Two Trees Working Together
If there is one architectural insight that makes everything in pyfloe (and Polars, and Spark) click, it's this: the engine is built around two distinct tree structures that work together.
Tree 1: The Plan Tree
The plan tree describes how data flows. Each node is an operation —
scan a file, filter rows, join two tables, project columns. The nodes
form a tree because operations are nested: a FilterNode pulls
from a ScanNode, a JoinNode pulls from two children,
and so on. When you call .collect(), the engine walks this tree
from the top and pulls data upward through every node.
ProjectNode("order_id", "amount") # ← collect() starts here
└── FilterNode(amount > 100) # pulls from ↓
└── ScanNode("orders.csv") # reads the file
Tree 2: The Expression Tree
The expression tree describes what to compute on each row. When
you write pf.col("amount") > 100, that's not a comparison — it's
a tree of objects: a BinaryExpr with a Col("amount")
on the left, a Lit(100) on the right, and the >
operator connecting them.
BinaryExpr(op=">")
├── left: Col("amount")
└── right: Lit(100)
Expressions live inside plan nodes. The FilterNode
above doesn't hold a lambda — it holds that expression tree. This is the
key architectural decision that makes everything else possible: because
the expression is data (not a black-box function), the engine can
inspect it, figure out which columns it needs, and make smart
optimization decisions.
The Roadmap What We'll Build
The course follows pyfloe's architecture from the bottom up. Each module builds a new layer, and each layer depends on the one before it. Here's the journey:
pf.col("x") + pf.col("y") builds
an inspectable tree instead of running a computation. Col,
Lit, BinaryExpr, WhenExpr.
LazyFrame wrapper. Schema propagation
without data.
Between Modules 4 and 5, there's a hands-on Interlude where you wire everything together into a complete mini-pipeline and watch it run. After Module 5, there are optional deep dives into streaming I/O and window functions for those who want to keep going.
Honest Boundaries What We Won't Cover
pyfloe has more features than we can teach in five modules, and that's fine. The goal of this course isn't exhaustive documentation — it's deep understanding of the architecture. Here's what we're choosing to skip:
Streaming I/O — pyfloe's file readers (io.py)
use factory-based generators to stream CSV, TSV, and JSONL files
with constant memory. We include a deep-dive appendix on this, but it's not
part of the main five modules.
Window functions — row_number(),
rank(), lag(), lead(), and more.
These follow the same expression pattern you'll learn in Module 2, but their
execution involves sort-partition-scan logic that deserves its own treatment.
We cover these in a bonus topic.
Accessor methods — pyfloe supports .str
(string methods like upper, contains,
replace) and .dt (datetime extraction like
year, month, truncate). These are
straightforward once you understand the expression AST — they're just new
node types.
Pivot, unpivot, and explode — reshape operations that are useful but orthogonal to the core architecture.
One Last Thing How This Started
pyfloe wasn't some grand plan. I needed an engine for
Flowfile — a visual data tool I was
building — so I sat down and wrote one. Pure Python,
itertools, operator.itemgetter, a lot of trial and
error. Python is my language for everything — it's still my standard library
for anything I build — so naturally the engine was Python too. It worked.
Flowfile eventually moved to Polars, and the original engine went into a
drawer.
Years later, I looked back at that old code and saw something I hadn't appreciated at the time: there were some really fun patterns in there. Hash joins, expression trees, a little optimizer — all the same ideas that power the big libraries, just written in plain Python by someone who was figuring it out as he went. So I cleaned it up, restructured it a bit, and turned it into something I could use for teaching — to myself first, and now to others. That's pyfloe.
This course is not going to teach you how to build pyfloe. You can clone
the repo or pip install the thing — it's right there. But if
you want to understand how something like this works well enough to build
your own version, shaped the way that makes sense to you,
this is the place to be.
I always think I can design things better — and once you understand how something works, you will too.
Source References
plan.py — Plan Nodes
FilterNode— filters rows using an expression predicateScanNode— reads data from a sourceJoinNode— joins two plan trees together
core.py — LazyFrame API
LazyFrame— the public API that wraps plan nodes into a fluent interface
expr.py — Expression AST
BinaryExpr— two operands combined with a binary operatorCol— column reference expressionLit— literal value expressioncol— creates a column reference expressionlit— creates a literal value expressionwhen— conditional expression builder
stream.py — Streaming
Stream— streaming pipeline class