Where Do You Go
from Here?
You came here to learn how libraries like Polars and PySpark work under the hood. You leave with something more useful: the vocabulary to reason about systems you haven't built yet.
What You Learned
Let's take stock. Over five modules, an interlude, and a deep dive, you traced the full architecture of a lazy dataframe engine — from first principles to a working optimizer. Not a surface tour. You went deep enough to understand why every piece exists and how they fit together.
Module 1
Generators, the Volcano Model, and the fundamental insight that laziness isn't avoidance — it's architecture.
Module 2
Dunder methods, operator overloading, and the expression AST that
turns col("a") + 1 from syntax into a tree you can inspect.
Module 3
Plan nodes, batched execution, schema propagation — the engine room where expressions meet data and tuples start flowing.
Module 4
Hash joins, hash aggregation, running accumulators. The algorithms
that make group_by and join work in O(n).
Interlude
Everything wired together. A complete pipeline, end to end — and the uncomfortable realization that it's doing too much work.
Module 5
The optimizer. Filter pushdown, column pruning, and the payoff for every design decision that came before it.
That's not a surface-level tour. You now understand plan trees and expression trees — the two data structures that underpin every serious query system. You understand how they compose, how data flows through them, and how an optimizer rewrites them before a single row is read.
Questions We Left Unanswered
On purpose. These aren't gaps in the course — they're the frontier. Each one is a real engineering problem with real tradeoffs. The reason you can now think about them is that you understand the architecture they sit on top of.
Can you parallelize the engine?
Python's multiprocessing or concurrent.futures
could process batches across CPU cores. But the GIL, serialization
overhead, and batch coordination make this non-trivial. Polars uses
Rust's rayon for work-stealing parallelism — that's a fundamentally
different approach. What parts of the plan tree are embarrassingly
parallel, and what parts aren't?
Starting point: FilterNode and ProjectNode process rows independently — they're embarrassingly parallel. AggNode and SortNode need all their input. What if you partitioned data by hash key and processed each partition on a different core?
Can you make it truly streaming?
Our engine materializes for sorts, joins, and group-by. A true streaming engine — like Flink or Kafka Streams — processes infinite, unbounded data one event at a time. What would have to change in the plan tree to support that? Which nodes survive, and which ones need entirely new semantics?
Starting point: the key constraint is unbounded aggregation. What if
group_by emitted partial results every N rows instead of
waiting for the stream to end? How would the downstream nodes handle
updates to previously emitted groups?
What if you rewrote the hot loops in Rust or C?
The volcano model and expression AST are architecture. They don't care what language evaluates each node. PyO3 lets you write Rust extensions callable from Python — you could keep the plan tree in Python and drop to Rust for the inner loops. Polars goes further: the entire engine is Rust, from the optimizer to the execution loops. Python is just a thin binding layer on top. But the concepts are the same ones you learned here.
Starting point: what if just Col.compile() and
BinaryExpr.compile() returned Rust-backed closures via
PyO3? The plan tree stays in Python; only the inner eval loop crosses
the language boundary. How much speedup would that give you?
What about a cost-based optimizer?
Our optimizer uses simple rules: push filters down, prune columns. A cost-based optimizer would estimate the size of intermediate results and choose between a hash join and a sort-merge join based on which is cheaper. That requires statistics — row counts, cardinality estimates, histograms. The plan tree is the same; the decision logic is orders of magnitude more sophisticated.
Starting point: add a row_count() method to each plan
node and teach the optimizer to pick the smaller side as the hash-table
build side.
ScanNode
knows its size; FilterNode can estimate output rows from a
selectivity factor. That's your first cost-based rule.
What about indexes?
Databases use B-trees and hash indexes to skip scanning entirely. Could
you add an index to ScanNode so that
filter(col("id") == 42) finds the row in O(log n) instead
of O(n)? What would change in the optimizer to exploit that?
Starting point: the optimizer already inspects filter predicates
via required_columns(). What if it also checked whether
the predicate is an equality check on an indexed column? It could
replace ScanNode → FilterNode with an
IndexLookupNode.
What about memory-mapped I/O and Arrow?
Apache Arrow defines a columnar memory format that eliminates
serialization. Polars, DuckDB, and DataFusion all use it. What if
ScanNode held Arrow arrays instead of Python tuples?
The plan tree stays the same. The storage layer beneath it transforms
completely.
Starting point: pip install pyarrow and try
converting a pyfloe result to an Arrow table. The plan tree and
expression AST are independent of the storage format — that's
the whole point of the architecture you just learned.
The Source: pyfloe
Every line of "real code" in this course came from pyfloe — a zero-dependency, lazy dataframe library written in pure Python. It's not a teaching toy. It's a working library with expression trees, a volcano engine, hash joins, window functions, streaming I/O, and a rule-based optimizer — all in roughly 6,000 lines of readable Python.
The codebase is open. You've already seen the most important parts. Now you can see the rest — the string methods, datetime accessors, pivot and unpivot, sorted merge joins, and everything else we acknowledged but chose not to cover. You have the vocabulary to read it all.
pyfloe Documentation
Full API reference, getting-started guide, and architectural
documentation. Start here if you want to use pyfloe.
pip install pyfloe — zero dependencies.
pyfloe on GitHub
The source code. Read it, fork it, break it, improve it. Every
module in this course maps to a specific file: expr.py,
plan.py, optimize.py, io.py.
You already know your way around.
A Final Thought
Libraries change. Polars will ship new features next month. Some new Rust-based engine will appear on Hacker News and everyone will argue about benchmarks. The architecture underneath them won't change — plan trees, expression ASTs, volcano iteration, hash partitioning, rule-based optimization. These aren't trends. They're decades old, and they'll outlast whatever's hot this year.
The next time you see a query plan or an optimizer trace — in Polars, in Spark, in DuckDB — you'll recognize every piece. That's worth more than any API.
Go build something.