Build Your Own DataFrame
Epilogue

Where Do You Go
from Here?

You came here to learn how libraries like Polars and PySpark work under the hood. You leave with something more useful: the vocabulary to reason about systems you haven't built yet.

What You Learned

Let's take stock. Over five modules, an interlude, and a deep dive, you traced the full architecture of a lazy dataframe engine — from first principles to a working optimizer. Not a surface tour. You went deep enough to understand why every piece exists and how they fit together.

Module 1

Generators, the Volcano Model, and the fundamental insight that laziness isn't avoidance — it's architecture.

Module 2

Dunder methods, operator overloading, and the expression AST that turns col("a") + 1 from syntax into a tree you can inspect.

Module 3

Plan nodes, batched execution, schema propagation — the engine room where expressions meet data and tuples start flowing.

Module 4

Hash joins, hash aggregation, running accumulators. The algorithms that make group_by and join work in O(n).

Interlude

Everything wired together. A complete pipeline, end to end — and the uncomfortable realization that it's doing too much work.

Module 5

The optimizer. Filter pushdown, column pruning, and the payoff for every design decision that came before it.

That's not a surface-level tour. You now understand plan trees and expression trees — the two data structures that underpin every serious query system. You understand how they compose, how data flows through them, and how an optimizer rewrites them before a single row is read.


Questions We Left Unanswered

On purpose. These aren't gaps in the course — they're the frontier. Each one is a real engineering problem with real tradeoffs. The reason you can now think about them is that you understand the architecture they sit on top of.

Can you parallelize the engine?

Python's multiprocessing or concurrent.futures could process batches across CPU cores. But the GIL, serialization overhead, and batch coordination make this non-trivial. Polars uses Rust's rayon for work-stealing parallelism — that's a fundamentally different approach. What parts of the plan tree are embarrassingly parallel, and what parts aren't?

Starting point: FilterNode and ProjectNode process rows independently — they're embarrassingly parallel. AggNode and SortNode need all their input. What if you partitioned data by hash key and processed each partition on a different core?

Can you make it truly streaming?

Our engine materializes for sorts, joins, and group-by. A true streaming engine — like Flink or Kafka Streams — processes infinite, unbounded data one event at a time. What would have to change in the plan tree to support that? Which nodes survive, and which ones need entirely new semantics?

Starting point: the key constraint is unbounded aggregation. What if group_by emitted partial results every N rows instead of waiting for the stream to end? How would the downstream nodes handle updates to previously emitted groups?

What if you rewrote the hot loops in Rust or C?

The volcano model and expression AST are architecture. They don't care what language evaluates each node. PyO3 lets you write Rust extensions callable from Python — you could keep the plan tree in Python and drop to Rust for the inner loops. Polars goes further: the entire engine is Rust, from the optimizer to the execution loops. Python is just a thin binding layer on top. But the concepts are the same ones you learned here.

Starting point: what if just Col.compile() and BinaryExpr.compile() returned Rust-backed closures via PyO3? The plan tree stays in Python; only the inner eval loop crosses the language boundary. How much speedup would that give you?

What about a cost-based optimizer?

Our optimizer uses simple rules: push filters down, prune columns. A cost-based optimizer would estimate the size of intermediate results and choose between a hash join and a sort-merge join based on which is cheaper. That requires statistics — row counts, cardinality estimates, histograms. The plan tree is the same; the decision logic is orders of magnitude more sophisticated.

Starting point: add a row_count() method to each plan node and teach the optimizer to pick the smaller side as the hash-table build side. ScanNode knows its size; FilterNode can estimate output rows from a selectivity factor. That's your first cost-based rule.

What about indexes?

Databases use B-trees and hash indexes to skip scanning entirely. Could you add an index to ScanNode so that filter(col("id") == 42) finds the row in O(log n) instead of O(n)? What would change in the optimizer to exploit that?

Starting point: the optimizer already inspects filter predicates via required_columns(). What if it also checked whether the predicate is an equality check on an indexed column? It could replace ScanNode → FilterNode with an IndexLookupNode.

What about memory-mapped I/O and Arrow?

Apache Arrow defines a columnar memory format that eliminates serialization. Polars, DuckDB, and DataFusion all use it. What if ScanNode held Arrow arrays instead of Python tuples? The plan tree stays the same. The storage layer beneath it transforms completely.

Starting point: pip install pyarrow and try converting a pyfloe result to an Arrow table. The plan tree and expression AST are independent of the storage format — that's the whole point of the architecture you just learned.

The point
None of these questions are rhetorical. They're the actual problems that database engineers, query engine authors, and library maintainers work on every day. You can now engage with them — read the papers, understand the blog posts, follow the GitHub discussions — because you know what a plan tree is, what an expression AST is, how a volcano iterator pulls data, how a hash join partitions rows, and how an optimizer rewrites a plan before execution begins. That's the foundation. Everything else is iteration.

The Source: pyfloe

Every line of "real code" in this course came from pyfloe — a zero-dependency, lazy dataframe library written in pure Python. It's not a teaching toy. It's a working library with expression trees, a volcano engine, hash joins, window functions, streaming I/O, and a rule-based optimizer — all in roughly 6,000 lines of readable Python.

The codebase is open. You've already seen the most important parts. Now you can see the rest — the string methods, datetime accessors, pivot and unpivot, sorted merge joins, and everything else we acknowledged but chose not to cover. You have the vocabulary to read it all.

📖

pyfloe Documentation

Full API reference, getting-started guide, and architectural documentation. Start here if you want to use pyfloe. pip install pyfloe — zero dependencies.

🔧

pyfloe on GitHub

The source code. Read it, fork it, break it, improve it. Every module in this course maps to a specific file: expr.py, plan.py, optimize.py, io.py. You already know your way around.


A Final Thought

Libraries change. Polars will ship new features next month. Some new Rust-based engine will appear on Hacker News and everyone will argue about benchmarks. The architecture underneath them won't change — plan trees, expression ASTs, volcano iteration, hash partitioning, rule-based optimization. These aren't trends. They're decades old, and they'll outlast whatever's hot this year.

The next time you see a query plan or an optimizer trace — in Polars, in Spark, in DuckDB — you'll recognize every piece. That's worth more than any API.

Go build something.

Source References

plan.py — Plan Nodes

core.py — LazyFrame API

expr.py — Expression AST

◆   ◆   ◆
End of course