Skip to content

Getting Started

pyfloe is a zero-dependency, lazy dataframe library for Python. It provides an expression-based API, query optimization, window functions, datetime handling, and type safety — all in pure Python.

Installation

pip install pyfloe

Quick example

import pyfloe as pf

result = (
    pf.read_csv("orders.csv")                            # lazy — file not read yet
    .filter(pf.col("amount") > 100)                      # lazy — no rows scanned
    .with_column("rank", pf.row_number()
        .over(partition_by="region", order_by="amount"))
    .select("order_id", "region", "amount", "rank")
    .sort("region", "rank")
    .collect()                                            # NOW it runs
)

result.schema   # Schema was known before collect() — no data touched
result[0]       # {'order_id': 4, 'region': 'EU', 'amount': 180.0, 'rank': 1}

Core concepts

Everything is lazy

Operations build a query plan — a tree of nodes describing what to do, without doing it. Data flows only when you trigger evaluation:

import pyfloe as pf

pipeline = (
    pf.read_csv("big_file.csv")
    .filter(pf.col("status") == "active")
    .join(pf.read_csv("users.csv"), on="user_id")
    .with_column("score", pf.col("points") * 1.5)
    .select("user_id", "name", "score")
    .sort("score", ascending=False)
)

pipeline.is_materialized  # False
pipeline.schema           # Known instantly — no data touched
pipeline.explain()        # Print the plan tree

Materialization triggers (the plan runs when you ask for data):

Method What happens
.collect() Materialize and cache. Returns self.
.to_pylist() Returns List[dict]
.to_pydict() Returns Dict[str, List]
.to_csv(path) Streams to file — constant memory
.to_jsonl(path) Streams to file — constant memory
len(ff) Counts all rows
ff[0] Indexes into data

Safe in debuggers: repr() on an unmaterialized LazyFrame shows LazyFrame [? rows × 5 cols] (lazy) without triggering evaluation.

Schemas propagate without data

Every plan node knows its output schema. You get final types instantly:

import pyfloe as pf

pipeline = (
    orders
    .filter(pf.col("amount") > 100)
    .with_column("tax", pf.col("amount") * 0.2)
    .join(customers, on="customer_id")
    .rename({"amount": "subtotal"})
    .select("order_id", "subtotal", "tax")
)

pipeline.schema
# Schema(
#   order_id: int
#   subtotal: float
#   tax: float
# )

pipeline.is_materialized  # False

Who is it for?

  • Library authors who can't force users to install numpy/pyarrow
  • Serverless / Lambda where package size and cold-start matter
  • Embedded ETL — CLI tools, data pipelines, config processors
  • Education — a readable query engine you can study end-to-end
  • Type safety enthusiasts — catch column errors before runtime