Getting Started¶

pyfloe is a zero-dependency, lazy dataframe library for Python. It provides an expression-based API, query optimization, window functions, datetime handling, and type safety — all in pure Python.

Installation¶

pip install pyfloe

Quick example¶

import pyfloe as pf

result = (
    pf.read_csv("orders.csv")                            # lazy — file not read yet
    .filter(pf.col("amount") > 100)                      # lazy — no rows scanned
    .with_column("rank", pf.row_number()
        .over(partition_by="region", order_by="amount"))
    .select("order_id", "region", "amount", "rank")
    .sort("region", "rank")
    .collect()                                            # NOW it runs
)

result.schema   # Schema was known before collect() — no data touched
result[0]       # {'order_id': 4, 'region': 'EU', 'amount': 180.0, 'rank': 1}

Core concepts¶

Everything is lazy¶

Operations build a query plan — a tree of nodes describing what to do, without doing it. Data flows only when you trigger evaluation:

import pyfloe as pf

pipeline = (
    pf.read_csv("big_file.csv")
    .filter(pf.col("status") == "active")
    .join(pf.read_csv("users.csv"), on="user_id")
    .with_column("score", pf.col("points") * 1.5)
    .select("user_id", "name", "score")
    .sort("score", ascending=False)
)

pipeline.is_materialized  # False
pipeline.schema           # Known instantly — no data touched
pipeline.explain()        # Print the plan tree

Materialization triggers (the plan runs when you ask for data):

Method	What happens
`.collect()`	Materialize and cache. Returns self.
`.to_pylist()`	Returns `List[dict]`
`.to_pydict()`	Returns `Dict[str, List]`
`.to_csv(path)`	Streams to file — constant memory
`.to_jsonl(path)`	Streams to file — constant memory
`len(ff)`	Counts all rows
`ff[0]`	Indexes into data

Safe in debuggers: repr() on an unmaterialized LazyFrame shows LazyFrame [? rows × 5 cols] (lazy) without triggering evaluation.

Schemas propagate without data¶

Every plan node knows its output schema. You get final types instantly:

import pyfloe as pf

pipeline = (
    orders
    .filter(pf.col("amount") > 100)
    .with_column("tax", pf.col("amount") * 0.2)
    .join(customers, on="customer_id")
    .rename({"amount": "subtotal"})
    .select("order_id", "subtotal", "tax")
)

pipeline.schema
# Schema(
#   order_id: int
#   subtotal: float
#   tax: float
# )

pipeline.is_materialized  # False

Who is it for?¶

Library authors who can't force users to install numpy/pyarrow
Serverless / Lambda where package size and cold-start matter
Embedded ETL — CLI tools, data pipelines, config processors
Education — a readable query engine you can study end-to-end
Type safety enthusiasts — catch column errors before runtime