File I/O¶

Functions for reading and writing data files. All readers return a lazy LazyFrame — data is not loaded until evaluation is triggered.

Readers¶

read_csv¶

read_csv ¶

read_csv(path: str, *, has_header: bool = True, columns: list[str] | None = None, encoding: str = 'utf-8', delimiter: str = ',', schema_sample_size: int = 100, skip_rows: int = 0, quotechar: str = '"', cast_types: bool = True) -> LazyFrame

Read a CSV file as a LazyFrame.

Types are inferred from a sample and datetime columns are auto-detected. Data is not loaded until you trigger evaluation.

Parameters:

path (str) –

Path to the CSV file.
has_header (bool, default: True ) –

Whether the first row is a header.
columns (list[str] | None, default: None ) –

Override column names.
encoding (str, default: 'utf-8' ) –

File encoding.
delimiter (str, default: ',' ) –

Field delimiter character.
schema_sample_size (int, default: 100 ) –

Number of rows to sample for type inference.
skip_rows (int, default: 0 ) –

Number of initial rows to skip.
quotechar (str, default: '"' ) –

Quote character for CSV fields.
cast_types (bool, default: True ) –

If True, cast string values to inferred types.

Returns:

LazyFrame –

A LazyFrame. Data streams from disk when evaluated.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_csv("orders.csv")
>>> lf.filter(pf.col("amount") > 200).select("order_id", "amount").to_pylist()
[{'order_id': 1, 'amount': 250.0}, {'order_id': 6, 'amount': 310.0}]

read_tsv¶

read_tsv ¶

read_tsv(path: str, **kwargs: Any) -> LazyFrame

Read a TSV (tab-separated) file as a LazyFrame.

Equivalent to read_csv(path, delimiter='\t', ...).

Parameters:

path (str) –

Path to the TSV file.
**kwargs (Any, default: {} ) –

Additional arguments passed to the CSV reader.

Returns:

LazyFrame –

A LazyFrame.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_tsv("students.tsv")
>>> lf.select("name", "score").head(2).to_pylist()
[{'name': 'Alice', 'score': 95}, {'name': 'Bob', 'score': 82}]

read_jsonl¶

read_jsonl ¶

read_jsonl(path: str, *, encoding: str = 'utf-8', schema_sample_size: int = 100, columns: list[str] | None = None) -> LazyFrame

Read a JSON Lines file as a LazyFrame.

Each line is parsed as a JSON object. Schema is inferred from a sample.

Parameters:

path (str) –

Path to the JSONL file.
encoding (str, default: 'utf-8' ) –

File encoding.
schema_sample_size (int, default: 100 ) –

Number of lines to sample for schema inference.
columns (list[str] | None, default: None ) –

Subset of columns to read.

Returns:

LazyFrame –

A LazyFrame.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_jsonl("events.jsonl")
>>> lf.filter(pf.col("event") == "click").select("event", "user_id").to_pylist()
[{'event': 'click', 'user_id': 1}, {'event': 'click', 'user_id': 1}]

Read only specific columns:

>>> lf = pf.read_jsonl("events.jsonl", columns=["event", "user_id"])
>>> lf.head(2).to_pylist()
[{'event': 'click', 'user_id': 1}, {'event': 'view', 'user_id': 2}]

read_json¶

read_json ¶

read_json(path: str, *, encoding: str = 'utf-8', columns: list[str] | None = None) -> LazyFrame

Read a JSON file containing a top-level array.

Unlike :func:read_jsonl, this loads the entire file into memory at once.

Parameters:

path (str) –

Path to the JSON file.
encoding (str, default: 'utf-8' ) –

File encoding.
columns (list[str] | None, default: None ) –

Subset of columns to read.

Returns:

LazyFrame –

A LazyFrame containing the parsed data.

Raises:

ValueError –

If the JSON file does not contain a top-level array.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_json("cities.json")
>>> lf.to_pylist()
[{'city': 'NYC', 'pop': 8336817}, {'city': 'LA', 'pop': 3979576}, {'city': 'Chicago', 'pop': 2693976}]

read_fixed_width¶

read_fixed_width ¶

read_fixed_width(path: str, *, widths: list[int], columns: list[str] | None = None, has_header: bool = False, encoding: str = 'utf-8', schema_sample_size: int = 100, strip: bool = True, cast_types: bool = True) -> LazyFrame

Read a fixed-width file as a LazyFrame.

Parameters:

path (str) –

Path to the file.
widths (list[int]) –

List of column widths in characters.
columns (list[str] | None, default: None ) –

Override column names.
has_header (bool, default: False ) –

Whether the first row is a header.
encoding (str, default: 'utf-8' ) –

File encoding.
schema_sample_size (int, default: 100 ) –

Number of rows to sample for type inference.
strip (bool, default: True ) –

Whether to strip whitespace from field values.
cast_types (bool, default: True ) –

If True, cast string values to inferred types.

Returns:

LazyFrame –

A LazyFrame.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_fixed_width("people.txt", widths=[10, 4, 14], has_header=True)
>>> lf.head(2).to_pylist()
[{'NAME': 'Alice', 'AGE': 30, 'CITY': 'New York'}, {'NAME': 'Bob', 'AGE': 25, 'CITY': 'Los Angeles'}]

read_parquet¶

read_parquet ¶

read_parquet(path: str, *, columns: list[str] | None = None, batch_size: int = 10000) -> LazyFrame

Read a Parquet file as a LazyFrame (requires pyarrow).

Schema is read from the Parquet file metadata without loading data.

Parameters:

path (str) –

Path to the Parquet file.
columns (list[str] | None, default: None ) –

Subset of columns to read.
batch_size (int, default: 10000 ) –

Number of rows to read per batch.

Returns:

LazyFrame –

A LazyFrame.

Raises:

ImportError –

If pyarrow is not installed.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_parquet("data.parquet")
>>> lf.filter(pf.col("score") > 85).select("name", "score").to_pylist()
[{'name': 'Alice', 'score': 95.5}, {'name': 'Diana', 'score': 91.2}, {'name': 'Eve', 'score': 88.7}]