Skip to content

File I/O

Functions for reading and writing data files. All readers return a lazy LazyFrame — data is not loaded until evaluation is triggered.


Readers

read_csv

read_csv

read_csv(path: str, *, has_header: bool = True, columns: list[str] | None = None, encoding: str = 'utf-8', delimiter: str = ',', schema_sample_size: int = 100, skip_rows: int = 0, quotechar: str = '"', cast_types: bool = True) -> LazyFrame

Read a CSV file as a LazyFrame.

Types are inferred from a sample and datetime columns are auto-detected. Data is not loaded until you trigger evaluation.

Parameters:

  • path (str) –

    Path to the CSV file.

  • has_header (bool, default: True ) –

    Whether the first row is a header.

  • columns (list[str] | None, default: None ) –

    Override column names.

  • encoding (str, default: 'utf-8' ) –

    File encoding.

  • delimiter (str, default: ',' ) –

    Field delimiter character.

  • schema_sample_size (int, default: 100 ) –

    Number of rows to sample for type inference.

  • skip_rows (int, default: 0 ) –

    Number of initial rows to skip.

  • quotechar (str, default: '"' ) –

    Quote character for CSV fields.

  • cast_types (bool, default: True ) –

    If True, cast string values to inferred types.

Returns:

  • LazyFrame

    A LazyFrame. Data streams from disk when evaluated.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_csv("orders.csv")
>>> lf.filter(pf.col("amount") > 200).select("order_id", "amount").to_pylist()
[{'order_id': 1, 'amount': 250.0}, {'order_id': 6, 'amount': 310.0}]

read_tsv

read_tsv

read_tsv(path: str, **kwargs: Any) -> LazyFrame

Read a TSV (tab-separated) file as a LazyFrame.

Equivalent to read_csv(path, delimiter='\t', ...).

Parameters:

  • path (str) –

    Path to the TSV file.

  • **kwargs (Any, default: {} ) –

    Additional arguments passed to the CSV reader.

Returns:

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_tsv("students.tsv")
>>> lf.select("name", "score").head(2).to_pylist()
[{'name': 'Alice', 'score': 95}, {'name': 'Bob', 'score': 82}]

read_jsonl

read_jsonl

read_jsonl(path: str, *, encoding: str = 'utf-8', schema_sample_size: int = 100, columns: list[str] | None = None) -> LazyFrame

Read a JSON Lines file as a LazyFrame.

Each line is parsed as a JSON object. Schema is inferred from a sample.

Parameters:

  • path (str) –

    Path to the JSONL file.

  • encoding (str, default: 'utf-8' ) –

    File encoding.

  • schema_sample_size (int, default: 100 ) –

    Number of lines to sample for schema inference.

  • columns (list[str] | None, default: None ) –

    Subset of columns to read.

Returns:

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_jsonl("events.jsonl")
>>> lf.filter(pf.col("event") == "click").select("event", "user_id").to_pylist()
[{'event': 'click', 'user_id': 1}, {'event': 'click', 'user_id': 1}]

Read only specific columns:

>>> lf = pf.read_jsonl("events.jsonl", columns=["event", "user_id"])
>>> lf.head(2).to_pylist()
[{'event': 'click', 'user_id': 1}, {'event': 'view', 'user_id': 2}]

read_json

read_json

read_json(path: str, *, encoding: str = 'utf-8', columns: list[str] | None = None) -> LazyFrame

Read a JSON file containing a top-level array.

Unlike :func:read_jsonl, this loads the entire file into memory at once.

Parameters:

  • path (str) –

    Path to the JSON file.

  • encoding (str, default: 'utf-8' ) –

    File encoding.

  • columns (list[str] | None, default: None ) –

    Subset of columns to read.

Returns:

  • LazyFrame

    A LazyFrame containing the parsed data.

Raises:

  • ValueError

    If the JSON file does not contain a top-level array.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_json("cities.json")
>>> lf.to_pylist()
[{'city': 'NYC', 'pop': 8336817}, {'city': 'LA', 'pop': 3979576}, {'city': 'Chicago', 'pop': 2693976}]

read_fixed_width

read_fixed_width

read_fixed_width(path: str, *, widths: list[int], columns: list[str] | None = None, has_header: bool = False, encoding: str = 'utf-8', schema_sample_size: int = 100, strip: bool = True, cast_types: bool = True) -> LazyFrame

Read a fixed-width file as a LazyFrame.

Parameters:

  • path (str) –

    Path to the file.

  • widths (list[int]) –

    List of column widths in characters.

  • columns (list[str] | None, default: None ) –

    Override column names.

  • has_header (bool, default: False ) –

    Whether the first row is a header.

  • encoding (str, default: 'utf-8' ) –

    File encoding.

  • schema_sample_size (int, default: 100 ) –

    Number of rows to sample for type inference.

  • strip (bool, default: True ) –

    Whether to strip whitespace from field values.

  • cast_types (bool, default: True ) –

    If True, cast string values to inferred types.

Returns:

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_fixed_width("people.txt", widths=[10, 4, 14], has_header=True)
>>> lf.head(2).to_pylist()
[{'NAME': 'Alice', 'AGE': 30, 'CITY': 'New York'}, {'NAME': 'Bob', 'AGE': 25, 'CITY': 'Los Angeles'}]

read_parquet

read_parquet

read_parquet(path: str, *, columns: list[str] | None = None, batch_size: int = 10000) -> LazyFrame

Read a Parquet file as a LazyFrame (requires pyarrow).

Schema is read from the Parquet file metadata without loading data.

Parameters:

  • path (str) –

    Path to the Parquet file.

  • columns (list[str] | None, default: None ) –

    Subset of columns to read.

  • batch_size (int, default: 10000 ) –

    Number of rows to read per batch.

Returns:

Raises:

  • ImportError

    If pyarrow is not installed.

Examples:

>>> import pyfloe as pf
>>> lf = pf.read_parquet("data.parquet")
>>> lf.filter(pf.col("score") > 85).select("name", "score").to_pylist()
[{'name': 'Alice', 'score': 95.5}, {'name': 'Diana', 'score': 91.2}, {'name': 'Eve', 'score': 88.7}]