File I/O¶
Functions for reading and writing data files. All readers return a lazy LazyFrame — data is not loaded until evaluation is triggered.
Readers¶
read_csv¶
read_csv
¶
read_csv(path: str, *, has_header: bool = True, columns: list[str] | None = None, encoding: str = 'utf-8', delimiter: str = ',', schema_sample_size: int = 100, skip_rows: int = 0, quotechar: str = '"', cast_types: bool = True) -> LazyFrame
Read a CSV file as a LazyFrame.
Types are inferred from a sample and datetime columns are auto-detected. Data is not loaded until you trigger evaluation.
Parameters:
-
path(str) –Path to the CSV file.
-
has_header(bool, default:True) –Whether the first row is a header.
-
columns(list[str] | None, default:None) –Override column names.
-
encoding(str, default:'utf-8') –File encoding.
-
delimiter(str, default:',') –Field delimiter character.
-
schema_sample_size(int, default:100) –Number of rows to sample for type inference.
-
skip_rows(int, default:0) –Number of initial rows to skip.
-
quotechar(str, default:'"') –Quote character for CSV fields.
-
cast_types(bool, default:True) –If True, cast string values to inferred types.
Returns:
-
LazyFrame–A LazyFrame. Data streams from disk when evaluated.
Examples:
read_tsv¶
read_tsv
¶
Read a TSV (tab-separated) file as a LazyFrame.
Equivalent to read_csv(path, delimiter='\t', ...).
Parameters:
-
path(str) –Path to the TSV file.
-
**kwargs(Any, default:{}) –Additional arguments passed to the CSV reader.
Returns:
-
LazyFrame–A LazyFrame.
Examples:
read_jsonl¶
read_jsonl
¶
read_jsonl(path: str, *, encoding: str = 'utf-8', schema_sample_size: int = 100, columns: list[str] | None = None) -> LazyFrame
Read a JSON Lines file as a LazyFrame.
Each line is parsed as a JSON object. Schema is inferred from a sample.
Parameters:
-
path(str) –Path to the JSONL file.
-
encoding(str, default:'utf-8') –File encoding.
-
schema_sample_size(int, default:100) –Number of lines to sample for schema inference.
-
columns(list[str] | None, default:None) –Subset of columns to read.
Returns:
-
LazyFrame–A LazyFrame.
Examples:
>>> import pyfloe as pf
>>> lf = pf.read_jsonl("events.jsonl")
>>> lf.filter(pf.col("event") == "click").select("event", "user_id").to_pylist()
[{'event': 'click', 'user_id': 1}, {'event': 'click', 'user_id': 1}]
Read only specific columns:
read_json¶
read_json
¶
Read a JSON file containing a top-level array.
Unlike :func:read_jsonl, this loads the entire file into memory at once.
Parameters:
-
path(str) –Path to the JSON file.
-
encoding(str, default:'utf-8') –File encoding.
-
columns(list[str] | None, default:None) –Subset of columns to read.
Returns:
-
LazyFrame–A LazyFrame containing the parsed data.
Raises:
-
ValueError–If the JSON file does not contain a top-level array.
Examples:
read_fixed_width¶
read_fixed_width
¶
read_fixed_width(path: str, *, widths: list[int], columns: list[str] | None = None, has_header: bool = False, encoding: str = 'utf-8', schema_sample_size: int = 100, strip: bool = True, cast_types: bool = True) -> LazyFrame
Read a fixed-width file as a LazyFrame.
Parameters:
-
path(str) –Path to the file.
-
widths(list[int]) –List of column widths in characters.
-
columns(list[str] | None, default:None) –Override column names.
-
has_header(bool, default:False) –Whether the first row is a header.
-
encoding(str, default:'utf-8') –File encoding.
-
schema_sample_size(int, default:100) –Number of rows to sample for type inference.
-
strip(bool, default:True) –Whether to strip whitespace from field values.
-
cast_types(bool, default:True) –If True, cast string values to inferred types.
Returns:
-
LazyFrame–A LazyFrame.
Examples:
read_parquet¶
read_parquet
¶
Read a Parquet file as a LazyFrame (requires pyarrow).
Schema is read from the Parquet file metadata without loading data.
Parameters:
-
path(str) –Path to the Parquet file.
-
columns(list[str] | None, default:None) –Subset of columns to read.
-
batch_size(int, default:10000) –Number of rows to read per batch.
Returns:
-
LazyFrame–A LazyFrame.
Raises:
-
ImportError–If pyarrow is not installed.
Examples: