Transform Nodes

Transform nodes modify and shape your data. These nodes handle everything from basic operations like filtering and sorting to more complex transformations like custom formulas and text manipulation.

Some transform nodes are not in Flowfile Lite

The browser-only Flowfile Lite build includes Select, Filter, Sort, Take Sample, Drop Duplicates, and Polars Code. Add Record ID, Formula, Text to Rows, and Python Script are not available — use the Polars Code node for Formula-style column logic.

Node Details

Add Record ID

The Add Record ID transformation generates a unique identifier for each record in your dataset. You can create a simple sequential ID or generate grouped IDs based on one or more columns.

Usage:

Add the Add Record ID node to your flow.
Configure the settings:
Define the output column name.
Set an optional offset for ID numbering.
(Optional) Enable grouping and specify grouping columns.
Apply the transformation.

Configuration Options

Parameter	Description
Output Column Name	Name of the new column where the record ID will be stored. Default is `"record_id"`.
Offset	Starting value for the record ID. Default is `1`.
Group By	If `true`, record IDs are assigned within groups instead of sequentially across all records. Default is `false`.
Group By Columns	List of columns to group by when assigning record IDs. Only applies when Group By is enabled.

Behavior

Sequential Record ID (Default)
A new column is added with a simple incremental ID starting from the defined offset.
Grouped Record ID
When grouping is enabled, the record ID resets within each group based on the specified columns.

This transformation helps in creating unique keys, tracking row order, or structuring data for downstream processing.

Formula

The Formula node creates a new column — or replaces an existing one — by evaluating a formula for every row. Formulas are written in the Flowfile formula language: reference columns as [column], call any of the 95 built-in functions, and use if ... then ... else ... endif for conditional logic.

Try formulas in your browser

The interactive formula playground lets you experiment with the full language against sample data — nothing to install.

Key Features

Create new columns or replace existing ones
Math, string, date/time, and type-conversion functions
Conditional logic with if ... then ... elseif ... else ... endif
Autocomplete for column names and functions, with inline function documentation, in the formula editor
Compiles to a native Polars expression — no row-by-row Python overhead

Usage

Drag the Formula node onto your canvas.
Connect input data.
Set the output column name.
Write your formula, e.g. round([price] * (1 - [discount]), 2).
Preview the results and optionally set a data type.

Configuration Options

Parameter	Description
Column Name	The name of the new or modified column.
Formula	The formula used to compute values for the column.
Data Type	The data type of the resulting column. Defaults to `Auto`, which infers the type from the formula; set it explicitly to force a cast.

Behavior

If the column name is new, the column is added to the dataset.
If the column name already exists, its values are replaced with the formula result.

This transformation is useful for feature engineering, data cleaning, and enriching datasets with computed values.

Select Data

The Select Data node allows you to choose which columns to keep, rename, and reorder. This transformation is useful for refining datasets, dropping unnecessary fields, and ensuring a structured column layout.

Key Features

Select specific columns to retain in the dataset
Reorder columns by dragging them into the desired order or by ordering them alphabetically
Rename columns to provide meaningful names
Automatically remove missing fields

Configuration Options

Parameter	Description
Column Selection	Choose which columns to keep in the dataset.
Reordering	Drag and drop to change the column order.
Rename Column	Assign a new name to any selected column.
Keep Missing Fields	If enabled, columns that are missing from input data are retained in the selection list.

Behavior

If a selected column is missing from the input, it is marked as unavailable.
Columns can be renamed without affecting their original data.
Changing the order affects how the columns appear in downstream processing.

This transformation ensures that datasets are structured efficiently before further analysis or processing.

Filter Data

The Filter Data node keeps only rows that match a specified condition.

Key Features

Basic mode: pick a column, an operator, and a value — no syntax needed
Advanced mode: write a formula that evaluates to true or false; only true rows remain
Split mode: route matching rows to one output and non-matching rows to a second output, instead of dropping them
Support for string, numeric, and date filtering

Usage

Drag the Filter Data node onto your canvas.
Connect input data.
Choose basic mode and fill in the condition, or advanced mode and enter a filter formula (e.g., [City] = 'Amsterdam').
(Optional) Enable split mode to keep the non-matching rows on a second output.

Example Filters (advanced mode)

Expression	Description
`[City] = 'Amsterdam'`	Keep rows where `City` is "Amsterdam".
`[Age] > 30`	Keep rows where `Age` is greater than 30.
`[Country] = 'USA' and [Sales] > 100`	Keep rows where `Country` is "USA" and `Sales` is over 100.
`is_not_empty([email])`	Keep rows that have an email address.

Advanced filters use the Flowfile formula language — any formula that returns true/false works, including functions like contains(), between(), or is_empty().

Sort Data

The Sort Data node orders your data based on one or more columns.

Key Features

Sort by multiple columns
Choose ascending or descending order

Usage

Select columns to sort by.
Choose sort direction (ascending or descending) for each column.

Configuration Options

Parameter	Description
Sort Columns	Columns used to sort the dataset.
Sort Order	Set ascending (`asc`) or descending (`desc`).

This node ensures structured and ordered data for better analysis.

Take Sample

The Take Sample node lets you work with a subset of your data.

Drop Duplicates

The Drop Duplicates node removes duplicate rows based on selected columns. Only the first occurrence is kept by default.

Key Features

Remove duplicate rows
Select columns to check for duplicates

Usage

Select columns to check for duplicates.
Choose whether to keep the first or last occurrence.

Configuration Options

Parameter	Description
Columns	Columns used to check for duplicates.

This node ensures a clean dataset by eliminating redundant rows.

Text to Rows

The Text to Rows node splits text from a selected column into multiple rows based on a delimiter.

Key Features

Split a column into multiple rows
Use a fixed delimiter (e.g., ,, ;, |)
Split using values from another column

Usage

Select the column to split.
Choose a delimiter or use another column for splitting.
(Optional) Set an output column name.

Configuration Options

Parameter	Description
Column to Split	The column containing text to be split.
Output Column Name	Name of the new column after splitting (defaults to the original column).
Split by Fixed Value	If `true`, use a fixed delimiter (default: `,`).
Delimiter	The character used to split text (e.g., `,`, `\|`, `;`).
Split by Column	Instead of a fixed delimiter, use values from another column.

This transformation helps normalize datasets by converting text lists into structured rows.

Window Functions

The Window Functions node adds rolling, cumulative, rank, or tile columns calculated over ordered — and optionally partitioned — rows. Each configured function produces one new column using Polars over(...) semantics, without collapsing rows the way a Group By does.

Key Features

Partition calculations so they restart per group (like .over(...))
Order rows within each partition
Rolling, cumulative, ranking, and tile functions
Add multiple functions in one node — each writes its own output column

Configuration Options

Parameter	Description
Partition by	(Optional) Columns that reset each calculation per group. Leave empty to compute over the whole table.
Order by	Column(s) and direction (ascending/descending) defining row order within each partition. Required for rolling and tile functions.
Window functions	One or more operations. Each takes a function, a source column, an output column name, and any function-specific parameters.

Available Functions

Function	Group	Parameters	Output
Rolling sum / mean / min / max / std	Rolling	Window size (rows) and how to handle incomplete windows	Aggregate over a sliding window of rows
Cumulative sum / count / min / max	Cumulative	—	Running total / count / min / max up to each row
Rank	Ranking	Tie-breaking method: `ordinal`, `dense`, `min`, `max`, or `average`	Rank of each row
Tile	Ranking	Number of groups	Splits the ordered rows into N equal-sized groups

For rolling functions, incomplete windows (the first rows, before the window is full) can be left empty (null, the default), computed from the partial window, or filled with 0.

Behavior

Rolling and tile functions require at least one Order by column; cumulative functions don't require ordering but usually want it.
Each window function must have a unique output column name.
Existing columns are preserved — every function adds a new column.

This node is useful for time-series features such as moving averages, running totals, ranking within groups, and bucketing rows into quantiles.

Polars Code

The Polars Code node allows you to write custom Polars DataFrame transformations directly in your workflow.

Key Features

Write custom Polars expressions
Apply advanced transformations not covered by standard nodes
Filter, aggregate, or modify data using Polars API

Usage

Write a single-line or multi-line Polars expression.
Use input_df as the DataFrame reference.
Assign results to output_df for multi-line operations.

Example Code

Single-line transformation

input_df.filter(pl.col('Age') > 30)

Multi-line transformation

result = input_df.select(['Name', 'City'])
filtered = result.filter(pl.col('City') == 'Amsterdam')
output_df = filtered.with_columns(pl.col('Name').alias('Customer_Name')) # this will be the output of the node

SQL Query

The SQL Query node runs a SQL SELECT query across one or more connected inputs using the Polars SQL dialect. Use it to filter, aggregate, join, and reshape data with familiar SQL.

Key Features

Query connected inputs as tables named input_1, input_2, … (in connection order)
Join, filter, group, and compute across up to 10 inputs
Chain SQL Query nodes together like any other transform
Read-only by design — only SELECT / WITH queries are allowed

Usage

Connect one or more inputs to the SQL Query node.
Write a query that references each input as input_1, input_2, etc.
Run the flow to materialize the query result.

Configuration Options

Parameter	Description
SQL Query	The `SELECT` statement to execute. Connected inputs are available as `input_1`, `input_2`, …

Example Queries

Filter and select

SELECT name, city FROM input_1 WHERE city = 'Amsterdam'

Aggregate

SELECT city, COUNT(*) AS cnt FROM input_1 GROUP BY city

Join two inputs

SELECT c.name, SUM(o.amount) AS total
FROM input_1 c
JOIN input_2 o ON c.id = o.customer_id
GROUP BY c.name

Behavior

Queries must start with SELECT or WITH (for common table expressions). Statements that modify data or schema — such as INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, or TRUNCATE — are rejected.
Invalid or unsafe SQL surfaces as a node error before the flow runs.

Querying catalog tables

To run SQL across registered catalog tables instead of connected nodes, use the Catalog Reader's SQL mode.

Python Script

The Python Script node executes custom Python code in an isolated Docker kernel container. It supports multiple named inputs and outputs, persistent namespaces, and the full flowfile API.

Key Features:

Jupyter-style notebook editor with multiple code cells
Named inputs and outputs for connecting to multiple data sources
Access to the flowfile API for data I/O, artifacts, display, and logging
Persistent variables across executions within the same flow

Settings:

Parameter	Description
Kernel	Select a running kernel from the dropdown
Code	Python code written in the notebook editor
Output Names	Configure named outputs (default: `main`)

For full documentation, see Kernel Execution.

← Read data | Next: Combine data →