Machine Learning Nodes
The ML nodes let you split a dataset, fit a model, score new rows with it, and read off quality metrics — all from the visual canvas. Compute runs in Polars (via polars-ds), so you do not need scikit-learn or any extra Python environment.
The pipeline shape
A typical ML flow chains five nodes end-to-end:
flowchart LR
Read[Read] --> Split[Random Split]
Split -- train --> Train[Train Model]
Split -- test --> Wait[Wait For]
Train --> Wait
Wait --> Apply[Apply Model]
Apply --> Eval[Evaluate Model]
- Random Split carves the input into a
trainandtestslice. - Train Model fits an algorithm on the
trainslice and writes the model artifact to a flow-scoped cache. - Wait For is a one-line synchronisation node: it lets the
testslice flow through unchanged but blocks until the trainer has finished writing. - Apply Model reads the model the upstream Train Model wrote and adds a
prediction column to the
testslice. - Evaluate Model compares the actual and predicted columns and emits a
(metric, value)table.
Skipping Wait For
You can skip Wait For if Apply Model is reading from the catalog instead of an upstream node. The in-flow path shown above is the simplest train→apply chain and what the starter templates use.
Random Split
Randomly partitions rows into named output handles. Default is two outputs,
train (80%) and test (20%), but you can configure up to ten splits with
arbitrary names.
Configuration
| Parameter | Description |
|---|---|
| Splits | List of name + percentage. Names must start with a letter and percentages must sum to 100. |
| Seed | Optional integer seed for reproducible splits. Leave empty for non-deterministic. |
Behaviour
- One output handle per split, in the order you defined them.
- Each row lands in exactly one split. The split is purely random — there is no stratification by target class.
- Setting a seed (e.g.
42) makes the split deterministic across runs, which is what you want for reproducible model evaluation.
Train Model
Fits a model on the input rows and stores the artifact for downstream use. The data passes through unchanged on the output, so you can chain other nodes off the same Train Model output if you want.
Configuration
| Parameter | Description |
|---|---|
| Target column | Column you are trying to predict. |
| Feature columns | One or more numeric columns used as predictors. Must not include the target. |
| Model type | Algorithm to fit. The drawer pulls the live list from GET /ml/algorithms. |
| Hyperparameters | Algorithm-specific. The form is generated from the algorithm spec — what shows up depends on the model type. |
| Publish to catalog | Off by default. When on, also stores the model in the global catalog under Model name. |
| Model name / tags | Required if publishing. Re-running with the same name auto-bumps the version. |
Algorithms
| Model type | Task | Output dtype | Notes |
|---|---|---|---|
linear_regression |
regression | Float64 |
Ordinary least squares. |
ridge_regression |
regression | Float64 |
L2-penalised. Useful when features are correlated. |
lasso_regression |
regression | Float64 |
L1-penalised. Drives some coefficients to zero (feature selection). |
logistic_regression |
classification | Int64 |
Binary classifier. Target column must contain 0/1 integer labels. |
knn_classifier |
classification | Int64 |
Binary KNN with kd-tree lookup. Stores the training set in the artifact — fine for demos, heavy for huge data. |
Behaviour
- Hyperparameters are validated up front. An invalid combination fails on the Train Model node, not later inside the worker.
- The trained model is always cached at a flow-scoped path keyed off the node id, so an Apply Model further down the same flow can read it without publishing to the catalog.
- Publishing is opt-in. Toggle it on if you want to reuse the model from another flow or pin a specific version.
Wait For
Pass-through node with two inputs: the left input flows through unchanged, the right input only enforces ordering. Once both inputs have finished, the node emits the left input's data on its single output.
Use it whenever a downstream node depends on a side effect of a sibling branch — typically Apply Model needing Train Model's artifact to exist before it runs.
Configuration
There are no settings. Wire the data branch into the left input and the dependency branch into the right input.
Apply Model
Adds a prediction column to the input. The model is loaded either from an upstream Train Model in the same flow, or by name/version from the catalog.
Configuration
| Parameter | Description |
|---|---|
| Source | Upstream (default) reads the model from a Train Model node in this flow. Catalog looks the model up by name/version. |
| Upstream training node | Picker populated from a /ml/upstream-train-models walk of the flow graph — only Train Model nodes that can actually reach this Apply Model show up. |
| Model name / version | Used when the source is Catalog. Empty version means "latest active". |
| Output column | Name of the prediction column. Defaults to prediction. |
Behaviour
- The input must contain every feature column the model was trained on. A missing feature raises a clear error before any work runs.
- Linear/ridge/lasso/logistic models apply as a pure Polars expression — no Python loop, no worker round-trip. KNN collects once because it needs the full training set in memory to run a kd-tree query.
- The output dtype is the algorithm's
output_dtype(Float64for regression,Int64for classification). For binary logistic regression, the prediction is the argmax of the sigmoid (i.e.1if the linear combination is positive, otherwise0).
Evaluate Model
Compares an actual column against a predicted column and emits a
(metric, value) table. It does not care which Train/Apply pair produced
the prediction — point it at any frame that has both columns.
Configuration
| Parameter | Description |
|---|---|
| Actual column | Ground-truth column. |
| Predicted column | Prediction column added by Apply Model. Defaults to prediction. |
| Task type | auto (default), regression, or classification. auto resolves the task from the upstream Train Model node when one is set, else regression. |
| Upstream training node | Optional pointer to a Train Model node in this flow, used only to resolve task_type=auto. |
Metrics
For regression:
| Metric | Meaning |
|---|---|
mae |
Mean absolute error. |
mse |
Mean squared error. |
rmse |
Root mean squared error. |
r2 |
Coefficient of determination. 1.0 is perfect, 0.0 is mean baseline. |
mape |
Mean absolute percentage error. Rows where actual is 0 are dropped from this metric only. |
n |
Row count after dropping nulls. |
For classification:
| Metric | Meaning |
|---|---|
accuracy |
Correct predictions / total. |
precision |
Macro-averaged across classes. |
recall |
Macro-averaged across classes. |
f1 |
Macro-averaged F1. |
n_correct |
Raw count of correct predictions. |
n_total |
Total rows after dropping nulls. |
Behaviour
- Rows where either column is null are dropped before any metric is computed, so a single missing prediction doesn't poison the rest.
- For classification, both columns are cast to string before grouping, so
integer
0/1labels and string labels share one path. - Macro-averaging weights every class equally, which is the right default for imbalanced data. There is no per-class breakdown in v1; the long-form output is meant to be filtered, joined, or charted downstream.
Starter templates
Two beginner templates are shipped under Templates → Beginner:
- Customer Churn Classification — logistic regression on a synthetic
churn dataset. Five features, binary
churnedtarget. Walks through the full split → train → wait → apply → evaluate chain. - Customer Churn (KNN) — same dataset and shape, but uses the KNN classifier so you can compare a parametric and a neighbour-based model on the same hold-out split.
Both load data/templates/customer_churn.csv and produce an Evaluate Model
output you can inspect with the Explore Data node downstream.
The regression sibling, House Price Regression, uses
linear_regression on data/templates/house_prices.csv and is the
template to mirror when building your own regression flow.
Python API
The same nodes are available as FlowFrame methods. The wiring is
identical — train_model returns a frame whose backing node is a Train
Model, and you pass that frame to apply_model(upstream=...) to chain
them.
import flowfile as ff
raw = ff.read_csv("customer_churn.csv")
train, test = raw.random_split([("train", 80), ("test", 20)], seed=42)
trained = train.train_model(
target="churned",
features=["tenure_months", "monthly_charges", "support_calls", "has_contract"],
model_type="logistic_regression",
params={"l2_reg": 0.1},
)
scored = (
test.wait_for(trained) # block apply until train is done
.apply_model(
upstream=trained,
output_column="predicted_churn",
)
)
metrics = scored.evaluate_model(
actual="churned",
predicted="predicted_churn",
task_type="auto",
upstream_train=trained,
)
ff.open_graph_in_editor(metrics.flow_graph) # see the canvas
The catalog path is also supported — pass model_name= (and optionally
version=) to apply_model instead of upstream=.
For contributors
Adding a new algorithm to the registry is a backend-only change. See Extending ML Nodes for the recipe.