Skip to content

Machine Learning Nodes

The ML nodes let you split a dataset, fit a model, score new rows with it, and read off quality metrics — all from the visual canvas. Compute runs in Polars (via polars-ds), so you do not need scikit-learn or any extra Python environment.

The pipeline shape

A typical ML flow chains five nodes end-to-end:

flowchart LR
    Read[Read] --> Split[Random Split]
    Split -- train --> Train[Train Model]
    Split -- test --> Wait[Wait For]
    Train --> Wait
    Wait --> Apply[Apply Model]
    Apply --> Eval[Evaluate Model]
  • Random Split carves the input into a train and test slice.
  • Train Model fits an algorithm on the train slice and writes the model artifact to a flow-scoped cache.
  • Wait For is a one-line synchronisation node: it lets the test slice flow through unchanged but blocks until the trainer has finished writing.
  • Apply Model reads the model the upstream Train Model wrote and adds a prediction column to the test slice.
  • Evaluate Model compares the actual and predicted columns and emits a (metric, value) table.

Skipping Wait For

You can skip Wait For if Apply Model is reading from the catalog instead of an upstream node. The in-flow path shown above is the simplest train→apply chain and what the starter templates use.


Random Split Random Split

Randomly partitions rows into named output handles. Default is two outputs, train (80%) and test (20%), but you can configure up to ten splits with arbitrary names.

Configuration

Parameter Description
Splits List of name + percentage. Names must start with a letter and percentages must sum to 100.
Seed Optional integer seed for reproducible splits. Leave empty for non-deterministic.

Behaviour

  • One output handle per split, in the order you defined them.
  • Each row lands in exactly one split. The split is purely random — there is no stratification by target class.
  • Setting a seed (e.g. 42) makes the split deterministic across runs, which is what you want for reproducible model evaluation.

Train Model Train Model

Fits a model on the input rows and stores the artifact for downstream use. The data passes through unchanged on the output, so you can chain other nodes off the same Train Model output if you want.

Configuration

Parameter Description
Target column Column you are trying to predict.
Feature columns One or more numeric columns used as predictors. Must not include the target.
Model type Algorithm to fit. The drawer pulls the live list from GET /ml/algorithms.
Hyperparameters Algorithm-specific. The form is generated from the algorithm spec — what shows up depends on the model type.
Publish to catalog Off by default. When on, also stores the model in the global catalog under Model name.
Model name / tags Required if publishing. Re-running with the same name auto-bumps the version.

Algorithms

Model type Task Output dtype Notes
linear_regression regression Float64 Ordinary least squares.
ridge_regression regression Float64 L2-penalised. Useful when features are correlated.
lasso_regression regression Float64 L1-penalised. Drives some coefficients to zero (feature selection).
logistic_regression classification Int64 Binary classifier. Target column must contain 0/1 integer labels.
knn_classifier classification Int64 Binary KNN with kd-tree lookup. Stores the training set in the artifact — fine for demos, heavy for huge data.

Behaviour

  • Hyperparameters are validated up front. An invalid combination fails on the Train Model node, not later inside the worker.
  • The trained model is always cached at a flow-scoped path keyed off the node id, so an Apply Model further down the same flow can read it without publishing to the catalog.
  • Publishing is opt-in. Toggle it on if you want to reuse the model from another flow or pin a specific version.

Wait For Wait For

Pass-through node with two inputs: the left input flows through unchanged, the right input only enforces ordering. Once both inputs have finished, the node emits the left input's data on its single output.

Use it whenever a downstream node depends on a side effect of a sibling branch — typically Apply Model needing Train Model's artifact to exist before it runs.

Configuration

There are no settings. Wire the data branch into the left input and the dependency branch into the right input.


Apply Model Apply Model

Adds a prediction column to the input. The model is loaded either from an upstream Train Model in the same flow, or by name/version from the catalog.

Configuration

Parameter Description
Source Upstream (default) reads the model from a Train Model node in this flow. Catalog looks the model up by name/version.
Upstream training node Picker populated from a /ml/upstream-train-models walk of the flow graph — only Train Model nodes that can actually reach this Apply Model show up.
Model name / version Used when the source is Catalog. Empty version means "latest active".
Output column Name of the prediction column. Defaults to prediction.

Behaviour

  • The input must contain every feature column the model was trained on. A missing feature raises a clear error before any work runs.
  • Linear/ridge/lasso/logistic models apply as a pure Polars expression — no Python loop, no worker round-trip. KNN collects once because it needs the full training set in memory to run a kd-tree query.
  • The output dtype is the algorithm's output_dtype (Float64 for regression, Int64 for classification). For binary logistic regression, the prediction is the argmax of the sigmoid (i.e. 1 if the linear combination is positive, otherwise 0).

Evaluate Model Evaluate Model

Compares an actual column against a predicted column and emits a (metric, value) table. It does not care which Train/Apply pair produced the prediction — point it at any frame that has both columns.

Configuration

Parameter Description
Actual column Ground-truth column.
Predicted column Prediction column added by Apply Model. Defaults to prediction.
Task type auto (default), regression, or classification. auto resolves the task from the upstream Train Model node when one is set, else regression.
Upstream training node Optional pointer to a Train Model node in this flow, used only to resolve task_type=auto.

Metrics

For regression:

Metric Meaning
mae Mean absolute error.
mse Mean squared error.
rmse Root mean squared error.
r2 Coefficient of determination. 1.0 is perfect, 0.0 is mean baseline.
mape Mean absolute percentage error. Rows where actual is 0 are dropped from this metric only.
n Row count after dropping nulls.

For classification:

Metric Meaning
accuracy Correct predictions / total.
precision Macro-averaged across classes.
recall Macro-averaged across classes.
f1 Macro-averaged F1.
n_correct Raw count of correct predictions.
n_total Total rows after dropping nulls.

Behaviour

  • Rows where either column is null are dropped before any metric is computed, so a single missing prediction doesn't poison the rest.
  • For classification, both columns are cast to string before grouping, so integer 0/1 labels and string labels share one path.
  • Macro-averaging weights every class equally, which is the right default for imbalanced data. There is no per-class breakdown in v1; the long-form output is meant to be filtered, joined, or charted downstream.

Starter templates

Two beginner templates are shipped under Templates → Beginner:

  • Customer Churn Classification — logistic regression on a synthetic churn dataset. Five features, binary churned target. Walks through the full split → train → wait → apply → evaluate chain.
  • Customer Churn (KNN) — same dataset and shape, but uses the KNN classifier so you can compare a parametric and a neighbour-based model on the same hold-out split.

Both load data/templates/customer_churn.csv and produce an Evaluate Model output you can inspect with the Explore Data node downstream.

The regression sibling, House Price Regression, uses linear_regression on data/templates/house_prices.csv and is the template to mirror when building your own regression flow.


Python API

The same nodes are available as FlowFrame methods. The wiring is identical — train_model returns a frame whose backing node is a Train Model, and you pass that frame to apply_model(upstream=...) to chain them.

import flowfile as ff

raw = ff.read_csv("customer_churn.csv")
train, test = raw.random_split([("train", 80), ("test", 20)], seed=42)

trained = train.train_model(
    target="churned",
    features=["tenure_months", "monthly_charges", "support_calls", "has_contract"],
    model_type="logistic_regression",
    params={"l2_reg": 0.1},
)

scored = (
    test.wait_for(trained)            # block apply until train is done
        .apply_model(
            upstream=trained,
            output_column="predicted_churn",
        )
)

metrics = scored.evaluate_model(
    actual="churned",
    predicted="predicted_churn",
    task_type="auto",
    upstream_train=trained,
)

ff.open_graph_in_editor(metrics.flow_graph)  # see the canvas

The catalog path is also supported — pass model_name= (and optionally version=) to apply_model instead of upstream=.


For contributors

Adding a new algorithm to the registry is a backend-only change. See Extending ML Nodes for the recipe.