Skip to content

Virtual Flow Tables

Create catalog tables that store no data on disk. When queried, a virtual table executes its producer flow on demand — delivering always-fresh results with zero storage overhead.


Why Virtual Tables?

Virtual tables change the way you think about catalog data. Instead of materializing every intermediate result to disk, you can expose computed views of your data that stay up to date automatically.

Benefit Description
Zero storage cost No files are written. The catalog entry holds only metadata and (when optimized) a serialized execution plan.
Always-fresh data Every time a virtual table is read, it produces results from the latest version of the source data and flow logic. No stale snapshots.
Automatic optimization Flowfile analyzes your pipeline and, when all upstream nodes support lazy execution, serializes the Polars execution plan for instant resolution — with predicate and projection pushdown.
Full integration Virtual tables work everywhere physical tables work: Catalog Reader, SQL queries, table triggers, and schedules.

How They Work

A virtual table is a catalog entry linked to a producer flow — a registered flow that contains a Catalog Writer node in virtual mode. When something reads from the virtual table, Flowfile resolves it in one of two ways:

graph LR
    A[Producer Flow] -->|runs with virtual writer| B[Virtual Table Entry]
    B -->|optimized| C[Deserialize LazyFrame]
    B -->|standard| D[Re-execute Producer Flow]
    C --> E[Query Results]
    D --> E

Creating Virtual Tables

There are two ways to create a virtual table.

Option 1: Via Catalog Writer Node

Add a Catalog Writer node to your flow and switch to virtual mode.

  1. Add a Catalog Writer node to your flow and connect it to the upstream data
  2. Enter a table name and select a catalog / schema namespace
  3. Click the Virtual Table tab (instead of "Write to Catalog")
  4. Review the laziness check result:
  5. Green checkmark: your flow is fully lazy — the virtual table will be optimized
  6. Yellow warning: some nodes prevent lazy execution — the table will use standard resolution (see Laziness Blockers)
  7. Save and run the flow

Flow must be registered

Virtual tables require the flow to be registered in the catalog. If the flow isn't registered yet, the virtual write will fail with an error. Open the flow from the catalog, or register it first via the catalog page.

Option 2: Via the SQL Editor

Create a query-based virtual table directly from the catalog's SQL Editor.

  1. Open the Catalog page and click the SQL button in the toolbar
  2. Write a SQL query against any combination of catalog tables
  3. Click the Save as Virtual Table button (bolt icon)
  4. Enter a table name and optional description
  5. Select a catalog / schema namespace
  6. Click Create

The SQL query is validated, executed once to derive the output schema, and then stored. No data is written to disk — each time the virtual table is read, the query re-executes against the latest catalog data.

Query-based vs flow-based

Query-based virtual tables store a SQL query and re-execute it on demand. Flow-based virtual tables (Option 1) store a reference to a producer flow and can be optimized with serialized execution plans. See SQL Editor — Save as Virtual Table for more on how query-based resolution works.


Optimization and Laziness

The key differentiator of virtual tables is the laziness system. Flowfile classifies every node in your pipeline as lazy, eager, or conditional, and uses this to determine whether a virtual table can be optimized.

What Makes a Flow "Fully Lazy"?

A virtual table is optimized when every node upstream of the Catalog Writer supports Polars' lazy evaluation. This means the entire pipeline can be represented as a deferred execution plan — no intermediate computation required.

When a flow is fully lazy, Flowfile serializes the Polars LazyFrame execution plan and stores it alongside the virtual table metadata. Reading an optimized virtual table is nearly instant — it deserializes the plan and executes it with full query optimization, including predicate pushdown and projection pushdown.

Node Laziness Classification

Every node type has a fixed laziness classification:

Classification Nodes Behavior
Lazy Manual Input, Filter, Select, Formula, Join, Group By, Sort, Add Record ID, Take Sample, Unpivot, Union, Drop Duplicates, Graph Solver, Count Records, Cross Join, Text to Rows, SQL Query, Catalog Reader Operations are deferred — computation happens only when results are collected
Eager Write Data, External Source, Explore Data, Pivot, Fuzzy Match, Python Script, Database Reader, Database Writer, Cloud Storage Writer, Catalog Writer, Kafka Source Forces execution of upstream flow — may breaks the execution plan

Optimized Resolution

When a virtual table is optimized (is_optimized = true):

  1. The serialized LazyFrame is read from the catalog database
  2. Polars deserializes the execution plan
  3. The query engine applies predicate pushdown, projection pushdown, and other optimizations
  4. Only the needed data is computed

This path is extremely fast. Because the deserialized plan is a LazyFrame, Polars can push the consumer's filters and column selections through the producer's execution plan — query optimization crosses the flow boundary. Work that the consumer doesn't need never runs in the producer, and results always reflect the current source data.

Standard Resolution

When a virtual table is not optimized (eager or conditional nodes upstream):

  1. The producer flow is loaded and executed end-to-end
  2. The Catalog Writer node's output is captured as a LazyFrame
  3. The result is returned to the caller

This path is slower because it executes the entire producer flow, but it guarantees correct results regardless of pipeline complexity.

Laziness Blockers

When the Catalog Writer's Virtual Table tab shows a yellow warning, it lists the specific laziness blockers — the nodes in your pipeline that prevent optimization.

Each blocker identifies:

  • The node name and ID (e.g., "Node 'Pivot data' (id=3) is eager")
  • Whether the node is eager (always forces computation) or conditional (may or may not be lazy)

Blockers are scoped

The laziness check only examines nodes that feed into the Catalog Writer. Nodes on unrelated branches (e.g., an Explore Data node on a separate path) are ignored.


Using Virtual Tables

Virtual tables integrate seamlessly into the catalog ecosystem.

Reading in Flows

Use the Catalog Reader node to read a virtual table, just like a physical one. Virtual tables appear in the table dropdown with a bolt icon to distinguish them.

When the flow runs, the Catalog Reader resolves the virtual table automatically:

  • Optimized tables → deserialize the stored execution plan (instant)
  • Standard tables → execute the producer flow (may take longer)

SQL Queries

Virtual tables are fully queryable via the SQL Editor. They appear alongside physical Delta tables in the SQL context, so you can join, filter, and aggregate across both types in a single query.

-- Query a virtual table alongside a physical table
SELECT v.customer_id, v.score, p.region
FROM customer_scores v
JOIN regions p ON v.region_id = p.id
WHERE v.score > 0.8

Table Triggers

A schedule can fire when a virtual table's producer flow finishes running, instead of (or in addition to) a cron. When the producer of a virtual table completes, any schedule with a table trigger on that table runs.

This shines when you want to centralize trigger logic across a chain or fan-out of flows:

flowchart LR
    A[Flow Aproduces table X] -->|fires| B[Flow Bproduces table Y]
    B -->|fires| C[Flow C]
    B -->|fires| D[Flow D]

Instead of Flow C and Flow D each managing their own cron trying to line up after A and B, one A → B → { C, D } trigger chain drives the whole graph off a single upstream event.

Note: Because virtual tables recompute their full lineage on read, you could also fan out directly from A — A → B, A → C, A → D — and C and D would still see the effect of B's transformations. Chain vs. fan-out is a choice about where to express the trigger edges, not about data flow.

In many cases you won't need it. If a flow runs on its own cadence, a plain schedule is simpler.

Note

Virtual tables store no data, so a consumer reading the table recomputes its full lineage — including the producer's logic. A table trigger is a scheduling signal, not a data hand-off.


When to Use Virtual vs Physical

Scenario Recommendation Why
Derived metrics or aggregations Virtual Always reflects latest source data, no storage duplication
Development and exploration Virtual Quick iteration without accumulating files
Small-to-medium computed datasets Virtual Negligible resolution time, zero storage cost
Production reporting with SLAs Physical Predictable read speed, no dependency on producer flow availability
Large datasets (millions+ rows) Physical Materialized reads are faster than on-demand computation
Historical snapshots / auditing Physical Delta versioning provides time-travel queries
Shared across many consumers Physical Compute once, read many times
Cross-system data landing Physical Need a stable file for external tools

Limitations

  • Non-optimized tables re-execute the full producer flow on every read. For complex or slow flows, this can add significant latency.
  • Requires a registered producer flow — the flow must be saved and registered in the catalog before a virtual table can reference it.
  • No Delta versioning — since no physical data is stored, there's no version history or time-travel capability.
  • One virtual table per producer flow — each registered flow can produce at most one virtual table entry (updates overwrite the existing entry).