FlowFrame and FlowGraph Design Concepts
Understanding how FlowFrame and FlowGraph work together is key to mastering Flowfile. This guide explains the core design principles that make Flowfile both powerful and intuitive.
Related Reading
- Practical Implementation: See these concepts in action in our Code to Flow guide
- Architecture Overview: Learn about the system design in Technical Architecture
- Visual Building: Compare with Building Flows visually
FlowFrame: Always Lazy, Always Connected
What is FlowFrame?
FlowFrame is Flowfile's version of a Polars DataFrame with a crucial difference: it's always lazy and always connected to a graph.
import flowfile as ff
# This creates a FlowFrame, not a regular DataFrame
df = ff.FlowFrame({
"id": [1, 2, 3, 4, 5],
"amount": [100, 250, 80, 300, 150],
"category": ["A", "B", "A", "C", "B"]
})
print(type(df)) # <class 'FlowFrame'>
print(type(df.data)) # <class 'polars.LazyFrame'>
Key Properties of FlowFrame
1. Always Lazy Evaluation
A FlowFrame
never loads your actual data into memory until you explicitly call .collect()
. This means you can build complex transformations on massive datasets without consuming memory:
# None of this processes any data yet
df = (
ff.FlowFrame({
"id": [1, 2, 3, 4, 5],
"amount": [500, 1200, 800, 1500, 900],
"category": ["A", "B", "A", "C", "B"]
}) # Creates manual input node
.filter(ff.col("amount") > 1000) # No filtering happens yet
.group_by("category") # No grouping happens yet
.agg(ff.col("amount").sum()) # No aggregation happens yet
)
# Only now does the data get processed
result = df.collect() # Everything executes at once, optimized
Performance Benefits
This lazy evaluation is powered by Polars and explained in detail in our Technical Architecture guide.
2. Connected to a DAG (Directed Acyclic Graph)
Every FlowFrame has a reference to a FlowGraph that tracks every operation as a node:
df = ff.FlowFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"active": [True, False, True]
})
print(df.flow_graph) # Shows the graph this FlowFrame belongs to
print(df.node_id) # Shows which node in the graph this FlowFrame represents
For a deeper understanding of how this DAG works internally, see FlowGraph in the Developers Guide).
3. Linear Operation Tracking
Each operation creates a new node in the graph, even if you repeat the same operation:
df = ff.FlowFrame({
"id": [1, 2, 3, 4],
"amount": [50, 150, 75, 200],
"status": ["active", "inactive", "active", "active"]
})
print(f"Initial graph has {len(df.flow_graph.nodes)} nodes")
# First filter - creates node 1
df1 = df.filter(ff.col("amount") > 100)
print(f"After first filter: {len(df1.flow_graph.nodes)} nodes")
# Second identical filter - creates node 2 (not reused!)
df2 = df.filter(ff.col("amount") > 100)
print(f"After second filter: {len(df2.flow_graph.nodes)} nodes")
# Both operations are tracked separately in the graph
FlowGraph: The Pipeline's Blueprint
What is FlowGraph?
FlowGraph is the "brain" behind FlowFrame - it's a Directed Acyclic Graph (DAG) that tracks every step in your data transformation pipeline.
# Access the graph from any FlowFrame
df = ff.FlowFrame({
"product": ["Widget", "Gadget", "Tool"],
"price": [10.99, 25.50, 8.75],
"quantity": [100, 50, 200]
})
graph = df.flow_graph
print(f"Graph ID: {graph.flow_id}")
print(f"Number of operations: {len(graph.nodes)}")
print(f"Node connections: {graph.node_connections}")
Visual Integration
Viewing Your Graph
Every FlowFrame can show its complete operation history in the visual editor:
# Build a pipeline
result = (
ff.FlowFrame({
"region": ["North", "South", "North", "East", "South"],
"amount": [1000, 0, 1500, 800, 1200],
"product": ["A", "B", "A", "C", "B"]
}, description="Load sales data")
.filter(ff.col("amount") > 0, description="Remove invalid amounts")
.group_by("region", description="Group by sales region")
.agg(ff.col("amount").sum().alias("total_sales"))
)
# Open the entire pipeline in the visual editor
ff.open_graph_in_editor(result.flow_graph)
Learn More
See Visual UI Integration for details on launching and controlling the visual editor from Python.
This opens a visual representation showing:
- Each operation as a node
- Data flow between operations
- Descriptions you added for documentation
- Schema changes at each step
Real-time Schema Prediction
The DAG enables instant schema prediction without processing data:
df = ff.FlowFrame({
"product": ["Widget", "Gadget"],
"price": [10.50, 25.00],
"quantity": [2, 3]
})
print("Original schema:", df.schema)
# Schema is predicted instantly, no data processed
transformed = df.with_columns([
(ff.col("price") * ff.col("quantity")).alias("total")
])
print("New schema:", transformed.schema) # Shows new 'total' column immediately
How Schema Prediction Works
Learn about the closure pattern that enables this in The Magic of Closures.
Practical Implications
Memory Efficiency
Since FlowFrame is always lazy:
# This can handle large datasets efficiently through lazy evaluation
large_pipeline = (
ff.FlowFrame({
"id": list(range(10000)),
"quality_score": [0.1, 0.9, 0.8, 0.95] * 2500, # Simulating large data
"value": list(range(10000)),
"category": ["A", "B", "C", "D"] * 2500
})
.filter(ff.col("quality_score") > 0.95) # Reduces data early
.select(["id", "value", "category"]) # Reduces columns early
.group_by("category")
.agg(ff.col("value").mean())
)
# Only processes what's needed when you collect
result = large_pipeline.collect() # Optimized execution plan
Performance Guide
For more on optimization strategies, see Execution Methods in our philosophy guide.
Graph Reuse and Copying
You can work with the same graph across multiple FlowFrames:
# Start with common base
base = ff.FlowFrame({
"region": ["North", "South", "East"],
"year": [2024, 2024, 2023],
"sales": [1000, 1500, 800],
"product": ["Widget", "Gadget", "Tool"],
"quantity": [10, 15, 8]
}).filter(ff.col("year") == 2024)
# Create different branches (same graph, different endpoints)
sales_summary = base.group_by("region").agg(ff.col("sales").sum())
product_summary = base.group_by("product").agg(ff.col("quantity").sum())
# Both share the same underlying graph
assert sales_summary.flow_graph is product_summary.flow_graph
Best Practices
1. Use Descriptions for Complex Pipelines
import flowfile as ff
pipeline = (
ff.FlowFrame({
"customer_id": [1, 2, 3, 4, 5],
"status": ["active", "inactive", "active", "active", "inactive"],
"signup_date": ["2024-01-15", "2023-12-10", "2024-02-20", "2023-11-05", "2024-03-01"],
"customer_segment": ["premium", "basic", "premium", "basic", "premium"],
"revenue": [1000, 500, 1500, 300, 2000]
}, description="Load raw customer data")
.filter(ff.col("status") == "active", description="Keep only active customers")
.with_columns([
ff.col("signup_date").str.strptime(ff.Date, "%Y-%m-%d").alias("signup_date")
], description="Parse signup dates")
.group_by("customer_segment", description="Aggregate by customer segment")
.agg([
ff.col("revenue").sum().alias("total_revenue"),
ff.col("customer_id").count().alias("customer_count")
], description="Calculate segment metrics")
)
2. Visualize During Development
# Check your pipeline structure frequently
ff.open_graph_in_editor(pipeline.flow_graph)
Complete Examples
- Database Pipeline: See PostgreSQL Integration for a real-world ui example
- Cloud Pipeline: Check Cloud Connections for S3 workflows
- Export to Code: Learn how your pipelines convert to pure Python in Flow to Code
Summary
FlowFrame and FlowGraph work together to provide:
- Lazy evaluation: No memory waste, optimal performance
- Complete lineage: Every operation is tracked and visualizable
- Real-time feedback: Instant schema prediction and error detection
- Seamless integration: Switch between code and visual editing
- Polars compatibility: Very identical API with additional features
- Automatic adaptation: Complex operations automatically fall back to code nodes
Understanding this design helps you build efficient, maintainable data pipelines that scale from quick analyses to production ETL workflows.
Related Documentation
- FlowFrame Operations - Available transformations and methods
- Expressions - Column operations and formula syntax
- Joins - Combining datasets
- Aggregations - Group by and summarization
- Visual UI Integration - Working with the visual editor
- Developers guide - Core architecture and design philosophy