Flowfile Core API Reference

This section provides a detailed API reference for the core Python objects, data models, and API routes in flowfile-core. The documentation is generated directly from the source code docstrings.

Core Components

This section covers the fundamental classes that manage the state and execution of data pipelines. These are the main "verbs" of the library.

FlowGraph

The FlowGraph is the central object that orchestrates the execution of data transformations. It is built incrementally as you chain operations. This DAG (Directed Acyclic Graph) represents the entire pipeline.

`flowfile_core.flowfile.flow_graph.FlowGraph`

A class representing a Directed Acyclic Graph (DAG) for data processing pipelines.

It manages nodes, connections, and the execution of the entire flow.

Methods:

Name	Description
`__init__`	Initializes a new FlowGraph instance.
`__repr__`	Provides the official string representation of the FlowGraph instance.
`add_cloud_storage_reader`	Adds a cloud storage read node to the flow graph.
`add_cloud_storage_writer`	Adds a node to write data to a cloud storage provider.
`add_cross_join`	Adds a cross join node to the graph.
`add_database_reader`	Adds a node to read data from a database.
`add_database_writer`	Adds a node to write data to a database.
`add_datasource`	Adds a data source node to the graph.
`add_dependency_on_polars_lazy_frame`	Adds a special node that directly injects a Polars LazyFrame into the graph.
`add_explore_data`	Adds a specialized node for data exploration and visualization.
`add_external_source`	Adds a node for a custom external data source.
`add_filter`	Adds a filter node to the graph.
`add_formula`	Adds a node that applies a formula to create or modify a column.
`add_fuzzy_match`	Adds a fuzzy matching node to join data on approximate string matches.
`add_graph_solver`	Adds a node that solves graph-like problems within the data.
`add_group_by`	Adds a group-by aggregation node to the graph.
`add_include_cols`	Adds columns to both the input and output column lists.
`add_initial_node_analysis`	Adds a data exploration/analysis node based on a node promise.
`add_join`	Adds a join node to combine two data streams based on key columns.
`add_manual_input`	Adds a node for manual data entry.
`add_node_promise`	Adds a placeholder node to the graph that is not yet fully configured.
`add_node_step`	The core method for adding or updating a node in the graph.
`add_output`	Adds an output node to write the final data to a destination.
`add_pivot`	Adds a pivot node to the graph.
`add_polars_code`	Adds a node that executes custom Polars code.
`add_read`	Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).
`add_record_count`	Adds a filter node to the graph.
`add_record_id`	Adds a node to create a new column with a unique ID for each record.
`add_sample`	Adds a node to take a random or top-N sample of the data.
`add_select`	Adds a node to select, rename, reorder, or drop columns.
`add_sort`	Adds a node to sort the data based on one or more columns.
`add_sql_source`	Adds a node that reads data from a SQL source.
`add_text_to_rows`	Adds a node that splits cell values into multiple rows.
`add_union`	Adds a union node to combine multiple data streams.
`add_unique`	Adds a node to find and remove duplicate rows.
`add_unpivot`	Adds an unpivot node to the graph.
`apply_layout`	Calculates and applies a layered layout to all nodes in the graph.
`cancel`	Cancels an ongoing graph execution.
`close_flow`	Performs cleanup operations, such as clearing node caches.
`copy_node`	Creates a copy of an existing node.
`delete_node`	Deletes a node from the graph and updates all its connections.
`generate_code`	Generates code for the flow graph.
`get_frontend_data`	Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.
`get_implicit_starter_nodes`	Finds nodes that can act as starting points but are not explicitly defined as such.
`get_node`	Retrieves a node from the graph by its ID.
`get_node_data`	Retrieves all data needed to render a node in the UI.
`get_node_storage`	Serializes the entire graph's state into a storable format.
`get_nodes_overview`	Gets a list of dictionary representations for all nodes in the graph.
`get_run_info`	Gets a summary of the most recent graph execution.
`get_vue_flow_input`	Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.
`print_tree`	Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art.
`remove_from_output_cols`	Removes specified columns from the list of expected output columns.
`reset`	Forces a deep reset on all nodes in the graph.
`run_graph`	Executes the entire data flow graph from start to finish.
`save_flow`	Saves the current state of the flow graph to a file.
`trigger_fetch_node`	Executes a specific node in the graph by its ID.

Attributes:

Name	Type	Description
`execution_location`	`ExecutionLocationsLiteral`	Gets the current execution location.
`execution_mode`	`ExecutionModeLiteral`	Gets the current execution mode ('Development' or 'Performance').
`flow_id`	`int`	Gets the unique identifier of the flow.
`graph_has_functions`	`bool`	Checks if the graph has any nodes.
`graph_has_input_data`	`bool`	Checks if the graph has an initial input data source.
`node_connections`	`List[Tuple[int, int]]`	Computes and returns a list of all connections in the graph.
`nodes`	`List[FlowNode]`	Gets a list of all FlowNode objects in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

class FlowGraph:
    """A class representing a Directed Acyclic Graph (DAG) for data processing pipelines.

    It manages nodes, connections, and the execution of the entire flow.
    """
    uuid: str
    depends_on: Dict[int, Union[ParquetFile, FlowDataEngine, "FlowGraph", pl.DataFrame,]]
    _flow_id: int
    _input_data: Union[ParquetFile, FlowDataEngine, "FlowGraph"]
    _input_cols: List[str]
    _output_cols: List[str]
    _node_db: Dict[Union[str, int], FlowNode]
    _node_ids: List[Union[str, int]]
    _results: Optional[FlowDataEngine] = None
    cache_results: bool = False
    schema: Optional[List[FlowfileColumn]] = None
    has_over_row_function: bool = False
    _flow_starts: List[Union[int, str]] = None
    latest_run_info: Optional[RunInformation] = None
    start_datetime: datetime = None
    end_datetime: datetime = None
    _flow_settings: schemas.FlowSettings = None
    flow_logger: FlowLogger

    def __init__(self,
                 flow_settings: schemas.FlowSettings | schemas.FlowGraphConfig,
                 name: str = None, input_cols: List[str] = None,
                 output_cols: List[str] = None,
                 path_ref: str = None,
                 input_flow: Union[ParquetFile, FlowDataEngine, "FlowGraph"] = None,
                 cache_results: bool = False):
        """Initializes a new FlowGraph instance.

        Args:
            flow_settings: The configuration settings for the flow.
            name: The name of the flow.
            input_cols: A list of input column names.
            output_cols: A list of output column names.
            path_ref: An optional path to an initial data source.
            input_flow: An optional existing data object to start the flow with.
            cache_results: A global flag to enable or disable result caching.
        """
        if isinstance(flow_settings, schemas.FlowGraphConfig):
            flow_settings = schemas.FlowSettings.from_flow_settings_input(flow_settings)

        self._flow_settings = flow_settings
        self.uuid = str(uuid1())
        self.start_datetime = None
        self.end_datetime = None
        self.latest_run_info = None
        self._flow_id = flow_settings.flow_id
        self.flow_logger = FlowLogger(flow_settings.flow_id)
        self._flow_starts: List[FlowNode] = []
        self._results = None
        self.schema = None
        self.has_over_row_function = False
        self._input_cols = [] if input_cols is None else input_cols
        self._output_cols = [] if output_cols is None else output_cols
        self._node_ids = []
        self._node_db = {}
        self.cache_results = cache_results
        self.__name__ = name if name else id(self)
        self.depends_on = {}
        if path_ref is not None:
            self.add_datasource(input_schema.NodeDatasource(file_path=path_ref))
        elif input_flow is not None:
            self.add_datasource(input_file=input_flow)

    @property
    def flow_settings(self) -> schemas.FlowSettings:
        return self._flow_settings

    @flow_settings.setter
    def flow_settings(self, flow_settings: schemas.FlowSettings):
        if (
                (self._flow_settings.execution_location != flow_settings.execution_location) or
                (self._flow_settings.execution_mode != flow_settings.execution_mode)
        ):
            self.reset()
        self._flow_settings = flow_settings

    def add_node_promise(self, node_promise: input_schema.NodePromise):
        """Adds a placeholder node to the graph that is not yet fully configured.

        Useful for building the graph structure before all settings are available.

        Args:
            node_promise: A promise object containing basic node information.
        """
        def placeholder(n: FlowNode = None):
            if n is None:
                return FlowDataEngine()
            return n

        self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=placeholder,
                           setting_input=node_promise)

    def apply_layout(self, y_spacing: int = 150, x_spacing: int = 200, initial_y: int = 100):
        """Calculates and applies a layered layout to all nodes in the graph.

        This updates their x and y positions for UI rendering.

        Args:
            y_spacing: The vertical spacing between layers.
            x_spacing: The horizontal spacing between nodes in the same layer.
            initial_y: The initial y-position for the first layer.
        """
        self.flow_logger.info("Applying layered layout...")
        start_time = time()
        try:
            # Calculate new positions for all nodes
            new_positions = calculate_layered_layout(
                self, y_spacing=y_spacing, x_spacing=x_spacing, initial_y=initial_y
            )

            if not new_positions:
                self.flow_logger.warning("Layout calculation returned no positions.")
                return

            # Apply the new positions to the setting_input of each node
            updated_count = 0
            for node_id, (pos_x, pos_y) in new_positions.items():
                node = self.get_node(node_id)
                if node and hasattr(node, 'setting_input'):
                    setting = node.setting_input
                    if hasattr(setting, 'pos_x') and hasattr(setting, 'pos_y'):
                        setting.pos_x = pos_x
                        setting.pos_y = pos_y
                        updated_count += 1
                    else:
                        self.flow_logger.warning(f"Node {node_id} setting_input ({type(setting)}) lacks pos_x/pos_y attributes.")
                elif node:
                    self.flow_logger.warning(f"Node {node_id} lacks setting_input attribute.")
                # else: Node not found, already warned by calculate_layered_layout

            end_time = time()
            self.flow_logger.info(f"Layout applied to {updated_count}/{len(self.nodes)} nodes in {end_time - start_time:.2f} seconds.")

        except Exception as e:
            self.flow_logger.error(f"Error applying layout: {e}")
            raise  # Optional: re-raise the exception

    @property
    def flow_id(self) -> int:
        """Gets the unique identifier of the flow."""
        return self._flow_id

    @flow_id.setter
    def flow_id(self, new_id: int):
        """Sets the unique identifier for the flow and updates all child nodes.

        Args:
            new_id: The new flow ID.
        """
        self._flow_id = new_id
        for node in self.nodes:
            if hasattr(node.setting_input, 'flow_id'):
                node.setting_input.flow_id = new_id
        self.flow_settings.flow_id = new_id

    def __repr__(self):
        """Provides the official string representation of the FlowGraph instance."""
        settings_str = "  -" + '\n  -'.join(f"{k}: {v}" for k, v in self.flow_settings)
        return f"FlowGraph(\nNodes: {self._node_db}\n\nSettings:\n{settings_str}"

    def print_tree(self):
        """Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art."""
        if not self._node_db:
            self.flow_logger.info("Empty flow graph")
            return

        # Build node information
        node_info = build_node_info(self.nodes)

        # Calculate depths for all nodes
        for node_id in node_info:
            calculate_depth(node_id, node_info)

        # Group nodes by depth
        depth_groups, max_depth = group_nodes_by_depth(node_info)

        # Sort nodes within each depth group
        for depth in depth_groups:
            depth_groups[depth].sort()

        # Create the main flow visualization
        lines = ["=" * 80, "Flow Graph Visualization", "=" * 80, ""]

        # Track which nodes connect to what
        merge_points = define_node_connections(node_info)

        # Build the flow paths

        # Find the maximum label length for each depth level
        max_label_length = {}
        for depth in range(max_depth + 1):
            if depth in depth_groups:
                max_len = max(len(node_info[nid].label) for nid in depth_groups[depth])
                max_label_length[depth] = max_len

        # Draw the paths
        drawn_nodes = set()
        merge_drawn = set()

        # Group paths by their merge points
        paths_by_merge = {}
        standalone_paths = []

        # Build flow paths
        paths = build_flow_paths(node_info, self._flow_starts, merge_points)

        # Define paths to merge and standalone paths
        for path in paths:
            if len(path) > 1 and path[-1] in merge_points and len(merge_points[path[-1]]) > 1:
                merge_id = path[-1]
                if merge_id not in paths_by_merge:
                    paths_by_merge[merge_id] = []
                paths_by_merge[merge_id].append(path)
            else:
                standalone_paths.append(path)

        # Draw merged paths
        draw_merged_paths(node_info, merge_points, paths_by_merge, merge_drawn, drawn_nodes, lines)

        # Draw standlone paths
        draw_standalone_paths(drawn_nodes, standalone_paths, lines, node_info)

        # Add undrawn nodes
        add_un_drawn_nodes(drawn_nodes, node_info, lines)

        try:
            skip_nodes, ordered_nodes = compute_execution_plan(
                nodes=self.nodes,
                flow_starts=self._flow_starts+self.get_implicit_starter_nodes())
            if ordered_nodes:
                for i, node in enumerate(ordered_nodes, 1):
                    lines.append(f"  {i:3d}. {node_info[node.node_id].label}")
        except Exception as e:
            lines.append(f"  Could not determine execution order: {e}")

        # Print everything
        output = "\n".join(lines)

        print(output)

    def get_nodes_overview(self):
        """Gets a list of dictionary representations for all nodes in the graph."""
        output = []
        for v in self._node_db.values():
            output.append(v.get_repr())
        return output

    def remove_from_output_cols(self, columns: List[str]):
        """Removes specified columns from the list of expected output columns.

        Args:
            columns: A list of column names to remove.
        """
        cols = set(columns)
        self._output_cols = [c for c in self._output_cols if c not in cols]

    def get_node(self, node_id: Union[int, str] = None) -> FlowNode | None:
        """Retrieves a node from the graph by its ID.

        Args:
            node_id: The ID of the node to retrieve. If None, retrieves the last added node.

        Returns:
            The FlowNode object, or None if not found.
        """
        if node_id is None:
            node_id = self._node_ids[-1]
        node = self._node_db.get(node_id)
        if node is not None:
            return node

    def add_user_defined_node(self, *,
                              custom_node: CustomNodeBase,
                              user_defined_node_settings: input_schema.UserDefinedNode
                              ):

        def _func(*fdes: FlowDataEngine) -> FlowDataEngine | None:
            output = custom_node.process(*(fde.data_frame for fde in fdes))
            if isinstance(output, pl.LazyFrame | pl.DataFrame):
                return FlowDataEngine(output)
            return None

        self.add_node_step(node_id=user_defined_node_settings.node_id,
                           function=_func,
                           setting_input=user_defined_node_settings,
                           input_node_ids=user_defined_node_settings.depending_on_ids,
                           node_type=custom_node.item,
                           )

    def add_pivot(self, pivot_settings: input_schema.NodePivot):
        """Adds a pivot node to the graph.

        Args:
            pivot_settings: The settings for the pivot operation.
        """

        def _func(fl: FlowDataEngine):
            return fl.do_pivot(pivot_settings.pivot_input, self.flow_logger.get_node_logger(pivot_settings.node_id))

        self.add_node_step(node_id=pivot_settings.node_id,
                           function=_func,
                           node_type='pivot',
                           setting_input=pivot_settings,
                           input_node_ids=[pivot_settings.depending_on_id])

        node = self.get_node(pivot_settings.node_id)

        def schema_callback():
            input_data = node.singular_main_input.get_resulting_data()  # get from the previous step the data
            input_data.lazy = True  # ensure the dataset is lazy
            input_lf = input_data.data_frame  # get the lazy frame
            return pre_calculate_pivot_schema(input_data.schema, pivot_settings.pivot_input, input_lf=input_lf)
        node.schema_callback = schema_callback

    def add_unpivot(self, unpivot_settings: input_schema.NodeUnpivot):
        """Adds an unpivot node to the graph.

        Args:
            unpivot_settings: The settings for the unpivot operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.unpivot(unpivot_settings.unpivot_input)

        self.add_node_step(node_id=unpivot_settings.node_id,
                           function=_func,
                           node_type='unpivot',
                           setting_input=unpivot_settings,
                           input_node_ids=[unpivot_settings.depending_on_id])

    def add_union(self, union_settings: input_schema.NodeUnion):
        """Adds a union node to combine multiple data streams.

        Args:
            union_settings: The settings for the union operation.
        """

        def _func(*flowfile_tables: FlowDataEngine):
            dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [flt.data_frame for flt in flowfile_tables]
            return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))

        self.add_node_step(node_id=union_settings.node_id,
                           function=_func,
                           node_type=f'union',
                           setting_input=union_settings,
                           input_node_ids=union_settings.depending_on_ids)

    def add_initial_node_analysis(self, node_promise: input_schema.NodePromise):
        """Adds a data exploration/analysis node based on a node promise.

        Args:
            node_promise: The promise representing the node to be analyzed.
        """
        node_analysis = create_graphic_walker_node_from_node_promise(node_promise)
        self.add_explore_data(node_analysis)

    def add_explore_data(self, node_analysis: input_schema.NodeExploreData):
        """Adds a specialized node for data exploration and visualization.

        Args:
            node_analysis: The settings for the data exploration node.
        """
        sample_size: int = 10000

        def analysis_preparation(flowfile_table: FlowDataEngine):
            if flowfile_table.number_of_records <= 0:
                number_of_records = flowfile_table.get_number_of_records(calculate_in_worker_process=True)
            else:
                number_of_records = flowfile_table.number_of_records
            if number_of_records > sample_size:
                flowfile_table = flowfile_table.get_sample(sample_size, random=True)
            external_sampler = ExternalDfFetcher(
                lf=flowfile_table.data_frame,
                file_ref="__gf_walker"+node.hash,
                wait_on_completion=True,
                node_id=node.node_id,
                flow_id=self.flow_id,
            )
            node.results.analysis_data_generator = get_read_top_n(external_sampler.status.file_ref,
                                                                  n=min(sample_size, number_of_records))
            return flowfile_table

        def schema_callback():
            node = self.get_node(node_analysis.node_id)
            if len(node.all_inputs) == 1:
                input_node = node.all_inputs[0]
                return input_node.schema
            else:
                return [FlowfileColumn.from_input('col_1', 'na')]

        self.add_node_step(node_id=node_analysis.node_id, node_type='explore_data',
                           function=analysis_preparation,
                           setting_input=node_analysis, schema_callback=schema_callback)
        node = self.get_node(node_analysis.node_id)

    def add_group_by(self, group_by_settings: input_schema.NodeGroupBy):
        """Adds a group-by aggregation node to the graph.

        Args:
            group_by_settings: The settings for the group-by operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.do_group_by(group_by_settings.groupby_input, False)

        self.add_node_step(node_id=group_by_settings.node_id,
                           function=_func,
                           node_type=f'group_by',
                           setting_input=group_by_settings,
                           input_node_ids=[group_by_settings.depending_on_id])

        node = self.get_node(group_by_settings.node_id)

        def schema_callback():

            output_columns = [(c.old_name, c.new_name, c.output_type) for c in group_by_settings.groupby_input.agg_cols]
            depends_on = node.node_inputs.main_inputs[0]
            input_schema_dict: Dict[str, str] = {s.name: s.data_type for s in depends_on.schema}
            output_schema = []
            for old_name, new_name, data_type in output_columns:
                data_type = input_schema_dict[old_name] if data_type is None else data_type
                output_schema.append(FlowfileColumn.from_input(data_type=data_type, column_name=new_name))
            return output_schema

        node.schema_callback = schema_callback

    def add_filter(self, filter_settings: input_schema.NodeFilter):
        """Adds a filter node to the graph.

        Args:
            filter_settings: The settings for the filter operation.
        """

        is_advanced = filter_settings.filter_input.filter_type == 'advanced'
        if is_advanced:
            predicate = filter_settings.filter_input.advanced_filter
        else:
            _basic_filter = filter_settings.filter_input.basic_filter
            filter_settings.filter_input.advanced_filter = (f'[{_basic_filter.field}]{_basic_filter.filter_type}"'
                                                            f'{_basic_filter.filter_value}"')

        def _func(fl: FlowDataEngine):
            is_advanced = filter_settings.filter_input.filter_type == 'advanced'
            if is_advanced:
                return fl.do_filter(predicate)
            else:
                basic_filter = filter_settings.filter_input.basic_filter
                if basic_filter.filter_value.isnumeric():
                    field_data_type = fl.get_schema_column(basic_filter.field).generic_datatype()
                    if field_data_type == 'str':
                        _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
                    else:
                        _f = f'[{basic_filter.field}]{basic_filter.filter_type}{basic_filter.filter_value}'
                else:
                    _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
                filter_settings.filter_input.advanced_filter = _f
                return fl.do_filter(_f)

        self.add_node_step(filter_settings.node_id, _func,
                           node_type='filter',
                           renew_schema=False,
                           setting_input=filter_settings,
                           input_node_ids=[filter_settings.depending_on_id]
                           )

    def add_record_count(self, node_number_of_records: input_schema.NodeRecordCount):
        """Adds a filter node to the graph.

        Args:
            node_number_of_records: The settings for the record count operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.get_record_count()

        self.add_node_step(node_id=node_number_of_records.node_id,
                           function=_func,
                           node_type='record_count',
                           setting_input=node_number_of_records,
                           input_node_ids=[node_number_of_records.depending_on_id])

    def add_polars_code(self, node_polars_code: input_schema.NodePolarsCode):
        """Adds a node that executes custom Polars code.

        Args:
            node_polars_code: The settings for the Polars code node.
        """

        def _func(*flowfile_tables: FlowDataEngine) -> FlowDataEngine:
            return execute_polars_code(*flowfile_tables, code=node_polars_code.polars_code_input.polars_code)
        self.add_node_step(node_id=node_polars_code.node_id,
                           function=_func,
                           node_type='polars_code',
                           setting_input=node_polars_code,
                           input_node_ids=node_polars_code.depending_on_ids)

        try:
            polars_code_parser.validate_code(node_polars_code.polars_code_input.polars_code)
        except Exception as e:
            node = self.get_node(node_id=node_polars_code.node_id)
            node.results.errors = str(e)

    def add_dependency_on_polars_lazy_frame(self,
                                            lazy_frame: pl.LazyFrame,
                                            node_id: int):
        """Adds a special node that directly injects a Polars LazyFrame into the graph.

        Note: This is intended for backend use and will not work in the UI editor.

        Args:
            lazy_frame: The Polars LazyFrame to inject.
            node_id: The ID for the new node.
        """
        def _func():
            return FlowDataEngine(lazy_frame)
        node_promise = input_schema.NodePromise(flow_id=self.flow_id,
                                                node_id=node_id, node_type="polars_lazy_frame",
                                                is_setup=True)
        self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=_func,
                           setting_input=node_promise)

    def add_unique(self, unique_settings: input_schema.NodeUnique):
        """Adds a node to find and remove duplicate rows.

        Args:
            unique_settings: The settings for the unique operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.make_unique(unique_settings.unique_input)

        self.add_node_step(node_id=unique_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='unique',
                           setting_input=unique_settings,
                           input_node_ids=[unique_settings.depending_on_id])

    def add_graph_solver(self, graph_solver_settings: input_schema.NodeGraphSolver):
        """Adds a node that solves graph-like problems within the data.

        This node can be used for operations like finding network paths,
        calculating connected components, or performing other graph algorithms
        on relational data that represents nodes and edges.

        Args:
            graph_solver_settings: The settings object defining the graph inputs
                and the specific algorithm to apply.
        """
        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.solve_graph(graph_solver_settings.graph_solver_input)

        self.add_node_step(node_id=graph_solver_settings.node_id,
                           function=_func,
                           node_type='graph_solver',
                           setting_input=graph_solver_settings,
                           input_node_ids=[graph_solver_settings.depending_on_id])

    def add_formula(self, function_settings: input_schema.NodeFormula):
        """Adds a node that applies a formula to create or modify a column.

        Args:
            function_settings: The settings for the formula operation.
        """

        error = ""
        if function_settings.function.field.data_type not in (None, "Auto"):
            output_type = cast_str_to_polars_type(function_settings.function.field.data_type)
        else:
            output_type = None
        if output_type not in (None, "Auto"):
            new_col = [FlowfileColumn.from_input(column_name=function_settings.function.field.name,
                                                 data_type=str(output_type))]
        else:
            new_col = [FlowfileColumn.from_input(function_settings.function.field.name, 'String')]

        def _func(fl: FlowDataEngine):
            return fl.apply_sql_formula(func=function_settings.function.function,
                                        col_name=function_settings.function.field.name,
                                        output_data_type=output_type)

        self.add_node_step(function_settings.node_id, _func,
                           output_schema=new_col,
                           node_type='formula',
                           renew_schema=False,
                           setting_input=function_settings,
                           input_node_ids=[function_settings.depending_on_id]
                           )
        if error != "":
            node = self.get_node(function_settings.node_id)
            node.results.errors = error
            return False, error
        else:
            return True, ""

    def add_cross_join(self, cross_join_settings: input_schema.NodeCrossJoin) -> "FlowGraph":
        """Adds a cross join node to the graph.

        Args:
            cross_join_settings: The settings for the cross join operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            for left_select in cross_join_settings.cross_join_input.left_select.renames:
                left_select.is_available = True if left_select.old_name in main.schema else False
            for right_select in cross_join_settings.cross_join_input.right_select.renames:
                right_select.is_available = True if right_select.old_name in right.schema else False

            return main.do_cross_join(cross_join_input=cross_join_settings.cross_join_input,
                                      auto_generate_selection=cross_join_settings.auto_generate_selection,
                                      verify_integrity=False,
                                      other=right)

        self.add_node_step(node_id=cross_join_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='cross_join',
                           setting_input=cross_join_settings,
                           input_node_ids=cross_join_settings.depending_on_ids)
        return self

    def add_join(self, join_settings: input_schema.NodeJoin) -> "FlowGraph":
        """Adds a join node to combine two data streams based on key columns.

        Args:
            join_settings: The settings for the join operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            for left_select in join_settings.join_input.left_select.renames:
                left_select.is_available = True if left_select.old_name in main.schema else False
            for right_select in join_settings.join_input.right_select.renames:
                right_select.is_available = True if right_select.old_name in right.schema else False

            return main.join(join_input=join_settings.join_input,
                             auto_generate_selection=join_settings.auto_generate_selection,
                             verify_integrity=False,
                             other=right)

        self.add_node_step(node_id=join_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='join',
                           setting_input=join_settings,
                           input_node_ids=join_settings.depending_on_ids)
        return self

    def add_fuzzy_match(self, fuzzy_settings: input_schema.NodeFuzzyMatch) -> "FlowGraph":
        """Adds a fuzzy matching node to join data on approximate string matches.

        Args:
            fuzzy_settings: The settings for the fuzzy match operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            node = self.get_node(node_id=fuzzy_settings.node_id)
            if self.execution_location == "local":
                return main.fuzzy_join(fuzzy_match_input=deepcopy(fuzzy_settings.join_input),
                                       other=right,
                                       node_logger=self.flow_logger.get_node_logger(fuzzy_settings.node_id))

            f = main.start_fuzzy_join(fuzzy_match_input=deepcopy(fuzzy_settings.join_input), other=right, file_ref=node.hash,
                                      flow_id=self.flow_id, node_id=fuzzy_settings.node_id)
            logger.info("Started the fuzzy match action")
            node._fetch_cached_df = f  # Add to the node so it can be cancelled and fetch later if needed
            return FlowDataEngine(f.get_result())

        def schema_callback():
            fm_input_copy = deepcopy(fuzzy_settings.join_input)  # Deepcopy create an unique object per func
            node = self.get_node(node_id=fuzzy_settings.node_id)
            return calculate_fuzzy_match_schema(fm_input_copy,
                                                left_schema=node.node_inputs.main_inputs[0].schema,
                                                right_schema=node.node_inputs.right_input.schema
                                                )

        self.add_node_step(node_id=fuzzy_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='fuzzy_match',
                           setting_input=fuzzy_settings,
                           input_node_ids=fuzzy_settings.depending_on_ids,
                           schema_callback=schema_callback)

        return self

    def add_text_to_rows(self, node_text_to_rows: input_schema.NodeTextToRows) -> "FlowGraph":
        """Adds a node that splits cell values into multiple rows.

        This is useful for un-nesting data where a single field contains multiple
        values separated by a delimiter.

        Args:
            node_text_to_rows: The settings object that specifies the column to split
                and the delimiter to use.

        Returns:
            The `FlowGraph` instance for method chaining.
        """
        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.split(node_text_to_rows.text_to_rows_input)

        self.add_node_step(node_id=node_text_to_rows.node_id,
                           function=_func,
                           node_type='text_to_rows',
                           setting_input=node_text_to_rows,
                           input_node_ids=[node_text_to_rows.depending_on_id])
        return self

    def add_sort(self, sort_settings: input_schema.NodeSort) -> "FlowGraph":
        """Adds a node to sort the data based on one or more columns.

        Args:
            sort_settings: The settings for the sort operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.do_sort(sort_settings.sort_input)

        self.add_node_step(node_id=sort_settings.node_id,
                           function=_func,
                           node_type='sort',
                           setting_input=sort_settings,
                           input_node_ids=[sort_settings.depending_on_id])
        return self

    def add_sample(self, sample_settings: input_schema.NodeSample) -> "FlowGraph":
        """Adds a node to take a random or top-N sample of the data.

        Args:
            sample_settings: The settings object specifying the size of the sample.

        Returns:
            The `FlowGraph` instance for method chaining.
        """
        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.get_sample(sample_settings.sample_size)

        self.add_node_step(node_id=sample_settings.node_id,
                           function=_func,
                           node_type='sample',
                           setting_input=sample_settings,
                           input_node_ids=[sample_settings.depending_on_id]
                           )
        return self

    def add_record_id(self, record_id_settings: input_schema.NodeRecordId) -> "FlowGraph":
        """Adds a node to create a new column with a unique ID for each record.

        Args:
            record_id_settings: The settings object specifying the name of the
                new record ID column.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.add_record_id(record_id_settings.record_id_input)

        self.add_node_step(node_id=record_id_settings.node_id,
                           function=_func,
                           node_type='record_id',
                           setting_input=record_id_settings,
                           input_node_ids=[record_id_settings.depending_on_id]
                           )
        return self

    def add_select(self, select_settings: input_schema.NodeSelect) -> "FlowGraph":
        """Adds a node to select, rename, reorder, or drop columns.

        Args:
            select_settings: The settings for the select operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        select_cols = select_settings.select_input
        drop_cols = tuple(s.old_name for s in select_settings.select_input)

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            input_cols = set(f.name for f in table.schema)
            ids_to_remove = []
            for i, select_col in enumerate(select_cols):
                if select_col.data_type is None:
                    select_col.data_type = table.get_schema_column(select_col.old_name).data_type
                if select_col.old_name not in input_cols:
                    select_col.is_available = False
                    if not select_col.keep:
                        ids_to_remove.append(i)
                else:
                    select_col.is_available = True
            ids_to_remove.reverse()
            for i in ids_to_remove:
                v = select_cols.pop(i)
                del v
            return table.do_select(select_inputs=transform_schema.SelectInputs(select_cols),
                                   keep_missing=select_settings.keep_missing)

        self.add_node_step(node_id=select_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='select',
                           drop_columns=list(drop_cols),
                           setting_input=select_settings,
                           input_node_ids=[select_settings.depending_on_id])
        return self

    @property
    def graph_has_functions(self) -> bool:
        """Checks if the graph has any nodes."""
        return len(self._node_ids) > 0

    def delete_node(self, node_id: Union[int, str]):
        """Deletes a node from the graph and updates all its connections.

        Args:
            node_id: The ID of the node to delete.

        Raises:
            Exception: If the node with the given ID does not exist.
        """
        logger.info(f"Starting deletion of node with ID: {node_id}")

        node = self._node_db.get(node_id)
        if node:
            logger.info(f"Found node: {node_id}, processing deletion")

            lead_to_steps: List[FlowNode] = node.leads_to_nodes
            logger.debug(f"Node {node_id} leads to {len(lead_to_steps)} other nodes")

            if len(lead_to_steps) > 0:
                for lead_to_step in lead_to_steps:
                    logger.debug(f"Deleting input node {node_id} from dependent node {lead_to_step}")
                    lead_to_step.delete_input_node(node_id, complete=True)

            if not node.is_start:
                depends_on: List[FlowNode] = node.node_inputs.get_all_inputs()
                logger.debug(f"Node {node_id} depends on {len(depends_on)} other nodes")

                for depend_on in depends_on:
                    logger.debug(f"Removing lead_to reference {node_id} from node {depend_on}")
                    depend_on.delete_lead_to_node(node_id)

            self._node_db.pop(node_id)
            logger.debug(f"Successfully removed node {node_id} from node_db")
            del node
            logger.info("Node object deleted")
        else:
            logger.error(f"Failed to find node with id {node_id}")
            raise Exception(f"Node with id {node_id} does not exist")

    @property
    def graph_has_input_data(self) -> bool:
        """Checks if the graph has an initial input data source."""
        return self._input_data is not None

    def add_node_step(self,
                      node_id: Union[int, str],
                      function: Callable,
                      input_columns: List[str] = None,
                      output_schema: List[FlowfileColumn] = None,
                      node_type: str = None,
                      drop_columns: List[str] = None,
                      renew_schema: bool = True,
                      setting_input: Any = None,
                      cache_results: bool = None,
                      schema_callback: Callable = None,
                      input_node_ids: List[int] = None) -> FlowNode:
        """The core method for adding or updating a node in the graph.

        Args:
            node_id: The unique ID for the node.
            function: The core processing function for the node.
            input_columns: A list of input column names required by the function.
            output_schema: A predefined schema for the node's output.
            node_type: A string identifying the type of node (e.g., 'filter', 'join').
            drop_columns: A list of columns to be dropped after the function executes.
            renew_schema: If True, the schema is recalculated after execution.
            setting_input: A configuration object containing settings for the node.
            cache_results: If True, the node's results are cached for future runs.
            schema_callback: A function that dynamically calculates the output schema.
            input_node_ids: A list of IDs for the nodes that this node depends on.

        Returns:
            The created or updated FlowNode object.
        """
        existing_node = self.get_node(node_id)
        if existing_node is not None:
            if existing_node.node_type != node_type:
                self.delete_node(existing_node.node_id)
                existing_node = None
        if existing_node:
            input_nodes = existing_node.all_inputs
        elif input_node_ids is not None:
            input_nodes = [self.get_node(node_id) for node_id in input_node_ids]
        else:
            input_nodes = None
        if isinstance(input_columns, str):
            input_columns = [input_columns]
        if (
                input_nodes is not None or
                function.__name__ in ('placeholder', 'analysis_preparation') or
                node_type in ("cloud_storage_reader", "polars_lazy_frame", "input_data")
        ):
            if not existing_node:
                node = FlowNode(node_id=node_id,
                                function=function,
                                output_schema=output_schema,
                                input_columns=input_columns,
                                drop_columns=drop_columns,
                                renew_schema=renew_schema,
                                setting_input=setting_input,
                                node_type=node_type,
                                name=function.__name__,
                                schema_callback=schema_callback,
                                parent_uuid=self.uuid)
            else:
                existing_node.update_node(function=function,
                                          output_schema=output_schema,
                                          input_columns=input_columns,
                                          drop_columns=drop_columns,
                                          setting_input=setting_input,
                                          schema_callback=schema_callback)
                node = existing_node
        else:
            raise Exception("No data initialized")
        self._node_db[node_id] = node
        self._node_ids.append(node_id)
        return node

    def add_include_cols(self, include_columns: List[str]):
        """Adds columns to both the input and output column lists.

        Args:
            include_columns: A list of column names to include.
        """
        for column in include_columns:
            if column not in self._input_cols:
                self._input_cols.append(column)
            if column not in self._output_cols:
                self._output_cols.append(column)
        return self

    def add_output(self, output_file: input_schema.NodeOutput):
        """Adds an output node to write the final data to a destination.

        Args:
            output_file: The settings for the output file.
        """

        def _func(df: FlowDataEngine):
            output_file.output_settings.populate_abs_file_path()
            execute_remote = self.execution_location != 'local'
            df.output(output_fs=output_file.output_settings, flow_id=self.flow_id, node_id=output_file.node_id,
                      execute_remote=execute_remote)
            return df

        def schema_callback():
            input_node: FlowNode = self.get_node(output_file.node_id).node_inputs.main_inputs[0]

            return input_node.schema
        input_node_id = getattr(output_file, "depending_on_id") if hasattr(output_file, 'depending_on_id') else None
        self.add_node_step(node_id=output_file.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='output',
                           setting_input=output_file,
                           schema_callback=schema_callback,
                           input_node_ids=[input_node_id])

    def add_database_writer(self, node_database_writer: input_schema.NodeDatabaseWriter):
        """Adds a node to write data to a database.

        Args:
            node_database_writer: The settings for the database writer node.
        """

        node_type = 'database_writer'
        database_settings: input_schema.DatabaseWriteSettings = node_database_writer.database_write_settings
        database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
        if database_settings.connection_mode == 'inline':
            database_connection: input_schema.DatabaseConnection = database_settings.database_connection
            encrypted_password = get_encrypted_secret(current_user_id=node_database_writer.user_id,
                                                      secret_name=database_connection.password_ref)
            if encrypted_password is None:
                raise HTTPException(status_code=400, detail="Password not found")
        else:
            database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                        node_database_writer.user_id)
            encrypted_password = database_reference_settings.password.get_secret_value()

        def _func(df: FlowDataEngine):
            df.lazy = True
            database_external_write_settings = (
                sql_models.DatabaseExternalWriteSettings.create_from_from_node_database_writer(
                    node_database_writer=node_database_writer,
                    password=encrypted_password,
                    table_name=(database_settings.schema_name+'.'+database_settings.table_name
                                if database_settings.schema_name else database_settings.table_name),
                    database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                                 else None),
                    lf=df.data_frame
                )
            )
            external_database_writer = ExternalDatabaseWriter(database_external_write_settings, wait_on_completion=False)
            node._fetch_cached_df = external_database_writer
            external_database_writer.get_result()
            return df

        def schema_callback():
            input_node: FlowNode = self.get_node(node_database_writer.node_id).node_inputs.main_inputs[0]
            return input_node.schema

        self.add_node_step(
            node_id=node_database_writer.node_id,
            function=_func,
            input_columns=[],
            node_type=node_type,
            setting_input=node_database_writer,
            schema_callback=schema_callback,
        )
        node = self.get_node(node_database_writer.node_id)

    def add_database_reader(self, node_database_reader: input_schema.NodeDatabaseReader):
        """Adds a node to read data from a database.

        Args:
            node_database_reader: The settings for the database reader node.
        """

        logger.info("Adding database reader")
        node_type = 'database_reader'
        database_settings: input_schema.DatabaseSettings = node_database_reader.database_settings
        database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
        if database_settings.connection_mode == 'inline':
            database_connection: input_schema.DatabaseConnection = database_settings.database_connection
            encrypted_password = get_encrypted_secret(current_user_id=node_database_reader.user_id,
                                                      secret_name=database_connection.password_ref)
            if encrypted_password is None:
                raise HTTPException(status_code=400, detail="Password not found")
        else:
            database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                        node_database_reader.user_id)
            database_connection = database_reference_settings
            encrypted_password = database_reference_settings.password.get_secret_value()

        def _func():
            sql_source = BaseSqlSource(query=None if database_settings.query_mode == 'table' else database_settings.query,
                                       table_name=database_settings.table_name,
                                       schema_name=database_settings.schema_name,
                                       fields=node_database_reader.fields,
                                       )
            database_external_read_settings = (
                sql_models.DatabaseExternalReadSettings.create_from_from_node_database_reader(
                    node_database_reader=node_database_reader,
                    password=encrypted_password,
                    query=sql_source.query,
                    database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                                 else None),
                )
            )

            external_database_fetcher = ExternalDatabaseFetcher(database_external_read_settings, wait_on_completion=False)
            node._fetch_cached_df = external_database_fetcher
            fl = FlowDataEngine(external_database_fetcher.get_result())
            node_database_reader.fields = [c.get_minimal_field_info() for c in fl.schema]
            return fl

        def schema_callback():
            sql_source = SqlSource(connection_string=
                                   sql_utils.construct_sql_uri(database_type=database_connection.database_type,
                                                               host=database_connection.host,
                                                               port=database_connection.port,
                                                               database=database_connection.database,
                                                               username=database_connection.username,
                                                               password=decrypt_secret(encrypted_password)),
                                   query=None if database_settings.query_mode == 'table' else database_settings.query,
                                   table_name=database_settings.table_name,
                                   schema_name=database_settings.schema_name,
                                   fields=node_database_reader.fields,
                                   )
            return sql_source.get_schema()

        node = self.get_node(node_database_reader.node_id)
        if node:
            node.node_type = node_type
            node.name = node_type
            node.function = _func
            node.setting_input = node_database_reader
            node.node_settings.cache_results = node_database_reader.cache_results
            if node_database_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)
            node.schema_callback = schema_callback
        else:
            node = FlowNode(node_database_reader.node_id, function=_func,
                            setting_input=node_database_reader,
                            name=node_type, node_type=node_type, parent_uuid=self.uuid,
                            schema_callback=schema_callback)
            self._node_db[node_database_reader.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(node_database_reader.node_id)

    def add_sql_source(self, external_source_input: input_schema.NodeExternalSource):
        """Adds a node that reads data from a SQL source.

        This is a convenience alias for `add_external_source`.

        Args:
            external_source_input: The settings for the external SQL source node.
        """
        logger.info('Adding sql source')
        self.add_external_source(external_source_input)

    def add_cloud_storage_writer(self, node_cloud_storage_writer: input_schema.NodeCloudStorageWriter) -> None:
        """Adds a node to write data to a cloud storage provider.

        Args:
            node_cloud_storage_writer: The settings for the cloud storage writer node.
        """

        node_type = "cloud_storage_writer"
        def _func(df: FlowDataEngine):
            df.lazy = True
            execute_remote = self.execution_location != 'local'
            cloud_connection_settings = get_cloud_connection_settings(
                connection_name=node_cloud_storage_writer.cloud_storage_settings.connection_name,
                user_id=node_cloud_storage_writer.user_id,
                auth_mode=node_cloud_storage_writer.cloud_storage_settings.auth_mode
            )
            full_cloud_storage_connection = FullCloudStorageConnection(
                storage_type=cloud_connection_settings.storage_type,
                auth_method=cloud_connection_settings.auth_method,
                aws_allow_unsafe_html=cloud_connection_settings.aws_allow_unsafe_html,
                **CloudStorageReader.get_storage_options(cloud_connection_settings)
            )
            if execute_remote:
                settings = get_cloud_storage_write_settings_worker_interface(
                    write_settings=node_cloud_storage_writer.cloud_storage_settings,
                    connection=full_cloud_storage_connection,
                    lf=df.data_frame,
                    flowfile_node_id=node_cloud_storage_writer.node_id,
                    flowfile_flow_id=self.flow_id)
                external_database_writer = ExternalCloudWriter(settings, wait_on_completion=False)
                node._fetch_cached_df = external_database_writer
                external_database_writer.get_result()
            else:
                cloud_storage_write_settings_internal = CloudStorageWriteSettingsInternal(
                    connection=full_cloud_storage_connection,
                    write_settings=node_cloud_storage_writer.cloud_storage_settings,
                )
                df.to_cloud_storage_obj(cloud_storage_write_settings_internal)
            return df

        def schema_callback():
            logger.info("Starting to run the schema callback for cloud storage writer")
            if self.get_node(node_cloud_storage_writer.node_id).is_correct:
                return self.get_node(node_cloud_storage_writer.node_id).node_inputs.main_inputs[0].schema
            else:
                return [FlowfileColumn.from_input(column_name="__error__", data_type="String")]

        self.add_node_step(
            node_id=node_cloud_storage_writer.node_id,
            function=_func,
            input_columns=[],
            node_type=node_type,
            setting_input=node_cloud_storage_writer,
            schema_callback=schema_callback,
            input_node_ids=[node_cloud_storage_writer.depending_on_id]
        )

        node = self.get_node(node_cloud_storage_writer.node_id)

    def add_cloud_storage_reader(self, node_cloud_storage_reader: input_schema.NodeCloudStorageReader) -> None:
        """Adds a cloud storage read node to the flow graph.

        Args:
            node_cloud_storage_reader: The settings for the cloud storage read node.
        """
        node_type = "cloud_storage_reader"
        logger.info("Adding cloud storage reader")
        cloud_storage_read_settings = node_cloud_storage_reader.cloud_storage_settings

        def _func():
            logger.info("Starting to run the schema callback for cloud storage reader")
            self.flow_logger.info("Starting to run the schema callback for cloud storage reader")
            settings = CloudStorageReadSettingsInternal(read_settings=cloud_storage_read_settings,
                                                        connection=get_cloud_connection_settings(
                                                            connection_name=cloud_storage_read_settings.connection_name,
                                                            user_id=node_cloud_storage_reader.user_id,
                                                            auth_mode=cloud_storage_read_settings.auth_mode
                                                        ))
            fl = FlowDataEngine.from_cloud_storage_obj(settings)
            return fl

        node = self.add_node_step(node_id=node_cloud_storage_reader.node_id,
                                  function=_func,
                                  cache_results=node_cloud_storage_reader.cache_results,
                                  setting_input=node_cloud_storage_reader,
                                  node_type=node_type,
                                  )
        if node_cloud_storage_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)

    def add_external_source(self,
                            external_source_input: input_schema.NodeExternalSource):
        """Adds a node for a custom external data source.

        Args:
            external_source_input: The settings for the external source node.
        """

        node_type = 'external_source'
        external_source_script = getattr(external_sources.custom_external_sources, external_source_input.identifier)
        source_settings = (getattr(input_schema, snake_case_to_camel_case(external_source_input.identifier)).
                           model_validate(external_source_input.source_settings))
        if hasattr(external_source_script, 'initial_getter'):
            initial_getter = getattr(external_source_script, 'initial_getter')(source_settings)
        else:
            initial_getter = None
        data_getter = external_source_script.getter(source_settings)
        external_source = data_source_factory(source_type='custom',
                                              data_getter=data_getter,
                                              initial_data_getter=initial_getter,
                                              orientation=external_source_input.source_settings.orientation,
                                              schema=None)

        def _func():
            logger.info('Calling external source')
            fl = FlowDataEngine.create_from_external_source(external_source=external_source)
            external_source_input.source_settings.fields = [c.get_minimal_field_info() for c in fl.schema]
            return fl

        node = self.get_node(external_source_input.node_id)
        if node:
            node.node_type = node_type
            node.name = node_type
            node.function = _func
            node.setting_input = external_source_input
            node.node_settings.cache_results = external_source_input.cache_results
            if external_source_input.node_id not in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)
        else:
            node = FlowNode(external_source_input.node_id, function=_func,
                            setting_input=external_source_input,
                            name=node_type, node_type=node_type, parent_uuid=self.uuid)
            self._node_db[external_source_input.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(external_source_input.node_id)
        if external_source_input.source_settings.fields and len(external_source_input.source_settings.fields) > 0:
            logger.info('Using provided schema in the node')

            def schema_callback():
                return [FlowfileColumn.from_input(f.name, f.data_type) for f in
                        external_source_input.source_settings.fields]

            node.schema_callback = schema_callback
        else:
            logger.warning('Removing schema')
            node._schema_callback = None
        self.add_node_step(node_id=external_source_input.node_id,
                           function=_func,
                           input_columns=[],
                           node_type=node_type,
                           setting_input=external_source_input)

    def add_read(self, input_file: input_schema.NodeRead):
        """Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

        Args:
            input_file: The settings for the read operation.
        """

        if input_file.received_file.file_type in ('xlsx', 'excel') and input_file.received_file.sheet_name == '':
            sheet_name = fastexcel.read_excel(input_file.received_file.path).sheet_names[0]
            input_file.received_file.sheet_name = sheet_name

        received_file = input_file.received_file
        input_file.received_file.set_absolute_filepath()

        def _func():
            input_file.received_file.set_absolute_filepath()
            if input_file.received_file.file_type == 'parquet':
                input_data = FlowDataEngine.create_from_path(input_file.received_file)
            elif input_file.received_file.file_type == 'csv' and 'utf' in input_file.received_file.encoding:
                input_data = FlowDataEngine.create_from_path(input_file.received_file)
            else:
                input_data = FlowDataEngine.create_from_path_worker(input_file.received_file,
                                                                    node_id=input_file.node_id,
                                                                    flow_id=self.flow_id)
            input_data.name = input_file.received_file.name
            return input_data

        node = self.get_node(input_file.node_id)
        schema_callback = None
        if node:
            start_hash = node.hash
            node.node_type = 'read'
            node.name = 'read'
            node.function = _func
            node.setting_input = input_file
            if input_file.node_id not in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)

            if start_hash != node.hash:
                logger.info('Hash changed, updating schema')
                if len(received_file.fields) > 0:
                    # If the file has fields defined, we can use them to create the schema
                    def schema_callback():
                        return [FlowfileColumn.from_input(f.name, f.data_type) for f in received_file.fields]

                elif input_file.received_file.file_type in ('csv', 'json', 'parquet'):
                    # everything that can be scanned by polars
                    def schema_callback():
                        input_data = FlowDataEngine.create_from_path(input_file.received_file)
                        return input_data.schema

                elif input_file.received_file.file_type in ('xlsx', 'excel'):
                    # If the file is an Excel file, we need to use the openpyxl engine to read the schema
                    schema_callback = get_xlsx_schema_callback(engine='openpyxl',
                                                               file_path=received_file.file_path,
                                                               sheet_name=received_file.sheet_name,
                                                               start_row=received_file.start_row,
                                                               end_row=received_file.end_row,
                                                               start_column=received_file.start_column,
                                                               end_column=received_file.end_column,
                                                               has_headers=received_file.has_headers)
                else:
                    schema_callback = None
        else:
            node = FlowNode(input_file.node_id, function=_func,
                            setting_input=input_file,
                            name='read', node_type='read', parent_uuid=self.uuid)
            self._node_db[input_file.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(input_file.node_id)

        if schema_callback is not None:
            node.schema_callback = schema_callback
        return self

    def add_datasource(self, input_file: Union[input_schema.NodeDatasource, input_schema.NodeManualInput]) -> "FlowGraph":
        """Adds a data source node to the graph.

        This method serves as a factory for creating starting nodes, handling both
        file-based sources and direct manual data entry.

        Args:
            input_file: The configuration object for the data source.

        Returns:
            The `FlowGraph` instance for method chaining.
        """
        if isinstance(input_file, input_schema.NodeManualInput):
            input_data = FlowDataEngine(input_file.raw_data_format)
            ref = 'manual_input'
        else:
            input_data = FlowDataEngine(path_ref=input_file.file_ref)
            ref = 'datasource'
        node = self.get_node(input_file.node_id)
        if node:
            node.node_type = ref
            node.name = ref
            node.function = input_data
            node.setting_input = input_file
            if not input_file.node_id in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)
        else:
            input_data.collect()
            node = FlowNode(input_file.node_id, function=input_data,
                            setting_input=input_file,
                            name=ref, node_type=ref, parent_uuid=self.uuid)
            self._node_db[input_file.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(input_file.node_id)
        return self

    def add_manual_input(self, input_file: input_schema.NodeManualInput):
        """Adds a node for manual data entry.

        This is a convenience alias for `add_datasource`.

        Args:
            input_file: The settings and data for the manual input node.
        """
        self.add_datasource(input_file)

    @property
    def nodes(self) -> List[FlowNode]:
        """Gets a list of all FlowNode objects in the graph."""

        return list(self._node_db.values())

    @property
    def execution_mode(self) -> schemas.ExecutionModeLiteral:
        """Gets the current execution mode ('Development' or 'Performance')."""
        return self.flow_settings.execution_mode

    def get_implicit_starter_nodes(self) -> List[FlowNode]:
        """Finds nodes that can act as starting points but are not explicitly defined as such.

        Some nodes, like the Polars Code node, can function without an input. This
        method identifies such nodes if they have no incoming connections.

        Returns:
            A list of `FlowNode` objects that are implicit starting nodes.
        """
        starting_node_ids = [node.node_id for node in self._flow_starts]
        implicit_starting_nodes = []
        for node in self.nodes:
            if node.node_template.can_be_start and not node.has_input and node.node_id not in starting_node_ids:
                implicit_starting_nodes.append(node)
        return implicit_starting_nodes

    @execution_mode.setter
    def execution_mode(self, mode: schemas.ExecutionModeLiteral):
        """Sets the execution mode for the flow.

        Args:
            mode: The execution mode to set.
        """
        self.flow_settings.execution_mode = mode

    @property
    def execution_location(self) -> schemas.ExecutionLocationsLiteral:
        """Gets the current execution location."""
        return self.flow_settings.execution_location

    @execution_location.setter
    def execution_location(self, execution_location: schemas.ExecutionLocationsLiteral):
        """Sets the execution location for the flow.

        Args:
            execution_location: The execution location to set.
        """
        if self.flow_settings.execution_location != execution_location:
            self.reset()
        self.flow_settings.execution_location = execution_location

    def validate_if_node_can_be_fetched(self, node_id: int) -> None:
        flow_node = self._node_db.get(node_id)
        if not flow_node:
            raise Exception("Node not found found")
        skip_nodes, execution_order = compute_execution_plan(
            nodes=self.nodes, flow_starts=self._flow_starts+self.get_implicit_starter_nodes()
        )
        if flow_node.node_id in [skip_node.node_id for skip_node in skip_nodes]:
            raise Exception("Node can not be executed because it does not have it's inputs")

    def create_initial_run_information(self, number_of_nodes: int,
                                       run_type: Literal["fetch_one", "full_run"]):
        return RunInformation(
            flow_id=self.flow_id, start_time=datetime.datetime.now(), end_time=None,
            success=None, number_of_nodes=number_of_nodes, node_step_result=[],
            run_type=run_type
        )

    def trigger_fetch_node(self, node_id: int) -> RunInformation | None:
        """Executes a specific node in the graph by its ID."""
        if self.flow_settings.is_running:
            raise Exception("Flow is already running")
        flow_node = self.get_node(node_id)
        self.flow_settings.is_running = True
        self.flow_settings.is_canceled = False
        self.flow_logger.clear_log_file()
        self.latest_run_info = self.create_initial_run_information(1, "fetch_one")
        node_logger = self.flow_logger.get_node_logger(flow_node.node_id)
        node_result = NodeResult(node_id=flow_node.node_id, node_name=flow_node.name)
        logger.info(f'Starting to run: node {flow_node.node_id}, start time: {node_result.start_timestamp}')
        try:
            self.latest_run_info.node_step_result.append(node_result)
            flow_node.execute_node(run_location=self.flow_settings.execution_location,
                                   performance_mode=False,
                                   node_logger=node_logger,
                                   optimize_for_downstream=False,
                                   reset_cache=True)
            node_result.error = str(flow_node.results.errors)
            if self.flow_settings.is_canceled:
                node_result.success = None
                node_result.success = None
                node_result.is_running = False
            node_result.success = flow_node.results.errors is None
            node_result.end_timestamp = time()
            node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
            node_result.is_running = False
            self.latest_run_info.nodes_completed += 1
            self.latest_run_info.end_time = datetime.datetime.now()
            self.flow_settings.is_running = False
            return self.get_run_info()
        except Exception as e:
            node_result.error = 'Node did not run'
            node_result.success = False
            node_result.end_timestamp = time()
            node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
            node_result.is_running = False
            node_logger.error(f'Error in node {flow_node.node_id}: {e}')
        finally:
            self.flow_settings.is_running = False

    def run_graph(self) -> RunInformation | None:
        """Executes the entire data flow graph from start to finish.

        It determines the correct execution order, runs each node,
        collects results, and handles errors and cancellations.

        Returns:
            A RunInformation object summarizing the execution results.

        Raises:
            Exception: If the flow is already running.
        """
        if self.flow_settings.is_running:
            raise Exception('Flow is already running')
        try:

            self.flow_settings.is_running = True
            self.flow_settings.is_canceled = False
            self.flow_logger.clear_log_file()
            self.flow_logger.info('Starting to run flowfile flow...')

            skip_nodes, execution_order = compute_execution_plan(
                nodes=self.nodes,
                flow_starts=self._flow_starts+self.get_implicit_starter_nodes()
            )

            self.latest_run_info = self.create_initial_run_information(len(execution_order), "full_run")

            skip_node_message(self.flow_logger, skip_nodes)
            execution_order_message(self.flow_logger, execution_order)
            performance_mode = self.flow_settings.execution_mode == 'Performance'

            for node in execution_order:
                node_logger = self.flow_logger.get_node_logger(node.node_id)
                if self.flow_settings.is_canceled:
                    self.flow_logger.info('Flow canceled')
                    break
                if node in skip_nodes:
                    node_logger.info(f'Skipping node {node.node_id}')
                    continue
                node_result = NodeResult(node_id=node.node_id, node_name=node.name)
                self.latest_run_info.node_step_result.append(node_result)
                logger.info(f'Starting to run: node {node.node_id}, start time: {node_result.start_timestamp}')
                node.execute_node(run_location=self.flow_settings.execution_location,
                                  performance_mode=performance_mode,
                                  node_logger=node_logger)
                try:
                    node_result.error = str(node.results.errors)
                    if self.flow_settings.is_canceled:
                        node_result.success = None
                        node_result.success = None
                        node_result.is_running = False
                        continue
                    node_result.success = node.results.errors is None
                    node_result.end_timestamp = time()
                    node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                    node_result.is_running = False
                except Exception as e:
                    node_result.error = 'Node did not run'
                    node_result.success = False
                    node_result.end_timestamp = time()
                    node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                    node_result.is_running = False
                    node_logger.error(f'Error in node {node.node_id}: {e}')
                if not node_result.success:
                    skip_nodes.extend(list(node.get_all_dependent_nodes()))
                node_logger.info(f'Completed node with success: {node_result.success}')
                self.latest_run_info.nodes_completed += 1
            self.flow_logger.info('Flow completed!')
            self.end_datetime = datetime.datetime.now()
            self.flow_settings.is_running = False
            if self.flow_settings.is_canceled:
                self.flow_logger.info('Flow canceled')
            return self.get_run_info()
        except Exception as e:
            raise e
        finally:
            self.flow_settings.is_running = False

    def get_run_info(self) -> RunInformation | None:
        """Gets a summary of the most recent graph execution.

        Returns:
            A RunInformation object with details about the last run.
        """
        is_running = self.flow_settings.is_running
        if self.latest_run_info is None:
            return

        elif not is_running and self.latest_run_info.success is not None:
            return self.latest_run_info

        run_info = self.latest_run_info
        if not is_running:
            run_info.success = all(nr.success for nr in run_info.node_step_result)
        return run_info

    @property
    def node_connections(self) -> List[Tuple[int, int]]:
        """Computes and returns a list of all connections in the graph.

        Returns:
            A list of tuples, where each tuple is a (source_id, target_id) pair.
        """
        connections = set()
        for node in self.nodes:
            outgoing_connections = [(node.node_id, ltn.node_id) for ltn in node.leads_to_nodes]
            incoming_connections = [(don.node_id, node.node_id) for don in node.all_inputs]
            node_connections = [c for c in outgoing_connections + incoming_connections if (c[0] is not None
                                                                                           and c[1] is not None)]
            for node_connection in node_connections:
                if node_connection not in connections:
                    connections.add(node_connection)
        return list(connections)

    def get_node_data(self, node_id: int, include_example: bool = True) -> NodeData:
        """Retrieves all data needed to render a node in the UI.

        Args:
            node_id: The ID of the node.
            include_example: Whether to include data samples in the result.

        Returns:
            A NodeData object, or None if the node is not found.
        """
        node = self._node_db[node_id]
        return node.get_node_data(flow_id=self.flow_id, include_example=include_example)

    def get_node_storage(self) -> schemas.FlowInformation:
        """Serializes the entire graph's state into a storable format.

        Returns:
            A FlowInformation object representing the complete graph.
        """
        node_information = {node.node_id: node.get_node_information() for
                            node in self.nodes if node.is_setup and node.is_correct}

        return schemas.FlowInformation(flow_id=self.flow_id,
                                       flow_name=self.__name__,
                                       flow_settings=self.flow_settings,
                                       data=node_information,
                                       node_starts=[v.node_id for v in self._flow_starts],
                                       node_connections=self.node_connections
                                       )

    def cancel(self):
        """Cancels an ongoing graph execution."""

        if not self.flow_settings.is_running:
            return
        self.flow_settings.is_canceled = True
        for node in self.nodes:
            node.cancel()

    def close_flow(self):
        """Performs cleanup operations, such as clearing node caches."""

        for node in self.nodes:
            node.remove_cache()

    def save_flow(self, flow_path: str):
        """Saves the current state of the flow graph to a file.

        Args:
            flow_path: The path where the flow file will be saved.
        """
        logger.info("Saving flow to %s", flow_path)
        os.makedirs(os.path.dirname(flow_path), exist_ok=True)
        try:
            with open(flow_path, 'wb') as f:
                pickle.dump(self.get_node_storage(), f)
        except Exception as e:
            logger.error(f"Error saving flow: {e}")

        self.flow_settings.path = flow_path

    def get_frontend_data(self) -> dict:
        """Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

        This method transforms the graph's state into a format compatible with the
        Drawflow.js library.

        Returns:
            A dictionary representing the graph in Drawflow format.
        """
        result = {
            'Home': {
                "data": {}
            }
        }
        flow_info: schemas.FlowInformation = self.get_node_storage()

        for node_id, node_info in flow_info.data.items():
            if node_info.is_setup:
                try:
                    pos_x = node_info.data.pos_x
                    pos_y = node_info.data.pos_y
                    # Basic node structure
                    result["Home"]["data"][str(node_id)] = {
                        "id": node_info.id,
                        "name": node_info.type,
                        "data": {},  # Additional data can go here
                        "class": node_info.type,
                        "html": node_info.type,
                        "typenode": "vue",
                        "inputs": {},
                        "outputs": {},
                        "pos_x": pos_x,
                        "pos_y": pos_y
                    }
                except Exception as e:
                    logger.error(e)
            # Add outputs to the node based on `outputs` in your backend data
            if node_info.outputs:
                outputs = {o: 0 for o in node_info.outputs}
                for o in node_info.outputs:
                    outputs[o] += 1
                connections = []
                for output_node_id, n_connections in outputs.items():
                    leading_to_node = self.get_node(output_node_id)
                    input_types = leading_to_node.get_input_type(node_info.id)
                    for input_type in input_types:
                        if input_type == 'main':
                            input_frontend_id = 'input_1'
                        elif input_type == 'right':
                            input_frontend_id = 'input_2'
                        elif input_type == 'left':
                            input_frontend_id = 'input_3'
                        else:
                            input_frontend_id = 'input_1'
                        connection = {"node": str(output_node_id), "input": input_frontend_id}
                        connections.append(connection)

                result["Home"]["data"][str(node_id)]["outputs"]["output_1"] = {
                    "connections": connections}
            else:
                result["Home"]["data"][str(node_id)]["outputs"] = {"output_1": {"connections": []}}

            # Add input to the node based on `depending_on_id` in your backend data
            if node_info.left_input_id is not None or node_info.right_input_id is not None or node_info.input_ids is not None:
                main_inputs = node_info.main_input_ids
                result["Home"]["data"][str(node_id)]["inputs"]["input_1"] = {
                    "connections": [{"node": str(main_node_id), "input": "output_1"} for main_node_id in main_inputs]
                }
                if node_info.right_input_id is not None:
                    result["Home"]["data"][str(node_id)]["inputs"]["input_2"] = {
                        "connections": [{"node": str(node_info.right_input_id), "input": "output_1"}]
                    }
                if node_info.left_input_id is not None:
                    result["Home"]["data"][str(node_id)]["inputs"]["input_3"] = {
                        "connections": [{"node": str(node_info.left_input_id), "input": "output_1"}]
                    }
        return result

    def get_vue_flow_input(self) -> schemas.VueFlowInput:
        """Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

        Returns:
            A VueFlowInput object.
        """
        edges: List[schemas.NodeEdge] = []
        nodes: List[schemas.NodeInput] = []
        for node in self.nodes:
            nodes.append(node.get_node_input())
            edges.extend(node.get_edge_input())
        return schemas.VueFlowInput(node_edges=edges, node_inputs=nodes)

    def reset(self):
        """Forces a deep reset on all nodes in the graph."""

        for node in self.nodes:
            node.reset(True)

    def copy_node(self, new_node_settings: input_schema.NodePromise, existing_setting_input: Any, node_type: str) -> None:
        """Creates a copy of an existing node.

        Args:
            new_node_settings: The promise containing new settings (like ID and position).
            existing_setting_input: The settings object from the node being copied.
            node_type: The type of the node being copied.
        """
        self.add_node_promise(new_node_settings)

        if isinstance(existing_setting_input, input_schema.NodePromise):
            return

        combined_settings = combine_existing_settings_and_new_settings(
            existing_setting_input, new_node_settings
        )
        getattr(self, f"add_{node_type}")(combined_settings)

    def generate_code(self):
        """Generates code for the flow graph.
        This method exports the flow graph to a Polars-compatible format.
        """
        from flowfile_core.flowfile.code_generator.code_generator import export_flow_to_polars
        print(export_flow_to_polars(self))

`execution_location` `property` `writable`

Gets the current execution location.

`execution_mode` `property` `writable`

Gets the current execution mode ('Development' or 'Performance').

`flow_id` `property` `writable`

Gets the unique identifier of the flow.

`graph_has_functions` `property`

Checks if the graph has any nodes.

`graph_has_input_data` `property`

Checks if the graph has an initial input data source.

`node_connections` `property`

Computes and returns a list of all connections in the graph.

Returns:

Type	Description
`List[Tuple[int, int]]`	A list of tuples, where each tuple is a (source_id, target_id) pair.

`nodes` `property`

Gets a list of all FlowNode objects in the graph.

`init(flow_settings, name=None, input_cols=None, output_cols=None, path_ref=None, input_flow=None, cache_results=False)`

Initializes a new FlowGraph instance.

Parameters:

Name	Type	Description	Default
`flow_settings`	`FlowSettings \| FlowGraphConfig`	The configuration settings for the flow.	required
`name`	`str`	The name of the flow.	`None`
`input_cols`	`List[str]`	A list of input column names.	`None`
`output_cols`	`List[str]`	A list of output column names.	`None`
`path_ref`	`str`	An optional path to an initial data source.	`None`
`input_flow`	`Union[ParquetFile, FlowDataEngine, FlowGraph]`	An optional existing data object to start the flow with.	`None`
`cache_results`	`bool`	A global flag to enable or disable result caching.	`False`

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def __init__(self,
             flow_settings: schemas.FlowSettings | schemas.FlowGraphConfig,
             name: str = None, input_cols: List[str] = None,
             output_cols: List[str] = None,
             path_ref: str = None,
             input_flow: Union[ParquetFile, FlowDataEngine, "FlowGraph"] = None,
             cache_results: bool = False):
    """Initializes a new FlowGraph instance.

    Args:
        flow_settings: The configuration settings for the flow.
        name: The name of the flow.
        input_cols: A list of input column names.
        output_cols: A list of output column names.
        path_ref: An optional path to an initial data source.
        input_flow: An optional existing data object to start the flow with.
        cache_results: A global flag to enable or disable result caching.
    """
    if isinstance(flow_settings, schemas.FlowGraphConfig):
        flow_settings = schemas.FlowSettings.from_flow_settings_input(flow_settings)

    self._flow_settings = flow_settings
    self.uuid = str(uuid1())
    self.start_datetime = None
    self.end_datetime = None
    self.latest_run_info = None
    self._flow_id = flow_settings.flow_id
    self.flow_logger = FlowLogger(flow_settings.flow_id)
    self._flow_starts: List[FlowNode] = []
    self._results = None
    self.schema = None
    self.has_over_row_function = False
    self._input_cols = [] if input_cols is None else input_cols
    self._output_cols = [] if output_cols is None else output_cols
    self._node_ids = []
    self._node_db = {}
    self.cache_results = cache_results
    self.__name__ = name if name else id(self)
    self.depends_on = {}
    if path_ref is not None:
        self.add_datasource(input_schema.NodeDatasource(file_path=path_ref))
    elif input_flow is not None:
        self.add_datasource(input_file=input_flow)

`repr()`

Provides the official string representation of the FlowGraph instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def __repr__(self):
    """Provides the official string representation of the FlowGraph instance."""
    settings_str = "  -" + '\n  -'.join(f"{k}: {v}" for k, v in self.flow_settings)
    return f"FlowGraph(\nNodes: {self._node_db}\n\nSettings:\n{settings_str}"

`add_cloud_storage_reader(node_cloud_storage_reader)`

Adds a cloud storage read node to the flow graph.

Parameters:

Name	Type	Description	Default
`node_cloud_storage_reader`	`NodeCloudStorageReader`	The settings for the cloud storage read node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_cloud_storage_reader(self, node_cloud_storage_reader: input_schema.NodeCloudStorageReader) -> None:
    """Adds a cloud storage read node to the flow graph.

    Args:
        node_cloud_storage_reader: The settings for the cloud storage read node.
    """
    node_type = "cloud_storage_reader"
    logger.info("Adding cloud storage reader")
    cloud_storage_read_settings = node_cloud_storage_reader.cloud_storage_settings

    def _func():
        logger.info("Starting to run the schema callback for cloud storage reader")
        self.flow_logger.info("Starting to run the schema callback for cloud storage reader")
        settings = CloudStorageReadSettingsInternal(read_settings=cloud_storage_read_settings,
                                                    connection=get_cloud_connection_settings(
                                                        connection_name=cloud_storage_read_settings.connection_name,
                                                        user_id=node_cloud_storage_reader.user_id,
                                                        auth_mode=cloud_storage_read_settings.auth_mode
                                                    ))
        fl = FlowDataEngine.from_cloud_storage_obj(settings)
        return fl

    node = self.add_node_step(node_id=node_cloud_storage_reader.node_id,
                              function=_func,
                              cache_results=node_cloud_storage_reader.cache_results,
                              setting_input=node_cloud_storage_reader,
                              node_type=node_type,
                              )
    if node_cloud_storage_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
        self._flow_starts.append(node)

`add_cloud_storage_writer(node_cloud_storage_writer)`

Adds a node to write data to a cloud storage provider.

Parameters:

Name	Type	Description	Default
`node_cloud_storage_writer`	`NodeCloudStorageWriter`	The settings for the cloud storage writer node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_cloud_storage_writer(self, node_cloud_storage_writer: input_schema.NodeCloudStorageWriter) -> None:
    """Adds a node to write data to a cloud storage provider.

    Args:
        node_cloud_storage_writer: The settings for the cloud storage writer node.
    """

    node_type = "cloud_storage_writer"
    def _func(df: FlowDataEngine):
        df.lazy = True
        execute_remote = self.execution_location != 'local'
        cloud_connection_settings = get_cloud_connection_settings(
            connection_name=node_cloud_storage_writer.cloud_storage_settings.connection_name,
            user_id=node_cloud_storage_writer.user_id,
            auth_mode=node_cloud_storage_writer.cloud_storage_settings.auth_mode
        )
        full_cloud_storage_connection = FullCloudStorageConnection(
            storage_type=cloud_connection_settings.storage_type,
            auth_method=cloud_connection_settings.auth_method,
            aws_allow_unsafe_html=cloud_connection_settings.aws_allow_unsafe_html,
            **CloudStorageReader.get_storage_options(cloud_connection_settings)
        )
        if execute_remote:
            settings = get_cloud_storage_write_settings_worker_interface(
                write_settings=node_cloud_storage_writer.cloud_storage_settings,
                connection=full_cloud_storage_connection,
                lf=df.data_frame,
                flowfile_node_id=node_cloud_storage_writer.node_id,
                flowfile_flow_id=self.flow_id)
            external_database_writer = ExternalCloudWriter(settings, wait_on_completion=False)
            node._fetch_cached_df = external_database_writer
            external_database_writer.get_result()
        else:
            cloud_storage_write_settings_internal = CloudStorageWriteSettingsInternal(
                connection=full_cloud_storage_connection,
                write_settings=node_cloud_storage_writer.cloud_storage_settings,
            )
            df.to_cloud_storage_obj(cloud_storage_write_settings_internal)
        return df

    def schema_callback():
        logger.info("Starting to run the schema callback for cloud storage writer")
        if self.get_node(node_cloud_storage_writer.node_id).is_correct:
            return self.get_node(node_cloud_storage_writer.node_id).node_inputs.main_inputs[0].schema
        else:
            return [FlowfileColumn.from_input(column_name="__error__", data_type="String")]

    self.add_node_step(
        node_id=node_cloud_storage_writer.node_id,
        function=_func,
        input_columns=[],
        node_type=node_type,
        setting_input=node_cloud_storage_writer,
        schema_callback=schema_callback,
        input_node_ids=[node_cloud_storage_writer.depending_on_id]
    )

    node = self.get_node(node_cloud_storage_writer.node_id)

`add_cross_join(cross_join_settings)`

Adds a cross join node to the graph.

Parameters:

Name	Type	Description	Default
`cross_join_settings`	`NodeCrossJoin`	The settings for the cross join operation.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_cross_join(self, cross_join_settings: input_schema.NodeCrossJoin) -> "FlowGraph":
    """Adds a cross join node to the graph.

    Args:
        cross_join_settings: The settings for the cross join operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        for left_select in cross_join_settings.cross_join_input.left_select.renames:
            left_select.is_available = True if left_select.old_name in main.schema else False
        for right_select in cross_join_settings.cross_join_input.right_select.renames:
            right_select.is_available = True if right_select.old_name in right.schema else False

        return main.do_cross_join(cross_join_input=cross_join_settings.cross_join_input,
                                  auto_generate_selection=cross_join_settings.auto_generate_selection,
                                  verify_integrity=False,
                                  other=right)

    self.add_node_step(node_id=cross_join_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='cross_join',
                       setting_input=cross_join_settings,
                       input_node_ids=cross_join_settings.depending_on_ids)
    return self

`add_database_reader(node_database_reader)`

Adds a node to read data from a database.

Parameters:

Name	Type	Description	Default
`node_database_reader`	`NodeDatabaseReader`	The settings for the database reader node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_database_reader(self, node_database_reader: input_schema.NodeDatabaseReader):
    """Adds a node to read data from a database.

    Args:
        node_database_reader: The settings for the database reader node.
    """

    logger.info("Adding database reader")
    node_type = 'database_reader'
    database_settings: input_schema.DatabaseSettings = node_database_reader.database_settings
    database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
    if database_settings.connection_mode == 'inline':
        database_connection: input_schema.DatabaseConnection = database_settings.database_connection
        encrypted_password = get_encrypted_secret(current_user_id=node_database_reader.user_id,
                                                  secret_name=database_connection.password_ref)
        if encrypted_password is None:
            raise HTTPException(status_code=400, detail="Password not found")
    else:
        database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                    node_database_reader.user_id)
        database_connection = database_reference_settings
        encrypted_password = database_reference_settings.password.get_secret_value()

    def _func():
        sql_source = BaseSqlSource(query=None if database_settings.query_mode == 'table' else database_settings.query,
                                   table_name=database_settings.table_name,
                                   schema_name=database_settings.schema_name,
                                   fields=node_database_reader.fields,
                                   )
        database_external_read_settings = (
            sql_models.DatabaseExternalReadSettings.create_from_from_node_database_reader(
                node_database_reader=node_database_reader,
                password=encrypted_password,
                query=sql_source.query,
                database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                             else None),
            )
        )

        external_database_fetcher = ExternalDatabaseFetcher(database_external_read_settings, wait_on_completion=False)
        node._fetch_cached_df = external_database_fetcher
        fl = FlowDataEngine(external_database_fetcher.get_result())
        node_database_reader.fields = [c.get_minimal_field_info() for c in fl.schema]
        return fl

    def schema_callback():
        sql_source = SqlSource(connection_string=
                               sql_utils.construct_sql_uri(database_type=database_connection.database_type,
                                                           host=database_connection.host,
                                                           port=database_connection.port,
                                                           database=database_connection.database,
                                                           username=database_connection.username,
                                                           password=decrypt_secret(encrypted_password)),
                               query=None if database_settings.query_mode == 'table' else database_settings.query,
                               table_name=database_settings.table_name,
                               schema_name=database_settings.schema_name,
                               fields=node_database_reader.fields,
                               )
        return sql_source.get_schema()

    node = self.get_node(node_database_reader.node_id)
    if node:
        node.node_type = node_type
        node.name = node_type
        node.function = _func
        node.setting_input = node_database_reader
        node.node_settings.cache_results = node_database_reader.cache_results
        if node_database_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)
        node.schema_callback = schema_callback
    else:
        node = FlowNode(node_database_reader.node_id, function=_func,
                        setting_input=node_database_reader,
                        name=node_type, node_type=node_type, parent_uuid=self.uuid,
                        schema_callback=schema_callback)
        self._node_db[node_database_reader.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(node_database_reader.node_id)

`add_database_writer(node_database_writer)`

Adds a node to write data to a database.

Parameters:

Name	Type	Description	Default
`node_database_writer`	`NodeDatabaseWriter`	The settings for the database writer node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_database_writer(self, node_database_writer: input_schema.NodeDatabaseWriter):
    """Adds a node to write data to a database.

    Args:
        node_database_writer: The settings for the database writer node.
    """

    node_type = 'database_writer'
    database_settings: input_schema.DatabaseWriteSettings = node_database_writer.database_write_settings
    database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
    if database_settings.connection_mode == 'inline':
        database_connection: input_schema.DatabaseConnection = database_settings.database_connection
        encrypted_password = get_encrypted_secret(current_user_id=node_database_writer.user_id,
                                                  secret_name=database_connection.password_ref)
        if encrypted_password is None:
            raise HTTPException(status_code=400, detail="Password not found")
    else:
        database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                    node_database_writer.user_id)
        encrypted_password = database_reference_settings.password.get_secret_value()

    def _func(df: FlowDataEngine):
        df.lazy = True
        database_external_write_settings = (
            sql_models.DatabaseExternalWriteSettings.create_from_from_node_database_writer(
                node_database_writer=node_database_writer,
                password=encrypted_password,
                table_name=(database_settings.schema_name+'.'+database_settings.table_name
                            if database_settings.schema_name else database_settings.table_name),
                database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                             else None),
                lf=df.data_frame
            )
        )
        external_database_writer = ExternalDatabaseWriter(database_external_write_settings, wait_on_completion=False)
        node._fetch_cached_df = external_database_writer
        external_database_writer.get_result()
        return df

    def schema_callback():
        input_node: FlowNode = self.get_node(node_database_writer.node_id).node_inputs.main_inputs[0]
        return input_node.schema

    self.add_node_step(
        node_id=node_database_writer.node_id,
        function=_func,
        input_columns=[],
        node_type=node_type,
        setting_input=node_database_writer,
        schema_callback=schema_callback,
    )
    node = self.get_node(node_database_writer.node_id)

`add_datasource(input_file)`

Adds a data source node to the graph.

This method serves as a factory for creating starting nodes, handling both file-based sources and direct manual data entry.

Parameters:

Name	Type	Description	Default
`input_file`	`Union[NodeDatasource, NodeManualInput]`	The configuration object for the data source.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_datasource(self, input_file: Union[input_schema.NodeDatasource, input_schema.NodeManualInput]) -> "FlowGraph":
    """Adds a data source node to the graph.

    This method serves as a factory for creating starting nodes, handling both
    file-based sources and direct manual data entry.

    Args:
        input_file: The configuration object for the data source.

    Returns:
        The `FlowGraph` instance for method chaining.
    """
    if isinstance(input_file, input_schema.NodeManualInput):
        input_data = FlowDataEngine(input_file.raw_data_format)
        ref = 'manual_input'
    else:
        input_data = FlowDataEngine(path_ref=input_file.file_ref)
        ref = 'datasource'
    node = self.get_node(input_file.node_id)
    if node:
        node.node_type = ref
        node.name = ref
        node.function = input_data
        node.setting_input = input_file
        if not input_file.node_id in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)
    else:
        input_data.collect()
        node = FlowNode(input_file.node_id, function=input_data,
                        setting_input=input_file,
                        name=ref, node_type=ref, parent_uuid=self.uuid)
        self._node_db[input_file.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(input_file.node_id)
    return self

`add_dependency_on_polars_lazy_frame(lazy_frame, node_id)`

Adds a special node that directly injects a Polars LazyFrame into the graph.

Note: This is intended for backend use and will not work in the UI editor.

Parameters:

Name	Type	Description	Default
`lazy_frame`	`LazyFrame`	The Polars LazyFrame to inject.	required
`node_id`	`int`	The ID for the new node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_dependency_on_polars_lazy_frame(self,
                                        lazy_frame: pl.LazyFrame,
                                        node_id: int):
    """Adds a special node that directly injects a Polars LazyFrame into the graph.

    Note: This is intended for backend use and will not work in the UI editor.

    Args:
        lazy_frame: The Polars LazyFrame to inject.
        node_id: The ID for the new node.
    """
    def _func():
        return FlowDataEngine(lazy_frame)
    node_promise = input_schema.NodePromise(flow_id=self.flow_id,
                                            node_id=node_id, node_type="polars_lazy_frame",
                                            is_setup=True)
    self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=_func,
                       setting_input=node_promise)

`add_explore_data(node_analysis)`

Adds a specialized node for data exploration and visualization.

Parameters:

Name	Type	Description	Default
`node_analysis`	`NodeExploreData`	The settings for the data exploration node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_explore_data(self, node_analysis: input_schema.NodeExploreData):
    """Adds a specialized node for data exploration and visualization.

    Args:
        node_analysis: The settings for the data exploration node.
    """
    sample_size: int = 10000

    def analysis_preparation(flowfile_table: FlowDataEngine):
        if flowfile_table.number_of_records <= 0:
            number_of_records = flowfile_table.get_number_of_records(calculate_in_worker_process=True)
        else:
            number_of_records = flowfile_table.number_of_records
        if number_of_records > sample_size:
            flowfile_table = flowfile_table.get_sample(sample_size, random=True)
        external_sampler = ExternalDfFetcher(
            lf=flowfile_table.data_frame,
            file_ref="__gf_walker"+node.hash,
            wait_on_completion=True,
            node_id=node.node_id,
            flow_id=self.flow_id,
        )
        node.results.analysis_data_generator = get_read_top_n(external_sampler.status.file_ref,
                                                              n=min(sample_size, number_of_records))
        return flowfile_table

    def schema_callback():
        node = self.get_node(node_analysis.node_id)
        if len(node.all_inputs) == 1:
            input_node = node.all_inputs[0]
            return input_node.schema
        else:
            return [FlowfileColumn.from_input('col_1', 'na')]

    self.add_node_step(node_id=node_analysis.node_id, node_type='explore_data',
                       function=analysis_preparation,
                       setting_input=node_analysis, schema_callback=schema_callback)
    node = self.get_node(node_analysis.node_id)

`add_external_source(external_source_input)`

Adds a node for a custom external data source.

Parameters:

Name	Type	Description	Default
`external_source_input`	`NodeExternalSource`	The settings for the external source node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_external_source(self,
                        external_source_input: input_schema.NodeExternalSource):
    """Adds a node for a custom external data source.

    Args:
        external_source_input: The settings for the external source node.
    """

    node_type = 'external_source'
    external_source_script = getattr(external_sources.custom_external_sources, external_source_input.identifier)
    source_settings = (getattr(input_schema, snake_case_to_camel_case(external_source_input.identifier)).
                       model_validate(external_source_input.source_settings))
    if hasattr(external_source_script, 'initial_getter'):
        initial_getter = getattr(external_source_script, 'initial_getter')(source_settings)
    else:
        initial_getter = None
    data_getter = external_source_script.getter(source_settings)
    external_source = data_source_factory(source_type='custom',
                                          data_getter=data_getter,
                                          initial_data_getter=initial_getter,
                                          orientation=external_source_input.source_settings.orientation,
                                          schema=None)

    def _func():
        logger.info('Calling external source')
        fl = FlowDataEngine.create_from_external_source(external_source=external_source)
        external_source_input.source_settings.fields = [c.get_minimal_field_info() for c in fl.schema]
        return fl

    node = self.get_node(external_source_input.node_id)
    if node:
        node.node_type = node_type
        node.name = node_type
        node.function = _func
        node.setting_input = external_source_input
        node.node_settings.cache_results = external_source_input.cache_results
        if external_source_input.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)
    else:
        node = FlowNode(external_source_input.node_id, function=_func,
                        setting_input=external_source_input,
                        name=node_type, node_type=node_type, parent_uuid=self.uuid)
        self._node_db[external_source_input.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(external_source_input.node_id)
    if external_source_input.source_settings.fields and len(external_source_input.source_settings.fields) > 0:
        logger.info('Using provided schema in the node')

        def schema_callback():
            return [FlowfileColumn.from_input(f.name, f.data_type) for f in
                    external_source_input.source_settings.fields]

        node.schema_callback = schema_callback
    else:
        logger.warning('Removing schema')
        node._schema_callback = None
    self.add_node_step(node_id=external_source_input.node_id,
                       function=_func,
                       input_columns=[],
                       node_type=node_type,
                       setting_input=external_source_input)

`add_filter(filter_settings)`

Adds a filter node to the graph.

Parameters:

Name	Type	Description	Default
`filter_settings`	`NodeFilter`	The settings for the filter operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_filter(self, filter_settings: input_schema.NodeFilter):
    """Adds a filter node to the graph.

    Args:
        filter_settings: The settings for the filter operation.
    """

    is_advanced = filter_settings.filter_input.filter_type == 'advanced'
    if is_advanced:
        predicate = filter_settings.filter_input.advanced_filter
    else:
        _basic_filter = filter_settings.filter_input.basic_filter
        filter_settings.filter_input.advanced_filter = (f'[{_basic_filter.field}]{_basic_filter.filter_type}"'
                                                        f'{_basic_filter.filter_value}"')

    def _func(fl: FlowDataEngine):
        is_advanced = filter_settings.filter_input.filter_type == 'advanced'
        if is_advanced:
            return fl.do_filter(predicate)
        else:
            basic_filter = filter_settings.filter_input.basic_filter
            if basic_filter.filter_value.isnumeric():
                field_data_type = fl.get_schema_column(basic_filter.field).generic_datatype()
                if field_data_type == 'str':
                    _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
                else:
                    _f = f'[{basic_filter.field}]{basic_filter.filter_type}{basic_filter.filter_value}'
            else:
                _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
            filter_settings.filter_input.advanced_filter = _f
            return fl.do_filter(_f)

    self.add_node_step(filter_settings.node_id, _func,
                       node_type='filter',
                       renew_schema=False,
                       setting_input=filter_settings,
                       input_node_ids=[filter_settings.depending_on_id]
                       )

`add_formula(function_settings)`

Adds a node that applies a formula to create or modify a column.

Parameters:

Name	Type	Description	Default
`function_settings`	`NodeFormula`	The settings for the formula operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_formula(self, function_settings: input_schema.NodeFormula):
    """Adds a node that applies a formula to create or modify a column.

    Args:
        function_settings: The settings for the formula operation.
    """

    error = ""
    if function_settings.function.field.data_type not in (None, "Auto"):
        output_type = cast_str_to_polars_type(function_settings.function.field.data_type)
    else:
        output_type = None
    if output_type not in (None, "Auto"):
        new_col = [FlowfileColumn.from_input(column_name=function_settings.function.field.name,
                                             data_type=str(output_type))]
    else:
        new_col = [FlowfileColumn.from_input(function_settings.function.field.name, 'String')]

    def _func(fl: FlowDataEngine):
        return fl.apply_sql_formula(func=function_settings.function.function,
                                    col_name=function_settings.function.field.name,
                                    output_data_type=output_type)

    self.add_node_step(function_settings.node_id, _func,
                       output_schema=new_col,
                       node_type='formula',
                       renew_schema=False,
                       setting_input=function_settings,
                       input_node_ids=[function_settings.depending_on_id]
                       )
    if error != "":
        node = self.get_node(function_settings.node_id)
        node.results.errors = error
        return False, error
    else:
        return True, ""

`add_fuzzy_match(fuzzy_settings)`

Adds a fuzzy matching node to join data on approximate string matches.

Parameters:

Name	Type	Description	Default
`fuzzy_settings`	`NodeFuzzyMatch`	The settings for the fuzzy match operation.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_fuzzy_match(self, fuzzy_settings: input_schema.NodeFuzzyMatch) -> "FlowGraph":
    """Adds a fuzzy matching node to join data on approximate string matches.

    Args:
        fuzzy_settings: The settings for the fuzzy match operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        node = self.get_node(node_id=fuzzy_settings.node_id)
        if self.execution_location == "local":
            return main.fuzzy_join(fuzzy_match_input=deepcopy(fuzzy_settings.join_input),
                                   other=right,
                                   node_logger=self.flow_logger.get_node_logger(fuzzy_settings.node_id))

        f = main.start_fuzzy_join(fuzzy_match_input=deepcopy(fuzzy_settings.join_input), other=right, file_ref=node.hash,
                                  flow_id=self.flow_id, node_id=fuzzy_settings.node_id)
        logger.info("Started the fuzzy match action")
        node._fetch_cached_df = f  # Add to the node so it can be cancelled and fetch later if needed
        return FlowDataEngine(f.get_result())

    def schema_callback():
        fm_input_copy = deepcopy(fuzzy_settings.join_input)  # Deepcopy create an unique object per func
        node = self.get_node(node_id=fuzzy_settings.node_id)
        return calculate_fuzzy_match_schema(fm_input_copy,
                                            left_schema=node.node_inputs.main_inputs[0].schema,
                                            right_schema=node.node_inputs.right_input.schema
                                            )

    self.add_node_step(node_id=fuzzy_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='fuzzy_match',
                       setting_input=fuzzy_settings,
                       input_node_ids=fuzzy_settings.depending_on_ids,
                       schema_callback=schema_callback)

    return self

`add_graph_solver(graph_solver_settings)`

Adds a node that solves graph-like problems within the data.

This node can be used for operations like finding network paths, calculating connected components, or performing other graph algorithms on relational data that represents nodes and edges.

Parameters:

Name	Type	Description	Default
`graph_solver_settings`	`NodeGraphSolver`	The settings object defining the graph inputs and the specific algorithm to apply.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_graph_solver(self, graph_solver_settings: input_schema.NodeGraphSolver):
    """Adds a node that solves graph-like problems within the data.

    This node can be used for operations like finding network paths,
    calculating connected components, or performing other graph algorithms
    on relational data that represents nodes and edges.

    Args:
        graph_solver_settings: The settings object defining the graph inputs
            and the specific algorithm to apply.
    """
    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.solve_graph(graph_solver_settings.graph_solver_input)

    self.add_node_step(node_id=graph_solver_settings.node_id,
                       function=_func,
                       node_type='graph_solver',
                       setting_input=graph_solver_settings,
                       input_node_ids=[graph_solver_settings.depending_on_id])

`add_group_by(group_by_settings)`

Adds a group-by aggregation node to the graph.

Parameters:

Name	Type	Description	Default
`group_by_settings`	`NodeGroupBy`	The settings for the group-by operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_group_by(self, group_by_settings: input_schema.NodeGroupBy):
    """Adds a group-by aggregation node to the graph.

    Args:
        group_by_settings: The settings for the group-by operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.do_group_by(group_by_settings.groupby_input, False)

    self.add_node_step(node_id=group_by_settings.node_id,
                       function=_func,
                       node_type=f'group_by',
                       setting_input=group_by_settings,
                       input_node_ids=[group_by_settings.depending_on_id])

    node = self.get_node(group_by_settings.node_id)

    def schema_callback():

        output_columns = [(c.old_name, c.new_name, c.output_type) for c in group_by_settings.groupby_input.agg_cols]
        depends_on = node.node_inputs.main_inputs[0]
        input_schema_dict: Dict[str, str] = {s.name: s.data_type for s in depends_on.schema}
        output_schema = []
        for old_name, new_name, data_type in output_columns:
            data_type = input_schema_dict[old_name] if data_type is None else data_type
            output_schema.append(FlowfileColumn.from_input(data_type=data_type, column_name=new_name))
        return output_schema

    node.schema_callback = schema_callback

`add_include_cols(include_columns)`

Adds columns to both the input and output column lists.

Parameters:

Name	Type	Description	Default
`include_columns`	`List[str]`	A list of column names to include.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_include_cols(self, include_columns: List[str]):
    """Adds columns to both the input and output column lists.

    Args:
        include_columns: A list of column names to include.
    """
    for column in include_columns:
        if column not in self._input_cols:
            self._input_cols.append(column)
        if column not in self._output_cols:
            self._output_cols.append(column)
    return self

`add_initial_node_analysis(node_promise)`

Adds a data exploration/analysis node based on a node promise.

Parameters:

Name	Type	Description	Default
`node_promise`	`NodePromise`	The promise representing the node to be analyzed.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_initial_node_analysis(self, node_promise: input_schema.NodePromise):
    """Adds a data exploration/analysis node based on a node promise.

    Args:
        node_promise: The promise representing the node to be analyzed.
    """
    node_analysis = create_graphic_walker_node_from_node_promise(node_promise)
    self.add_explore_data(node_analysis)

`add_join(join_settings)`

Adds a join node to combine two data streams based on key columns.

Parameters:

Name	Type	Description	Default
`join_settings`	`NodeJoin`	The settings for the join operation.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_join(self, join_settings: input_schema.NodeJoin) -> "FlowGraph":
    """Adds a join node to combine two data streams based on key columns.

    Args:
        join_settings: The settings for the join operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        for left_select in join_settings.join_input.left_select.renames:
            left_select.is_available = True if left_select.old_name in main.schema else False
        for right_select in join_settings.join_input.right_select.renames:
            right_select.is_available = True if right_select.old_name in right.schema else False

        return main.join(join_input=join_settings.join_input,
                         auto_generate_selection=join_settings.auto_generate_selection,
                         verify_integrity=False,
                         other=right)

    self.add_node_step(node_id=join_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='join',
                       setting_input=join_settings,
                       input_node_ids=join_settings.depending_on_ids)
    return self

`add_manual_input(input_file)`

Adds a node for manual data entry.

This is a convenience alias for add_datasource.

Parameters:

Name	Type	Description	Default
`input_file`	`NodeManualInput`	The settings and data for the manual input node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_manual_input(self, input_file: input_schema.NodeManualInput):
    """Adds a node for manual data entry.

    This is a convenience alias for `add_datasource`.

    Args:
        input_file: The settings and data for the manual input node.
    """
    self.add_datasource(input_file)

`add_node_promise(node_promise)`

Adds a placeholder node to the graph that is not yet fully configured.

Useful for building the graph structure before all settings are available.

Parameters:

Name	Type	Description	Default
`node_promise`	`NodePromise`	A promise object containing basic node information.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_node_promise(self, node_promise: input_schema.NodePromise):
    """Adds a placeholder node to the graph that is not yet fully configured.

    Useful for building the graph structure before all settings are available.

    Args:
        node_promise: A promise object containing basic node information.
    """
    def placeholder(n: FlowNode = None):
        if n is None:
            return FlowDataEngine()
        return n

    self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=placeholder,
                       setting_input=node_promise)

`add_node_step(node_id, function, input_columns=None, output_schema=None, node_type=None, drop_columns=None, renew_schema=True, setting_input=None, cache_results=None, schema_callback=None, input_node_ids=None)`

The core method for adding or updating a node in the graph.

Parameters:

Name	Type	Description	Default
`node_id`	`Union[int, str]`	The unique ID for the node.	required
`function`	`Callable`	The core processing function for the node.	required
`input_columns`	`List[str]`	A list of input column names required by the function.	`None`
`output_schema`	`List[FlowfileColumn]`	A predefined schema for the node's output.	`None`
`node_type`	`str`	A string identifying the type of node (e.g., 'filter', 'join').	`None`
`drop_columns`	`List[str]`	A list of columns to be dropped after the function executes.	`None`
`renew_schema`	`bool`	If True, the schema is recalculated after execution.	`True`
`setting_input`	`Any`	A configuration object containing settings for the node.	`None`
`cache_results`	`bool`	If True, the node's results are cached for future runs.	`None`
`schema_callback`	`Callable`	A function that dynamically calculates the output schema.	`None`
`input_node_ids`	`List[int]`	A list of IDs for the nodes that this node depends on.	`None`

Returns:

Type	Description
`FlowNode`	The created or updated FlowNode object.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_node_step(self,
                  node_id: Union[int, str],
                  function: Callable,
                  input_columns: List[str] = None,
                  output_schema: List[FlowfileColumn] = None,
                  node_type: str = None,
                  drop_columns: List[str] = None,
                  renew_schema: bool = True,
                  setting_input: Any = None,
                  cache_results: bool = None,
                  schema_callback: Callable = None,
                  input_node_ids: List[int] = None) -> FlowNode:
    """The core method for adding or updating a node in the graph.

    Args:
        node_id: The unique ID for the node.
        function: The core processing function for the node.
        input_columns: A list of input column names required by the function.
        output_schema: A predefined schema for the node's output.
        node_type: A string identifying the type of node (e.g., 'filter', 'join').
        drop_columns: A list of columns to be dropped after the function executes.
        renew_schema: If True, the schema is recalculated after execution.
        setting_input: A configuration object containing settings for the node.
        cache_results: If True, the node's results are cached for future runs.
        schema_callback: A function that dynamically calculates the output schema.
        input_node_ids: A list of IDs for the nodes that this node depends on.

    Returns:
        The created or updated FlowNode object.
    """
    existing_node = self.get_node(node_id)
    if existing_node is not None:
        if existing_node.node_type != node_type:
            self.delete_node(existing_node.node_id)
            existing_node = None
    if existing_node:
        input_nodes = existing_node.all_inputs
    elif input_node_ids is not None:
        input_nodes = [self.get_node(node_id) for node_id in input_node_ids]
    else:
        input_nodes = None
    if isinstance(input_columns, str):
        input_columns = [input_columns]
    if (
            input_nodes is not None or
            function.__name__ in ('placeholder', 'analysis_preparation') or
            node_type in ("cloud_storage_reader", "polars_lazy_frame", "input_data")
    ):
        if not existing_node:
            node = FlowNode(node_id=node_id,
                            function=function,
                            output_schema=output_schema,
                            input_columns=input_columns,
                            drop_columns=drop_columns,
                            renew_schema=renew_schema,
                            setting_input=setting_input,
                            node_type=node_type,
                            name=function.__name__,
                            schema_callback=schema_callback,
                            parent_uuid=self.uuid)
        else:
            existing_node.update_node(function=function,
                                      output_schema=output_schema,
                                      input_columns=input_columns,
                                      drop_columns=drop_columns,
                                      setting_input=setting_input,
                                      schema_callback=schema_callback)
            node = existing_node
    else:
        raise Exception("No data initialized")
    self._node_db[node_id] = node
    self._node_ids.append(node_id)
    return node

`add_output(output_file)`

Adds an output node to write the final data to a destination.

Parameters:

Name	Type	Description	Default
`output_file`	`NodeOutput`	The settings for the output file.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_output(self, output_file: input_schema.NodeOutput):
    """Adds an output node to write the final data to a destination.

    Args:
        output_file: The settings for the output file.
    """

    def _func(df: FlowDataEngine):
        output_file.output_settings.populate_abs_file_path()
        execute_remote = self.execution_location != 'local'
        df.output(output_fs=output_file.output_settings, flow_id=self.flow_id, node_id=output_file.node_id,
                  execute_remote=execute_remote)
        return df

    def schema_callback():
        input_node: FlowNode = self.get_node(output_file.node_id).node_inputs.main_inputs[0]

        return input_node.schema
    input_node_id = getattr(output_file, "depending_on_id") if hasattr(output_file, 'depending_on_id') else None
    self.add_node_step(node_id=output_file.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='output',
                       setting_input=output_file,
                       schema_callback=schema_callback,
                       input_node_ids=[input_node_id])

`add_pivot(pivot_settings)`

Adds a pivot node to the graph.

Parameters:

Name	Type	Description	Default
`pivot_settings`	`NodePivot`	The settings for the pivot operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_pivot(self, pivot_settings: input_schema.NodePivot):
    """Adds a pivot node to the graph.

    Args:
        pivot_settings: The settings for the pivot operation.
    """

    def _func(fl: FlowDataEngine):
        return fl.do_pivot(pivot_settings.pivot_input, self.flow_logger.get_node_logger(pivot_settings.node_id))

    self.add_node_step(node_id=pivot_settings.node_id,
                       function=_func,
                       node_type='pivot',
                       setting_input=pivot_settings,
                       input_node_ids=[pivot_settings.depending_on_id])

    node = self.get_node(pivot_settings.node_id)

    def schema_callback():
        input_data = node.singular_main_input.get_resulting_data()  # get from the previous step the data
        input_data.lazy = True  # ensure the dataset is lazy
        input_lf = input_data.data_frame  # get the lazy frame
        return pre_calculate_pivot_schema(input_data.schema, pivot_settings.pivot_input, input_lf=input_lf)
    node.schema_callback = schema_callback

`add_polars_code(node_polars_code)`

Adds a node that executes custom Polars code.

Parameters:

Name	Type	Description	Default
`node_polars_code`	`NodePolarsCode`	The settings for the Polars code node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_polars_code(self, node_polars_code: input_schema.NodePolarsCode):
    """Adds a node that executes custom Polars code.

    Args:
        node_polars_code: The settings for the Polars code node.
    """

    def _func(*flowfile_tables: FlowDataEngine) -> FlowDataEngine:
        return execute_polars_code(*flowfile_tables, code=node_polars_code.polars_code_input.polars_code)
    self.add_node_step(node_id=node_polars_code.node_id,
                       function=_func,
                       node_type='polars_code',
                       setting_input=node_polars_code,
                       input_node_ids=node_polars_code.depending_on_ids)

    try:
        polars_code_parser.validate_code(node_polars_code.polars_code_input.polars_code)
    except Exception as e:
        node = self.get_node(node_id=node_polars_code.node_id)
        node.results.errors = str(e)

`add_read(input_file)`

Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

Parameters:

Name	Type	Description	Default
`input_file`	`NodeRead`	The settings for the read operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_read(self, input_file: input_schema.NodeRead):
    """Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

    Args:
        input_file: The settings for the read operation.
    """

    if input_file.received_file.file_type in ('xlsx', 'excel') and input_file.received_file.sheet_name == '':
        sheet_name = fastexcel.read_excel(input_file.received_file.path).sheet_names[0]
        input_file.received_file.sheet_name = sheet_name

    received_file = input_file.received_file
    input_file.received_file.set_absolute_filepath()

    def _func():
        input_file.received_file.set_absolute_filepath()
        if input_file.received_file.file_type == 'parquet':
            input_data = FlowDataEngine.create_from_path(input_file.received_file)
        elif input_file.received_file.file_type == 'csv' and 'utf' in input_file.received_file.encoding:
            input_data = FlowDataEngine.create_from_path(input_file.received_file)
        else:
            input_data = FlowDataEngine.create_from_path_worker(input_file.received_file,
                                                                node_id=input_file.node_id,
                                                                flow_id=self.flow_id)
        input_data.name = input_file.received_file.name
        return input_data

    node = self.get_node(input_file.node_id)
    schema_callback = None
    if node:
        start_hash = node.hash
        node.node_type = 'read'
        node.name = 'read'
        node.function = _func
        node.setting_input = input_file
        if input_file.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)

        if start_hash != node.hash:
            logger.info('Hash changed, updating schema')
            if len(received_file.fields) > 0:
                # If the file has fields defined, we can use them to create the schema
                def schema_callback():
                    return [FlowfileColumn.from_input(f.name, f.data_type) for f in received_file.fields]

            elif input_file.received_file.file_type in ('csv', 'json', 'parquet'):
                # everything that can be scanned by polars
                def schema_callback():
                    input_data = FlowDataEngine.create_from_path(input_file.received_file)
                    return input_data.schema

            elif input_file.received_file.file_type in ('xlsx', 'excel'):
                # If the file is an Excel file, we need to use the openpyxl engine to read the schema
                schema_callback = get_xlsx_schema_callback(engine='openpyxl',
                                                           file_path=received_file.file_path,
                                                           sheet_name=received_file.sheet_name,
                                                           start_row=received_file.start_row,
                                                           end_row=received_file.end_row,
                                                           start_column=received_file.start_column,
                                                           end_column=received_file.end_column,
                                                           has_headers=received_file.has_headers)
            else:
                schema_callback = None
    else:
        node = FlowNode(input_file.node_id, function=_func,
                        setting_input=input_file,
                        name='read', node_type='read', parent_uuid=self.uuid)
        self._node_db[input_file.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(input_file.node_id)

    if schema_callback is not None:
        node.schema_callback = schema_callback
    return self

`add_record_count(node_number_of_records)`

Adds a filter node to the graph.

Parameters:

Name	Type	Description	Default
`node_number_of_records`	`NodeRecordCount`	The settings for the record count operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_record_count(self, node_number_of_records: input_schema.NodeRecordCount):
    """Adds a filter node to the graph.

    Args:
        node_number_of_records: The settings for the record count operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.get_record_count()

    self.add_node_step(node_id=node_number_of_records.node_id,
                       function=_func,
                       node_type='record_count',
                       setting_input=node_number_of_records,
                       input_node_ids=[node_number_of_records.depending_on_id])

`add_record_id(record_id_settings)`

Adds a node to create a new column with a unique ID for each record.

Parameters:

Name	Type	Description	Default
`record_id_settings`	`NodeRecordId`	The settings object specifying the name of the new record ID column.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_record_id(self, record_id_settings: input_schema.NodeRecordId) -> "FlowGraph":
    """Adds a node to create a new column with a unique ID for each record.

    Args:
        record_id_settings: The settings object specifying the name of the
            new record ID column.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.add_record_id(record_id_settings.record_id_input)

    self.add_node_step(node_id=record_id_settings.node_id,
                       function=_func,
                       node_type='record_id',
                       setting_input=record_id_settings,
                       input_node_ids=[record_id_settings.depending_on_id]
                       )
    return self

`add_sample(sample_settings)`

Adds a node to take a random or top-N sample of the data.

Parameters:

Name	Type	Description	Default
`sample_settings`	`NodeSample`	The settings object specifying the size of the sample.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_sample(self, sample_settings: input_schema.NodeSample) -> "FlowGraph":
    """Adds a node to take a random or top-N sample of the data.

    Args:
        sample_settings: The settings object specifying the size of the sample.

    Returns:
        The `FlowGraph` instance for method chaining.
    """
    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.get_sample(sample_settings.sample_size)

    self.add_node_step(node_id=sample_settings.node_id,
                       function=_func,
                       node_type='sample',
                       setting_input=sample_settings,
                       input_node_ids=[sample_settings.depending_on_id]
                       )
    return self

`add_select(select_settings)`

Adds a node to select, rename, reorder, or drop columns.

Parameters:

Name	Type	Description	Default
`select_settings`	`NodeSelect`	The settings for the select operation.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_select(self, select_settings: input_schema.NodeSelect) -> "FlowGraph":
    """Adds a node to select, rename, reorder, or drop columns.

    Args:
        select_settings: The settings for the select operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    select_cols = select_settings.select_input
    drop_cols = tuple(s.old_name for s in select_settings.select_input)

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        input_cols = set(f.name for f in table.schema)
        ids_to_remove = []
        for i, select_col in enumerate(select_cols):
            if select_col.data_type is None:
                select_col.data_type = table.get_schema_column(select_col.old_name).data_type
            if select_col.old_name not in input_cols:
                select_col.is_available = False
                if not select_col.keep:
                    ids_to_remove.append(i)
            else:
                select_col.is_available = True
        ids_to_remove.reverse()
        for i in ids_to_remove:
            v = select_cols.pop(i)
            del v
        return table.do_select(select_inputs=transform_schema.SelectInputs(select_cols),
                               keep_missing=select_settings.keep_missing)

    self.add_node_step(node_id=select_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='select',
                       drop_columns=list(drop_cols),
                       setting_input=select_settings,
                       input_node_ids=[select_settings.depending_on_id])
    return self

`add_sort(sort_settings)`

Adds a node to sort the data based on one or more columns.

Parameters:

Name	Type	Description	Default
`sort_settings`	`NodeSort`	The settings for the sort operation.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_sort(self, sort_settings: input_schema.NodeSort) -> "FlowGraph":
    """Adds a node to sort the data based on one or more columns.

    Args:
        sort_settings: The settings for the sort operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.do_sort(sort_settings.sort_input)

    self.add_node_step(node_id=sort_settings.node_id,
                       function=_func,
                       node_type='sort',
                       setting_input=sort_settings,
                       input_node_ids=[sort_settings.depending_on_id])
    return self

`add_sql_source(external_source_input)`

Adds a node that reads data from a SQL source.

This is a convenience alias for add_external_source.

Parameters:

Name	Type	Description	Default
`external_source_input`	`NodeExternalSource`	The settings for the external SQL source node.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_sql_source(self, external_source_input: input_schema.NodeExternalSource):
    """Adds a node that reads data from a SQL source.

    This is a convenience alias for `add_external_source`.

    Args:
        external_source_input: The settings for the external SQL source node.
    """
    logger.info('Adding sql source')
    self.add_external_source(external_source_input)

`add_text_to_rows(node_text_to_rows)`

Adds a node that splits cell values into multiple rows.

This is useful for un-nesting data where a single field contains multiple values separated by a delimiter.

Parameters:

Name	Type	Description	Default
`node_text_to_rows`	`NodeTextToRows`	The settings object that specifies the column to split and the delimiter to use.	required

Returns:

Type	Description
`FlowGraph`	The `FlowGraph` instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_text_to_rows(self, node_text_to_rows: input_schema.NodeTextToRows) -> "FlowGraph":
    """Adds a node that splits cell values into multiple rows.

    This is useful for un-nesting data where a single field contains multiple
    values separated by a delimiter.

    Args:
        node_text_to_rows: The settings object that specifies the column to split
            and the delimiter to use.

    Returns:
        The `FlowGraph` instance for method chaining.
    """
    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.split(node_text_to_rows.text_to_rows_input)

    self.add_node_step(node_id=node_text_to_rows.node_id,
                       function=_func,
                       node_type='text_to_rows',
                       setting_input=node_text_to_rows,
                       input_node_ids=[node_text_to_rows.depending_on_id])
    return self

`add_union(union_settings)`

Adds a union node to combine multiple data streams.

Parameters:

Name	Type	Description	Default
`union_settings`	`NodeUnion`	The settings for the union operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_union(self, union_settings: input_schema.NodeUnion):
    """Adds a union node to combine multiple data streams.

    Args:
        union_settings: The settings for the union operation.
    """

    def _func(*flowfile_tables: FlowDataEngine):
        dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [flt.data_frame for flt in flowfile_tables]
        return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))

    self.add_node_step(node_id=union_settings.node_id,
                       function=_func,
                       node_type=f'union',
                       setting_input=union_settings,
                       input_node_ids=union_settings.depending_on_ids)

`add_unique(unique_settings)`

Adds a node to find and remove duplicate rows.

Parameters:

Name	Type	Description	Default
`unique_settings`	`NodeUnique`	The settings for the unique operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_unique(self, unique_settings: input_schema.NodeUnique):
    """Adds a node to find and remove duplicate rows.

    Args:
        unique_settings: The settings for the unique operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.make_unique(unique_settings.unique_input)

    self.add_node_step(node_id=unique_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='unique',
                       setting_input=unique_settings,
                       input_node_ids=[unique_settings.depending_on_id])

`add_unpivot(unpivot_settings)`

Adds an unpivot node to the graph.

Parameters:

Name	Type	Description	Default
`unpivot_settings`	`NodeUnpivot`	The settings for the unpivot operation.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def add_unpivot(self, unpivot_settings: input_schema.NodeUnpivot):
    """Adds an unpivot node to the graph.

    Args:
        unpivot_settings: The settings for the unpivot operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.unpivot(unpivot_settings.unpivot_input)

    self.add_node_step(node_id=unpivot_settings.node_id,
                       function=_func,
                       node_type='unpivot',
                       setting_input=unpivot_settings,
                       input_node_ids=[unpivot_settings.depending_on_id])

`apply_layout(y_spacing=150, x_spacing=200, initial_y=100)`

Calculates and applies a layered layout to all nodes in the graph.

This updates their x and y positions for UI rendering.

Parameters:

Name	Type	Description	Default
`y_spacing`	`int`	The vertical spacing between layers.	`150`
`x_spacing`	`int`	The horizontal spacing between nodes in the same layer.	`200`
`initial_y`	`int`	The initial y-position for the first layer.	`100`

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def apply_layout(self, y_spacing: int = 150, x_spacing: int = 200, initial_y: int = 100):
    """Calculates and applies a layered layout to all nodes in the graph.

    This updates their x and y positions for UI rendering.

    Args:
        y_spacing: The vertical spacing between layers.
        x_spacing: The horizontal spacing between nodes in the same layer.
        initial_y: The initial y-position for the first layer.
    """
    self.flow_logger.info("Applying layered layout...")
    start_time = time()
    try:
        # Calculate new positions for all nodes
        new_positions = calculate_layered_layout(
            self, y_spacing=y_spacing, x_spacing=x_spacing, initial_y=initial_y
        )

        if not new_positions:
            self.flow_logger.warning("Layout calculation returned no positions.")
            return

        # Apply the new positions to the setting_input of each node
        updated_count = 0
        for node_id, (pos_x, pos_y) in new_positions.items():
            node = self.get_node(node_id)
            if node and hasattr(node, 'setting_input'):
                setting = node.setting_input
                if hasattr(setting, 'pos_x') and hasattr(setting, 'pos_y'):
                    setting.pos_x = pos_x
                    setting.pos_y = pos_y
                    updated_count += 1
                else:
                    self.flow_logger.warning(f"Node {node_id} setting_input ({type(setting)}) lacks pos_x/pos_y attributes.")
            elif node:
                self.flow_logger.warning(f"Node {node_id} lacks setting_input attribute.")
            # else: Node not found, already warned by calculate_layered_layout

        end_time = time()
        self.flow_logger.info(f"Layout applied to {updated_count}/{len(self.nodes)} nodes in {end_time - start_time:.2f} seconds.")

    except Exception as e:
        self.flow_logger.error(f"Error applying layout: {e}")
        raise  # Optional: re-raise the exception

`cancel()`

Cancels an ongoing graph execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def cancel(self):
    """Cancels an ongoing graph execution."""

    if not self.flow_settings.is_running:
        return
    self.flow_settings.is_canceled = True
    for node in self.nodes:
        node.cancel()

`close_flow()`

Performs cleanup operations, such as clearing node caches.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def close_flow(self):
    """Performs cleanup operations, such as clearing node caches."""

    for node in self.nodes:
        node.remove_cache()

`copy_node(new_node_settings, existing_setting_input, node_type)`

Creates a copy of an existing node.

Parameters:

Name	Type	Description	Default
`new_node_settings`	`NodePromise`	The promise containing new settings (like ID and position).	required
`existing_setting_input`	`Any`	The settings object from the node being copied.	required
`node_type`	`str`	The type of the node being copied.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def copy_node(self, new_node_settings: input_schema.NodePromise, existing_setting_input: Any, node_type: str) -> None:
    """Creates a copy of an existing node.

    Args:
        new_node_settings: The promise containing new settings (like ID and position).
        existing_setting_input: The settings object from the node being copied.
        node_type: The type of the node being copied.
    """
    self.add_node_promise(new_node_settings)

    if isinstance(existing_setting_input, input_schema.NodePromise):
        return

    combined_settings = combine_existing_settings_and_new_settings(
        existing_setting_input, new_node_settings
    )
    getattr(self, f"add_{node_type}")(combined_settings)

`delete_node(node_id)`

Deletes a node from the graph and updates all its connections.

Parameters:

Name	Type	Description	Default
`node_id`	`Union[int, str]`	The ID of the node to delete.	required

Raises:

Type	Description
`Exception`	If the node with the given ID does not exist.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def delete_node(self, node_id: Union[int, str]):
    """Deletes a node from the graph and updates all its connections.

    Args:
        node_id: The ID of the node to delete.

    Raises:
        Exception: If the node with the given ID does not exist.
    """
    logger.info(f"Starting deletion of node with ID: {node_id}")

    node = self._node_db.get(node_id)
    if node:
        logger.info(f"Found node: {node_id}, processing deletion")

        lead_to_steps: List[FlowNode] = node.leads_to_nodes
        logger.debug(f"Node {node_id} leads to {len(lead_to_steps)} other nodes")

        if len(lead_to_steps) > 0:
            for lead_to_step in lead_to_steps:
                logger.debug(f"Deleting input node {node_id} from dependent node {lead_to_step}")
                lead_to_step.delete_input_node(node_id, complete=True)

        if not node.is_start:
            depends_on: List[FlowNode] = node.node_inputs.get_all_inputs()
            logger.debug(f"Node {node_id} depends on {len(depends_on)} other nodes")

            for depend_on in depends_on:
                logger.debug(f"Removing lead_to reference {node_id} from node {depend_on}")
                depend_on.delete_lead_to_node(node_id)

        self._node_db.pop(node_id)
        logger.debug(f"Successfully removed node {node_id} from node_db")
        del node
        logger.info("Node object deleted")
    else:
        logger.error(f"Failed to find node with id {node_id}")
        raise Exception(f"Node with id {node_id} does not exist")

`generate_code()`

Generates code for the flow graph. This method exports the flow graph to a Polars-compatible format.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def generate_code(self):
    """Generates code for the flow graph.
    This method exports the flow graph to a Polars-compatible format.
    """
    from flowfile_core.flowfile.code_generator.code_generator import export_flow_to_polars
    print(export_flow_to_polars(self))

`get_frontend_data()`

Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

This method transforms the graph's state into a format compatible with the Drawflow.js library.

Returns:

Type	Description
`dict`	A dictionary representing the graph in Drawflow format.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_frontend_data(self) -> dict:
    """Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

    This method transforms the graph's state into a format compatible with the
    Drawflow.js library.

    Returns:
        A dictionary representing the graph in Drawflow format.
    """
    result = {
        'Home': {
            "data": {}
        }
    }
    flow_info: schemas.FlowInformation = self.get_node_storage()

    for node_id, node_info in flow_info.data.items():
        if node_info.is_setup:
            try:
                pos_x = node_info.data.pos_x
                pos_y = node_info.data.pos_y
                # Basic node structure
                result["Home"]["data"][str(node_id)] = {
                    "id": node_info.id,
                    "name": node_info.type,
                    "data": {},  # Additional data can go here
                    "class": node_info.type,
                    "html": node_info.type,
                    "typenode": "vue",
                    "inputs": {},
                    "outputs": {},
                    "pos_x": pos_x,
                    "pos_y": pos_y
                }
            except Exception as e:
                logger.error(e)
        # Add outputs to the node based on `outputs` in your backend data
        if node_info.outputs:
            outputs = {o: 0 for o in node_info.outputs}
            for o in node_info.outputs:
                outputs[o] += 1
            connections = []
            for output_node_id, n_connections in outputs.items():
                leading_to_node = self.get_node(output_node_id)
                input_types = leading_to_node.get_input_type(node_info.id)
                for input_type in input_types:
                    if input_type == 'main':
                        input_frontend_id = 'input_1'
                    elif input_type == 'right':
                        input_frontend_id = 'input_2'
                    elif input_type == 'left':
                        input_frontend_id = 'input_3'
                    else:
                        input_frontend_id = 'input_1'
                    connection = {"node": str(output_node_id), "input": input_frontend_id}
                    connections.append(connection)

            result["Home"]["data"][str(node_id)]["outputs"]["output_1"] = {
                "connections": connections}
        else:
            result["Home"]["data"][str(node_id)]["outputs"] = {"output_1": {"connections": []}}

        # Add input to the node based on `depending_on_id` in your backend data
        if node_info.left_input_id is not None or node_info.right_input_id is not None or node_info.input_ids is not None:
            main_inputs = node_info.main_input_ids
            result["Home"]["data"][str(node_id)]["inputs"]["input_1"] = {
                "connections": [{"node": str(main_node_id), "input": "output_1"} for main_node_id in main_inputs]
            }
            if node_info.right_input_id is not None:
                result["Home"]["data"][str(node_id)]["inputs"]["input_2"] = {
                    "connections": [{"node": str(node_info.right_input_id), "input": "output_1"}]
                }
            if node_info.left_input_id is not None:
                result["Home"]["data"][str(node_id)]["inputs"]["input_3"] = {
                    "connections": [{"node": str(node_info.left_input_id), "input": "output_1"}]
                }
    return result

`get_implicit_starter_nodes()`

Finds nodes that can act as starting points but are not explicitly defined as such.

Some nodes, like the Polars Code node, can function without an input. This method identifies such nodes if they have no incoming connections.

Returns:

Type	Description
`List[FlowNode]`	A list of `FlowNode` objects that are implicit starting nodes.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_implicit_starter_nodes(self) -> List[FlowNode]:
    """Finds nodes that can act as starting points but are not explicitly defined as such.

    Some nodes, like the Polars Code node, can function without an input. This
    method identifies such nodes if they have no incoming connections.

    Returns:
        A list of `FlowNode` objects that are implicit starting nodes.
    """
    starting_node_ids = [node.node_id for node in self._flow_starts]
    implicit_starting_nodes = []
    for node in self.nodes:
        if node.node_template.can_be_start and not node.has_input and node.node_id not in starting_node_ids:
            implicit_starting_nodes.append(node)
    return implicit_starting_nodes

`get_node(node_id=None)`

Retrieves a node from the graph by its ID.

Parameters:

Name	Type	Description	Default
`node_id`	`Union[int, str]`	The ID of the node to retrieve. If None, retrieves the last added node.	`None`

Returns:

Type	Description
`FlowNode \| None`	The FlowNode object, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_node(self, node_id: Union[int, str] = None) -> FlowNode | None:
    """Retrieves a node from the graph by its ID.

    Args:
        node_id: The ID of the node to retrieve. If None, retrieves the last added node.

    Returns:
        The FlowNode object, or None if not found.
    """
    if node_id is None:
        node_id = self._node_ids[-1]
    node = self._node_db.get(node_id)
    if node is not None:
        return node

`get_node_data(node_id, include_example=True)`

Retrieves all data needed to render a node in the UI.

Parameters:

Name	Type	Description	Default
`node_id`	`int`	The ID of the node.	required
`include_example`	`bool`	Whether to include data samples in the result.	`True`

Returns:

Type	Description
`NodeData`	A NodeData object, or None if the node is not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_node_data(self, node_id: int, include_example: bool = True) -> NodeData:
    """Retrieves all data needed to render a node in the UI.

    Args:
        node_id: The ID of the node.
        include_example: Whether to include data samples in the result.

    Returns:
        A NodeData object, or None if the node is not found.
    """
    node = self._node_db[node_id]
    return node.get_node_data(flow_id=self.flow_id, include_example=include_example)

`get_node_storage()`

Serializes the entire graph's state into a storable format.

Returns:

Type	Description
`FlowInformation`	A FlowInformation object representing the complete graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_node_storage(self) -> schemas.FlowInformation:
    """Serializes the entire graph's state into a storable format.

    Returns:
        A FlowInformation object representing the complete graph.
    """
    node_information = {node.node_id: node.get_node_information() for
                        node in self.nodes if node.is_setup and node.is_correct}

    return schemas.FlowInformation(flow_id=self.flow_id,
                                   flow_name=self.__name__,
                                   flow_settings=self.flow_settings,
                                   data=node_information,
                                   node_starts=[v.node_id for v in self._flow_starts],
                                   node_connections=self.node_connections
                                   )

`get_nodes_overview()`

Gets a list of dictionary representations for all nodes in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_nodes_overview(self):
    """Gets a list of dictionary representations for all nodes in the graph."""
    output = []
    for v in self._node_db.values():
        output.append(v.get_repr())
    return output

`get_run_info()`

Gets a summary of the most recent graph execution.

Returns:

Type	Description
`RunInformation \| None`	A RunInformation object with details about the last run.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_run_info(self) -> RunInformation | None:
    """Gets a summary of the most recent graph execution.

    Returns:
        A RunInformation object with details about the last run.
    """
    is_running = self.flow_settings.is_running
    if self.latest_run_info is None:
        return

    elif not is_running and self.latest_run_info.success is not None:
        return self.latest_run_info

    run_info = self.latest_run_info
    if not is_running:
        run_info.success = all(nr.success for nr in run_info.node_step_result)
    return run_info

`get_vue_flow_input()`

Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

Returns:

Type	Description
`VueFlowInput`	A VueFlowInput object.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def get_vue_flow_input(self) -> schemas.VueFlowInput:
    """Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

    Returns:
        A VueFlowInput object.
    """
    edges: List[schemas.NodeEdge] = []
    nodes: List[schemas.NodeInput] = []
    for node in self.nodes:
        nodes.append(node.get_node_input())
        edges.extend(node.get_edge_input())
    return schemas.VueFlowInput(node_edges=edges, node_inputs=nodes)

`print_tree()`

Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def print_tree(self):
    """Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art."""
    if not self._node_db:
        self.flow_logger.info("Empty flow graph")
        return

    # Build node information
    node_info = build_node_info(self.nodes)

    # Calculate depths for all nodes
    for node_id in node_info:
        calculate_depth(node_id, node_info)

    # Group nodes by depth
    depth_groups, max_depth = group_nodes_by_depth(node_info)

    # Sort nodes within each depth group
    for depth in depth_groups:
        depth_groups[depth].sort()

    # Create the main flow visualization
    lines = ["=" * 80, "Flow Graph Visualization", "=" * 80, ""]

    # Track which nodes connect to what
    merge_points = define_node_connections(node_info)

    # Build the flow paths

    # Find the maximum label length for each depth level
    max_label_length = {}
    for depth in range(max_depth + 1):
        if depth in depth_groups:
            max_len = max(len(node_info[nid].label) for nid in depth_groups[depth])
            max_label_length[depth] = max_len

    # Draw the paths
    drawn_nodes = set()
    merge_drawn = set()

    # Group paths by their merge points
    paths_by_merge = {}
    standalone_paths = []

    # Build flow paths
    paths = build_flow_paths(node_info, self._flow_starts, merge_points)

    # Define paths to merge and standalone paths
    for path in paths:
        if len(path) > 1 and path[-1] in merge_points and len(merge_points[path[-1]]) > 1:
            merge_id = path[-1]
            if merge_id not in paths_by_merge:
                paths_by_merge[merge_id] = []
            paths_by_merge[merge_id].append(path)
        else:
            standalone_paths.append(path)

    # Draw merged paths
    draw_merged_paths(node_info, merge_points, paths_by_merge, merge_drawn, drawn_nodes, lines)

    # Draw standlone paths
    draw_standalone_paths(drawn_nodes, standalone_paths, lines, node_info)

    # Add undrawn nodes
    add_un_drawn_nodes(drawn_nodes, node_info, lines)

    try:
        skip_nodes, ordered_nodes = compute_execution_plan(
            nodes=self.nodes,
            flow_starts=self._flow_starts+self.get_implicit_starter_nodes())
        if ordered_nodes:
            for i, node in enumerate(ordered_nodes, 1):
                lines.append(f"  {i:3d}. {node_info[node.node_id].label}")
    except Exception as e:
        lines.append(f"  Could not determine execution order: {e}")

    # Print everything
    output = "\n".join(lines)

    print(output)

`remove_from_output_cols(columns)`

Removes specified columns from the list of expected output columns.

Parameters:

Name	Type	Description	Default
`columns`	`List[str]`	A list of column names to remove.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def remove_from_output_cols(self, columns: List[str]):
    """Removes specified columns from the list of expected output columns.

    Args:
        columns: A list of column names to remove.
    """
    cols = set(columns)
    self._output_cols = [c for c in self._output_cols if c not in cols]

`reset()`

Forces a deep reset on all nodes in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def reset(self):
    """Forces a deep reset on all nodes in the graph."""

    for node in self.nodes:
        node.reset(True)

`run_graph()`

Executes the entire data flow graph from start to finish.

It determines the correct execution order, runs each node, collects results, and handles errors and cancellations.

Returns:

Type	Description
`RunInformation \| None`	A RunInformation object summarizing the execution results.

Raises:

Type	Description
`Exception`	If the flow is already running.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def run_graph(self) -> RunInformation | None:
    """Executes the entire data flow graph from start to finish.

    It determines the correct execution order, runs each node,
    collects results, and handles errors and cancellations.

    Returns:
        A RunInformation object summarizing the execution results.

    Raises:
        Exception: If the flow is already running.
    """
    if self.flow_settings.is_running:
        raise Exception('Flow is already running')
    try:

        self.flow_settings.is_running = True
        self.flow_settings.is_canceled = False
        self.flow_logger.clear_log_file()
        self.flow_logger.info('Starting to run flowfile flow...')

        skip_nodes, execution_order = compute_execution_plan(
            nodes=self.nodes,
            flow_starts=self._flow_starts+self.get_implicit_starter_nodes()
        )

        self.latest_run_info = self.create_initial_run_information(len(execution_order), "full_run")

        skip_node_message(self.flow_logger, skip_nodes)
        execution_order_message(self.flow_logger, execution_order)
        performance_mode = self.flow_settings.execution_mode == 'Performance'

        for node in execution_order:
            node_logger = self.flow_logger.get_node_logger(node.node_id)
            if self.flow_settings.is_canceled:
                self.flow_logger.info('Flow canceled')
                break
            if node in skip_nodes:
                node_logger.info(f'Skipping node {node.node_id}')
                continue
            node_result = NodeResult(node_id=node.node_id, node_name=node.name)
            self.latest_run_info.node_step_result.append(node_result)
            logger.info(f'Starting to run: node {node.node_id}, start time: {node_result.start_timestamp}')
            node.execute_node(run_location=self.flow_settings.execution_location,
                              performance_mode=performance_mode,
                              node_logger=node_logger)
            try:
                node_result.error = str(node.results.errors)
                if self.flow_settings.is_canceled:
                    node_result.success = None
                    node_result.success = None
                    node_result.is_running = False
                    continue
                node_result.success = node.results.errors is None
                node_result.end_timestamp = time()
                node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                node_result.is_running = False
            except Exception as e:
                node_result.error = 'Node did not run'
                node_result.success = False
                node_result.end_timestamp = time()
                node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                node_result.is_running = False
                node_logger.error(f'Error in node {node.node_id}: {e}')
            if not node_result.success:
                skip_nodes.extend(list(node.get_all_dependent_nodes()))
            node_logger.info(f'Completed node with success: {node_result.success}')
            self.latest_run_info.nodes_completed += 1
        self.flow_logger.info('Flow completed!')
        self.end_datetime = datetime.datetime.now()
        self.flow_settings.is_running = False
        if self.flow_settings.is_canceled:
            self.flow_logger.info('Flow canceled')
        return self.get_run_info()
    except Exception as e:
        raise e
    finally:
        self.flow_settings.is_running = False

`save_flow(flow_path)`

Saves the current state of the flow graph to a file.

Parameters:

Name	Type	Description	Default
`flow_path`	`str`	The path where the flow file will be saved.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def save_flow(self, flow_path: str):
    """Saves the current state of the flow graph to a file.

    Args:
        flow_path: The path where the flow file will be saved.
    """
    logger.info("Saving flow to %s", flow_path)
    os.makedirs(os.path.dirname(flow_path), exist_ok=True)
    try:
        with open(flow_path, 'wb') as f:
            pickle.dump(self.get_node_storage(), f)
    except Exception as e:
        logger.error(f"Error saving flow: {e}")

    self.flow_settings.path = flow_path

`trigger_fetch_node(node_id)`

Executes a specific node in the graph by its ID.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py

def trigger_fetch_node(self, node_id: int) -> RunInformation | None:
    """Executes a specific node in the graph by its ID."""
    if self.flow_settings.is_running:
        raise Exception("Flow is already running")
    flow_node = self.get_node(node_id)
    self.flow_settings.is_running = True
    self.flow_settings.is_canceled = False
    self.flow_logger.clear_log_file()
    self.latest_run_info = self.create_initial_run_information(1, "fetch_one")
    node_logger = self.flow_logger.get_node_logger(flow_node.node_id)
    node_result = NodeResult(node_id=flow_node.node_id, node_name=flow_node.name)
    logger.info(f'Starting to run: node {flow_node.node_id}, start time: {node_result.start_timestamp}')
    try:
        self.latest_run_info.node_step_result.append(node_result)
        flow_node.execute_node(run_location=self.flow_settings.execution_location,
                               performance_mode=False,
                               node_logger=node_logger,
                               optimize_for_downstream=False,
                               reset_cache=True)
        node_result.error = str(flow_node.results.errors)
        if self.flow_settings.is_canceled:
            node_result.success = None
            node_result.success = None
            node_result.is_running = False
        node_result.success = flow_node.results.errors is None
        node_result.end_timestamp = time()
        node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
        node_result.is_running = False
        self.latest_run_info.nodes_completed += 1
        self.latest_run_info.end_time = datetime.datetime.now()
        self.flow_settings.is_running = False
        return self.get_run_info()
    except Exception as e:
        node_result.error = 'Node did not run'
        node_result.success = False
        node_result.end_timestamp = time()
        node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
        node_result.is_running = False
        node_logger.error(f'Error in node {flow_node.node_id}: {e}')
    finally:
        self.flow_settings.is_running = False

FlowNode

The FlowNode represents a single operation in the FlowGraph. Each node corresponds to a specific transformation or action, such as filtering or grouping data.

`flowfile_core.flowfile.flow_node.flow_node.FlowNode`

Represents a single node in a data flow graph.

This class manages the node's state, its data processing function, and its connections to other nodes within the graph.

Methods:

Name	Description
`__call__`	Makes the node instance callable, acting as an alias for execute_node.
`__init__`	Initializes a FlowNode instance.
`__repr__`	Provides a string representation of the FlowNode instance.
`add_lead_to_in_depend_source`	Ensures this node is registered in the `leads_to_nodes` list of its inputs.
`add_node_connection`	Adds a connection from a source node to this node.
`calculate_hash`	Calculates a hash based on settings and input node hashes.
`cancel`	Cancels an ongoing external process if one is running.
`clear_table_example`	Clear the table example in the results so that it clears the existing results
`create_schema_callback_from_function`	Wraps a node's function to create a schema callback that extracts the schema.
`delete_input_node`	Removes a connection from a specific input node.
`delete_lead_to_node`	Removes a connection to a specific downstream node.
`evaluate_nodes`	Triggers a state reset for all directly connected downstream nodes.
`execute_full_local`	Executes the node's logic locally, including example data generation.
`execute_local`	Executes the node's logic locally.
`execute_node`	Orchestrates the execution, handling location, caching, and retries.
`execute_remote`	Executes the node's logic remotely or handles cached results.
`get_all_dependent_node_ids`	Yields the IDs of all downstream nodes recursively.
`get_all_dependent_nodes`	Yields all downstream nodes recursively.
`get_edge_input`	Generates `NodeEdge` objects for all input connections to this node.
`get_flow_file_column_schema`	Retrieves the schema for a specific column from the output schema.
`get_input_type`	Gets the type of connection ('main', 'left', 'right') for a given input node ID.
`get_node_data`	Gathers all necessary data for representing the node in the UI.
`get_node_information`	Updates and returns the node's information object.
`get_node_input`	Creates a `NodeInput` schema object for representing this node in the UI.
`get_output_data`	Gets the full output data sample for this node.
`get_predicted_resulting_data`	Creates a `FlowDataEngine` instance based on the predicted schema.
`get_predicted_schema`	Predicts the output schema of the node without full execution.
`get_repr`	Gets a detailed dictionary representation of the node's state.
`get_resulting_data`	Executes the node's function to produce the actual output data.
`get_table_example`	Generates a `TableExample` model summarizing the node's output.
`needs_reset`	Checks if the node's hash has changed, indicating an outdated state.
`needs_run`	Determines if the node needs to be executed.
`post_init`	Initializes or resets the node's attributes to their default states.
`prepare_before_run`	Resets results and errors before a new execution.
`print`	Helper method to log messages with node context.
`remove_cache`	Removes cached results for this node.
`reset`	Resets the node's execution state and schema information.
`set_node_information`	Populates the `node_information` attribute with the current state.
`store_example_data_generator`	Stores a generator function for fetching a sample of the result data.
`update_node`	Updates the properties of the node.

Attributes:

Name	Type	Description
`all_inputs`	`List[FlowNode]`	Gets a list of all nodes connected to any input port.
`function`	`Callable`	Gets the core processing function of the node.
`has_input`	`bool`	Checks if this node has any input connections.
`has_next_step`	`bool`	Checks if this node has any downstream connections.
`hash`	`str`	Gets the cached hash for the node, calculating it if it doesn't exist.
`is_correct`	`bool`	Checks if the node's input connections satisfy its template requirements.
`is_setup`	`bool`	Checks if the node has been properly configured and is ready for execution.
`is_start`	`bool`	Determines if the node is a starting node in the flow.
`left_input`	`Optional[FlowNode]`	Gets the node connected to the left input port.
`main_input`	`List[FlowNode]`	Gets the list of nodes connected to the main input port(s).
`name`	`str`	Gets the name of the node.
`node_id`	`Union[str, int]`	Gets the unique identifier of the node.
`number_of_leads_to_nodes`	`int \| None`	Counts the number of downstream node connections.
`right_input`	`Optional[FlowNode]`	Gets the node connected to the right input port.
`schema`	`List[FlowfileColumn]`	Gets the definitive output schema of the node.
`schema_callback`	`SingleExecutionFuture`	Gets the schema callback function, creating one if it doesn't exist.
`setting_input`	`Any`	Gets the node's specific configuration settings.
`singular_input`	`bool`	Checks if the node template specifies exactly one input.
`singular_main_input`	`FlowNode`	Gets the input node, assuming it is a single-input type.
`state_needs_reset`	`bool`	Checks if the node's state needs to be reset.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

class FlowNode:
    """Represents a single node in a data flow graph.

    This class manages the node's state, its data processing function,
    and its connections to other nodes within the graph.
    """
    parent_uuid: str
    node_type: str
    node_template: node_store.NodeTemplate
    node_default: schemas.NodeDefault
    node_schema: NodeSchemaInformation
    node_inputs: NodeStepInputs
    node_stats: NodeStepStats
    node_settings: NodeStepSettings
    results: NodeResults
    node_information: Optional[schemas.NodeInformation] = None
    leads_to_nodes: List["FlowNode"] = []  # list with target flows, after execution the step will trigger those step(s)
    user_provided_schema_callback: Optional[Callable] = None  # user provided callback function for schema calculation
    _setting_input: Any = None
    _hash: Optional[str] = None  # host this for caching results
    _function: Callable = None  # the function that needs to be executed when triggered
    _name: str = None  # name of the node, used for display
    _schema_callback: Optional[SingleExecutionFuture] = None  # Function that calculates the schema without executing
    _state_needs_reset: bool = False
    _fetch_cached_df: Optional[ExternalDfFetcher | ExternalDatabaseFetcher | ExternalDatabaseWriter | ExternalCloudWriter] = None
    _cache_progress: Optional[ExternalDfFetcher | ExternalDatabaseFetcher | ExternalDatabaseWriter | ExternalCloudWriter] = None

    def __init__(self, node_id: Union[str, int], function: Callable,
                 parent_uuid: str,
                 setting_input: Any,
                 name: str,
                 node_type: str,
                 input_columns: List[str] = None,
                 output_schema: List[FlowfileColumn] = None,
                 drop_columns: List[str] = None,
                 renew_schema: bool = True,
                 pos_x: float = 0,
                 pos_y: float = 0,
                 schema_callback: Callable = None,
                 ):
        """Initializes a FlowNode instance.

        Args:
            node_id: Unique identifier for the node.
            function: The core data processing function for the node.
            parent_uuid: The UUID of the parent flow.
            setting_input: The configuration/settings object for the node.
            name: The name of the node.
            node_type: The type identifier of the node (e.g., 'join', 'filter').
            input_columns: List of column names expected as input.
            output_schema: The schema of the columns to be added.
            drop_columns: List of column names to be dropped.
            renew_schema: Flag to indicate if the schema should be renewed.
            pos_x: The x-coordinate on the canvas.
            pos_y: The y-coordinate on the canvas.
            schema_callback: A custom function to calculate the output schema.
        """
        self._name = None
        self.parent_uuid = parent_uuid
        self.post_init()
        self.active = True
        self.node_information.id = node_id
        self.node_type = node_type
        self.node_settings.renew_schema = renew_schema
        self.update_node(function=function,
                         input_columns=input_columns,
                         output_schema=output_schema,
                         drop_columns=drop_columns,
                         setting_input=setting_input,
                         name=name,
                         pos_x=pos_x,
                         pos_y=pos_y,
                         schema_callback=schema_callback,
                         )

    def post_init(self):
        """Initializes or resets the node's attributes to their default states."""
        self.node_inputs = NodeStepInputs()
        self.node_stats = NodeStepStats()
        self.node_settings = NodeStepSettings()
        self.node_schema = NodeSchemaInformation()
        self.results = NodeResults()
        self.node_information = schemas.NodeInformation()
        self.leads_to_nodes = []
        self._setting_input = None
        self._cache_progress = None
        self._schema_callback = None
        self._state_needs_reset = False

    @property
    def state_needs_reset(self) -> bool:
        """Checks if the node's state needs to be reset.

        Returns:
            True if a reset is required, False otherwise.
        """
        return self._state_needs_reset

    @state_needs_reset.setter
    def state_needs_reset(self, v: bool):
        """Sets the flag indicating that the node's state needs to be reset.

        Args:
            v: The boolean value to set.
        """
        self._state_needs_reset = v

    @staticmethod
    def create_schema_callback_from_function(f: Callable) -> Callable[[], List[FlowfileColumn]]:
        """Wraps a node's function to create a schema callback that extracts the schema.

        Args:
            f: The node's core function that returns a FlowDataEngine instance.

        Returns:
            A callable that, when executed, returns the output schema.
        """
        def schema_callback() -> List[FlowfileColumn]:
            try:
                logger.info('Executing the schema callback function based on the node function')
                return f().schema
            except Exception as e:
                logger.warning(f'Error with the schema callback: {e}')
                return []
        return schema_callback

    @property
    def schema_callback(self) -> SingleExecutionFuture:
        """Gets the schema callback function, creating one if it doesn't exist.

        The callback is used for predicting the output schema without full execution.

        Returns:
            A SingleExecutionFuture instance wrapping the schema function.
        """
        if self._schema_callback is None:
            if self.user_provided_schema_callback is not None:
                self.schema_callback = self.user_provided_schema_callback
            elif self.is_start:
                self.schema_callback = self.create_schema_callback_from_function(self._function)
        return self._schema_callback

    @schema_callback.setter
    def schema_callback(self, f: Callable):
        """Sets the schema callback function for the node.

        Args:
            f: The function to be used for schema calculation.
        """
        if f is None:
            return

        def error_callback(e: Exception) -> List:
            logger.warning(e)

            self.node_settings.setup_errors = True
            return []

        self._schema_callback = SingleExecutionFuture(f, error_callback)

    @property
    def is_start(self) -> bool:
        """Determines if the node is a starting node in the flow.

        A starting node requires no inputs.

        Returns:
            True if the node is a start node, False otherwise.
        """
        return not self.has_input and self.node_template.input == 0

    def get_input_type(self, node_id: int) -> List:
        """Gets the type of connection ('main', 'left', 'right') for a given input node ID.

        Args:
            node_id: The ID of the input node.

        Returns:
            A list of connection types for that node ID.
        """
        relation_type = []
        if node_id in [n.node_id for n in self.node_inputs.main_inputs]:
            relation_type.append('main')
        if self.node_inputs.left_input is not None and node_id == self.node_inputs.left_input.node_id:
            relation_type.append('left')
        if self.node_inputs.right_input is not None and node_id == self.node_inputs.right_input.node_id:
            relation_type.append('right')
        return list(set(relation_type))

    def update_node(self,
                    function: Callable,
                    input_columns: List[str] = None,
                    output_schema: List[FlowfileColumn] = None,
                    drop_columns: List[str] = None,
                    name: str = None,
                    setting_input: Any = None,
                    pos_x: float = 0,
                    pos_y: float = 0,
                    schema_callback: Callable = None,
                    ):
        """Updates the properties of the node.

        This is called during initialization and when settings are changed.

        Args:
            function: The new core data processing function.
            input_columns: The new list of input columns.
            output_schema: The new schema of added columns.
            drop_columns: The new list of dropped columns.
            name: The new name for the node.
            setting_input: The new settings object.
            pos_x: The new x-coordinate.
            pos_y: The new y-coordinate.
            schema_callback: The new custom schema callback function.
        """
        self.user_provided_schema_callback = schema_callback
        self.node_information.y_position = int(pos_y)
        self.node_information.x_position = int(pos_x)
        self.node_information.setting_input = setting_input
        self.name = self.node_type if name is None else name
        self._function = function

        self.node_schema.input_columns = [] if input_columns is None else input_columns
        self.node_schema.output_columns = [] if output_schema is None else output_schema
        self.node_schema.drop_columns = [] if drop_columns is None else drop_columns
        self.node_settings.renew_schema = True
        if hasattr(setting_input, 'cache_results'):
            self.node_settings.cache_results = setting_input.cache_results

        self.results.errors = None
        self.add_lead_to_in_depend_source()
        _ = self.hash
        self.node_template = node_store.node_dict.get(self.node_type)
        if self.node_template is None:
            raise Exception(f'Node template {self.node_type} not found')
        self.node_default = node_store.node_defaults.get(self.node_type)
        self.setting_input = setting_input  # wait until the end so that the hash is calculated correctly

    @property
    def name(self) -> str:
        """Gets the name of the node.

        Returns:
            The node's name.
        """
        return self._name

    @name.setter
    def name(self, name: str):
        """Sets the name of the node.

        Args:
            name: The new name.
        """
        self._name = name
        self.__name__ = name

    @property
    def setting_input(self) -> Any:
        """Gets the node's specific configuration settings.

        Returns:
            The settings object.
        """
        return self._setting_input

    @setting_input.setter
    def setting_input(self, setting_input: Any):
        """Sets the node's configuration and triggers a reset if necessary.

        Args:
            setting_input: The new settings object.
        """
        is_manual_input = (self.node_type == 'manual_input' and
                           isinstance(setting_input, input_schema.NodeManualInput) and
                           isinstance(self._setting_input, input_schema.NodeManualInput)
                           )
        if is_manual_input:
            _ = self.hash
        self._setting_input = setting_input
        self.set_node_information()
        if is_manual_input:
            if self.hash != self.calculate_hash(setting_input) or not self.node_stats.has_run_with_current_setup:
                self.function = FlowDataEngine(setting_input.raw_data_format)
                self.reset()
                self.get_predicted_schema()
        elif self._setting_input is not None:
            self.reset()

    @property
    def node_id(self) -> Union[str, int]:
        """Gets the unique identifier of the node.

        Returns:
            The node's ID.
        """
        return self.node_information.id

    @property
    def left_input(self) -> Optional["FlowNode"]:
        """Gets the node connected to the left input port.

        Returns:
            The left input FlowNode, or None.
        """
        return self.node_inputs.left_input

    @property
    def right_input(self) -> Optional["FlowNode"]:
        """Gets the node connected to the right input port.

        Returns:
            The right input FlowNode, or None.
        """
        return self.node_inputs.right_input

    @property
    def main_input(self) -> List["FlowNode"]:
        """Gets the list of nodes connected to the main input port(s).

        Returns:
            A list of main input FlowNodes.
        """
        return self.node_inputs.main_inputs

    @property
    def is_correct(self) -> bool:
        """Checks if the node's input connections satisfy its template requirements.

        Returns:
            True if connections are valid, False otherwise.
        """
        if isinstance(self.setting_input, input_schema.NodePromise):
            return False
        return (self.node_template.input == len(self.node_inputs.get_all_inputs()) or
                (self.node_template.multi and len(self.node_inputs.get_all_inputs()) > 0) or
                (self.node_template.multi and self.node_template.can_be_start))

    def set_node_information(self):
        """Populates the `node_information` attribute with the current state.

        This includes the node's connections, settings, and position.
        """
        logger.info('setting node information')
        node_information = self.node_information
        node_information.left_input_id = self.node_inputs.left_input.node_id if self.left_input else None
        node_information.right_input_id = self.node_inputs.right_input.node_id if self.right_input else None
        node_information.input_ids = [mi.node_id for mi in
                                      self.node_inputs.main_inputs] if self.node_inputs.main_inputs is not None else None
        node_information.setting_input = self.setting_input
        node_information.outputs = [n.node_id for n in self.leads_to_nodes]
        node_information.is_setup = self.is_setup
        node_information.x_position = self.setting_input.pos_x
        node_information.y_position = self.setting_input.pos_y
        node_information.type = self.node_type

    def get_node_information(self) -> schemas.NodeInformation:
        """Updates and returns the node's information object.

        Returns:
            The `NodeInformation` object for this node.
        """
        self.set_node_information()
        return self.node_information

    @property
    def function(self) -> Callable:
        """Gets the core processing function of the node.

        Returns:
            The callable function.
        """
        return self._function

    @function.setter
    def function(self, function: Callable):
        """Sets the core processing function of the node.

        Args:
            function: The new callable function.
        """
        self._function = function

    @property
    def all_inputs(self) -> List["FlowNode"]:
        """Gets a list of all nodes connected to any input port.

        Returns:
            A list of all input FlowNodes.
        """
        return self.node_inputs.get_all_inputs()

    def calculate_hash(self, setting_input: Any) -> str:
        """Calculates a hash based on settings and input node hashes.

        Args:
            setting_input: The node's settings object to be included in the hash.

        Returns:
            A string hash value.
        """
        depends_on_hashes = [_node.hash for _node in self.all_inputs]
        node_data_hash = get_hash(setting_input)
        return get_hash(depends_on_hashes + [node_data_hash, self.parent_uuid])

    @property
    def hash(self) -> str:
        """Gets the cached hash for the node, calculating it if it doesn't exist.

        Returns:
            The string hash value.
        """
        if not self._hash:
            self._hash = self.calculate_hash(self.setting_input)
        return self._hash

    def add_node_connection(self, from_node: "FlowNode",
                            insert_type: Literal['main', 'left', 'right'] = 'main') -> None:
        """Adds a connection from a source node to this node.

        Args:
            from_node: The node to connect from.
            insert_type: The type of input to connect to ('main', 'left', 'right').

        Raises:
            Exception: If the insert_type is invalid.
        """
        from_node.leads_to_nodes.append(self)
        if insert_type == 'main':
            if self.node_template.input <= 2 or self.node_inputs.main_inputs is None:
                self.node_inputs.main_inputs = [from_node]
            else:
                self.node_inputs.main_inputs.append(from_node)
        elif insert_type == 'right':
            self.node_inputs.right_input = from_node
        elif insert_type == 'left':
            self.node_inputs.left_input = from_node
        else:
            raise Exception('Cannot find the connection')
        if self.setting_input.is_setup:
            if hasattr(self.setting_input, 'depending_on_id') and insert_type == 'main':
                self.setting_input.depending_on_id = from_node.node_id
        self.reset()
        from_node.reset()

    def evaluate_nodes(self, deep: bool = False) -> None:
        """Triggers a state reset for all directly connected downstream nodes.

        Args:
            deep: If True, the reset propagates recursively through the entire downstream graph.
        """
        for node in self.leads_to_nodes:
            self.print(f'resetting node: {node.node_id}')
            node.reset(deep)

    def get_flow_file_column_schema(self, col_name: str) -> FlowfileColumn | None:
        """Retrieves the schema for a specific column from the output schema.

        Args:
            col_name: The name of the column.

        Returns:
            The FlowfileColumn object for that column, or None if not found.
        """
        for s in self.schema:
            if s.column_name == col_name:
                return s

    def get_predicted_schema(self, force: bool = False) -> List[FlowfileColumn] | None:
        """Predicts the output schema of the node without full execution.

        It uses the schema_callback or infers from predicted data.

        Args:
            force: If True, forces recalculation even if a predicted schema exists.

        Returns:
            A list of FlowfileColumn objects representing the predicted schema.
        """
        if self.node_schema.predicted_schema and not force:
            return self.node_schema.predicted_schema
        if self.schema_callback is not None and (self.node_schema.predicted_schema is None or force):
            self.print('Getting the data from a schema callback')
            if force:
                # Force the schema callback to reset, so that it will be executed again
                self.schema_callback.reset()
            schema = self.schema_callback()
            if schema is not None and len(schema) > 0:
                self.print('Calculating the schema based on the schema callback')
                self.node_schema.predicted_schema = schema
                return self.node_schema.predicted_schema
        predicted_data = self._predicted_data_getter()
        if predicted_data is not None and predicted_data.schema is not None:
            self.print('Calculating the schema based on the predicted resulting data')
            self.node_schema.predicted_schema = self._predicted_data_getter().schema
        return self.node_schema.predicted_schema

    @property
    def is_setup(self) -> bool:
        """Checks if the node has been properly configured and is ready for execution.

        Returns:
            True if the node is set up, False otherwise.
        """
        if not self.node_information.is_setup:
            if self.function.__name__ != 'placeholder':
                self.node_information.is_setup = True
                self.setting_input.is_setup = True
        return self.node_information.is_setup

    def print(self, v: Any):
        """Helper method to log messages with node context.

        Args:
            v: The message or value to log.
        """
        logger.info(f'{self.node_type}, node_id: {self.node_id}: {v}')

    def get_resulting_data(self) -> FlowDataEngine | None:
        """Executes the node's function to produce the actual output data.

        Handles both regular functions and external data sources.

        Returns:
            A FlowDataEngine instance containing the result, or None on error.

        Raises:
            Exception: Propagates exceptions from the node's function execution.
        """
        if self.is_setup:
            if self.results.resulting_data is None and self.results.errors is None:
                self.print('getting resulting data')
                try:
                    if isinstance(self.function, FlowDataEngine):
                        fl: FlowDataEngine = self.function
                    elif self.node_type == 'external_source':
                        fl: FlowDataEngine = self.function()
                        fl.collect_external()
                        self.node_settings.streamable = False
                    else:
                        try:
                            fl = self._function(*[v.get_resulting_data() for v in self.all_inputs])
                        except Exception as e:
                            raise e
                    fl.set_streamable(self.node_settings.streamable)
                    self.results.resulting_data = fl
                    self.node_schema.result_schema = fl.schema
                except Exception as e:
                    self.results.resulting_data = FlowDataEngine()
                    self.results.errors = str(e)
                    self.node_stats.has_run_with_current_setup = False
                    self.node_stats.has_completed_last_run = False
                    raise e
            return self.results.resulting_data

    def _predicted_data_getter(self) -> FlowDataEngine | None:
        """Internal helper to get a predicted data result.

        This calls the function with predicted data from input nodes.

        Returns:
            A FlowDataEngine instance with predicted data, or an empty one on error.
        """
        try:
            fl = self._function(*[v.get_predicted_resulting_data() for v in self.all_inputs])
            return fl
        except ValueError as e:
            if str(e) == "generator already executing":
                logger.info('Generator already executing, waiting for the result')
                sleep(1)
                return self._predicted_data_getter()
            fl = FlowDataEngine()
            return fl

        except Exception as e:
            logger.warning('there was an issue with the function, returning an empty Flowfile')
            logger.warning(e)

    def get_predicted_resulting_data(self) -> FlowDataEngine:
        """Creates a `FlowDataEngine` instance based on the predicted schema.

        This avoids executing the node's full logic.

        Returns:
            A FlowDataEngine instance with a schema but no data.
        """
        if self.needs_run(False) and self.schema_callback is not None or self.node_schema.result_schema is not None:
            self.print('Getting data based on the schema')

            _s = self.schema_callback() if self.node_schema.result_schema is None else self.node_schema.result_schema
            return FlowDataEngine.create_from_schema(_s)
        else:
            if isinstance(self.function, FlowDataEngine):
                fl = self.function
            else:
                fl = FlowDataEngine.create_from_schema(self.get_predicted_schema())
            return fl

    def add_lead_to_in_depend_source(self):
        """Ensures this node is registered in the `leads_to_nodes` list of its inputs."""
        for input_node in self.all_inputs:
            if self.node_id not in [n.node_id for n in input_node.leads_to_nodes]:
                input_node.leads_to_nodes.append(self)

    def get_all_dependent_nodes(self) -> Generator["FlowNode", None, None]:
        """Yields all downstream nodes recursively.

        Returns:
            A generator of all dependent FlowNode objects.
        """
        for node in self.leads_to_nodes:
            yield node
            for n in node.get_all_dependent_nodes():
                yield n

    def get_all_dependent_node_ids(self) -> Generator[int, None, None]:
        """Yields the IDs of all downstream nodes recursively.

        Returns:
            A generator of all dependent node IDs.
        """
        for node in self.leads_to_nodes:
            yield node.node_id
            for n in node.get_all_dependent_node_ids():
                yield n

    @property
    def schema(self) -> List[FlowfileColumn]:
        """Gets the definitive output schema of the node.

        If not already run, it falls back to the predicted schema.

        Returns:
            A list of FlowfileColumn objects.
        """
        try:
            if self.is_setup and self.results.errors is None:
                if self.node_schema.result_schema is not None and len(self.node_schema.result_schema) > 0:
                    return self.node_schema.result_schema
                elif self.node_type == 'output':
                    if len(self.node_inputs.main_inputs) > 0:
                        self.node_schema.result_schema = self.node_inputs.main_inputs[0].schema
                else:
                    self.node_schema.result_schema = self.get_predicted_schema()
                return self.node_schema.result_schema
            else:
                return []
        except Exception as e:
            logger.error(e)
            return []

    def remove_cache(self):
        """Removes cached results for this node.

        Note: Currently not fully implemented.
        """

        if results_exists(self.hash):
            logger.warning('Not implemented')
            clear_task_from_worker(self.hash)

    def needs_run(self, performance_mode: bool, node_logger: NodeLogger = None,
                  execution_location: schemas.ExecutionLocationsLiteral = "remote") -> bool:
        """Determines if the node needs to be executed.

        The decision is based on its run state, caching settings, and execution mode.

        Args:
            performance_mode: True if the flow is in performance mode.
            node_logger: The logger instance for this node.
            execution_location: The target execution location.

        Returns:
            True if the node should be run, False otherwise.
        """
        if execution_location == "local":
            return False

        flow_logger = logger if node_logger is None else node_logger
        cache_result_exists = results_exists(self.hash)
        if not self.node_stats.has_run_with_current_setup:
            flow_logger.info('Node has not run, needs to run')
            return True
        if self.node_settings.cache_results and cache_result_exists:
            return False
        elif self.node_settings.cache_results and not cache_result_exists:
            return True
        elif not performance_mode and cache_result_exists:
            return False
        else:
            return True

    def __call__(self, *args, **kwargs):
        """Makes the node instance callable, acting as an alias for execute_node."""
        self.execute_node(*args, **kwargs)

    def execute_full_local(self, performance_mode: bool = False) -> None:
        """Executes the node's logic locally, including example data generation.

        Args:
            performance_mode: If True, skips generating example data.

        Raises:
            Exception: Propagates exceptions from the execution.
        """
        self.clear_table_example()

        def example_data_generator():
            example_data = None

            def get_example_data():
                nonlocal example_data
                if example_data is None:
                    example_data = resulting_data.get_sample(100).to_arrow()
                return example_data
            return get_example_data
        resulting_data = self.get_resulting_data()

        if not performance_mode:
            self.node_stats.has_run_with_current_setup = True
            self.results.example_data_generator = example_data_generator()
            self.node_schema.result_schema = self.results.resulting_data.schema
            self.node_stats.has_completed_last_run = True

    def execute_local(self, flow_id: int, performance_mode: bool = False):
        """Executes the node's logic locally.

        Args:
            flow_id: The ID of the parent flow.
            performance_mode: If True, skips generating example data.

        Raises:
            Exception: Propagates exceptions from the execution.
        """
        try:
            resulting_data = self.get_resulting_data()
            if not performance_mode:
                external_sampler = ExternalSampler(lf=resulting_data.data_frame, file_ref=self.hash,
                                                   wait_on_completion=True, node_id=self.node_id, flow_id=flow_id)
                self.store_example_data_generator(external_sampler)
                if self.results.errors is None and not self.node_stats.is_canceled:
                    self.node_stats.has_run_with_current_setup = True
            self.node_schema.result_schema = resulting_data.schema

        except Exception as e:
            logger.warning(f"Error with step {self.__name__}")
            logger.error(str(e))
            self.results.errors = str(e)
            self.node_stats.has_run_with_current_setup = False
            self.node_stats.has_completed_last_run = False
            raise e

        if self.node_stats.has_run_with_current_setup:
            for step in self.leads_to_nodes:
                if not self.node_settings.streamable:
                    step.node_settings.streamable = self.node_settings.streamable

    def execute_remote(self, performance_mode: bool = False, node_logger: NodeLogger = None):
        """Executes the node's logic remotely or handles cached results.

        Args:
            performance_mode: If True, skips generating example data.
            node_logger: The logger for this node execution.

        Raises:
            Exception: If the node_logger is not provided or if execution fails.
        """
        if node_logger is None:
            raise Exception('Node logger is not defined')
        if self.node_settings.cache_results and results_exists(self.hash):
            try:
                self.results.resulting_data = get_external_df_result(self.hash)
                self._cache_progress = None
                return
            except Exception as e:
                node_logger.warning('Failed to read the cache, rerunning the code')
        if self.node_type == 'output':
            self.results.resulting_data = self.get_resulting_data()
            self.node_stats.has_run_with_current_setup = True
            return
        try:
            self.get_resulting_data()
        except Exception as e:
            self.results.errors = 'Error with creating the lazy frame, most likely due to invalid graph'
            raise e
        if not performance_mode:
            external_df_fetcher = ExternalDfFetcher(lf=self.get_resulting_data().data_frame,
                                                    file_ref=self.hash, wait_on_completion=False,
                                                    flow_id=node_logger.flow_id,
                                                    node_id=self.node_id)
            self._fetch_cached_df = external_df_fetcher
            try:
                lf = external_df_fetcher.get_result()
                self.results.resulting_data = FlowDataEngine(
                    lf, number_of_records=ExternalDfFetcher(lf=lf, operation_type='calculate_number_of_records',
                                                            flow_id=node_logger.flow_id, node_id=self.node_id).result
                )
                if not performance_mode:
                    self.store_example_data_generator(external_df_fetcher)
                    self.node_stats.has_run_with_current_setup = True

            except Exception as e:
                node_logger.error('Error with external process')
                if external_df_fetcher.error_code == -1:
                    try:
                        self.results.resulting_data = self.get_resulting_data()
                        self.results.warnings = ('Error with external process (unknown error), '
                                                 'likely the process was killed by the server because of memory constraints, '
                                                 'continue with the process. '
                                                 'We cannot display example data...')
                    except Exception as e:
                        self.results.errors = str(e)
                        raise e
                elif external_df_fetcher.error_description is None:
                    self.results.errors = str(e)
                    raise e
                else:
                    self.results.errors = external_df_fetcher.error_description
                    raise Exception(external_df_fetcher.error_description)
            finally:
                self._fetch_cached_df = None

    def prepare_before_run(self):
        """Resets results and errors before a new execution."""

        self.results.errors = None
        self.results.resulting_data = None
        self.results.example_data = None

    def cancel(self):
        """Cancels an ongoing external process if one is running."""

        if self._fetch_cached_df is not None:
            self._fetch_cached_df.cancel()
            self.node_stats.is_canceled = True
        else:
            logger.warning('No external process to cancel')
        self.node_stats.is_canceled = True

    def execute_node(self, run_location: schemas.ExecutionLocationsLiteral,
                     reset_cache: bool = False,
                     performance_mode: bool = False,
                     retry: bool = True,
                     node_logger: NodeLogger = None,
                     optimize_for_downstream: bool = True):
        """Orchestrates the execution, handling location, caching, and retries.

        Args:
            run_location: The location for execution ('local', 'remote').
            reset_cache: If True, forces removal of any existing cache.
            performance_mode: If True, optimizes for speed over diagnostics.
            retry: If True, allows retrying execution on recoverable errors.
            node_logger: The logger for this node execution.
            optimize_for_downstream: If true, operations that shuffle the order of rows are fully cached and provided as
            input to downstream steps

        Raises:
            Exception: If the node_logger is not defined.
        """
        if node_logger is None:
            raise Exception('Flow logger is not defined')
        #  TODO: Simplify which route is being picked there are many duplicate checks

        if reset_cache:
            self.remove_cache()
            self.node_stats.has_run_with_current_setup = False
            self.node_stats.has_completed_last_run = False

        if self.is_setup:
            node_logger.info(f'Starting to run {self.__name__}')
            if (self.needs_run(performance_mode, node_logger, run_location) or self.node_template.node_group == "output"
                    and not (run_location == 'local')):
                self.clear_table_example()
                self.prepare_before_run()
                self.reset()
                try:
                    if (((run_location == 'remote' or
                         (self.node_default.transform_type == 'wide' and optimize_for_downstream) and
                         not run_location == 'local'))
                            or self.node_settings.cache_results):
                        node_logger.info('Running the node remotely')
                        if self.node_settings.cache_results:
                            performance_mode = False
                        self.execute_remote(performance_mode=(performance_mode if not self.node_settings.cache_results
                                                              else False),
                                            node_logger=node_logger
                                            )
                    else:
                        node_logger.info('Running the node locally')
                        self.execute_local(performance_mode=performance_mode, flow_id=node_logger.flow_id)
                except Exception as e:
                    if 'No such file or directory (os error' in str(e) and retry:
                        logger.warning('Error with the input node, starting to rerun the input node...')
                        all_inputs: List[FlowNode] = self.node_inputs.get_all_inputs()
                        for node_input in all_inputs:
                            node_input.execute_node(run_location=run_location,
                                                    performance_mode=performance_mode, retry=True,
                                                    reset_cache=True,
                                                    node_logger=node_logger)
                        self.execute_node(run_location=run_location,
                                          performance_mode=performance_mode, retry=False,
                                          node_logger=node_logger)
                    else:
                        self.results.errors = str(e)
                        if "Connection refused" in str(e) and "/submit_query/" in str(e):
                            node_logger.warning("There was an issue connecting to the remote worker, "
                                                "ensure the worker process is running, "
                                                "or change the settings to, so it executes locally")
                            node_logger.error("Could not execute in the remote worker. (Re)start the worker service, or change settings to local settings.")
                        else:
                            node_logger.error(f'Error with running the node: {e}')
            elif ((run_location == 'local') and
                  (not self.node_stats.has_run_with_current_setup or self.node_template.node_group == "output")):
                try:
                    node_logger.info('Executing fully locally')
                    self.execute_full_local(performance_mode)
                except Exception as e:
                    self.results.errors = str(e)
                    node_logger.error(f'Error with running the node: {e}')
                    self.node_stats.error = str(e)
                    self.node_stats.has_completed_last_run = False

            else:
                node_logger.info('Node has already run, not running the node')
        else:
            node_logger.warning(f'Node {self.__name__} is not setup, cannot run the node')

    def store_example_data_generator(self, external_df_fetcher: ExternalDfFetcher | ExternalSampler):
        """Stores a generator function for fetching a sample of the result data.

        Args:
            external_df_fetcher: The process that generated the sample data.
        """
        if external_df_fetcher.status is not None:
            file_ref = external_df_fetcher.status.file_ref
            self.results.example_data_path = file_ref
            self.results.example_data_generator = get_read_top_n(file_path=file_ref, n=100)
        else:
            logger.error('Could not get the sample data, the external process is not ready')

    def needs_reset(self) -> bool:
        """Checks if the node's hash has changed, indicating an outdated state.

        Returns:
            True if the calculated hash differs from the stored hash.
        """
        return self._hash != self.calculate_hash(self.setting_input)

    def reset(self, deep: bool = False):
        """Resets the node's execution state and schema information.

        This also triggers a reset on all downstream nodes.

        Args:
            deep: If True, forces a reset even if the hash hasn't changed.
        """
        needs_reset = self.needs_reset() or deep
        if needs_reset:
            logger.info(f'{self.node_id}: Node needs reset')
            self.node_stats.has_run_with_current_setup = False
            self.results.reset()
            self.node_schema.result_schema = None
            self.node_schema.predicted_schema = None
            self._hash = None
            self.node_information.is_setup = None
            self.results.errors = None
            if self.is_correct:
                self._schema_callback = None  # Ensure the schema callback is reset
                if self.schema_callback:
                    logger.info(f'{self.node_id}: Resetting the schema callback')
                    self.schema_callback.start()
            self.evaluate_nodes()
            _ = self.hash  # Recalculate the hash after reset

    def delete_lead_to_node(self, node_id: int) -> bool:
        """Removes a connection to a specific downstream node.

        Args:
            node_id: The ID of the downstream node to disconnect.

        Returns:
            True if the connection was found and removed, False otherwise.
        """
        logger.info(f'Deleting lead to node: {node_id}')
        for i, lead_to_node in enumerate(self.leads_to_nodes):
            logger.info(f'Checking lead to node: {lead_to_node.node_id}')
            if lead_to_node.node_id == node_id:
                logger.info(f'Found the node to delete: {node_id}')
                self.leads_to_nodes.pop(i)
                return True
        return False

    def delete_input_node(self, node_id: int, connection_type: input_schema.InputConnectionClass = 'input-0',
                          complete: bool = False) -> bool:
        """Removes a connection from a specific input node.

        Args:
            node_id: The ID of the input node to disconnect.
            connection_type: The specific input handle (e.g., 'input-0', 'input-1').
            complete: If True, tries to delete from all input types.

        Returns:
            True if a connection was found and removed, False otherwise.
        """
        deleted: bool = False
        if connection_type == 'input-0':
            for i, node in enumerate(self.node_inputs.main_inputs):
                if node.node_id == node_id:
                    self.node_inputs.main_inputs.pop(i)
                    deleted = True
                    if not complete:
                        continue
        elif connection_type == 'input-1' or complete:
            if self.node_inputs.right_input is not None and self.node_inputs.right_input.node_id == node_id:
                self.node_inputs.right_input = None
                deleted = True
        elif connection_type == 'input-2' or complete:
            if self.node_inputs.left_input is not None and self.node_inputs.right_input.node_id == node_id:
                self.node_inputs.left_input = None
                deleted = True
        else:
            logger.warning('Could not find the connection to delete...')
        if deleted:
            self.reset()
        return deleted

    def __repr__(self) -> str:
        """Provides a string representation of the FlowNode instance.

        Returns:
            A string showing the node's ID and type.
        """
        return f"Node id: {self.node_id} ({self.node_type})"

    def _get_readable_schema(self) -> List[dict] | None:
        """Helper to get a simplified, dictionary representation of the output schema.

        Returns:
            A list of dictionaries, each with 'column_name' and 'data_type'.
        """
        if self.is_setup:
            output = []
            for s in self.schema:
                output.append(dict(column_name=s.column_name, data_type=s.data_type))
            return output

    def get_repr(self) -> dict:
        """Gets a detailed dictionary representation of the node's state.

        Returns:
            A dictionary containing key information about the node.
        """
        return dict(FlowNode=
                    dict(node_id=self.node_id,
                         step_name=self.__name__,
                         output_columns=self.node_schema.output_columns,
                         output_schema=self._get_readable_schema()))

    @property
    def number_of_leads_to_nodes(self) -> int | None:
        """Counts the number of downstream node connections.

        Returns:
            The number of nodes this node leads to.
        """
        if self.is_setup:
            return len(self.leads_to_nodes)

    @property
    def has_next_step(self) -> bool:
        """Checks if this node has any downstream connections.

        Returns:
            True if it has at least one downstream node.
        """
        return len(self.leads_to_nodes) > 0

    @property
    def has_input(self) -> bool:
        """Checks if this node has any input connections.

        Returns:
            True if it has at least one input node.
        """
        return len(self.all_inputs) > 0

    @property
    def singular_input(self) -> bool:
        """Checks if the node template specifies exactly one input.

        Returns:
            True if the node is a single-input type.
        """
        return self.node_template.input == 1

    @property
    def singular_main_input(self) -> "FlowNode":
        """Gets the input node, assuming it is a single-input type.

        Returns:
            The single input FlowNode, or None.
        """
        if self.singular_input:
            return self.all_inputs[0]

    def clear_table_example(self) -> None:
        """
        Clear the table example in the results so that it clears the existing results
        Returns:
            None
        """

        self.results.example_data = None
        self.results.example_data_generator = None
        self.results.example_data_path = None

    def get_table_example(self, include_data: bool = False) -> TableExample | None:
        """Generates a `TableExample` model summarizing the node's output.

        This can optionally include a sample of the data.

        Args:
            include_data: If True, includes a data sample in the result.

        Returns:
            A `TableExample` object, or None if the node is not set up.
        """
        self.print('Getting a table example')
        if self.is_setup and include_data and self.node_stats.has_completed_last_run:

            if self.node_template.node_group == 'output':
                self.print('getting the table example')
                return self.main_input[0].get_table_example(include_data)

            logger.info('getting the table example since the node has run')
            example_data_getter = self.results.example_data_generator
            if example_data_getter is not None:
                data = example_data_getter().to_pylist()
                if data is None:
                    data = []
            else:
                data = []
            schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
            fl = self.get_resulting_data()
            has_example_data = self.results.example_data_generator is not None

            return TableExample(node_id=self.node_id,
                                name=str(self.node_id), number_of_records=999,
                                number_of_columns=fl.number_of_fields,
                                table_schema=schema, columns=fl.columns, data=data,
                                has_example_data=has_example_data,
                                has_run_with_current_setup=self.node_stats.has_run_with_current_setup
                                )
        else:
            logger.warning('getting the table example but the node has not run')
            try:
                schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
            except Exception as e:
                logger.warning(e)
                schema = []
            columns = [s.name for s in schema]
            return TableExample(node_id=self.node_id,
                                name=str(self.node_id), number_of_records=0,
                                number_of_columns=len(columns),
                                table_schema=schema, columns=columns,
                                data=[])

    def get_node_data(self, flow_id: int, include_example: bool = False) -> NodeData:
        """Gathers all necessary data for representing the node in the UI.

        Args:
            flow_id: The ID of the parent flow.
            include_example: If True, includes data samples.

        Returns:
            A `NodeData` object.
        """
        node = NodeData(flow_id=flow_id,
                        node_id=self.node_id,
                        has_run=self.node_stats.has_run_with_current_setup,
                        setting_input=self.setting_input,
                        flow_type=self.node_type)
        if self.main_input:
            node.main_input = self.main_input[0].get_table_example()
        if self.left_input:
            node.left_input = self.left_input.get_table_example()
        if self.right_input:
            node.right_input = self.right_input.get_table_example()
        if self.is_setup:
            node.main_output = self.get_table_example(include_example)
        node = setting_generator.get_setting_generator(self.node_type)(node)

        node = setting_updator.get_setting_updator(self.node_type)(node)
        return node

    def get_output_data(self) -> TableExample:
        """Gets the full output data sample for this node.

        Returns:
            A `TableExample` object with data.
        """
        return self.get_table_example(True)

    def get_node_input(self) -> schemas.NodeInput:
        """Creates a `NodeInput` schema object for representing this node in the UI.

        Returns:
            A `NodeInput` object.
        """
        return schemas.NodeInput(pos_y=self.setting_input.pos_y,
                                 pos_x=self.setting_input.pos_x,
                                 id=self.node_id,
                                 **self.node_template.__dict__)

    def get_edge_input(self) -> List[schemas.NodeEdge]:
        """Generates `NodeEdge` objects for all input connections to this node.

        Returns:
            A list of `NodeEdge` objects.
        """
        edges = []
        if self.node_inputs.main_inputs is not None:
            for i, main_input in enumerate(self.node_inputs.main_inputs):
                edges.append(schemas.NodeEdge(id=f'{main_input.node_id}-{self.node_id}-{i}',
                                              source=main_input.node_id,
                                              target=self.node_id,
                                              sourceHandle='output-0',
                                              targetHandle='input-0',
                                              ))
        if self.node_inputs.left_input is not None:
            edges.append(schemas.NodeEdge(id=f'{self.node_inputs.left_input.node_id}-{self.node_id}-right',
                                          source=self.node_inputs.left_input.node_id,
                                          target=self.node_id,
                                          sourceHandle='output-0',
                                          targetHandle='input-2',
                                          ))
        if self.node_inputs.right_input is not None:
            edges.append(schemas.NodeEdge(id=f'{self.node_inputs.right_input.node_id}-{self.node_id}-left',
                                          source=self.node_inputs.right_input.node_id,
                                          target=self.node_id,
                                          sourceHandle='output-0',
                                          targetHandle='input-1',
                                          ))
        return edges

`all_inputs` `property`

Gets a list of all nodes connected to any input port.

Returns:

Type	Description
`List[FlowNode]`	A list of all input FlowNodes.

`function` `property` `writable`

Gets the core processing function of the node.

Returns:

Type	Description
`Callable`	The callable function.

`has_input` `property`

Checks if this node has any input connections.

Returns:

Type	Description
`bool`	True if it has at least one input node.

`has_next_step` `property`

Checks if this node has any downstream connections.

Returns:

Type	Description
`bool`	True if it has at least one downstream node.

`hash` `property`

Gets the cached hash for the node, calculating it if it doesn't exist.

Returns:

Type	Description
`str`	The string hash value.

`is_correct` `property`

Checks if the node's input connections satisfy its template requirements.

Returns:

Type	Description
`bool`	True if connections are valid, False otherwise.

`is_setup` `property`

Checks if the node has been properly configured and is ready for execution.

Returns:

Type	Description
`bool`	True if the node is set up, False otherwise.

`is_start` `property`

Determines if the node is a starting node in the flow.

A starting node requires no inputs.

Returns:

Type	Description
`bool`	True if the node is a start node, False otherwise.

`left_input` `property`

Gets the node connected to the left input port.

Returns:

Type	Description
`Optional[FlowNode]`	The left input FlowNode, or None.

`main_input` `property`

Gets the list of nodes connected to the main input port(s).

Returns:

Type	Description
`List[FlowNode]`	A list of main input FlowNodes.

`name` `property` `writable`

Gets the name of the node.

Returns:

Type	Description
`str`	The node's name.

`node_id` `property`

Gets the unique identifier of the node.

Returns:

Type	Description
`Union[str, int]`	The node's ID.

`number_of_leads_to_nodes` `property`

Counts the number of downstream node connections.

Returns:

Type	Description
`int \| None`	The number of nodes this node leads to.

`right_input` `property`

Gets the node connected to the right input port.

Returns:

Type	Description
`Optional[FlowNode]`	The right input FlowNode, or None.

`schema` `property`

Gets the definitive output schema of the node.

If not already run, it falls back to the predicted schema.

Returns:

Type	Description
`List[FlowfileColumn]`	A list of FlowfileColumn objects.

`schema_callback` `property` `writable`

Gets the schema callback function, creating one if it doesn't exist.

The callback is used for predicting the output schema without full execution.

Returns:

Type	Description
`SingleExecutionFuture`	A SingleExecutionFuture instance wrapping the schema function.

`setting_input` `property` `writable`

Gets the node's specific configuration settings.

Returns:

Type	Description
`Any`	The settings object.

`singular_input` `property`

Checks if the node template specifies exactly one input.

Returns:

Type	Description
`bool`	True if the node is a single-input type.

`singular_main_input` `property`

Gets the input node, assuming it is a single-input type.

Returns:

Type	Description
`FlowNode`	The single input FlowNode, or None.

`state_needs_reset` `property` `writable`

Checks if the node's state needs to be reset.

Returns:

Type	Description
`bool`	True if a reset is required, False otherwise.

`call(*args, **kwargs)`

Makes the node instance callable, acting as an alias for execute_node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def __call__(self, *args, **kwargs):
    """Makes the node instance callable, acting as an alias for execute_node."""
    self.execute_node(*args, **kwargs)

`init(node_id, function, parent_uuid, setting_input, name, node_type, input_columns=None, output_schema=None, drop_columns=None, renew_schema=True, pos_x=0, pos_y=0, schema_callback=None)`

Initializes a FlowNode instance.

Parameters:

Name	Type	Description	Default
`node_id`	`Union[str, int]`	Unique identifier for the node.	required
`function`	`Callable`	The core data processing function for the node.	required
`parent_uuid`	`str`	The UUID of the parent flow.	required
`setting_input`	`Any`	The configuration/settings object for the node.	required
`name`	`str`	The name of the node.	required
`node_type`	`str`	The type identifier of the node (e.g., 'join', 'filter').	required
`input_columns`	`List[str]`	List of column names expected as input.	`None`
`output_schema`	`List[FlowfileColumn]`	The schema of the columns to be added.	`None`
`drop_columns`	`List[str]`	List of column names to be dropped.	`None`
`renew_schema`	`bool`	Flag to indicate if the schema should be renewed.	`True`
`pos_x`	`float`	The x-coordinate on the canvas.	`0`
`pos_y`	`float`	The y-coordinate on the canvas.	`0`
`schema_callback`	`Callable`	A custom function to calculate the output schema.	`None`

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def __init__(self, node_id: Union[str, int], function: Callable,
             parent_uuid: str,
             setting_input: Any,
             name: str,
             node_type: str,
             input_columns: List[str] = None,
             output_schema: List[FlowfileColumn] = None,
             drop_columns: List[str] = None,
             renew_schema: bool = True,
             pos_x: float = 0,
             pos_y: float = 0,
             schema_callback: Callable = None,
             ):
    """Initializes a FlowNode instance.

    Args:
        node_id: Unique identifier for the node.
        function: The core data processing function for the node.
        parent_uuid: The UUID of the parent flow.
        setting_input: The configuration/settings object for the node.
        name: The name of the node.
        node_type: The type identifier of the node (e.g., 'join', 'filter').
        input_columns: List of column names expected as input.
        output_schema: The schema of the columns to be added.
        drop_columns: List of column names to be dropped.
        renew_schema: Flag to indicate if the schema should be renewed.
        pos_x: The x-coordinate on the canvas.
        pos_y: The y-coordinate on the canvas.
        schema_callback: A custom function to calculate the output schema.
    """
    self._name = None
    self.parent_uuid = parent_uuid
    self.post_init()
    self.active = True
    self.node_information.id = node_id
    self.node_type = node_type
    self.node_settings.renew_schema = renew_schema
    self.update_node(function=function,
                     input_columns=input_columns,
                     output_schema=output_schema,
                     drop_columns=drop_columns,
                     setting_input=setting_input,
                     name=name,
                     pos_x=pos_x,
                     pos_y=pos_y,
                     schema_callback=schema_callback,
                     )

`repr()`

Provides a string representation of the FlowNode instance.

Returns:

Type	Description
`str`	A string showing the node's ID and type.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def __repr__(self) -> str:
    """Provides a string representation of the FlowNode instance.

    Returns:
        A string showing the node's ID and type.
    """
    return f"Node id: {self.node_id} ({self.node_type})"

`add_lead_to_in_depend_source()`

Ensures this node is registered in the leads_to_nodes list of its inputs.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def add_lead_to_in_depend_source(self):
    """Ensures this node is registered in the `leads_to_nodes` list of its inputs."""
    for input_node in self.all_inputs:
        if self.node_id not in [n.node_id for n in input_node.leads_to_nodes]:
            input_node.leads_to_nodes.append(self)

`add_node_connection(from_node, insert_type='main')`

Adds a connection from a source node to this node.

Parameters:

Name	Type	Description	Default
`from_node`	`FlowNode`	The node to connect from.	required
`insert_type`	`Literal['main', 'left', 'right']`	The type of input to connect to ('main', 'left', 'right').	`'main'`

Raises:

Type	Description
`Exception`	If the insert_type is invalid.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def add_node_connection(self, from_node: "FlowNode",
                        insert_type: Literal['main', 'left', 'right'] = 'main') -> None:
    """Adds a connection from a source node to this node.

    Args:
        from_node: The node to connect from.
        insert_type: The type of input to connect to ('main', 'left', 'right').

    Raises:
        Exception: If the insert_type is invalid.
    """
    from_node.leads_to_nodes.append(self)
    if insert_type == 'main':
        if self.node_template.input <= 2 or self.node_inputs.main_inputs is None:
            self.node_inputs.main_inputs = [from_node]
        else:
            self.node_inputs.main_inputs.append(from_node)
    elif insert_type == 'right':
        self.node_inputs.right_input = from_node
    elif insert_type == 'left':
        self.node_inputs.left_input = from_node
    else:
        raise Exception('Cannot find the connection')
    if self.setting_input.is_setup:
        if hasattr(self.setting_input, 'depending_on_id') and insert_type == 'main':
            self.setting_input.depending_on_id = from_node.node_id
    self.reset()
    from_node.reset()

`calculate_hash(setting_input)`

Calculates a hash based on settings and input node hashes.

Parameters:

Name	Type	Description	Default
`setting_input`	`Any`	The node's settings object to be included in the hash.	required

Returns:

Type	Description
`str`	A string hash value.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def calculate_hash(self, setting_input: Any) -> str:
    """Calculates a hash based on settings and input node hashes.

    Args:
        setting_input: The node's settings object to be included in the hash.

    Returns:
        A string hash value.
    """
    depends_on_hashes = [_node.hash for _node in self.all_inputs]
    node_data_hash = get_hash(setting_input)
    return get_hash(depends_on_hashes + [node_data_hash, self.parent_uuid])

`cancel()`

Cancels an ongoing external process if one is running.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def cancel(self):
    """Cancels an ongoing external process if one is running."""

    if self._fetch_cached_df is not None:
        self._fetch_cached_df.cancel()
        self.node_stats.is_canceled = True
    else:
        logger.warning('No external process to cancel')
    self.node_stats.is_canceled = True

`clear_table_example()`

Clear the table example in the results so that it clears the existing results Returns: None

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def clear_table_example(self) -> None:
    """
    Clear the table example in the results so that it clears the existing results
    Returns:
        None
    """

    self.results.example_data = None
    self.results.example_data_generator = None
    self.results.example_data_path = None

`create_schema_callback_from_function(f)` `staticmethod`

Wraps a node's function to create a schema callback that extracts the schema.

Parameters:

Name	Type	Description	Default
`f`	`Callable`	The node's core function that returns a FlowDataEngine instance.	required

Returns:

Type	Description
`Callable[[], List[FlowfileColumn]]`	A callable that, when executed, returns the output schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

@staticmethod
def create_schema_callback_from_function(f: Callable) -> Callable[[], List[FlowfileColumn]]:
    """Wraps a node's function to create a schema callback that extracts the schema.

    Args:
        f: The node's core function that returns a FlowDataEngine instance.

    Returns:
        A callable that, when executed, returns the output schema.
    """
    def schema_callback() -> List[FlowfileColumn]:
        try:
            logger.info('Executing the schema callback function based on the node function')
            return f().schema
        except Exception as e:
            logger.warning(f'Error with the schema callback: {e}')
            return []
    return schema_callback

`delete_input_node(node_id, connection_type='input-0', complete=False)`

Removes a connection from a specific input node.

Parameters:

Name	Type	Description	Default
`node_id`	`int`	The ID of the input node to disconnect.	required
`connection_type`	`InputConnectionClass`	The specific input handle (e.g., 'input-0', 'input-1').	`'input-0'`
`complete`	`bool`	If True, tries to delete from all input types.	`False`

Returns:

Type	Description
`bool`	True if a connection was found and removed, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def delete_input_node(self, node_id: int, connection_type: input_schema.InputConnectionClass = 'input-0',
                      complete: bool = False) -> bool:
    """Removes a connection from a specific input node.

    Args:
        node_id: The ID of the input node to disconnect.
        connection_type: The specific input handle (e.g., 'input-0', 'input-1').
        complete: If True, tries to delete from all input types.

    Returns:
        True if a connection was found and removed, False otherwise.
    """
    deleted: bool = False
    if connection_type == 'input-0':
        for i, node in enumerate(self.node_inputs.main_inputs):
            if node.node_id == node_id:
                self.node_inputs.main_inputs.pop(i)
                deleted = True
                if not complete:
                    continue
    elif connection_type == 'input-1' or complete:
        if self.node_inputs.right_input is not None and self.node_inputs.right_input.node_id == node_id:
            self.node_inputs.right_input = None
            deleted = True
    elif connection_type == 'input-2' or complete:
        if self.node_inputs.left_input is not None and self.node_inputs.right_input.node_id == node_id:
            self.node_inputs.left_input = None
            deleted = True
    else:
        logger.warning('Could not find the connection to delete...')
    if deleted:
        self.reset()
    return deleted

`delete_lead_to_node(node_id)`

Removes a connection to a specific downstream node.

Parameters:

Name	Type	Description	Default
`node_id`	`int`	The ID of the downstream node to disconnect.	required

Returns:

Type	Description
`bool`	True if the connection was found and removed, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def delete_lead_to_node(self, node_id: int) -> bool:
    """Removes a connection to a specific downstream node.

    Args:
        node_id: The ID of the downstream node to disconnect.

    Returns:
        True if the connection was found and removed, False otherwise.
    """
    logger.info(f'Deleting lead to node: {node_id}')
    for i, lead_to_node in enumerate(self.leads_to_nodes):
        logger.info(f'Checking lead to node: {lead_to_node.node_id}')
        if lead_to_node.node_id == node_id:
            logger.info(f'Found the node to delete: {node_id}')
            self.leads_to_nodes.pop(i)
            return True
    return False

`evaluate_nodes(deep=False)`

Triggers a state reset for all directly connected downstream nodes.

Parameters:

Name	Type	Description	Default
`deep`	`bool`	If True, the reset propagates recursively through the entire downstream graph.	`False`

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def evaluate_nodes(self, deep: bool = False) -> None:
    """Triggers a state reset for all directly connected downstream nodes.

    Args:
        deep: If True, the reset propagates recursively through the entire downstream graph.
    """
    for node in self.leads_to_nodes:
        self.print(f'resetting node: {node.node_id}')
        node.reset(deep)

`execute_full_local(performance_mode=False)`

Executes the node's logic locally, including example data generation.

Parameters:

Name	Type	Description	Default
`performance_mode`	`bool`	If True, skips generating example data.	`False`

Raises:

Type	Description
`Exception`	Propagates exceptions from the execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def execute_full_local(self, performance_mode: bool = False) -> None:
    """Executes the node's logic locally, including example data generation.

    Args:
        performance_mode: If True, skips generating example data.

    Raises:
        Exception: Propagates exceptions from the execution.
    """
    self.clear_table_example()

    def example_data_generator():
        example_data = None

        def get_example_data():
            nonlocal example_data
            if example_data is None:
                example_data = resulting_data.get_sample(100).to_arrow()
            return example_data
        return get_example_data
    resulting_data = self.get_resulting_data()

    if not performance_mode:
        self.node_stats.has_run_with_current_setup = True
        self.results.example_data_generator = example_data_generator()
        self.node_schema.result_schema = self.results.resulting_data.schema
        self.node_stats.has_completed_last_run = True

`execute_local(flow_id, performance_mode=False)`

Executes the node's logic locally.

Parameters:

Name	Type	Description	Default
`flow_id`	`int`	The ID of the parent flow.	required
`performance_mode`	`bool`	If True, skips generating example data.	`False`

Raises:

Type	Description
`Exception`	Propagates exceptions from the execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def execute_local(self, flow_id: int, performance_mode: bool = False):
    """Executes the node's logic locally.

    Args:
        flow_id: The ID of the parent flow.
        performance_mode: If True, skips generating example data.

    Raises:
        Exception: Propagates exceptions from the execution.
    """
    try:
        resulting_data = self.get_resulting_data()
        if not performance_mode:
            external_sampler = ExternalSampler(lf=resulting_data.data_frame, file_ref=self.hash,
                                               wait_on_completion=True, node_id=self.node_id, flow_id=flow_id)
            self.store_example_data_generator(external_sampler)
            if self.results.errors is None and not self.node_stats.is_canceled:
                self.node_stats.has_run_with_current_setup = True
        self.node_schema.result_schema = resulting_data.schema

    except Exception as e:
        logger.warning(f"Error with step {self.__name__}")
        logger.error(str(e))
        self.results.errors = str(e)
        self.node_stats.has_run_with_current_setup = False
        self.node_stats.has_completed_last_run = False
        raise e

    if self.node_stats.has_run_with_current_setup:
        for step in self.leads_to_nodes:
            if not self.node_settings.streamable:
                step.node_settings.streamable = self.node_settings.streamable

`execute_node(run_location, reset_cache=False, performance_mode=False, retry=True, node_logger=None, optimize_for_downstream=True)`

Orchestrates the execution, handling location, caching, and retries.

Parameters:

Name	Type	Description	Default
`run_location`	`ExecutionLocationsLiteral`	The location for execution ('local', 'remote').	required
`reset_cache`	`bool`	If True, forces removal of any existing cache.	`False`
`performance_mode`	`bool`	If True, optimizes for speed over diagnostics.	`False`
`retry`	`bool`	If True, allows retrying execution on recoverable errors.	`True`
`node_logger`	`NodeLogger`	The logger for this node execution.	`None`
`optimize_for_downstream`	`bool`	If true, operations that shuffle the order of rows are fully cached and provided as	`True`

Raises:

Type	Description
`Exception`	If the node_logger is not defined.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def execute_node(self, run_location: schemas.ExecutionLocationsLiteral,
                 reset_cache: bool = False,
                 performance_mode: bool = False,
                 retry: bool = True,
                 node_logger: NodeLogger = None,
                 optimize_for_downstream: bool = True):
    """Orchestrates the execution, handling location, caching, and retries.

    Args:
        run_location: The location for execution ('local', 'remote').
        reset_cache: If True, forces removal of any existing cache.
        performance_mode: If True, optimizes for speed over diagnostics.
        retry: If True, allows retrying execution on recoverable errors.
        node_logger: The logger for this node execution.
        optimize_for_downstream: If true, operations that shuffle the order of rows are fully cached and provided as
        input to downstream steps

    Raises:
        Exception: If the node_logger is not defined.
    """
    if node_logger is None:
        raise Exception('Flow logger is not defined')
    #  TODO: Simplify which route is being picked there are many duplicate checks

    if reset_cache:
        self.remove_cache()
        self.node_stats.has_run_with_current_setup = False
        self.node_stats.has_completed_last_run = False

    if self.is_setup:
        node_logger.info(f'Starting to run {self.__name__}')
        if (self.needs_run(performance_mode, node_logger, run_location) or self.node_template.node_group == "output"
                and not (run_location == 'local')):
            self.clear_table_example()
            self.prepare_before_run()
            self.reset()
            try:
                if (((run_location == 'remote' or
                     (self.node_default.transform_type == 'wide' and optimize_for_downstream) and
                     not run_location == 'local'))
                        or self.node_settings.cache_results):
                    node_logger.info('Running the node remotely')
                    if self.node_settings.cache_results:
                        performance_mode = False
                    self.execute_remote(performance_mode=(performance_mode if not self.node_settings.cache_results
                                                          else False),
                                        node_logger=node_logger
                                        )
                else:
                    node_logger.info('Running the node locally')
                    self.execute_local(performance_mode=performance_mode, flow_id=node_logger.flow_id)
            except Exception as e:
                if 'No such file or directory (os error' in str(e) and retry:
                    logger.warning('Error with the input node, starting to rerun the input node...')
                    all_inputs: List[FlowNode] = self.node_inputs.get_all_inputs()
                    for node_input in all_inputs:
                        node_input.execute_node(run_location=run_location,
                                                performance_mode=performance_mode, retry=True,
                                                reset_cache=True,
                                                node_logger=node_logger)
                    self.execute_node(run_location=run_location,
                                      performance_mode=performance_mode, retry=False,
                                      node_logger=node_logger)
                else:
                    self.results.errors = str(e)
                    if "Connection refused" in str(e) and "/submit_query/" in str(e):
                        node_logger.warning("There was an issue connecting to the remote worker, "
                                            "ensure the worker process is running, "
                                            "or change the settings to, so it executes locally")
                        node_logger.error("Could not execute in the remote worker. (Re)start the worker service, or change settings to local settings.")
                    else:
                        node_logger.error(f'Error with running the node: {e}')
        elif ((run_location == 'local') and
              (not self.node_stats.has_run_with_current_setup or self.node_template.node_group == "output")):
            try:
                node_logger.info('Executing fully locally')
                self.execute_full_local(performance_mode)
            except Exception as e:
                self.results.errors = str(e)
                node_logger.error(f'Error with running the node: {e}')
                self.node_stats.error = str(e)
                self.node_stats.has_completed_last_run = False

        else:
            node_logger.info('Node has already run, not running the node')
    else:
        node_logger.warning(f'Node {self.__name__} is not setup, cannot run the node')

`execute_remote(performance_mode=False, node_logger=None)`

Executes the node's logic remotely or handles cached results.

Parameters:

Name	Type	Description	Default
`performance_mode`	`bool`	If True, skips generating example data.	`False`
`node_logger`	`NodeLogger`	The logger for this node execution.	`None`

Raises:

Type	Description
`Exception`	If the node_logger is not provided or if execution fails.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def execute_remote(self, performance_mode: bool = False, node_logger: NodeLogger = None):
    """Executes the node's logic remotely or handles cached results.

    Args:
        performance_mode: If True, skips generating example data.
        node_logger: The logger for this node execution.

    Raises:
        Exception: If the node_logger is not provided or if execution fails.
    """
    if node_logger is None:
        raise Exception('Node logger is not defined')
    if self.node_settings.cache_results and results_exists(self.hash):
        try:
            self.results.resulting_data = get_external_df_result(self.hash)
            self._cache_progress = None
            return
        except Exception as e:
            node_logger.warning('Failed to read the cache, rerunning the code')
    if self.node_type == 'output':
        self.results.resulting_data = self.get_resulting_data()
        self.node_stats.has_run_with_current_setup = True
        return
    try:
        self.get_resulting_data()
    except Exception as e:
        self.results.errors = 'Error with creating the lazy frame, most likely due to invalid graph'
        raise e
    if not performance_mode:
        external_df_fetcher = ExternalDfFetcher(lf=self.get_resulting_data().data_frame,
                                                file_ref=self.hash, wait_on_completion=False,
                                                flow_id=node_logger.flow_id,
                                                node_id=self.node_id)
        self._fetch_cached_df = external_df_fetcher
        try:
            lf = external_df_fetcher.get_result()
            self.results.resulting_data = FlowDataEngine(
                lf, number_of_records=ExternalDfFetcher(lf=lf, operation_type='calculate_number_of_records',
                                                        flow_id=node_logger.flow_id, node_id=self.node_id).result
            )
            if not performance_mode:
                self.store_example_data_generator(external_df_fetcher)
                self.node_stats.has_run_with_current_setup = True

        except Exception as e:
            node_logger.error('Error with external process')
            if external_df_fetcher.error_code == -1:
                try:
                    self.results.resulting_data = self.get_resulting_data()
                    self.results.warnings = ('Error with external process (unknown error), '
                                             'likely the process was killed by the server because of memory constraints, '
                                             'continue with the process. '
                                             'We cannot display example data...')
                except Exception as e:
                    self.results.errors = str(e)
                    raise e
            elif external_df_fetcher.error_description is None:
                self.results.errors = str(e)
                raise e
            else:
                self.results.errors = external_df_fetcher.error_description
                raise Exception(external_df_fetcher.error_description)
        finally:
            self._fetch_cached_df = None

`get_all_dependent_node_ids()`

Yields the IDs of all downstream nodes recursively.

Returns:

Type	Description
`None`	A generator of all dependent node IDs.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_all_dependent_node_ids(self) -> Generator[int, None, None]:
    """Yields the IDs of all downstream nodes recursively.

    Returns:
        A generator of all dependent node IDs.
    """
    for node in self.leads_to_nodes:
        yield node.node_id
        for n in node.get_all_dependent_node_ids():
            yield n

`get_all_dependent_nodes()`

Yields all downstream nodes recursively.

Returns:

Type	Description
`None`	A generator of all dependent FlowNode objects.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_all_dependent_nodes(self) -> Generator["FlowNode", None, None]:
    """Yields all downstream nodes recursively.

    Returns:
        A generator of all dependent FlowNode objects.
    """
    for node in self.leads_to_nodes:
        yield node
        for n in node.get_all_dependent_nodes():
            yield n

`get_edge_input()`

Generates NodeEdge objects for all input connections to this node.

Returns:

Type	Description
`List[NodeEdge]`	A list of `NodeEdge` objects.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_edge_input(self) -> List[schemas.NodeEdge]:
    """Generates `NodeEdge` objects for all input connections to this node.

    Returns:
        A list of `NodeEdge` objects.
    """
    edges = []
    if self.node_inputs.main_inputs is not None:
        for i, main_input in enumerate(self.node_inputs.main_inputs):
            edges.append(schemas.NodeEdge(id=f'{main_input.node_id}-{self.node_id}-{i}',
                                          source=main_input.node_id,
                                          target=self.node_id,
                                          sourceHandle='output-0',
                                          targetHandle='input-0',
                                          ))
    if self.node_inputs.left_input is not None:
        edges.append(schemas.NodeEdge(id=f'{self.node_inputs.left_input.node_id}-{self.node_id}-right',
                                      source=self.node_inputs.left_input.node_id,
                                      target=self.node_id,
                                      sourceHandle='output-0',
                                      targetHandle='input-2',
                                      ))
    if self.node_inputs.right_input is not None:
        edges.append(schemas.NodeEdge(id=f'{self.node_inputs.right_input.node_id}-{self.node_id}-left',
                                      source=self.node_inputs.right_input.node_id,
                                      target=self.node_id,
                                      sourceHandle='output-0',
                                      targetHandle='input-1',
                                      ))
    return edges

`get_flow_file_column_schema(col_name)`

Retrieves the schema for a specific column from the output schema.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column.	required

Returns:

Type	Description
`FlowfileColumn \| None`	The FlowfileColumn object for that column, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_flow_file_column_schema(self, col_name: str) -> FlowfileColumn | None:
    """Retrieves the schema for a specific column from the output schema.

    Args:
        col_name: The name of the column.

    Returns:
        The FlowfileColumn object for that column, or None if not found.
    """
    for s in self.schema:
        if s.column_name == col_name:
            return s

`get_input_type(node_id)`

Gets the type of connection ('main', 'left', 'right') for a given input node ID.

Parameters:

Name	Type	Description	Default
`node_id`	`int`	The ID of the input node.	required

Returns:

Type	Description
`List`	A list of connection types for that node ID.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_input_type(self, node_id: int) -> List:
    """Gets the type of connection ('main', 'left', 'right') for a given input node ID.

    Args:
        node_id: The ID of the input node.

    Returns:
        A list of connection types for that node ID.
    """
    relation_type = []
    if node_id in [n.node_id for n in self.node_inputs.main_inputs]:
        relation_type.append('main')
    if self.node_inputs.left_input is not None and node_id == self.node_inputs.left_input.node_id:
        relation_type.append('left')
    if self.node_inputs.right_input is not None and node_id == self.node_inputs.right_input.node_id:
        relation_type.append('right')
    return list(set(relation_type))

`get_node_data(flow_id, include_example=False)`

Gathers all necessary data for representing the node in the UI.

Parameters:

Name	Type	Description	Default
`flow_id`	`int`	The ID of the parent flow.	required
`include_example`	`bool`	If True, includes data samples.	`False`

Returns:

Type	Description
`NodeData`	A `NodeData` object.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_node_data(self, flow_id: int, include_example: bool = False) -> NodeData:
    """Gathers all necessary data for representing the node in the UI.

    Args:
        flow_id: The ID of the parent flow.
        include_example: If True, includes data samples.

    Returns:
        A `NodeData` object.
    """
    node = NodeData(flow_id=flow_id,
                    node_id=self.node_id,
                    has_run=self.node_stats.has_run_with_current_setup,
                    setting_input=self.setting_input,
                    flow_type=self.node_type)
    if self.main_input:
        node.main_input = self.main_input[0].get_table_example()
    if self.left_input:
        node.left_input = self.left_input.get_table_example()
    if self.right_input:
        node.right_input = self.right_input.get_table_example()
    if self.is_setup:
        node.main_output = self.get_table_example(include_example)
    node = setting_generator.get_setting_generator(self.node_type)(node)

    node = setting_updator.get_setting_updator(self.node_type)(node)
    return node

`get_node_information()`

Updates and returns the node's information object.

Returns:

Type	Description
`NodeInformation`	The `NodeInformation` object for this node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_node_information(self) -> schemas.NodeInformation:
    """Updates and returns the node's information object.

    Returns:
        The `NodeInformation` object for this node.
    """
    self.set_node_information()
    return self.node_information

`get_node_input()`

Creates a NodeInput schema object for representing this node in the UI.

Returns:

Type	Description
`NodeInput`	A `NodeInput` object.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_node_input(self) -> schemas.NodeInput:
    """Creates a `NodeInput` schema object for representing this node in the UI.

    Returns:
        A `NodeInput` object.
    """
    return schemas.NodeInput(pos_y=self.setting_input.pos_y,
                             pos_x=self.setting_input.pos_x,
                             id=self.node_id,
                             **self.node_template.__dict__)

`get_output_data()`

Gets the full output data sample for this node.

Returns:

Type	Description
`TableExample`	A `TableExample` object with data.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_output_data(self) -> TableExample:
    """Gets the full output data sample for this node.

    Returns:
        A `TableExample` object with data.
    """
    return self.get_table_example(True)

`get_predicted_resulting_data()`

Creates a FlowDataEngine instance based on the predicted schema.

This avoids executing the node's full logic.

Returns:

Type	Description
`FlowDataEngine`	A FlowDataEngine instance with a schema but no data.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_predicted_resulting_data(self) -> FlowDataEngine:
    """Creates a `FlowDataEngine` instance based on the predicted schema.

    This avoids executing the node's full logic.

    Returns:
        A FlowDataEngine instance with a schema but no data.
    """
    if self.needs_run(False) and self.schema_callback is not None or self.node_schema.result_schema is not None:
        self.print('Getting data based on the schema')

        _s = self.schema_callback() if self.node_schema.result_schema is None else self.node_schema.result_schema
        return FlowDataEngine.create_from_schema(_s)
    else:
        if isinstance(self.function, FlowDataEngine):
            fl = self.function
        else:
            fl = FlowDataEngine.create_from_schema(self.get_predicted_schema())
        return fl

`get_predicted_schema(force=False)`

Predicts the output schema of the node without full execution.

It uses the schema_callback or infers from predicted data.

Parameters:

Name	Type	Description	Default
`force`	`bool`	If True, forces recalculation even if a predicted schema exists.	`False`

Returns:

Type	Description
`List[FlowfileColumn] \| None`	A list of FlowfileColumn objects representing the predicted schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_predicted_schema(self, force: bool = False) -> List[FlowfileColumn] | None:
    """Predicts the output schema of the node without full execution.

    It uses the schema_callback or infers from predicted data.

    Args:
        force: If True, forces recalculation even if a predicted schema exists.

    Returns:
        A list of FlowfileColumn objects representing the predicted schema.
    """
    if self.node_schema.predicted_schema and not force:
        return self.node_schema.predicted_schema
    if self.schema_callback is not None and (self.node_schema.predicted_schema is None or force):
        self.print('Getting the data from a schema callback')
        if force:
            # Force the schema callback to reset, so that it will be executed again
            self.schema_callback.reset()
        schema = self.schema_callback()
        if schema is not None and len(schema) > 0:
            self.print('Calculating the schema based on the schema callback')
            self.node_schema.predicted_schema = schema
            return self.node_schema.predicted_schema
    predicted_data = self._predicted_data_getter()
    if predicted_data is not None and predicted_data.schema is not None:
        self.print('Calculating the schema based on the predicted resulting data')
        self.node_schema.predicted_schema = self._predicted_data_getter().schema
    return self.node_schema.predicted_schema

`get_repr()`

Gets a detailed dictionary representation of the node's state.

Returns:

Type	Description
`dict`	A dictionary containing key information about the node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_repr(self) -> dict:
    """Gets a detailed dictionary representation of the node's state.

    Returns:
        A dictionary containing key information about the node.
    """
    return dict(FlowNode=
                dict(node_id=self.node_id,
                     step_name=self.__name__,
                     output_columns=self.node_schema.output_columns,
                     output_schema=self._get_readable_schema()))

`get_resulting_data()`

Executes the node's function to produce the actual output data.

Handles both regular functions and external data sources.

Returns:

Type	Description
`FlowDataEngine \| None`	A FlowDataEngine instance containing the result, or None on error.

Raises:

Type	Description
`Exception`	Propagates exceptions from the node's function execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_resulting_data(self) -> FlowDataEngine | None:
    """Executes the node's function to produce the actual output data.

    Handles both regular functions and external data sources.

    Returns:
        A FlowDataEngine instance containing the result, or None on error.

    Raises:
        Exception: Propagates exceptions from the node's function execution.
    """
    if self.is_setup:
        if self.results.resulting_data is None and self.results.errors is None:
            self.print('getting resulting data')
            try:
                if isinstance(self.function, FlowDataEngine):
                    fl: FlowDataEngine = self.function
                elif self.node_type == 'external_source':
                    fl: FlowDataEngine = self.function()
                    fl.collect_external()
                    self.node_settings.streamable = False
                else:
                    try:
                        fl = self._function(*[v.get_resulting_data() for v in self.all_inputs])
                    except Exception as e:
                        raise e
                fl.set_streamable(self.node_settings.streamable)
                self.results.resulting_data = fl
                self.node_schema.result_schema = fl.schema
            except Exception as e:
                self.results.resulting_data = FlowDataEngine()
                self.results.errors = str(e)
                self.node_stats.has_run_with_current_setup = False
                self.node_stats.has_completed_last_run = False
                raise e
        return self.results.resulting_data

`get_table_example(include_data=False)`

Generates a TableExample model summarizing the node's output.

This can optionally include a sample of the data.

Parameters:

Name	Type	Description	Default
`include_data`	`bool`	If True, includes a data sample in the result.	`False`

Returns:

Type	Description
`TableExample \| None`	A `TableExample` object, or None if the node is not set up.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def get_table_example(self, include_data: bool = False) -> TableExample | None:
    """Generates a `TableExample` model summarizing the node's output.

    This can optionally include a sample of the data.

    Args:
        include_data: If True, includes a data sample in the result.

    Returns:
        A `TableExample` object, or None if the node is not set up.
    """
    self.print('Getting a table example')
    if self.is_setup and include_data and self.node_stats.has_completed_last_run:

        if self.node_template.node_group == 'output':
            self.print('getting the table example')
            return self.main_input[0].get_table_example(include_data)

        logger.info('getting the table example since the node has run')
        example_data_getter = self.results.example_data_generator
        if example_data_getter is not None:
            data = example_data_getter().to_pylist()
            if data is None:
                data = []
        else:
            data = []
        schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
        fl = self.get_resulting_data()
        has_example_data = self.results.example_data_generator is not None

        return TableExample(node_id=self.node_id,
                            name=str(self.node_id), number_of_records=999,
                            number_of_columns=fl.number_of_fields,
                            table_schema=schema, columns=fl.columns, data=data,
                            has_example_data=has_example_data,
                            has_run_with_current_setup=self.node_stats.has_run_with_current_setup
                            )
    else:
        logger.warning('getting the table example but the node has not run')
        try:
            schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
        except Exception as e:
            logger.warning(e)
            schema = []
        columns = [s.name for s in schema]
        return TableExample(node_id=self.node_id,
                            name=str(self.node_id), number_of_records=0,
                            number_of_columns=len(columns),
                            table_schema=schema, columns=columns,
                            data=[])

`needs_reset()`

Checks if the node's hash has changed, indicating an outdated state.

Returns:

Type	Description
`bool`	True if the calculated hash differs from the stored hash.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def needs_reset(self) -> bool:
    """Checks if the node's hash has changed, indicating an outdated state.

    Returns:
        True if the calculated hash differs from the stored hash.
    """
    return self._hash != self.calculate_hash(self.setting_input)

`needs_run(performance_mode, node_logger=None, execution_location='remote')`

Determines if the node needs to be executed.

The decision is based on its run state, caching settings, and execution mode.

Parameters:

Name	Type	Description	Default
`performance_mode`	`bool`	True if the flow is in performance mode.	required
`node_logger`	`NodeLogger`	The logger instance for this node.	`None`
`execution_location`	`ExecutionLocationsLiteral`	The target execution location.	`'remote'`

Returns:

Type	Description
`bool`	True if the node should be run, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def needs_run(self, performance_mode: bool, node_logger: NodeLogger = None,
              execution_location: schemas.ExecutionLocationsLiteral = "remote") -> bool:
    """Determines if the node needs to be executed.

    The decision is based on its run state, caching settings, and execution mode.

    Args:
        performance_mode: True if the flow is in performance mode.
        node_logger: The logger instance for this node.
        execution_location: The target execution location.

    Returns:
        True if the node should be run, False otherwise.
    """
    if execution_location == "local":
        return False

    flow_logger = logger if node_logger is None else node_logger
    cache_result_exists = results_exists(self.hash)
    if not self.node_stats.has_run_with_current_setup:
        flow_logger.info('Node has not run, needs to run')
        return True
    if self.node_settings.cache_results and cache_result_exists:
        return False
    elif self.node_settings.cache_results and not cache_result_exists:
        return True
    elif not performance_mode and cache_result_exists:
        return False
    else:
        return True

`post_init()`

Initializes or resets the node's attributes to their default states.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def post_init(self):
    """Initializes or resets the node's attributes to their default states."""
    self.node_inputs = NodeStepInputs()
    self.node_stats = NodeStepStats()
    self.node_settings = NodeStepSettings()
    self.node_schema = NodeSchemaInformation()
    self.results = NodeResults()
    self.node_information = schemas.NodeInformation()
    self.leads_to_nodes = []
    self._setting_input = None
    self._cache_progress = None
    self._schema_callback = None
    self._state_needs_reset = False

`prepare_before_run()`

Resets results and errors before a new execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def prepare_before_run(self):
    """Resets results and errors before a new execution."""

    self.results.errors = None
    self.results.resulting_data = None
    self.results.example_data = None

`print(v)`

Helper method to log messages with node context.

Parameters:

Name	Type	Description	Default
`v`	`Any`	The message or value to log.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def print(self, v: Any):
    """Helper method to log messages with node context.

    Args:
        v: The message or value to log.
    """
    logger.info(f'{self.node_type}, node_id: {self.node_id}: {v}')

`remove_cache()`

Removes cached results for this node.

Note: Currently not fully implemented.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def remove_cache(self):
    """Removes cached results for this node.

    Note: Currently not fully implemented.
    """

    if results_exists(self.hash):
        logger.warning('Not implemented')
        clear_task_from_worker(self.hash)

`reset(deep=False)`

Resets the node's execution state and schema information.

This also triggers a reset on all downstream nodes.

Parameters:

Name	Type	Description	Default
`deep`	`bool`	If True, forces a reset even if the hash hasn't changed.	`False`

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def reset(self, deep: bool = False):
    """Resets the node's execution state and schema information.

    This also triggers a reset on all downstream nodes.

    Args:
        deep: If True, forces a reset even if the hash hasn't changed.
    """
    needs_reset = self.needs_reset() or deep
    if needs_reset:
        logger.info(f'{self.node_id}: Node needs reset')
        self.node_stats.has_run_with_current_setup = False
        self.results.reset()
        self.node_schema.result_schema = None
        self.node_schema.predicted_schema = None
        self._hash = None
        self.node_information.is_setup = None
        self.results.errors = None
        if self.is_correct:
            self._schema_callback = None  # Ensure the schema callback is reset
            if self.schema_callback:
                logger.info(f'{self.node_id}: Resetting the schema callback')
                self.schema_callback.start()
        self.evaluate_nodes()
        _ = self.hash  # Recalculate the hash after reset

`set_node_information()`

Populates the node_information attribute with the current state.

This includes the node's connections, settings, and position.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def set_node_information(self):
    """Populates the `node_information` attribute with the current state.

    This includes the node's connections, settings, and position.
    """
    logger.info('setting node information')
    node_information = self.node_information
    node_information.left_input_id = self.node_inputs.left_input.node_id if self.left_input else None
    node_information.right_input_id = self.node_inputs.right_input.node_id if self.right_input else None
    node_information.input_ids = [mi.node_id for mi in
                                  self.node_inputs.main_inputs] if self.node_inputs.main_inputs is not None else None
    node_information.setting_input = self.setting_input
    node_information.outputs = [n.node_id for n in self.leads_to_nodes]
    node_information.is_setup = self.is_setup
    node_information.x_position = self.setting_input.pos_x
    node_information.y_position = self.setting_input.pos_y
    node_information.type = self.node_type

`store_example_data_generator(external_df_fetcher)`

Stores a generator function for fetching a sample of the result data.

Parameters:

Name	Type	Description	Default
`external_df_fetcher`	`ExternalDfFetcher \| ExternalSampler`	The process that generated the sample data.	required

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def store_example_data_generator(self, external_df_fetcher: ExternalDfFetcher | ExternalSampler):
    """Stores a generator function for fetching a sample of the result data.

    Args:
        external_df_fetcher: The process that generated the sample data.
    """
    if external_df_fetcher.status is not None:
        file_ref = external_df_fetcher.status.file_ref
        self.results.example_data_path = file_ref
        self.results.example_data_generator = get_read_top_n(file_path=file_ref, n=100)
    else:
        logger.error('Could not get the sample data, the external process is not ready')

`update_node(function, input_columns=None, output_schema=None, drop_columns=None, name=None, setting_input=None, pos_x=0, pos_y=0, schema_callback=None)`

Updates the properties of the node.

This is called during initialization and when settings are changed.

Parameters:

Name	Type	Description	Default
`function`	`Callable`	The new core data processing function.	required
`input_columns`	`List[str]`	The new list of input columns.	`None`
`output_schema`	`List[FlowfileColumn]`	The new schema of added columns.	`None`
`drop_columns`	`List[str]`	The new list of dropped columns.	`None`
`name`	`str`	The new name for the node.	`None`
`setting_input`	`Any`	The new settings object.	`None`
`pos_x`	`float`	The new x-coordinate.	`0`
`pos_y`	`float`	The new y-coordinate.	`0`
`schema_callback`	`Callable`	The new custom schema callback function.	`None`

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py

def update_node(self,
                function: Callable,
                input_columns: List[str] = None,
                output_schema: List[FlowfileColumn] = None,
                drop_columns: List[str] = None,
                name: str = None,
                setting_input: Any = None,
                pos_x: float = 0,
                pos_y: float = 0,
                schema_callback: Callable = None,
                ):
    """Updates the properties of the node.

    This is called during initialization and when settings are changed.

    Args:
        function: The new core data processing function.
        input_columns: The new list of input columns.
        output_schema: The new schema of added columns.
        drop_columns: The new list of dropped columns.
        name: The new name for the node.
        setting_input: The new settings object.
        pos_x: The new x-coordinate.
        pos_y: The new y-coordinate.
        schema_callback: The new custom schema callback function.
    """
    self.user_provided_schema_callback = schema_callback
    self.node_information.y_position = int(pos_y)
    self.node_information.x_position = int(pos_x)
    self.node_information.setting_input = setting_input
    self.name = self.node_type if name is None else name
    self._function = function

    self.node_schema.input_columns = [] if input_columns is None else input_columns
    self.node_schema.output_columns = [] if output_schema is None else output_schema
    self.node_schema.drop_columns = [] if drop_columns is None else drop_columns
    self.node_settings.renew_schema = True
    if hasattr(setting_input, 'cache_results'):
        self.node_settings.cache_results = setting_input.cache_results

    self.results.errors = None
    self.add_lead_to_in_depend_source()
    _ = self.hash
    self.node_template = node_store.node_dict.get(self.node_type)
    if self.node_template is None:
        raise Exception(f'Node template {self.node_type} not found')
    self.node_default = node_store.node_defaults.get(self.node_type)
    self.setting_input = setting_input  # wait until the end so that the hash is calculated correctly

The FlowDataEngine

The FlowDataEngine is the primary engine of the library, providing a rich API for data manipulation, I/O, and transformation. Its methods are grouped below by functionality.

`flowfile_core.flowfile.flow_data_engine.flow_data_engine.FlowDataEngine` `dataclass`

The core data handling engine for Flowfile.

This class acts as a high-level wrapper around a Polars DataFrame or LazyFrame, providing a unified API for data ingestion, transformation, and output. It manages data state (lazy vs. eager), schema information, and execution logic.

Attributes:

Name	Type	Description
`_data_frame`	`Union[DataFrame, LazyFrame]`	The underlying Polars DataFrame or LazyFrame.
`columns`	`List[Any]`	A list of column names in the current data frame.
`name`	`str`	An optional name for the data engine instance.
`number_of_records`	`int`	The number of records. Can be -1 for lazy frames.
`errors`	`List`	A list of errors encountered during operations.
`_schema`	`Optional[List[FlowfileColumn]]`	A cached list of `FlowfileColumn` objects representing the schema.

Methods:

Name	Description
`__call__`	Makes the class instance callable, returning itself.
`__get_sample__`	Internal method to get a sample of the data.
`__getitem__`	Accesses a specific column or item from the DataFrame.
`__init__`	Initializes the FlowDataEngine from various data sources.
`__len__`	Returns the number of records in the table.
`__repr__`	Returns a string representation of the FlowDataEngine.
`add_new_values`	Adds a new column with the provided values.
`add_record_id`	Adds a record ID (row number) column to the DataFrame.
`apply_flowfile_formula`	Applies a formula to create a new column or transform an existing one.
`apply_sql_formula`	Applies an SQL-style formula using `pl.sql_expr`.
`assert_equal`	Asserts that this DataFrame is equal to another.
`cache`	Caches the current DataFrame to disk and updates the internal reference.
`calculate_schema`	Calculates and returns the schema.
`change_column_types`	Changes the data type of one or more columns.
`collect`	Collects the data and returns it as a Polars DataFrame.
`collect_external`	Materializes data from a tracked external source.
`concat`	Concatenates this DataFrame with one or more other DataFrames.
`count`	Gets the total number of records.
`create_from_external_source`	Creates a FlowDataEngine from an external data source.
`create_from_path`	Creates a FlowDataEngine from a local file path.
`create_from_path_worker`	Creates a FlowDataEngine from a path in a worker process.
`create_from_schema`	Creates an empty FlowDataEngine from a schema definition.
`create_from_sql`	Creates a FlowDataEngine by executing a SQL query.
`create_random`	Creates a FlowDataEngine with randomly generated data.
`do_cross_join`	Performs a cross join with another DataFrame.
`do_filter`	Filters rows based on a predicate expression.
`do_group_by`	Performs a group-by operation on the DataFrame.
`do_pivot`	Converts the DataFrame from a long to a wide format, aggregating values.
`do_select`	Performs a complex column selection, renaming, and reordering operation.
`do_sort`	Sorts the DataFrame by one or more columns.
`drop_columns`	Drops specified columns from the DataFrame.
`from_cloud_storage_obj`	Creates a FlowDataEngine from an object in cloud storage.
`generate_enumerator`	Generates a FlowDataEngine with a single column containing a sequence of integers.
`get_estimated_file_size`	Estimates the file size in bytes if the data originated from a local file.
`get_number_of_records`	Gets the total number of records in the DataFrame.
`get_number_of_records_in_process`	Get the number of records in the DataFrame in the local process.
`get_output_sample`	Gets a sample of the data as a list of dictionaries.
`get_record_count`	Returns a new FlowDataEngine with a single column 'number_of_records'
`get_sample`	Gets a sample of rows from the DataFrame.
`get_schema_column`	Retrieves the schema information for a single column by its name.
`get_select_inputs`	Gets `SelectInput` specifications for all columns in the current schema.
`get_subset`	Gets the first `n_rows` from the DataFrame.
`initialize_empty_fl`	Initializes an empty LazyFrame.
`iter_batches`	Iterates over the DataFrame in batches.
`join`	Performs a standard SQL-style join with another DataFrame.
`make_unique`	Gets the unique rows from the DataFrame.
`output`	Writes the DataFrame to an output file.
`reorganize_order`	Reorganizes columns into a specified order.
`save`	Saves the DataFrame to a file in a separate thread.
`select_columns`	Selects a subset of columns from the DataFrame.
`set_streamable`	Sets whether DataFrame operations should be streamable.
`solve_graph`	Solves a graph problem represented by 'from' and 'to' columns.
`split`	Splits a column's text values into multiple rows based on a delimiter.
`start_fuzzy_join`	Starts a fuzzy join operation in a background process.
`to_arrow`	Converts the DataFrame to a PyArrow Table.
`to_cloud_storage_obj`	Writes the DataFrame to an object in cloud storage.
`to_dict`	Converts the DataFrame to a Python dictionary of columns.
`to_pylist`	Converts the DataFrame to a list of Python dictionaries.
`to_raw_data`	Converts the DataFrame to a `RawData` schema object.
`unpivot`	Converts the DataFrame from a wide to a long format.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@dataclass
class FlowDataEngine:
    """The core data handling engine for Flowfile.

    This class acts as a high-level wrapper around a Polars DataFrame or
    LazyFrame, providing a unified API for data ingestion, transformation,
    and output. It manages data state (lazy vs. eager), schema information,
    and execution logic.

    Attributes:
        _data_frame: The underlying Polars DataFrame or LazyFrame.
        columns: A list of column names in the current data frame.
        name: An optional name for the data engine instance.
        number_of_records: The number of records. Can be -1 for lazy frames.
        errors: A list of errors encountered during operations.
        _schema: A cached list of `FlowfileColumn` objects representing the schema.
    """
    # Core attributes
    _data_frame: Union[pl.DataFrame, pl.LazyFrame]
    columns: List[Any]

    # Metadata attributes
    name: str = None
    number_of_records: int = None
    errors: List = None
    _schema: Optional[List['FlowfileColumn']] = None

    # Configuration attributes
    _optimize_memory: bool = False
    _lazy: bool = None
    _streamable: bool = True
    _calculate_schema_stats: bool = False

    # Cache and optimization attributes
    __col_name_idx_map: Dict = None
    __data_map: Dict = None
    __optimized_columns: List = None
    __sample__: str = None
    __number_of_fields: int = None
    _col_idx: Dict[str, int] = None

    # Source tracking
    _org_path: Optional[str] = None
    _external_source: Optional[ExternalDataSource] = None

    # State tracking
    sorted_by: int = None
    is_future: bool = False
    is_collected: bool = True
    ind_schema_calculated: bool = False

    # Callbacks
    _future: Future = None
    _number_of_records_callback: Callable = None
    _data_callback: Callable = None


    def __init__(self,
                 raw_data: Union[List[Dict], List[Any], Dict[str, Any], 'ParquetFile', pl.DataFrame, pl.LazyFrame, input_schema.RawData] = None,
                 path_ref: str = None,
                 name: str = None,
                 optimize_memory: bool = True,
                 schema: List['FlowfileColumn'] | List[str] | pl.Schema = None,
                 number_of_records: int = None,
                 calculate_schema_stats: bool = False,
                 streamable: bool = True,
                 number_of_records_callback: Callable = None,
                 data_callback: Callable = None):
        """Initializes the FlowDataEngine from various data sources.

        Args:
            raw_data: The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame,
                or a `RawData` schema object.
            path_ref: A string path to a Parquet file.
            name: An optional name for the data engine instance.
            optimize_memory: If True, prefers lazy operations to conserve memory.
            schema: An optional schema definition. Can be a list of `FlowfileColumn` objects,
                a list of column names, or a Polars `Schema`.
            number_of_records: The number of records, if known.
            calculate_schema_stats: If True, computes detailed statistics for each column.
            streamable: If True, allows for streaming operations when possible.
            number_of_records_callback: A callback function to retrieve the number of records.
            data_callback: A callback function to retrieve the data.
        """
        self._initialize_attributes(number_of_records_callback, data_callback, streamable)

        if raw_data is not None:
            self._handle_raw_data(raw_data, number_of_records, optimize_memory)
        elif path_ref:
            self._handle_path_ref(path_ref, optimize_memory)
        else:
            self.initialize_empty_fl()
        self._finalize_initialization(name, optimize_memory, schema, calculate_schema_stats)

    def _initialize_attributes(self, number_of_records_callback, data_callback, streamable):
        """(Internal) Sets the initial default attributes for a new instance.

        This helper is called first during initialization to ensure all state-tracking
        and configuration attributes have a clean default value before data is processed.
        """
        self._external_source = None
        self._number_of_records_callback = number_of_records_callback
        self._data_callback = data_callback
        self.ind_schema_calculated = False
        self._streamable = streamable
        self._org_path = None
        self._lazy = False
        self.errors = []
        self._calculate_schema_stats = False
        self.is_collected = True
        self.is_future = False

    def _handle_raw_data(self, raw_data, number_of_records, optimize_memory):
        """(Internal) Dispatches raw data to the appropriate handler based on its type.

        This acts as a router during initialization, inspecting the type of `raw_data`
        and calling the corresponding specialized `_handle_*` method to process it.
        """
        if isinstance(raw_data, input_schema.RawData):
            self._handle_raw_data_format(raw_data)
        elif isinstance(raw_data, pl.DataFrame):
            self._handle_polars_dataframe(raw_data, number_of_records)
        elif isinstance(raw_data, pl.LazyFrame):
            self._handle_polars_lazy_frame(raw_data, number_of_records, optimize_memory)
        elif isinstance(raw_data, (list, dict)):
            self._handle_python_data(raw_data)

    def _handle_polars_dataframe(self, df: pl.DataFrame, number_of_records: Optional[int]):
        """(Internal) Initializes the engine from an eager Polars DataFrame."""
        self.data_frame = df
        self.number_of_records = number_of_records or df.select(pl.len())[0, 0]

    def _handle_polars_lazy_frame(self, lf: pl.LazyFrame, number_of_records: Optional[int], optimize_memory: bool):
        """(Internal) Initializes the engine from a Polars LazyFrame."""
        self.data_frame = lf
        self._lazy = True
        if number_of_records is not None:
            self.number_of_records = number_of_records
        elif optimize_memory:
            self.number_of_records = -1
        else:
            self.number_of_records = lf.select(pl.len()).collect()[0, 0]

    def _handle_python_data(self, data: Union[List, Dict]):
        """(Internal) Dispatches Python collections to the correct handler."""
        if isinstance(data, dict):
            self._handle_dict_input(data)
        else:
            self._handle_list_input(data)

    def _handle_dict_input(self, data: Dict):
        """(Internal) Initializes the engine from a Python dictionary."""
        if len(data) == 0:
            self.initialize_empty_fl()
        lengths = [len(v) if isinstance(v, (list, tuple)) else 1 for v in data.values()]

        if len(set(lengths)) == 1 and lengths[0] > 1:
            self.number_of_records = lengths[0]
            self.data_frame = pl.DataFrame(data)
        else:
            self.number_of_records = 1
            self.data_frame = pl.DataFrame([data])
        self.lazy = True

    def _handle_raw_data_format(self, raw_data: input_schema.RawData):
        """(Internal) Initializes the engine from a `RawData` schema object.

        This method uses the schema provided in the `RawData` object to correctly
        infer data types when creating the Polars DataFrame.

        Args:
            raw_data: An instance of `RawData` containing the data and schema.
        """
        flowfile_schema = list(FlowfileColumn.create_from_minimal_field_info(c) for c in raw_data.columns)
        polars_schema = pl.Schema([(flowfile_column.column_name, flowfile_column.get_polars_type().pl_datatype)
                                   for flowfile_column in flowfile_schema])
        try:
            df = pl.DataFrame(raw_data.data, polars_schema, strict=False)
        except TypeError as e:
            logger.warning(f"Could not parse the data with the schema:\n{e}")
            df = pl.DataFrame(raw_data.data)
        self.number_of_records = len(df)
        self.data_frame = df.lazy()
        self.lazy = True

    def _handle_list_input(self, data: List):
        """(Internal) Initializes the engine from a list of records."""
        number_of_records = len(data)
        if number_of_records > 0:
            processed_data = self._process_list_data(data)
            self.number_of_records = number_of_records
            self.data_frame = pl.DataFrame(processed_data)
            self.lazy = True
        else:
            self.initialize_empty_fl()
            self.number_of_records = 0

    @staticmethod
    def _process_list_data(data: List) -> List[Dict]:
        """(Internal) Normalizes list data into a list of dictionaries.

        Ensures that a list of objects or non-dict items is converted into a
        uniform list of dictionaries suitable for Polars DataFrame creation.
        """
        if not (isinstance(data[0], dict) or hasattr(data[0], '__dict__')):
            try:
                return pl.DataFrame(data).to_dicts()
            except TypeError:
                raise Exception('Value must be able to be converted to dictionary')
            except Exception as e:
                raise Exception(f'Value must be able to be converted to dictionary: {e}')

        if not isinstance(data[0], dict):
            data = [row.__dict__ for row in data]

        return ensure_similarity_dicts(data)

    def to_cloud_storage_obj(self, settings: cloud_storage_schemas.CloudStorageWriteSettingsInternal):
        """Writes the DataFrame to an object in cloud storage.

        This method supports writing to various cloud storage providers like AWS S3,
        Azure Data Lake Storage, and Google Cloud Storage.

        Args:
            settings: A `CloudStorageWriteSettingsInternal` object containing connection
                details, file format, and write options.

        Raises:
            ValueError: If the specified file format is not supported for writing.
            NotImplementedError: If the 'append' write mode is used with an unsupported format.
            Exception: If the write operation to cloud storage fails for any reason.
        """
        connection = settings.connection
        write_settings = settings.write_settings

        logger.info(f"Writing to {connection.storage_type} storage: {write_settings.resource_path}")

        if write_settings.write_mode == 'append' and write_settings.file_format != "delta":
            raise NotImplementedError("The 'append' write mode is not yet supported for this destination.")
        storage_options = CloudStorageReader.get_storage_options(connection)
        credential_provider = CloudStorageReader.get_credential_provider(connection)
        # Dispatch to the correct writer based on file format
        if write_settings.file_format == "parquet":
            self._write_parquet_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        elif write_settings.file_format == "delta":
            self._write_delta_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        elif write_settings.file_format == "csv":
            self._write_csv_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        elif write_settings.file_format == "json":
            self._write_json_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        else:
            raise ValueError(f"Unsupported file format for writing: {write_settings.file_format}")

        logger.info(f"Successfully wrote data to {write_settings.resource_path}")

    def _write_parquet_to_cloud(self,
                                resource_path: str,
                                storage_options: Dict[str, Any],
                                credential_provider: Optional[Callable],
                                write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a Parquet file in cloud storage.

        Uses `sink_parquet` for efficient streaming writes. Falls back to a
        collect-then-write pattern if sinking fails.
        """
        try:
            sink_kwargs = {
                "path": resource_path,
                "compression": write_settings.parquet_compression,
            }
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider
            try:
                self.data_frame.sink_parquet(**sink_kwargs)
            except Exception as e:
                logger.warning(f"Failed to sink the data, falling back to collecing and writing. \n {e}")
                pl_df = self.collect()
                sink_kwargs['file'] = sink_kwargs.pop("path")
                pl_df.write_parquet(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write Parquet to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write Parquet to cloud storage: {str(e)}")

    def _write_delta_to_cloud(self,
                              resource_path: str,
                              storage_options: Dict[str, Any],
                              credential_provider: Optional[Callable],
                              write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a Delta Lake table in cloud storage.

        This operation requires collecting the data first, as `write_delta` operates
        on an eager DataFrame.
        """
        sink_kwargs = {
            "target": resource_path,
            "mode": write_settings.write_mode,
        }
        if storage_options:
            sink_kwargs["storage_options"] = storage_options
        if credential_provider:
            sink_kwargs["credential_provider"] = credential_provider
        self.collect().write_delta(**sink_kwargs)

    def _write_csv_to_cloud(self,
                            resource_path: str,
                            storage_options: Dict[str, Any],
                            credential_provider: Optional[Callable],
                            write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a CSV file in cloud storage.

        Uses `sink_csv` for efficient, streaming writes of the data.
        """
        try:
            sink_kwargs = {
                "path": resource_path,
                "separator": write_settings.csv_delimiter,
            }
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider

            # sink_csv executes the lazy query and writes the result
            self.data_frame.sink_csv(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write CSV to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write CSV to cloud storage: {str(e)}")

    def _write_json_to_cloud(self,
                             resource_path: str,
                             storage_options: Dict[str, Any],
                             credential_provider: Optional[Callable],
                             write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a line-delimited JSON (NDJSON) file.

        Uses `sink_ndjson` for efficient, streaming writes.
        """
        try:
            sink_kwargs = {"path": resource_path}
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider
            self.data_frame.sink_ndjson(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write JSON to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write JSON to cloud storage: {str(e)}")

    @classmethod
    def from_cloud_storage_obj(cls, settings: cloud_storage_schemas.CloudStorageReadSettingsInternal) -> "FlowDataEngine":
        """Creates a FlowDataEngine from an object in cloud storage.

        This method supports reading from various cloud storage providers like AWS S3,
        Azure Data Lake Storage, and Google Cloud Storage, with support for
        various authentication methods.

        Args:
            settings: A `CloudStorageReadSettingsInternal` object containing connection
                details, file format, and read options.

        Returns:
            A new `FlowDataEngine` instance containing the data from cloud storage.

        Raises:
            ValueError: If the storage type or file format is not supported.
            NotImplementedError: If a requested file format like "delta" or "iceberg"
                is not yet implemented.
            Exception: If reading from cloud storage fails.
        """
        connection = settings.connection
        read_settings = settings.read_settings

        logger.info(f"Reading from {connection.storage_type} storage: {read_settings.resource_path}")
        # Get storage options based on connection type
        storage_options = CloudStorageReader.get_storage_options(connection)
        # Get credential provider if needed
        credential_provider = CloudStorageReader.get_credential_provider(connection)
        if read_settings.file_format == "parquet":
            return cls._read_parquet_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings.scan_mode == "directory",
            )
        elif read_settings.file_format == "delta":
            return cls._read_delta_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings
            )
        elif read_settings.file_format == "csv":
            return cls._read_csv_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings
            )
        elif read_settings.file_format == "json":
            return cls._read_json_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings.scan_mode == "directory"
            )
        elif read_settings.file_format == "iceberg":
            return cls._read_iceberg_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings
            )

        elif read_settings.file_format in ["delta", "iceberg"]:
            # These would require additional libraries
            raise NotImplementedError(f"File format {read_settings.file_format} not yet implemented")
        else:
            raise ValueError(f"Unsupported file format: {read_settings.file_format}")

    @staticmethod
    def _get_schema_from_first_file_in_dir(source: str, storage_options: Dict[str, Any],
                                           file_format: Literal["csv", "parquet", "json", "delta"]) -> List[FlowfileColumn] | None:
        """Infers the schema by scanning the first file in a cloud directory."""
        try:
            scan_func = getattr(pl, "scan_" + file_format)
            first_file_ref = get_first_file_from_s3_dir(source, storage_options=storage_options)
            return convert_stats_to_column_info(FlowDataEngine._create_schema_stats_from_pl_schema(
                scan_func(first_file_ref, storage_options=storage_options).collect_schema()))
        except Exception as e:
            logger.warning(f"Could not read schema from first file in directory, using default schema: {e}")


    @classmethod
    def _read_iceberg_from_cloud(cls,
                                 resource_path: str,
                                 storage_options: Dict[str, Any],
                                 credential_provider: Optional[Callable],
                                 read_settings: cloud_storage_schemas.CloudStorageReadSettings) -> "FlowDataEngine":
        """Reads Iceberg table(s) from cloud storage."""
        raise NotImplementedError(f"Failed to read Iceberg table from cloud storage: Not yet implemented")

    @classmethod
    def _read_parquet_from_cloud(cls,
                                 resource_path: str,
                                 storage_options: Dict[str, Any],
                                 credential_provider: Optional[Callable],
                                 is_directory: bool) -> "FlowDataEngine":
        """Reads Parquet file(s) from cloud storage."""
        try:
            # Use scan_parquet for lazy evaluation
            if is_directory:
                resource_path = ensure_path_has_wildcard_pattern(resource_path=resource_path, file_format="parquet")
            scan_kwargs = {"source": resource_path}

            if storage_options:
                scan_kwargs["storage_options"] = storage_options

            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider
            if storage_options and is_directory:
                schema = cls._get_schema_from_first_file_in_dir(resource_path, storage_options, "parquet")
            else:
                schema = None
            lf = pl.scan_parquet(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Set so the provider is not accessed for this stat
                optimize_memory=True,
                streamable=True,
                schema=schema
            )

        except Exception as e:
            logger.error(f"Failed to read Parquet from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read Parquet from cloud storage: {str(e)}")

    @classmethod
    def _read_delta_from_cloud(cls,
                               resource_path: str,
                               storage_options: Dict[str, Any],
                               credential_provider: Optional[Callable],
                               read_settings: cloud_storage_schemas.CloudStorageReadSettings) -> "FlowDataEngine":
        """Reads a Delta Lake table from cloud storage."""
        try:
            logger.info("Reading Delta file from cloud storage...")
            logger.info(f"read_settings: {read_settings}")
            scan_kwargs = {"source": resource_path}
            if read_settings.delta_version:
                scan_kwargs['version'] = read_settings.delta_version
            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider
            lf = pl.scan_delta(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Set so the provider is not accessed for this stat
                optimize_memory=True,
                streamable=True
            )
        except Exception as e:
            logger.error(f"Failed to read Delta file from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read Delta file from cloud storage: {str(e)}")

    @classmethod
    def _read_csv_from_cloud(cls,
                             resource_path: str,
                             storage_options: Dict[str, Any],
                             credential_provider: Optional[Callable],
                             read_settings: cloud_storage_schemas.CloudStorageReadSettings) -> "FlowDataEngine":
        """Reads CSV file(s) from cloud storage."""
        try:
            scan_kwargs = {
                "source": resource_path,
                "has_header": read_settings.csv_has_header,
                "separator": read_settings.csv_delimiter,
                "encoding": read_settings.csv_encoding,
            }
            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider

            if read_settings.scan_mode == "directory":
                resource_path = ensure_path_has_wildcard_pattern(resource_path=resource_path, file_format="csv")
                scan_kwargs["source"] = resource_path
            if storage_options and read_settings.scan_mode == "directory":
                schema = cls._get_schema_from_first_file_in_dir(resource_path, storage_options, "csv")
            else:
                schema = None

            lf = pl.scan_csv(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Will be calculated lazily
                optimize_memory=True,
                streamable=True,
                schema=schema
            )

        except Exception as e:
            logger.error(f"Failed to read CSV from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read CSV from cloud storage: {str(e)}")

    @classmethod
    def _read_json_from_cloud(cls,
                              resource_path: str,
                              storage_options: Dict[str, Any],
                              credential_provider: Optional[Callable],
                              is_directory: bool) -> "FlowDataEngine":
        """Reads JSON file(s) from cloud storage."""
        try:
            if is_directory:
                resource_path = ensure_path_has_wildcard_pattern(resource_path, "json")
            scan_kwargs = {"source": resource_path}

            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider

            lf = pl.scan_ndjson(**scan_kwargs)  # Using NDJSON for line-delimited JSON

            return cls(
                lf,
                number_of_records=-1,
                optimize_memory=True,
                streamable=True,
            )

        except Exception as e:
            logger.error(f"Failed to read JSON from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read JSON from cloud storage: {str(e)}")

    def _handle_path_ref(self, path_ref: str, optimize_memory: bool):
        """Handles file path reference input."""
        try:
            pf = ParquetFile(path_ref)
        except Exception as e:
            logger.error(e)
            raise Exception("Provided ref is not a parquet file")

        self.number_of_records = pf.metadata.num_rows
        if optimize_memory:
            self._lazy = True
            self.data_frame = pl.scan_parquet(path_ref)
        else:
            self.data_frame = pl.read_parquet(path_ref)

    def _finalize_initialization(self, name: str, optimize_memory: bool, schema: Optional[Any],
                                 calculate_schema_stats: bool):
        """Finalizes initialization by setting remaining attributes."""
        _ = calculate_schema_stats
        self.name = name
        self._optimize_memory = optimize_memory
        if assert_if_flowfile_schema(schema):
            self._schema = schema
            self.columns = [c.column_name for c in self._schema]
        else:
            pl_schema = self.data_frame.collect_schema()
            self._schema = self._handle_schema(schema, pl_schema)
            self.columns = [c.column_name for c in self._schema] if self._schema else pl_schema.names()

    def __getitem__(self, item):
        """Accesses a specific column or item from the DataFrame."""
        return self.data_frame.select([item])

    @property
    def data_frame(self) -> pl.LazyFrame | pl.DataFrame | None:
        """The underlying Polars DataFrame or LazyFrame.

        This property provides access to the Polars object that backs the
        FlowDataEngine. It handles lazy-loading from external sources if necessary.

        Returns:
            The active Polars `DataFrame` or `LazyFrame`.
        """
        if self._data_frame is not None and not self.is_future:
            return self._data_frame
        elif self.is_future:
            return self._data_frame
        elif self._external_source is not None and self.lazy:
            return self._data_frame
        elif self._external_source is not None and not self.lazy:
            if self._external_source.get_pl_df() is None:
                data_frame = list(self._external_source.get_iter())
                if len(data_frame) > 0:
                    self.data_frame = pl.DataFrame(data_frame)
            else:
                self.data_frame = self._external_source.get_pl_df()
            self.calculate_schema()
            return self._data_frame

    @data_frame.setter
    def data_frame(self, df: pl.LazyFrame | pl.DataFrame):
        """Sets the underlying Polars DataFrame or LazyFrame."""
        if self.lazy and isinstance(df, pl.DataFrame):
            raise Exception('Cannot set a non-lazy dataframe to a lazy flowfile')
        self._data_frame = df

    @staticmethod
    def _create_schema_stats_from_pl_schema(pl_schema: pl.Schema) -> List[Dict]:
        """Converts a Polars Schema into a list of schema statistics dictionaries."""
        return [
            dict(column_name=k, pl_datatype=v, col_index=i)
            for i, (k, v) in enumerate(pl_schema.items())
        ]

    def _add_schema_from_schema_stats(self, schema_stats: List[Dict]):
        """Populates the schema from a list of schema statistics dictionaries."""
        self._schema = convert_stats_to_column_info(schema_stats)

    @property
    def schema(self) -> List[FlowfileColumn]:
        """The schema of the DataFrame as a list of `FlowfileColumn` objects.

        This property lazily calculates the schema if it hasn't been determined yet.

        Returns:
            A list of `FlowfileColumn` objects describing the schema.
        """
        if self.number_of_fields == 0:
            return []
        if self._schema is None or (self._calculate_schema_stats and not self.ind_schema_calculated):
            if self._calculate_schema_stats and not self.ind_schema_calculated:
                schema_stats = self._calculate_schema()
                self.ind_schema_calculated = True
            else:
                schema_stats = self._create_schema_stats_from_pl_schema(self.data_frame.collect_schema())
            self._add_schema_from_schema_stats(schema_stats)
        return self._schema

    @property
    def number_of_fields(self) -> int:
        """The number of columns (fields) in the DataFrame.

        Returns:
            The integer count of columns.
        """
        if self.__number_of_fields is None:
            self.__number_of_fields = len(self.columns)
        return self.__number_of_fields

    def collect(self, n_records: int = None) -> pl.DataFrame:
        """Collects the data and returns it as a Polars DataFrame.

        This method triggers the execution of the lazy query plan (if applicable)
        and returns the result. It supports streaming to optimize memory usage
        for large datasets.

        Args:
            n_records: The maximum number of records to collect. If None, all
                records are collected.

        Returns:
            A Polars `DataFrame` containing the collected data.
        """
        if n_records is None:
            logger.info(f'Fetching all data for Table object "{id(self)}". Settings: streaming={self._streamable}')
        else:
            logger.info(f'Fetching {n_records} record(s) for Table object "{id(self)}". '
                        f'Settings: streaming={self._streamable}')

        if not self.lazy:
            return self.data_frame

        try:
            return self._collect_data(n_records)
        except Exception as e:
            self.errors = [e]
            return self._handle_collection_error(n_records)

    def _collect_data(self, n_records: int = None) -> pl.DataFrame:
        """Internal method to handle data collection logic."""
        if n_records is None:

            self.collect_external()
            if self._streamable:
                try:
                    logger.info('Collecting data in streaming mode')
                    return self.data_frame.collect(engine="streaming")
                except PanicException:
                    self._streamable = False

            logger.info('Collecting data in non-streaming mode')
            return self.data_frame.collect()

        if self.external_source is not None:
            return self._collect_from_external_source(n_records)

        if self._streamable:
            return self.data_frame.head(n_records).collect(engine="streaming")
        return self.data_frame.head(n_records).collect()

    def _collect_from_external_source(self, n_records: int) -> pl.DataFrame:
        """Handles collection from an external source."""
        if self.external_source.get_pl_df() is not None:
            all_data = self.external_source.get_pl_df().head(n_records)
            self.data_frame = all_data
        else:
            all_data = self.external_source.get_sample(n_records)
            self.data_frame = pl.LazyFrame(all_data)
        return self.data_frame

    def _handle_collection_error(self, n_records: int) -> pl.DataFrame:
        """Handles errors during collection by attempting partial collection."""
        n_records = 100000000 if n_records is None else n_records
        ok_cols, error_cols = self._identify_valid_columns(n_records)

        if len(ok_cols) > 0:
            return self._create_partial_dataframe(ok_cols, error_cols, n_records)
        return self._create_empty_dataframe(n_records)

    def _identify_valid_columns(self, n_records: int) -> Tuple[List[str], List[Tuple[str, Any]]]:
        """Identifies which columns can be collected successfully."""
        ok_cols = []
        error_cols = []
        for c in self.columns:
            try:
                _ = self.data_frame.select(c).head(n_records).collect()
                ok_cols.append(c)
            except:
                error_cols.append((c, self.data_frame.schema[c]))
        return ok_cols, error_cols

    def _create_partial_dataframe(self, ok_cols: List[str], error_cols: List[Tuple[str, Any]],
                                  n_records: int) -> pl.DataFrame:
        """Creates a DataFrame with partial data for columns that could be collected."""
        df = self.data_frame.select(ok_cols)
        df = df.with_columns([
            pl.lit(None).alias(column_name).cast(data_type)
            for column_name, data_type in error_cols
        ])
        return df.select(self.columns).head(n_records).collect()

    def _create_empty_dataframe(self, n_records: int) -> pl.DataFrame:
        """Creates an empty DataFrame with the correct schema."""
        if self.number_of_records > 0:
            return pl.DataFrame({
                column_name: pl.Series(
                    name=column_name,
                    values=[None] * min(self.number_of_records, n_records)
                ).cast(data_type)
                for column_name, data_type in self.data_frame.schema.items()
            })
        return pl.DataFrame(schema=self.data_frame.schema)

    def do_group_by(self, group_by_input: transform_schemas.GroupByInput,
                    calculate_schema_stats: bool = True) -> "FlowDataEngine":
        """Performs a group-by operation on the DataFrame.

        Args:
            group_by_input: A `GroupByInput` object defining the grouping columns
                and aggregations.
            calculate_schema_stats: If True, calculates schema statistics for the
                resulting DataFrame.

        Returns:
            A new `FlowDataEngine` instance with the grouped and aggregated data.
        """
        aggregations = [c for c in group_by_input.agg_cols if c.agg != 'groupby']
        group_columns = [c for c in group_by_input.agg_cols if c.agg == 'groupby']

        if len(group_columns) == 0:
            return FlowDataEngine(
                self.data_frame.select(
                    ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
                ),
                calculate_schema_stats=calculate_schema_stats
            )

        df = self.data_frame.rename({c.old_name: c.new_name for c in group_columns})
        group_by_columns = [n_c.new_name for n_c in group_columns]
        return FlowDataEngine(
            df.group_by(*group_by_columns).agg(
                ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
            ),
            calculate_schema_stats=calculate_schema_stats
        )

    def do_sort(self, sorts: List[transform_schemas.SortByInput]) -> "FlowDataEngine":
        """Sorts the DataFrame by one or more columns.

        Args:
            sorts: A list of `SortByInput` objects, each specifying a column
                and sort direction ('asc' or 'desc').

        Returns:
            A new `FlowDataEngine` instance with the sorted data.
        """
        if not sorts:
            return self

        descending = [s.how == 'desc' or s.how.lower() == 'descending' for s in sorts]
        df = self.data_frame.sort([sort_by.column for sort_by in sorts], descending=descending)
        return FlowDataEngine(df, number_of_records=self.number_of_records, schema=self.schema)

    def change_column_types(self, transforms: List[transform_schemas.SelectInput],
                            calculate_schema: bool = False) -> "FlowDataEngine":
        """Changes the data type of one or more columns.

        Args:
            transforms: A list of `SelectInput` objects, where each object specifies
                the column and its new `polars_type`.
            calculate_schema: If True, recalculates the schema after the type change.

        Returns:
            A new `FlowDataEngine` instance with the updated column types.
        """
        dtypes = [dtype.base_type() for dtype in self.data_frame.collect_schema().dtypes()]
        idx_mapping = list(
            (transform.old_name, self.cols_idx.get(transform.old_name), getattr(pl, transform.polars_type))
            for transform in transforms if transform.data_type is not None
        )

        actual_transforms = [c for c in idx_mapping if c[2] != dtypes[c[1]]]
        transformations = [
            utils.define_pl_col_transformation(col_name=transform[0], col_type=transform[2])
            for transform in actual_transforms
        ]

        df = self.data_frame.with_columns(transformations)
        return FlowDataEngine(
            df,
            number_of_records=self.number_of_records,
            calculate_schema_stats=calculate_schema,
            streamable=self._streamable
        )

    def save(self, path: str, data_type: str = 'parquet') -> Future:
        """Saves the DataFrame to a file in a separate thread.

        Args:
            path: The file path to save to.
            data_type: The format to save in (e.g., 'parquet', 'csv').

        Returns:
            A `loky.Future` object representing the asynchronous save operation.
        """
        estimated_size = deepcopy(self.get_estimated_file_size() * 4)
        df = deepcopy(self.data_frame)
        return write_threaded(_df=df, path=path, data_type=data_type, estimated_size=estimated_size)

    def to_pylist(self) -> List[Dict]:
        """Converts the DataFrame to a list of Python dictionaries.

        Returns:
            A list where each item is a dictionary representing a row.
        """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dicts()
        return self.data_frame.to_dicts()

    def to_arrow(self) -> PaTable:
        """Converts the DataFrame to a PyArrow Table.

        This method triggers a `.collect()` call if the data is lazy,
        then converts the resulting eager DataFrame into a `pyarrow.Table`.

        Returns:
            A `pyarrow.Table` instance representing the data.
        """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_arrow()
        else:
            return self.data_frame.to_arrow()

    def to_raw_data(self) -> input_schema.RawData:
        """Converts the DataFrame to a `RawData` schema object.

        Returns:
            An `input_schema.RawData` object containing the schema and data.
        """
        columns = [c.get_minimal_field_info() for c in self.schema]
        data = list(self.to_dict().values())
        return input_schema.RawData(columns=columns, data=data)

    def to_dict(self) -> Dict[str, List]:
        """Converts the DataFrame to a Python dictionary of columns.

         Each key in the dictionary is a column name, and the corresponding value
         is a list of the data in that column.

         Returns:
             A dictionary mapping column names to lists of their values.
         """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dict(as_series=False)
        else:
            return self.data_frame.to_dict(as_series=False)

    @classmethod
    def create_from_external_source(cls, external_source: ExternalDataSource) -> "FlowDataEngine":
        """Creates a FlowDataEngine from an external data source.

        Args:
            external_source: An object that conforms to the `ExternalDataSource`
                interface.

        Returns:
            A new `FlowDataEngine` instance.
        """
        if external_source.schema is not None:
            ff = cls.create_from_schema(external_source.schema)
        elif external_source.initial_data_getter is not None:
            ff = cls(raw_data=external_source.initial_data_getter())
        else:
            ff = cls()
        ff._external_source = external_source
        return ff

    @classmethod
    def create_from_sql(cls, sql: str, conn: Any) -> "FlowDataEngine":
        """Creates a FlowDataEngine by executing a SQL query.

        Args:
            sql: The SQL query string to execute.
            conn: A database connection object or connection URI string.

        Returns:
            A new `FlowDataEngine` instance with the query result.
        """
        return cls(pl.read_sql(sql, conn))

    @classmethod
    def create_from_schema(cls, schema: List[FlowfileColumn]) -> "FlowDataEngine":
        """Creates an empty FlowDataEngine from a schema definition.

        Args:
            schema: A list of `FlowfileColumn` objects defining the schema.

        Returns:
            A new, empty `FlowDataEngine` instance with the specified schema.
        """
        pl_schema = []
        for i, flow_file_column in enumerate(schema):
            pl_schema.append((flow_file_column.name, cast_str_to_polars_type(flow_file_column.data_type)))
            schema[i].col_index = i
        df = pl.LazyFrame(schema=pl_schema)
        return cls(df, schema=schema, calculate_schema_stats=False, number_of_records=0)

    @classmethod
    def create_from_path(cls, received_table: input_schema.ReceivedTableBase) -> "FlowDataEngine":
        """Creates a FlowDataEngine from a local file path.

        Supports various file types like CSV, Parquet, and Excel.

        Args:
            received_table: A `ReceivedTableBase` object containing the file path
                and format details.

        Returns:
            A new `FlowDataEngine` instance with data from the file.
        """
        received_table.set_absolute_filepath()
        file_type_handlers = {
            'csv': create_funcs.create_from_path_csv,
            'parquet': create_funcs.create_from_path_parquet,
            'excel': create_funcs.create_from_path_excel
        }

        handler = file_type_handlers.get(received_table.file_type)
        if not handler:
            raise Exception(f'Cannot create from {received_table.file_type}')

        flow_file = cls(handler(received_table))
        flow_file._org_path = received_table.abs_file_path
        return flow_file

    @classmethod
    def create_random(cls, number_of_records: int = 1000) -> "FlowDataEngine":
        """Creates a FlowDataEngine with randomly generated data.

        Useful for testing and examples.

        Args:
            number_of_records: The number of random records to generate.

        Returns:
            A new `FlowDataEngine` instance with fake data.
        """
        return cls(create_fake_data(number_of_records))

    @classmethod
    def generate_enumerator(cls, length: int = 1000, output_name: str = 'output_column') -> "FlowDataEngine":
        """Generates a FlowDataEngine with a single column containing a sequence of integers.

        Args:
            length: The number of integers to generate in the sequence.
            output_name: The name of the output column.

        Returns:
            A new `FlowDataEngine` instance.
        """
        if length > 10_000_000:
            length = 10_000_000
        return cls(pl.LazyFrame().select((pl.int_range(0, length, dtype=pl.UInt32)).alias(output_name)))

    def _handle_schema(self, schema: List[FlowfileColumn] | List[str] | pl.Schema | None,
                       pl_schema: pl.Schema) -> List[FlowfileColumn] | None:
        """Handles schema processing and validation during initialization."""
        if schema is None and pl_schema is not None:
            return convert_stats_to_column_info(self._create_schema_stats_from_pl_schema(pl_schema))
        elif schema is None and pl_schema is None:
            return None
        elif assert_if_flowfile_schema(schema) and pl_schema is None:
            return schema
        elif pl_schema is not None and schema is not None:
            if schema.__len__() != pl_schema.__len__():
                raise Exception(
                    f'Schema does not match the data got {schema.__len__()} columns expected {pl_schema.__len__()}')
            if isinstance(schema, pl.Schema):
                return self._handle_polars_schema(schema, pl_schema)
            elif isinstance(schema, list) and len(schema) == 0:
                return []
            elif isinstance(schema[0], str):
                return self._handle_string_schema(schema, pl_schema)
            return schema

    def _handle_polars_schema(self, schema: pl.Schema, pl_schema: pl.Schema) -> List[FlowfileColumn]:
        """Handles Polars schema conversion."""
        flow_file_columns = [
            FlowfileColumn.create_from_polars_dtype(column_name=col_name, data_type=dtype)
            for col_name, dtype in zip(schema.names(), schema.dtypes())
        ]

        select_arg = [
            pl.col(o).alias(n).cast(schema_dtype)
            for o, n, schema_dtype in zip(pl_schema.names(), schema.names(), schema.dtypes())
        ]

        self.data_frame = self.data_frame.select(select_arg)
        return flow_file_columns

    def _handle_string_schema(self, schema: List[str], pl_schema: pl.Schema) -> List[FlowfileColumn]:
        """Handles string-based schema conversion."""
        flow_file_columns = [
            FlowfileColumn.create_from_polars_dtype(column_name=col_name, data_type=dtype)
            for col_name, dtype in zip(schema, pl_schema.dtypes())
        ]

        self.data_frame = self.data_frame.rename({
            o: n for o, n in zip(pl_schema.names(), schema)
        })

        return flow_file_columns

    def split(self, split_input: transform_schemas.TextToRowsInput) -> "FlowDataEngine":
        """Splits a column's text values into multiple rows based on a delimiter.

        This operation is often referred to as "exploding" the DataFrame, as it
        increases the number of rows.

        Args:
            split_input: A `TextToRowsInput` object specifying the column to split,
                the delimiter, and the output column name.

        Returns:
            A new `FlowDataEngine` instance with the exploded rows.
        """
        output_column_name = (
            split_input.output_column_name
            if split_input.output_column_name
            else split_input.column_to_split
        )

        split_value = (
            split_input.split_fixed_value
            if split_input.split_by_fixed_value
            else pl.col(split_input.split_by_column)
        )

        df = (
            self.data_frame.with_columns(
                pl.col(split_input.column_to_split)
                .str.split(by=split_value)
                .alias(output_column_name)
            )
            .explode(output_column_name)
        )

        return FlowDataEngine(df)

    def unpivot(self, unpivot_input: transform_schemas.UnpivotInput) -> "FlowDataEngine":
        """Converts the DataFrame from a wide to a long format.

        This is the inverse of a pivot operation, taking columns and transforming
        them into `variable` and `value` rows.

        Args:
            unpivot_input: An `UnpivotInput` object specifying which columns to
                unpivot and which to keep as index columns.

        Returns:
            A new, unpivoted `FlowDataEngine` instance.
        """
        lf = self.data_frame

        if unpivot_input.data_type_selector_expr is not None:
            result = lf.unpivot(
                on=unpivot_input.data_type_selector_expr(),
                index=unpivot_input.index_columns
            )
        elif unpivot_input.value_columns is not None:
            result = lf.unpivot(
                on=unpivot_input.value_columns,
                index=unpivot_input.index_columns
            )
        else:
            result = lf.unpivot()

        return FlowDataEngine(result)

    def do_pivot(self, pivot_input: transform_schemas.PivotInput, node_logger: NodeLogger = None) -> "FlowDataEngine":
        """Converts the DataFrame from a long to a wide format, aggregating values.

        Args:
            pivot_input: A `PivotInput` object defining the index, pivot, and value
                columns, along with the aggregation logic.
            node_logger: An optional logger for reporting warnings, e.g., if the
                pivot column has too many unique values.

        Returns:
            A new, pivoted `FlowDataEngine` instance.
        """
        # Get unique values for pivot columns
        max_unique_vals = 200
        new_cols_unique = fetch_unique_values(self.data_frame.select(pivot_input.pivot_column)
                                              .unique()
                                              .sort(pivot_input.pivot_column)
                                              .limit(max_unique_vals).cast(pl.String))
        if len(new_cols_unique) >= max_unique_vals:
            if node_logger:
                node_logger.warning('Pivot column has too many unique values. Please consider using a different column.'
                                    f' Max unique values: {max_unique_vals}')

        if len(pivot_input.index_columns) == 0:
            no_index_cols = True
            pivot_input.index_columns = ['__temp__']
            ff = self.apply_flowfile_formula('1', col_name='__temp__')
        else:
            no_index_cols = False
            ff = self

        # Perform pivot operations
        index_columns = pivot_input.get_index_columns()
        grouped_ff = ff.do_group_by(pivot_input.get_group_by_input(), False)
        pivot_column = pivot_input.get_pivot_column()

        input_df = grouped_ff.data_frame.with_columns(
            pivot_column.cast(pl.String).alias(pivot_input.pivot_column)
        )
        number_of_aggregations = len(pivot_input.aggregations)
        df = (
            input_df.select(
                *index_columns,
                pivot_column,
                pivot_input.get_values_expr()
            )
            .group_by(*index_columns)
            .agg([
                (pl.col('vals').filter(pivot_column == new_col_value))
                .first()
                .alias(new_col_value)
                for new_col_value in new_cols_unique
            ])
            .select(
                *index_columns,
                *[
                    pl.col(new_col).struct.field(agg).alias(f'{new_col + "_" + agg if number_of_aggregations > 1 else new_col }')
                    for new_col in new_cols_unique
                    for agg in pivot_input.aggregations
                ]
            )
        )

        # Clean up temporary columns if needed
        if no_index_cols:
            df = df.drop('__temp__')
            pivot_input.index_columns = []

        return FlowDataEngine(df, calculate_schema_stats=False)

    def do_filter(self, predicate: str) -> "FlowDataEngine":
        """Filters rows based on a predicate expression.

        Args:
            predicate: A string containing a Polars expression that evaluates to
                a boolean value.

        Returns:
            A new `FlowDataEngine` instance containing only the rows that match
            the predicate.
        """
        try:
            f = to_expr(predicate)
        except Exception as e:
            logger.warning(f'Error in filter expression: {e}')
            f = to_expr("False")
        df = self.data_frame.filter(f)
        return FlowDataEngine(df, schema=self.schema, streamable=self._streamable)

    def add_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
        """Adds a record ID (row number) column to the DataFrame.

        Can generate a simple sequential ID or a grouped ID that resets for
        each group.

        Args:
            record_id_settings: A `RecordIdInput` object specifying the output
                column name, offset, and optional grouping columns.

        Returns:
            A new `FlowDataEngine` instance with the added record ID column.
        """
        if record_id_settings.group_by and len(record_id_settings.group_by_columns) > 0:
            return self._add_grouped_record_id(record_id_settings)
        return self._add_simple_record_id(record_id_settings)

    def _add_grouped_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
        """Adds a record ID column with grouping."""
        select_cols = [pl.col(record_id_settings.output_column_name)] + [pl.col(c) for c in self.columns]

        df = (
            self.data_frame
            .with_columns(pl.lit(1).alias(record_id_settings.output_column_name))
            .with_columns(
                (pl.cum_count(record_id_settings.output_column_name)
                 .over(record_id_settings.group_by_columns) + record_id_settings.offset - 1)
                .alias(record_id_settings.output_column_name)
            )
            .select(select_cols)
        )

        output_schema = [FlowfileColumn.from_input(record_id_settings.output_column_name, 'UInt64')]
        output_schema.extend(self.schema)

        return FlowDataEngine(df, schema=output_schema)

    def _add_simple_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
        """Adds a simple sequential record ID column."""
        df = self.data_frame.with_row_index(
            record_id_settings.output_column_name,
            record_id_settings.offset
        )

        output_schema = [FlowfileColumn.from_input(record_id_settings.output_column_name, 'UInt64')]
        output_schema.extend(self.schema)

        return FlowDataEngine(df, schema=output_schema)

    def get_schema_column(self, col_name: str) -> FlowfileColumn:
        """Retrieves the schema information for a single column by its name.

        Args:
            col_name: The name of the column to retrieve.

        Returns:
            A `FlowfileColumn` object for the specified column, or `None` if not found.
        """
        for s in self.schema:
            if s.name == col_name:
                return s

    def get_estimated_file_size(self) -> int:
        """Estimates the file size in bytes if the data originated from a local file.

        This relies on the original path being tracked during file ingestion.

        Returns:
            The file size in bytes, or 0 if the original path is unknown.
        """
        if self._org_path is not None:
            return os.path.getsize(self._org_path)
        return 0

    def __repr__(self) -> str:
        """Returns a string representation of the FlowDataEngine."""
        return f'flow data engine\n{self.data_frame.__repr__()}'

    def __call__(self) -> "FlowDataEngine":
        """Makes the class instance callable, returning itself."""
        return self

    def __len__(self) -> int:
        """Returns the number of records in the table."""
        return self.number_of_records if self.number_of_records >= 0 else self.get_number_of_records()

    def cache(self) -> "FlowDataEngine":
        """Caches the current DataFrame to disk and updates the internal reference.

        This triggers a background process to write the current LazyFrame's result
        to a temporary file. Subsequent operations on this `FlowDataEngine` instance
        will read from the cached file, which can speed up downstream computations.

        Returns:
            The same `FlowDataEngine` instance, now backed by the cached data.
        """
        edf = ExternalDfFetcher(lf=self.data_frame, file_ref=str(id(self)), wait_on_completion=False,
                                flow_id=-1,
                                node_id=-1)
        logger.info('Caching data in background')
        result = edf.get_result()
        if isinstance(result, pl.LazyFrame):
            logger.info('Data cached')
            del self._data_frame
            self.data_frame = result
            logger.info('Data loaded from cache')
        return self

    def collect_external(self):
        """Materializes data from a tracked external source.

        If the `FlowDataEngine` was created from an `ExternalDataSource`, this
        method will trigger the data retrieval, update the internal `_data_frame`
        to a `LazyFrame` of the collected data, and reset the schema to be
        re-evaluated.
        """
        if self._external_source is not None:
            logger.info('Collecting external source')
            if self.external_source.get_pl_df() is not None:
                self.data_frame = self.external_source.get_pl_df().lazy()
            else:
                self.data_frame = pl.LazyFrame(list(self.external_source.get_iter()))
            self._schema = None  # enforce reset schema

    def get_output_sample(self, n_rows: int = 10) -> List[Dict]:
        """Gets a sample of the data as a list of dictionaries.

        This is typically used to display a preview of the data in a UI.

        Args:
            n_rows: The number of rows to sample.

        Returns:
            A list of dictionaries, where each dictionary represents a row.
        """
        if self.number_of_records > n_rows or self.number_of_records < 0:
            df = self.collect(n_rows)
        else:
            df = self.collect()
        return df.to_dicts()

    def __get_sample__(self, n_rows: int = 100, streamable: bool = True) -> "FlowDataEngine":
        """Internal method to get a sample of the data."""
        if not self.lazy:
            df = self.data_frame.lazy()
        else:
            df = self.data_frame

        if streamable:
            try:
                df = df.head(n_rows).collect()
            except Exception as e:
                logger.warning(f'Error in getting sample: {e}')
                df = df.head(n_rows).collect(engine="auto")
        else:
            df = self.collect()
        return FlowDataEngine(df, number_of_records=len(df), schema=self.schema)

    def get_sample(self, n_rows: int = 100, random: bool = False, shuffle: bool = False,
                   seed: int = None, execution_location: Optional[ExecutionLocationsLiteral] = None) -> "FlowDataEngine":
        """Gets a sample of rows from the DataFrame.

        Args:
            n_rows: The number of rows to sample.
            random: If True, performs random sampling. If False, takes the first n_rows.
            shuffle: If True (and `random` is True), shuffles the data before sampling.
            seed: A random seed for reproducibility.
            execution_location: Location which is used to calculate the size of the dataframe
        Returns:
            A new `FlowDataEngine` instance containing the sampled data.
        """
        logging.info(f'Getting sample of {n_rows} rows')

        if random:
            if self.lazy and self.external_source is not None:
                self.collect_external()

            if self.lazy and shuffle:
                sample_df = (self.data_frame.collect(engine="streaming" if self._streamable else "auto")
                             .sample(n_rows, seed=seed, shuffle=shuffle))
            elif shuffle:
                sample_df = self.data_frame.sample(n_rows, seed=seed, shuffle=shuffle)
            else:
                if execution_location is None:
                    execution_location = get_global_execution_location()
                n_rows = min(n_rows, self.get_number_of_records(
                    calculate_in_worker_process=execution_location == "remote")
                             )

                every_n_records = ceil(self.number_of_records / n_rows)
                sample_df = self.data_frame.gather_every(every_n_records)
        else:
            if self.external_source:
                self.collect(n_rows)
            sample_df = self.data_frame.head(n_rows)

        return FlowDataEngine(sample_df, schema=self.schema)

    def get_subset(self, n_rows: int = 100) -> "FlowDataEngine":
        """Gets the first `n_rows` from the DataFrame.

        Args:
            n_rows: The number of rows to include in the subset.

        Returns:
            A new `FlowDataEngine` instance containing the subset of data.
        """
        if not self.lazy:
            return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
        else:
            return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)

    def iter_batches(self, batch_size: int = 1000,
                     columns: Union[List, Tuple, str] = None) -> Generator["FlowDataEngine", None, None]:
        """Iterates over the DataFrame in batches.

        Args:
            batch_size: The size of each batch.
            columns: A list of column names to include in the batches. If None,
                all columns are included.

        Yields:
            A `FlowDataEngine` instance for each batch.
        """
        if columns:
            self.data_frame = self.data_frame.select(columns)
        self.lazy = False
        batches = self.data_frame.iter_slices(batch_size)
        for batch in batches:
            yield FlowDataEngine(batch)

    def start_fuzzy_join(self, fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                         other: "FlowDataEngine", file_ref: str, flow_id: int = -1,
                         node_id: int | str = -1) -> ExternalFuzzyMatchFetcher:
        """Starts a fuzzy join operation in a background process.

        This method prepares the data and initiates the fuzzy matching in a
        separate process, returning a tracker object immediately.

        Args:
            fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
            other: The right `FlowDataEngine` to join with.
            file_ref: A reference string for temporary files.
            flow_id: The flow ID for tracking.
            node_id: The node ID for tracking.

        Returns:
            An `ExternalFuzzyMatchFetcher` object that can be used to track the
            progress and retrieve the result of the fuzzy join.
        """
        left_df, right_df = prepare_for_fuzzy_match(left=self, right=other, fuzzy_match_input=fuzzy_match_input)
        return ExternalFuzzyMatchFetcher(left_df, right_df,
                                         fuzzy_maps=fuzzy_match_input.fuzzy_maps,
                                         file_ref=file_ref + '_fm',
                                         wait_on_completion=False,
                                         flow_id=flow_id,
                                         node_id=node_id)

    def fuzzy_join_external(self,
                            fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                            other: "FlowDataEngine",
                            file_ref: str = None,
                            flow_id: int = -1,
                            node_id: int = -1
                            ):
        if file_ref is None:
            file_ref = str(id(self)) + '_' + str(id(other))

        left_df, right_df = prepare_for_fuzzy_match(left=self, right=other, fuzzy_match_input=fuzzy_match_input)
        external_tracker = ExternalFuzzyMatchFetcher(left_df, right_df,
                                                     fuzzy_maps=fuzzy_match_input.fuzzy_maps,
                                                     file_ref=file_ref + '_fm',
                                                     wait_on_completion=False,
                                                     flow_id=flow_id,
                                                     node_id=node_id)
        return FlowDataEngine(external_tracker.get_result())

    def fuzzy_join(self, fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                   other: "FlowDataEngine",
                   node_logger: NodeLogger = None) -> "FlowDataEngine":
        left_df, right_df = prepare_for_fuzzy_match(left=self, right=other, fuzzy_match_input=fuzzy_match_input)
        fuzzy_mappings = [FuzzyMapping(**fm.__dict__) for fm in fuzzy_match_input.fuzzy_maps]
        return FlowDataEngine(fuzzy_match_dfs(left_df, right_df, fuzzy_maps=fuzzy_mappings,
                                              logger=node_logger.logger if node_logger else logger)
                              .lazy())

    def do_cross_join(self, cross_join_input: transform_schemas.CrossJoinInput,
                      auto_generate_selection: bool, verify_integrity: bool,
                      other: "FlowDataEngine") -> "FlowDataEngine":
        """Performs a cross join with another DataFrame.

        A cross join produces the Cartesian product of the two DataFrames.

        Args:
            cross_join_input: A `CrossJoinInput` object specifying column selections.
            auto_generate_selection: If True, automatically renames columns to avoid conflicts.
            verify_integrity: If True, checks if the resulting join would be too large.
            other: The right `FlowDataEngine` to join with.

        Returns:
            A new `FlowDataEngine` with the result of the cross join.

        Raises:
            Exception: If `verify_integrity` is True and the join would result in
                an excessively large number of records.
        """

        self.lazy = True

        other.lazy = True

        verify_join_select_integrity(cross_join_input, left_columns=self.columns, right_columns=other.columns)
        right_select = [v.old_name for v in cross_join_input.right_select.renames
                        if (v.keep or v.join_key) and v.is_available]
        left_select = [v.old_name for v in cross_join_input.left_select.renames
                       if (v.keep or v.join_key) and v.is_available]

        left = self.data_frame.select(left_select).rename(cross_join_input.left_select.rename_table)
        right = other.data_frame.select(right_select).rename(cross_join_input.right_select.rename_table)

        joined_df = left.join(right, how='cross')

        cols_to_delete_after = [col.new_name for col in
                                cross_join_input.left_select.renames + cross_join_input.left_select.renames
                                if col.join_key and not col.keep and col.is_available]

        fl = FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False, streamable=False)
        return fl

    def join(self, join_input: transform_schemas.JoinInput, auto_generate_selection: bool,
             verify_integrity: bool, other: "FlowDataEngine") -> "FlowDataEngine":
        """Performs a standard SQL-style join with another DataFrame.

        Supports various join types like 'inner', 'left', 'right', 'outer', 'semi', and 'anti'.

        Args:
            join_input: A `JoinInput` object defining the join keys, join type,
                and column selections.
            auto_generate_selection: If True, automatically handles column renaming.
            verify_integrity: If True, performs checks to prevent excessively large joins.
            other: The right `FlowDataEngine` to join with.

        Returns:
            A new `FlowDataEngine` with the joined data.

        Raises:
            Exception: If the join configuration is invalid or if `verify_integrity`
                is True and the join is predicted to be too large.
        """
        ensure_right_unselect_for_semi_and_anti_joins(join_input)
        verify_join_select_integrity(join_input, left_columns=self.columns, right_columns=other.columns)
        if not verify_join_map_integrity(join_input, left_columns=self.schema, right_columns=other.schema):
            raise Exception('Join is not valid by the data fields')
        if auto_generate_selection:
            join_input.auto_rename()
        left = self.data_frame.select(get_select_columns(join_input.left_select.renames)).rename(join_input.left_select.rename_table)
        right = other.data_frame.select(get_select_columns(join_input.right_select.renames)).rename(join_input.right_select.rename_table)
        if verify_integrity and join_input.how != 'right':
            n_records = get_join_count(left, right, left_on_keys=join_input.left_join_keys,
                                       right_on_keys=join_input.right_join_keys, how=join_input.how)
            if n_records > 1_000_000_000:
                raise Exception("Join will result in too many records, ending process")
        else:
            n_records = -1
        left, right, reverse_join_key_mapping = _handle_duplication_join_keys(left, right, join_input)
        left, right = rename_df_table_for_join(left, right, join_input.get_join_key_renames())
        if join_input.how == 'right':
            joined_df = right.join(
                other=left,
                left_on=join_input.right_join_keys,
                right_on=join_input.left_join_keys,
                how="left",
                suffix="").rename(reverse_join_key_mapping)
        else:
            joined_df = left.join(
                other=right,
                left_on=join_input.left_join_keys,
                right_on=join_input.right_join_keys,
                how=join_input.how,
                suffix="").rename(reverse_join_key_mapping)
        left_cols_to_delete_after = [get_col_name_to_delete(col, 'left') for col in join_input.left_select.renames
                                     if not col.keep
                                     and col.is_available and col.join_key
                                     ]
        right_cols_to_delete_after = [get_col_name_to_delete(col, 'right') for col in join_input.right_select.renames
                                      if not col.keep
                                      and col.is_available and col.join_key
                                      and join_input.how in ("left", "right", "inner", "cross", "outer")
                                      ]
        if len(right_cols_to_delete_after + left_cols_to_delete_after) > 0:
            joined_df = joined_df.drop(left_cols_to_delete_after + right_cols_to_delete_after)
        undo_join_key_remapping = get_undo_rename_mapping_join(join_input)
        joined_df = joined_df.rename(undo_join_key_remapping)

        if verify_integrity:
            return FlowDataEngine(joined_df, calculate_schema_stats=True,
                                  number_of_records=n_records, streamable=False)
        else:
            fl = FlowDataEngine(joined_df, calculate_schema_stats=False,
                                number_of_records=0, streamable=False)
            return fl

    def solve_graph(self, graph_solver_input: transform_schemas.GraphSolverInput) -> "FlowDataEngine":
        """Solves a graph problem represented by 'from' and 'to' columns.

        This is used for operations like finding connected components in a graph.

        Args:
            graph_solver_input: A `GraphSolverInput` object defining the source,
                destination, and output column names.

        Returns:
            A new `FlowDataEngine` instance with the solved graph data.
        """
        lf = self.data_frame.with_columns(
            graph_solver(graph_solver_input.col_from, graph_solver_input.col_to)
            .alias(graph_solver_input.output_column_name)
        )
        return FlowDataEngine(lf)

    def add_new_values(self, values: Iterable, col_name: str = None) -> "FlowDataEngine":
        """Adds a new column with the provided values.

        Args:
            values: An iterable (e.g., list, tuple) of values to add as a new column.
            col_name: The name for the new column. Defaults to 'new_values'.

        Returns:
            A new `FlowDataEngine` instance with the added column.
        """
        if col_name is None:
            col_name = 'new_values'
        return FlowDataEngine(self.data_frame.with_columns(pl.Series(values).alias(col_name)))

    def get_record_count(self) -> "FlowDataEngine":
        """Returns a new FlowDataEngine with a single column 'number_of_records'
        containing the total number of records.

        Returns:
            A new `FlowDataEngine` instance.
        """
        return FlowDataEngine(self.data_frame.select(pl.len().alias('number_of_records')))

    def assert_equal(self, other: "FlowDataEngine", ordered: bool = True, strict_schema: bool = False):
        """Asserts that this DataFrame is equal to another.

        Useful for testing.

        Args:
            other: The other `FlowDataEngine` to compare with.
            ordered: If True, the row order must be identical.
            strict_schema: If True, the data types of the schemas must be identical.

        Raises:
            Exception: If the DataFrames are not equal based on the specified criteria.
        """
        org_laziness = self.lazy, other.lazy
        self.lazy = False
        other.lazy = False
        self.number_of_records = -1
        other.number_of_records = -1
        other = other.select_columns(self.columns)

        if self.get_number_of_records_in_process() != other.get_number_of_records_in_process():
            raise Exception('Number of records is not equal')

        if self.columns != other.columns:
            raise Exception('Schema is not equal')

        if strict_schema:
            assert self.data_frame.schema == other.data_frame.schema, 'Data types do not match'

        if ordered:
            self_lf = self.data_frame.sort(by=self.columns)
            other_lf = other.data_frame.sort(by=other.columns)
        else:
            self_lf = self.data_frame
            other_lf = other.data_frame

        self.lazy, other.lazy = org_laziness
        assert self_lf.equals(other_lf), 'Data is not equal'

    def initialize_empty_fl(self):
        """Initializes an empty LazyFrame."""
        self.data_frame = pl.LazyFrame()
        self.number_of_records = 0
        self._lazy = True

    def _calculate_number_of_records_in_worker(self) -> int:
        """Calculates the number of records in a worker process."""
        number_of_records = ExternalDfFetcher(
            lf=self.data_frame,
            operation_type="calculate_number_of_records",
            flow_id=-1,
            node_id=-1,
            wait_on_completion=True
        ).result
        return number_of_records

    def get_number_of_records_in_process(self, force_calculate: bool = False):
        """
        Get the number of records in the DataFrame in the local process.

        args:
            force_calculate: If True, forces recalculation even if a value is cached.

        Returns:
            The total number of records.
        """
        return self.get_number_of_records(force_calculate=force_calculate)

    def get_number_of_records(self, warn: bool = False, force_calculate: bool = False,
                              calculate_in_worker_process: bool = False) -> int:
        """Gets the total number of records in the DataFrame.

        For lazy frames, this may trigger a full data scan, which can be expensive.

        Args:
            warn: If True, logs a warning if a potentially expensive calculation is triggered.
            force_calculate: If True, forces recalculation even if a value is cached.
            calculate_in_worker_process: If True, offloads the calculation to a worker process.

        Returns:
            The total number of records.

        Raises:
            ValueError: If the number of records could not be determined.
        """
        if self.is_future and not self.is_collected:
            return -1
        if self.number_of_records is None or self.number_of_records < 0 or force_calculate:
            if self._number_of_records_callback is not None:
                self._number_of_records_callback(self)

            if self.lazy:
                if calculate_in_worker_process:
                    try:
                        self.number_of_records = self._calculate_number_of_records_in_worker()
                        return self.number_of_records
                    except Exception as e:
                        logger.error(f"Error: {e}")
                if warn:
                    logger.warning('Calculating the number of records this can be expensive on a lazy frame')
                try:
                    self.number_of_records = self.data_frame.select(pl.len()).collect(
                        engine="streaming" if self._streamable else "auto")[0, 0]
                except Exception:
                    raise ValueError('Could not get number of records')
            else:
                self.number_of_records = self.data_frame.__len__()
        return self.number_of_records

    @property
    def has_errors(self) -> bool:
        """Checks if there are any errors."""
        return len(self.errors) > 0

    @property
    def lazy(self) -> bool:
        """Indicates if the DataFrame is in lazy mode."""
        return self._lazy

    @lazy.setter
    def lazy(self, exec_lazy: bool = False):
        """Sets the laziness of the DataFrame.

        Args:
            exec_lazy: If True, converts the DataFrame to a LazyFrame. If False,
                collects the data and converts it to an eager DataFrame.
        """
        if exec_lazy != self._lazy:
            if exec_lazy:
                self.data_frame = self.data_frame.lazy()
            else:
                self._lazy = exec_lazy
                if self.external_source is not None:
                    df = self.collect()
                    self.data_frame = df
                else:
                    self.data_frame = self.data_frame.collect(engine="streaming" if self._streamable else "auto")
            self._lazy = exec_lazy

    @property
    def external_source(self) -> ExternalDataSource:
        """The external data source, if any."""
        return self._external_source

    @property
    def cols_idx(self) -> Dict[str, int]:
        """A dictionary mapping column names to their integer index."""
        if self._col_idx is None:
            self._col_idx = {c: i for i, c in enumerate(self.columns)}
        return self._col_idx

    @property
    def __name__(self) -> str:
        """The name of the table."""
        return self.name

    def get_select_inputs(self) -> transform_schemas.SelectInputs:
        """Gets `SelectInput` specifications for all columns in the current schema.

        Returns:
            A `SelectInputs` object that can be used to configure selection or
            transformation operations.
        """
        return transform_schemas.SelectInputs(
            [transform_schemas.SelectInput(old_name=c.name, data_type=c.data_type) for c in self.schema]
        )

    def select_columns(self, list_select: Union[List[str], Tuple[str], str]) -> "FlowDataEngine":
        """Selects a subset of columns from the DataFrame.

        Args:
            list_select: A list, tuple, or single string of column names to select.

        Returns:
            A new `FlowDataEngine` instance containing only the selected columns.
        """
        if isinstance(list_select, str):
            list_select = [list_select]

        idx_to_keep = [self.cols_idx.get(c) for c in list_select]
        selects = [ls for ls, id_to_keep in zip(list_select, idx_to_keep) if id_to_keep is not None]
        new_schema = [self.schema[i] for i in idx_to_keep if i is not None]

        return FlowDataEngine(
            self.data_frame.select(selects),
            number_of_records=self.number_of_records,
            schema=new_schema,
            streamable=self._streamable
        )

    def drop_columns(self, columns: List[str]) -> "FlowDataEngine":
        """Drops specified columns from the DataFrame.

        Args:
            columns: A list of column names to drop.

        Returns:
            A new `FlowDataEngine` instance without the dropped columns.
        """
        cols_for_select = tuple(set(self.columns) - set(columns))
        idx_to_keep = [self.cols_idx.get(c) for c in cols_for_select]
        new_schema = [self.schema[i] for i in idx_to_keep]

        return FlowDataEngine(
            self.data_frame.select(cols_for_select),
            number_of_records=self.number_of_records,
            schema=new_schema
        )

    def reorganize_order(self, column_order: List[str]) -> "FlowDataEngine":
        """Reorganizes columns into a specified order.

        Args:
            column_order: A list of column names in the desired order.

        Returns:
            A new `FlowDataEngine` instance with the columns reordered.
        """
        df = self.data_frame.select(column_order)
        schema = sorted(self.schema, key=lambda x: column_order.index(x.column_name))
        return FlowDataEngine(df, schema=schema, number_of_records=self.number_of_records)

    def apply_flowfile_formula(self, func: str, col_name: str,
                               output_data_type: pl.DataType = None) -> "FlowDataEngine":
        """Applies a formula to create a new column or transform an existing one.

        Args:
            func: A string containing a Polars expression formula.
            col_name: The name of the new or transformed column.
            output_data_type: The desired Polars data type for the output column.

        Returns:
            A new `FlowDataEngine` instance with the applied formula.
        """
        parsed_func = to_expr(func)
        if output_data_type is not None:
            df2 = self.data_frame.with_columns(parsed_func.cast(output_data_type).alias(col_name))
        else:
            df2 = self.data_frame.with_columns(parsed_func.alias(col_name))

        return FlowDataEngine(df2, number_of_records=self.number_of_records)

    def apply_sql_formula(self, func: str, col_name: str,
                          output_data_type: pl.DataType = None) -> "FlowDataEngine":
        """Applies an SQL-style formula using `pl.sql_expr`.

        Args:
            func: A string containing an SQL expression.
            col_name: The name of the new or transformed column.
            output_data_type: The desired Polars data type for the output column.

        Returns:
            A new `FlowDataEngine` instance with the applied formula.
        """
        expr = to_expr(func)
        if output_data_type not in (None, "Auto"):
            df = self.data_frame.with_columns(expr.cast(output_data_type).alias(col_name))
        else:
            df = self.data_frame.with_columns(expr.alias(col_name))

        return FlowDataEngine(df, number_of_records=self.number_of_records)

    def output(self, output_fs: input_schema.OutputSettings, flow_id: int, node_id: int | str,
               execute_remote: bool = True) -> "FlowDataEngine":
        """Writes the DataFrame to an output file.

        Can execute the write operation locally or in a remote worker process.

        Args:
            output_fs: An `OutputSettings` object with details about the output file.
            flow_id: The flow ID for tracking.
            node_id: The node ID for tracking.
            execute_remote: If True, executes the write in a worker process.

        Returns:
            The same `FlowDataEngine` instance for chaining.
        """
        logger.info('Starting to write output')
        if execute_remote:
            status = utils.write_output(
                self.data_frame,
                data_type=output_fs.file_type,
                path=output_fs.abs_file_path,
                write_mode=output_fs.write_mode,
                sheet_name=output_fs.output_excel_table.sheet_name,
                delimiter=output_fs.output_csv_table.delimiter,
                flow_id=flow_id,
                node_id=node_id
            )
            tracker = ExternalExecutorTracker(status)
            tracker.get_result()
            logger.info('Finished writing output')
        else:
            logger.info("Starting to write results locally")
            utils.local_write_output(
                self.data_frame,
                data_type=output_fs.file_type,
                path=output_fs.abs_file_path,
                write_mode=output_fs.write_mode,
                sheet_name=output_fs.output_excel_table.sheet_name,
                delimiter=output_fs.output_csv_table.delimiter,
                flow_id=flow_id,
                node_id=node_id,
            )
            logger.info("Finished writing output")
        return self

    def make_unique(self, unique_input: transform_schemas.UniqueInput = None) -> "FlowDataEngine":
        """Gets the unique rows from the DataFrame.

        Args:
            unique_input: A `UniqueInput` object specifying a subset of columns
                to consider for uniqueness and a strategy for keeping rows.

        Returns:
            A new `FlowDataEngine` instance with unique rows.
        """
        if unique_input is None or unique_input.columns is None:
            return FlowDataEngine(self.data_frame.unique())
        return FlowDataEngine(self.data_frame.unique(unique_input.columns, keep=unique_input.strategy))

    def concat(self, other: Iterable["FlowDataEngine"] | "FlowDataEngine") -> "FlowDataEngine":
        """Concatenates this DataFrame with one or more other DataFrames.

        Args:
            other: A single `FlowDataEngine` or an iterable of them.

        Returns:
            A new `FlowDataEngine` containing the concatenated data.
        """
        if isinstance(other, FlowDataEngine):
            other = [other]

        dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [self.data_frame] + [flt.data_frame for flt in other]
        return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))

    def do_select(self, select_inputs: transform_schemas.SelectInputs,
                  keep_missing: bool = True) -> "FlowDataEngine":
        """Performs a complex column selection, renaming, and reordering operation.

        Args:
            select_inputs: A `SelectInputs` object defining the desired transformations.
            keep_missing: If True, columns not specified in `select_inputs` are kept.
                If False, they are dropped.

        Returns:
            A new `FlowDataEngine` with the transformed selection.
        """
        new_schema = deepcopy(self.schema)
        renames = [r for r in select_inputs.renames if r.is_available]

        if not keep_missing:
            drop_cols = set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames).union(
                set(r.old_name for r in renames if not r.keep))
            keep_cols = []
        else:
            keep_cols = list(set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames))
            drop_cols = set(r.old_name for r in renames if not r.keep)

        if len(drop_cols) > 0:
            new_schema = [s for s in new_schema if s.name not in drop_cols]
        new_schema_mapping = {v.name: v for v in new_schema}

        available_renames = []
        for rename in renames:
            if (rename.new_name != rename.old_name or rename.new_name not in new_schema_mapping) and rename.keep:
                schema_entry = new_schema_mapping.get(rename.old_name)
                if schema_entry is not None:
                    available_renames.append(rename)
                    schema_entry.column_name = rename.new_name

        rename_dict = {r.old_name: r.new_name for r in available_renames}
        fl = self.select_columns(
            list_select=[col_to_keep.old_name for col_to_keep in renames if col_to_keep.keep] + keep_cols)
        fl = fl.change_column_types(transforms=[r for r in renames if r.keep])
        ndf = fl.data_frame.rename(rename_dict)
        renames.sort(key=lambda r: 0 if r.position is None else r.position)
        sorted_cols = utils.match_order(ndf.collect_schema().names(),
                                        [r.new_name for r in renames] + self.data_frame.collect_schema().names())
        output_file = FlowDataEngine(ndf, number_of_records=self.number_of_records)
        return output_file.reorganize_order(sorted_cols)

    def set_streamable(self, streamable: bool = False):
        """Sets whether DataFrame operations should be streamable."""
        self._streamable = streamable

    def _calculate_schema(self) -> List[Dict]:
        """Calculates schema statistics."""
        if self.external_source is not None:
            self.collect_external()
        v = utils.calculate_schema(self.data_frame)
        return v

    def calculate_schema(self):
        """Calculates and returns the schema."""
        self._calculate_schema_stats = True
        return self.schema

    def count(self) -> int:
        """Gets the total number of records."""
        return self.get_number_of_records()

    @classmethod
    def create_from_path_worker(cls, received_table: input_schema.ReceivedTable, flow_id: int, node_id: int | str):
        """Creates a FlowDataEngine from a path in a worker process."""
        received_table.set_absolute_filepath()
        external_fetcher = ExternalCreateFetcher(received_table=received_table,
                                                 file_type=received_table.file_type, flow_id=flow_id, node_id=node_id)
        return cls(external_fetcher.get_result())

`name` `property`

The name of the table.

`cols_idx` `property`

A dictionary mapping column names to their integer index.

`data_frame` `property` `writable`

The underlying Polars DataFrame or LazyFrame.

This property provides access to the Polars object that backs the FlowDataEngine. It handles lazy-loading from external sources if necessary.

Returns:

Type	Description
`LazyFrame \| DataFrame \| None`	The active Polars `DataFrame` or `LazyFrame`.

`external_source` `property`

The external data source, if any.

`has_errors` `property`

Checks if there are any errors.

`lazy` `property` `writable`

Indicates if the DataFrame is in lazy mode.

`number_of_fields` `property`

The number of columns (fields) in the DataFrame.

Returns:

Type	Description
`int`	The integer count of columns.

`schema` `property`

The schema of the DataFrame as a list of FlowfileColumn objects.

This property lazily calculates the schema if it hasn't been determined yet.

Returns:

Type	Description
`List[FlowfileColumn]`	A list of `FlowfileColumn` objects describing the schema.

`call()`

Makes the class instance callable, returning itself.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def __call__(self) -> "FlowDataEngine":
    """Makes the class instance callable, returning itself."""
    return self

`__get_sample__(n_rows=100, streamable=True)`

Internal method to get a sample of the data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def __get_sample__(self, n_rows: int = 100, streamable: bool = True) -> "FlowDataEngine":
    """Internal method to get a sample of the data."""
    if not self.lazy:
        df = self.data_frame.lazy()
    else:
        df = self.data_frame

    if streamable:
        try:
            df = df.head(n_rows).collect()
        except Exception as e:
            logger.warning(f'Error in getting sample: {e}')
            df = df.head(n_rows).collect(engine="auto")
    else:
        df = self.collect()
    return FlowDataEngine(df, number_of_records=len(df), schema=self.schema)

`getitem(item)`

Accesses a specific column or item from the DataFrame.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def __getitem__(self, item):
    """Accesses a specific column or item from the DataFrame."""
    return self.data_frame.select([item])

`init(raw_data=None, path_ref=None, name=None, optimize_memory=True, schema=None, number_of_records=None, calculate_schema_stats=False, streamable=True, number_of_records_callback=None, data_callback=None)`

Initializes the FlowDataEngine from various data sources.

Parameters:

Name	Type	Description	Default
`raw_data`	`Union[List[Dict], List[Any], Dict[str, Any], ParquetFile, DataFrame, LazyFrame, RawData]`	The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame, or a `RawData` schema object.	`None`
`path_ref`	`str`	A string path to a Parquet file.	`None`
`name`	`str`	An optional name for the data engine instance.	`None`
`optimize_memory`	`bool`	If True, prefers lazy operations to conserve memory.	`True`
`schema`	`List[FlowfileColumn] \| List[str] \| Schema`	An optional schema definition. Can be a list of `FlowfileColumn` objects, a list of column names, or a Polars `Schema`.	`None`
`number_of_records`	`int`	The number of records, if known.	`None`
`calculate_schema_stats`	`bool`	If True, computes detailed statistics for each column.	`False`
`streamable`	`bool`	If True, allows for streaming operations when possible.	`True`
`number_of_records_callback`	`Callable`	A callback function to retrieve the number of records.	`None`
`data_callback`	`Callable`	A callback function to retrieve the data.	`None`

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def __init__(self,
             raw_data: Union[List[Dict], List[Any], Dict[str, Any], 'ParquetFile', pl.DataFrame, pl.LazyFrame, input_schema.RawData] = None,
             path_ref: str = None,
             name: str = None,
             optimize_memory: bool = True,
             schema: List['FlowfileColumn'] | List[str] | pl.Schema = None,
             number_of_records: int = None,
             calculate_schema_stats: bool = False,
             streamable: bool = True,
             number_of_records_callback: Callable = None,
             data_callback: Callable = None):
    """Initializes the FlowDataEngine from various data sources.

    Args:
        raw_data: The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame,
            or a `RawData` schema object.
        path_ref: A string path to a Parquet file.
        name: An optional name for the data engine instance.
        optimize_memory: If True, prefers lazy operations to conserve memory.
        schema: An optional schema definition. Can be a list of `FlowfileColumn` objects,
            a list of column names, or a Polars `Schema`.
        number_of_records: The number of records, if known.
        calculate_schema_stats: If True, computes detailed statistics for each column.
        streamable: If True, allows for streaming operations when possible.
        number_of_records_callback: A callback function to retrieve the number of records.
        data_callback: A callback function to retrieve the data.
    """
    self._initialize_attributes(number_of_records_callback, data_callback, streamable)

    if raw_data is not None:
        self._handle_raw_data(raw_data, number_of_records, optimize_memory)
    elif path_ref:
        self._handle_path_ref(path_ref, optimize_memory)
    else:
        self.initialize_empty_fl()
    self._finalize_initialization(name, optimize_memory, schema, calculate_schema_stats)

`len()`

Returns the number of records in the table.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def __len__(self) -> int:
    """Returns the number of records in the table."""
    return self.number_of_records if self.number_of_records >= 0 else self.get_number_of_records()

`repr()`

Returns a string representation of the FlowDataEngine.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def __repr__(self) -> str:
    """Returns a string representation of the FlowDataEngine."""
    return f'flow data engine\n{self.data_frame.__repr__()}'

`add_new_values(values, col_name=None)`

Adds a new column with the provided values.

Parameters:

Name	Type	Description	Default
`values`	`Iterable`	An iterable (e.g., list, tuple) of values to add as a new column.	required
`col_name`	`str`	The name for the new column. Defaults to 'new_values'.	`None`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the added column.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def add_new_values(self, values: Iterable, col_name: str = None) -> "FlowDataEngine":
    """Adds a new column with the provided values.

    Args:
        values: An iterable (e.g., list, tuple) of values to add as a new column.
        col_name: The name for the new column. Defaults to 'new_values'.

    Returns:
        A new `FlowDataEngine` instance with the added column.
    """
    if col_name is None:
        col_name = 'new_values'
    return FlowDataEngine(self.data_frame.with_columns(pl.Series(values).alias(col_name)))

`add_record_id(record_id_settings)`

Adds a record ID (row number) column to the DataFrame.

Can generate a simple sequential ID or a grouped ID that resets for each group.

Parameters:

Name	Type	Description	Default
`record_id_settings`	`RecordIdInput`	A `RecordIdInput` object specifying the output column name, offset, and optional grouping columns.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the added record ID column.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def add_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
    """Adds a record ID (row number) column to the DataFrame.

    Can generate a simple sequential ID or a grouped ID that resets for
    each group.

    Args:
        record_id_settings: A `RecordIdInput` object specifying the output
            column name, offset, and optional grouping columns.

    Returns:
        A new `FlowDataEngine` instance with the added record ID column.
    """
    if record_id_settings.group_by and len(record_id_settings.group_by_columns) > 0:
        return self._add_grouped_record_id(record_id_settings)
    return self._add_simple_record_id(record_id_settings)

`apply_flowfile_formula(func, col_name, output_data_type=None)`

Applies a formula to create a new column or transform an existing one.

Parameters:

Name	Type	Description	Default
`func`	`str`	A string containing a Polars expression formula.	required
`col_name`	`str`	The name of the new or transformed column.	required
`output_data_type`	`DataType`	The desired Polars data type for the output column.	`None`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the applied formula.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def apply_flowfile_formula(self, func: str, col_name: str,
                           output_data_type: pl.DataType = None) -> "FlowDataEngine":
    """Applies a formula to create a new column or transform an existing one.

    Args:
        func: A string containing a Polars expression formula.
        col_name: The name of the new or transformed column.
        output_data_type: The desired Polars data type for the output column.

    Returns:
        A new `FlowDataEngine` instance with the applied formula.
    """
    parsed_func = to_expr(func)
    if output_data_type is not None:
        df2 = self.data_frame.with_columns(parsed_func.cast(output_data_type).alias(col_name))
    else:
        df2 = self.data_frame.with_columns(parsed_func.alias(col_name))

    return FlowDataEngine(df2, number_of_records=self.number_of_records)

`apply_sql_formula(func, col_name, output_data_type=None)`

Applies an SQL-style formula using pl.sql_expr.

Parameters:

Name	Type	Description	Default
`func`	`str`	A string containing an SQL expression.	required
`col_name`	`str`	The name of the new or transformed column.	required
`output_data_type`	`DataType`	The desired Polars data type for the output column.	`None`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the applied formula.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def apply_sql_formula(self, func: str, col_name: str,
                      output_data_type: pl.DataType = None) -> "FlowDataEngine":
    """Applies an SQL-style formula using `pl.sql_expr`.

    Args:
        func: A string containing an SQL expression.
        col_name: The name of the new or transformed column.
        output_data_type: The desired Polars data type for the output column.

    Returns:
        A new `FlowDataEngine` instance with the applied formula.
    """
    expr = to_expr(func)
    if output_data_type not in (None, "Auto"):
        df = self.data_frame.with_columns(expr.cast(output_data_type).alias(col_name))
    else:
        df = self.data_frame.with_columns(expr.alias(col_name))

    return FlowDataEngine(df, number_of_records=self.number_of_records)

`assert_equal(other, ordered=True, strict_schema=False)`

Asserts that this DataFrame is equal to another.

Useful for testing.

Parameters:

Name	Type	Description	Default
`other`	`FlowDataEngine`	The other `FlowDataEngine` to compare with.	required
`ordered`	`bool`	If True, the row order must be identical.	`True`
`strict_schema`	`bool`	If True, the data types of the schemas must be identical.	`False`

Raises:

Type	Description
`Exception`	If the DataFrames are not equal based on the specified criteria.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def assert_equal(self, other: "FlowDataEngine", ordered: bool = True, strict_schema: bool = False):
    """Asserts that this DataFrame is equal to another.

    Useful for testing.

    Args:
        other: The other `FlowDataEngine` to compare with.
        ordered: If True, the row order must be identical.
        strict_schema: If True, the data types of the schemas must be identical.

    Raises:
        Exception: If the DataFrames are not equal based on the specified criteria.
    """
    org_laziness = self.lazy, other.lazy
    self.lazy = False
    other.lazy = False
    self.number_of_records = -1
    other.number_of_records = -1
    other = other.select_columns(self.columns)

    if self.get_number_of_records_in_process() != other.get_number_of_records_in_process():
        raise Exception('Number of records is not equal')

    if self.columns != other.columns:
        raise Exception('Schema is not equal')

    if strict_schema:
        assert self.data_frame.schema == other.data_frame.schema, 'Data types do not match'

    if ordered:
        self_lf = self.data_frame.sort(by=self.columns)
        other_lf = other.data_frame.sort(by=other.columns)
    else:
        self_lf = self.data_frame
        other_lf = other.data_frame

    self.lazy, other.lazy = org_laziness
    assert self_lf.equals(other_lf), 'Data is not equal'

`cache()`

Caches the current DataFrame to disk and updates the internal reference.

This triggers a background process to write the current LazyFrame's result to a temporary file. Subsequent operations on this FlowDataEngine instance will read from the cached file, which can speed up downstream computations.

Returns:

Type	Description
`FlowDataEngine`	The same `FlowDataEngine` instance, now backed by the cached data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def cache(self) -> "FlowDataEngine":
    """Caches the current DataFrame to disk and updates the internal reference.

    This triggers a background process to write the current LazyFrame's result
    to a temporary file. Subsequent operations on this `FlowDataEngine` instance
    will read from the cached file, which can speed up downstream computations.

    Returns:
        The same `FlowDataEngine` instance, now backed by the cached data.
    """
    edf = ExternalDfFetcher(lf=self.data_frame, file_ref=str(id(self)), wait_on_completion=False,
                            flow_id=-1,
                            node_id=-1)
    logger.info('Caching data in background')
    result = edf.get_result()
    if isinstance(result, pl.LazyFrame):
        logger.info('Data cached')
        del self._data_frame
        self.data_frame = result
        logger.info('Data loaded from cache')
    return self

`calculate_schema()`

Calculates and returns the schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def calculate_schema(self):
    """Calculates and returns the schema."""
    self._calculate_schema_stats = True
    return self.schema

`change_column_types(transforms, calculate_schema=False)`

Changes the data type of one or more columns.

Parameters:

Name	Type	Description	Default
`transforms`	`List[SelectInput]`	A list of `SelectInput` objects, where each object specifies the column and its new `polars_type`.	required
`calculate_schema`	`bool`	If True, recalculates the schema after the type change.	`False`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the updated column types.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def change_column_types(self, transforms: List[transform_schemas.SelectInput],
                        calculate_schema: bool = False) -> "FlowDataEngine":
    """Changes the data type of one or more columns.

    Args:
        transforms: A list of `SelectInput` objects, where each object specifies
            the column and its new `polars_type`.
        calculate_schema: If True, recalculates the schema after the type change.

    Returns:
        A new `FlowDataEngine` instance with the updated column types.
    """
    dtypes = [dtype.base_type() for dtype in self.data_frame.collect_schema().dtypes()]
    idx_mapping = list(
        (transform.old_name, self.cols_idx.get(transform.old_name), getattr(pl, transform.polars_type))
        for transform in transforms if transform.data_type is not None
    )

    actual_transforms = [c for c in idx_mapping if c[2] != dtypes[c[1]]]
    transformations = [
        utils.define_pl_col_transformation(col_name=transform[0], col_type=transform[2])
        for transform in actual_transforms
    ]

    df = self.data_frame.with_columns(transformations)
    return FlowDataEngine(
        df,
        number_of_records=self.number_of_records,
        calculate_schema_stats=calculate_schema,
        streamable=self._streamable
    )

`collect(n_records=None)`

Collects the data and returns it as a Polars DataFrame.

This method triggers the execution of the lazy query plan (if applicable) and returns the result. It supports streaming to optimize memory usage for large datasets.

Parameters:

Name	Type	Description	Default
`n_records`	`int`	The maximum number of records to collect. If None, all records are collected.	`None`

Returns:

Type	Description
`DataFrame`	A Polars `DataFrame` containing the collected data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def collect(self, n_records: int = None) -> pl.DataFrame:
    """Collects the data and returns it as a Polars DataFrame.

    This method triggers the execution of the lazy query plan (if applicable)
    and returns the result. It supports streaming to optimize memory usage
    for large datasets.

    Args:
        n_records: The maximum number of records to collect. If None, all
            records are collected.

    Returns:
        A Polars `DataFrame` containing the collected data.
    """
    if n_records is None:
        logger.info(f'Fetching all data for Table object "{id(self)}". Settings: streaming={self._streamable}')
    else:
        logger.info(f'Fetching {n_records} record(s) for Table object "{id(self)}". '
                    f'Settings: streaming={self._streamable}')

    if not self.lazy:
        return self.data_frame

    try:
        return self._collect_data(n_records)
    except Exception as e:
        self.errors = [e]
        return self._handle_collection_error(n_records)

`collect_external()`

Materializes data from a tracked external source.

If the FlowDataEngine was created from an ExternalDataSource, this method will trigger the data retrieval, update the internal _data_frame to a LazyFrame of the collected data, and reset the schema to be re-evaluated.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def collect_external(self):
    """Materializes data from a tracked external source.

    If the `FlowDataEngine` was created from an `ExternalDataSource`, this
    method will trigger the data retrieval, update the internal `_data_frame`
    to a `LazyFrame` of the collected data, and reset the schema to be
    re-evaluated.
    """
    if self._external_source is not None:
        logger.info('Collecting external source')
        if self.external_source.get_pl_df() is not None:
            self.data_frame = self.external_source.get_pl_df().lazy()
        else:
            self.data_frame = pl.LazyFrame(list(self.external_source.get_iter()))
        self._schema = None  # enforce reset schema

`concat(other)`

Concatenates this DataFrame with one or more other DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`Iterable[FlowDataEngine] \| FlowDataEngine`	A single `FlowDataEngine` or an iterable of them.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` containing the concatenated data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def concat(self, other: Iterable["FlowDataEngine"] | "FlowDataEngine") -> "FlowDataEngine":
    """Concatenates this DataFrame with one or more other DataFrames.

    Args:
        other: A single `FlowDataEngine` or an iterable of them.

    Returns:
        A new `FlowDataEngine` containing the concatenated data.
    """
    if isinstance(other, FlowDataEngine):
        other = [other]

    dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [self.data_frame] + [flt.data_frame for flt in other]
    return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))

`count()`

Gets the total number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def count(self) -> int:
    """Gets the total number of records."""
    return self.get_number_of_records()

`create_from_external_source(external_source)` `classmethod`

Creates a FlowDataEngine from an external data source.

Parameters:

Name	Type	Description	Default
`external_source`	`ExternalDataSource`	An object that conforms to the `ExternalDataSource` interface.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def create_from_external_source(cls, external_source: ExternalDataSource) -> "FlowDataEngine":
    """Creates a FlowDataEngine from an external data source.

    Args:
        external_source: An object that conforms to the `ExternalDataSource`
            interface.

    Returns:
        A new `FlowDataEngine` instance.
    """
    if external_source.schema is not None:
        ff = cls.create_from_schema(external_source.schema)
    elif external_source.initial_data_getter is not None:
        ff = cls(raw_data=external_source.initial_data_getter())
    else:
        ff = cls()
    ff._external_source = external_source
    return ff

`create_from_path(received_table)` `classmethod`

Creates a FlowDataEngine from a local file path.

Supports various file types like CSV, Parquet, and Excel.

Parameters:

Name	Type	Description	Default
`received_table`	`ReceivedTableBase`	A `ReceivedTableBase` object containing the file path and format details.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with data from the file.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def create_from_path(cls, received_table: input_schema.ReceivedTableBase) -> "FlowDataEngine":
    """Creates a FlowDataEngine from a local file path.

    Supports various file types like CSV, Parquet, and Excel.

    Args:
        received_table: A `ReceivedTableBase` object containing the file path
            and format details.

    Returns:
        A new `FlowDataEngine` instance with data from the file.
    """
    received_table.set_absolute_filepath()
    file_type_handlers = {
        'csv': create_funcs.create_from_path_csv,
        'parquet': create_funcs.create_from_path_parquet,
        'excel': create_funcs.create_from_path_excel
    }

    handler = file_type_handlers.get(received_table.file_type)
    if not handler:
        raise Exception(f'Cannot create from {received_table.file_type}')

    flow_file = cls(handler(received_table))
    flow_file._org_path = received_table.abs_file_path
    return flow_file

`create_from_path_worker(received_table, flow_id, node_id)` `classmethod`

Creates a FlowDataEngine from a path in a worker process.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def create_from_path_worker(cls, received_table: input_schema.ReceivedTable, flow_id: int, node_id: int | str):
    """Creates a FlowDataEngine from a path in a worker process."""
    received_table.set_absolute_filepath()
    external_fetcher = ExternalCreateFetcher(received_table=received_table,
                                             file_type=received_table.file_type, flow_id=flow_id, node_id=node_id)
    return cls(external_fetcher.get_result())

`create_from_schema(schema)` `classmethod`

Creates an empty FlowDataEngine from a schema definition.

Parameters:

Name	Type	Description	Default
`schema`	`List[FlowfileColumn]`	A list of `FlowfileColumn` objects defining the schema.	required

Returns:

Type	Description
`FlowDataEngine`	A new, empty `FlowDataEngine` instance with the specified schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def create_from_schema(cls, schema: List[FlowfileColumn]) -> "FlowDataEngine":
    """Creates an empty FlowDataEngine from a schema definition.

    Args:
        schema: A list of `FlowfileColumn` objects defining the schema.

    Returns:
        A new, empty `FlowDataEngine` instance with the specified schema.
    """
    pl_schema = []
    for i, flow_file_column in enumerate(schema):
        pl_schema.append((flow_file_column.name, cast_str_to_polars_type(flow_file_column.data_type)))
        schema[i].col_index = i
    df = pl.LazyFrame(schema=pl_schema)
    return cls(df, schema=schema, calculate_schema_stats=False, number_of_records=0)

`create_from_sql(sql, conn)` `classmethod`

Creates a FlowDataEngine by executing a SQL query.

Parameters:

Name	Type	Description	Default
`sql`	`str`	The SQL query string to execute.	required
`conn`	`Any`	A database connection object or connection URI string.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the query result.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def create_from_sql(cls, sql: str, conn: Any) -> "FlowDataEngine":
    """Creates a FlowDataEngine by executing a SQL query.

    Args:
        sql: The SQL query string to execute.
        conn: A database connection object or connection URI string.

    Returns:
        A new `FlowDataEngine` instance with the query result.
    """
    return cls(pl.read_sql(sql, conn))

`create_random(number_of_records=1000)` `classmethod`

Creates a FlowDataEngine with randomly generated data.

Useful for testing and examples.

Parameters:

Name	Type	Description	Default
`number_of_records`	`int`	The number of random records to generate.	`1000`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with fake data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def create_random(cls, number_of_records: int = 1000) -> "FlowDataEngine":
    """Creates a FlowDataEngine with randomly generated data.

    Useful for testing and examples.

    Args:
        number_of_records: The number of random records to generate.

    Returns:
        A new `FlowDataEngine` instance with fake data.
    """
    return cls(create_fake_data(number_of_records))

`do_cross_join(cross_join_input, auto_generate_selection, verify_integrity, other)`

Performs a cross join with another DataFrame.

A cross join produces the Cartesian product of the two DataFrames.

Parameters:

Name	Type	Description	Default
`cross_join_input`	`CrossJoinInput`	A `CrossJoinInput` object specifying column selections.	required
`auto_generate_selection`	`bool`	If True, automatically renames columns to avoid conflicts.	required
`verify_integrity`	`bool`	If True, checks if the resulting join would be too large.	required
`other`	`FlowDataEngine`	The right `FlowDataEngine` to join with.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` with the result of the cross join.

Raises:

Type	Description
`Exception`	If `verify_integrity` is True and the join would result in an excessively large number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def do_cross_join(self, cross_join_input: transform_schemas.CrossJoinInput,
                  auto_generate_selection: bool, verify_integrity: bool,
                  other: "FlowDataEngine") -> "FlowDataEngine":
    """Performs a cross join with another DataFrame.

    A cross join produces the Cartesian product of the two DataFrames.

    Args:
        cross_join_input: A `CrossJoinInput` object specifying column selections.
        auto_generate_selection: If True, automatically renames columns to avoid conflicts.
        verify_integrity: If True, checks if the resulting join would be too large.
        other: The right `FlowDataEngine` to join with.

    Returns:
        A new `FlowDataEngine` with the result of the cross join.

    Raises:
        Exception: If `verify_integrity` is True and the join would result in
            an excessively large number of records.
    """

    self.lazy = True

    other.lazy = True

    verify_join_select_integrity(cross_join_input, left_columns=self.columns, right_columns=other.columns)
    right_select = [v.old_name for v in cross_join_input.right_select.renames
                    if (v.keep or v.join_key) and v.is_available]
    left_select = [v.old_name for v in cross_join_input.left_select.renames
                   if (v.keep or v.join_key) and v.is_available]

    left = self.data_frame.select(left_select).rename(cross_join_input.left_select.rename_table)
    right = other.data_frame.select(right_select).rename(cross_join_input.right_select.rename_table)

    joined_df = left.join(right, how='cross')

    cols_to_delete_after = [col.new_name for col in
                            cross_join_input.left_select.renames + cross_join_input.left_select.renames
                            if col.join_key and not col.keep and col.is_available]

    fl = FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False, streamable=False)
    return fl

`do_filter(predicate)`

Filters rows based on a predicate expression.

Parameters:

Name	Type	Description	Default
`predicate`	`str`	A string containing a Polars expression that evaluates to a boolean value.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance containing only the rows that match
`FlowDataEngine`	the predicate.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def do_filter(self, predicate: str) -> "FlowDataEngine":
    """Filters rows based on a predicate expression.

    Args:
        predicate: A string containing a Polars expression that evaluates to
            a boolean value.

    Returns:
        A new `FlowDataEngine` instance containing only the rows that match
        the predicate.
    """
    try:
        f = to_expr(predicate)
    except Exception as e:
        logger.warning(f'Error in filter expression: {e}')
        f = to_expr("False")
    df = self.data_frame.filter(f)
    return FlowDataEngine(df, schema=self.schema, streamable=self._streamable)

`do_group_by(group_by_input, calculate_schema_stats=True)`

Performs a group-by operation on the DataFrame.

Parameters:

Name	Type	Description	Default
`group_by_input`	`GroupByInput`	A `GroupByInput` object defining the grouping columns and aggregations.	required
`calculate_schema_stats`	`bool`	If True, calculates schema statistics for the resulting DataFrame.	`True`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the grouped and aggregated data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def do_group_by(self, group_by_input: transform_schemas.GroupByInput,
                calculate_schema_stats: bool = True) -> "FlowDataEngine":
    """Performs a group-by operation on the DataFrame.

    Args:
        group_by_input: A `GroupByInput` object defining the grouping columns
            and aggregations.
        calculate_schema_stats: If True, calculates schema statistics for the
            resulting DataFrame.

    Returns:
        A new `FlowDataEngine` instance with the grouped and aggregated data.
    """
    aggregations = [c for c in group_by_input.agg_cols if c.agg != 'groupby']
    group_columns = [c for c in group_by_input.agg_cols if c.agg == 'groupby']

    if len(group_columns) == 0:
        return FlowDataEngine(
            self.data_frame.select(
                ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
            ),
            calculate_schema_stats=calculate_schema_stats
        )

    df = self.data_frame.rename({c.old_name: c.new_name for c in group_columns})
    group_by_columns = [n_c.new_name for n_c in group_columns]
    return FlowDataEngine(
        df.group_by(*group_by_columns).agg(
            ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
        ),
        calculate_schema_stats=calculate_schema_stats
    )

`do_pivot(pivot_input, node_logger=None)`

Converts the DataFrame from a long to a wide format, aggregating values.

Parameters:

Name	Type	Description	Default
`pivot_input`	`PivotInput`	A `PivotInput` object defining the index, pivot, and value columns, along with the aggregation logic.	required
`node_logger`	`NodeLogger`	An optional logger for reporting warnings, e.g., if the pivot column has too many unique values.	`None`

Returns:

Type	Description
`FlowDataEngine`	A new, pivoted `FlowDataEngine` instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def do_pivot(self, pivot_input: transform_schemas.PivotInput, node_logger: NodeLogger = None) -> "FlowDataEngine":
    """Converts the DataFrame from a long to a wide format, aggregating values.

    Args:
        pivot_input: A `PivotInput` object defining the index, pivot, and value
            columns, along with the aggregation logic.
        node_logger: An optional logger for reporting warnings, e.g., if the
            pivot column has too many unique values.

    Returns:
        A new, pivoted `FlowDataEngine` instance.
    """
    # Get unique values for pivot columns
    max_unique_vals = 200
    new_cols_unique = fetch_unique_values(self.data_frame.select(pivot_input.pivot_column)
                                          .unique()
                                          .sort(pivot_input.pivot_column)
                                          .limit(max_unique_vals).cast(pl.String))
    if len(new_cols_unique) >= max_unique_vals:
        if node_logger:
            node_logger.warning('Pivot column has too many unique values. Please consider using a different column.'
                                f' Max unique values: {max_unique_vals}')

    if len(pivot_input.index_columns) == 0:
        no_index_cols = True
        pivot_input.index_columns = ['__temp__']
        ff = self.apply_flowfile_formula('1', col_name='__temp__')
    else:
        no_index_cols = False
        ff = self

    # Perform pivot operations
    index_columns = pivot_input.get_index_columns()
    grouped_ff = ff.do_group_by(pivot_input.get_group_by_input(), False)
    pivot_column = pivot_input.get_pivot_column()

    input_df = grouped_ff.data_frame.with_columns(
        pivot_column.cast(pl.String).alias(pivot_input.pivot_column)
    )
    number_of_aggregations = len(pivot_input.aggregations)
    df = (
        input_df.select(
            *index_columns,
            pivot_column,
            pivot_input.get_values_expr()
        )
        .group_by(*index_columns)
        .agg([
            (pl.col('vals').filter(pivot_column == new_col_value))
            .first()
            .alias(new_col_value)
            for new_col_value in new_cols_unique
        ])
        .select(
            *index_columns,
            *[
                pl.col(new_col).struct.field(agg).alias(f'{new_col + "_" + agg if number_of_aggregations > 1 else new_col }')
                for new_col in new_cols_unique
                for agg in pivot_input.aggregations
            ]
        )
    )

    # Clean up temporary columns if needed
    if no_index_cols:
        df = df.drop('__temp__')
        pivot_input.index_columns = []

    return FlowDataEngine(df, calculate_schema_stats=False)

`do_select(select_inputs, keep_missing=True)`

Performs a complex column selection, renaming, and reordering operation.

Parameters:

Name	Type	Description	Default
`select_inputs`	`SelectInputs`	A `SelectInputs` object defining the desired transformations.	required
`keep_missing`	`bool`	If True, columns not specified in `select_inputs` are kept. If False, they are dropped.	`True`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` with the transformed selection.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def do_select(self, select_inputs: transform_schemas.SelectInputs,
              keep_missing: bool = True) -> "FlowDataEngine":
    """Performs a complex column selection, renaming, and reordering operation.

    Args:
        select_inputs: A `SelectInputs` object defining the desired transformations.
        keep_missing: If True, columns not specified in `select_inputs` are kept.
            If False, they are dropped.

    Returns:
        A new `FlowDataEngine` with the transformed selection.
    """
    new_schema = deepcopy(self.schema)
    renames = [r for r in select_inputs.renames if r.is_available]

    if not keep_missing:
        drop_cols = set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames).union(
            set(r.old_name for r in renames if not r.keep))
        keep_cols = []
    else:
        keep_cols = list(set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames))
        drop_cols = set(r.old_name for r in renames if not r.keep)

    if len(drop_cols) > 0:
        new_schema = [s for s in new_schema if s.name not in drop_cols]
    new_schema_mapping = {v.name: v for v in new_schema}

    available_renames = []
    for rename in renames:
        if (rename.new_name != rename.old_name or rename.new_name not in new_schema_mapping) and rename.keep:
            schema_entry = new_schema_mapping.get(rename.old_name)
            if schema_entry is not None:
                available_renames.append(rename)
                schema_entry.column_name = rename.new_name

    rename_dict = {r.old_name: r.new_name for r in available_renames}
    fl = self.select_columns(
        list_select=[col_to_keep.old_name for col_to_keep in renames if col_to_keep.keep] + keep_cols)
    fl = fl.change_column_types(transforms=[r for r in renames if r.keep])
    ndf = fl.data_frame.rename(rename_dict)
    renames.sort(key=lambda r: 0 if r.position is None else r.position)
    sorted_cols = utils.match_order(ndf.collect_schema().names(),
                                    [r.new_name for r in renames] + self.data_frame.collect_schema().names())
    output_file = FlowDataEngine(ndf, number_of_records=self.number_of_records)
    return output_file.reorganize_order(sorted_cols)

`do_sort(sorts)`

Sorts the DataFrame by one or more columns.

Parameters:

Name	Type	Description	Default
`sorts`	`List[SortByInput]`	A list of `SortByInput` objects, each specifying a column and sort direction ('asc' or 'desc').	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the sorted data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def do_sort(self, sorts: List[transform_schemas.SortByInput]) -> "FlowDataEngine":
    """Sorts the DataFrame by one or more columns.

    Args:
        sorts: A list of `SortByInput` objects, each specifying a column
            and sort direction ('asc' or 'desc').

    Returns:
        A new `FlowDataEngine` instance with the sorted data.
    """
    if not sorts:
        return self

    descending = [s.how == 'desc' or s.how.lower() == 'descending' for s in sorts]
    df = self.data_frame.sort([sort_by.column for sort_by in sorts], descending=descending)
    return FlowDataEngine(df, number_of_records=self.number_of_records, schema=self.schema)

`drop_columns(columns)`

Drops specified columns from the DataFrame.

Parameters:

Name	Type	Description	Default
`columns`	`List[str]`	A list of column names to drop.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance without the dropped columns.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def drop_columns(self, columns: List[str]) -> "FlowDataEngine":
    """Drops specified columns from the DataFrame.

    Args:
        columns: A list of column names to drop.

    Returns:
        A new `FlowDataEngine` instance without the dropped columns.
    """
    cols_for_select = tuple(set(self.columns) - set(columns))
    idx_to_keep = [self.cols_idx.get(c) for c in cols_for_select]
    new_schema = [self.schema[i] for i in idx_to_keep]

    return FlowDataEngine(
        self.data_frame.select(cols_for_select),
        number_of_records=self.number_of_records,
        schema=new_schema
    )

`from_cloud_storage_obj(settings)` `classmethod`

Creates a FlowDataEngine from an object in cloud storage.

This method supports reading from various cloud storage providers like AWS S3, Azure Data Lake Storage, and Google Cloud Storage, with support for various authentication methods.

Parameters:

Name	Type	Description	Default
`settings`	`CloudStorageReadSettingsInternal`	A `CloudStorageReadSettingsInternal` object containing connection details, file format, and read options.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance containing the data from cloud storage.

Raises:

Type	Description
`ValueError`	If the storage type or file format is not supported.
`NotImplementedError`	If a requested file format like "delta" or "iceberg" is not yet implemented.
`Exception`	If reading from cloud storage fails.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def from_cloud_storage_obj(cls, settings: cloud_storage_schemas.CloudStorageReadSettingsInternal) -> "FlowDataEngine":
    """Creates a FlowDataEngine from an object in cloud storage.

    This method supports reading from various cloud storage providers like AWS S3,
    Azure Data Lake Storage, and Google Cloud Storage, with support for
    various authentication methods.

    Args:
        settings: A `CloudStorageReadSettingsInternal` object containing connection
            details, file format, and read options.

    Returns:
        A new `FlowDataEngine` instance containing the data from cloud storage.

    Raises:
        ValueError: If the storage type or file format is not supported.
        NotImplementedError: If a requested file format like "delta" or "iceberg"
            is not yet implemented.
        Exception: If reading from cloud storage fails.
    """
    connection = settings.connection
    read_settings = settings.read_settings

    logger.info(f"Reading from {connection.storage_type} storage: {read_settings.resource_path}")
    # Get storage options based on connection type
    storage_options = CloudStorageReader.get_storage_options(connection)
    # Get credential provider if needed
    credential_provider = CloudStorageReader.get_credential_provider(connection)
    if read_settings.file_format == "parquet":
        return cls._read_parquet_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings.scan_mode == "directory",
        )
    elif read_settings.file_format == "delta":
        return cls._read_delta_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings
        )
    elif read_settings.file_format == "csv":
        return cls._read_csv_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings
        )
    elif read_settings.file_format == "json":
        return cls._read_json_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings.scan_mode == "directory"
        )
    elif read_settings.file_format == "iceberg":
        return cls._read_iceberg_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings
        )

    elif read_settings.file_format in ["delta", "iceberg"]:
        # These would require additional libraries
        raise NotImplementedError(f"File format {read_settings.file_format} not yet implemented")
    else:
        raise ValueError(f"Unsupported file format: {read_settings.file_format}")

`generate_enumerator(length=1000, output_name='output_column')` `classmethod`

Generates a FlowDataEngine with a single column containing a sequence of integers.

Parameters:

Name	Type	Description	Default
`length`	`int`	The number of integers to generate in the sequence.	`1000`
`output_name`	`str`	The name of the output column.	`'output_column'`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

@classmethod
def generate_enumerator(cls, length: int = 1000, output_name: str = 'output_column') -> "FlowDataEngine":
    """Generates a FlowDataEngine with a single column containing a sequence of integers.

    Args:
        length: The number of integers to generate in the sequence.
        output_name: The name of the output column.

    Returns:
        A new `FlowDataEngine` instance.
    """
    if length > 10_000_000:
        length = 10_000_000
    return cls(pl.LazyFrame().select((pl.int_range(0, length, dtype=pl.UInt32)).alias(output_name)))

`get_estimated_file_size()`

Estimates the file size in bytes if the data originated from a local file.

This relies on the original path being tracked during file ingestion.

Returns:

Type	Description
`int`	The file size in bytes, or 0 if the original path is unknown.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_estimated_file_size(self) -> int:
    """Estimates the file size in bytes if the data originated from a local file.

    This relies on the original path being tracked during file ingestion.

    Returns:
        The file size in bytes, or 0 if the original path is unknown.
    """
    if self._org_path is not None:
        return os.path.getsize(self._org_path)
    return 0

`get_number_of_records(warn=False, force_calculate=False, calculate_in_worker_process=False)`

Gets the total number of records in the DataFrame.

For lazy frames, this may trigger a full data scan, which can be expensive.

Parameters:

Name	Type	Description	Default
`warn`	`bool`	If True, logs a warning if a potentially expensive calculation is triggered.	`False`
`force_calculate`	`bool`	If True, forces recalculation even if a value is cached.	`False`
`calculate_in_worker_process`	`bool`	If True, offloads the calculation to a worker process.	`False`

Returns:

Type	Description
`int`	The total number of records.

Raises:

Type	Description
`ValueError`	If the number of records could not be determined.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_number_of_records(self, warn: bool = False, force_calculate: bool = False,
                          calculate_in_worker_process: bool = False) -> int:
    """Gets the total number of records in the DataFrame.

    For lazy frames, this may trigger a full data scan, which can be expensive.

    Args:
        warn: If True, logs a warning if a potentially expensive calculation is triggered.
        force_calculate: If True, forces recalculation even if a value is cached.
        calculate_in_worker_process: If True, offloads the calculation to a worker process.

    Returns:
        The total number of records.

    Raises:
        ValueError: If the number of records could not be determined.
    """
    if self.is_future and not self.is_collected:
        return -1
    if self.number_of_records is None or self.number_of_records < 0 or force_calculate:
        if self._number_of_records_callback is not None:
            self._number_of_records_callback(self)

        if self.lazy:
            if calculate_in_worker_process:
                try:
                    self.number_of_records = self._calculate_number_of_records_in_worker()
                    return self.number_of_records
                except Exception as e:
                    logger.error(f"Error: {e}")
            if warn:
                logger.warning('Calculating the number of records this can be expensive on a lazy frame')
            try:
                self.number_of_records = self.data_frame.select(pl.len()).collect(
                    engine="streaming" if self._streamable else "auto")[0, 0]
            except Exception:
                raise ValueError('Could not get number of records')
        else:
            self.number_of_records = self.data_frame.__len__()
    return self.number_of_records

`get_number_of_records_in_process(force_calculate=False)`

Get the number of records in the DataFrame in the local process.

Parameters:

Name	Type	Description	Default
`force_calculate`	`bool`	If True, forces recalculation even if a value is cached.	`False`

Returns:

Type	Description
	The total number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_number_of_records_in_process(self, force_calculate: bool = False):
    """
    Get the number of records in the DataFrame in the local process.

    args:
        force_calculate: If True, forces recalculation even if a value is cached.

    Returns:
        The total number of records.
    """
    return self.get_number_of_records(force_calculate=force_calculate)

`get_output_sample(n_rows=10)`

Gets a sample of the data as a list of dictionaries.

This is typically used to display a preview of the data in a UI.

Parameters:

Name	Type	Description	Default
`n_rows`	`int`	The number of rows to sample.	`10`

Returns:

Type	Description
`List[Dict]`	A list of dictionaries, where each dictionary represents a row.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_output_sample(self, n_rows: int = 10) -> List[Dict]:
    """Gets a sample of the data as a list of dictionaries.

    This is typically used to display a preview of the data in a UI.

    Args:
        n_rows: The number of rows to sample.

    Returns:
        A list of dictionaries, where each dictionary represents a row.
    """
    if self.number_of_records > n_rows or self.number_of_records < 0:
        df = self.collect(n_rows)
    else:
        df = self.collect()
    return df.to_dicts()

`get_record_count()`

Returns a new FlowDataEngine with a single column 'number_of_records' containing the total number of records.

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_record_count(self) -> "FlowDataEngine":
    """Returns a new FlowDataEngine with a single column 'number_of_records'
    containing the total number of records.

    Returns:
        A new `FlowDataEngine` instance.
    """
    return FlowDataEngine(self.data_frame.select(pl.len().alias('number_of_records')))

`get_sample(n_rows=100, random=False, shuffle=False, seed=None, execution_location=None)`

Gets a sample of rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`n_rows`	`int`	The number of rows to sample.	`100`
`random`	`bool`	If True, performs random sampling. If False, takes the first n_rows.	`False`
`shuffle`	`bool`	If True (and `random` is True), shuffles the data before sampling.	`False`
`seed`	`int`	A random seed for reproducibility.	`None`
`execution_location`	`Optional[ExecutionLocationsLiteral]`	Location which is used to calculate the size of the dataframe	`None`

Returns: A new FlowDataEngine instance containing the sampled data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_sample(self, n_rows: int = 100, random: bool = False, shuffle: bool = False,
               seed: int = None, execution_location: Optional[ExecutionLocationsLiteral] = None) -> "FlowDataEngine":
    """Gets a sample of rows from the DataFrame.

    Args:
        n_rows: The number of rows to sample.
        random: If True, performs random sampling. If False, takes the first n_rows.
        shuffle: If True (and `random` is True), shuffles the data before sampling.
        seed: A random seed for reproducibility.
        execution_location: Location which is used to calculate the size of the dataframe
    Returns:
        A new `FlowDataEngine` instance containing the sampled data.
    """
    logging.info(f'Getting sample of {n_rows} rows')

    if random:
        if self.lazy and self.external_source is not None:
            self.collect_external()

        if self.lazy and shuffle:
            sample_df = (self.data_frame.collect(engine="streaming" if self._streamable else "auto")
                         .sample(n_rows, seed=seed, shuffle=shuffle))
        elif shuffle:
            sample_df = self.data_frame.sample(n_rows, seed=seed, shuffle=shuffle)
        else:
            if execution_location is None:
                execution_location = get_global_execution_location()
            n_rows = min(n_rows, self.get_number_of_records(
                calculate_in_worker_process=execution_location == "remote")
                         )

            every_n_records = ceil(self.number_of_records / n_rows)
            sample_df = self.data_frame.gather_every(every_n_records)
    else:
        if self.external_source:
            self.collect(n_rows)
        sample_df = self.data_frame.head(n_rows)

    return FlowDataEngine(sample_df, schema=self.schema)

`get_schema_column(col_name)`

Retrieves the schema information for a single column by its name.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	The name of the column to retrieve.	required

Returns:

Type	Description
`FlowfileColumn`	A `FlowfileColumn` object for the specified column, or `None` if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_schema_column(self, col_name: str) -> FlowfileColumn:
    """Retrieves the schema information for a single column by its name.

    Args:
        col_name: The name of the column to retrieve.

    Returns:
        A `FlowfileColumn` object for the specified column, or `None` if not found.
    """
    for s in self.schema:
        if s.name == col_name:
            return s

`get_select_inputs()`

Gets SelectInput specifications for all columns in the current schema.

Returns:

Type	Description
`SelectInputs`	A `SelectInputs` object that can be used to configure selection or
`SelectInputs`	transformation operations.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_select_inputs(self) -> transform_schemas.SelectInputs:
    """Gets `SelectInput` specifications for all columns in the current schema.

    Returns:
        A `SelectInputs` object that can be used to configure selection or
        transformation operations.
    """
    return transform_schemas.SelectInputs(
        [transform_schemas.SelectInput(old_name=c.name, data_type=c.data_type) for c in self.schema]
    )

`get_subset(n_rows=100)`

Gets the first n_rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`n_rows`	`int`	The number of rows to include in the subset.	`100`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance containing the subset of data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def get_subset(self, n_rows: int = 100) -> "FlowDataEngine":
    """Gets the first `n_rows` from the DataFrame.

    Args:
        n_rows: The number of rows to include in the subset.

    Returns:
        A new `FlowDataEngine` instance containing the subset of data.
    """
    if not self.lazy:
        return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
    else:
        return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)

`initialize_empty_fl()`

Initializes an empty LazyFrame.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def initialize_empty_fl(self):
    """Initializes an empty LazyFrame."""
    self.data_frame = pl.LazyFrame()
    self.number_of_records = 0
    self._lazy = True

`iter_batches(batch_size=1000, columns=None)`

Iterates over the DataFrame in batches.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	The size of each batch.	`1000`
`columns`	`Union[List, Tuple, str]`	A list of column names to include in the batches. If None, all columns are included.	`None`

Yields:

Type	Description
`FlowDataEngine`	A `FlowDataEngine` instance for each batch.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def iter_batches(self, batch_size: int = 1000,
                 columns: Union[List, Tuple, str] = None) -> Generator["FlowDataEngine", None, None]:
    """Iterates over the DataFrame in batches.

    Args:
        batch_size: The size of each batch.
        columns: A list of column names to include in the batches. If None,
            all columns are included.

    Yields:
        A `FlowDataEngine` instance for each batch.
    """
    if columns:
        self.data_frame = self.data_frame.select(columns)
    self.lazy = False
    batches = self.data_frame.iter_slices(batch_size)
    for batch in batches:
        yield FlowDataEngine(batch)

`join(join_input, auto_generate_selection, verify_integrity, other)`

Performs a standard SQL-style join with another DataFrame.

Supports various join types like 'inner', 'left', 'right', 'outer', 'semi', and 'anti'.

Parameters:

Name	Type	Description	Default
`join_input`	`JoinInput`	A `JoinInput` object defining the join keys, join type, and column selections.	required
`auto_generate_selection`	`bool`	If True, automatically handles column renaming.	required
`verify_integrity`	`bool`	If True, performs checks to prevent excessively large joins.	required
`other`	`FlowDataEngine`	The right `FlowDataEngine` to join with.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` with the joined data.

Raises:

Type	Description
`Exception`	If the join configuration is invalid or if `verify_integrity` is True and the join is predicted to be too large.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def join(self, join_input: transform_schemas.JoinInput, auto_generate_selection: bool,
         verify_integrity: bool, other: "FlowDataEngine") -> "FlowDataEngine":
    """Performs a standard SQL-style join with another DataFrame.

    Supports various join types like 'inner', 'left', 'right', 'outer', 'semi', and 'anti'.

    Args:
        join_input: A `JoinInput` object defining the join keys, join type,
            and column selections.
        auto_generate_selection: If True, automatically handles column renaming.
        verify_integrity: If True, performs checks to prevent excessively large joins.
        other: The right `FlowDataEngine` to join with.

    Returns:
        A new `FlowDataEngine` with the joined data.

    Raises:
        Exception: If the join configuration is invalid or if `verify_integrity`
            is True and the join is predicted to be too large.
    """
    ensure_right_unselect_for_semi_and_anti_joins(join_input)
    verify_join_select_integrity(join_input, left_columns=self.columns, right_columns=other.columns)
    if not verify_join_map_integrity(join_input, left_columns=self.schema, right_columns=other.schema):
        raise Exception('Join is not valid by the data fields')
    if auto_generate_selection:
        join_input.auto_rename()
    left = self.data_frame.select(get_select_columns(join_input.left_select.renames)).rename(join_input.left_select.rename_table)
    right = other.data_frame.select(get_select_columns(join_input.right_select.renames)).rename(join_input.right_select.rename_table)
    if verify_integrity and join_input.how != 'right':
        n_records = get_join_count(left, right, left_on_keys=join_input.left_join_keys,
                                   right_on_keys=join_input.right_join_keys, how=join_input.how)
        if n_records > 1_000_000_000:
            raise Exception("Join will result in too many records, ending process")
    else:
        n_records = -1
    left, right, reverse_join_key_mapping = _handle_duplication_join_keys(left, right, join_input)
    left, right = rename_df_table_for_join(left, right, join_input.get_join_key_renames())
    if join_input.how == 'right':
        joined_df = right.join(
            other=left,
            left_on=join_input.right_join_keys,
            right_on=join_input.left_join_keys,
            how="left",
            suffix="").rename(reverse_join_key_mapping)
    else:
        joined_df = left.join(
            other=right,
            left_on=join_input.left_join_keys,
            right_on=join_input.right_join_keys,
            how=join_input.how,
            suffix="").rename(reverse_join_key_mapping)
    left_cols_to_delete_after = [get_col_name_to_delete(col, 'left') for col in join_input.left_select.renames
                                 if not col.keep
                                 and col.is_available and col.join_key
                                 ]
    right_cols_to_delete_after = [get_col_name_to_delete(col, 'right') for col in join_input.right_select.renames
                                  if not col.keep
                                  and col.is_available and col.join_key
                                  and join_input.how in ("left", "right", "inner", "cross", "outer")
                                  ]
    if len(right_cols_to_delete_after + left_cols_to_delete_after) > 0:
        joined_df = joined_df.drop(left_cols_to_delete_after + right_cols_to_delete_after)
    undo_join_key_remapping = get_undo_rename_mapping_join(join_input)
    joined_df = joined_df.rename(undo_join_key_remapping)

    if verify_integrity:
        return FlowDataEngine(joined_df, calculate_schema_stats=True,
                              number_of_records=n_records, streamable=False)
    else:
        fl = FlowDataEngine(joined_df, calculate_schema_stats=False,
                            number_of_records=0, streamable=False)
        return fl

`make_unique(unique_input=None)`

Gets the unique rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`unique_input`	`UniqueInput`	A `UniqueInput` object specifying a subset of columns to consider for uniqueness and a strategy for keeping rows.	`None`

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with unique rows.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def make_unique(self, unique_input: transform_schemas.UniqueInput = None) -> "FlowDataEngine":
    """Gets the unique rows from the DataFrame.

    Args:
        unique_input: A `UniqueInput` object specifying a subset of columns
            to consider for uniqueness and a strategy for keeping rows.

    Returns:
        A new `FlowDataEngine` instance with unique rows.
    """
    if unique_input is None or unique_input.columns is None:
        return FlowDataEngine(self.data_frame.unique())
    return FlowDataEngine(self.data_frame.unique(unique_input.columns, keep=unique_input.strategy))

`output(output_fs, flow_id, node_id, execute_remote=True)`

Writes the DataFrame to an output file.

Can execute the write operation locally or in a remote worker process.

Parameters:

Name	Type	Description	Default
`output_fs`	`OutputSettings`	An `OutputSettings` object with details about the output file.	required
`flow_id`	`int`	The flow ID for tracking.	required
`node_id`	`int \| str`	The node ID for tracking.	required
`execute_remote`	`bool`	If True, executes the write in a worker process.	`True`

Returns:

Type	Description
`FlowDataEngine`	The same `FlowDataEngine` instance for chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def output(self, output_fs: input_schema.OutputSettings, flow_id: int, node_id: int | str,
           execute_remote: bool = True) -> "FlowDataEngine":
    """Writes the DataFrame to an output file.

    Can execute the write operation locally or in a remote worker process.

    Args:
        output_fs: An `OutputSettings` object with details about the output file.
        flow_id: The flow ID for tracking.
        node_id: The node ID for tracking.
        execute_remote: If True, executes the write in a worker process.

    Returns:
        The same `FlowDataEngine` instance for chaining.
    """
    logger.info('Starting to write output')
    if execute_remote:
        status = utils.write_output(
            self.data_frame,
            data_type=output_fs.file_type,
            path=output_fs.abs_file_path,
            write_mode=output_fs.write_mode,
            sheet_name=output_fs.output_excel_table.sheet_name,
            delimiter=output_fs.output_csv_table.delimiter,
            flow_id=flow_id,
            node_id=node_id
        )
        tracker = ExternalExecutorTracker(status)
        tracker.get_result()
        logger.info('Finished writing output')
    else:
        logger.info("Starting to write results locally")
        utils.local_write_output(
            self.data_frame,
            data_type=output_fs.file_type,
            path=output_fs.abs_file_path,
            write_mode=output_fs.write_mode,
            sheet_name=output_fs.output_excel_table.sheet_name,
            delimiter=output_fs.output_csv_table.delimiter,
            flow_id=flow_id,
            node_id=node_id,
        )
        logger.info("Finished writing output")
    return self

`reorganize_order(column_order)`

Reorganizes columns into a specified order.

Parameters:

Name	Type	Description	Default
`column_order`	`List[str]`	A list of column names in the desired order.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the columns reordered.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def reorganize_order(self, column_order: List[str]) -> "FlowDataEngine":
    """Reorganizes columns into a specified order.

    Args:
        column_order: A list of column names in the desired order.

    Returns:
        A new `FlowDataEngine` instance with the columns reordered.
    """
    df = self.data_frame.select(column_order)
    schema = sorted(self.schema, key=lambda x: column_order.index(x.column_name))
    return FlowDataEngine(df, schema=schema, number_of_records=self.number_of_records)

`save(path, data_type='parquet')`

Saves the DataFrame to a file in a separate thread.

Parameters:

Name	Type	Description	Default
`path`	`str`	The file path to save to.	required
`data_type`	`str`	The format to save in (e.g., 'parquet', 'csv').	`'parquet'`

Returns:

Type	Description
`Future`	A `loky.Future` object representing the asynchronous save operation.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def save(self, path: str, data_type: str = 'parquet') -> Future:
    """Saves the DataFrame to a file in a separate thread.

    Args:
        path: The file path to save to.
        data_type: The format to save in (e.g., 'parquet', 'csv').

    Returns:
        A `loky.Future` object representing the asynchronous save operation.
    """
    estimated_size = deepcopy(self.get_estimated_file_size() * 4)
    df = deepcopy(self.data_frame)
    return write_threaded(_df=df, path=path, data_type=data_type, estimated_size=estimated_size)

`select_columns(list_select)`

Selects a subset of columns from the DataFrame.

Parameters:

Name	Type	Description	Default
`list_select`	`Union[List[str], Tuple[str], str]`	A list, tuple, or single string of column names to select.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance containing only the selected columns.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def select_columns(self, list_select: Union[List[str], Tuple[str], str]) -> "FlowDataEngine":
    """Selects a subset of columns from the DataFrame.

    Args:
        list_select: A list, tuple, or single string of column names to select.

    Returns:
        A new `FlowDataEngine` instance containing only the selected columns.
    """
    if isinstance(list_select, str):
        list_select = [list_select]

    idx_to_keep = [self.cols_idx.get(c) for c in list_select]
    selects = [ls for ls, id_to_keep in zip(list_select, idx_to_keep) if id_to_keep is not None]
    new_schema = [self.schema[i] for i in idx_to_keep if i is not None]

    return FlowDataEngine(
        self.data_frame.select(selects),
        number_of_records=self.number_of_records,
        schema=new_schema,
        streamable=self._streamable
    )

`set_streamable(streamable=False)`

Sets whether DataFrame operations should be streamable.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def set_streamable(self, streamable: bool = False):
    """Sets whether DataFrame operations should be streamable."""
    self._streamable = streamable

`solve_graph(graph_solver_input)`

Solves a graph problem represented by 'from' and 'to' columns.

This is used for operations like finding connected components in a graph.

Parameters:

Name	Type	Description	Default
`graph_solver_input`	`GraphSolverInput`	A `GraphSolverInput` object defining the source, destination, and output column names.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the solved graph data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def solve_graph(self, graph_solver_input: transform_schemas.GraphSolverInput) -> "FlowDataEngine":
    """Solves a graph problem represented by 'from' and 'to' columns.

    This is used for operations like finding connected components in a graph.

    Args:
        graph_solver_input: A `GraphSolverInput` object defining the source,
            destination, and output column names.

    Returns:
        A new `FlowDataEngine` instance with the solved graph data.
    """
    lf = self.data_frame.with_columns(
        graph_solver(graph_solver_input.col_from, graph_solver_input.col_to)
        .alias(graph_solver_input.output_column_name)
    )
    return FlowDataEngine(lf)

`split(split_input)`

Splits a column's text values into multiple rows based on a delimiter.

This operation is often referred to as "exploding" the DataFrame, as it increases the number of rows.

Parameters:

Name	Type	Description	Default
`split_input`	`TextToRowsInput`	A `TextToRowsInput` object specifying the column to split, the delimiter, and the output column name.	required

Returns:

Type	Description
`FlowDataEngine`	A new `FlowDataEngine` instance with the exploded rows.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def split(self, split_input: transform_schemas.TextToRowsInput) -> "FlowDataEngine":
    """Splits a column's text values into multiple rows based on a delimiter.

    This operation is often referred to as "exploding" the DataFrame, as it
    increases the number of rows.

    Args:
        split_input: A `TextToRowsInput` object specifying the column to split,
            the delimiter, and the output column name.

    Returns:
        A new `FlowDataEngine` instance with the exploded rows.
    """
    output_column_name = (
        split_input.output_column_name
        if split_input.output_column_name
        else split_input.column_to_split
    )

    split_value = (
        split_input.split_fixed_value
        if split_input.split_by_fixed_value
        else pl.col(split_input.split_by_column)
    )

    df = (
        self.data_frame.with_columns(
            pl.col(split_input.column_to_split)
            .str.split(by=split_value)
            .alias(output_column_name)
        )
        .explode(output_column_name)
    )

    return FlowDataEngine(df)

`start_fuzzy_join(fuzzy_match_input, other, file_ref, flow_id=-1, node_id=-1)`

Starts a fuzzy join operation in a background process.

This method prepares the data and initiates the fuzzy matching in a separate process, returning a tracker object immediately.

Parameters:

Name	Type	Description	Default
`fuzzy_match_input`	`FuzzyMatchInput`	A `FuzzyMatchInput` object with the matching parameters.	required
`other`	`FlowDataEngine`	The right `FlowDataEngine` to join with.	required
`file_ref`	`str`	A reference string for temporary files.	required
`flow_id`	`int`	The flow ID for tracking.	`-1`
`node_id`	`int \| str`	The node ID for tracking.	`-1`

Returns:

Type	Description
`ExternalFuzzyMatchFetcher`	An `ExternalFuzzyMatchFetcher` object that can be used to track the
`ExternalFuzzyMatchFetcher`	progress and retrieve the result of the fuzzy join.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def start_fuzzy_join(self, fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                     other: "FlowDataEngine", file_ref: str, flow_id: int = -1,
                     node_id: int | str = -1) -> ExternalFuzzyMatchFetcher:
    """Starts a fuzzy join operation in a background process.

    This method prepares the data and initiates the fuzzy matching in a
    separate process, returning a tracker object immediately.

    Args:
        fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
        other: The right `FlowDataEngine` to join with.
        file_ref: A reference string for temporary files.
        flow_id: The flow ID for tracking.
        node_id: The node ID for tracking.

    Returns:
        An `ExternalFuzzyMatchFetcher` object that can be used to track the
        progress and retrieve the result of the fuzzy join.
    """
    left_df, right_df = prepare_for_fuzzy_match(left=self, right=other, fuzzy_match_input=fuzzy_match_input)
    return ExternalFuzzyMatchFetcher(left_df, right_df,
                                     fuzzy_maps=fuzzy_match_input.fuzzy_maps,
                                     file_ref=file_ref + '_fm',
                                     wait_on_completion=False,
                                     flow_id=flow_id,
                                     node_id=node_id)

`to_arrow()`

Converts the DataFrame to a PyArrow Table.

This method triggers a .collect() call if the data is lazy, then converts the resulting eager DataFrame into a pyarrow.Table.

Returns:

Type	Description
`Table`	A `pyarrow.Table` instance representing the data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def to_arrow(self) -> PaTable:
    """Converts the DataFrame to a PyArrow Table.

    This method triggers a `.collect()` call if the data is lazy,
    then converts the resulting eager DataFrame into a `pyarrow.Table`.

    Returns:
        A `pyarrow.Table` instance representing the data.
    """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_arrow()
    else:
        return self.data_frame.to_arrow()

`to_cloud_storage_obj(settings)`

Writes the DataFrame to an object in cloud storage.

This method supports writing to various cloud storage providers like AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

Parameters:

Name	Type	Description	Default
`settings`	`CloudStorageWriteSettingsInternal`	A `CloudStorageWriteSettingsInternal` object containing connection details, file format, and write options.	required

Raises:

Type	Description
`ValueError`	If the specified file format is not supported for writing.
`NotImplementedError`	If the 'append' write mode is used with an unsupported format.
`Exception`	If the write operation to cloud storage fails for any reason.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def to_cloud_storage_obj(self, settings: cloud_storage_schemas.CloudStorageWriteSettingsInternal):
    """Writes the DataFrame to an object in cloud storage.

    This method supports writing to various cloud storage providers like AWS S3,
    Azure Data Lake Storage, and Google Cloud Storage.

    Args:
        settings: A `CloudStorageWriteSettingsInternal` object containing connection
            details, file format, and write options.

    Raises:
        ValueError: If the specified file format is not supported for writing.
        NotImplementedError: If the 'append' write mode is used with an unsupported format.
        Exception: If the write operation to cloud storage fails for any reason.
    """
    connection = settings.connection
    write_settings = settings.write_settings

    logger.info(f"Writing to {connection.storage_type} storage: {write_settings.resource_path}")

    if write_settings.write_mode == 'append' and write_settings.file_format != "delta":
        raise NotImplementedError("The 'append' write mode is not yet supported for this destination.")
    storage_options = CloudStorageReader.get_storage_options(connection)
    credential_provider = CloudStorageReader.get_credential_provider(connection)
    # Dispatch to the correct writer based on file format
    if write_settings.file_format == "parquet":
        self._write_parquet_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    elif write_settings.file_format == "delta":
        self._write_delta_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    elif write_settings.file_format == "csv":
        self._write_csv_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    elif write_settings.file_format == "json":
        self._write_json_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    else:
        raise ValueError(f"Unsupported file format for writing: {write_settings.file_format}")

    logger.info(f"Successfully wrote data to {write_settings.resource_path}")

`to_dict()`

Converts the DataFrame to a Python dictionary of columns.

Each key in the dictionary is a column name, and the corresponding value is a list of the data in that column.

Returns:

Type	Description
`Dict[str, List]`	A dictionary mapping column names to lists of their values.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def to_dict(self) -> Dict[str, List]:
    """Converts the DataFrame to a Python dictionary of columns.

     Each key in the dictionary is a column name, and the corresponding value
     is a list of the data in that column.

     Returns:
         A dictionary mapping column names to lists of their values.
     """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dict(as_series=False)
    else:
        return self.data_frame.to_dict(as_series=False)

`to_pylist()`

Converts the DataFrame to a list of Python dictionaries.

Returns:

Type	Description
`List[Dict]`	A list where each item is a dictionary representing a row.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def to_pylist(self) -> List[Dict]:
    """Converts the DataFrame to a list of Python dictionaries.

    Returns:
        A list where each item is a dictionary representing a row.
    """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dicts()
    return self.data_frame.to_dicts()

`to_raw_data()`

Converts the DataFrame to a RawData schema object.

Returns:

Type	Description
`RawData`	An `input_schema.RawData` object containing the schema and data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def to_raw_data(self) -> input_schema.RawData:
    """Converts the DataFrame to a `RawData` schema object.

    Returns:
        An `input_schema.RawData` object containing the schema and data.
    """
    columns = [c.get_minimal_field_info() for c in self.schema]
    data = list(self.to_dict().values())
    return input_schema.RawData(columns=columns, data=data)

`unpivot(unpivot_input)`

Converts the DataFrame from a wide to a long format.

This is the inverse of a pivot operation, taking columns and transforming them into variable and value rows.

Parameters:

Name	Type	Description	Default
`unpivot_input`	`UnpivotInput`	An `UnpivotInput` object specifying which columns to unpivot and which to keep as index columns.	required

Returns:

Type	Description
`FlowDataEngine`	A new, unpivoted `FlowDataEngine` instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py

def unpivot(self, unpivot_input: transform_schemas.UnpivotInput) -> "FlowDataEngine":
    """Converts the DataFrame from a wide to a long format.

    This is the inverse of a pivot operation, taking columns and transforming
    them into `variable` and `value` rows.

    Args:
        unpivot_input: An `UnpivotInput` object specifying which columns to
            unpivot and which to keep as index columns.

    Returns:
        A new, unpivoted `FlowDataEngine` instance.
    """
    lf = self.data_frame

    if unpivot_input.data_type_selector_expr is not None:
        result = lf.unpivot(
            on=unpivot_input.data_type_selector_expr(),
            index=unpivot_input.index_columns
        )
    elif unpivot_input.value_columns is not None:
        result = lf.unpivot(
            on=unpivot_input.value_columns,
            index=unpivot_input.index_columns
        )
    else:
        result = lf.unpivot()

    return FlowDataEngine(result)

FlowfileColumn

`FlowfileColumn`

The FlowfileColumn is a data class that holds the schema and rich metadata for a single column managed by the FlowDataEngine.

`flowfile_core.flowfile.flow_data_engine.flow_file_column.main.FlowfileColumn` `dataclass`

Methods:

Name	Description
`__repr__`	Provides a concise, developer-friendly representation of the object.
`__str__`	Provides a detailed, readable summary of the column's metadata.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_file_column/main.py

@dataclass
class FlowfileColumn:
    column_name: str
    data_type: str
    size: int
    max_value: str
    min_value: str
    col_index: int
    number_of_empty_values: int
    number_of_unique_values: int
    example_values: str
    data_type_group: ReadableDataTypeGroup
    __sql_type: Optional[Any]
    __is_unique: Optional[bool]
    __nullable: Optional[bool]
    __has_values: Optional[bool]
    average_value: Optional[str]
    __perc_unique: Optional[float]

    def __init__(self, polars_type: PlType):
        self.data_type = convert_pl_type_to_string(polars_type.pl_datatype)
        self.size = polars_type.count - polars_type.null_count
        self.max_value = polars_type.max
        self.min_value = polars_type.min
        self.number_of_unique_values = polars_type.n_unique
        self.number_of_empty_values = polars_type.null_count
        self.example_values = polars_type.examples
        self.column_name = polars_type.column_name
        self.average_value = polars_type.mean
        self.col_index = polars_type.col_index
        self.__has_values = None
        self.__nullable = None
        self.__is_unique = None
        self.__sql_type = None
        self.__perc_unique = None
        self.data_type_group = self.get_readable_datatype_group()

    def __repr__(self):
        """
        Provides a concise, developer-friendly representation of the object.
        Ideal for debugging and console inspection.
        """
        return (f"FlowfileColumn(name='{self.column_name}', "
                f"type={self.data_type}, "
                f"size={self.size}, "
                f"nulls={self.number_of_empty_values})")

    def __str__(self):
        """
        Provides a detailed, readable summary of the column's metadata.
        It conditionally omits any attribute that is None, ensuring a clean output.
        """
        # --- Header (Always Shown) ---
        header = f"<FlowfileColumn: '{self.column_name}'>"
        lines = []

        # --- Core Attributes (Conditionally Shown) ---
        if self.data_type is not None:
            lines.append(f"  Type: {self.data_type}")
        if self.size is not None:
            lines.append(f"  Non-Nulls: {self.size}")

        # Calculate and display nulls if possible
        if self.size is not None and self.number_of_empty_values is not None:
            total_entries = self.size + self.number_of_empty_values
            if total_entries > 0:
                null_perc = (self.number_of_empty_values / total_entries) * 100
                null_info = f"{self.number_of_empty_values} ({null_perc:.1f}%)"
            else:
                null_info = "0 (0.0%)"
            lines.append(f"  Nulls: {null_info}")

        if self.number_of_unique_values is not None:
            lines.append(f"  Unique: {self.number_of_unique_values}")

        # --- Conditional Stats Section ---
        stats = []
        if self.min_value is not None:
            stats.append(f"    Min: {self.min_value}")
        if self.max_value is not None:
            stats.append(f"    Max: {self.max_value}")
        if self.average_value is not None:
            stats.append(f"    Mean: {self.average_value}")

        if stats:
            lines.append("  Stats:")
            lines.extend(stats)

        # --- Conditional Examples Section ---
        if self.example_values:
            example_str = str(self.example_values)
            # Truncate long example strings for cleaner display
            if len(example_str) > 70:
                example_str = example_str[:67] + '...'
            lines.append(f"  Examples: {example_str}")

        return f"{header}\n" + "\n".join(lines)

    @classmethod
    def create_from_polars_type(cls, polars_type: PlType, **kwargs) -> "FlowfileColumn":
        for k, v in kwargs.items():
            if hasattr(polars_type, k):
                setattr(polars_type, k, v)
        return cls(polars_type)

    @classmethod
    def from_input(cls, column_name: str, data_type: str, **kwargs) -> "FlowfileColumn":
        pl_type = cast_str_to_polars_type(data_type)
        if pl_type is not None:
            data_type = pl_type
        return cls(PlType(column_name=column_name, pl_datatype=data_type, **kwargs))

    @classmethod
    def create_from_polars_dtype(cls, column_name: str, data_type: pl.DataType, **kwargs):
        return cls(PlType(column_name=column_name, pl_datatype=data_type, **kwargs))

    def get_minimal_field_info(self) -> input_schema.MinimalFieldInfo:
        return input_schema.MinimalFieldInfo(name=self.column_name, data_type=self.data_type)

    @classmethod
    def create_from_minimal_field_info(cls, minimal_field_info: input_schema.MinimalFieldInfo) -> "FlowfileColumn":
        return cls.from_input(column_name=minimal_field_info.name,
                              data_type=minimal_field_info.data_type)

    @property
    def is_unique(self) -> bool:
        if self.__is_unique is None:
            if self.has_values:
                self.__is_unique = self.number_of_unique_values == self.number_of_filled_values
            else:
                self.__is_unique = False
        return self.__is_unique

    @property
    def perc_unique(self) -> float:
        if self.__perc_unique is None:
            self.__perc_unique = self.number_of_unique_values / self.number_of_filled_values
        return self.__perc_unique

    @property
    def has_values(self) -> bool:
        if not self.__has_values:
            self.__has_values = self.number_of_unique_values > 0
        return self.__has_values

    @property
    def number_of_filled_values(self):
        return self.size

    @property
    def nullable(self):
        if self.__nullable is None:
            self.__nullable = self.number_of_empty_values > 0
        return self.__nullable

    @property
    def name(self):
        return self.column_name

    def get_column_repr(self):
        return dict(name=self.name,
                    size=self.size,
                    data_type=str(self.data_type),
                    has_values=self.has_values,
                    is_unique=self.is_unique,
                    max_value=str(self.max_value),
                    min_value=str(self.min_value),
                    number_of_unique_values=self.number_of_unique_values,
                    number_of_filled_values=self.number_of_filled_values,
                    number_of_empty_values=self.number_of_empty_values,
                    average_size=self.average_value)

    def generic_datatype(self) -> DataTypeGroup:
        if self.data_type in ('Utf8', 'VARCHAR', 'CHAR', 'NVARCHAR', 'String'):
            return 'str'
        elif self.data_type in ('fixed_decimal', 'decimal', 'float', 'integer', 'boolean', 'double', 'Int16', 'Int32',
                                'Int64', 'Float32', 'Float64', 'Decimal', 'Binary', 'Boolean', 'Uint8', 'Uint16',
                                'Uint32', 'Uint64'):
            return 'numeric'
        elif self.data_type in ('datetime', 'date', 'Date', 'Datetime', 'Time'):
            return 'date'
        else:
            return 'str'

    def get_readable_datatype_group(self) -> ReadableDataTypeGroup:
        if self.data_type in ('Utf8', 'VARCHAR', 'CHAR', 'NVARCHAR', 'String'):
            return 'String'
        elif self.data_type in ('fixed_decimal', 'decimal', 'float', 'integer', 'boolean', 'double', 'Int16', 'Int32',
                                'Int64', 'Float32', 'Float64', 'Decimal', 'Binary', 'Boolean', 'Uint8', 'Uint16',
                                'Uint32', 'Uint64'):
            return 'Numeric'
        elif self.data_type in ('datetime', 'date', 'Date', 'Datetime', 'Time'):
            return 'Date'
        else:
            return 'Other'

    def get_polars_type(self) -> PlType:
        pl_datatype = cast_str_to_polars_type(self.data_type)
        pl_type = PlType(pl_datatype=pl_datatype, **self.__dict__)
        return pl_type

    def update_type_from_polars_type(self, pl_type: PlType):
        self.data_type = str(pl_type.pl_datatype.base_type())

`repr()`

Provides a concise, developer-friendly representation of the object. Ideal for debugging and console inspection.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_file_column/main.py

def __repr__(self):
    """
    Provides a concise, developer-friendly representation of the object.
    Ideal for debugging and console inspection.
    """
    return (f"FlowfileColumn(name='{self.column_name}', "
            f"type={self.data_type}, "
            f"size={self.size}, "
            f"nulls={self.number_of_empty_values})")

`str()`

Provides a detailed, readable summary of the column's metadata. It conditionally omits any attribute that is None, ensuring a clean output.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_file_column/main.py

def __str__(self):
    """
    Provides a detailed, readable summary of the column's metadata.
    It conditionally omits any attribute that is None, ensuring a clean output.
    """
    # --- Header (Always Shown) ---
    header = f"<FlowfileColumn: '{self.column_name}'>"
    lines = []

    # --- Core Attributes (Conditionally Shown) ---
    if self.data_type is not None:
        lines.append(f"  Type: {self.data_type}")
    if self.size is not None:
        lines.append(f"  Non-Nulls: {self.size}")

    # Calculate and display nulls if possible
    if self.size is not None and self.number_of_empty_values is not None:
        total_entries = self.size + self.number_of_empty_values
        if total_entries > 0:
            null_perc = (self.number_of_empty_values / total_entries) * 100
            null_info = f"{self.number_of_empty_values} ({null_perc:.1f}%)"
        else:
            null_info = "0 (0.0%)"
        lines.append(f"  Nulls: {null_info}")

    if self.number_of_unique_values is not None:
        lines.append(f"  Unique: {self.number_of_unique_values}")

    # --- Conditional Stats Section ---
    stats = []
    if self.min_value is not None:
        stats.append(f"    Min: {self.min_value}")
    if self.max_value is not None:
        stats.append(f"    Max: {self.max_value}")
    if self.average_value is not None:
        stats.append(f"    Mean: {self.average_value}")

    if stats:
        lines.append("  Stats:")
        lines.extend(stats)

    # --- Conditional Examples Section ---
    if self.example_values:
        example_str = str(self.example_values)
        # Truncate long example strings for cleaner display
        if len(example_str) > 70:
            example_str = example_str[:67] + '...'
        lines.append(f"  Examples: {example_str}")

    return f"{header}\n" + "\n".join(lines)

Data Modeling (Schemas)

This section documents the Pydantic models that define the structure of settings and data.

`schemas`

`flowfile_core.schemas.schemas`

Classes:

Name	Description
`FlowGraphConfig`	Configuration model for a flow graph's basic properties.
`FlowInformation`	Represents the complete state of a flow, including settings, nodes, and connections.
`FlowSettings`	Extends FlowGraphConfig with additional operational settings for a flow.
`NodeDefault`	Defines default properties for a node type.
`NodeEdge`	Represents a connection (edge) between two nodes in the frontend.
`NodeInformation`	Stores the state and configuration of a specific node instance within a flow.
`NodeInput`	Represents a node as it is received from the frontend, including position.
`NodeTemplate`	Defines the template for a node type, specifying its UI and functional characteristics.
`RawLogInput`	Schema for a raw log message.
`VueFlowInput`	Represents the complete graph structure from the Vue-based frontend.

Functions:

Name	Description
`get_global_execution_location`	Calculates the default execution location based on the global settings

`FlowGraphConfig` `pydantic-model`

Bases: BaseModel

Configuration model for a flow graph's basic properties.

Attributes:

Name	Type	Description
`flow_id`	`int`	Unique identifier for the flow.
`description`	`Optional[str]`	A description of the flow.
`save_location`	`Optional[str]`	The location where the flow is saved.
`name`	`str`	The name of the flow.
`path`	`str`	The file path associated with the flow.
`execution_mode`	`ExecutionModeLiteral`	The mode of execution ('Development' or 'Performance').
`execution_location`	`ExecutionLocationsLiteral`	The location for execution ('local', 'remote').

Show JSON schema:

{
  "description": "Configuration model for a flow graph's basic properties.\n\nAttributes:\n    flow_id (int): Unique identifier for the flow.\n    description (Optional[str]): A description of the flow.\n    save_location (Optional[str]): The location where the flow is saved.\n    name (str): The name of the flow.\n    path (str): The file path associated with the flow.\n    execution_mode (ExecutionModeLiteral): The mode of execution ('Development' or 'Performance').\n    execution_location (ExecutionLocationsLiteral): The location for execution ('local', 'remote').",
  "properties": {
    "flow_id": {
      "description": "Unique identifier for the flow.",
      "title": "Flow Id",
      "type": "integer"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Description"
    },
    "save_location": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Save Location"
    },
    "name": {
      "default": "",
      "title": "Name",
      "type": "string"
    },
    "path": {
      "default": "",
      "title": "Path",
      "type": "string"
    },
    "execution_mode": {
      "default": "Performance",
      "enum": [
        "Development",
        "Performance"
      ],
      "title": "Execution Mode",
      "type": "string"
    },
    "execution_location": {
      "enum": [
        "local",
        "remote"
      ],
      "title": "Execution Location",
      "type": "string"
    }
  },
  "title": "FlowGraphConfig",
  "type": "object"
}

Fields:

flow_id (int)
description (Optional[str])
save_location (Optional[str])
name (str)
path (str)
execution_mode (ExecutionModeLiteral)
execution_location (ExecutionLocationsLiteral)

Validators:

validate_and_set_execution_location → execution_location

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class FlowGraphConfig(BaseModel):
    """
    Configuration model for a flow graph's basic properties.

    Attributes:
        flow_id (int): Unique identifier for the flow.
        description (Optional[str]): A description of the flow.
        save_location (Optional[str]): The location where the flow is saved.
        name (str): The name of the flow.
        path (str): The file path associated with the flow.
        execution_mode (ExecutionModeLiteral): The mode of execution ('Development' or 'Performance').
        execution_location (ExecutionLocationsLiteral): The location for execution ('local', 'remote').
    """
    flow_id: int = Field(default_factory=create_unique_id, description="Unique identifier for the flow.")
    description: Optional[str] = None
    save_location: Optional[str] = None
    name: str = ''
    path: str = ''
    execution_mode: ExecutionModeLiteral = 'Performance'
    execution_location: ExecutionLocationsLiteral = Field(default_factory=get_global_execution_location)

    @field_validator('execution_location', mode='before')
    def validate_and_set_execution_location(cls, v: Optional[ExecutionLocationsLiteral]) -> ExecutionLocationsLiteral:
        """
        Validates and sets the execution location.
        1.  **If `None` is provided**: It defaults to the location determined by global settings.
        2.  **If a value is provided**: It checks if the value is compatible with the global
            settings. If not (e.g., requesting 'remote' when only 'local' is possible),
            it corrects the value to a compatible one.
        """
        if v is None:
            return get_global_execution_location()
        if v == "auto":
            return get_global_execution_location()

        return get_prio_execution_location(v, get_global_execution_location())

`flow_id` `pydantic-field`

Unique identifier for the flow.

`validate_and_set_execution_location(v)` `pydantic-validator`

Validates and sets the execution location. 1. If None is provided: It defaults to the location determined by global settings. 2. If a value is provided: It checks if the value is compatible with the global settings. If not (e.g., requesting 'remote' when only 'local' is possible), it corrects the value to a compatible one.

Source code in flowfile_core/flowfile_core/schemas/schemas.py

@field_validator('execution_location', mode='before')
def validate_and_set_execution_location(cls, v: Optional[ExecutionLocationsLiteral]) -> ExecutionLocationsLiteral:
    """
    Validates and sets the execution location.
    1.  **If `None` is provided**: It defaults to the location determined by global settings.
    2.  **If a value is provided**: It checks if the value is compatible with the global
        settings. If not (e.g., requesting 'remote' when only 'local' is possible),
        it corrects the value to a compatible one.
    """
    if v is None:
        return get_global_execution_location()
    if v == "auto":
        return get_global_execution_location()

    return get_prio_execution_location(v, get_global_execution_location())

`FlowInformation` `pydantic-model`

Bases: BaseModel

Represents the complete state of a flow, including settings, nodes, and connections.

Attributes:

Name	Type	Description
`flow_id`	`int`	The unique ID of the flow.
`flow_name`	`Optional[str]`	The name of the flow.
`flow_settings`	`FlowSettings`	The settings for the flow.
`data`	`Dict[int, NodeInformation]`	A dictionary mapping node IDs to their information.
`node_starts`	`List[int]`	A list of starting node IDs.
`node_connections`	`List[Tuple[int, int]]`	A list of tuples representing connections between nodes.

Show JSON schema:

{
  "$defs": {
    "FlowSettings": {
      "description": "Extends FlowGraphConfig with additional operational settings for a flow.\n\nAttributes:\n    auto_save (bool): Flag to enable or disable automatic saving.\n    modified_on (Optional[float]): Timestamp of the last modification.\n    show_detailed_progress (bool): Flag to show detailed progress during execution.\n    is_running (bool): Indicates if the flow is currently running.\n    is_canceled (bool): Indicates if the flow execution has been canceled.",
      "properties": {
        "flow_id": {
          "description": "Unique identifier for the flow.",
          "title": "Flow Id",
          "type": "integer"
        },
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Description"
        },
        "save_location": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Save Location"
        },
        "name": {
          "default": "",
          "title": "Name",
          "type": "string"
        },
        "path": {
          "default": "",
          "title": "Path",
          "type": "string"
        },
        "execution_mode": {
          "default": "Performance",
          "enum": [
            "Development",
            "Performance"
          ],
          "title": "Execution Mode",
          "type": "string"
        },
        "execution_location": {
          "enum": [
            "local",
            "remote"
          ],
          "title": "Execution Location",
          "type": "string"
        },
        "auto_save": {
          "default": false,
          "title": "Auto Save",
          "type": "boolean"
        },
        "modified_on": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modified On"
        },
        "show_detailed_progress": {
          "default": true,
          "title": "Show Detailed Progress",
          "type": "boolean"
        },
        "is_running": {
          "default": false,
          "title": "Is Running",
          "type": "boolean"
        },
        "is_canceled": {
          "default": false,
          "title": "Is Canceled",
          "type": "boolean"
        }
      },
      "title": "FlowSettings",
      "type": "object"
    },
    "NodeInformation": {
      "description": "Stores the state and configuration of a specific node instance within a flow.\n\nAttributes:\n    id (Optional[int]): The unique ID of the node instance.\n    type (Optional[str]): The type of the node (e.g., 'join', 'filter').\n    is_setup (Optional[bool]): Whether the node has been configured.\n    description (Optional[str]): A user-provided description.\n    x_position (Optional[int]): The x-coordinate on the canvas.\n    y_position (Optional[int]): The y-coordinate on the canvas.\n    left_input_id (Optional[int]): The ID of the node connected to the left input.\n    right_input_id (Optional[int]): The ID of the node connected to the right input.\n    input_ids (Optional[List[int]]): A list of IDs for main input nodes.\n    outputs (Optional[List[int]]): A list of IDs for nodes this node outputs to.\n    setting_input (Optional[Any]): The specific settings for this node instance.",
      "properties": {
        "id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Type"
        },
        "is_setup": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Is Setup"
        },
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "",
          "title": "Description"
        },
        "x_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "X Position"
        },
        "y_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "Y Position"
        },
        "left_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Left Input Id"
        },
        "right_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Right Input Id"
        },
        "input_ids": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": [
            -1
          ],
          "title": "Input Ids"
        },
        "outputs": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": [
            -1
          ],
          "title": "Outputs"
        },
        "setting_input": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Setting Input"
        }
      },
      "title": "NodeInformation",
      "type": "object"
    }
  },
  "description": "Represents the complete state of a flow, including settings, nodes, and connections.\n\nAttributes:\n    flow_id (int): The unique ID of the flow.\n    flow_name (Optional[str]): The name of the flow.\n    flow_settings (FlowSettings): The settings for the flow.\n    data (Dict[int, NodeInformation]): A dictionary mapping node IDs to their information.\n    node_starts (List[int]): A list of starting node IDs.\n    node_connections (List[Tuple[int, int]]): A list of tuples representing connections between nodes.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "flow_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Flow Name"
    },
    "flow_settings": {
      "$ref": "#/$defs/FlowSettings"
    },
    "data": {
      "additionalProperties": {
        "$ref": "#/$defs/NodeInformation"
      },
      "default": {},
      "title": "Data",
      "type": "object"
    },
    "node_starts": {
      "items": {
        "type": "integer"
      },
      "title": "Node Starts",
      "type": "array"
    },
    "node_connections": {
      "default": [],
      "items": {
        "maxItems": 2,
        "minItems": 2,
        "prefixItems": [
          {
            "type": "integer"
          },
          {
            "type": "integer"
          }
        ],
        "type": "array"
      },
      "title": "Node Connections",
      "type": "array"
    }
  },
  "required": [
    "flow_id",
    "flow_settings",
    "node_starts"
  ],
  "title": "FlowInformation",
  "type": "object"
}

Fields:

flow_id (int)
flow_name (Optional[str])
flow_settings (FlowSettings)
data (Dict[int, NodeInformation])
node_starts (List[int])
node_connections (List[Tuple[int, int]])

Validators:

ensure_string → flow_name

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class FlowInformation(BaseModel):
    """
    Represents the complete state of a flow, including settings, nodes, and connections.

    Attributes:
        flow_id (int): The unique ID of the flow.
        flow_name (Optional[str]): The name of the flow.
        flow_settings (FlowSettings): The settings for the flow.
        data (Dict[int, NodeInformation]): A dictionary mapping node IDs to their information.
        node_starts (List[int]): A list of starting node IDs.
        node_connections (List[Tuple[int, int]]): A list of tuples representing connections between nodes.
    """
    flow_id: int
    flow_name: Optional[str] = ''
    flow_settings: FlowSettings
    data: Dict[int, NodeInformation] = {}
    node_starts: List[int]
    node_connections: List[Tuple[int, int]] = []

    @field_validator('flow_name', mode="before")
    def ensure_string(cls, v):
        """
        Validator to ensure the flow_name is always a string.
        :param v: The value to validate.
        :return: The value as a string, or an empty string if it's None.
        """
        return str(v) if v is not None else ''

`ensure_string(v)` `pydantic-validator`

Validator to ensure the flow_name is always a string. :param v: The value to validate. :return: The value as a string, or an empty string if it's None.

Source code in flowfile_core/flowfile_core/schemas/schemas.py

@field_validator('flow_name', mode="before")
def ensure_string(cls, v):
    """
    Validator to ensure the flow_name is always a string.
    :param v: The value to validate.
    :return: The value as a string, or an empty string if it's None.
    """
    return str(v) if v is not None else ''

`FlowSettings` `pydantic-model`

Bases: FlowGraphConfig

Extends FlowGraphConfig with additional operational settings for a flow.

Attributes:

Name	Type	Description
`auto_save`	`bool`	Flag to enable or disable automatic saving.
`modified_on`	`Optional[float]`	Timestamp of the last modification.
`show_detailed_progress`	`bool`	Flag to show detailed progress during execution.
`is_running`	`bool`	Indicates if the flow is currently running.
`is_canceled`	`bool`	Indicates if the flow execution has been canceled.

Show JSON schema:

{
  "description": "Extends FlowGraphConfig with additional operational settings for a flow.\n\nAttributes:\n    auto_save (bool): Flag to enable or disable automatic saving.\n    modified_on (Optional[float]): Timestamp of the last modification.\n    show_detailed_progress (bool): Flag to show detailed progress during execution.\n    is_running (bool): Indicates if the flow is currently running.\n    is_canceled (bool): Indicates if the flow execution has been canceled.",
  "properties": {
    "flow_id": {
      "description": "Unique identifier for the flow.",
      "title": "Flow Id",
      "type": "integer"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Description"
    },
    "save_location": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Save Location"
    },
    "name": {
      "default": "",
      "title": "Name",
      "type": "string"
    },
    "path": {
      "default": "",
      "title": "Path",
      "type": "string"
    },
    "execution_mode": {
      "default": "Performance",
      "enum": [
        "Development",
        "Performance"
      ],
      "title": "Execution Mode",
      "type": "string"
    },
    "execution_location": {
      "enum": [
        "local",
        "remote"
      ],
      "title": "Execution Location",
      "type": "string"
    },
    "auto_save": {
      "default": false,
      "title": "Auto Save",
      "type": "boolean"
    },
    "modified_on": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modified On"
    },
    "show_detailed_progress": {
      "default": true,
      "title": "Show Detailed Progress",
      "type": "boolean"
    },
    "is_running": {
      "default": false,
      "title": "Is Running",
      "type": "boolean"
    },
    "is_canceled": {
      "default": false,
      "title": "Is Canceled",
      "type": "boolean"
    }
  },
  "title": "FlowSettings",
  "type": "object"
}

Fields:

flow_id (int)
description (Optional[str])
save_location (Optional[str])
name (str)
path (str)
execution_mode (ExecutionModeLiteral)
execution_location (ExecutionLocationsLiteral)
auto_save (bool)
modified_on (Optional[float])
show_detailed_progress (bool)
is_running (bool)
is_canceled (bool)

Validators:

validate_and_set_execution_location → execution_location

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class FlowSettings(FlowGraphConfig):
    """
    Extends FlowGraphConfig with additional operational settings for a flow.

    Attributes:
        auto_save (bool): Flag to enable or disable automatic saving.
        modified_on (Optional[float]): Timestamp of the last modification.
        show_detailed_progress (bool): Flag to show detailed progress during execution.
        is_running (bool): Indicates if the flow is currently running.
        is_canceled (bool): Indicates if the flow execution has been canceled.
    """
    auto_save: bool = False
    modified_on: Optional[float] = None
    show_detailed_progress: bool = True
    is_running: bool = False
    is_canceled: bool = False

    @classmethod
    def from_flow_settings_input(cls, flow_graph_config: FlowGraphConfig):
        """
        Creates a FlowSettings instance from a FlowGraphConfig instance.

        :param flow_graph_config: The base flow graph configuration.
        :return: A new instance of FlowSettings with data from flow_graph_config.
        """
        return cls.model_validate(flow_graph_config.model_dump())

`from_flow_settings_input(flow_graph_config)` `classmethod`

Creates a FlowSettings instance from a FlowGraphConfig instance.

:param flow_graph_config: The base flow graph configuration. :return: A new instance of FlowSettings with data from flow_graph_config.

Source code in flowfile_core/flowfile_core/schemas/schemas.py

@classmethod
def from_flow_settings_input(cls, flow_graph_config: FlowGraphConfig):
    """
    Creates a FlowSettings instance from a FlowGraphConfig instance.

    :param flow_graph_config: The base flow graph configuration.
    :return: A new instance of FlowSettings with data from flow_graph_config.
    """
    return cls.model_validate(flow_graph_config.model_dump())

`NodeDefault` `pydantic-model`

Bases: BaseModel

Defines default properties for a node type.

Attributes:

Name	Type	Description
`node_name`	`str`	The name of the node.
`node_type`	`NodeTypeLiteral`	The functional type of the node ('input', 'output', 'process').
`transform_type`	`TransformTypeLiteral`	The data transformation behavior ('narrow', 'wide', 'other').
`has_default_settings`	`Optional[Any]`	Indicates if the node has predefined default settings.

Show JSON schema:

{
  "description": "Defines default properties for a node type.\n\nAttributes:\n    node_name (str): The name of the node.\n    node_type (NodeTypeLiteral): The functional type of the node ('input', 'output', 'process').\n    transform_type (TransformTypeLiteral): The data transformation behavior ('narrow', 'wide', 'other').\n    has_default_settings (Optional[Any]): Indicates if the node has predefined default settings.",
  "properties": {
    "node_name": {
      "title": "Node Name",
      "type": "string"
    },
    "node_type": {
      "enum": [
        "input",
        "output",
        "process"
      ],
      "title": "Node Type",
      "type": "string"
    },
    "transform_type": {
      "enum": [
        "narrow",
        "wide",
        "other"
      ],
      "title": "Transform Type",
      "type": "string"
    },
    "has_default_settings": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Has Default Settings"
    }
  },
  "required": [
    "node_name",
    "node_type",
    "transform_type"
  ],
  "title": "NodeDefault",
  "type": "object"
}

Fields:

node_name (str)
node_type (NodeTypeLiteral)
transform_type (TransformTypeLiteral)
has_default_settings (Optional[Any])

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class NodeDefault(BaseModel):
    """
    Defines default properties for a node type.

    Attributes:
        node_name (str): The name of the node.
        node_type (NodeTypeLiteral): The functional type of the node ('input', 'output', 'process').
        transform_type (TransformTypeLiteral): The data transformation behavior ('narrow', 'wide', 'other').
        has_default_settings (Optional[Any]): Indicates if the node has predefined default settings.
    """
    node_name: str
    node_type: NodeTypeLiteral
    transform_type: TransformTypeLiteral
    has_default_settings: Optional[Any] = None

`NodeEdge` `pydantic-model`

Bases: BaseModel

Represents a connection (edge) between two nodes in the frontend.

Attributes:

Name	Type	Description
`id`	`str`	A unique identifier for the edge.
`source`	`str`	The ID of the source node.
`target`	`str`	The ID of the target node.
`targetHandle`	`str`	The specific input handle on the target node.
`sourceHandle`	`str`	The specific output handle on the source node.

Show JSON schema:

{
  "description": "Represents a connection (edge) between two nodes in the frontend.\n\nAttributes:\n    id (str): A unique identifier for the edge.\n    source (str): The ID of the source node.\n    target (str): The ID of the target node.\n    targetHandle (str): The specific input handle on the target node.\n    sourceHandle (str): The specific output handle on the source node.",
  "properties": {
    "id": {
      "title": "Id",
      "type": "string"
    },
    "source": {
      "title": "Source",
      "type": "string"
    },
    "target": {
      "title": "Target",
      "type": "string"
    },
    "targetHandle": {
      "title": "Targethandle",
      "type": "string"
    },
    "sourceHandle": {
      "title": "Sourcehandle",
      "type": "string"
    }
  },
  "required": [
    "id",
    "source",
    "target",
    "targetHandle",
    "sourceHandle"
  ],
  "title": "NodeEdge",
  "type": "object"
}

Config:

coerce_numbers_to_str: True

Fields:

id (str)
source (str)
target (str)
targetHandle (str)
sourceHandle (str)

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class NodeEdge(BaseModel):
    """
    Represents a connection (edge) between two nodes in the frontend.

    Attributes:
        id (str): A unique identifier for the edge.
        source (str): The ID of the source node.
        target (str): The ID of the target node.
        targetHandle (str): The specific input handle on the target node.
        sourceHandle (str): The specific output handle on the source node.
    """
    model_config = ConfigDict(coerce_numbers_to_str=True)
    id: str
    source: str
    target: str
    targetHandle: str
    sourceHandle: str

`NodeInformation` `pydantic-model`

Bases: BaseModel

Stores the state and configuration of a specific node instance within a flow.

Attributes:

Name	Type	Description
`id`	`Optional[int]`	The unique ID of the node instance.
`type`	`Optional[str]`	The type of the node (e.g., 'join', 'filter').
`is_setup`	`Optional[bool]`	Whether the node has been configured.
`description`	`Optional[str]`	A user-provided description.
`x_position`	`Optional[int]`	The x-coordinate on the canvas.
`y_position`	`Optional[int]`	The y-coordinate on the canvas.
`left_input_id`	`Optional[int]`	The ID of the node connected to the left input.
`right_input_id`	`Optional[int]`	The ID of the node connected to the right input.
`input_ids`	`Optional[List[int]]`	A list of IDs for main input nodes.
`outputs`	`Optional[List[int]]`	A list of IDs for nodes this node outputs to.
`setting_input`	`Optional[Any]`	The specific settings for this node instance.

Show JSON schema:

{
  "description": "Stores the state and configuration of a specific node instance within a flow.\n\nAttributes:\n    id (Optional[int]): The unique ID of the node instance.\n    type (Optional[str]): The type of the node (e.g., 'join', 'filter').\n    is_setup (Optional[bool]): Whether the node has been configured.\n    description (Optional[str]): A user-provided description.\n    x_position (Optional[int]): The x-coordinate on the canvas.\n    y_position (Optional[int]): The y-coordinate on the canvas.\n    left_input_id (Optional[int]): The ID of the node connected to the left input.\n    right_input_id (Optional[int]): The ID of the node connected to the right input.\n    input_ids (Optional[List[int]]): A list of IDs for main input nodes.\n    outputs (Optional[List[int]]): A list of IDs for nodes this node outputs to.\n    setting_input (Optional[Any]): The specific settings for this node instance.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Type"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "x_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "X Position"
    },
    "y_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Y Position"
    },
    "left_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Left Input Id"
    },
    "right_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Right Input Id"
    },
    "input_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Input Ids"
    },
    "outputs": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Outputs"
    },
    "setting_input": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Setting Input"
    }
  },
  "title": "NodeInformation",
  "type": "object"
}

Fields:

id (Optional[int])
type (Optional[str])
is_setup (Optional[bool])
description (Optional[str])
x_position (Optional[int])
y_position (Optional[int])
left_input_id (Optional[int])
right_input_id (Optional[int])
input_ids (Optional[List[int]])
outputs (Optional[List[int]])
setting_input (Optional[Any])

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class NodeInformation(BaseModel):
    """
    Stores the state and configuration of a specific node instance within a flow.

    Attributes:
        id (Optional[int]): The unique ID of the node instance.
        type (Optional[str]): The type of the node (e.g., 'join', 'filter').
        is_setup (Optional[bool]): Whether the node has been configured.
        description (Optional[str]): A user-provided description.
        x_position (Optional[int]): The x-coordinate on the canvas.
        y_position (Optional[int]): The y-coordinate on the canvas.
        left_input_id (Optional[int]): The ID of the node connected to the left input.
        right_input_id (Optional[int]): The ID of the node connected to the right input.
        input_ids (Optional[List[int]]): A list of IDs for main input nodes.
        outputs (Optional[List[int]]): A list of IDs for nodes this node outputs to.
        setting_input (Optional[Any]): The specific settings for this node instance.
    """
    id: Optional[int] = None
    type: Optional[str] = None
    is_setup: Optional[bool] = None
    description: Optional[str] = ''
    x_position: Optional[int] = 0
    y_position: Optional[int] = 0
    left_input_id: Optional[int] = None
    right_input_id: Optional[int] = None
    input_ids: Optional[List[int]] = [-1]
    outputs: Optional[List[int]] = [-1]
    setting_input: Optional[Any] = None

    @property
    def data(self) -> Any:
        """
        Property to access the node's specific settings.
        :return: The settings of the node.
        """
        return self.setting_input

    @property
    def main_input_ids(self) -> Optional[List[int]]:
        """
        Property to access the main input node IDs.
        :return: A list of main input node IDs.
        """
        return self.input_ids

`data` `property`

Property to access the node's specific settings. :return: The settings of the node.

`main_input_ids` `property`

Property to access the main input node IDs. :return: A list of main input node IDs.

`NodeInput` `pydantic-model`

Bases: NodeTemplate

Represents a node as it is received from the frontend, including position.

Attributes:

Name	Type	Description
`id`	`int`	The unique ID of the node instance.
`pos_x`	`float`	The x-coordinate on the canvas.
`pos_y`	`float`	The y-coordinate on the canvas.

Show JSON schema:

{
  "description": "Represents a node as it is received from the frontend, including position.\n\nAttributes:\n    id (int): The unique ID of the node instance.\n    pos_x (float): The x-coordinate on the canvas.\n    pos_y (float): The y-coordinate on the canvas.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "item": {
      "title": "Item",
      "type": "string"
    },
    "input": {
      "title": "Input",
      "type": "integer"
    },
    "output": {
      "title": "Output",
      "type": "integer"
    },
    "image": {
      "title": "Image",
      "type": "string"
    },
    "multi": {
      "default": false,
      "title": "Multi",
      "type": "boolean"
    },
    "node_type": {
      "enum": [
        "input",
        "output",
        "process"
      ],
      "title": "Node Type",
      "type": "string"
    },
    "transform_type": {
      "enum": [
        "narrow",
        "wide",
        "other"
      ],
      "title": "Transform Type",
      "type": "string"
    },
    "node_group": {
      "title": "Node Group",
      "type": "string"
    },
    "prod_ready": {
      "default": true,
      "title": "Prod Ready",
      "type": "boolean"
    },
    "can_be_start": {
      "default": false,
      "title": "Can Be Start",
      "type": "boolean"
    },
    "drawer_title": {
      "default": "Node title",
      "title": "Drawer Title",
      "type": "string"
    },
    "drawer_intro": {
      "default": "Drawer into",
      "title": "Drawer Intro",
      "type": "string"
    },
    "custom_node": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Custom Node"
    },
    "id": {
      "title": "Id",
      "type": "integer"
    },
    "pos_x": {
      "title": "Pos X",
      "type": "number"
    },
    "pos_y": {
      "title": "Pos Y",
      "type": "number"
    }
  },
  "required": [
    "name",
    "item",
    "input",
    "output",
    "image",
    "node_type",
    "transform_type",
    "node_group",
    "id",
    "pos_x",
    "pos_y"
  ],
  "title": "NodeInput",
  "type": "object"
}

Fields:

name (str)
item (str)
input (int)
output (int)
image (str)
multi (bool)
node_type (NodeTypeLiteral)
transform_type (TransformTypeLiteral)
node_group (str)
prod_ready (bool)
can_be_start (bool)
drawer_title (str)
drawer_intro (str)
custom_node (Optional[bool])
id (int)
pos_x (float)
pos_y (float)

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class NodeInput(NodeTemplate):
    """
    Represents a node as it is received from the frontend, including position.

    Attributes:
        id (int): The unique ID of the node instance.
        pos_x (float): The x-coordinate on the canvas.
        pos_y (float): The y-coordinate on the canvas.
    """
    id: int
    pos_x: float
    pos_y: float

`NodeTemplate` `pydantic-model`

Bases: BaseModel

Defines the template for a node type, specifying its UI and functional characteristics.

Attributes:

Name	Type	Description
`name`	`str`	The display name of the node.
`item`	`str`	The unique identifier for the node type.
`input`	`int`	The number of required input connections.
`output`	`int`	The number of output connections.
`image`	`str`	The filename of the icon for the node.
`multi`	`bool`	Whether the node accepts multiple main input connections.
`node_group`	`str`	The category group the node belongs to (e.g., 'input', 'transform').
`prod_ready`	`bool`	Whether the node is considered production-ready.
`can_be_start`	`bool`	Whether the node can be a starting point in a flow.

Show JSON schema:

{
  "description": "Defines the template for a node type, specifying its UI and functional characteristics.\n\nAttributes:\n    name (str): The display name of the node.\n    item (str): The unique identifier for the node type.\n    input (int): The number of required input connections.\n    output (int): The number of output connections.\n    image (str): The filename of the icon for the node.\n    multi (bool): Whether the node accepts multiple main input connections.\n    node_group (str): The category group the node belongs to (e.g., 'input', 'transform').\n    prod_ready (bool): Whether the node is considered production-ready.\n    can_be_start (bool): Whether the node can be a starting point in a flow.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "item": {
      "title": "Item",
      "type": "string"
    },
    "input": {
      "title": "Input",
      "type": "integer"
    },
    "output": {
      "title": "Output",
      "type": "integer"
    },
    "image": {
      "title": "Image",
      "type": "string"
    },
    "multi": {
      "default": false,
      "title": "Multi",
      "type": "boolean"
    },
    "node_type": {
      "enum": [
        "input",
        "output",
        "process"
      ],
      "title": "Node Type",
      "type": "string"
    },
    "transform_type": {
      "enum": [
        "narrow",
        "wide",
        "other"
      ],
      "title": "Transform Type",
      "type": "string"
    },
    "node_group": {
      "title": "Node Group",
      "type": "string"
    },
    "prod_ready": {
      "default": true,
      "title": "Prod Ready",
      "type": "boolean"
    },
    "can_be_start": {
      "default": false,
      "title": "Can Be Start",
      "type": "boolean"
    },
    "drawer_title": {
      "default": "Node title",
      "title": "Drawer Title",
      "type": "string"
    },
    "drawer_intro": {
      "default": "Drawer into",
      "title": "Drawer Intro",
      "type": "string"
    },
    "custom_node": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Custom Node"
    }
  },
  "required": [
    "name",
    "item",
    "input",
    "output",
    "image",
    "node_type",
    "transform_type",
    "node_group"
  ],
  "title": "NodeTemplate",
  "type": "object"
}

Fields:

name (str)
item (str)
input (int)
output (int)
image (str)
multi (bool)
node_type (NodeTypeLiteral)
transform_type (TransformTypeLiteral)
node_group (str)
prod_ready (bool)
can_be_start (bool)
drawer_title (str)
drawer_intro (str)
custom_node (Optional[bool])

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class NodeTemplate(BaseModel):
    """
    Defines the template for a node type, specifying its UI and functional characteristics.

    Attributes:
        name (str): The display name of the node.
        item (str): The unique identifier for the node type.
        input (int): The number of required input connections.
        output (int): The number of output connections.
        image (str): The filename of the icon for the node.
        multi (bool): Whether the node accepts multiple main input connections.
        node_group (str): The category group the node belongs to (e.g., 'input', 'transform').
        prod_ready (bool): Whether the node is considered production-ready.
        can_be_start (bool): Whether the node can be a starting point in a flow.
    """
    name: str
    item: str
    input: int
    output: int
    image: str
    multi: bool = False
    node_type: NodeTypeLiteral
    transform_type: TransformTypeLiteral
    node_group: str
    prod_ready: bool = True
    can_be_start: bool = False
    drawer_title: str = "Node title"
    drawer_intro: str = "Drawer into"
    custom_node: Optional[bool] = False

`RawLogInput` `pydantic-model`

Bases: BaseModel

Schema for a raw log message.

Attributes:

Name	Type	Description
`flowfile_flow_id`	`int`	The ID of the flow that generated the log.
`log_message`	`str`	The content of the log message.
`log_type`	`Literal['INFO', 'ERROR']`	The type of log.
`extra`	`Optional[dict]`	Extra context data for the log.

Show JSON schema:

{
  "description": "Schema for a raw log message.\n\nAttributes:\n    flowfile_flow_id (int): The ID of the flow that generated the log.\n    log_message (str): The content of the log message.\n    log_type (Literal[\"INFO\", \"ERROR\"]): The type of log.\n    extra (Optional[dict]): Extra context data for the log.",
  "properties": {
    "flowfile_flow_id": {
      "title": "Flowfile Flow Id",
      "type": "integer"
    },
    "log_message": {
      "title": "Log Message",
      "type": "string"
    },
    "log_type": {
      "enum": [
        "INFO",
        "ERROR"
      ],
      "title": "Log Type",
      "type": "string"
    },
    "extra": {
      "anyOf": [
        {
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Extra"
    }
  },
  "required": [
    "flowfile_flow_id",
    "log_message",
    "log_type"
  ],
  "title": "RawLogInput",
  "type": "object"
}

Fields:

flowfile_flow_id (int)
log_message (str)
log_type (Literal['INFO', 'ERROR'])
extra (Optional[dict])

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class RawLogInput(BaseModel):
    """
    Schema for a raw log message.

    Attributes:
        flowfile_flow_id (int): The ID of the flow that generated the log.
        log_message (str): The content of the log message.
        log_type (Literal["INFO", "ERROR"]): The type of log.
        extra (Optional[dict]): Extra context data for the log.
    """
    flowfile_flow_id: int
    log_message: str
    log_type: Literal["INFO", "ERROR"]
    extra: Optional[dict] = None

`VueFlowInput` `pydantic-model`

Bases: BaseModel

Represents the complete graph structure from the Vue-based frontend.

Attributes:

Name	Type	Description
`node_edges`	`List[NodeEdge]`	A list of all edges in the graph.
`node_inputs`	`List[NodeInput]`	A list of all nodes in the graph.

Show JSON schema:

{
  "$defs": {
    "NodeEdge": {
      "description": "Represents a connection (edge) between two nodes in the frontend.\n\nAttributes:\n    id (str): A unique identifier for the edge.\n    source (str): The ID of the source node.\n    target (str): The ID of the target node.\n    targetHandle (str): The specific input handle on the target node.\n    sourceHandle (str): The specific output handle on the source node.",
      "properties": {
        "id": {
          "title": "Id",
          "type": "string"
        },
        "source": {
          "title": "Source",
          "type": "string"
        },
        "target": {
          "title": "Target",
          "type": "string"
        },
        "targetHandle": {
          "title": "Targethandle",
          "type": "string"
        },
        "sourceHandle": {
          "title": "Sourcehandle",
          "type": "string"
        }
      },
      "required": [
        "id",
        "source",
        "target",
        "targetHandle",
        "sourceHandle"
      ],
      "title": "NodeEdge",
      "type": "object"
    },
    "NodeInput": {
      "description": "Represents a node as it is received from the frontend, including position.\n\nAttributes:\n    id (int): The unique ID of the node instance.\n    pos_x (float): The x-coordinate on the canvas.\n    pos_y (float): The y-coordinate on the canvas.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "item": {
          "title": "Item",
          "type": "string"
        },
        "input": {
          "title": "Input",
          "type": "integer"
        },
        "output": {
          "title": "Output",
          "type": "integer"
        },
        "image": {
          "title": "Image",
          "type": "string"
        },
        "multi": {
          "default": false,
          "title": "Multi",
          "type": "boolean"
        },
        "node_type": {
          "enum": [
            "input",
            "output",
            "process"
          ],
          "title": "Node Type",
          "type": "string"
        },
        "transform_type": {
          "enum": [
            "narrow",
            "wide",
            "other"
          ],
          "title": "Transform Type",
          "type": "string"
        },
        "node_group": {
          "title": "Node Group",
          "type": "string"
        },
        "prod_ready": {
          "default": true,
          "title": "Prod Ready",
          "type": "boolean"
        },
        "can_be_start": {
          "default": false,
          "title": "Can Be Start",
          "type": "boolean"
        },
        "drawer_title": {
          "default": "Node title",
          "title": "Drawer Title",
          "type": "string"
        },
        "drawer_intro": {
          "default": "Drawer into",
          "title": "Drawer Intro",
          "type": "string"
        },
        "custom_node": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Custom Node"
        },
        "id": {
          "title": "Id",
          "type": "integer"
        },
        "pos_x": {
          "title": "Pos X",
          "type": "number"
        },
        "pos_y": {
          "title": "Pos Y",
          "type": "number"
        }
      },
      "required": [
        "name",
        "item",
        "input",
        "output",
        "image",
        "node_type",
        "transform_type",
        "node_group",
        "id",
        "pos_x",
        "pos_y"
      ],
      "title": "NodeInput",
      "type": "object"
    }
  },
  "description": "Represents the complete graph structure from the Vue-based frontend.\n\nAttributes:\n    node_edges (List[NodeEdge]): A list of all edges in the graph.\n    node_inputs (List[NodeInput]): A list of all nodes in the graph.",
  "properties": {
    "node_edges": {
      "items": {
        "$ref": "#/$defs/NodeEdge"
      },
      "title": "Node Edges",
      "type": "array"
    },
    "node_inputs": {
      "items": {
        "$ref": "#/$defs/NodeInput"
      },
      "title": "Node Inputs",
      "type": "array"
    }
  },
  "required": [
    "node_edges",
    "node_inputs"
  ],
  "title": "VueFlowInput",
  "type": "object"
}

Fields:

node_edges (List[NodeEdge])
node_inputs (List[NodeInput])

Source code in flowfile_core/flowfile_core/schemas/schemas.py

class VueFlowInput(BaseModel):
    """

    Represents the complete graph structure from the Vue-based frontend.

    Attributes:
        node_edges (List[NodeEdge]): A list of all edges in the graph.
        node_inputs (List[NodeInput]): A list of all nodes in the graph.
    """
    node_edges: List[NodeEdge]
    node_inputs: List[NodeInput]

`get_global_execution_location()`

Calculates the default execution location based on the global settings Returns

ExecutionLocationsLiteral where the current

Source code in flowfile_core/flowfile_core/schemas/schemas.py

def get_global_execution_location() -> ExecutionLocationsLiteral:
    """
    Calculates the default execution location based on the global settings
    Returns
    -------
    ExecutionLocationsLiteral where the current
    """
    if OFFLOAD_TO_WORKER:
        return "remote"
    return "local"

`input_schema`

`flowfile_core.schemas.input_schema`

Classes:

Name	Description
`DatabaseConnection`	Defines the connection parameters for a database.
`DatabaseSettings`	Defines settings for reading from a database, either via table or query.
`DatabaseWriteSettings`	Defines settings for writing data to a database table.
`ExternalSource`	Base model for data coming from a predefined external source.
`FullDatabaseConnection`	A complete database connection model including the secret password.
`FullDatabaseConnectionInterface`	A database connection model intended for UI display, omitting the password.
`MinimalFieldInfo`	Represents the most basic information about a data field (column).
`NewDirectory`	Defines the information required to create a new directory.
`NodeBase`	Base model for all nodes in a FlowGraph. Contains common metadata.
`NodeCloudStorageReader`	Settings for a node that reads from a cloud storage service (S3, GCS, etc.).
`NodeCloudStorageWriter`	Settings for a node that writes to a cloud storage service.
`NodeConnection`	Represents a connection (edge) between two nodes in the graph.
`NodeCrossJoin`	Settings for a node that performs a cross join.
`NodeDatabaseReader`	Settings for a node that reads from a database.
`NodeDatabaseWriter`	Settings for a node that writes data to a database.
`NodeDatasource`	Base settings for a node that acts as a data source.
`NodeDescription`	A simple model for updating a node's description text.
`NodeExploreData`	Settings for a node that provides an interactive data exploration interface.
`NodeExternalSource`	Settings for a node that connects to a registered external data source.
`NodeFilter`	Settings for a node that filters rows based on a condition.
`NodeFormula`	Settings for a node that applies a formula to create/modify a column.
`NodeFuzzyMatch`	Settings for a node that performs a fuzzy join based on string similarity.
`NodeGraphSolver`	Settings for a node that solves graph-based problems (e.g., connected components).
`NodeGroupBy`	Settings for a node that performs a group-by and aggregation operation.
`NodeInputConnection`	Represents the input side of a connection between two nodes.
`NodeJoin`	Settings for a node that performs a standard SQL-style join.
`NodeManualInput`	Settings for a node that allows direct data entry in the UI.
`NodeMultiInput`	A base model for any node that takes multiple data inputs.
`NodeOutput`	Settings for a node that writes its input to a file.
`NodeOutputConnection`	Represents the output side of a connection between two nodes.
`NodePivot`	Settings for a node that pivots data from a long to a wide format.
`NodePolarsCode`	Settings for a node that executes arbitrary user-provided Polars code.
`NodePromise`	A placeholder node for an operation that has not yet been configured.
`NodeRead`	Settings for a node that reads data from a file.
`NodeRecordCount`	Settings for a node that counts the number of records.
`NodeRecordId`	Settings for a node that adds a unique record ID column.
`NodeSample`	Settings for a node that samples a subset of the data.
`NodeSelect`	Settings for a node that selects, renames, and reorders columns.
`NodeSingleInput`	A base model for any node that takes a single data input.
`NodeSort`	Settings for a node that sorts the data by one or more columns.
`NodeTextToRows`	Settings for a node that splits a text column into multiple rows.
`NodeUnion`	Settings for a node that concatenates multiple data inputs.
`NodeUnique`	Settings for a node that returns the unique rows from the data.
`NodeUnpivot`	Settings for a node that unpivots data from a wide to a long format.
`OutputCsvTable`	Defines settings for writing a CSV file.
`OutputExcelTable`	Defines settings for writing an Excel file.
`OutputParquetTable`	Defines settings for writing a Parquet file.
`OutputSettings`	Defines the complete settings for an output node.
`RawData`	Represents data in a raw, columnar format for manual input.
`ReceivedCsvTable`	Defines settings for reading a CSV file.
`ReceivedExcelTable`	Defines settings for reading an Excel file.
`ReceivedJsonTable`	Defines settings for reading a JSON file (inherits from CSV settings).
`ReceivedParquetTable`	Defines settings for reading a Parquet file.
`ReceivedTable`	A comprehensive model that can represent any type of received table.
`ReceivedTableBase`	Base model for defining a table received from an external source.
`RemoveItem`	Represents a single item to be removed from a directory or list.
`RemoveItemsInput`	Defines a list of items to be removed.
`SampleUsers`	Settings for generating a sample dataset of users.
`UserDefinedNode`	Settings for a node that contains the user defined node information

`DatabaseConnection` `pydantic-model`

Bases: BaseModel

Defines the connection parameters for a database.

Show JSON schema:

{
  "description": "Defines the connection parameters for a database.",
  "properties": {
    "database_type": {
      "default": "postgresql",
      "title": "Database Type",
      "type": "string"
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Username"
    },
    "password_ref": {
      "anyOf": [
        {
          "description": "An ID referencing an encrypted secret.",
          "maxLength": 100,
          "minLength": 1,
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Password Ref"
    },
    "host": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Host"
    },
    "port": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Port"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database"
    },
    "url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Url"
    }
  },
  "title": "DatabaseConnection",
  "type": "object"
}

Fields:

database_type (str)
username (Optional[str])
password_ref (Optional[SecretRef])
host (Optional[str])
port (Optional[int])
database (Optional[str])
url (Optional[str])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class DatabaseConnection(BaseModel):
    """Defines the connection parameters for a database."""
    database_type: str = "postgresql"
    username: Optional[str] = None
    password_ref: Optional[SecretRef] = None
    host: Optional[str] = None
    port: Optional[int] = None
    database: Optional[str] = None
    url: Optional[str] = None

`DatabaseSettings` `pydantic-model`

Bases: BaseModel

Defines settings for reading from a database, either via table or query.

Show JSON schema:

{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    }
  },
  "description": "Defines settings for reading from a database, either via table or query.",
  "properties": {
    "connection_mode": {
      "anyOf": [
        {
          "enum": [
            "inline",
            "reference"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "inline",
      "title": "Connection Mode"
    },
    "database_connection": {
      "anyOf": [
        {
          "$ref": "#/$defs/DatabaseConnection"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "database_connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database Connection Name"
    },
    "schema_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Schema Name"
    },
    "table_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Table Name"
    },
    "query": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Query"
    },
    "query_mode": {
      "default": "table",
      "enum": [
        "query",
        "table",
        "reference"
      ],
      "title": "Query Mode",
      "type": "string"
    }
  },
  "title": "DatabaseSettings",
  "type": "object"
}

Fields:

connection_mode (Optional[Literal['inline', 'reference']])
database_connection (Optional[DatabaseConnection])
database_connection_name (Optional[str])
schema_name (Optional[str])
table_name (Optional[str])
query (Optional[str])
query_mode (Literal['query', 'table', 'reference'])

Validators:

validate_table_or_query

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class DatabaseSettings(BaseModel):
    """Defines settings for reading from a database, either via table or query."""
    connection_mode: Optional[Literal['inline', 'reference']] = 'inline'
    database_connection: Optional[DatabaseConnection] = None
    database_connection_name: Optional[str] = None
    schema_name: Optional[str] = None
    table_name: Optional[str] = None
    query: Optional[str] = None
    query_mode: Literal['query', 'table', 'reference'] = 'table'

    @model_validator(mode='after')
    def validate_table_or_query(self):
        # Validate that either table_name or query is provided
        if (not self.table_name and not self.query) and self.query_mode == 'inline':
            raise ValueError("Either 'table_name' or 'query' must be provided")

        # Validate correct connection information based on connection_mode
        if self.connection_mode == 'inline' and self.database_connection is None:
            raise ValueError("When 'connection_mode' is 'inline', 'database_connection' must be provided")

        if self.connection_mode == 'reference' and not self.database_connection_name:
            raise ValueError("When 'connection_mode' is 'reference', 'database_connection_name' must be provided")

        return self

`DatabaseWriteSettings` `pydantic-model`

Bases: BaseModel

Defines settings for writing data to a database table.

Show JSON schema:

{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    }
  },
  "description": "Defines settings for writing data to a database table.",
  "properties": {
    "connection_mode": {
      "anyOf": [
        {
          "enum": [
            "inline",
            "reference"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "inline",
      "title": "Connection Mode"
    },
    "database_connection": {
      "anyOf": [
        {
          "$ref": "#/$defs/DatabaseConnection"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "database_connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database Connection Name"
    },
    "table_name": {
      "title": "Table Name",
      "type": "string"
    },
    "schema_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Schema Name"
    },
    "if_exists": {
      "anyOf": [
        {
          "enum": [
            "append",
            "replace",
            "fail"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "append",
      "title": "If Exists"
    }
  },
  "required": [
    "table_name"
  ],
  "title": "DatabaseWriteSettings",
  "type": "object"
}

Fields:

connection_mode (Optional[Literal['inline', 'reference']])
database_connection (Optional[DatabaseConnection])
database_connection_name (Optional[str])
table_name (str)
schema_name (Optional[str])
if_exists (Optional[Literal['append', 'replace', 'fail']])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class DatabaseWriteSettings(BaseModel):
    """Defines settings for writing data to a database table."""
    connection_mode: Optional[Literal['inline', 'reference']] = 'inline'
    database_connection: Optional[DatabaseConnection] = None
    database_connection_name: Optional[str] = None
    table_name: str
    schema_name: Optional[str] = None
    if_exists: Optional[Literal['append', 'replace', 'fail']] = 'append'

`ExternalSource` `pydantic-model`

Bases: BaseModel

Base model for data coming from a predefined external source.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Base model for data coming from a predefined external source.",
  "properties": {
    "orientation": {
      "default": "row",
      "title": "Orientation",
      "type": "string"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    }
  },
  "title": "ExternalSource",
  "type": "object"
}

Fields:

orientation (str)
fields (Optional[List[MinimalFieldInfo]])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class ExternalSource(BaseModel):
    """Base model for data coming from a predefined external source."""
    orientation: str = 'row'
    fields: Optional[List[MinimalFieldInfo]] = None

`FullDatabaseConnection` `pydantic-model`

Bases: BaseModel

A complete database connection model including the secret password.

Show JSON schema:

{
  "description": "A complete database connection model including the secret password.",
  "properties": {
    "connection_name": {
      "title": "Connection Name",
      "type": "string"
    },
    "database_type": {
      "default": "postgresql",
      "title": "Database Type",
      "type": "string"
    },
    "username": {
      "title": "Username",
      "type": "string"
    },
    "password": {
      "format": "password",
      "title": "Password",
      "type": "string",
      "writeOnly": true
    },
    "host": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Host"
    },
    "port": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Port"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database"
    },
    "ssl_enabled": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Ssl Enabled"
    },
    "url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Url"
    }
  },
  "required": [
    "connection_name",
    "username",
    "password"
  ],
  "title": "FullDatabaseConnection",
  "type": "object"
}

Fields:

connection_name (str)
database_type (str)
username (str)
password (SecretStr)
host (Optional[str])
port (Optional[int])
database (Optional[str])
ssl_enabled (Optional[bool])
url (Optional[str])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class FullDatabaseConnection(BaseModel):
    """A complete database connection model including the secret password."""
    connection_name: str
    database_type: str = "postgresql"
    username: str
    password: SecretStr
    host: Optional[str] = None
    port: Optional[int] = None
    database: Optional[str] = None
    ssl_enabled: Optional[bool] = False
    url: Optional[str] = None

`FullDatabaseConnectionInterface` `pydantic-model`

Bases: BaseModel

A database connection model intended for UI display, omitting the password.

Show JSON schema:

{
  "description": "A database connection model intended for UI display, omitting the password.",
  "properties": {
    "connection_name": {
      "title": "Connection Name",
      "type": "string"
    },
    "database_type": {
      "default": "postgresql",
      "title": "Database Type",
      "type": "string"
    },
    "username": {
      "title": "Username",
      "type": "string"
    },
    "host": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Host"
    },
    "port": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Port"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database"
    },
    "ssl_enabled": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Ssl Enabled"
    },
    "url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Url"
    }
  },
  "required": [
    "connection_name",
    "username"
  ],
  "title": "FullDatabaseConnectionInterface",
  "type": "object"
}

Fields:

connection_name (str)
database_type (str)
username (str)
host (Optional[str])
port (Optional[int])
database (Optional[str])
ssl_enabled (Optional[bool])
url (Optional[str])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class FullDatabaseConnectionInterface(BaseModel):
    """A database connection model intended for UI display, omitting the password."""
    connection_name: str
    database_type: str = "postgresql"
    username: str
    host: Optional[str] = None
    port: Optional[int] = None
    database: Optional[str] = None
    ssl_enabled: Optional[bool] = False
    url: Optional[str] = None

`MinimalFieldInfo` `pydantic-model`

Bases: BaseModel

Represents the most basic information about a data field (column).

Show JSON schema:

{
  "description": "Represents the most basic information about a data field (column).",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "data_type": {
      "default": "String",
      "title": "Data Type",
      "type": "string"
    }
  },
  "required": [
    "name"
  ],
  "title": "MinimalFieldInfo",
  "type": "object"
}

Fields:

name (str)
data_type (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class MinimalFieldInfo(BaseModel):
    """Represents the most basic information about a data field (column)."""
    name: str
    data_type: str = "String"

`NewDirectory` `pydantic-model`

Bases: BaseModel

Defines the information required to create a new directory.

Show JSON schema:

{
  "description": "Defines the information required to create a new directory.",
  "properties": {
    "source_path": {
      "title": "Source Path",
      "type": "string"
    },
    "dir_name": {
      "title": "Dir Name",
      "type": "string"
    }
  },
  "required": [
    "source_path",
    "dir_name"
  ],
  "title": "NewDirectory",
  "type": "object"
}

Fields:

source_path (str)
dir_name (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NewDirectory(BaseModel):
    """Defines the information required to create a new directory."""
    source_path: str
    dir_name: str

`NodeBase` `pydantic-model`

Bases: BaseModel

Base model for all nodes in a FlowGraph. Contains common metadata.

Show JSON schema:

{
  "description": "Base model for all nodes in a FlowGraph. Contains common metadata.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeBase",
  "type": "object"
}

Config:

arbitrary_types_allowed: True

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeBase(BaseModel):
    """Base model for all nodes in a FlowGraph. Contains common metadata."""
    model_config = ConfigDict(arbitrary_types_allowed=True)
    flow_id: int
    node_id: int
    cache_results: Optional[bool] = False
    pos_x: Optional[float] = 0
    pos_y: Optional[float] = 0
    is_setup: Optional[bool] = True
    description: Optional[str] = ''
    user_id: Optional[int] = None
    is_flow_output: Optional[bool] = False
    is_user_defined: Optional[bool] = False  # Indicator if the node is a user defined node

`NodeCloudStorageReader` `pydantic-model`

Bases: NodeBase

Settings for a node that reads from a cloud storage service (S3, GCS, etc.).

Show JSON schema:

{
  "$defs": {
    "CloudStorageReadSettings": {
      "description": "Settings for reading from cloud storage",
      "properties": {
        "auth_mode": {
          "default": "auto",
          "enum": [
            "access_key",
            "iam_role",
            "service_principal",
            "managed_identity",
            "sas_token",
            "aws-cli",
            "env_vars"
          ],
          "title": "Auth Mode",
          "type": "string"
        },
        "connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Connection Name"
        },
        "resource_path": {
          "title": "Resource Path",
          "type": "string"
        },
        "scan_mode": {
          "default": "single_file",
          "enum": [
            "single_file",
            "directory"
          ],
          "title": "Scan Mode",
          "type": "string"
        },
        "file_format": {
          "default": "parquet",
          "enum": [
            "csv",
            "parquet",
            "json",
            "delta",
            "iceberg"
          ],
          "title": "File Format",
          "type": "string"
        },
        "csv_has_header": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Csv Has Header"
        },
        "csv_delimiter": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": ",",
          "title": "Csv Delimiter"
        },
        "csv_encoding": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "utf8",
          "title": "Csv Encoding"
        },
        "delta_version": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Delta Version"
        }
      },
      "required": [
        "resource_path"
      ],
      "title": "CloudStorageReadSettings",
      "type": "object"
    },
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that reads from a cloud storage service (S3, GCS, etc.).",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "cloud_storage_settings": {
      "$ref": "#/$defs/CloudStorageReadSettings"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "cloud_storage_settings"
  ],
  "title": "NodeCloudStorageReader",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
cloud_storage_settings (CloudStorageReadSettings)
fields (Optional[List[MinimalFieldInfo]])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeCloudStorageReader(NodeBase):
    """Settings for a node that reads from a cloud storage service (S3, GCS, etc.)."""
    cloud_storage_settings: CloudStorageReadSettings
    fields: Optional[List[MinimalFieldInfo]] = None

`NodeCloudStorageWriter` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that writes to a cloud storage service.

Show JSON schema:

{
  "$defs": {
    "CloudStorageWriteSettings": {
      "description": "Settings for writing to cloud storage",
      "properties": {
        "resource_path": {
          "title": "Resource Path",
          "type": "string"
        },
        "write_mode": {
          "default": "overwrite",
          "enum": [
            "overwrite",
            "append"
          ],
          "title": "Write Mode",
          "type": "string"
        },
        "file_format": {
          "default": "parquet",
          "enum": [
            "csv",
            "parquet",
            "json",
            "delta"
          ],
          "title": "File Format",
          "type": "string"
        },
        "parquet_compression": {
          "default": "snappy",
          "enum": [
            "snappy",
            "gzip",
            "brotli",
            "lz4",
            "zstd"
          ],
          "title": "Parquet Compression",
          "type": "string"
        },
        "csv_delimiter": {
          "default": ",",
          "title": "Csv Delimiter",
          "type": "string"
        },
        "csv_encoding": {
          "default": "utf8",
          "title": "Csv Encoding",
          "type": "string"
        },
        "auth_mode": {
          "default": "auto",
          "enum": [
            "access_key",
            "iam_role",
            "service_principal",
            "managed_identity",
            "sas_token",
            "aws-cli",
            "env_vars"
          ],
          "title": "Auth Mode",
          "type": "string"
        },
        "connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Connection Name"
        }
      },
      "required": [
        "resource_path"
      ],
      "title": "CloudStorageWriteSettings",
      "type": "object"
    }
  },
  "description": "Settings for a node that writes to a cloud storage service.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "cloud_storage_settings": {
      "$ref": "#/$defs/CloudStorageWriteSettings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "cloud_storage_settings"
  ],
  "title": "NodeCloudStorageWriter",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
cloud_storage_settings (CloudStorageWriteSettings)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeCloudStorageWriter(NodeSingleInput):
    """Settings for a node that writes to a cloud storage service."""
    cloud_storage_settings: CloudStorageWriteSettings

`NodeConnection` `pydantic-model`

Bases: BaseModel

Represents a connection (edge) between two nodes in the graph.

Show JSON schema:

{
  "$defs": {
    "NodeInputConnection": {
      "description": "Represents the input side of a connection between two nodes.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "connection_class": {
          "enum": [
            "input-0",
            "input-1",
            "input-2",
            "input-3",
            "input-4",
            "input-5",
            "input-6",
            "input-7",
            "input-8",
            "input-9"
          ],
          "title": "Connection Class",
          "type": "string"
        }
      },
      "required": [
        "node_id",
        "connection_class"
      ],
      "title": "NodeInputConnection",
      "type": "object"
    },
    "NodeOutputConnection": {
      "description": "Represents the output side of a connection between two nodes.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "connection_class": {
          "enum": [
            "output-0",
            "output-1",
            "output-2",
            "output-3",
            "output-4",
            "output-5",
            "output-6",
            "output-7",
            "output-8",
            "output-9"
          ],
          "title": "Connection Class",
          "type": "string"
        }
      },
      "required": [
        "node_id",
        "connection_class"
      ],
      "title": "NodeOutputConnection",
      "type": "object"
    }
  },
  "description": "Represents a connection (edge) between two nodes in the graph.",
  "properties": {
    "input_connection": {
      "$ref": "#/$defs/NodeInputConnection"
    },
    "output_connection": {
      "$ref": "#/$defs/NodeOutputConnection"
    }
  },
  "required": [
    "input_connection",
    "output_connection"
  ],
  "title": "NodeConnection",
  "type": "object"
}

Fields:

input_connection (NodeInputConnection)
output_connection (NodeOutputConnection)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeConnection(BaseModel):
    """Represents a connection (edge) between two nodes in the graph."""
    input_connection: NodeInputConnection
    output_connection: NodeOutputConnection

    @classmethod
    def create_from_simple_input(cls, from_id: int, to_id: int, input_type: InputType = "input-0"):
        """Creates a standard connection between two nodes."""
        match input_type:
            case "main": connection_class: InputConnectionClass = "input-0"
            case "right": connection_class: InputConnectionClass = "input-1"
            case "left": connection_class: InputConnectionClass = "input-2"
            case _: connection_class: InputConnectionClass = "input-0"
        node_input = NodeInputConnection(node_id=to_id, connection_class=connection_class)
        node_output = NodeOutputConnection(node_id=from_id, connection_class='output-0')
        return cls(input_connection=node_input, output_connection=node_output)

`create_from_simple_input(from_id, to_id, input_type='input-0')` `classmethod`

Creates a standard connection between two nodes.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

@classmethod
def create_from_simple_input(cls, from_id: int, to_id: int, input_type: InputType = "input-0"):
    """Creates a standard connection between two nodes."""
    match input_type:
        case "main": connection_class: InputConnectionClass = "input-0"
        case "right": connection_class: InputConnectionClass = "input-1"
        case "left": connection_class: InputConnectionClass = "input-2"
        case _: connection_class: InputConnectionClass = "input-0"
    node_input = NodeInputConnection(node_id=to_id, connection_class=connection_class)
    node_output = NodeOutputConnection(node_id=from_id, connection_class='output-0')
    return cls(input_connection=node_input, output_connection=node_output)

`NodeCrossJoin` `pydantic-model`

Bases: NodeMultiInput

Settings for a node that performs a cross join.

Show JSON schema:

{
  "$defs": {
    "CrossJoinInput": {
      "properties": {
        "left_select": {
          "$ref": "#/$defs/SelectInputs",
          "default": null
        },
        "right_select": {
          "$ref": "#/$defs/SelectInputs",
          "default": null
        }
      },
      "title": "CrossJoinInput",
      "type": "object"
    },
    "SelectInput": {
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Data Type Change"
        },
        "join_key": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Join Key"
        },
        "is_altered": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Is Altered"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Is Available"
        },
        "keep": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Keep"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    },
    "SelectInputs": {
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "required": [
        "renames"
      ],
      "title": "SelectInputs",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a cross join.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Depending On Ids"
    },
    "auto_generate_selection": {
      "default": true,
      "title": "Auto Generate Selection",
      "type": "boolean"
    },
    "verify_integrity": {
      "default": true,
      "title": "Verify Integrity",
      "type": "boolean"
    },
    "cross_join_input": {
      "$ref": "#/$defs/CrossJoinInput"
    },
    "auto_keep_all": {
      "default": true,
      "title": "Auto Keep All",
      "type": "boolean"
    },
    "auto_keep_right": {
      "default": true,
      "title": "Auto Keep Right",
      "type": "boolean"
    },
    "auto_keep_left": {
      "default": true,
      "title": "Auto Keep Left",
      "type": "boolean"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "cross_join_input"
  ],
  "title": "NodeCrossJoin",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_ids (Optional[List[int]])
auto_generate_selection (bool)
verify_integrity (bool)
cross_join_input (CrossJoinInput)
auto_keep_all (bool)
auto_keep_right (bool)
auto_keep_left (bool)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeCrossJoin(NodeMultiInput):
    """Settings for a node that performs a cross join."""
    auto_generate_selection: bool = True
    verify_integrity: bool = True
    cross_join_input: transform_schema.CrossJoinInput
    auto_keep_all: bool = True
    auto_keep_right: bool = True
    auto_keep_left: bool = True

`NodeDatabaseReader` `pydantic-model`

Bases: NodeBase

Settings for a node that reads from a database.

Show JSON schema:

{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    },
    "DatabaseSettings": {
      "description": "Defines settings for reading from a database, either via table or query.",
      "properties": {
        "connection_mode": {
          "anyOf": [
            {
              "enum": [
                "inline",
                "reference"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "inline",
          "title": "Connection Mode"
        },
        "database_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/DatabaseConnection"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "database_connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database Connection Name"
        },
        "schema_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Schema Name"
        },
        "table_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Table Name"
        },
        "query": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Query"
        },
        "query_mode": {
          "default": "table",
          "enum": [
            "query",
            "table",
            "reference"
          ],
          "title": "Query Mode",
          "type": "string"
        }
      },
      "title": "DatabaseSettings",
      "type": "object"
    },
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that reads from a database.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "database_settings": {
      "$ref": "#/$defs/DatabaseSettings"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "database_settings"
  ],
  "title": "NodeDatabaseReader",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
database_settings (DatabaseSettings)
fields (Optional[List[MinimalFieldInfo]])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeDatabaseReader(NodeBase):
    """Settings for a node that reads from a database."""
    database_settings: DatabaseSettings
    fields: Optional[List[MinimalFieldInfo]] = None

`NodeDatabaseWriter` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that writes data to a database.

Show JSON schema:

{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    },
    "DatabaseWriteSettings": {
      "description": "Defines settings for writing data to a database table.",
      "properties": {
        "connection_mode": {
          "anyOf": [
            {
              "enum": [
                "inline",
                "reference"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "inline",
          "title": "Connection Mode"
        },
        "database_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/DatabaseConnection"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "database_connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database Connection Name"
        },
        "table_name": {
          "title": "Table Name",
          "type": "string"
        },
        "schema_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Schema Name"
        },
        "if_exists": {
          "anyOf": [
            {
              "enum": [
                "append",
                "replace",
                "fail"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "append",
          "title": "If Exists"
        }
      },
      "required": [
        "table_name"
      ],
      "title": "DatabaseWriteSettings",
      "type": "object"
    }
  },
  "description": "Settings for a node that writes data to a database.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "database_write_settings": {
      "$ref": "#/$defs/DatabaseWriteSettings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "database_write_settings"
  ],
  "title": "NodeDatabaseWriter",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
database_write_settings (DatabaseWriteSettings)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeDatabaseWriter(NodeSingleInput):
    """Settings for a node that writes data to a database."""
    database_write_settings: DatabaseWriteSettings

`NodeDatasource` `pydantic-model`

Bases: NodeBase

Base settings for a node that acts as a data source.

Show JSON schema:

{
  "description": "Base settings for a node that acts as a data source.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "file_ref": {
      "default": null,
      "title": "File Ref",
      "type": "string"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeDatasource",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
file_ref (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeDatasource(NodeBase):
    """Base settings for a node that acts as a data source."""
    file_ref: str = None

`NodeDescription` `pydantic-model`

Bases: BaseModel

A simple model for updating a node's description text.

Show JSON schema:

{
  "description": "A simple model for updating a node's description text.",
  "properties": {
    "description": {
      "default": "",
      "title": "Description",
      "type": "string"
    }
  },
  "title": "NodeDescription",
  "type": "object"
}

Fields:

description (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeDescription(BaseModel):
    """A simple model for updating a node's description text."""
    description: str = ''

`NodeExploreData` `pydantic-model`

Bases: NodeBase

Settings for a node that provides an interactive data exploration interface.

Show JSON schema:

{
  "$defs": {
    "DataModel": {
      "properties": {
        "data": {
          "items": {
            "type": "object"
          },
          "title": "Data",
          "type": "array"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/MutField"
          },
          "title": "Fields",
          "type": "array"
        }
      },
      "required": [
        "data",
        "fields"
      ],
      "title": "DataModel",
      "type": "object"
    },
    "GraphicWalkerInput": {
      "properties": {
        "dataModel": {
          "$ref": "#/$defs/DataModel"
        },
        "is_initial": {
          "default": true,
          "title": "Is Initial",
          "type": "boolean"
        },
        "specList": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Speclist"
        }
      },
      "title": "GraphicWalkerInput",
      "type": "object"
    },
    "MutField": {
      "properties": {
        "fid": {
          "title": "Fid",
          "type": "string"
        },
        "key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Key"
        },
        "name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Name"
        },
        "basename": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Basename"
        },
        "disable": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Disable"
        },
        "semanticType": {
          "title": "Semantictype",
          "type": "string"
        },
        "analyticType": {
          "enum": [
            "measure",
            "dimension"
          ],
          "title": "Analytictype",
          "type": "string"
        },
        "path": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Path"
        },
        "offset": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Offset"
        }
      },
      "required": [
        "fid",
        "semanticType",
        "analyticType"
      ],
      "title": "MutField",
      "type": "object"
    }
  },
  "description": "Settings for a node that provides an interactive data exploration interface.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "graphic_walker_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/GraphicWalkerInput"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeExploreData",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
graphic_walker_input (Optional[GraphicWalkerInput])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeExploreData(NodeBase):
    """Settings for a node that provides an interactive data exploration interface."""
    graphic_walker_input: Optional[gs_schemas.GraphicWalkerInput] = None

`NodeExternalSource` `pydantic-model`

Bases: NodeBase

Settings for a node that connects to a registered external data source.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "SampleUsers": {
      "description": "Settings for generating a sample dataset of users.",
      "properties": {
        "orientation": {
          "default": "row",
          "title": "Orientation",
          "type": "string"
        },
        "fields": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/MinimalFieldInfo"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Fields"
        },
        "SAMPLE_USERS": {
          "title": "Sample Users",
          "type": "boolean"
        },
        "class_name": {
          "default": "sample_users",
          "title": "Class Name",
          "type": "string"
        },
        "size": {
          "default": 100,
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "SAMPLE_USERS"
      ],
      "title": "SampleUsers",
      "type": "object"
    }
  },
  "description": "Settings for a node that connects to a registered external data source.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "identifier": {
      "title": "Identifier",
      "type": "string"
    },
    "source_settings": {
      "$ref": "#/$defs/SampleUsers"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "identifier",
    "source_settings"
  ],
  "title": "NodeExternalSource",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
identifier (str)
source_settings (SampleUsers)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeExternalSource(NodeBase):
    """Settings for a node that connects to a registered external data source."""
    identifier: str
    source_settings: SampleUsers

`NodeFilter` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that filters rows based on a condition.

Show JSON schema:

{
  "$defs": {
    "BasicFilter": {
      "properties": {
        "field": {
          "default": "",
          "title": "Field",
          "type": "string"
        },
        "filter_type": {
          "default": "",
          "title": "Filter Type",
          "type": "string"
        },
        "filter_value": {
          "default": "",
          "title": "Filter Value",
          "type": "string"
        }
      },
      "title": "BasicFilter",
      "type": "object"
    },
    "FilterInput": {
      "properties": {
        "advanced_filter": {
          "default": "",
          "title": "Advanced Filter",
          "type": "string"
        },
        "basic_filter": {
          "$ref": "#/$defs/BasicFilter",
          "default": null
        },
        "filter_type": {
          "default": "basic",
          "title": "Filter Type",
          "type": "string"
        }
      },
      "title": "FilterInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that filters rows based on a condition.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "filter_input": {
      "$ref": "#/$defs/FilterInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "filter_input"
  ],
  "title": "NodeFilter",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
filter_input (FilterInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeFilter(NodeSingleInput):
    """Settings for a node that filters rows based on a condition."""
    filter_input: transform_schema.FilterInput

`NodeFormula` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that applies a formula to create/modify a column.

Show JSON schema:

{
  "$defs": {
    "FieldInput": {
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        }
      },
      "required": [
        "name"
      ],
      "title": "FieldInput",
      "type": "object"
    },
    "FunctionInput": {
      "properties": {
        "field": {
          "$ref": "#/$defs/FieldInput"
        },
        "function": {
          "title": "Function",
          "type": "string"
        }
      },
      "required": [
        "field",
        "function"
      ],
      "title": "FunctionInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that applies a formula to create/modify a column.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "function": {
      "$ref": "#/$defs/FunctionInput",
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeFormula",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
function (FunctionInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeFormula(NodeSingleInput):
    """Settings for a node that applies a formula to create/modify a column."""
    function: transform_schema.FunctionInput = None

`NodeFuzzyMatch` `pydantic-model`

Bases: NodeJoin

Settings for a node that performs a fuzzy join based on string similarity.

Show JSON schema:

{
  "$defs": {
    "FuzzyMapping": {
      "properties": {
        "left_col": {
          "title": "Left Col",
          "type": "string"
        },
        "right_col": {
          "title": "Right Col",
          "type": "string"
        },
        "threshold_score": {
          "default": 80.0,
          "title": "Threshold Score",
          "type": "number"
        },
        "fuzzy_type": {
          "default": "levenshtein",
          "enum": [
            "levenshtein",
            "jaro",
            "jaro_winkler",
            "hamming",
            "damerau_levenshtein",
            "indel"
          ],
          "title": "Fuzzy Type",
          "type": "string"
        },
        "perc_unique": {
          "default": 0.0,
          "title": "Perc Unique",
          "type": "number"
        },
        "output_column_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Column Name"
        },
        "valid": {
          "default": true,
          "title": "Valid",
          "type": "boolean"
        }
      },
      "required": [
        "left_col",
        "right_col"
      ],
      "title": "FuzzyMapping",
      "type": "object"
    },
    "FuzzyMatchInput": {
      "properties": {
        "join_mapping": {
          "items": {
            "$ref": "#/$defs/FuzzyMapping"
          },
          "title": "Join Mapping",
          "type": "array"
        },
        "left_select": {
          "$ref": "#/$defs/JoinInputs",
          "default": null
        },
        "right_select": {
          "$ref": "#/$defs/JoinInputs",
          "default": null
        },
        "how": {
          "default": "inner",
          "enum": [
            "inner",
            "left",
            "right",
            "full",
            "semi",
            "anti",
            "cross",
            "outer"
          ],
          "title": "How",
          "type": "string"
        },
        "aggregate_output": {
          "default": false,
          "title": "Aggregate Output",
          "type": "boolean"
        }
      },
      "required": [
        "join_mapping"
      ],
      "title": "FuzzyMatchInput",
      "type": "object"
    },
    "JoinInputs": {
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "required": [
        "renames"
      ],
      "title": "JoinInputs",
      "type": "object"
    },
    "SelectInput": {
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Data Type Change"
        },
        "join_key": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Join Key"
        },
        "is_altered": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Is Altered"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Is Available"
        },
        "keep": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Keep"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a fuzzy join based on string similarity.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Depending On Ids"
    },
    "auto_generate_selection": {
      "default": true,
      "title": "Auto Generate Selection",
      "type": "boolean"
    },
    "verify_integrity": {
      "default": true,
      "title": "Verify Integrity",
      "type": "boolean"
    },
    "join_input": {
      "$ref": "#/$defs/FuzzyMatchInput"
    },
    "auto_keep_all": {
      "default": true,
      "title": "Auto Keep All",
      "type": "boolean"
    },
    "auto_keep_right": {
      "default": true,
      "title": "Auto Keep Right",
      "type": "boolean"
    },
    "auto_keep_left": {
      "default": true,
      "title": "Auto Keep Left",
      "type": "boolean"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "join_input"
  ],
  "title": "NodeFuzzyMatch",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_ids (Optional[List[int]])
auto_generate_selection (bool)
verify_integrity (bool)
auto_keep_all (bool)
auto_keep_right (bool)
auto_keep_left (bool)
join_input (FuzzyMatchInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeFuzzyMatch(NodeJoin):
    """Settings for a node that performs a fuzzy join based on string similarity."""
    join_input: transform_schema.FuzzyMatchInput

`NodeGraphSolver` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that solves graph-based problems (e.g., connected components).

Show JSON schema:

{
  "$defs": {
    "GraphSolverInput": {
      "properties": {
        "col_from": {
          "title": "Col From",
          "type": "string"
        },
        "col_to": {
          "title": "Col To",
          "type": "string"
        },
        "output_column_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "graph_group",
          "title": "Output Column Name"
        }
      },
      "required": [
        "col_from",
        "col_to"
      ],
      "title": "GraphSolverInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that solves graph-based problems (e.g., connected components).",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "graph_solver_input": {
      "$ref": "#/$defs/GraphSolverInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "graph_solver_input"
  ],
  "title": "NodeGraphSolver",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
graph_solver_input (GraphSolverInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeGraphSolver(NodeSingleInput):
    """Settings for a node that solves graph-based problems (e.g., connected components)."""
    graph_solver_input: transform_schema.GraphSolverInput

`NodeGroupBy` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that performs a group-by and aggregation operation.

Show JSON schema:

{
  "$defs": {
    "AggColl": {
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "agg": {
          "title": "Agg",
          "type": "string"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "title": "New Name"
        },
        "output_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Type"
        }
      },
      "required": [
        "old_name",
        "agg",
        "new_name"
      ],
      "title": "AggColl",
      "type": "object"
    },
    "GroupByInput": {
      "properties": {
        "agg_cols": {
          "items": {
            "$ref": "#/$defs/AggColl"
          },
          "title": "Agg Cols",
          "type": "array"
        }
      },
      "required": [
        "agg_cols"
      ],
      "title": "GroupByInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a group-by and aggregation operation.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "groupby_input": {
      "$ref": "#/$defs/GroupByInput",
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeGroupBy",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
groupby_input (GroupByInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeGroupBy(NodeSingleInput):
    """Settings for a node that performs a group-by and aggregation operation."""
    groupby_input: transform_schema.GroupByInput = None

`NodeInputConnection` `pydantic-model`

Bases: BaseModel

Represents the input side of a connection between two nodes.

Show JSON schema:

{
  "description": "Represents the input side of a connection between two nodes.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "connection_class": {
      "enum": [
        "input-0",
        "input-1",
        "input-2",
        "input-3",
        "input-4",
        "input-5",
        "input-6",
        "input-7",
        "input-8",
        "input-9"
      ],
      "title": "Connection Class",
      "type": "string"
    }
  },
  "required": [
    "node_id",
    "connection_class"
  ],
  "title": "NodeInputConnection",
  "type": "object"
}

Fields:

node_id (int)
connection_class (InputConnectionClass)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeInputConnection(BaseModel):
    """Represents the input side of a connection between two nodes."""
    node_id: int
    connection_class: InputConnectionClass

    def get_node_input_connection_type(self) -> Literal['main', 'right', 'left']:
        """Determines the semantic type of the input (e.g., for a join)."""
        match self.connection_class:
            case 'input-0': return 'main'
            case 'input-1': return 'right'
            case 'input-2': return 'left'
            case _: raise ValueError(f"Unexpected connection_class: {self.connection_class}")

`get_node_input_connection_type()`

Determines the semantic type of the input (e.g., for a join).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

def get_node_input_connection_type(self) -> Literal['main', 'right', 'left']:
    """Determines the semantic type of the input (e.g., for a join)."""
    match self.connection_class:
        case 'input-0': return 'main'
        case 'input-1': return 'right'
        case 'input-2': return 'left'
        case _: raise ValueError(f"Unexpected connection_class: {self.connection_class}")

`NodeJoin` `pydantic-model`

Bases: NodeMultiInput

Settings for a node that performs a standard SQL-style join.

Show JSON schema:

{
  "$defs": {
    "JoinInput": {
      "properties": {
        "join_mapping": {
          "items": {
            "$ref": "#/$defs/JoinMap"
          },
          "title": "Join Mapping",
          "type": "array"
        },
        "left_select": {
          "$ref": "#/$defs/JoinInputs",
          "default": null
        },
        "right_select": {
          "$ref": "#/$defs/JoinInputs",
          "default": null
        },
        "how": {
          "default": "inner",
          "enum": [
            "inner",
            "left",
            "right",
            "full",
            "semi",
            "anti",
            "cross",
            "outer"
          ],
          "title": "How",
          "type": "string"
        }
      },
      "required": [
        "join_mapping"
      ],
      "title": "JoinInput",
      "type": "object"
    },
    "JoinInputs": {
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "required": [
        "renames"
      ],
      "title": "JoinInputs",
      "type": "object"
    },
    "JoinMap": {
      "properties": {
        "left_col": {
          "title": "Left Col",
          "type": "string"
        },
        "right_col": {
          "title": "Right Col",
          "type": "string"
        }
      },
      "required": [
        "left_col",
        "right_col"
      ],
      "title": "JoinMap",
      "type": "object"
    },
    "SelectInput": {
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Data Type Change"
        },
        "join_key": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Join Key"
        },
        "is_altered": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Is Altered"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Is Available"
        },
        "keep": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Keep"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a standard SQL-style join.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Depending On Ids"
    },
    "auto_generate_selection": {
      "default": true,
      "title": "Auto Generate Selection",
      "type": "boolean"
    },
    "verify_integrity": {
      "default": true,
      "title": "Verify Integrity",
      "type": "boolean"
    },
    "join_input": {
      "$ref": "#/$defs/JoinInput"
    },
    "auto_keep_all": {
      "default": true,
      "title": "Auto Keep All",
      "type": "boolean"
    },
    "auto_keep_right": {
      "default": true,
      "title": "Auto Keep Right",
      "type": "boolean"
    },
    "auto_keep_left": {
      "default": true,
      "title": "Auto Keep Left",
      "type": "boolean"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "join_input"
  ],
  "title": "NodeJoin",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_ids (Optional[List[int]])
auto_generate_selection (bool)
verify_integrity (bool)
join_input (JoinInput)
auto_keep_all (bool)
auto_keep_right (bool)
auto_keep_left (bool)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeJoin(NodeMultiInput):
    """Settings for a node that performs a standard SQL-style join."""
    auto_generate_selection: bool = True
    verify_integrity: bool = True
    join_input: transform_schema.JoinInput
    auto_keep_all: bool = True
    auto_keep_right: bool = True
    auto_keep_left: bool = True

`NodeManualInput` `pydantic-model`

Bases: NodeBase

Settings for a node that allows direct data entry in the UI.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "RawData": {
      "description": "Represents data in a raw, columnar format for manual input.",
      "properties": {
        "columns": {
          "default": null,
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "title": "Columns",
          "type": "array"
        },
        "data": {
          "items": {
            "items": {},
            "type": "array"
          },
          "title": "Data",
          "type": "array"
        }
      },
      "required": [
        "data"
      ],
      "title": "RawData",
      "type": "object"
    }
  },
  "description": "Settings for a node that allows direct data entry in the UI.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "raw_data_format": {
      "anyOf": [
        {
          "$ref": "#/$defs/RawData"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeManualInput",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
raw_data_format (Optional[RawData])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeManualInput(NodeBase):
    """Settings for a node that allows direct data entry in the UI."""
    raw_data_format: Optional[RawData] = None

`NodeMultiInput` `pydantic-model`

Bases: NodeBase

A base model for any node that takes multiple data inputs.

Show JSON schema:

{
  "description": "A base model for any node that takes multiple data inputs.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Depending On Ids"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeMultiInput",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_ids (Optional[List[int]])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeMultiInput(NodeBase):
    """A base model for any node that takes multiple data inputs."""
    depending_on_ids: Optional[List[int]] = [-1]

`NodeOutput` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that writes its input to a file.

Show JSON schema:

{
  "$defs": {
    "OutputCsvTable": {
      "description": "Defines settings for writing a CSV file.",
      "properties": {
        "file_type": {
          "default": "csv",
          "title": "File Type",
          "type": "string"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        }
      },
      "title": "OutputCsvTable",
      "type": "object"
    },
    "OutputExcelTable": {
      "description": "Defines settings for writing an Excel file.",
      "properties": {
        "file_type": {
          "default": "excel",
          "title": "File Type",
          "type": "string"
        },
        "sheet_name": {
          "default": "Sheet1",
          "title": "Sheet Name",
          "type": "string"
        }
      },
      "title": "OutputExcelTable",
      "type": "object"
    },
    "OutputParquetTable": {
      "description": "Defines settings for writing a Parquet file.",
      "properties": {
        "file_type": {
          "default": "parquet",
          "title": "File Type",
          "type": "string"
        }
      },
      "title": "OutputParquetTable",
      "type": "object"
    },
    "OutputSettings": {
      "description": "Defines the complete settings for an output node.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "directory": {
          "title": "Directory",
          "type": "string"
        },
        "file_type": {
          "title": "File Type",
          "type": "string"
        },
        "fields": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Fields"
        },
        "write_mode": {
          "default": "overwrite",
          "title": "Write Mode",
          "type": "string"
        },
        "output_csv_table": {
          "anyOf": [
            {
              "$ref": "#/$defs/OutputCsvTable"
            },
            {
              "type": "null"
            }
          ]
        },
        "output_parquet_table": {
          "$ref": "#/$defs/OutputParquetTable"
        },
        "output_excel_table": {
          "$ref": "#/$defs/OutputExcelTable"
        },
        "abs_file_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Abs File Path"
        }
      },
      "required": [
        "name",
        "directory",
        "file_type"
      ],
      "title": "OutputSettings",
      "type": "object"
    }
  },
  "description": "Settings for a node that writes its input to a file.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "output_settings": {
      "$ref": "#/$defs/OutputSettings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "output_settings"
  ],
  "title": "NodeOutput",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
output_settings (OutputSettings)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeOutput(NodeSingleInput):
    """Settings for a node that writes its input to a file."""
    output_settings: OutputSettings

`NodeOutputConnection` `pydantic-model`

Bases: BaseModel

Represents the output side of a connection between two nodes.

Show JSON schema:

{
  "description": "Represents the output side of a connection between two nodes.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "connection_class": {
      "enum": [
        "output-0",
        "output-1",
        "output-2",
        "output-3",
        "output-4",
        "output-5",
        "output-6",
        "output-7",
        "output-8",
        "output-9"
      ],
      "title": "Connection Class",
      "type": "string"
    }
  },
  "required": [
    "node_id",
    "connection_class"
  ],
  "title": "NodeOutputConnection",
  "type": "object"
}

Fields:

node_id (int)
connection_class (OutputConnectionClass)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeOutputConnection(BaseModel):
    """Represents the output side of a connection between two nodes."""
    node_id: int
    connection_class: OutputConnectionClass

`NodePivot` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that pivots data from a long to a wide format.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "PivotInput": {
      "properties": {
        "index_columns": {
          "items": {
            "type": "string"
          },
          "title": "Index Columns",
          "type": "array"
        },
        "pivot_column": {
          "title": "Pivot Column",
          "type": "string"
        },
        "value_col": {
          "title": "Value Col",
          "type": "string"
        },
        "aggregations": {
          "items": {
            "type": "string"
          },
          "title": "Aggregations",
          "type": "array"
        }
      },
      "required": [
        "index_columns",
        "pivot_column",
        "value_col",
        "aggregations"
      ],
      "title": "PivotInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that pivots data from a long to a wide format.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "pivot_input": {
      "$ref": "#/$defs/PivotInput",
      "default": null
    },
    "output_fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Output Fields"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodePivot",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
pivot_input (PivotInput)
output_fields (Optional[List[MinimalFieldInfo]])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodePivot(NodeSingleInput):
    """Settings for a node that pivots data from a long to a wide format."""
    pivot_input: transform_schema.PivotInput = None
    output_fields: Optional[List[MinimalFieldInfo]] = None

`NodePolarsCode` `pydantic-model`

Bases: NodeMultiInput

Settings for a node that executes arbitrary user-provided Polars code.

Show JSON schema:

{
  "$defs": {
    "PolarsCodeInput": {
      "properties": {
        "polars_code": {
          "title": "Polars Code",
          "type": "string"
        }
      },
      "required": [
        "polars_code"
      ],
      "title": "PolarsCodeInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that executes arbitrary user-provided Polars code.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Depending On Ids"
    },
    "polars_code_input": {
      "$ref": "#/$defs/PolarsCodeInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "polars_code_input"
  ],
  "title": "NodePolarsCode",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_ids (Optional[List[int]])
polars_code_input (PolarsCodeInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodePolarsCode(NodeMultiInput):
    """Settings for a node that executes arbitrary user-provided Polars code."""
    polars_code_input: transform_schema.PolarsCodeInput

`NodePromise` `pydantic-model`

Bases: NodeBase

A placeholder node for an operation that has not yet been configured.

Show JSON schema:

{
  "description": "A placeholder node for an operation that has not yet been configured.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "default": false,
      "title": "Is Setup",
      "type": "boolean"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "node_type": {
      "title": "Node Type",
      "type": "string"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "node_type"
  ],
  "title": "NodePromise",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
is_setup (bool)
node_type (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodePromise(NodeBase):
    """A placeholder node for an operation that has not yet been configured."""
    is_setup: bool = False
    node_type: str

`NodeRead` `pydantic-model`

Bases: NodeBase

Settings for a node that reads data from a file.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "ReceivedTable": {
      "description": "A comprehensive model that can represent any type of received table.",
      "properties": {
        "id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "title": "Name"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "directory": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Directory"
        },
        "analysis_file_available": {
          "default": false,
          "title": "Analysis File Available",
          "type": "boolean"
        },
        "status": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Status"
        },
        "file_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "File Type"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "abs_file_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Abs File Path"
        },
        "reference": {
          "default": "",
          "title": "Reference",
          "type": "string"
        },
        "starting_from_line": {
          "default": 0,
          "title": "Starting From Line",
          "type": "integer"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "has_headers": {
          "default": true,
          "title": "Has Headers",
          "type": "boolean"
        },
        "encoding": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "utf-8",
          "title": "Encoding"
        },
        "parquet_ref": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Parquet Ref"
        },
        "row_delimiter": {
          "default": "\n",
          "title": "Row Delimiter",
          "type": "string"
        },
        "quote_char": {
          "default": "\"",
          "title": "Quote Char",
          "type": "string"
        },
        "infer_schema_length": {
          "default": 10000,
          "title": "Infer Schema Length",
          "type": "integer"
        },
        "truncate_ragged_lines": {
          "default": false,
          "title": "Truncate Ragged Lines",
          "type": "boolean"
        },
        "ignore_errors": {
          "default": false,
          "title": "Ignore Errors",
          "type": "boolean"
        },
        "sheet_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Sheet Name"
        },
        "start_row": {
          "default": 0,
          "title": "Start Row",
          "type": "integer"
        },
        "start_column": {
          "default": 0,
          "title": "Start Column",
          "type": "integer"
        },
        "end_row": {
          "default": 0,
          "title": "End Row",
          "type": "integer"
        },
        "end_column": {
          "default": 0,
          "title": "End Column",
          "type": "integer"
        },
        "type_inference": {
          "default": false,
          "title": "Type Inference",
          "type": "boolean"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "ReceivedTable",
      "type": "object"
    }
  },
  "description": "Settings for a node that reads data from a file.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "received_file": {
      "$ref": "#/$defs/ReceivedTable"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "received_file"
  ],
  "title": "NodeRead",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
received_file (ReceivedTable)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeRead(NodeBase):
    """Settings for a node that reads data from a file."""
    received_file: ReceivedTable

`NodeRecordCount` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that counts the number of records.

Show JSON schema:

{
  "description": "Settings for a node that counts the number of records.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeRecordCount",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeRecordCount(NodeSingleInput):
    """Settings for a node that counts the number of records."""
    pass

`NodeRecordId` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that adds a unique record ID column.

Show JSON schema:

{
  "$defs": {
    "RecordIdInput": {
      "properties": {
        "output_column_name": {
          "default": "record_id",
          "title": "Output Column Name",
          "type": "string"
        },
        "offset": {
          "default": 1,
          "title": "Offset",
          "type": "integer"
        },
        "group_by": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Group By"
        },
        "group_by_columns": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Group By Columns"
        }
      },
      "title": "RecordIdInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that adds a unique record ID column.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "record_id_input": {
      "$ref": "#/$defs/RecordIdInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "record_id_input"
  ],
  "title": "NodeRecordId",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
record_id_input (RecordIdInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeRecordId(NodeSingleInput):
    """Settings for a node that adds a unique record ID column."""
    record_id_input: transform_schema.RecordIdInput

`NodeSample` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that samples a subset of the data.

Show JSON schema:

{
  "description": "Settings for a node that samples a subset of the data.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "sample_size": {
      "default": 1000,
      "title": "Sample Size",
      "type": "integer"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSample",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
sample_size (int)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeSample(NodeSingleInput):
    """Settings for a node that samples a subset of the data."""
    sample_size: int = 1000

`NodeSelect` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that selects, renames, and reorders columns.

Show JSON schema:

{
  "$defs": {
    "SelectInput": {
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Data Type Change"
        },
        "join_key": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Join Key"
        },
        "is_altered": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Is Altered"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Is Available"
        },
        "keep": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Keep"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that selects, renames, and reorders columns.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "keep_missing": {
      "default": true,
      "title": "Keep Missing",
      "type": "boolean"
    },
    "select_input": {
      "items": {
        "$ref": "#/$defs/SelectInput"
      },
      "title": "Select Input",
      "type": "array"
    },
    "sorted_by": {
      "anyOf": [
        {
          "enum": [
            "none",
            "asc",
            "desc"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "none",
      "title": "Sorted By"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSelect",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
keep_missing (bool)
select_input (List[SelectInput])
sorted_by (Optional[Literal['none', 'asc', 'desc']])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeSelect(NodeSingleInput):
    """Settings for a node that selects, renames, and reorders columns."""
    keep_missing: bool = True
    select_input: List[transform_schema.SelectInput] = Field(default_factory=list)
    sorted_by: Optional[Literal['none', 'asc', 'desc']] = 'none'

`NodeSingleInput` `pydantic-model`

Bases: NodeBase

A base model for any node that takes a single data input.

Show JSON schema:

{
  "description": "A base model for any node that takes a single data input.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSingleInput",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeSingleInput(NodeBase):
    """A base model for any node that takes a single data input."""
    depending_on_id: Optional[int] = -1

`NodeSort` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that sorts the data by one or more columns.

Show JSON schema:

{
  "$defs": {
    "SortByInput": {
      "properties": {
        "column": {
          "title": "Column",
          "type": "string"
        },
        "how": {
          "default": "asc",
          "title": "How",
          "type": "string"
        }
      },
      "required": [
        "column"
      ],
      "title": "SortByInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that sorts the data by one or more columns.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "sort_input": {
      "items": {
        "$ref": "#/$defs/SortByInput"
      },
      "title": "Sort Input",
      "type": "array"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSort",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
sort_input (List[SortByInput])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeSort(NodeSingleInput):
    """Settings for a node that sorts the data by one or more columns."""
    sort_input: List[transform_schema.SortByInput] = Field(default_factory=list)

`NodeTextToRows` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that splits a text column into multiple rows.

Show JSON schema:

{
  "$defs": {
    "TextToRowsInput": {
      "properties": {
        "column_to_split": {
          "title": "Column To Split",
          "type": "string"
        },
        "output_column_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Column Name"
        },
        "split_by_fixed_value": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Split By Fixed Value"
        },
        "split_fixed_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": ",",
          "title": "Split Fixed Value"
        },
        "split_by_column": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Split By Column"
        }
      },
      "required": [
        "column_to_split"
      ],
      "title": "TextToRowsInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that splits a text column into multiple rows.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "text_to_rows_input": {
      "$ref": "#/$defs/TextToRowsInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "text_to_rows_input"
  ],
  "title": "NodeTextToRows",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
text_to_rows_input (TextToRowsInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeTextToRows(NodeSingleInput):
    """Settings for a node that splits a text column into multiple rows."""
    text_to_rows_input: transform_schema.TextToRowsInput

`NodeUnion` `pydantic-model`

Bases: NodeMultiInput

Settings for a node that concatenates multiple data inputs.

Show JSON schema:

{
  "$defs": {
    "UnionInput": {
      "properties": {
        "mode": {
          "default": "relaxed",
          "enum": [
            "selective",
            "relaxed"
          ],
          "title": "Mode",
          "type": "string"
        }
      },
      "title": "UnionInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that concatenates multiple data inputs.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Depending On Ids"
    },
    "union_input": {
      "$ref": "#/$defs/UnionInput"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeUnion",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_ids (Optional[List[int]])
union_input (UnionInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeUnion(NodeMultiInput):
    """Settings for a node that concatenates multiple data inputs."""
    union_input: transform_schema.UnionInput = Field(default_factory=transform_schema.UnionInput)

`NodeUnique` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that returns the unique rows from the data.

Show JSON schema:

{
  "$defs": {
    "UniqueInput": {
      "properties": {
        "columns": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Columns"
        },
        "strategy": {
          "default": "any",
          "enum": [
            "first",
            "last",
            "any",
            "none"
          ],
          "title": "Strategy",
          "type": "string"
        }
      },
      "title": "UniqueInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that returns the unique rows from the data.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "unique_input": {
      "$ref": "#/$defs/UniqueInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "unique_input"
  ],
  "title": "NodeUnique",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
unique_input (UniqueInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeUnique(NodeSingleInput):
    """Settings for a node that returns the unique rows from the data."""
    unique_input: transform_schema.UniqueInput

`NodeUnpivot` `pydantic-model`

Bases: NodeSingleInput

Settings for a node that unpivots data from a wide to a long format.

Show JSON schema:

{
  "$defs": {
    "UnpivotInput": {
      "properties": {
        "index_columns": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Index Columns"
        },
        "value_columns": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Value Columns"
        },
        "data_type_selector": {
          "anyOf": [
            {
              "enum": [
                "float",
                "all",
                "date",
                "numeric",
                "string"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type Selector"
        },
        "data_type_selector_mode": {
          "anyOf": [
            {
              "enum": [
                "data_type",
                "column"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "column",
          "title": "Data Type Selector Mode"
        }
      },
      "title": "UnpivotInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that unpivots data from a wide to a long format.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "unpivot_input": {
      "$ref": "#/$defs/UnpivotInput",
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeUnpivot",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_id (Optional[int])
unpivot_input (UnpivotInput)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class NodeUnpivot(NodeSingleInput):
    """Settings for a node that unpivots data from a wide to a long format."""
    unpivot_input: transform_schema.UnpivotInput = None

`OutputCsvTable` `pydantic-model`

Bases: BaseModel

Defines settings for writing a CSV file.

Show JSON schema:

{
  "description": "Defines settings for writing a CSV file.",
  "properties": {
    "file_type": {
      "default": "csv",
      "title": "File Type",
      "type": "string"
    },
    "delimiter": {
      "default": ",",
      "title": "Delimiter",
      "type": "string"
    },
    "encoding": {
      "default": "utf-8",
      "title": "Encoding",
      "type": "string"
    }
  },
  "title": "OutputCsvTable",
  "type": "object"
}

Fields:

file_type (str)
delimiter (str)
encoding (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class OutputCsvTable(BaseModel):
    """Defines settings for writing a CSV file."""
    file_type: str = 'csv'
    delimiter: str = ','
    encoding: str = 'utf-8'

`OutputExcelTable` `pydantic-model`

Bases: BaseModel

Defines settings for writing an Excel file.

Show JSON schema:

{
  "description": "Defines settings for writing an Excel file.",
  "properties": {
    "file_type": {
      "default": "excel",
      "title": "File Type",
      "type": "string"
    },
    "sheet_name": {
      "default": "Sheet1",
      "title": "Sheet Name",
      "type": "string"
    }
  },
  "title": "OutputExcelTable",
  "type": "object"
}

Fields:

file_type (str)
sheet_name (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class OutputExcelTable(BaseModel):
    """Defines settings for writing an Excel file."""
    file_type: str = 'excel'
    sheet_name: str = 'Sheet1'

`OutputParquetTable` `pydantic-model`

Bases: BaseModel

Defines settings for writing a Parquet file.

Show JSON schema:

{
  "description": "Defines settings for writing a Parquet file.",
  "properties": {
    "file_type": {
      "default": "parquet",
      "title": "File Type",
      "type": "string"
    }
  },
  "title": "OutputParquetTable",
  "type": "object"
}

Fields:

file_type (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class OutputParquetTable(BaseModel):
    """Defines settings for writing a Parquet file."""
    file_type: str = 'parquet'

`OutputSettings` `pydantic-model`

Bases: BaseModel

Defines the complete settings for an output node.

Show JSON schema:

{
  "$defs": {
    "OutputCsvTable": {
      "description": "Defines settings for writing a CSV file.",
      "properties": {
        "file_type": {
          "default": "csv",
          "title": "File Type",
          "type": "string"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        }
      },
      "title": "OutputCsvTable",
      "type": "object"
    },
    "OutputExcelTable": {
      "description": "Defines settings for writing an Excel file.",
      "properties": {
        "file_type": {
          "default": "excel",
          "title": "File Type",
          "type": "string"
        },
        "sheet_name": {
          "default": "Sheet1",
          "title": "Sheet Name",
          "type": "string"
        }
      },
      "title": "OutputExcelTable",
      "type": "object"
    },
    "OutputParquetTable": {
      "description": "Defines settings for writing a Parquet file.",
      "properties": {
        "file_type": {
          "default": "parquet",
          "title": "File Type",
          "type": "string"
        }
      },
      "title": "OutputParquetTable",
      "type": "object"
    }
  },
  "description": "Defines the complete settings for an output node.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "directory": {
      "title": "Directory",
      "type": "string"
    },
    "file_type": {
      "title": "File Type",
      "type": "string"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Fields"
    },
    "write_mode": {
      "default": "overwrite",
      "title": "Write Mode",
      "type": "string"
    },
    "output_csv_table": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputCsvTable"
        },
        {
          "type": "null"
        }
      ]
    },
    "output_parquet_table": {
      "$ref": "#/$defs/OutputParquetTable"
    },
    "output_excel_table": {
      "$ref": "#/$defs/OutputExcelTable"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    }
  },
  "required": [
    "name",
    "directory",
    "file_type"
  ],
  "title": "OutputSettings",
  "type": "object"
}

Fields:

name (str)
directory (str)
file_type (str)
fields (Optional[List[str]])
write_mode (str)
output_csv_table (Optional[OutputCsvTable])
output_parquet_table (OutputParquetTable)
output_excel_table (OutputExcelTable)
abs_file_path (Optional[str])

Validators:

populate_abs_file_path

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class OutputSettings(BaseModel):
    """Defines the complete settings for an output node."""
    name: str
    directory: str
    file_type: str
    fields: Optional[List[str]] = Field(default_factory=list)
    write_mode: str = 'overwrite'
    output_csv_table: Optional[OutputCsvTable] = Field(default_factory=OutputCsvTable)
    output_parquet_table: OutputParquetTable = Field(default_factory=OutputParquetTable)
    output_excel_table: OutputExcelTable = Field(default_factory=OutputExcelTable)
    abs_file_path: Optional[str] = None

    def set_absolute_filepath(self):
        """Resolves the output directory and name into an absolute path."""
        base_path = Path(self.directory)
        if not base_path.is_absolute():
            base_path = Path.cwd() / base_path
        if self.name and self.name not in base_path.name:
            base_path = base_path / self.name
        self.abs_file_path = str(base_path.resolve())

    @model_validator(mode='after')
    def populate_abs_file_path(self):
        """Ensures the absolute file path is populated after validation."""
        self.set_absolute_filepath()
        return self

`populate_abs_file_path()` `pydantic-validator`

Ensures the absolute file path is populated after validation.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

@model_validator(mode='after')
def populate_abs_file_path(self):
    """Ensures the absolute file path is populated after validation."""
    self.set_absolute_filepath()
    return self

`set_absolute_filepath()`

Resolves the output directory and name into an absolute path.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

def set_absolute_filepath(self):
    """Resolves the output directory and name into an absolute path."""
    base_path = Path(self.directory)
    if not base_path.is_absolute():
        base_path = Path.cwd() / base_path
    if self.name and self.name not in base_path.name:
        base_path = base_path / self.name
    self.abs_file_path = str(base_path.resolve())

`RawData` `pydantic-model`

Bases: BaseModel

Represents data in a raw, columnar format for manual input.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Represents data in a raw, columnar format for manual input.",
  "properties": {
    "columns": {
      "default": null,
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Columns",
      "type": "array"
    },
    "data": {
      "items": {
        "items": {},
        "type": "array"
      },
      "title": "Data",
      "type": "array"
    }
  },
  "required": [
    "data"
  ],
  "title": "RawData",
  "type": "object"
}

Fields:

columns (List[MinimalFieldInfo])
data (List[List])

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class RawData(BaseModel):
    """Represents data in a raw, columnar format for manual input."""
    columns: List[MinimalFieldInfo] = None
    data: List[List]

    @classmethod
    def from_pylist(cls, pylist: List[dict]):
        """Creates a RawData object from a list of Python dictionaries."""
        if len(pylist) == 0:
            return cls(columns=[], data=[])
        pylist = ensure_similarity_dicts(pylist)
        values = [standardize_col_dtype([vv for vv in c]) for c in
                  zip(*(r.values() for r in pylist))]
        data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
        columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pylist[0].keys()]
        return cls(columns=columns, data=values)

    def to_pylist(self) -> List[dict]:
        """Converts the RawData object back into a list of Python dictionaries."""
        return [{c.name: self.data[ci][ri] for ci, c in enumerate(self.columns)} for ri in range(len(self.data[0]))]

`from_pylist(pylist)` `classmethod`

Creates a RawData object from a list of Python dictionaries.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

@classmethod
def from_pylist(cls, pylist: List[dict]):
    """Creates a RawData object from a list of Python dictionaries."""
    if len(pylist) == 0:
        return cls(columns=[], data=[])
    pylist = ensure_similarity_dicts(pylist)
    values = [standardize_col_dtype([vv for vv in c]) for c in
              zip(*(r.values() for r in pylist))]
    data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
    columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pylist[0].keys()]
    return cls(columns=columns, data=values)

`to_pylist()`

Converts the RawData object back into a list of Python dictionaries.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

def to_pylist(self) -> List[dict]:
    """Converts the RawData object back into a list of Python dictionaries."""
    return [{c.name: self.data[ci][ri] for ci, c in enumerate(self.columns)} for ri in range(len(self.data[0]))]

`ReceivedCsvTable` `pydantic-model`

Bases: ReceivedTableBase

Defines settings for reading a CSV file.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Defines settings for reading a CSV file.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Name"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "directory": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Directory"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "status": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Status"
    },
    "file_type": {
      "default": "csv",
      "title": "File Type",
      "type": "string"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    },
    "reference": {
      "default": "",
      "title": "Reference",
      "type": "string"
    },
    "starting_from_line": {
      "default": 0,
      "title": "Starting From Line",
      "type": "integer"
    },
    "delimiter": {
      "default": ",",
      "title": "Delimiter",
      "type": "string"
    },
    "has_headers": {
      "default": true,
      "title": "Has Headers",
      "type": "boolean"
    },
    "encoding": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "utf-8",
      "title": "Encoding"
    },
    "parquet_ref": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Parquet Ref"
    },
    "row_delimiter": {
      "default": "\n",
      "title": "Row Delimiter",
      "type": "string"
    },
    "quote_char": {
      "default": "\"",
      "title": "Quote Char",
      "type": "string"
    },
    "infer_schema_length": {
      "default": 10000,
      "title": "Infer Schema Length",
      "type": "integer"
    },
    "truncate_ragged_lines": {
      "default": false,
      "title": "Truncate Ragged Lines",
      "type": "boolean"
    },
    "ignore_errors": {
      "default": false,
      "title": "Ignore Errors",
      "type": "boolean"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "ReceivedCsvTable",
  "type": "object"
}

Fields:

id (Optional[int])
name (Optional[str])
path (str)
directory (Optional[str])
analysis_file_available (bool)
status (Optional[str])
fields (List[MinimalFieldInfo])
abs_file_path (Optional[str])
file_type (str)
reference (str)
starting_from_line (int)
delimiter (str)
has_headers (bool)
encoding (Optional[str])
parquet_ref (Optional[str])
row_delimiter (str)
quote_char (str)
infer_schema_length (int)
truncate_ragged_lines (bool)
ignore_errors (bool)

Validators:

populate_abs_file_path

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class ReceivedCsvTable(ReceivedTableBase):
    """Defines settings for reading a CSV file."""
    file_type: str = 'csv'
    reference: str = ''
    starting_from_line: int = 0
    delimiter: str = ','
    has_headers: bool = True
    encoding: Optional[str] = 'utf-8'
    parquet_ref: Optional[str] = None
    row_delimiter: str = '\n'
    quote_char: str = '"'
    infer_schema_length: int = 10_000
    truncate_ragged_lines: bool = False
    ignore_errors: bool = False

`ReceivedExcelTable` `pydantic-model`

Bases: ReceivedTableBase

Defines settings for reading an Excel file.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Defines settings for reading an Excel file.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Name"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "directory": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Directory"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "status": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Status"
    },
    "file_type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "File Type"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    },
    "sheet_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Sheet Name"
    },
    "start_row": {
      "default": 0,
      "title": "Start Row",
      "type": "integer"
    },
    "start_column": {
      "default": 0,
      "title": "Start Column",
      "type": "integer"
    },
    "end_row": {
      "default": 0,
      "title": "End Row",
      "type": "integer"
    },
    "end_column": {
      "default": 0,
      "title": "End Column",
      "type": "integer"
    },
    "has_headers": {
      "default": true,
      "title": "Has Headers",
      "type": "boolean"
    },
    "type_inference": {
      "default": false,
      "title": "Type Inference",
      "type": "boolean"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "ReceivedExcelTable",
  "type": "object"
}

Fields:

id (Optional[int])
name (Optional[str])
path (str)
directory (Optional[str])
analysis_file_available (bool)
status (Optional[str])
file_type (Optional[str])
fields (List[MinimalFieldInfo])
abs_file_path (Optional[str])
sheet_name (Optional[str])
start_row (int)
start_column (int)
end_row (int)
end_column (int)
has_headers (bool)
type_inference (bool)

Validators:

populate_abs_file_path

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class ReceivedExcelTable(ReceivedTableBase):
    """Defines settings for reading an Excel file."""
    sheet_name: Optional[str] = None
    start_row: int = 0
    start_column: int = 0
    end_row: int = 0
    end_column: int = 0
    has_headers: bool = True
    type_inference: bool = False

    def validate_range_values(self):
        """Validates that the Excel cell range is logical."""
        for attribute in [self.start_row, self.start_column, self.end_row, self.end_column]:
            if not isinstance(attribute, int) or attribute < 0:
                raise ValueError("Row and column indices must be non-negative integers")
        if (self.end_row > 0 and self.start_row > self.end_row) or \
           (self.end_column > 0 and self.start_column > self.end_column):
            raise ValueError("Start row/column must not be greater than end row/column")

`validate_range_values()`

Validates that the Excel cell range is logical.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

def validate_range_values(self):
    """Validates that the Excel cell range is logical."""
    for attribute in [self.start_row, self.start_column, self.end_row, self.end_column]:
        if not isinstance(attribute, int) or attribute < 0:
            raise ValueError("Row and column indices must be non-negative integers")
    if (self.end_row > 0 and self.start_row > self.end_row) or \
       (self.end_column > 0 and self.start_column > self.end_column):
        raise ValueError("Start row/column must not be greater than end row/column")

`ReceivedJsonTable` `pydantic-model`

Bases: ReceivedCsvTable

Defines settings for reading a JSON file (inherits from CSV settings).

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Defines settings for reading a JSON file (inherits from CSV settings).",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Name"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "directory": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Directory"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "status": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Status"
    },
    "file_type": {
      "default": "csv",
      "title": "File Type",
      "type": "string"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    },
    "reference": {
      "default": "",
      "title": "Reference",
      "type": "string"
    },
    "starting_from_line": {
      "default": 0,
      "title": "Starting From Line",
      "type": "integer"
    },
    "delimiter": {
      "default": ",",
      "title": "Delimiter",
      "type": "string"
    },
    "has_headers": {
      "default": true,
      "title": "Has Headers",
      "type": "boolean"
    },
    "encoding": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "utf-8",
      "title": "Encoding"
    },
    "parquet_ref": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Parquet Ref"
    },
    "row_delimiter": {
      "default": "\n",
      "title": "Row Delimiter",
      "type": "string"
    },
    "quote_char": {
      "default": "\"",
      "title": "Quote Char",
      "type": "string"
    },
    "infer_schema_length": {
      "default": 10000,
      "title": "Infer Schema Length",
      "type": "integer"
    },
    "truncate_ragged_lines": {
      "default": false,
      "title": "Truncate Ragged Lines",
      "type": "boolean"
    },
    "ignore_errors": {
      "default": false,
      "title": "Ignore Errors",
      "type": "boolean"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "ReceivedJsonTable",
  "type": "object"
}

Fields:

id (Optional[int])
name (Optional[str])
path (str)
directory (Optional[str])
analysis_file_available (bool)
status (Optional[str])
file_type (str)
fields (List[MinimalFieldInfo])
abs_file_path (Optional[str])
reference (str)
starting_from_line (int)
delimiter (str)
has_headers (bool)
encoding (Optional[str])
parquet_ref (Optional[str])
row_delimiter (str)
quote_char (str)
infer_schema_length (int)
truncate_ragged_lines (bool)
ignore_errors (bool)

Validators:

populate_abs_file_path

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class ReceivedJsonTable(ReceivedCsvTable):
    """Defines settings for reading a JSON file (inherits from CSV settings)."""
    pass

`ReceivedParquetTable` `pydantic-model`

Bases: ReceivedTableBase

Defines settings for reading a Parquet file.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Defines settings for reading a Parquet file.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Name"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "directory": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Directory"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "status": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Status"
    },
    "file_type": {
      "default": "parquet",
      "title": "File Type",
      "type": "string"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "ReceivedParquetTable",
  "type": "object"
}

Fields:

id (Optional[int])
name (Optional[str])
path (str)
directory (Optional[str])
analysis_file_available (bool)
status (Optional[str])
fields (List[MinimalFieldInfo])
abs_file_path (Optional[str])
file_type (str)

Validators:

populate_abs_file_path

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class ReceivedParquetTable(ReceivedTableBase):
    """Defines settings for reading a Parquet file."""
    file_type: str = 'parquet'

`ReceivedTable` `pydantic-model`

Bases: ReceivedExcelTable, ReceivedCsvTable, ReceivedParquetTable

A comprehensive model that can represent any type of received table.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "A comprehensive model that can represent any type of received table.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Name"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "directory": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Directory"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "status": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Status"
    },
    "file_type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "File Type"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    },
    "reference": {
      "default": "",
      "title": "Reference",
      "type": "string"
    },
    "starting_from_line": {
      "default": 0,
      "title": "Starting From Line",
      "type": "integer"
    },
    "delimiter": {
      "default": ",",
      "title": "Delimiter",
      "type": "string"
    },
    "has_headers": {
      "default": true,
      "title": "Has Headers",
      "type": "boolean"
    },
    "encoding": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "utf-8",
      "title": "Encoding"
    },
    "parquet_ref": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Parquet Ref"
    },
    "row_delimiter": {
      "default": "\n",
      "title": "Row Delimiter",
      "type": "string"
    },
    "quote_char": {
      "default": "\"",
      "title": "Quote Char",
      "type": "string"
    },
    "infer_schema_length": {
      "default": 10000,
      "title": "Infer Schema Length",
      "type": "integer"
    },
    "truncate_ragged_lines": {
      "default": false,
      "title": "Truncate Ragged Lines",
      "type": "boolean"
    },
    "ignore_errors": {
      "default": false,
      "title": "Ignore Errors",
      "type": "boolean"
    },
    "sheet_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Sheet Name"
    },
    "start_row": {
      "default": 0,
      "title": "Start Row",
      "type": "integer"
    },
    "start_column": {
      "default": 0,
      "title": "Start Column",
      "type": "integer"
    },
    "end_row": {
      "default": 0,
      "title": "End Row",
      "type": "integer"
    },
    "end_column": {
      "default": 0,
      "title": "End Column",
      "type": "integer"
    },
    "type_inference": {
      "default": false,
      "title": "Type Inference",
      "type": "boolean"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "ReceivedTable",
  "type": "object"
}

Fields:

id (Optional[int])
name (Optional[str])
path (str)
directory (Optional[str])
analysis_file_available (bool)
status (Optional[str])
file_type (str)
fields (List[MinimalFieldInfo])
abs_file_path (Optional[str])
reference (str)
starting_from_line (int)
delimiter (str)
has_headers (bool)
encoding (Optional[str])
parquet_ref (Optional[str])
row_delimiter (str)
quote_char (str)
infer_schema_length (int)
truncate_ragged_lines (bool)
ignore_errors (bool)
sheet_name (Optional[str])
start_row (int)
start_column (int)
end_row (int)
end_column (int)
type_inference (bool)

Validators:

populate_abs_file_path

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class ReceivedTable(ReceivedExcelTable, ReceivedCsvTable, ReceivedParquetTable):
    """A comprehensive model that can represent any type of received table."""
    ...

`ReceivedTableBase` `pydantic-model`

Bases: BaseModel

Base model for defining a table received from an external source.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Base model for defining a table received from an external source.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Name"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "directory": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Directory"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "status": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Status"
    },
    "file_type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "File Type"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "ReceivedTableBase",
  "type": "object"
}

Fields:

id (Optional[int])
name (Optional[str])
path (str)
directory (Optional[str])
analysis_file_available (bool)
status (Optional[str])
file_type (Optional[str])
fields (List[MinimalFieldInfo])
abs_file_path (Optional[str])

Validators:

populate_abs_file_path

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class ReceivedTableBase(BaseModel):
    """Base model for defining a table received from an external source."""
    id: Optional[int] = None
    name: Optional[str]
    path: str  # This can be an absolute or relative path
    directory: Optional[str] = None
    analysis_file_available: bool = False
    status: Optional[str] = None
    file_type: Optional[str] = None
    fields: List[MinimalFieldInfo] = Field(default_factory=list)
    abs_file_path: Optional[str] = None

    @classmethod
    def create_from_path(cls, path: str):
        """Creates an instance from a file path string."""
        filename = Path(path).name
        return cls(name=filename, path=path)

    @property
    def file_path(self) -> str:
        """Constructs the full file path from the directory and name."""
        if not self.name in self.path:
            return os.path.join(self.path, self.name)
        else:
            return self.path

    def set_absolute_filepath(self):
        """Resolves the path to an absolute file path."""
        base_path = Path(self.path).expanduser()
        if not base_path.is_absolute():
            base_path = Path.cwd() / base_path
        if self.name and self.name not in base_path.name:
            base_path = base_path / self.name
        self.abs_file_path = str(base_path.resolve())

    @model_validator(mode='after')
    def populate_abs_file_path(self):
        """Ensures the absolute file path is populated after validation."""
        if not self.abs_file_path:
            self.set_absolute_filepath()
        return self

`file_path` `property`

Constructs the full file path from the directory and name.

`create_from_path(path)` `classmethod`

Creates an instance from a file path string.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

@classmethod
def create_from_path(cls, path: str):
    """Creates an instance from a file path string."""
    filename = Path(path).name
    return cls(name=filename, path=path)

`populate_abs_file_path()` `pydantic-validator`

Ensures the absolute file path is populated after validation.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

@model_validator(mode='after')
def populate_abs_file_path(self):
    """Ensures the absolute file path is populated after validation."""
    if not self.abs_file_path:
        self.set_absolute_filepath()
    return self

`set_absolute_filepath()`

Resolves the path to an absolute file path.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

def set_absolute_filepath(self):
    """Resolves the path to an absolute file path."""
    base_path = Path(self.path).expanduser()
    if not base_path.is_absolute():
        base_path = Path.cwd() / base_path
    if self.name and self.name not in base_path.name:
        base_path = base_path / self.name
    self.abs_file_path = str(base_path.resolve())

`RemoveItem` `pydantic-model`

Bases: BaseModel

Represents a single item to be removed from a directory or list.

Show JSON schema:

{
  "description": "Represents a single item to be removed from a directory or list.",
  "properties": {
    "path": {
      "title": "Path",
      "type": "string"
    },
    "id": {
      "default": -1,
      "title": "Id",
      "type": "integer"
    }
  },
  "required": [
    "path"
  ],
  "title": "RemoveItem",
  "type": "object"
}

Fields:

path (str)
id (int)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class RemoveItem(BaseModel):
    """Represents a single item to be removed from a directory or list."""
    path: str
    id: int = -1

`RemoveItemsInput` `pydantic-model`

Bases: BaseModel

Defines a list of items to be removed.

Show JSON schema:

{
  "$defs": {
    "RemoveItem": {
      "description": "Represents a single item to be removed from a directory or list.",
      "properties": {
        "path": {
          "title": "Path",
          "type": "string"
        },
        "id": {
          "default": -1,
          "title": "Id",
          "type": "integer"
        }
      },
      "required": [
        "path"
      ],
      "title": "RemoveItem",
      "type": "object"
    }
  },
  "description": "Defines a list of items to be removed.",
  "properties": {
    "paths": {
      "items": {
        "$ref": "#/$defs/RemoveItem"
      },
      "title": "Paths",
      "type": "array"
    },
    "source_path": {
      "title": "Source Path",
      "type": "string"
    }
  },
  "required": [
    "paths",
    "source_path"
  ],
  "title": "RemoveItemsInput",
  "type": "object"
}

Fields:

paths (List[RemoveItem])
source_path (str)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class RemoveItemsInput(BaseModel):
    """Defines a list of items to be removed."""
    paths: List[RemoveItem]
    source_path: str

`SampleUsers` `pydantic-model`

Bases: ExternalSource

Settings for generating a sample dataset of users.

Show JSON schema:

{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for generating a sample dataset of users.",
  "properties": {
    "orientation": {
      "default": "row",
      "title": "Orientation",
      "type": "string"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    },
    "SAMPLE_USERS": {
      "title": "Sample Users",
      "type": "boolean"
    },
    "class_name": {
      "default": "sample_users",
      "title": "Class Name",
      "type": "string"
    },
    "size": {
      "default": 100,
      "title": "Size",
      "type": "integer"
    }
  },
  "required": [
    "SAMPLE_USERS"
  ],
  "title": "SampleUsers",
  "type": "object"
}

Fields:

orientation (str)
fields (Optional[List[MinimalFieldInfo]])
SAMPLE_USERS (bool)
class_name (str)
size (int)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class SampleUsers(ExternalSource):
    """Settings for generating a sample dataset of users."""
    SAMPLE_USERS: bool
    class_name: str = "sample_users"
    size: int = 100

`UserDefinedNode` `pydantic-model`

Bases: NodeMultiInput

Settings for a node that contains the user defined node information

Show JSON schema:

{
  "description": "Settings for a node that contains the user defined node information",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Depending On Ids"
    },
    "settings": {
      "title": "Settings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "settings"
  ],
  "title": "UserDefinedNode",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
cache_results (Optional[bool])
pos_x (Optional[float])
pos_y (Optional[float])
is_setup (Optional[bool])
description (Optional[str])
user_id (Optional[int])
is_flow_output (Optional[bool])
is_user_defined (Optional[bool])
depending_on_ids (Optional[List[int]])
settings (Any)

Source code in flowfile_core/flowfile_core/schemas/input_schema.py

class UserDefinedNode(NodeMultiInput):
    """Settings for a node that contains the user defined node information"""
    settings: Any

`transform_schema`

`flowfile_core.schemas.transform_schema`

Classes:

Name	Description
`AggColl`	A data class that represents a single aggregation operation for a group by operation.
`BasicFilter`	Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').
`CrossJoinInput`	Defines the settings for a cross join operation, including column selections for both inputs.
`FieldInput`	Represents a single field with its name and data type, typically for defining an output column.
`FilterInput`	Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.
`FullJoinKeyResponse`	Holds the join key rename responses for both sides of a join.
`FunctionInput`	Defines a formula to be applied, including the output field information.
`FuzzyMatchInput`	Extends `JoinInput` with settings specific to fuzzy matching, such as the matching algorithm and threshold.
`GraphSolverInput`	Defines settings for a graph-solving operation (e.g., finding connected components).
`GroupByInput`	A data class that represents the input for a group by operation.
`JoinInput`	Defines the settings for a standard SQL-style join, including keys, strategy, and selections.
`JoinInputs`	Extends `SelectInputs` with functionality specific to join operations, like handling join keys.
`JoinKeyRename`	Represents the renaming of a join key from its original to a temporary name.
`JoinKeyRenameResponse`	Contains a list of join key renames for one side of a join.
`JoinMap`	Defines a single mapping between a left and right column for a join key.
`JoinSelectMixin`	A mixin providing common methods for join-like operations that involve left and right inputs.
`PivotInput`	Defines the settings for a pivot (long-to-wide) operation.
`PolarsCodeInput`	A simple container for a string of user-provided Polars code to be executed.
`RecordIdInput`	Defines settings for adding a record ID (row number) column to the data.
`SelectInput`	Defines how a single column should be selected, renamed, or type-cast.
`SelectInputs`	A container for a list of `SelectInput` objects, providing helper methods for managing selections.
`SortByInput`	Defines a single sort condition on a column, including the direction.
`TextToRowsInput`	Defines settings for splitting a text column into multiple rows based on a delimiter.
`UnionInput`	Defines settings for a union (concatenation) operation.
`UniqueInput`	Defines settings for a uniqueness operation, specifying columns and which row to keep.
`UnpivotInput`	Defines settings for an unpivot (wide-to-long) operation.

Functions:

Name	Description
`construct_join_key_name`	Creates a temporary, unique name for a join key column.
`get_func_type_mapping`	Infers the output data type of common aggregation functions.
`string_concat`	A simple wrapper to concatenate string columns in Polars.

`AggColl` `dataclass`

A data class that represents a single aggregation operation for a group by operation.

Attributes

old_name : str The name of the column in the original DataFrame to be aggregated.

Any

The aggregation function to use. This can be a string representing a built-in function or a custom function.

Optional[str]

The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the old_name appended with the aggregation function.

Optional[str]

The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function using the get_func_type_mapping function.

Example

agg_col = AggColl( old_name='col1', agg='sum', new_name='sum_col1', output_type='float' )

Methods:

Name	Description
`__init__`	Initializes an aggregation column with its source, function, and new name.

Attributes:

Name	Type	Description
`agg_func`		Returns the corresponding Polars aggregation function from the `agg` string.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class AggColl:
    """
    A data class that represents a single aggregation operation for a group by operation.

    Attributes
    ----------
    old_name : str
        The name of the column in the original DataFrame to be aggregated.

    agg : Any
        The aggregation function to use. This can be a string representing a built-in function or a custom function.

    new_name : Optional[str]
        The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the
        old_name appended with the aggregation function.

    output_type : Optional[str]
        The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function
        using the `get_func_type_mapping` function.

    Example
    --------
    agg_col = AggColl(
        old_name='col1',
        agg='sum',
        new_name='sum_col1',
        output_type='float'
    )
    """
    old_name: str
    agg: str
    new_name: Optional[str]
    output_type: Optional[str] = None

    def __init__(self, old_name: str, agg: str, new_name: str = None, output_type: str = None):
        """Initializes an aggregation column with its source, function, and new name."""
        self.old_name = str(old_name)
        if agg != 'groupby':
            self.new_name = new_name if new_name is not None else self.old_name + "_" + agg
        else:
            self.new_name = new_name if new_name is not None else self.old_name
        self.output_type = output_type if output_type is not None else get_func_type_mapping(agg)
        self.agg = agg

    @property
    def agg_func(self):
        """Returns the corresponding Polars aggregation function from the `agg` string."""
        if self.agg == 'groupby':
            return self.agg
        elif self.agg == 'concat':
            return string_concat
        else:
            return getattr(pl, self.agg) if isinstance(self.agg, str) else self.agg

`agg_func` `property`

Returns the corresponding Polars aggregation function from the agg string.

`init(old_name, agg, new_name=None, output_type=None)`

Initializes an aggregation column with its source, function, and new name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def __init__(self, old_name: str, agg: str, new_name: str = None, output_type: str = None):
    """Initializes an aggregation column with its source, function, and new name."""
    self.old_name = str(old_name)
    if agg != 'groupby':
        self.new_name = new_name if new_name is not None else self.old_name + "_" + agg
    else:
        self.new_name = new_name if new_name is not None else self.old_name
    self.output_type = output_type if output_type is not None else get_func_type_mapping(agg)
    self.agg = agg

`BasicFilter` `dataclass`

Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class BasicFilter:
    """Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value')."""
    field: str = ''
    filter_type: str = ''
    filter_value: str = ''

`CrossJoinInput` `dataclass`

Bases: JoinSelectMixin

Defines the settings for a cross join operation, including column selections for both inputs.

Methods:

Name	Description
`__init__`	Initializes the CrossJoinInput with selections for left and right tables.
`auto_rename`	Automatically renames columns on the right side to prevent naming conflicts.

Attributes:

Name	Type	Description
`overlapping_records`		Finds column names that would conflict after the join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class CrossJoinInput(JoinSelectMixin):
    """Defines the settings for a cross join operation, including column selections for both inputs."""
    left_select: SelectInputs = None
    right_select: SelectInputs = None

    def __init__(self, left_select: List[SelectInput] | List[str],
                 right_select: List[SelectInput] | List[str]):
        """Initializes the CrossJoinInput with selections for left and right tables."""
        self.left_select = self.parse_select(left_select)
        self.right_select = self.parse_select(right_select)

    @property
    def overlapping_records(self):
        """Finds column names that would conflict after the join."""
        return self.left_select.new_cols & self.right_select.new_cols

    def auto_rename(self):
        """Automatically renames columns on the right side to prevent naming conflicts."""
        overlapping_records = self.overlapping_records
        while len(overlapping_records) > 0:
            for right_col in self.right_select.renames:
                if right_col.new_name in overlapping_records:
                    right_col.new_name = 'right_' + right_col.new_name
            overlapping_records = self.overlapping_records

`overlapping_records` `property`

Finds column names that would conflict after the join.

`init(left_select, right_select)`

Initializes the CrossJoinInput with selections for left and right tables.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def __init__(self, left_select: List[SelectInput] | List[str],
             right_select: List[SelectInput] | List[str]):
    """Initializes the CrossJoinInput with selections for left and right tables."""
    self.left_select = self.parse_select(left_select)
    self.right_select = self.parse_select(right_select)

`auto_rename()`

Automatically renames columns on the right side to prevent naming conflicts.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def auto_rename(self):
    """Automatically renames columns on the right side to prevent naming conflicts."""
    overlapping_records = self.overlapping_records
    while len(overlapping_records) > 0:
        for right_col in self.right_select.renames:
            if right_col.new_name in overlapping_records:
                right_col.new_name = 'right_' + right_col.new_name
        overlapping_records = self.overlapping_records

`FieldInput` `dataclass`

Represents a single field with its name and data type, typically for defining an output column.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class FieldInput:
    """Represents a single field with its name and data type, typically for defining an output column."""
    name: str
    data_type: Optional[str] = None

    def __init__(self, name: str, data_type: str = None):
        self.name = name
        self.data_type = data_type

`FilterInput` `dataclass`

Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class FilterInput:
    """Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes."""
    advanced_filter: str = ''
    basic_filter: BasicFilter = None
    filter_type: str = 'basic'

`FullJoinKeyResponse`

Bases: NamedTuple

Holds the join key rename responses for both sides of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

class FullJoinKeyResponse(NamedTuple):
    """Holds the join key rename responses for both sides of a join."""
    left: JoinKeyRenameResponse
    right: JoinKeyRenameResponse

`FunctionInput` `dataclass`

Defines a formula to be applied, including the output field information.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class FunctionInput:
    """Defines a formula to be applied, including the output field information."""
    field: FieldInput
    function: str

`FuzzyMatchInput` `dataclass`

Bases: JoinInput

Extends JoinInput with settings specific to fuzzy matching, such as the matching algorithm and threshold.

Attributes:

Name	Type	Description
`fuzzy_maps`	`List[FuzzyMapping]`	Returns the final fuzzy mappings after applying all column renames.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class FuzzyMatchInput(JoinInput):
    """Extends `JoinInput` with settings specific to fuzzy matching, such as the matching algorithm and threshold."""
    join_mapping: List[FuzzyMapping]
    aggregate_output: bool = False

    @staticmethod
    def parse_fuzz_mapping(fuzz_mapping: List[FuzzyMapping] | Tuple[str, str] | str) -> List[FuzzyMapping]:
        if isinstance(fuzz_mapping, (tuple, list)):
            assert len(fuzz_mapping) > 0
            if all(isinstance(fm, dict) for fm in fuzz_mapping):
                fuzz_mapping = [FuzzyMapping(**fm) for fm in fuzz_mapping]

            if not isinstance(fuzz_mapping[0], FuzzyMapping):
                assert len(fuzz_mapping) <= 2
                if len(fuzz_mapping) == 2:
                    assert isinstance(fuzz_mapping[0], str) and isinstance(fuzz_mapping[1], str)
                    fuzz_mapping = [FuzzyMapping(*fuzz_mapping)]
                elif isinstance(fuzz_mapping[0], str):
                    fuzz_mapping = [FuzzyMapping(fuzz_mapping[0], fuzz_mapping[0])]
        elif isinstance(fuzz_mapping, str):
            fuzz_mapping = [FuzzyMapping(fuzz_mapping, fuzz_mapping)]
        elif isinstance(fuzz_mapping, FuzzyMapping):
            fuzz_mapping = [fuzz_mapping]
        else:
            raise Exception('No valid join mapping as input')
        return fuzz_mapping

    def __init__(self, join_mapping: List[FuzzyMapping] | Tuple[str, str] | str, left_select: List[SelectInput] | List[str],
                 right_select: List[SelectInput] | List[str], aggregate_output: bool = False, how: JoinStrategy = 'inner'):
        self.join_mapping = self.parse_fuzz_mapping(join_mapping)
        self.left_select = self.parse_select(left_select)
        self.right_select = self.parse_select(right_select)
        self.how = how
        for jm in self.join_mapping:

            if jm.right_col not in {v.old_name for v in self.right_select.renames}:
                self.right_select.append(SelectInput(jm.right_col, keep=False, join_key=True))
            if jm.left_col not in {v.old_name for v in self.left_select.renames}:
                self.left_select.append(SelectInput(jm.left_col, keep=False, join_key=True))
        [setattr(v, "join_key", v.old_name in self._left_join_keys) for v in self.left_select.renames]
        [setattr(v, "join_key", v.old_name in self._right_join_keys) for v in self.right_select.renames]
        self.aggregate_output = aggregate_output

    @property
    def overlapping_records(self):
        return self.left_select.new_cols & self.right_select.new_cols

    @property
    def fuzzy_maps(self) -> List[FuzzyMapping]:
        """Returns the final fuzzy mappings after applying all column renames."""
        new_mappings = []
        left_rename_table, right_rename_table = self.left_select.rename_table, self.right_select.rename_table
        for org_fuzzy_map in self.join_mapping:
            right_col = right_rename_table.get(org_fuzzy_map.right_col)
            left_col = left_rename_table.get(org_fuzzy_map.left_col)
            if right_col != org_fuzzy_map.right_col or left_col != org_fuzzy_map.left_col:
                new_mapping = deepcopy(org_fuzzy_map)
                new_mapping.left_col = left_col
                new_mapping.right_col = right_col
                new_mappings.append(new_mapping)
            else:
                new_mappings.append(org_fuzzy_map)
        return new_mappings

`fuzzy_maps` `property`

Returns the final fuzzy mappings after applying all column renames.

`GraphSolverInput` `dataclass`

Defines settings for a graph-solving operation (e.g., finding connected components).

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class GraphSolverInput:
    """Defines settings for a graph-solving operation (e.g., finding connected components)."""
    col_from: str
    col_to: str
    output_column_name: Optional[str] = 'graph_group'

`GroupByInput` `dataclass`

A data class that represents the input for a group by operation.

Attributes

group_columns : List[str] A list of column names to group the DataFrame by. These column(s) will be set as the DataFrame index.

List[AggColl]

A list of AggColl objects that specify the aggregation operations to perform on the DataFrame columns after grouping. Each AggColl object should specify the column to be aggregated and the aggregation function to use.

Example

group_by_input = GroupByInput( agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'), AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')] )

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class GroupByInput:
    """
    A data class that represents the input for a group by operation.

    Attributes
    ----------
    group_columns : List[str]
        A list of column names to group the DataFrame by. These column(s) will be set as the DataFrame index.

    agg_cols : List[AggColl]
        A list of `AggColl` objects that specify the aggregation operations to perform on the DataFrame columns
        after grouping. Each `AggColl` object should specify the column to be aggregated and the aggregation
        function to use.

    Example
    --------
    group_by_input = GroupByInput(
        agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'), AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')]
    )
    """
    agg_cols: List[AggColl]

`JoinInput` `dataclass`

Bases: JoinSelectMixin

Defines the settings for a standard SQL-style join, including keys, strategy, and selections.

Methods:

Name	Description
`__init__`	Initializes the JoinInput with keys, selections, and join strategy.
`auto_rename`	Automatically renames columns on the right side to prevent naming conflicts.
`get_join_key_renames`	Gets the temporary rename mappings for the join keys on both sides.
`parse_join_mapping`	Parses various input formats for join keys into a standardized list of `JoinMap` objects.
`set_join_keys`	Marks the `SelectInput` objects corresponding to join keys.

Attributes:

Name	Type	Description
`left_join_keys`	`List[str]`	Returns an ordered list of the left-side join key column names to be used in the join.
`right_join_keys`	`List[str]`	Returns an ordered list of the right-side join key column names to be used in the join.
`used_join_mapping`	`List[JoinMap]`	Returns the final join mapping after applying all renames and transformations.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class JoinInput(JoinSelectMixin):
    """Defines the settings for a standard SQL-style join, including keys, strategy, and selections."""
    join_mapping: List[JoinMap]
    left_select: JoinInputs = None
    right_select: JoinInputs = None
    how: JoinStrategy = 'inner'

    @staticmethod
    def parse_join_mapping(join_mapping: any) -> List[JoinMap]:
        """Parses various input formats for join keys into a standardized list of `JoinMap` objects."""
        if isinstance(join_mapping, (tuple, list)):
            assert len(join_mapping) > 0
            if all(isinstance(jm, dict) for jm in join_mapping):
                join_mapping = [JoinMap(**jm) for jm in join_mapping]

            if not isinstance(join_mapping[0], JoinMap):
                assert len(join_mapping) <= 2
                if len(join_mapping) == 2:
                    assert isinstance(join_mapping[0], str) and isinstance(join_mapping[1], str)
                    join_mapping = [JoinMap(*join_mapping)]
                elif isinstance(join_mapping[0], str):
                    join_mapping = [JoinMap(join_mapping[0], join_mapping[0])]
        elif isinstance(join_mapping, str):
            join_mapping = [JoinMap(join_mapping, join_mapping)]
        else:
            raise Exception('No valid join mapping as input')
        return join_mapping

    def __init__(self, join_mapping: List[JoinMap] | Tuple[str, str] | str,
                 left_select: List[SelectInput] | List[str],
                 right_select: List[SelectInput] | List[str],
                 how: JoinStrategy = 'inner'):
        """Initializes the JoinInput with keys, selections, and join strategy."""
        self.join_mapping = self.parse_join_mapping(join_mapping)
        self.left_select = self.parse_select(left_select)
        self.right_select = self.parse_select(right_select)
        self.set_join_keys()
        self.how = how

    def set_join_keys(self):
        """Marks the `SelectInput` objects corresponding to join keys."""
        [setattr(v, "join_key", v.old_name in self._left_join_keys) for v in self.left_select.renames]
        [setattr(v, "join_key", v.old_name in self._right_join_keys) for v in self.right_select.renames]

    def get_join_key_renames(self, filter_drop: bool = False) -> FullJoinKeyResponse:
        """Gets the temporary rename mappings for the join keys on both sides."""
        return FullJoinKeyResponse(self.left_select.get_join_key_renames(side="left", filter_drop=filter_drop),
                                   self.right_select.get_join_key_renames(side="right", filter_drop=filter_drop))

    def get_names_for_table_rename(self) -> List[JoinMap]:
        new_mappings: List[JoinMap] = []
        left_rename_table, right_rename_table = self.left_select.rename_table, self.right_select.rename_table
        for join_map in self.join_mapping:
            new_mappings.append(JoinMap(left_rename_table.get(join_map.left_col, join_map.left_col),
                                        right_rename_table.get(join_map.right_col, join_map.right_col)
                                        )
                                )
        return new_mappings

    @property
    def _left_join_keys(self) -> Set:
        """Returns a set of the left-side join key column names."""
        return set(jm.left_col for jm in self.join_mapping)

    @property
    def _right_join_keys(self) -> Set:
        """Returns a set of the right-side join key column names."""
        return set(jm.right_col for jm in self.join_mapping)

    @property
    def left_join_keys(self) -> List[str]:
        """Returns an ordered list of the left-side join key column names to be used in the join."""
        return [jm.left_col for jm in self.used_join_mapping]

    @property
    def right_join_keys(self) -> List[str]:
        """Returns an ordered list of the right-side join key column names to be used in the join."""
        return [jm.right_col for jm in self.used_join_mapping]

    @property
    def overlapping_records(self):
        if self.how in ('left', 'right', 'inner'):
            return self.left_select.new_cols & self.right_select.new_cols
        else:
            return self.left_select.new_cols & self.right_select.new_cols

    def auto_rename(self):
        """Automatically renames columns on the right side to prevent naming conflicts."""
        self.set_join_keys()
        overlapping_records = self.overlapping_records
        while len(overlapping_records) > 0:
            for right_col in self.right_select.renames:
                if right_col.new_name in overlapping_records:
                    right_col.new_name = right_col.new_name + '_right'
            overlapping_records = self.overlapping_records

    @property
    def used_join_mapping(self) -> List[JoinMap]:
        """Returns the final join mapping after applying all renames and transformations."""
        new_mappings: List[JoinMap] = []
        left_rename_table, right_rename_table = self.left_select.rename_table, self.right_select.rename_table
        left_join_rename_mapping: Dict[str, str] = self.left_select.get_join_key_rename_mapping("left")
        right_join_rename_mapping: Dict[str, str] = self.right_select.get_join_key_rename_mapping("right")
        for join_map in self.join_mapping:
            # del self.right_select.rename_table, self.left_select.rename_table
            new_mappings.append(JoinMap(left_join_rename_mapping.get(left_rename_table.get(join_map.left_col, join_map.left_col)),
                                        right_join_rename_mapping.get(right_rename_table.get(join_map.right_col, join_map.right_col))
                                        )
                                )
        return new_mappings

`left_join_keys` `property`

Returns an ordered list of the left-side join key column names to be used in the join.

`right_join_keys` `property`

Returns an ordered list of the right-side join key column names to be used in the join.

`used_join_mapping` `property`

Returns the final join mapping after applying all renames and transformations.

`init(join_mapping, left_select, right_select, how='inner')`

Initializes the JoinInput with keys, selections, and join strategy.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def __init__(self, join_mapping: List[JoinMap] | Tuple[str, str] | str,
             left_select: List[SelectInput] | List[str],
             right_select: List[SelectInput] | List[str],
             how: JoinStrategy = 'inner'):
    """Initializes the JoinInput with keys, selections, and join strategy."""
    self.join_mapping = self.parse_join_mapping(join_mapping)
    self.left_select = self.parse_select(left_select)
    self.right_select = self.parse_select(right_select)
    self.set_join_keys()
    self.how = how

`auto_rename()`

Automatically renames columns on the right side to prevent naming conflicts.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def auto_rename(self):
    """Automatically renames columns on the right side to prevent naming conflicts."""
    self.set_join_keys()
    overlapping_records = self.overlapping_records
    while len(overlapping_records) > 0:
        for right_col in self.right_select.renames:
            if right_col.new_name in overlapping_records:
                right_col.new_name = right_col.new_name + '_right'
        overlapping_records = self.overlapping_records

`get_join_key_renames(filter_drop=False)`

Gets the temporary rename mappings for the join keys on both sides.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_join_key_renames(self, filter_drop: bool = False) -> FullJoinKeyResponse:
    """Gets the temporary rename mappings for the join keys on both sides."""
    return FullJoinKeyResponse(self.left_select.get_join_key_renames(side="left", filter_drop=filter_drop),
                               self.right_select.get_join_key_renames(side="right", filter_drop=filter_drop))

`parse_join_mapping(join_mapping)` `staticmethod`

Parses various input formats for join keys into a standardized list of JoinMap objects.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@staticmethod
def parse_join_mapping(join_mapping: any) -> List[JoinMap]:
    """Parses various input formats for join keys into a standardized list of `JoinMap` objects."""
    if isinstance(join_mapping, (tuple, list)):
        assert len(join_mapping) > 0
        if all(isinstance(jm, dict) for jm in join_mapping):
            join_mapping = [JoinMap(**jm) for jm in join_mapping]

        if not isinstance(join_mapping[0], JoinMap):
            assert len(join_mapping) <= 2
            if len(join_mapping) == 2:
                assert isinstance(join_mapping[0], str) and isinstance(join_mapping[1], str)
                join_mapping = [JoinMap(*join_mapping)]
            elif isinstance(join_mapping[0], str):
                join_mapping = [JoinMap(join_mapping[0], join_mapping[0])]
    elif isinstance(join_mapping, str):
        join_mapping = [JoinMap(join_mapping, join_mapping)]
    else:
        raise Exception('No valid join mapping as input')
    return join_mapping

`set_join_keys()`

Marks the SelectInput objects corresponding to join keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def set_join_keys(self):
    """Marks the `SelectInput` objects corresponding to join keys."""
    [setattr(v, "join_key", v.old_name in self._left_join_keys) for v in self.left_select.renames]
    [setattr(v, "join_key", v.old_name in self._right_join_keys) for v in self.right_select.renames]

`JoinInputs`

Bases: SelectInputs

Extends SelectInputs with functionality specific to join operations, like handling join keys.

Methods:

Name	Description
`get_join_key_rename_mapping`	Returns a dictionary mapping original join key names to their temporary names.
`get_join_key_renames`	Gets the temporary rename mapping for all join keys on one side of a join.

Attributes:

Name	Type	Description
`join_key_selects`	`List[SelectInput]`	Returns only the `SelectInput` objects that are marked as join keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

class JoinInputs(SelectInputs):
    """Extends `SelectInputs` with functionality specific to join operations, like handling join keys."""

    def __init__(self, renames: List[SelectInput]):
        self.renames = renames

    @property
    def join_key_selects(self) -> List[SelectInput]:
        """Returns only the `SelectInput` objects that are marked as join keys."""
        return [v for v in self.renames if v.join_key]

    def get_join_key_renames(self, side: SideLit, filter_drop: bool = False) -> JoinKeyRenameResponse:
        """Gets the temporary rename mapping for all join keys on one side of a join."""
        return JoinKeyRenameResponse(
            side,
            [JoinKeyRename(jk.new_name,
                           construct_join_key_name(side, jk.new_name))
             for jk in self.join_key_selects if jk.keep or not filter_drop]
        )

    def get_join_key_rename_mapping(self, side: SideLit) -> Dict[str, str]:
        """Returns a dictionary mapping original join key names to their temporary names."""
        return {jkr[0]: jkr[1] for jkr in self.get_join_key_renames(side)[1]}

`join_key_selects` `property`

Returns only the SelectInput objects that are marked as join keys.

`get_join_key_rename_mapping(side)`

Returns a dictionary mapping original join key names to their temporary names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_join_key_rename_mapping(self, side: SideLit) -> Dict[str, str]:
    """Returns a dictionary mapping original join key names to their temporary names."""
    return {jkr[0]: jkr[1] for jkr in self.get_join_key_renames(side)[1]}

`get_join_key_renames(side, filter_drop=False)`

Gets the temporary rename mapping for all join keys on one side of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_join_key_renames(self, side: SideLit, filter_drop: bool = False) -> JoinKeyRenameResponse:
    """Gets the temporary rename mapping for all join keys on one side of a join."""
    return JoinKeyRenameResponse(
        side,
        [JoinKeyRename(jk.new_name,
                       construct_join_key_name(side, jk.new_name))
         for jk in self.join_key_selects if jk.keep or not filter_drop]
    )

`JoinKeyRename`

Bases: NamedTuple

Represents the renaming of a join key from its original to a temporary name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

class JoinKeyRename(NamedTuple):
    """Represents the renaming of a join key from its original to a temporary name."""
    original_name: str
    temp_name: str

`JoinKeyRenameResponse`

Bases: NamedTuple

Contains a list of join key renames for one side of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

class JoinKeyRenameResponse(NamedTuple):
    """Contains a list of join key renames for one side of a join."""
    side: SideLit
    join_key_renames: List[JoinKeyRename]

`JoinMap` `dataclass`

Defines a single mapping between a left and right column for a join key.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class JoinMap:
    """Defines a single mapping between a left and right column for a join key."""
    left_col: str
    right_col: str

`JoinSelectMixin`

A mixin providing common methods for join-like operations that involve left and right inputs.

Methods:

Name	Description
`add_new_select_column`	Adds a new column to the selection for either the left or right side.
`auto_generate_new_col_name`	Generates a new, non-conflicting column name by adding a suffix if necessary.
`parse_select`	Parses various input formats into a standardized `JoinInputs` object.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

class JoinSelectMixin:
    """A mixin providing common methods for join-like operations that involve left and right inputs."""
    left_select: JoinInputs = None
    right_select: JoinInputs = None

    @staticmethod
    def parse_select(select: List[SelectInput] | List[str] | List[Dict]) -> JoinInputs | None:
        """Parses various input formats into a standardized `JoinInputs` object."""
        if all(isinstance(c, SelectInput) for c in select):
            return JoinInputs(select)
        elif all(isinstance(c, dict) for c in select):
            return JoinInputs([SelectInput(**c.__dict__) for c in select])
        elif isinstance(select, dict):
            renames = select.get('renames')
            if renames:
                return JoinInputs([SelectInput(**c) for c in renames])
        elif all(isinstance(c, str) for c in select):
            return JoinInputs([SelectInput(s, s) for s in select])

    def auto_generate_new_col_name(self, old_col_name: str, side: str) -> str:
        """Generates a new, non-conflicting column name by adding a suffix if necessary."""
        current_names = self.left_select.new_cols & self.right_select.new_cols
        if old_col_name not in current_names:
            return old_col_name
        while True:
            if old_col_name not in current_names:
                return old_col_name
            old_col_name = f'{side}_{old_col_name}'

    def add_new_select_column(self, select_input: SelectInput, side: str):
        """Adds a new column to the selection for either the left or right side."""
        selects = self.right_select if side == 'right' else self.left_select
        select_input.new_name = self.auto_generate_new_col_name(select_input.old_name, side=side)
        selects.__add__(select_input)

`add_new_select_column(select_input, side)`

Adds a new column to the selection for either the left or right side.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def add_new_select_column(self, select_input: SelectInput, side: str):
    """Adds a new column to the selection for either the left or right side."""
    selects = self.right_select if side == 'right' else self.left_select
    select_input.new_name = self.auto_generate_new_col_name(select_input.old_name, side=side)
    selects.__add__(select_input)

`auto_generate_new_col_name(old_col_name, side)`

Generates a new, non-conflicting column name by adding a suffix if necessary.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def auto_generate_new_col_name(self, old_col_name: str, side: str) -> str:
    """Generates a new, non-conflicting column name by adding a suffix if necessary."""
    current_names = self.left_select.new_cols & self.right_select.new_cols
    if old_col_name not in current_names:
        return old_col_name
    while True:
        if old_col_name not in current_names:
            return old_col_name
        old_col_name = f'{side}_{old_col_name}'

`parse_select(select)` `staticmethod`

Parses various input formats into a standardized JoinInputs object.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@staticmethod
def parse_select(select: List[SelectInput] | List[str] | List[Dict]) -> JoinInputs | None:
    """Parses various input formats into a standardized `JoinInputs` object."""
    if all(isinstance(c, SelectInput) for c in select):
        return JoinInputs(select)
    elif all(isinstance(c, dict) for c in select):
        return JoinInputs([SelectInput(**c.__dict__) for c in select])
    elif isinstance(select, dict):
        renames = select.get('renames')
        if renames:
            return JoinInputs([SelectInput(**c) for c in renames])
    elif all(isinstance(c, str) for c in select):
        return JoinInputs([SelectInput(s, s) for s in select])

`PivotInput` `dataclass`

Defines the settings for a pivot (long-to-wide) operation.

Methods:

Name	Description
`get_group_by_input`	Constructs the `GroupByInput` needed for the pre-aggregation step of the pivot.
`get_pivot_column`	Returns the pivot column as a Polars column expression.
`get_values_expr`	Creates the struct expression used to gather the values for pivoting.

Attributes:

Name	Type	Description
`grouped_columns`	`List[str]`	Returns the list of columns to be used for the initial grouping stage of the pivot.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class PivotInput:
    """Defines the settings for a pivot (long-to-wide) operation."""
    index_columns: List[str]
    pivot_column: str
    value_col: str
    aggregations: List[str]

    @property
    def grouped_columns(self) -> List[str]:
        """Returns the list of columns to be used for the initial grouping stage of the pivot."""
        return self.index_columns + [self.pivot_column]

    def get_group_by_input(self) -> GroupByInput:
        """Constructs the `GroupByInput` needed for the pre-aggregation step of the pivot."""
        group_by_cols = [AggColl(c, 'groupby') for c in self.grouped_columns]
        agg_cols = [AggColl(self.value_col, agg=aggregation, new_name=aggregation) for aggregation in self.aggregations]
        return GroupByInput(group_by_cols+agg_cols)

    def get_index_columns(self) -> List[pl.col]:
        return [pl.col(c) for c in self.index_columns]

    def get_pivot_column(self) -> pl.Expr:
        """Returns the pivot column as a Polars column expression."""
        return pl.col(self.pivot_column)

    def get_values_expr(self) -> pl.Expr:
        """Creates the struct expression used to gather the values for pivoting."""
        return pl.struct([pl.col(c) for c in self.aggregations]).alias('vals')

`grouped_columns` `property`

Returns the list of columns to be used for the initial grouping stage of the pivot.

`get_group_by_input()`

Constructs the GroupByInput needed for the pre-aggregation step of the pivot.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_group_by_input(self) -> GroupByInput:
    """Constructs the `GroupByInput` needed for the pre-aggregation step of the pivot."""
    group_by_cols = [AggColl(c, 'groupby') for c in self.grouped_columns]
    agg_cols = [AggColl(self.value_col, agg=aggregation, new_name=aggregation) for aggregation in self.aggregations]
    return GroupByInput(group_by_cols+agg_cols)

`get_pivot_column()`

Returns the pivot column as a Polars column expression.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_pivot_column(self) -> pl.Expr:
    """Returns the pivot column as a Polars column expression."""
    return pl.col(self.pivot_column)

`get_values_expr()`

Creates the struct expression used to gather the values for pivoting.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_values_expr(self) -> pl.Expr:
    """Creates the struct expression used to gather the values for pivoting."""
    return pl.struct([pl.col(c) for c in self.aggregations]).alias('vals')

`PolarsCodeInput` `dataclass`

A simple container for a string of user-provided Polars code to be executed.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class PolarsCodeInput:
    """A simple container for a string of user-provided Polars code to be executed."""
    polars_code: str

`RecordIdInput` `dataclass`

Defines settings for adding a record ID (row number) column to the data.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class RecordIdInput:
    """Defines settings for adding a record ID (row number) column to the data."""
    output_column_name: str = 'record_id'
    offset: int = 1
    group_by: Optional[bool] = False
    group_by_columns: Optional[List[str]] = field(default_factory=list)

`SelectInput` `dataclass`

Defines how a single column should be selected, renamed, or type-cast.

This is a core building block for any operation that involves column manipulation. It holds all the configuration for a single field in a selection operation.

Attributes:

Name	Type	Description
`polars_type`	`str`	Translates a user-friendly type name to a Polars data type string.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class SelectInput:
    """Defines how a single column should be selected, renamed, or type-cast.

    This is a core building block for any operation that involves column manipulation.
    It holds all the configuration for a single field in a selection operation.
    """
    old_name: str
    original_position: Optional[int] = None
    new_name: Optional[str] = None
    data_type: Optional[str] = None
    data_type_change: Optional[bool] = False
    join_key: Optional[bool] = False
    is_altered: Optional[bool] = False
    position: Optional[int] = None
    is_available: Optional[bool] = True
    keep: Optional[bool] = True

    def __hash__(self):
        return hash(self.old_name)

    def __init__(self, old_name: str, new_name: str = None, keep: bool = True, data_type: str = None,
                 data_type_change: bool = False, join_key: bool = False, is_altered: bool = False,
                 is_available: bool = True, position: int = None):
        self.old_name = old_name
        if new_name is None:
            new_name = old_name
        self.new_name = new_name
        self.keep = keep
        self.data_type = data_type
        self.data_type_change = data_type_change
        self.join_key = join_key
        self.is_altered = is_altered
        self.is_available = is_available
        self.position = position

    @property
    def polars_type(self) -> str:
        """Translates a user-friendly type name to a Polars data type string."""
        if self.data_type.lower() == 'string':
            return 'Utf8'
        elif self.data_type.lower() == 'integer':
            return 'Int64'
        elif self.data_type.lower() == 'double':
            return 'Float64'
        return self.data_type

`polars_type` `property`

Translates a user-friendly type name to a Polars data type string.

`SelectInputs` `dataclass`

A container for a list of SelectInput objects, providing helper methods for managing selections.

Methods:

Name	Description
`__add__`	Allows adding a SelectInput using the '+' operator.
`append`	Appends a new SelectInput to the list of renames.
`create_from_list`	Creates a SelectInputs object from a simple list of column names.
`create_from_pl_df`	Creates a SelectInputs object from a Polars DataFrame's columns.
`get_select_cols`	Gets a list of original column names to select from the source DataFrame.
`has_drop_cols`	Checks if any column is marked to be dropped from the selection.
`remove_select_input`	Removes a SelectInput from the list based on its original name.
`unselect_field`	Marks a field to be dropped from the final selection by setting `keep` to False.

Attributes:

Name	Type	Description
`drop_columns`	`List[SelectInput]`	Returns a list of column names that are marked to be dropped from the selection.
`new_cols`	`Set`	Returns a set of new (renamed) column names to be kept in the selection.
`old_cols`	`Set`	Returns a set of original column names to be kept in the selection.
`rename_table`		Generates a dictionary for use in Polars' `.rename()` method.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class SelectInputs:
    """A container for a list of `SelectInput` objects, providing helper methods for managing selections."""
    renames: List[SelectInput]

    @property
    def old_cols(self) -> Set:
        """Returns a set of original column names to be kept in the selection."""
        return set(v.old_name for v in self.renames if v.keep)

    @property
    def new_cols(self) -> Set:
        """Returns a set of new (renamed) column names to be kept in the selection."""
        return set(v.new_name for v in self.renames if v.keep)

    @property
    def rename_table(self):
        """Generates a dictionary for use in Polars' `.rename()` method."""
        return {v.old_name: v.new_name for v in self.renames if v.is_available and (v.keep or v.join_key)}

    def get_select_cols(self, include_join_key: bool = True):
        """Gets a list of original column names to select from the source DataFrame."""
        return [v.old_name for v in self.renames if v.keep or (v.join_key and include_join_key)]

    def has_drop_cols(self) -> bool:
        """Checks if any column is marked to be dropped from the selection."""
        return any(not v.keep for v in self.renames)

    @property
    def drop_columns(self) -> List[SelectInput]:
        """Returns a list of column names that are marked to be dropped from the selection."""
        return [v for v in self.renames if not v.keep and v.is_available]

    @property
    def non_jk_drop_columns(self) -> List[SelectInput]:
        return [v for v in self.renames if not v.keep and v.is_available and not v.join_key]

    def __add__(self, other: "SelectInput"):
        """Allows adding a SelectInput using the '+' operator."""
        self.renames.append(other)

    def append(self, other: "SelectInput"):
        """Appends a new SelectInput to the list of renames."""
        self.renames.append(other)

    def remove_select_input(self, old_key: str):
        """Removes a SelectInput from the list based on its original name."""
        self.renames = [rename for rename in self.renames if rename.old_name != old_key]

    def unselect_field(self, old_key: str):
        """Marks a field to be dropped from the final selection by setting `keep` to False."""
        for rename in self.renames:
            if old_key == rename.old_name:
                rename.keep = False

    @classmethod
    def create_from_list(cls, col_list: List[str]):
        """Creates a SelectInputs object from a simple list of column names."""
        return cls([SelectInput(c) for c in col_list])

    @classmethod
    def create_from_pl_df(cls, df: pl.DataFrame | pl.LazyFrame):
        """Creates a SelectInputs object from a Polars DataFrame's columns."""
        return cls([SelectInput(c) for c in df.columns])

    def get_select_input_on_old_name(self, old_name: str) -> SelectInput | None:
        return next((v for v in self.renames if v.old_name == old_name), None)

    def get_select_input_on_new_name(self, old_name: str) -> SelectInput | None:
        return next((v for v in self.renames if v.new_name == old_name), None)

`drop_columns` `property`

Returns a list of column names that are marked to be dropped from the selection.

`new_cols` `property`

Returns a set of new (renamed) column names to be kept in the selection.

`old_cols` `property`

Returns a set of original column names to be kept in the selection.

`rename_table` `property`

Generates a dictionary for use in Polars' .rename() method.

`add(other)`

Allows adding a SelectInput using the '+' operator.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def __add__(self, other: "SelectInput"):
    """Allows adding a SelectInput using the '+' operator."""
    self.renames.append(other)

`append(other)`

Appends a new SelectInput to the list of renames.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def append(self, other: "SelectInput"):
    """Appends a new SelectInput to the list of renames."""
    self.renames.append(other)

`create_from_list(col_list)` `classmethod`

Creates a SelectInputs object from a simple list of column names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@classmethod
def create_from_list(cls, col_list: List[str]):
    """Creates a SelectInputs object from a simple list of column names."""
    return cls([SelectInput(c) for c in col_list])

`create_from_pl_df(df)` `classmethod`

Creates a SelectInputs object from a Polars DataFrame's columns.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@classmethod
def create_from_pl_df(cls, df: pl.DataFrame | pl.LazyFrame):
    """Creates a SelectInputs object from a Polars DataFrame's columns."""
    return cls([SelectInput(c) for c in df.columns])

`get_select_cols(include_join_key=True)`

Gets a list of original column names to select from the source DataFrame.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_select_cols(self, include_join_key: bool = True):
    """Gets a list of original column names to select from the source DataFrame."""
    return [v.old_name for v in self.renames if v.keep or (v.join_key and include_join_key)]

`has_drop_cols()`

Checks if any column is marked to be dropped from the selection.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def has_drop_cols(self) -> bool:
    """Checks if any column is marked to be dropped from the selection."""
    return any(not v.keep for v in self.renames)

`remove_select_input(old_key)`

Removes a SelectInput from the list based on its original name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def remove_select_input(self, old_key: str):
    """Removes a SelectInput from the list based on its original name."""
    self.renames = [rename for rename in self.renames if rename.old_name != old_key]

`unselect_field(old_key)`

Marks a field to be dropped from the final selection by setting keep to False.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def unselect_field(self, old_key: str):
    """Marks a field to be dropped from the final selection by setting `keep` to False."""
    for rename in self.renames:
        if old_key == rename.old_name:
            rename.keep = False

`SortByInput` `dataclass`

Defines a single sort condition on a column, including the direction.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class SortByInput:
    """Defines a single sort condition on a column, including the direction."""
    column: str
    how: str = 'asc'

`TextToRowsInput` `dataclass`

Defines settings for splitting a text column into multiple rows based on a delimiter.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class TextToRowsInput:
    """Defines settings for splitting a text column into multiple rows based on a delimiter."""
    column_to_split: str
    output_column_name: Optional[str] = None
    split_by_fixed_value: Optional[bool] = True
    split_fixed_value: Optional[str] = ','
    split_by_column: Optional[str] = None

`UnionInput` `dataclass`

Defines settings for a union (concatenation) operation.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class UnionInput:
    """Defines settings for a union (concatenation) operation."""
    mode: Literal['selective', 'relaxed'] = 'relaxed'

`UniqueInput` `dataclass`

Defines settings for a uniqueness operation, specifying columns and which row to keep.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class UniqueInput:
    """Defines settings for a uniqueness operation, specifying columns and which row to keep."""
    columns: Optional[List[str]] = None
    strategy: Literal["first", "last", "any", "none"] = "any"

`UnpivotInput` `dataclass`

Defines settings for an unpivot (wide-to-long) operation.

Methods:

Name	Description
`__post_init__`	Ensures that list attributes are initialized correctly if they are None.

Attributes:

Name	Type	Description
`data_type_selector_expr`	`Optional[Callable]`	Returns a Polars selector function based on the `data_type_selector` string.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

@dataclass
class UnpivotInput:
    """Defines settings for an unpivot (wide-to-long) operation."""
    index_columns: Optional[List[str]] = field(default_factory=list)
    value_columns: Optional[List[str]] = field(default_factory=list)
    data_type_selector: Optional[Literal['float', 'all', 'date', 'numeric', 'string']] = None
    data_type_selector_mode: Optional[Literal['data_type', 'column']] = 'column'

    def __post_init__(self):
        """Ensures that list attributes are initialized correctly if they are None."""
        if self.index_columns is None:
            self.index_columns = []
        if self.value_columns is None:
            self.value_columns = []
        if self.data_type_selector_mode is None:
            self.data_type_selector_mode = 'column'

    @property
    def data_type_selector_expr(self) -> Optional[Callable]:
        """Returns a Polars selector function based on the `data_type_selector` string."""
        if self.data_type_selector_mode == 'data_type':
            if self.data_type_selector is not None:
                try:
                    return getattr(selectors, self.data_type_selector)
                except Exception as e:
                    print(f'Could not find the selector: {self.data_type_selector}')
                    return selectors.all
            return selectors.all

`data_type_selector_expr` `property`

Returns a Polars selector function based on the data_type_selector string.

`__post_init__()`

Ensures that list attributes are initialized correctly if they are None.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def __post_init__(self):
    """Ensures that list attributes are initialized correctly if they are None."""
    if self.index_columns is None:
        self.index_columns = []
    if self.value_columns is None:
        self.value_columns = []
    if self.data_type_selector_mode is None:
        self.data_type_selector_mode = 'column'

`construct_join_key_name(side, column_name)`

Creates a temporary, unique name for a join key column.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def construct_join_key_name(side: SideLit, column_name: str) -> str:
    """Creates a temporary, unique name for a join key column."""
    return "_FLOWFILE_JOIN_KEY_" + side.upper() + "_" + column_name

`get_func_type_mapping(func)`

Infers the output data type of common aggregation functions.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def get_func_type_mapping(func: str):
    """Infers the output data type of common aggregation functions."""
    if func in ["mean", "avg", "median", "std", "var"]:
        return "Float64"
    elif func in ['min', 'max', 'first', 'last', "cumsum", "sum"]:
        return None
    elif func in ['count', 'n_unique']:
        return "Int64"
    elif func in ['concat']:
        return "Utf8"

`string_concat(*column)`

A simple wrapper to concatenate string columns in Polars.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py

def string_concat(*column: str):
    """A simple wrapper to concatenate string columns in Polars."""
    return pl.col(column).cast(pl.Utf8).str.concat(delimiter=',')

`cloud_storage_schemas`

`flowfile_core.schemas.cloud_storage_schemas`

Cloud storage connection schemas for S3, ADLS, and other cloud providers.

Classes:

Name	Description
`AuthSettingsInput`	The information needed for the user to provide the details that are needed to provide how to connect to the
`CloudStorageReadSettings`	Settings for reading from cloud storage
`CloudStorageSettings`	Settings for cloud storage nodes in the visual designer
`CloudStorageWriteSettings`	Settings for writing to cloud storage
`CloudStorageWriteSettingsWorkerInterface`	Settings for writing to cloud storage in worker context
`FullCloudStorageConnection`	Internal model with decrypted secrets
`FullCloudStorageConnectionInterface`	API response model - no secrets exposed
`FullCloudStorageConnectionWorkerInterface`	Internal model with decrypted secrets
`WriteSettingsWorkerInterface`	Settings for writing to cloud storage

Functions:

Name	Description
`encrypt_for_worker`	Encrypts a secret value for use in worker contexts.
`get_cloud_storage_write_settings_worker_interface`	Convert to a worker interface model with hashed secrets.

`AuthSettingsInput` `pydantic-model`

Bases: BaseModel

The information needed for the user to provide the details that are needed to provide how to connect to the Cloud provider

Show JSON schema:

{
  "description": "The information needed for the user to provide the details that are needed to provide how to connect to the\n Cloud provider",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "AuthSettingsInput",
  "type": "object"
}

Fields:

storage_type (CloudStorageType)
auth_method (AuthMethod)
connection_name (Optional[str])

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class AuthSettingsInput(BaseModel):
    """
    The information needed for the user to provide the details that are needed to provide how to connect to the
     Cloud provider
    """
    storage_type: CloudStorageType
    auth_method: AuthMethod
    connection_name: Optional[str] = "None"  # This is the reference to the item we will fetch that contains the data

`CloudStorageReadSettings` `pydantic-model`

Bases: CloudStorageSettings

Settings for reading from cloud storage

Show JSON schema:

{
  "description": "Settings for reading from cloud storage",
  "properties": {
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    },
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "scan_mode": {
      "default": "single_file",
      "enum": [
        "single_file",
        "directory"
      ],
      "title": "Scan Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta",
        "iceberg"
      ],
      "title": "File Format",
      "type": "string"
    },
    "csv_has_header": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Csv Has Header"
    },
    "csv_delimiter": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": ",",
      "title": "Csv Delimiter"
    },
    "csv_encoding": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "utf8",
      "title": "Csv Encoding"
    },
    "delta_version": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Delta Version"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageReadSettings",
  "type": "object"
}

Fields:

auth_mode (AuthMethod)
connection_name (Optional[str])
resource_path (str)
scan_mode (Literal['single_file', 'directory'])
file_format (Literal['csv', 'parquet', 'json', 'delta', 'iceberg'])
csv_has_header (Optional[bool])
csv_delimiter (Optional[str])
csv_encoding (Optional[str])
delta_version (Optional[int])

Validators:

validate_auth_requirements → auth_mode

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class CloudStorageReadSettings(CloudStorageSettings):
    """Settings for reading from cloud storage"""

    scan_mode: Literal["single_file", "directory"] = "single_file"
    file_format: Literal["csv", "parquet", "json", "delta", "iceberg"] = "parquet"
    # CSV specific options
    csv_has_header: Optional[bool] = True
    csv_delimiter: Optional[str] = ","
    csv_encoding: Optional[str] = "utf8"
    # Deltalake specific settings
    delta_version: Optional[int] = None

`CloudStorageSettings` `pydantic-model`

Bases: BaseModel

Settings for cloud storage nodes in the visual designer

Show JSON schema:

{
  "description": "Settings for cloud storage nodes in the visual designer",
  "properties": {
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    },
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageSettings",
  "type": "object"
}

Fields:

auth_mode (AuthMethod)
connection_name (Optional[str])
resource_path (str)

Validators:

validate_auth_requirements → auth_mode

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class CloudStorageSettings(BaseModel):
    """Settings for cloud storage nodes in the visual designer"""

    auth_mode: AuthMethod = "auto"
    connection_name: Optional[str] = None  # Required only for 'reference' mode
    resource_path: str  # s3://bucket/path/to/file.csv

    @field_validator("auth_mode", mode="after")
    def validate_auth_requirements(cls, v, values):
        data = values.data
        if v == "reference" and not data.get("connection_name"):
            raise ValueError("connection_name required when using reference mode")
        return v

`CloudStorageWriteSettings` `pydantic-model`

Bases: CloudStorageSettings, WriteSettingsWorkerInterface

Settings for writing to cloud storage

Show JSON schema:

{
  "description": "Settings for writing to cloud storage",
  "properties": {
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "write_mode": {
      "default": "overwrite",
      "enum": [
        "overwrite",
        "append"
      ],
      "title": "Write Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta"
      ],
      "title": "File Format",
      "type": "string"
    },
    "parquet_compression": {
      "default": "snappy",
      "enum": [
        "snappy",
        "gzip",
        "brotli",
        "lz4",
        "zstd"
      ],
      "title": "Parquet Compression",
      "type": "string"
    },
    "csv_delimiter": {
      "default": ",",
      "title": "Csv Delimiter",
      "type": "string"
    },
    "csv_encoding": {
      "default": "utf8",
      "title": "Csv Encoding",
      "type": "string"
    },
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageWriteSettings",
  "type": "object"
}

Fields:

resource_path (str)
write_mode (Literal['overwrite', 'append'])
file_format (Literal['csv', 'parquet', 'json', 'delta'])
parquet_compression (Literal['snappy', 'gzip', 'brotli', 'lz4', 'zstd'])
csv_delimiter (str)
csv_encoding (str)
auth_mode (AuthMethod)
connection_name (Optional[str])

Validators:

validate_auth_requirements → auth_mode

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class CloudStorageWriteSettings(CloudStorageSettings, WriteSettingsWorkerInterface):
    """Settings for writing to cloud storage"""
    pass

    def get_write_setting_worker_interface(self) -> WriteSettingsWorkerInterface:
        """
        Convert to a worker interface model without secrets.
        """
        return WriteSettingsWorkerInterface(
            resource_path=self.resource_path,
            write_mode=self.write_mode,
            file_format=self.file_format,
            parquet_compression=self.parquet_compression,
            csv_delimiter=self.csv_delimiter,
            csv_encoding=self.csv_encoding
        )

`get_write_setting_worker_interface()`

Convert to a worker interface model without secrets.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

def get_write_setting_worker_interface(self) -> WriteSettingsWorkerInterface:
    """
    Convert to a worker interface model without secrets.
    """
    return WriteSettingsWorkerInterface(
        resource_path=self.resource_path,
        write_mode=self.write_mode,
        file_format=self.file_format,
        parquet_compression=self.parquet_compression,
        csv_delimiter=self.csv_delimiter,
        csv_encoding=self.csv_encoding
    )

`CloudStorageWriteSettingsWorkerInterface` `pydantic-model`

Bases: BaseModel

Settings for writing to cloud storage in worker context

Show JSON schema:

{
  "$defs": {
    "FullCloudStorageConnectionWorkerInterface": {
      "description": "Internal model with decrypted secrets",
      "properties": {
        "storage_type": {
          "enum": [
            "s3",
            "adls",
            "gcs"
          ],
          "title": "Storage Type",
          "type": "string"
        },
        "auth_method": {
          "enum": [
            "access_key",
            "iam_role",
            "service_principal",
            "managed_identity",
            "sas_token",
            "aws-cli",
            "env_vars"
          ],
          "title": "Auth Method",
          "type": "string"
        },
        "connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "None",
          "title": "Connection Name"
        },
        "aws_region": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Region"
        },
        "aws_access_key_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Access Key Id"
        },
        "aws_secret_access_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Secret Access Key"
        },
        "aws_role_arn": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Role Arn"
        },
        "aws_allow_unsafe_html": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Allow Unsafe Html"
        },
        "aws_session_token": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Session Token"
        },
        "azure_account_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Account Name"
        },
        "azure_account_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Account Key"
        },
        "azure_tenant_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Tenant Id"
        },
        "azure_client_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Client Id"
        },
        "azure_client_secret": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Client Secret"
        },
        "endpoint_url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Endpoint Url"
        },
        "verify_ssl": {
          "default": true,
          "title": "Verify Ssl",
          "type": "boolean"
        }
      },
      "required": [
        "storage_type",
        "auth_method"
      ],
      "title": "FullCloudStorageConnectionWorkerInterface",
      "type": "object"
    },
    "WriteSettingsWorkerInterface": {
      "description": "Settings for writing to cloud storage",
      "properties": {
        "resource_path": {
          "title": "Resource Path",
          "type": "string"
        },
        "write_mode": {
          "default": "overwrite",
          "enum": [
            "overwrite",
            "append"
          ],
          "title": "Write Mode",
          "type": "string"
        },
        "file_format": {
          "default": "parquet",
          "enum": [
            "csv",
            "parquet",
            "json",
            "delta"
          ],
          "title": "File Format",
          "type": "string"
        },
        "parquet_compression": {
          "default": "snappy",
          "enum": [
            "snappy",
            "gzip",
            "brotli",
            "lz4",
            "zstd"
          ],
          "title": "Parquet Compression",
          "type": "string"
        },
        "csv_delimiter": {
          "default": ",",
          "title": "Csv Delimiter",
          "type": "string"
        },
        "csv_encoding": {
          "default": "utf8",
          "title": "Csv Encoding",
          "type": "string"
        }
      },
      "required": [
        "resource_path"
      ],
      "title": "WriteSettingsWorkerInterface",
      "type": "object"
    }
  },
  "description": "Settings for writing to cloud storage in worker context",
  "properties": {
    "operation": {
      "title": "Operation",
      "type": "string"
    },
    "write_settings": {
      "$ref": "#/$defs/WriteSettingsWorkerInterface"
    },
    "connection": {
      "$ref": "#/$defs/FullCloudStorageConnectionWorkerInterface"
    },
    "flowfile_flow_id": {
      "default": 1,
      "title": "Flowfile Flow Id",
      "type": "integer"
    },
    "flowfile_node_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "string"
        }
      ],
      "default": -1,
      "title": "Flowfile Node Id"
    }
  },
  "required": [
    "operation",
    "write_settings",
    "connection"
  ],
  "title": "CloudStorageWriteSettingsWorkerInterface",
  "type": "object"
}

Fields:

operation (str)
write_settings (WriteSettingsWorkerInterface)
connection (FullCloudStorageConnectionWorkerInterface)
flowfile_flow_id (int)
flowfile_node_id (int | str)

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class CloudStorageWriteSettingsWorkerInterface(BaseModel):
    """Settings for writing to cloud storage in worker context"""
    operation: str
    write_settings: WriteSettingsWorkerInterface
    connection: FullCloudStorageConnectionWorkerInterface
    flowfile_flow_id: int = 1
    flowfile_node_id: int | str = -1

`FullCloudStorageConnection` `pydantic-model`

Bases: AuthSettingsInput

Internal model with decrypted secrets

Show JSON schema:

{
  "description": "Internal model with decrypted secrets",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_secret_access_key": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Secret Access Key"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_session_token": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Session Token"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_account_key": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Key"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "azure_client_secret": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Secret"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnection",
  "type": "object"
}

Fields:

storage_type (CloudStorageType)
auth_method (AuthMethod)
connection_name (Optional[str])
aws_region (Optional[str])
aws_access_key_id (Optional[str])
aws_secret_access_key (Optional[SecretStr])
aws_role_arn (Optional[str])
aws_allow_unsafe_html (Optional[bool])
aws_session_token (Optional[SecretStr])
azure_account_name (Optional[str])
azure_account_key (Optional[SecretStr])
azure_tenant_id (Optional[str])
azure_client_id (Optional[str])
azure_client_secret (Optional[SecretStr])
endpoint_url (Optional[str])
verify_ssl (bool)

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class FullCloudStorageConnection(AuthSettingsInput):
    """Internal model with decrypted secrets"""

    # AWS S3
    aws_region: Optional[str] = None
    aws_access_key_id: Optional[str] = None
    aws_secret_access_key: Optional[SecretStr] = None
    aws_role_arn: Optional[str] = None
    aws_allow_unsafe_html: Optional[bool] = None
    aws_session_token: Optional[SecretStr] = None

    # Azure ADLS
    azure_account_name: Optional[str] = None
    azure_account_key: Optional[SecretStr] = None
    azure_tenant_id: Optional[str] = None
    azure_client_id: Optional[str] = None
    azure_client_secret: Optional[SecretStr] = None

    # Common
    endpoint_url: Optional[str] = None
    verify_ssl: bool = True

    def get_worker_interface(self) -> "FullCloudStorageConnectionWorkerInterface":
        """
        Convert to a public interface model without secrets.
        """
        return FullCloudStorageConnectionWorkerInterface(
            storage_type=self.storage_type,
            auth_method=self.auth_method,
            connection_name=self.connection_name,
            aws_allow_unsafe_html=self.aws_allow_unsafe_html,
            aws_secret_access_key=encrypt_for_worker(self.aws_secret_access_key),
            aws_region=self.aws_region,
            aws_access_key_id=self.aws_access_key_id,
            aws_role_arn=self.aws_role_arn,
            aws_session_token=encrypt_for_worker(self.aws_session_token),
            azure_account_name=self.azure_account_name,
            azure_tenant_id=self.azure_tenant_id,
            azure_account_key=encrypt_for_worker(self.azure_account_key),
            azure_client_id=self.azure_client_id,
            azure_client_secret=encrypt_for_worker(self.azure_client_secret),
            endpoint_url=self.endpoint_url,
            verify_ssl=self.verify_ssl
        )

`get_worker_interface()`

Convert to a public interface model without secrets.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

def get_worker_interface(self) -> "FullCloudStorageConnectionWorkerInterface":
    """
    Convert to a public interface model without secrets.
    """
    return FullCloudStorageConnectionWorkerInterface(
        storage_type=self.storage_type,
        auth_method=self.auth_method,
        connection_name=self.connection_name,
        aws_allow_unsafe_html=self.aws_allow_unsafe_html,
        aws_secret_access_key=encrypt_for_worker(self.aws_secret_access_key),
        aws_region=self.aws_region,
        aws_access_key_id=self.aws_access_key_id,
        aws_role_arn=self.aws_role_arn,
        aws_session_token=encrypt_for_worker(self.aws_session_token),
        azure_account_name=self.azure_account_name,
        azure_tenant_id=self.azure_tenant_id,
        azure_account_key=encrypt_for_worker(self.azure_account_key),
        azure_client_id=self.azure_client_id,
        azure_client_secret=encrypt_for_worker(self.azure_client_secret),
        endpoint_url=self.endpoint_url,
        verify_ssl=self.verify_ssl
    )

`FullCloudStorageConnectionInterface` `pydantic-model`

Bases: AuthSettingsInput

API response model - no secrets exposed

Show JSON schema:

{
  "description": "API response model - no secrets exposed",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnectionInterface",
  "type": "object"
}

Fields:

storage_type (CloudStorageType)
auth_method (AuthMethod)
connection_name (Optional[str])
aws_allow_unsafe_html (Optional[bool])
aws_region (Optional[str])
aws_access_key_id (Optional[str])
aws_role_arn (Optional[str])
azure_account_name (Optional[str])
azure_tenant_id (Optional[str])
azure_client_id (Optional[str])
endpoint_url (Optional[str])
verify_ssl (bool)

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class FullCloudStorageConnectionInterface(AuthSettingsInput):
    """API response model - no secrets exposed"""

    # Public fields only
    aws_allow_unsafe_html: Optional[bool] = None
    aws_region: Optional[str] = None
    aws_access_key_id: Optional[str] = None
    aws_role_arn: Optional[str] = None
    azure_account_name: Optional[str] = None
    azure_tenant_id: Optional[str] = None
    azure_client_id: Optional[str] = None
    endpoint_url: Optional[str] = None
    verify_ssl: bool = True

`FullCloudStorageConnectionWorkerInterface` `pydantic-model`

Bases: AuthSettingsInput

Internal model with decrypted secrets

Show JSON schema:

{
  "description": "Internal model with decrypted secrets",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_secret_access_key": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Secret Access Key"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_session_token": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Session Token"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_account_key": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Key"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "azure_client_secret": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Secret"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnectionWorkerInterface",
  "type": "object"
}

Fields:

storage_type (CloudStorageType)
auth_method (AuthMethod)
connection_name (Optional[str])
aws_region (Optional[str])
aws_access_key_id (Optional[str])
aws_secret_access_key (Optional[str])
aws_role_arn (Optional[str])
aws_allow_unsafe_html (Optional[bool])
aws_session_token (Optional[str])
azure_account_name (Optional[str])
azure_account_key (Optional[str])
azure_tenant_id (Optional[str])
azure_client_id (Optional[str])
azure_client_secret (Optional[str])
endpoint_url (Optional[str])
verify_ssl (bool)

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class FullCloudStorageConnectionWorkerInterface(AuthSettingsInput):
    """Internal model with decrypted secrets"""

    # AWS S3
    aws_region: Optional[str] = None
    aws_access_key_id: Optional[str] = None
    aws_secret_access_key: Optional[str] = None
    aws_role_arn: Optional[str] = None
    aws_allow_unsafe_html: Optional[bool] = None
    aws_session_token: Optional[str] = None

    # Azure ADLS
    azure_account_name: Optional[str] = None
    azure_account_key: Optional[str] = None
    azure_tenant_id: Optional[str] = None
    azure_client_id: Optional[str] = None
    azure_client_secret: Optional[str] = None

    # Common
    endpoint_url: Optional[str] = None
    verify_ssl: bool = True

`WriteSettingsWorkerInterface` `pydantic-model`

Bases: BaseModel

Settings for writing to cloud storage

Show JSON schema:

{
  "description": "Settings for writing to cloud storage",
  "properties": {
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "write_mode": {
      "default": "overwrite",
      "enum": [
        "overwrite",
        "append"
      ],
      "title": "Write Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta"
      ],
      "title": "File Format",
      "type": "string"
    },
    "parquet_compression": {
      "default": "snappy",
      "enum": [
        "snappy",
        "gzip",
        "brotli",
        "lz4",
        "zstd"
      ],
      "title": "Parquet Compression",
      "type": "string"
    },
    "csv_delimiter": {
      "default": ",",
      "title": "Csv Delimiter",
      "type": "string"
    },
    "csv_encoding": {
      "default": "utf8",
      "title": "Csv Encoding",
      "type": "string"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "WriteSettingsWorkerInterface",
  "type": "object"
}

Fields:

resource_path (str)
write_mode (Literal['overwrite', 'append'])
file_format (Literal['csv', 'parquet', 'json', 'delta'])
parquet_compression (Literal['snappy', 'gzip', 'brotli', 'lz4', 'zstd'])
csv_delimiter (str)
csv_encoding (str)

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

class WriteSettingsWorkerInterface(BaseModel):
    """Settings for writing to cloud storage"""
    resource_path: str  # s3://bucket/path/to/file.csv

    write_mode: Literal["overwrite", "append"] = "overwrite"
    file_format: Literal["csv", "parquet", "json", "delta"] = "parquet"

    parquet_compression: Literal["snappy", "gzip", "brotli", "lz4", "zstd"] = "snappy"

    csv_delimiter: str = ","
    csv_encoding: str = "utf8"

`encrypt_for_worker(secret_value)`

Encrypts a secret value for use in worker contexts. This is a placeholder function that simulates encryption. In practice, you would use a secure encryption method.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

def encrypt_for_worker(secret_value: SecretStr|None) -> str|None:
    """
    Encrypts a secret value for use in worker contexts.
    This is a placeholder function that simulates encryption.
    In practice, you would use a secure encryption method.
    """
    if secret_value is not None:
        return encrypt_secret(secret_value.get_secret_value())

`get_cloud_storage_write_settings_worker_interface(write_settings, connection, lf, flowfile_flow_id=1, flowfile_node_id=-1)`

Convert to a worker interface model with hashed secrets.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py

def get_cloud_storage_write_settings_worker_interface(
        write_settings: CloudStorageWriteSettings,
        connection: FullCloudStorageConnection,
        lf: pl.LazyFrame,
        flowfile_flow_id: int = 1,
        flowfile_node_id: int | str = -1,
        ) -> CloudStorageWriteSettingsWorkerInterface:
    """
    Convert to a worker interface model with hashed secrets.
    """
    operation = base64.b64encode(lf.serialize()).decode()

    return CloudStorageWriteSettingsWorkerInterface(
        operation=operation,
        write_settings=write_settings.get_write_setting_worker_interface(),
        connection=connection.get_worker_interface(),
        flowfile_flow_id=flowfile_flow_id,  # Default value, can be overridden
        flowfile_node_id=flowfile_node_id  # Default value, can be overridden
    )

`output_model`

`flowfile_core.schemas.output_model`

Classes:

Name	Description
`BaseItem`	A base model for any item in a file system, like a file or directory.
`ExpressionRef`	A reference to a single Polars expression, including its name and docstring.
`ExpressionsOverview`	Represents a categorized list of available Polars expressions.
`FileColumn`	Represents detailed schema and statistics for a single column (field).
`InstantFuncResult`	Represents the result of a function that is expected to execute instantly.
`ItemInfo`	Provides detailed information about a single item in an output directory.
`NodeData`	A comprehensive model holding the complete state and data for a single node.
`NodeResult`	Represents the execution result of a single node in a FlowGraph run.
`OutputDir`	Represents the contents of a single output directory.
`OutputFile`	Represents a single file in an output directory, extending BaseItem.
`OutputFiles`	Represents a collection of files, typically within a directory.
`OutputTree`	Represents a directory tree, including subdirectories.
`RunInformation`	Contains summary information about a complete FlowGraph execution.
`TableExample`	Represents a preview of a table, including schema and sample data.

`BaseItem` `pydantic-model`

Bases: BaseModel

A base model for any item in a file system, like a file or directory.

Show JSON schema:

{
  "description": "A base model for any item in a file system, like a file or directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "BaseItem",
  "type": "object"
}

Fields:

name (str)
path (str)
size (Optional[int])
creation_date (Optional[datetime])
access_date (Optional[datetime])
modification_date (Optional[datetime])
source_path (Optional[str])
number_of_items (int)

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class BaseItem(BaseModel):
    """A base model for any item in a file system, like a file or directory."""
    name: str
    path: str
    size: Optional[int] = None
    creation_date: Optional[datetime] = None
    access_date: Optional[datetime] = None
    modification_date: Optional[datetime] = None
    source_path: Optional[str] = None
    number_of_items: int = -1

`ExpressionRef` `pydantic-model`

Bases: BaseModel

A reference to a single Polars expression, including its name and docstring.

Show JSON schema:

{
  "description": "A reference to a single Polars expression, including its name and docstring.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "doc": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Doc"
    }
  },
  "required": [
    "name",
    "doc"
  ],
  "title": "ExpressionRef",
  "type": "object"
}

Fields:

name (str)
doc (Optional[str])

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class ExpressionRef(BaseModel):
    """A reference to a single Polars expression, including its name and docstring."""
    name: str
    doc: Optional[str]

`ExpressionsOverview` `pydantic-model`

Bases: BaseModel

Represents a categorized list of available Polars expressions.

Show JSON schema:

{
  "$defs": {
    "ExpressionRef": {
      "description": "A reference to a single Polars expression, including its name and docstring.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "doc": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "title": "Doc"
        }
      },
      "required": [
        "name",
        "doc"
      ],
      "title": "ExpressionRef",
      "type": "object"
    }
  },
  "description": "Represents a categorized list of available Polars expressions.",
  "properties": {
    "expression_type": {
      "title": "Expression Type",
      "type": "string"
    },
    "expressions": {
      "items": {
        "$ref": "#/$defs/ExpressionRef"
      },
      "title": "Expressions",
      "type": "array"
    }
  },
  "required": [
    "expression_type",
    "expressions"
  ],
  "title": "ExpressionsOverview",
  "type": "object"
}

Fields:

expression_type (str)
expressions (List[ExpressionRef])

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class ExpressionsOverview(BaseModel):
    """Represents a categorized list of available Polars expressions."""
    expression_type: str
    expressions: List[ExpressionRef]

`FileColumn` `pydantic-model`

Bases: BaseModel

Represents detailed schema and statistics for a single column (field).

Show JSON schema:

{
  "description": "Represents detailed schema and statistics for a single column (field).",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "data_type": {
      "title": "Data Type",
      "type": "string"
    },
    "is_unique": {
      "title": "Is Unique",
      "type": "boolean"
    },
    "max_value": {
      "title": "Max Value",
      "type": "string"
    },
    "min_value": {
      "title": "Min Value",
      "type": "string"
    },
    "number_of_empty_values": {
      "title": "Number Of Empty Values",
      "type": "integer"
    },
    "number_of_filled_values": {
      "title": "Number Of Filled Values",
      "type": "integer"
    },
    "number_of_unique_values": {
      "title": "Number Of Unique Values",
      "type": "integer"
    },
    "size": {
      "title": "Size",
      "type": "integer"
    }
  },
  "required": [
    "name",
    "data_type",
    "is_unique",
    "max_value",
    "min_value",
    "number_of_empty_values",
    "number_of_filled_values",
    "number_of_unique_values",
    "size"
  ],
  "title": "FileColumn",
  "type": "object"
}

Fields:

name (str)
data_type (str)
is_unique (bool)
max_value (str)
min_value (str)
number_of_empty_values (int)
number_of_filled_values (int)
number_of_unique_values (int)
size (int)

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class FileColumn(BaseModel):
    """Represents detailed schema and statistics for a single column (field)."""
    name: str
    data_type: str
    is_unique: bool
    max_value: str
    min_value: str
    number_of_empty_values: int
    number_of_filled_values: int
    number_of_unique_values: int
    size: int

`InstantFuncResult` `pydantic-model`

Bases: BaseModel

Represents the result of a function that is expected to execute instantly.

Show JSON schema:

{
  "description": "Represents the result of a function that is expected to execute instantly.",
  "properties": {
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "result": {
      "title": "Result",
      "type": "string"
    }
  },
  "required": [
    "result"
  ],
  "title": "InstantFuncResult",
  "type": "object"
}

Fields:

success (Optional[bool])
result (str)

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class InstantFuncResult(BaseModel):
    """Represents the result of a function that is expected to execute instantly."""
    success: Optional[bool] = None
    result: str

`ItemInfo` `pydantic-model`

Bases: OutputFile

Provides detailed information about a single item in an output directory.

Show JSON schema:

{
  "description": "Provides detailed information about a single item in an output directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "ext": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Ext"
    },
    "mimetype": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Mimetype"
    },
    "id": {
      "default": -1,
      "title": "Id",
      "type": "integer"
    },
    "type": {
      "title": "Type",
      "type": "string"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "analysis_file_location": {
      "default": null,
      "title": "Analysis File Location",
      "type": "string"
    },
    "analysis_file_error": {
      "default": null,
      "title": "Analysis File Error",
      "type": "string"
    }
  },
  "required": [
    "name",
    "path",
    "type"
  ],
  "title": "ItemInfo",
  "type": "object"
}

Fields:

name (str)
path (str)
size (Optional[int])
creation_date (Optional[datetime])
access_date (Optional[datetime])
modification_date (Optional[datetime])
source_path (Optional[str])
number_of_items (int)
ext (Optional[str])
mimetype (Optional[str])
id (int)
type (str)
analysis_file_available (bool)
analysis_file_location (str)
analysis_file_error (str)

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class ItemInfo(OutputFile):
    """Provides detailed information about a single item in an output directory."""
    id: int = -1
    type: str
    analysis_file_available: bool = False
    analysis_file_location: str = None
    analysis_file_error: str = None

`NodeData` `pydantic-model`

Bases: BaseModel

A comprehensive model holding the complete state and data for a single node.

This includes its input/output data previews, settings, and run status.

Show JSON schema:

{
  "$defs": {
    "FileColumn": {
      "description": "Represents detailed schema and statistics for a single column (field).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "title": "Data Type",
          "type": "string"
        },
        "is_unique": {
          "title": "Is Unique",
          "type": "boolean"
        },
        "max_value": {
          "title": "Max Value",
          "type": "string"
        },
        "min_value": {
          "title": "Min Value",
          "type": "string"
        },
        "number_of_empty_values": {
          "title": "Number Of Empty Values",
          "type": "integer"
        },
        "number_of_filled_values": {
          "title": "Number Of Filled Values",
          "type": "integer"
        },
        "number_of_unique_values": {
          "title": "Number Of Unique Values",
          "type": "integer"
        },
        "size": {
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "name",
        "data_type",
        "is_unique",
        "max_value",
        "min_value",
        "number_of_empty_values",
        "number_of_filled_values",
        "number_of_unique_values",
        "size"
      ],
      "title": "FileColumn",
      "type": "object"
    },
    "TableExample": {
      "description": "Represents a preview of a table, including schema and sample data.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "number_of_records": {
          "title": "Number Of Records",
          "type": "integer"
        },
        "number_of_columns": {
          "title": "Number Of Columns",
          "type": "integer"
        },
        "name": {
          "title": "Name",
          "type": "string"
        },
        "table_schema": {
          "items": {
            "$ref": "#/$defs/FileColumn"
          },
          "title": "Table Schema",
          "type": "array"
        },
        "columns": {
          "items": {
            "type": "string"
          },
          "title": "Columns",
          "type": "array"
        },
        "data": {
          "anyOf": [
            {
              "items": {
                "type": "object"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": {},
          "title": "Data"
        },
        "has_example_data": {
          "default": false,
          "title": "Has Example Data",
          "type": "boolean"
        },
        "has_run_with_current_setup": {
          "default": false,
          "title": "Has Run With Current Setup",
          "type": "boolean"
        }
      },
      "required": [
        "node_id",
        "number_of_records",
        "number_of_columns",
        "name",
        "table_schema",
        "columns"
      ],
      "title": "TableExample",
      "type": "object"
    }
  },
  "description": "A comprehensive model holding the complete state and data for a single node.\n\nThis includes its input/output data previews, settings, and run status.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "flow_type": {
      "title": "Flow Type",
      "type": "string"
    },
    "left_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "right_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "main_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "main_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "left_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "right_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "has_run": {
      "default": false,
      "title": "Has Run",
      "type": "boolean"
    },
    "is_cached": {
      "default": false,
      "title": "Is Cached",
      "type": "boolean"
    },
    "setting_input": {
      "default": null,
      "title": "Setting Input"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "flow_type"
  ],
  "title": "NodeData",
  "type": "object"
}

Fields:

flow_id (int)
node_id (int)
flow_type (str)
left_input (Optional[TableExample])
right_input (Optional[TableExample])
main_input (Optional[TableExample])
main_output (Optional[TableExample])
left_output (Optional[TableExample])
right_output (Optional[TableExample])
has_run (bool)
is_cached (bool)
setting_input (Any)

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class NodeData(BaseModel):
    """A comprehensive model holding the complete state and data for a single node.

    This includes its input/output data previews, settings, and run status.
    """
    flow_id: int
    node_id: int
    flow_type: str
    left_input: Optional[TableExample] = None
    right_input: Optional[TableExample] = None
    main_input: Optional[TableExample] = None
    main_output: Optional[TableExample] = None
    left_output: Optional[TableExample] = None
    right_output: Optional[TableExample] = None
    has_run: bool = False
    is_cached: bool = False
    setting_input: Any = None

`NodeResult` `pydantic-model`

Bases: BaseModel

Represents the execution result of a single node in a FlowGraph run.

Show JSON schema:

{
  "description": "Represents the execution result of a single node in a FlowGraph run.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "node_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Name"
    },
    "start_timestamp": {
      "title": "Start Timestamp",
      "type": "number"
    },
    "end_timestamp": {
      "default": 0,
      "title": "End Timestamp",
      "type": "number"
    },
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "error": {
      "default": "",
      "title": "Error",
      "type": "string"
    },
    "run_time": {
      "default": -1,
      "title": "Run Time",
      "type": "integer"
    },
    "is_running": {
      "default": true,
      "title": "Is Running",
      "type": "boolean"
    }
  },
  "required": [
    "node_id"
  ],
  "title": "NodeResult",
  "type": "object"
}

Fields:

node_id (int)
node_name (Optional[str])
start_timestamp (float)
end_timestamp (float)
success (Optional[bool])
error (str)
run_time (int)
is_running (bool)

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class NodeResult(BaseModel):
    """Represents the execution result of a single node in a FlowGraph run."""
    node_id: int
    node_name: Optional[str] = None
    start_timestamp: float = Field(default_factory=time.time)
    end_timestamp: float = 0
    success: Optional[bool] = None
    error: str = ''
    run_time: int = -1
    is_running: bool = True

`OutputDir` `pydantic-model`

Bases: BaseItem

Represents the contents of a single output directory.

Show JSON schema:

{
  "$defs": {
    "ItemInfo": {
      "description": "Provides detailed information about a single item in an output directory.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        },
        "id": {
          "default": -1,
          "title": "Id",
          "type": "integer"
        },
        "type": {
          "title": "Type",
          "type": "string"
        },
        "analysis_file_available": {
          "default": false,
          "title": "Analysis File Available",
          "type": "boolean"
        },
        "analysis_file_location": {
          "default": null,
          "title": "Analysis File Location",
          "type": "string"
        },
        "analysis_file_error": {
          "default": null,
          "title": "Analysis File Error",
          "type": "string"
        }
      },
      "required": [
        "name",
        "path",
        "type"
      ],
      "title": "ItemInfo",
      "type": "object"
    }
  },
  "description": "Represents the contents of a single output directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "all_items": {
      "items": {
        "type": "string"
      },
      "title": "All Items",
      "type": "array"
    },
    "items": {
      "items": {
        "$ref": "#/$defs/ItemInfo"
      },
      "title": "Items",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path",
    "all_items",
    "items"
  ],
  "title": "OutputDir",
  "type": "object"
}

Fields:

name (str)
path (str)
size (Optional[int])
creation_date (Optional[datetime])
access_date (Optional[datetime])
modification_date (Optional[datetime])
source_path (Optional[str])
number_of_items (int)
all_items (List[str])
items (List[ItemInfo])

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class OutputDir(BaseItem):
    """Represents the contents of a single output directory."""
    all_items: List[str]
    items: List[ItemInfo]

`OutputFile` `pydantic-model`

Bases: BaseItem

Represents a single file in an output directory, extending BaseItem.

Show JSON schema:

{
  "description": "Represents a single file in an output directory, extending BaseItem.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "ext": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Ext"
    },
    "mimetype": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Mimetype"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputFile",
  "type": "object"
}

Fields:

name (str)
path (str)
size (Optional[int])
creation_date (Optional[datetime])
access_date (Optional[datetime])
modification_date (Optional[datetime])
source_path (Optional[str])
number_of_items (int)
ext (Optional[str])
mimetype (Optional[str])

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class OutputFile(BaseItem):
    """Represents a single file in an output directory, extending BaseItem."""
    ext: Optional[str] = None
    mimetype: Optional[str] = None

`OutputFiles` `pydantic-model`

Bases: BaseItem

Represents a collection of files, typically within a directory.

Show JSON schema:

{
  "$defs": {
    "OutputFile": {
      "description": "Represents a single file in an output directory, extending BaseItem.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFile",
      "type": "object"
    }
  },
  "description": "Represents a collection of files, typically within a directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "files": {
      "items": {
        "$ref": "#/$defs/OutputFile"
      },
      "title": "Files",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputFiles",
  "type": "object"
}

Fields:

name (str)
path (str)
size (Optional[int])
creation_date (Optional[datetime])
access_date (Optional[datetime])
modification_date (Optional[datetime])
source_path (Optional[str])
number_of_items (int)
files (List[OutputFile])

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class OutputFiles(BaseItem):
    """Represents a collection of files, typically within a directory."""
    files: List[OutputFile] = Field(default_factory=list)

`OutputTree` `pydantic-model`

Bases: OutputFiles

Represents a directory tree, including subdirectories.

Show JSON schema:

{
  "$defs": {
    "OutputFile": {
      "description": "Represents a single file in an output directory, extending BaseItem.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFile",
      "type": "object"
    },
    "OutputFiles": {
      "description": "Represents a collection of files, typically within a directory.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "files": {
          "items": {
            "$ref": "#/$defs/OutputFile"
          },
          "title": "Files",
          "type": "array"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFiles",
      "type": "object"
    }
  },
  "description": "Represents a directory tree, including subdirectories.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "files": {
      "items": {
        "$ref": "#/$defs/OutputFile"
      },
      "title": "Files",
      "type": "array"
    },
    "directories": {
      "items": {
        "$ref": "#/$defs/OutputFiles"
      },
      "title": "Directories",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputTree",
  "type": "object"
}

Fields:

name (str)
path (str)
size (Optional[int])
creation_date (Optional[datetime])
access_date (Optional[datetime])
modification_date (Optional[datetime])
source_path (Optional[str])
number_of_items (int)
files (List[OutputFile])
directories (List[OutputFiles])

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class OutputTree(OutputFiles):
    """Represents a directory tree, including subdirectories."""
    directories: List[OutputFiles] = Field(default_factory=list)

`RunInformation` `pydantic-model`

Bases: BaseModel

Contains summary information about a complete FlowGraph execution.

Show JSON schema:

{
  "$defs": {
    "NodeResult": {
      "description": "Represents the execution result of a single node in a FlowGraph run.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "node_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Node Name"
        },
        "start_timestamp": {
          "title": "Start Timestamp",
          "type": "number"
        },
        "end_timestamp": {
          "default": 0,
          "title": "End Timestamp",
          "type": "number"
        },
        "success": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Success"
        },
        "error": {
          "default": "",
          "title": "Error",
          "type": "string"
        },
        "run_time": {
          "default": -1,
          "title": "Run Time",
          "type": "integer"
        },
        "is_running": {
          "default": true,
          "title": "Is Running",
          "type": "boolean"
        }
      },
      "required": [
        "node_id"
      ],
      "title": "NodeResult",
      "type": "object"
    }
  },
  "description": "Contains summary information about a complete FlowGraph execution.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "start_time": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Start Time"
    },
    "end_time": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "End Time"
    },
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "nodes_completed": {
      "default": 0,
      "title": "Nodes Completed",
      "type": "integer"
    },
    "number_of_nodes": {
      "default": 0,
      "title": "Number Of Nodes",
      "type": "integer"
    },
    "node_step_result": {
      "items": {
        "$ref": "#/$defs/NodeResult"
      },
      "title": "Node Step Result",
      "type": "array"
    },
    "run_type": {
      "enum": [
        "fetch_one",
        "full_run"
      ],
      "title": "Run Type",
      "type": "string"
    }
  },
  "required": [
    "flow_id",
    "node_step_result",
    "run_type"
  ],
  "title": "RunInformation",
  "type": "object"
}

Fields:

flow_id (int)
start_time (Optional[datetime])
end_time (Optional[datetime])
success (Optional[bool])
nodes_completed (int)
number_of_nodes (int)
node_step_result (List[NodeResult])
run_type (Literal['fetch_one', 'full_run'])

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class RunInformation(BaseModel):
    """Contains summary information about a complete FlowGraph execution."""
    flow_id: int
    start_time: Optional[datetime] = Field(default_factory=datetime.now)
    end_time: Optional[datetime] = None
    success: Optional[bool] = None
    nodes_completed: int = 0
    number_of_nodes: int = 0
    node_step_result: List[NodeResult]
    run_type: Literal["fetch_one", "full_run"]

`TableExample` `pydantic-model`

Bases: BaseModel

Represents a preview of a table, including schema and sample data.

Show JSON schema:

{
  "$defs": {
    "FileColumn": {
      "description": "Represents detailed schema and statistics for a single column (field).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "title": "Data Type",
          "type": "string"
        },
        "is_unique": {
          "title": "Is Unique",
          "type": "boolean"
        },
        "max_value": {
          "title": "Max Value",
          "type": "string"
        },
        "min_value": {
          "title": "Min Value",
          "type": "string"
        },
        "number_of_empty_values": {
          "title": "Number Of Empty Values",
          "type": "integer"
        },
        "number_of_filled_values": {
          "title": "Number Of Filled Values",
          "type": "integer"
        },
        "number_of_unique_values": {
          "title": "Number Of Unique Values",
          "type": "integer"
        },
        "size": {
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "name",
        "data_type",
        "is_unique",
        "max_value",
        "min_value",
        "number_of_empty_values",
        "number_of_filled_values",
        "number_of_unique_values",
        "size"
      ],
      "title": "FileColumn",
      "type": "object"
    }
  },
  "description": "Represents a preview of a table, including schema and sample data.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "number_of_records": {
      "title": "Number Of Records",
      "type": "integer"
    },
    "number_of_columns": {
      "title": "Number Of Columns",
      "type": "integer"
    },
    "name": {
      "title": "Name",
      "type": "string"
    },
    "table_schema": {
      "items": {
        "$ref": "#/$defs/FileColumn"
      },
      "title": "Table Schema",
      "type": "array"
    },
    "columns": {
      "items": {
        "type": "string"
      },
      "title": "Columns",
      "type": "array"
    },
    "data": {
      "anyOf": [
        {
          "items": {
            "type": "object"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": {},
      "title": "Data"
    },
    "has_example_data": {
      "default": false,
      "title": "Has Example Data",
      "type": "boolean"
    },
    "has_run_with_current_setup": {
      "default": false,
      "title": "Has Run With Current Setup",
      "type": "boolean"
    }
  },
  "required": [
    "node_id",
    "number_of_records",
    "number_of_columns",
    "name",
    "table_schema",
    "columns"
  ],
  "title": "TableExample",
  "type": "object"
}

Fields:

node_id (int)
number_of_records (int)
number_of_columns (int)
name (str)
table_schema (List[FileColumn])
columns (List[str])
data (Optional[List[Dict]])
has_example_data (bool)
has_run_with_current_setup (bool)

Source code in flowfile_core/flowfile_core/schemas/output_model.py

class TableExample(BaseModel):
    """Represents a preview of a table, including schema and sample data."""
    node_id: int
    number_of_records: int
    number_of_columns: int
    name: str
    table_schema: List[FileColumn]
    columns: List[str]
    data: Optional[List[Dict]] = {}
    has_example_data: bool = False
    has_run_with_current_setup: bool = False

Web API

This section documents the FastAPI routes that expose flowfile-core's functionality over HTTP.

`routes`

`flowfile_core.routes.routes`

Main API router and endpoint definitions for the Flowfile application.

This module sets up the FastAPI router, defines all the API endpoints for interacting with flows, nodes, files, and other core components of the application. It handles the logic for creating, reading, updating, and deleting these resources.

Functions:

Name	Description
`add_generic_settings`	A generic endpoint to update the settings of any node.
`add_node`	Adds a new, unconfigured node (a "promise") to the flow graph.
`cancel_flow`	Cancels a currently running flow execution.
`close_flow`	Closes an active flow session.
`connect_node`	Creates a connection (edge) between two nodes in the flow graph.
`copy_node`	Copies an existing node's settings to a new node promise.
`create_db_connection`	Creates and securely stores a new database connection.
`create_directory`	Creates a new directory at the specified path.
`create_flow`	Creates a new, empty flow file at the specified path and registers a session for it.
`delete_db_connection`	Deletes a stored database connection.
`delete_node`	Deletes a node from the flow graph.
`delete_node_connection`	Deletes a connection (edge) between two nodes.
`get_active_flow_file_sessions`	Retrieves a list of all currently active flow sessions.
`get_current_directory_contents`	Gets the contents of the file explorer's current directory.
`get_current_files`	Gets the contents of the file explorer's current directory.
`get_current_path`	Returns the current absolute path of the file explorer.
`get_db_connections`	Retrieves all stored database connections for the current user (without passwords).
`get_description_node`	Retrieves the description text for a specific node.
`get_directory_contents`	Gets the contents of an arbitrary directory path.
`get_downstream_node_ids`	Gets a list of all node IDs that are downstream dependencies of a given node.
`get_excel_sheet_names`	Retrieves the sheet names from an Excel file.
`get_expression_doc`	Retrieves documentation for available Polars expressions.
`get_expressions`	Retrieves a list of all available Flowfile expression names.
`get_flow`	Retrieves the settings for a specific flow.
`get_flow_frontend_data`	Retrieves the data needed to render the flow graph in the frontend.
`get_flow_settings`	Retrieves the main settings for a flow.
`get_generated_code`	Generates and returns a Python script with Polars code representing the flow.
`get_graphic_walker_input`	Gets the data and configuration for the Graphic Walker data exploration tool.
`get_instant_function_result`	Executes a simple, instant function on a node's data and returns the result.
`get_list_of_saved_flows`	Scans a directory for saved flow files (`.flowfile`).
`get_local_files`	Retrieves a list of files from a specified local directory.
`get_node`	Retrieves the complete state and data preview for a single node.
`get_node_list`	Retrieves the list of all available node types and their templates.
`get_node_model`	(Internal) Retrieves a node's Pydantic model from the input_schema module by its name.
`get_run_status`	Retrieves the run status information for a specific flow.
`get_table_example`	Retrieves a data preview (schema and sample rows) for a node's output.
`get_vue_flow_data`	Retrieves the flow data formatted for the Vue-based frontend.
`import_saved_flow`	Imports a flow from a saved `.flowfile` and registers it as a new session.
`navigate_into_directory`	Navigates the file explorer into a specified subdirectory.
`navigate_to_directory`	Navigates the file explorer to an absolute directory path.
`navigate_up`	Navigates the file explorer one directory level up.
`register_flow`	Registers a new flow session with the application.
`run_flow`	Executes a flow in a background task.
`save_flow`	Saves the current state of a flow to a `.flowfile`.
`trigger_fetch_node_data`	Fetches and refreshes the data for a specific node.
`update_description_node`	Updates the description text for a specific node.
`update_flow_settings`	Updates the main settings for a flow.
`upload_file`	Uploads a file to the server's 'uploads' directory.
`validate_db_settings`	Validates that a connection can be made to a database with the given settings.

`add_generic_settings(input_data, node_type, current_user=Depends(get_current_active_user))`

A generic endpoint to update the settings of any node.

This endpoint dynamically determines the correct Pydantic model and update function based on the node_type parameter.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/update_settings/', tags=['transform'])
def add_generic_settings(input_data: Dict[str, Any], node_type: str, current_user=Depends(get_current_active_user)):
    """A generic endpoint to update the settings of any node.

    This endpoint dynamically determines the correct Pydantic model and update
    function based on the `node_type` parameter.
    """
    input_data['user_id'] = current_user.id
    node_type = camel_case_to_snake_case(node_type)
    flow_id = int(input_data.get('flow_id'))
    logger.info(f'Updating the data for flow: {flow_id}, node {input_data["node_id"]}')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    add_func = getattr(flow, 'add_' + node_type)
    parsed_input = None
    setting_name_ref = 'node' + node_type.replace('_', '')

    if add_func is None:
        raise HTTPException(404, 'could not find the function')
    try:
        ref = get_node_model(setting_name_ref)
        if ref:
            parsed_input = ref(**input_data)
    except Exception as e:
        raise HTTPException(421, str(e))
    if parsed_input is None:
        raise HTTPException(404, 'could not find the interface')
    try:
        add_func(parsed_input)
    except Exception as e:
        logger.error(e)
        raise HTTPException(419, str(f'error: {e}'))

`add_node(flow_id, node_id, node_type, pos_x=0, pos_y=0)`

Adds a new, unconfigured node (a "promise") to the flow graph.

Parameters:

Name	Type	Description	Default
`flow_id`	`int`	The ID of the flow to add the node to.	required
`node_id`	`int`	The client-generated ID for the new node.	required
`node_type`	`str`	The type of the node to add (e.g., 'filter', 'join').	required
`pos_x`	`int`	The X coordinate for the node's position in the UI.	`0`
`pos_y`	`int`	The Y coordinate for the node's position in the UI.	`0`

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/editor/add_node/', tags=['editor'])
def add_node(flow_id: int, node_id: int, node_type: str, pos_x: int = 0, pos_y: int = 0):
    """Adds a new, unconfigured node (a "promise") to the flow graph.

    Args:
        flow_id: The ID of the flow to add the node to.
        node_id: The client-generated ID for the new node.
        node_type: The type of the node to add (e.g., 'filter', 'join').
        pos_x: The X coordinate for the node's position in the UI.
        pos_y: The Y coordinate for the node's position in the UI.
    """
    flow = flow_file_handler.get_flow(flow_id)
    logger.info(f'Adding a promise for {node_type}')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    node = flow.get_node(node_id)
    if node is not None:
        flow.delete_node(node_id)
    node_promise = input_schema.NodePromise(flow_id=flow_id, node_id=node_id, cache_results=False, pos_x=pos_x,
                                            pos_y=pos_y,
                                            node_type=node_type)
    if node_type == 'explore_data':
        flow.add_initial_node_analysis(node_promise)
        return
    else:
        logger.info("Adding node")
        flow.add_node_promise(node_promise)

    if check_if_has_default_setting(node_type):
        logger.info(f'Found standard settings for {node_type}, trying to upload them')
        setting_name_ref = 'node' + node_type.replace('_', '')
        node_model = get_node_model(setting_name_ref)
        add_func = getattr(flow, 'add_' + node_type)
        initial_settings = node_model(flow_id=flow_id, node_id=node_id, cache_results=False,
                                      pos_x=pos_x, pos_y=pos_y, node_type=node_type)
        add_func(initial_settings)

`cancel_flow(flow_id)`

Cancels a currently running flow execution.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/flow/cancel/', tags=['editor'])
def cancel_flow(flow_id: int):
    """Cancels a currently running flow execution."""
    flow = flow_file_handler.get_flow(flow_id)
    if not flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is not running')
    flow.cancel()

`close_flow(flow_id)`

Closes an active flow session.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/editor/close_flow/', tags=['editor'])
def close_flow(flow_id: int) -> None:
    """Closes an active flow session."""
    flow_file_handler.delete_flow(flow_id)

`connect_node(flow_id, node_connection)`

Creates a connection (edge) between two nodes in the flow graph.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/editor/connect_node/', tags=['editor'])
def connect_node(flow_id: int, node_connection: input_schema.NodeConnection):
    """Creates a connection (edge) between two nodes in the flow graph."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        logger.info('could not find the flow')
        raise HTTPException(404, 'could not find the flow')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    add_connection(flow, node_connection)

`copy_node(node_id_to_copy_from, flow_id_to_copy_from, node_promise)`

Copies an existing node's settings to a new node promise.

Parameters:

Name	Type	Description	Default
`node_id_to_copy_from`	`int`	The ID of the node to copy the settings from.	required
`flow_id_to_copy_from`	`int`	The ID of the flow containing the source node.	required
`node_promise`	`NodePromise`	A `NodePromise` representing the new node to be created.	required

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/editor/copy_node', tags=['editor'])
def copy_node(node_id_to_copy_from: int, flow_id_to_copy_from: int, node_promise: input_schema.NodePromise):
    """Copies an existing node's settings to a new node promise.

    Args:
        node_id_to_copy_from: The ID of the node to copy the settings from.
        flow_id_to_copy_from: The ID of the flow containing the source node.
        node_promise: A `NodePromise` representing the new node to be created.
    """
    try:
        flow_to_copy_from = flow_file_handler.get_flow(flow_id_to_copy_from)
        flow = (flow_to_copy_from
                if flow_id_to_copy_from == node_promise.flow_id
                else flow_file_handler.get_flow(node_promise.flow_id)
                )
        node_to_copy = flow_to_copy_from.get_node(node_id_to_copy_from)
        logger.info(f"Copying data {node_promise.node_type}")

        if flow.flow_settings.is_running:
            raise HTTPException(422, "Flow is running")

        if flow.get_node(node_promise.node_id) is not None:
            flow.delete_node(node_promise.node_id)

        if node_promise.node_type == "explore_data":
            flow.add_initial_node_analysis(node_promise)
            return

        flow.copy_node(node_promise, node_to_copy.setting_input, node_to_copy.node_type)

    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))

`create_db_connection(input_connection, current_user=Depends(get_current_active_user), db=Depends(get_db))`

Creates and securely stores a new database connection.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post("/db_connection_lib", tags=['db_connections'])
def create_db_connection(input_connection: input_schema.FullDatabaseConnection,
                         current_user=Depends(get_current_active_user),
                         db: Session = Depends(get_db)
                         ):
    """Creates and securely stores a new database connection."""
    logger.info(f'Creating database connection {input_connection.connection_name}')
    try:
        store_database_connection(db, input_connection, current_user.id)
    except ValueError:
        raise HTTPException(422, 'Connection name already exists')
    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
    return {"message": "Database connection created successfully"}

`create_directory(new_directory)`

Creates a new directory at the specified path.

Parameters:

Name	Type	Description	Default
`new_directory`	`NewDirectory`	An `input_schema.NewDirectory` object with the path and name.	required

Returns:

Type	Description
`bool`	`True` if the directory was created successfully.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/files/create_directory', response_model=output_model.OutputDir, tags=['file manager'])
def create_directory(new_directory: input_schema.NewDirectory) -> bool:
    """Creates a new directory at the specified path.

    Args:
        new_directory: An `input_schema.NewDirectory` object with the path and name.

    Returns:
        `True` if the directory was created successfully.
    """
    result, error = create_dir(new_directory)
    if result:
        return True
    else:
        raise error

`create_flow(flow_path=None, name=None)`

Creates a new, empty flow file at the specified path and registers a session for it.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/editor/create_flow/', tags=['editor'])
def create_flow(flow_path: str = None, name: str = None):
    """Creates a new, empty flow file at the specified path and registers a session for it."""
    if flow_path is not None and name is None:
        name = Path(flow_path).stem
    elif flow_path is not None and name is not None:
        if name not in flow_path and flow_path.endswith(".flowfile"):
            raise HTTPException(422, 'The name must be part of the flow path when a full path is provided')
        elif name in flow_path and not flow_path.endswith(".flowfile"):
            flow_path = str(Path(flow_path) / (name + ".flowfile"))
        elif name not in flow_path and name.endswith(".flowfile"):
            flow_path = str(Path(flow_path) / name)
        elif name not in flow_path and not name.endswith(".flowfile"):
            flow_path = str(Path(flow_path) / (name + ".flowfile"))
    if flow_path is not None:
        flow_path_ref = Path(flow_path)
        if not flow_path_ref.parent.exists():
            raise HTTPException(422, 'The directory does not exist')
    return flow_file_handler.add_flow(name=name, flow_path=flow_path)

`delete_db_connection(connection_name, current_user=Depends(get_current_active_user), db=Depends(get_db))`

Deletes a stored database connection.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.delete('/db_connection_lib', tags=['db_connections'])
def delete_db_connection(connection_name: str,
                         current_user=Depends(get_current_active_user),
                         db: Session = Depends(get_db)
                         ):
    """Deletes a stored database connection."""
    logger.info(f'Deleting database connection {connection_name}')
    db_connection = get_database_connection(db, connection_name, current_user.id)
    if db_connection is None:
        raise HTTPException(404, 'Database connection not found')
    delete_database_connection(db, connection_name, current_user.id)
    return {"message": "Database connection deleted successfully"}

`delete_node(flow_id, node_id)`

Deletes a node from the flow graph.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/editor/delete_node/', tags=['editor'])
def delete_node(flow_id: Optional[int], node_id: int):
    """Deletes a node from the flow graph."""
    logger.info('Deleting node')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    flow.delete_node(node_id)

`delete_node_connection(flow_id, node_connection=None)`

Deletes a connection (edge) between two nodes.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/editor/delete_connection/', tags=['editor'])
def delete_node_connection(flow_id: int, node_connection: input_schema.NodeConnection = None):
    """Deletes a connection (edge) between two nodes."""
    flow_id = int(flow_id)
    logger.info(
        f'Deleting connection node {node_connection.output_connection.node_id} to node {node_connection.input_connection.node_id}')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    delete_connection(flow, node_connection)

`get_active_flow_file_sessions()` `async`

Retrieves a list of all currently active flow sessions.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/active_flowfile_sessions/', response_model=List[schemas.FlowSettings])
async def get_active_flow_file_sessions() -> List[schemas.FlowSettings]:
    """Retrieves a list of all currently active flow sessions."""
    return [flf.flow_settings for flf in flow_file_handler.flowfile_flows]

`get_current_directory_contents(file_types=None, include_hidden=False)` `async`

Gets the contents of the file explorer's current directory.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/files/current_directory_contents/', response_model=List[FileInfo], tags=['file manager'])
async def get_current_directory_contents(file_types: List[str] = None, include_hidden: bool = False) -> List[FileInfo]:
    """Gets the contents of the file explorer's current directory."""
    return file_explorer.list_contents(file_types=file_types, show_hidden=include_hidden)

`get_current_files()` `async`

Gets the contents of the file explorer's current directory.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/files/tree/', response_model=List[FileInfo], tags=['file manager'])
async def get_current_files() -> List[FileInfo]:
    """Gets the contents of the file explorer's current directory."""
    f = file_explorer.list_contents()
    return f

`get_current_path()` `async`

Returns the current absolute path of the file explorer.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/files/current_path/', response_model=str, tags=['file manager'])
async def get_current_path() -> str:
    """Returns the current absolute path of the file explorer."""
    return str(file_explorer.current_path)

`get_db_connections(db=Depends(get_db), current_user=Depends(get_current_active_user))`

Retrieves all stored database connections for the current user (without passwords).

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/db_connection_lib', tags=['db_connections'],
            response_model=List[input_schema.FullDatabaseConnectionInterface])
def get_db_connections(
        db: Session = Depends(get_db),
        current_user=Depends(get_current_active_user)) -> List[input_schema.FullDatabaseConnectionInterface]:
    """Retrieves all stored database connections for the current user (without passwords)."""
    return get_all_database_connections_interface(db, current_user.id)

`get_description_node(flow_id, node_id)`

Retrieves the description text for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/node/description', tags=['editor'])
def get_description_node(flow_id: int, node_id: int):
    """Retrieves the description text for a specific node."""
    try:
        node = flow_file_handler.get_flow(flow_id).get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    if node is None:
        raise HTTPException(404, 'Could not find the node')
    return node.setting_input.description

`get_directory_contents(directory, file_types=None, include_hidden=False)` `async`

Gets the contents of an arbitrary directory path.

Parameters:

Name	Type	Description	Default
`directory`	`str`	The absolute path to the directory.	required
`file_types`	`List[str]`	An optional list of file extensions to filter by.	`None`
`include_hidden`	`bool`	If True, includes hidden files and directories.	`False`

Returns:

Type	Description
`List[FileInfo]`	A list of `FileInfo` objects representing the directory's contents.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/files/directory_contents/', response_model=List[FileInfo], tags=['file manager'])
async def get_directory_contents(directory: str, file_types: List[str] = None,
                                 include_hidden: bool = False) -> List[FileInfo]:
    """Gets the contents of an arbitrary directory path.

    Args:
        directory: The absolute path to the directory.
        file_types: An optional list of file extensions to filter by.
        include_hidden: If True, includes hidden files and directories.

    Returns:
        A list of `FileInfo` objects representing the directory's contents.
    """
    directory_explorer = SecureFileExplorer(directory, storage.user_data_directory)
    try:
        return directory_explorer.list_contents(show_hidden=include_hidden, file_types=file_types)
    except Exception as e:
        logger.error(e)
        HTTPException(404, 'Could not access the directory')

`get_downstream_node_ids(flow_id, node_id)` `async`

Gets a list of all node IDs that are downstream dependencies of a given node.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/node/downstream_node_ids', response_model=List[int], tags=['editor'])
async def get_downstream_node_ids(flow_id: int, node_id: int) -> List[int]:
    """Gets a list of all node IDs that are downstream dependencies of a given node."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    return list(node.get_all_dependent_node_ids())

`get_excel_sheet_names(path)` `async`

Retrieves the sheet names from an Excel file.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/api/get_xlsx_sheet_names', tags=['excel_reader'], response_model=List[str])
async def get_excel_sheet_names(path: str) -> List[str] | None:
    """Retrieves the sheet names from an Excel file."""
    sheet_names = excel_file_manager.get_sheet_names(path)
    if sheet_names:
        return sheet_names
    else:
        raise HTTPException(404, 'File not found')

`get_expression_doc()`

Retrieves documentation for available Polars expressions.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/editor/expression_doc', tags=['editor'], response_model=List[output_model.ExpressionsOverview])
def get_expression_doc() -> List[output_model.ExpressionsOverview]:
    """Retrieves documentation for available Polars expressions."""
    return get_expression_overview()

`get_expressions()`

Retrieves a list of all available Flowfile expression names.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/editor/expressions', tags=['editor'], response_model=List[str])
def get_expressions() -> List[str]:
    """Retrieves a list of all available Flowfile expression names."""
    return get_all_expressions()

`get_flow(flow_id)`

Retrieves the settings for a specific flow.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/editor/flow', tags=['editor'], response_model=schemas.FlowSettings)
def get_flow(flow_id: int):
    """Retrieves the settings for a specific flow."""
    flow_id = int(flow_id)
    result = get_flow_settings(flow_id)
    return result

`get_flow_frontend_data(flow_id=1)`

Retrieves the data needed to render the flow graph in the frontend.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/flow_data', tags=['manager'])
def get_flow_frontend_data(flow_id: Optional[int] = 1):
    """Retrieves the data needed to render the flow graph in the frontend."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return flow.get_frontend_data()

`get_flow_settings(flow_id=1)`

Retrieves the main settings for a flow.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/flow_settings', tags=['manager'], response_model=schemas.FlowSettings)
def get_flow_settings(flow_id: Optional[int] = 1) -> schemas.FlowSettings:
    """Retrieves the main settings for a flow."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return flow.flow_settings

`get_generated_code(flow_id)`

Generates and returns a Python script with Polars code representing the flow.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get("/editor/code_to_polars", tags=[], response_model=str)
def get_generated_code(flow_id: int) -> str:
    """Generates and returns a Python script with Polars code representing the flow."""
    flow_id = int(flow_id)
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return export_flow_to_polars(flow)

`get_graphic_walker_input(flow_id, node_id)`

Gets the data and configuration for the Graphic Walker data exploration tool.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/analysis_data/graphic_walker_input', tags=['analysis'], response_model=input_schema.NodeExploreData)
def get_graphic_walker_input(flow_id: int, node_id: int):
    """Gets the data and configuration for the Graphic Walker data exploration tool."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    if node.results.analysis_data_generator is None:
        logger.error('The data is not refreshed and available for analysis')
        raise HTTPException(422, 'The data is not refreshed and available for analysis')
    return AnalyticsProcessor.process_graphic_walker_input(node)

`get_instant_function_result(flow_id, node_id, func_string)` `async`

Executes a simple, instant function on a node's data and returns the result.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/custom_functions/instant_result', tags=[])
async def get_instant_function_result(flow_id: int, node_id: int, func_string: str):
    """Executes a simple, instant function on a node's data and returns the result."""
    try:
        node = flow_file_handler.get_node(flow_id, node_id)
        result = await asyncio.to_thread(get_instant_func_results, node, func_string)
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

`get_list_of_saved_flows(path)`

Scans a directory for saved flow files (.flowfile).

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/files/available_flow_files', tags=['editor'], response_model=List[FileInfo])
def get_list_of_saved_flows(path: str):
    """Scans a directory for saved flow files (`.flowfile`)."""
    try:
        return get_files_from_directory(path, types=['flowfile'])
    except:
        return []

`get_local_files(directory)` `async`

Retrieves a list of files from a specified local directory.

Parameters:

Name	Type	Description	Default
`directory`	`str`	The absolute path of the directory to scan.	required

Returns:

Type	Description
`List[FileInfo]`	A list of `FileInfo` objects for each item in the directory.

Raises:

Type	Description
`HTTPException`	404 if the directory does not exist.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/files/files_in_local_directory/', response_model=List[FileInfo], tags=['file manager'])
async def get_local_files(directory: str) -> List[FileInfo]:
    """Retrieves a list of files from a specified local directory.

    Args:
        directory: The absolute path of the directory to scan.

    Returns:
        A list of `FileInfo` objects for each item in the directory.

    Raises:
        HTTPException: 404 if the directory does not exist.
    """
    files = get_files_from_directory(directory)
    if files is None:
        raise HTTPException(404, 'Directory does not exist')
    return files

`get_node(flow_id, node_id, get_data=False)`

Retrieves the complete state and data preview for a single node.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/node', response_model=output_model.NodeData, tags=['editor'])
def get_node(flow_id: int, node_id: int, get_data: bool = False):
    """Retrieves the complete state and data preview for a single node."""
    logging.info(f'Getting node {node_id} from flow {flow_id}')
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    if node is None:
        raise HTTPException(422, 'Not found')
    v = node.get_node_data(flow_id=flow.flow_id, include_example=get_data)
    return v

`get_node_list()`

Retrieves the list of all available node types and their templates.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/node_list', response_model=List[schemas.NodeTemplate])
def get_node_list() -> List[schemas.NodeTemplate]:
    """Retrieves the list of all available node types and their templates."""
    return nodes_list

`get_node_model(setting_name_ref)`

(Internal) Retrieves a node's Pydantic model from the input_schema module by its name.

Source code in flowfile_core/flowfile_core/routes/routes.py

def get_node_model(setting_name_ref: str):
    """(Internal) Retrieves a node's Pydantic model from the input_schema module by its name."""
    logger.info("Getting node model for: " + setting_name_ref)
    for ref_name, ref in inspect.getmodule(input_schema).__dict__.items():
        if ref_name.lower() == setting_name_ref:
            return ref
    logger.error(f"Could not find node model for: {setting_name_ref}")

`get_run_status(flow_id, response)`

Retrieves the run status information for a specific flow.

Returns a 202 Accepted status while the flow is running, and 200 OK when finished.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/flow/run_status/', tags=['editor'],
            response_model=output_model.RunInformation)
def get_run_status(flow_id: int, response: Response):
    """Retrieves the run status information for a specific flow.

    Returns a 202 Accepted status while the flow is running, and 200 OK when finished.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    if flow.latest_run_info is None:
        raise HTTPException(status_code=404, detail="No run information available")
    if flow.flow_settings.is_running:
        response.status_code = status.HTTP_202_ACCEPTED
    else:
        response.status_code = status.HTTP_200_OK
    return flow.get_run_info()

`get_table_example(flow_id, node_id)`

Retrieves a data preview (schema and sample rows) for a node's output.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/node/data', response_model=output_model.TableExample, tags=['editor'])
def get_table_example(flow_id: int, node_id: int):
    """Retrieves a data preview (schema and sample rows) for a node's output."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    return node.get_table_example(True)

`get_vue_flow_data(flow_id)`

Retrieves the flow data formatted for the Vue-based frontend.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/flow_data/v2', tags=['manager'])
def get_vue_flow_data(flow_id: int) -> schemas.VueFlowInput:
    """Retrieves the flow data formatted for the Vue-based frontend."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    data = flow.get_vue_flow_input()
    return data

`import_saved_flow(flow_path)`

Imports a flow from a saved .flowfile and registers it as a new session.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/import_flow/', tags=['editor'], response_model=int)
def import_saved_flow(flow_path: str) -> int:
    """Imports a flow from a saved `.flowfile` and registers it as a new session."""
    flow_path = Path(flow_path)
    if not flow_path.exists():
        raise HTTPException(404, 'File not found')
    return flow_file_handler.import_flow(flow_path)

`navigate_into_directory(directory_name)` `async`

Navigates the file explorer into a specified subdirectory.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/files/navigate_into/', response_model=str, tags=['file manager'])
async def navigate_into_directory(directory_name: str) -> str:
    """Navigates the file explorer into a specified subdirectory."""
    file_explorer.navigate_into(directory_name)
    return str(file_explorer.current_path)

`navigate_to_directory(directory_name)` `async`

Navigates the file explorer to an absolute directory path.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/files/navigate_to/', tags=['file manager'])
async def navigate_to_directory(directory_name: str) -> str:
    """Navigates the file explorer to an absolute directory path."""
    file_explorer.navigate_to(directory_name)
    return str(file_explorer.current_path)

`navigate_up()` `async`

Navigates the file explorer one directory level up.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/files/navigate_up/', response_model=str, tags=['file manager'])
async def navigate_up() -> str:
    """Navigates the file explorer one directory level up."""
    file_explorer.navigate_up()
    return str(file_explorer.current_path)

`register_flow(flow_data)`

Registers a new flow session with the application.

Parameters:

Name	Type	Description	Default
`flow_data`	`FlowSettings`	The `FlowSettings` for the new flow.	required

Returns:

Type	Description
`int`	The ID of the newly registered flow.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/flow/register/', tags=['editor'])
def register_flow(flow_data: schemas.FlowSettings) -> int:
    """Registers a new flow session with the application.

    Args:
        flow_data: The `FlowSettings` for the new flow.

    Returns:
        The ID of the newly registered flow.
    """
    return flow_file_handler.register_flow(flow_data)

`run_flow(flow_id, background_tasks)` `async`

Executes a flow in a background task.

Parameters:

Name	Type	Description	Default
`flow_id`	`int`	The ID of the flow to execute.	required
`background_tasks`	`BackgroundTasks`	FastAPI's background task runner.	required

Returns:

Type	Description
`JSONResponse`	A JSON response indicating that the flow has started.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/flow/run/', tags=['editor'])
async def run_flow(flow_id: int, background_tasks: BackgroundTasks) -> JSONResponse:
    """Executes a flow in a background task.

    Args:
        flow_id: The ID of the flow to execute.
        background_tasks: FastAPI's background task runner.

    Returns:
        A JSON response indicating that the flow has started.
    """
    logger.info('starting to run...')
    flow = flow_file_handler.get_flow(flow_id)
    lock = get_flow_run_lock(flow_id)
    async with lock:
        if flow.flow_settings.is_running:
            raise HTTPException(422, 'Flow is already running')
        background_tasks.add_task(flow.run_graph)
    return JSONResponse(content={"message": "Data started", "flow_id": flow_id}, status_code=status.HTTP_200_OK)

`save_flow(flow_id, flow_path=None)`

Saves the current state of a flow to a .flowfile.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.get('/save_flow', tags=['editor'])
def save_flow(flow_id: int, flow_path: str = None):
    """Saves the current state of a flow to a `.flowfile`."""
    flow = flow_file_handler.get_flow(flow_id)
    flow.save_flow(flow_path=flow_path)

`trigger_fetch_node_data(flow_id, node_id, background_tasks)` `async`

Fetches and refreshes the data for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post("/node/trigger_fetch_data", tags=['editor'])
async def trigger_fetch_node_data(flow_id: int, node_id: int, background_tasks: BackgroundTasks):
    """Fetches and refreshes the data for a specific node."""
    flow = flow_file_handler.get_flow(flow_id)
    lock = get_flow_run_lock(flow_id)
    async with lock:
        if flow.flow_settings.is_running:
            raise HTTPException(422, 'Flow is already running')
        try:
            flow.validate_if_node_can_be_fetched(node_id)
        except Exception as e:
            raise HTTPException(422, str(e))
        background_tasks.add_task(flow.trigger_fetch_node, node_id)
    return JSONResponse(content={"message": "Data started",
                                 "flow_id": flow_id,
                                 "node_id": node_id}, status_code=status.HTTP_200_OK)

`update_description_node(flow_id, node_id, description=Body(...))`

Updates the description text for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/node/description/', tags=['editor'])
def update_description_node(flow_id: int, node_id: int, description: str = Body(...)):
    """Updates the description text for a specific node."""
    try:
        node = flow_file_handler.get_flow(flow_id).get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    node.setting_input.description = description
    return True

`update_flow_settings(flow_settings)`

Updates the main settings for a flow.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post('/flow_settings', tags=['manager'])
def update_flow_settings(flow_settings: schemas.FlowSettings):
    """Updates the main settings for a flow."""
    flow = flow_file_handler.get_flow(flow_settings.flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    flow.flow_settings = flow_settings

`upload_file(file=File(...))` `async`

Uploads a file to the server's 'uploads' directory.

Parameters:

Name	Type	Description	Default
`file`	`UploadFile`	The file to be uploaded.	`File(...)`

Returns:

Type	Description
`JSONResponse`	A JSON response containing the filename and the path where it was saved.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post("/upload/")
async def upload_file(file: UploadFile = File(...)) -> JSONResponse:
    """Uploads a file to the server's 'uploads' directory.

    Args:
        file: The file to be uploaded.

    Returns:
        A JSON response containing the filename and the path where it was saved.
    """
    file_location = f"uploads/{file.filename}"
    with open(file_location, "wb+") as file_object:
        file_object.write(file.file.read())
    return JSONResponse(content={"filename": file.filename, "filepath": file_location})

`validate_db_settings(database_settings, current_user=Depends(get_current_active_user))` `async`

Validates that a connection can be made to a database with the given settings.

Source code in flowfile_core/flowfile_core/routes/routes.py

@router.post("/validate_db_settings")
async def validate_db_settings(
        database_settings: input_schema.DatabaseSettings,
        current_user=Depends(get_current_active_user)
):
    """Validates that a connection can be made to a database with the given settings."""
    # Validate the query settings
    try:
        sql_source = create_sql_source_from_db_settings(database_settings, user_id=current_user.id)
        sql_source.validate()
        return {"message": "Query settings are valid"}
    except Exception as e:
        raise HTTPException(status_code=422, detail=str(e))

`auth`

`flowfile_core.routes.auth`

`cloud_connections`

`flowfile_core.routes.cloud_connections`

Functions:

Name	Description
`create_cloud_storage_connection`	Create a new cloud storage connection.
`delete_cloud_connection_with_connection_name`	Delete a cloud connection.
`get_cloud_connections`	Get all cloud storage connections for the current user.

`create_cloud_storage_connection(input_connection, current_user=Depends(get_current_active_user), db=Depends(get_db))`

Create a new cloud storage connection. Parameters input_connection: FullCloudStorageConnection schema containing connection details current_user: User obtained from Depends(get_current_active_user) db: Session obtained from Depends(get_db) Returns Dict with a success message

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py

@router.post("/cloud_connection", tags=['cloud_connections'])
def create_cloud_storage_connection(input_connection: FullCloudStorageConnection,
                                    current_user=Depends(get_current_active_user),
                                    db: Session = Depends(get_db)
                                    ):
    """
    Create a new cloud storage connection.
    Parameters
        input_connection: FullCloudStorageConnection schema containing connection details
        current_user: User obtained from Depends(get_current_active_user)
        db: Session obtained from Depends(get_db)
    Returns
        Dict with a success message
    """
    logger.info(f'Create cloud connection {input_connection.connection_name}')
    try:
        store_cloud_connection(db, input_connection, current_user.id)
    except ValueError:
        raise HTTPException(422, 'Connection name already exists')
    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
    return {"message": "Cloud connection created successfully"}

`delete_cloud_connection_with_connection_name(connection_name, current_user=Depends(get_current_active_user), db=Depends(get_db))`

Delete a cloud connection.

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py

@router.delete('/cloud_connection', tags=['cloud_connections'])
def delete_cloud_connection_with_connection_name(connection_name: str,
                                                 current_user=Depends(get_current_active_user),
                                                 db: Session = Depends(get_db)
                                                 ):
    """
    Delete a cloud connection.
    """
    logger.info(f'Deleting cloud connection {connection_name}')
    cloud_storage_connection = get_cloud_connection_schema(db, connection_name, current_user.id)
    if cloud_storage_connection is None:
        raise HTTPException(404, 'Cloud connection connection not found')
    delete_cloud_connection(db, connection_name, current_user.id)
    return {"message": "Cloud connection deleted successfully"}

`get_cloud_connections(db=Depends(get_db), current_user=Depends(get_current_active_user))`

Get all cloud storage connections for the current user. Parameters db: Session obtained from Depends(get_db) current_user: User obtained from Depends(get_current_active_user)

Returns List[FullCloudStorageConnectionInterface]

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py

@router.get('/cloud_connections', tags=['cloud_connection'],
            response_model=List[FullCloudStorageConnectionInterface])
def get_cloud_connections(
        db: Session = Depends(get_db),
        current_user=Depends(get_current_active_user)) -> List[FullCloudStorageConnectionInterface]:
    """
    Get all cloud storage connections for the current user.
    Parameters
        db: Session obtained from Depends(get_db)
        current_user: User obtained from Depends(get_current_active_user)

    Returns
        List[FullCloudStorageConnectionInterface]
    """
    return get_all_cloud_connections_interface(db, current_user.id)

`logs`

`flowfile_core.routes.logs`

Functions:

Name	Description
`add_log`	Adds a log message to the log file for a given flow_id.
`add_raw_log`	Adds a log message to the log file for a given flow_id.
`format_sse_message`	Format the data as a proper SSE message
`stream_logs`	Streams logs for a given flow_id using Server-Sent Events.

`add_log(flow_id, log_message)` `async`

Adds a log message to the log file for a given flow_id.

Source code in flowfile_core/flowfile_core/routes/logs.py

@router.post("/logs/{flow_id}", tags=['flow_logging'])
async def add_log(flow_id: int, log_message: str):
    """Adds a log message to the log file for a given flow_id."""
    flow = flow_file_handler.get_flow(flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    flow.flow_logger.info(log_message)
    return {"message": "Log added successfully"}

`add_raw_log(raw_log_input)` `async`

Adds a log message to the log file for a given flow_id.

Source code in flowfile_core/flowfile_core/routes/logs.py

@router.post("/raw_logs", tags=['flow_logging'])
async def add_raw_log(raw_log_input: schemas.RawLogInput):
    """Adds a log message to the log file for a given flow_id."""
    logger.info('Adding raw logs')
    flow = flow_file_handler.get_flow(raw_log_input.flowfile_flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    flow.flow_logger.get_log_filepath()
    flow_logger = flow.flow_logger
    flow_logger.get_log_filepath()
    if raw_log_input.log_type == 'INFO':
        flow_logger.info(raw_log_input.log_message,
                         extra=raw_log_input.extra)
    elif raw_log_input.log_type == 'ERROR':
        flow_logger.error(raw_log_input.log_message,
                          extra=raw_log_input.extra)
    return {"message": "Log added successfully"}

`format_sse_message(data)` `async`

Format the data as a proper SSE message

Source code in flowfile_core/flowfile_core/routes/logs.py

async def format_sse_message(data: str) -> str:
    """Format the data as a proper SSE message"""
    return f"data: {json.dumps(data)}\n\n"

`stream_logs(flow_id, idle_timeout=300, current_user=Depends(get_current_user_from_query))` `async`

Streams logs for a given flow_id using Server-Sent Events. Requires authentication via token in query parameter. The connection will close gracefully if the server shuts down.

Source code in flowfile_core/flowfile_core/routes/logs.py

@router.get("/logs/{flow_id}", tags=['flow_logging'])
async def stream_logs(
    flow_id: int,
    idle_timeout: int = 300,
    current_user=Depends(get_current_user_from_query)
):
    """
    Streams logs for a given flow_id using Server-Sent Events.
    Requires authentication via token in query parameter.
    The connection will close gracefully if the server shuts down.
    """
    logger.info(f"Starting log stream for flow_id: {flow_id} by user: {current_user.username}")
    await asyncio.sleep(.3)
    flow = flow_file_handler.get_flow(flow_id)
    logger.info('Streaming logs')
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")

    log_file_path = flow.flow_logger.get_log_filepath()
    if not Path(log_file_path).exists():
        raise HTTPException(status_code=404, detail="Log file not found")

    class RunningState:
        def __init__(self):
            self.has_started = False

        def is_running(self):
            if flow.flow_settings.is_running:
                self.has_started = True
            return flow.flow_settings.is_running or not self.has_started

    running_state = RunningState()

    return StreamingResponse(
        stream_log_file(log_file_path, running_state.is_running, idle_timeout),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "text/event-stream",
        }
    )

`public`

`flowfile_core.routes.public`

Functions:

Name	Description
`docs_redirect`	Redirects to the documentation page.

`docs_redirect()` `async`

Redirects to the documentation page.

Source code in flowfile_core/flowfile_core/routes/public.py

@router.get("/", tags=['admin'])
async def docs_redirect():
    """ Redirects to the documentation page."""
    return RedirectResponse(url='/docs')

`secrets`

`flowfile_core.routes.secrets`

Manages CRUD (Create, Read, Update, Delete) operations for secrets.

This router provides secure endpoints for creating, retrieving, and deleting sensitive credentials for the authenticated user. Secrets are encrypted before being stored and are associated with the user's ID.

Functions:

Name	Description
`create_secret`	Creates a new secret for the authenticated user.
`delete_secret`	Deletes a secret by name for the authenticated user.
`get_secret`	Retrieves a specific secret by name for the authenticated user.
`get_secrets`	Retrieves all secret names for the currently authenticated user.

`create_secret(secret, current_user=Depends(get_current_active_user), db=Depends(get_db))` `async`

Creates a new secret for the authenticated user.

The secret value is encrypted before being stored in the database. A secret name must be unique for a given user.

Parameters:

Name	Type	Description	Default
`secret`	`SecretInput`	A `SecretInput` object containing the name and plaintext value of the secret.	required
`current_user`		The authenticated user object, injected by FastAPI.	`Depends(get_current_active_user)`
`db`	`Session`	The database session, injected by FastAPI.	`Depends(get_db)`

Raises:

Type	Description
`HTTPException`	400 if a secret with the same name already exists for the user.

Returns:

Type	Description
`Secret`	A `Secret` object containing the name and the encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py

@router.post("/secrets", response_model=Secret)
async def create_secret(secret: SecretInput, current_user=Depends(get_current_active_user),
                        db: Session = Depends(get_db)) -> Secret:
    """Creates a new secret for the authenticated user.

    The secret value is encrypted before being stored in the database. A secret
    name must be unique for a given user.

    Args:
        secret: A `SecretInput` object containing the name and plaintext value of the secret.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Raises:
        HTTPException: 400 if a secret with the same name already exists for the user.

    Returns:
        A `Secret` object containing the name and the *encrypted* value.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id

    existing_secret = db.query(db_models.Secret).filter(
        db_models.Secret.user_id == user_id,
        db_models.Secret.name == secret.name
    ).first()

    if existing_secret:
        raise HTTPException(status_code=400, detail="Secret with this name already exists")

    # The store_secret function handles encryption and DB storage
    stored_secret = store_secret(db, secret, user_id)
    return Secret(name=stored_secret.name, value=stored_secret.encrypted_value, user_id=str(user_id))

`delete_secret(secret_name, current_user=Depends(get_current_active_user), db=Depends(get_db))` `async`

Deletes a secret by name for the authenticated user.

Parameters:

Name	Type	Description	Default
`secret_name`	`str`	The name of the secret to delete.	required
`current_user`		The authenticated user object, injected by FastAPI.	`Depends(get_current_active_user)`
`db`	`Session`	The database session, injected by FastAPI.	`Depends(get_db)`

Returns:

Type	Description
`None`	An empty response with a 204 No Content status code upon success.

Source code in flowfile_core/flowfile_core/routes/secrets.py

@router.delete("/secrets/{secret_name}", status_code=204)
async def delete_secret(secret_name: str, current_user=Depends(get_current_active_user),
                        db: Session = Depends(get_db)) -> None:
    """Deletes a secret by name for the authenticated user.

    Args:
        secret_name: The name of the secret to delete.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Returns:
        An empty response with a 204 No Content status code upon success.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id
    delete_secret_action(db, secret_name, user_id)
    return None

`get_secret(secret_name, current_user=Depends(get_current_active_user), db=Depends(get_db))` `async`

Retrieves a specific secret by name for the authenticated user.

Note: This endpoint returns the secret name and metadata but does not expose the decrypted secret value.

Parameters:

Name	Type	Description	Default
`secret_name`	`str`	The name of the secret to retrieve.	required
`current_user`		The authenticated user object, injected by FastAPI.	`Depends(get_current_active_user)`
`db`	`Session`	The database session, injected by FastAPI.	`Depends(get_db)`

Raises:

Type	Description
`HTTPException`	404 if the secret is not found.

Returns:

Type	Description
`Secret`	A `Secret` object containing the name and encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py

@router.get("/secrets/{secret_name}", response_model=Secret)
async def get_secret(secret_name: str,
                     current_user=Depends(get_current_active_user), db: Session = Depends(get_db)) -> Secret:
    """Retrieves a specific secret by name for the authenticated user.

    Note: This endpoint returns the secret name and metadata but does not
    expose the decrypted secret value.

    Args:
        secret_name: The name of the secret to retrieve.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Raises:
        HTTPException: 404 if the secret is not found.

    Returns:
        A `Secret` object containing the name and encrypted value.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id

    # Get secret from database
    db_secret = db.query(db_models.Secret).filter(
        db_models.Secret.user_id == user_id,
        db_models.Secret.name == secret_name
    ).first()

    if not db_secret:
        raise HTTPException(status_code=404, detail="Secret not found")

    return Secret(
        name=db_secret.name,
        value=db_secret.encrypted_value,
        user_id=str(db_secret.user_id)
    )

`get_secrets(current_user=Depends(get_current_active_user), db=Depends(get_db))` `async`

Retrieves all secret names for the currently authenticated user.

Note: This endpoint returns the secret names and metadata but does not expose the decrypted secret values.

Parameters:

Name	Type	Description	Default
`current_user`		The authenticated user object, injected by FastAPI.	`Depends(get_current_active_user)`
`db`	`Session`	The database session, injected by FastAPI.	`Depends(get_db)`

Returns:

Type	Description
	A list of `Secret` objects, each containing the name and encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py

@router.get("/secrets", response_model=List[Secret])
async def get_secrets(current_user=Depends(get_current_active_user), db: Session = Depends(get_db)):
    """Retrieves all secret names for the currently authenticated user.

    Note: This endpoint returns the secret names and metadata but does not
    expose the decrypted secret values.

    Args:
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Returns:
        A list of `Secret` objects, each containing the name and encrypted value.
    """
    user_id = current_user.id

    # Get secrets from database
    db_secrets = db.query(db_models.Secret).filter(db_models.Secret.user_id == user_id).all()

    # Prepare response model (without decrypting)
    secrets = []
    for db_secret in db_secrets:
        secrets.append(Secret(
            name=db_secret.name,
            value=db_secret.encrypted_value,
            user_id=str(db_secret.user_id)
        ))

    return secrets

Flowfile Core API Reference

Core Components

FlowGraph

flowfile_core.flowfile.flow_graph.FlowGraph

execution_location property writable

execution_mode property writable

flow_id property writable

graph_has_functions property

graph_has_input_data property

node_connections property

nodes property

__init__(flow_settings, name=None, input_cols=None, output_cols=None, path_ref=None, input_flow=None, cache_results=False)

__repr__()

add_cloud_storage_reader(node_cloud_storage_reader)

add_cloud_storage_writer(node_cloud_storage_writer)

add_cross_join(cross_join_settings)

add_database_reader(node_database_reader)

add_database_writer(node_database_writer)

add_datasource(input_file)

add_dependency_on_polars_lazy_frame(lazy_frame, node_id)

add_explore_data(node_analysis)

add_external_source(external_source_input)

add_filter(filter_settings)

add_formula(function_settings)

add_fuzzy_match(fuzzy_settings)

add_graph_solver(graph_solver_settings)

add_group_by(group_by_settings)

add_include_cols(include_columns)

add_initial_node_analysis(node_promise)

add_join(join_settings)

add_manual_input(input_file)

add_node_promise(node_promise)

add_node_step(node_id, function, input_columns=None, output_schema=None, node_type=None, drop_columns=None, renew_schema=True, setting_input=None, cache_results=None, schema_callback=None, input_node_ids=None)

add_output(output_file)

add_pivot(pivot_settings)

add_polars_code(node_polars_code)

add_read(input_file)

add_record_count(node_number_of_records)

add_record_id(record_id_settings)

add_sample(sample_settings)

add_select(select_settings)

add_sort(sort_settings)

add_sql_source(external_source_input)

add_text_to_rows(node_text_to_rows)

add_union(union_settings)

add_unique(unique_settings)

add_unpivot(unpivot_settings)

apply_layout(y_spacing=150, x_spacing=200, initial_y=100)

cancel()

close_flow()

copy_node(new_node_settings, existing_setting_input, node_type)

delete_node(node_id)

generate_code()

get_frontend_data()

get_implicit_starter_nodes()

get_node(node_id=None)

get_node_data(node_id, include_example=True)

get_node_storage()

get_nodes_overview()

get_run_info()

get_vue_flow_input()

print_tree()

remove_from_output_cols(columns)

reset()

run_graph()

save_flow(flow_path)

trigger_fetch_node(node_id)

FlowNode

flowfile_core.flowfile.flow_node.flow_node.FlowNode

all_inputs property

function property writable

has_input property

has_next_step property

hash property

is_correct property

is_setup property

is_start property

left_input property

main_input property

name property writable

`flowfile_core.flowfile.flow_graph.FlowGraph`

`execution_location` `property` `writable`

`execution_mode` `property` `writable`

`flow_id` `property` `writable`

`graph_has_functions` `property`

`graph_has_input_data` `property`

`node_connections` `property`

`nodes` `property`

`init(flow_settings, name=None, input_cols=None, output_cols=None, path_ref=None, input_flow=None, cache_results=False)`

`repr()`

`add_cloud_storage_reader(node_cloud_storage_reader)`

`add_cloud_storage_writer(node_cloud_storage_writer)`

`add_cross_join(cross_join_settings)`

`add_database_reader(node_database_reader)`

`add_database_writer(node_database_writer)`

`add_datasource(input_file)`

`add_dependency_on_polars_lazy_frame(lazy_frame, node_id)`

`add_explore_data(node_analysis)`

`add_external_source(external_source_input)`

`add_filter(filter_settings)`

`add_formula(function_settings)`

`add_fuzzy_match(fuzzy_settings)`

`add_graph_solver(graph_solver_settings)`

`add_group_by(group_by_settings)`

`add_include_cols(include_columns)`

`add_initial_node_analysis(node_promise)`

`add_join(join_settings)`

`add_manual_input(input_file)`

`add_node_promise(node_promise)`

`add_node_step(node_id, function, input_columns=None, output_schema=None, node_type=None, drop_columns=None, renew_schema=True, setting_input=None, cache_results=None, schema_callback=None, input_node_ids=None)`

`add_output(output_file)`

`add_pivot(pivot_settings)`

`add_polars_code(node_polars_code)`

`add_read(input_file)`

`add_record_count(node_number_of_records)`

`add_record_id(record_id_settings)`

`add_sample(sample_settings)`

`add_select(select_settings)`

`add_sort(sort_settings)`

`add_sql_source(external_source_input)`

`add_text_to_rows(node_text_to_rows)`

`add_union(union_settings)`

`add_unique(unique_settings)`

`add_unpivot(unpivot_settings)`

`apply_layout(y_spacing=150, x_spacing=200, initial_y=100)`

`cancel()`

`close_flow()`

`copy_node(new_node_settings, existing_setting_input, node_type)`

`delete_node(node_id)`

`generate_code()`

`get_frontend_data()`

`get_implicit_starter_nodes()`

`get_node(node_id=None)`

`get_node_data(node_id, include_example=True)`

`get_node_storage()`

`get_nodes_overview()`

`get_run_info()`

`get_vue_flow_input()`

`print_tree()`

`remove_from_output_cols(columns)`

`reset()`

`run_graph()`

`save_flow(flow_path)`

`trigger_fetch_node(node_id)`

`flowfile_core.flowfile.flow_node.flow_node.FlowNode`

`all_inputs` `property`

`function` `property` `writable`

`has_input` `property`

`has_next_step` `property`

`hash` `property`

`is_correct` `property`

`is_setup` `property`

`is_start` `property`

`left_input` `property`

`main_input` `property`

`name` `property` `writable`

`node_id` `property`

`number_of_leads_to_nodes` `property`

`right_input` `property`

`schema` `property`