Skip to content

Flowfile Core API Reference

This section provides a detailed API reference for the core Python objects, data models, and API routes in flowfile-core. The documentation is generated directly from the source code docstrings.


Core Components

This section covers the fundamental classes that manage the state and execution of data pipelines. These are the main "verbs" of the library.

FlowGraph

The FlowGraph is the central object that orchestrates the execution of data transformations. It is built incrementally as you chain operations. This DAG (Directed Acyclic Graph) represents the entire pipeline.

flowfile_core.flowfile.flow_graph.FlowGraph

A class representing a Directed Acyclic Graph (DAG) for data processing pipelines.

It manages nodes, connections, and the execution of the entire flow.

Methods:

Name Description
__init__

Initializes a new FlowGraph instance.

__repr__

Provides the official string representation of the FlowGraph instance.

add_cloud_storage_reader

Adds a cloud storage read node to the flow graph.

add_cloud_storage_writer

Adds a node to write data to a cloud storage provider.

add_cross_join

Adds a cross join node to the graph.

add_database_reader

Adds a node to read data from a database.

add_database_writer

Adds a node to write data to a database.

add_datasource

Adds a data source node to the graph.

add_dependency_on_polars_lazy_frame

Adds a special node that directly injects a Polars LazyFrame into the graph.

add_explore_data

Adds a specialized node for data exploration and visualization.

add_external_source

Adds a node for a custom external data source.

add_filter

Adds a filter node to the graph.

add_formula

Adds a node that applies a formula to create or modify a column.

add_fuzzy_match

Adds a fuzzy matching node to join data on approximate string matches.

add_graph_solver

Adds a node that solves graph-like problems within the data.

add_group_by

Adds a group-by aggregation node to the graph.

add_include_cols

Adds columns to both the input and output column lists.

add_initial_node_analysis

Adds a data exploration/analysis node based on a node promise.

add_join

Adds a join node to combine two data streams based on key columns.

add_manual_input

Adds a node for manual data entry.

add_node_promise

Adds a placeholder node to the graph that is not yet fully configured.

add_node_step

The core method for adding or updating a node in the graph.

add_output

Adds an output node to write the final data to a destination.

add_pivot

Adds a pivot node to the graph.

add_polars_code

Adds a node that executes custom Polars code.

add_read

Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

add_record_count

Adds a filter node to the graph.

add_record_id

Adds a node to create a new column with a unique ID for each record.

add_sample

Adds a node to take a random or top-N sample of the data.

add_select

Adds a node to select, rename, reorder, or drop columns.

add_sort

Adds a node to sort the data based on one or more columns.

add_sql_source

Adds a node that reads data from a SQL source.

add_text_to_rows

Adds a node that splits cell values into multiple rows.

add_union

Adds a union node to combine multiple data streams.

add_unique

Adds a node to find and remove duplicate rows.

add_unpivot

Adds an unpivot node to the graph.

apply_layout

Calculates and applies a layered layout to all nodes in the graph.

cancel

Cancels an ongoing graph execution.

close_flow

Performs cleanup operations, such as clearing node caches.

copy_node

Creates a copy of an existing node.

delete_node

Deletes a node from the graph and updates all its connections.

generate_code

Generates code for the flow graph.

get_frontend_data

Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

get_implicit_starter_nodes

Finds nodes that can act as starting points but are not explicitly defined as such.

get_node

Retrieves a node from the graph by its ID.

get_node_data

Retrieves all data needed to render a node in the UI.

get_node_storage

Serializes the entire graph's state into a storable format.

get_nodes_overview

Gets a list of dictionary representations for all nodes in the graph.

get_run_info

Gets a summary of the most recent graph execution.

get_vue_flow_input

Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

print_tree

Print flow_graph as a tree.

remove_from_output_cols

Removes specified columns from the list of expected output columns.

reset

Forces a deep reset on all nodes in the graph.

run_graph

Executes the entire data flow graph from start to finish.

save_flow

Saves the current state of the flow graph to a file.

Attributes:

Name Type Description
execution_location ExecutionLocationsLiteral

Gets the current execution location.

execution_mode ExecutionModeLiteral

Gets the current execution mode ('Development' or 'Performance').

flow_id int

Gets the unique identifier of the flow.

graph_has_functions bool

Checks if the graph has any nodes.

graph_has_input_data bool

Checks if the graph has an initial input data source.

node_connections List[Tuple[int, int]]

Computes and returns a list of all connections in the graph.

nodes List[FlowNode]

Gets a list of all FlowNode objects in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
class FlowGraph:
    """A class representing a Directed Acyclic Graph (DAG) for data processing pipelines.

    It manages nodes, connections, and the execution of the entire flow.
    """
    uuid: str
    depends_on: Dict[int, Union[ParquetFile, FlowDataEngine, "FlowGraph", pl.DataFrame,]]
    _flow_id: int
    _input_data: Union[ParquetFile, FlowDataEngine, "FlowGraph"]
    _input_cols: List[str]
    _output_cols: List[str]
    _node_db: Dict[Union[str, int], FlowNode]
    _node_ids: List[Union[str, int]]
    _results: Optional[FlowDataEngine] = None
    cache_results: bool = False
    schema: Optional[List[FlowfileColumn]] = None
    has_over_row_function: bool = False
    _flow_starts: List[Union[int, str]] = None
    node_results: List[NodeResult] = None
    latest_run_info: Optional[RunInformation] = None
    start_datetime: datetime = None
    end_datetime: datetime = None
    nodes_completed: int = 0
    flow_settings: schemas.FlowSettings = None
    flow_logger: FlowLogger

    def __init__(self,
                 flow_settings: schemas.FlowSettings | schemas.FlowGraphConfig,
                 name: str = None, input_cols: List[str] = None,
                 output_cols: List[str] = None,
                 path_ref: str = None,
                 input_flow: Union[ParquetFile, FlowDataEngine, "FlowGraph"] = None,
                 cache_results: bool = False):
        """Initializes a new FlowGraph instance.

        Args:
            flow_settings: The configuration settings for the flow.
            name: The name of the flow.
            input_cols: A list of input column names.
            output_cols: A list of output column names.
            path_ref: An optional path to an initial data source.
            input_flow: An optional existing data object to start the flow with.
            cache_results: A global flag to enable or disable result caching.
        """
        if isinstance(flow_settings, schemas.FlowGraphConfig):
            flow_settings = schemas.FlowSettings.from_flow_settings_input(flow_settings)

        self.flow_settings = flow_settings
        self.uuid = str(uuid1())
        self.nodes_completed = 0
        self.start_datetime = None
        self.end_datetime = None
        self.latest_run_info = None
        self.node_results = []
        self._flow_id = flow_settings.flow_id
        self.flow_logger = FlowLogger(flow_settings.flow_id)
        self._flow_starts: List[FlowNode] = []
        self._results = None
        self.schema = None
        self.has_over_row_function = False
        self._input_cols = [] if input_cols is None else input_cols
        self._output_cols = [] if output_cols is None else output_cols
        self._node_ids = []
        self._node_db = {}
        self.cache_results = cache_results
        self.__name__ = name if name else id(self)
        self.depends_on = {}
        if path_ref is not None:
            self.add_datasource(input_schema.NodeDatasource(file_path=path_ref))
        elif input_flow is not None:
            self.add_datasource(input_file=input_flow)

    def add_node_promise(self, node_promise: input_schema.NodePromise):
        """Adds a placeholder node to the graph that is not yet fully configured.

        Useful for building the graph structure before all settings are available.

        Args:
            node_promise: A promise object containing basic node information.
        """
        def placeholder(n: FlowNode = None):
            if n is None:
                return FlowDataEngine()
            return n

        self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=placeholder,
                           setting_input=node_promise)

    def print_tree(self, show_schema=False, show_descriptions=False):
        """
        Print flow_graph as a tree.
        """
        max_node_id = max(self._node_db.keys())

        tree = ""
        tabs = 0
        tab_counter = 0
        for node in self.nodes:
            tab_counter += 1
            node_input = node.setting_input
            operation = str(self._node_db[node_input.node_id]).split("(")[1][:-1].replace("_", " ").title()

            if operation == "Formula":
                operation = "With Columns"

            tree += str(operation) + " (id=" + str(node_input.node_id) + ")"

            if show_descriptions & show_schema:
                raise ValueError('show_descriptions and show_schema cannot be True simultaneously')
            if show_descriptions:
                tree += ": " + str(node_input.description)
            elif show_schema:
                tree += " -> ["
                if operation == "Manual Input":
                    schema = ", ".join([str(i.name) + ": " + str(i.data_type) for i in node_input.raw_data_format.columns])
                    tree += schema
                elif operation == "With Columns":
                    tree_with_col_schema = ", " + node_input.function.field.name + ": " + node_input.function.field.data_type
                    tree += schema + tree_with_col_schema
                elif operation == "Filter":
                    index = node_input.filter_input.advanced_filter.find("]")
                    filtered_column = str(node_input.filter_input.advanced_filter[1:index])
                    schema = re.sub('({str(filtered_column)}: [A-Za-z0-9]+\,\s)', "", schema)
                    tree += schema
                elif operation == "Group By":
                    for col in node_input.groupby_input.agg_cols:
                        schema = re.sub(str(col.old_name) + ': [a-z0-9]+\, ', "", schema)
                    tree += schema
                tree += "]"
            else:
                if operation == "Manual Input":
                    tree += ": " + str(node_input.raw_data_format.data)
                elif operation == "With Columns":
                    tree += ": " + str(node_input.function)
                elif operation == "Filter":
                    tree += ": " + str(node_input.filter_input.advanced_filter)
                elif operation == "Group By":
                    tree += ": groupby=[" + ", ".join([col.old_name for col in node_input.groupby_input.agg_cols if col.agg == "groupby"]) + "], "
                    tree += "agg=[" + ", ".join([str(col.agg) + "(" + str(col.old_name) + ")" for col in node_input.groupby_input.agg_cols if col.agg != "groupby"]) + "]"

            if node_input.node_id < max_node_id:
                tree += "\n" + "# " + " "*3*(tabs-1) + "|___ "
            print("\n"*2)

        return print(tree)

    def apply_layout(self, y_spacing: int = 150, x_spacing: int = 200, initial_y: int = 100):
        """Calculates and applies a layered layout to all nodes in the graph.

        This updates their x and y positions for UI rendering.

        Args:
            y_spacing: The vertical spacing between layers.
            x_spacing: The horizontal spacing between nodes in the same layer.
            initial_y: The initial y-position for the first layer.
        """
        self.flow_logger.info("Applying layered layout...")
        start_time = time()
        try:
            # Calculate new positions for all nodes
            new_positions = calculate_layered_layout(
                self, y_spacing=y_spacing, x_spacing=x_spacing, initial_y=initial_y
            )

            if not new_positions:
                self.flow_logger.warning("Layout calculation returned no positions.")
                return

            # Apply the new positions to the setting_input of each node
            updated_count = 0
            for node_id, (pos_x, pos_y) in new_positions.items():
                node = self.get_node(node_id)
                if node and hasattr(node, 'setting_input'):
                    setting = node.setting_input
                    if hasattr(setting, 'pos_x') and hasattr(setting, 'pos_y'):
                        setting.pos_x = pos_x
                        setting.pos_y = pos_y
                        updated_count += 1
                    else:
                        self.flow_logger.warning(f"Node {node_id} setting_input ({type(setting)}) lacks pos_x/pos_y attributes.")
                elif node:
                    self.flow_logger.warning(f"Node {node_id} lacks setting_input attribute.")
                # else: Node not found, already warned by calculate_layered_layout

            end_time = time()
            self.flow_logger.info(f"Layout applied to {updated_count}/{len(self.nodes)} nodes in {end_time - start_time:.2f} seconds.")

        except Exception as e:
            self.flow_logger.error(f"Error applying layout: {e}")
            raise  # Optional: re-raise the exception

    @property
    def flow_id(self) -> int:
        """Gets the unique identifier of the flow."""
        return self._flow_id

    @flow_id.setter
    def flow_id(self, new_id: int):
        """Sets the unique identifier for the flow and updates all child nodes.

        Args:
            new_id: The new flow ID.
        """
        self._flow_id = new_id
        for node in self.nodes:
            if hasattr(node.setting_input, 'flow_id'):
                node.setting_input.flow_id = new_id
        self.flow_settings.flow_id = new_id

    def __repr__(self):
        """Provides the official string representation of the FlowGraph instance."""
        settings_str = "  -" + '\n  -'.join(f"{k}: {v}" for k, v in self.flow_settings)
        return f"FlowGraph(\nNodes: {self._node_db}\n\nSettings:\n{settings_str}"

    def get_nodes_overview(self):
        """Gets a list of dictionary representations for all nodes in the graph."""
        output = []
        for v in self._node_db.values():
            output.append(v.get_repr())
        return output

    def remove_from_output_cols(self, columns: List[str]):
        """Removes specified columns from the list of expected output columns.

        Args:
            columns: A list of column names to remove.
        """
        cols = set(columns)
        self._output_cols = [c for c in self._output_cols if c not in cols]

    def get_node(self, node_id: Union[int, str] = None) -> FlowNode | None:
        """Retrieves a node from the graph by its ID.

        Args:
            node_id: The ID of the node to retrieve. If None, retrieves the last added node.

        Returns:
            The FlowNode object, or None if not found.
        """
        if node_id is None:
            node_id = self._node_ids[-1]
        node = self._node_db.get(node_id)
        if node is not None:
            return node

    def add_pivot(self, pivot_settings: input_schema.NodePivot):
        """Adds a pivot node to the graph.

        Args:
            pivot_settings: The settings for the pivot operation.
        """

        def _func(fl: FlowDataEngine):
            return fl.do_pivot(pivot_settings.pivot_input, self.flow_logger.get_node_logger(pivot_settings.node_id))

        self.add_node_step(node_id=pivot_settings.node_id,
                           function=_func,
                           node_type='pivot',
                           setting_input=pivot_settings,
                           input_node_ids=[pivot_settings.depending_on_id])

        node = self.get_node(pivot_settings.node_id)

        def schema_callback():
            input_data = node.singular_main_input.get_resulting_data()  # get from the previous step the data
            input_data.lazy = True  # ensure the dataset is lazy
            input_lf = input_data.data_frame  # get the lazy frame
            return pre_calculate_pivot_schema(input_data.schema, pivot_settings.pivot_input, input_lf=input_lf)
        node.schema_callback = schema_callback

    def add_unpivot(self, unpivot_settings: input_schema.NodeUnpivot):
        """Adds an unpivot node to the graph.

        Args:
            unpivot_settings: The settings for the unpivot operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.unpivot(unpivot_settings.unpivot_input)

        self.add_node_step(node_id=unpivot_settings.node_id,
                           function=_func,
                           node_type='unpivot',
                           setting_input=unpivot_settings,
                           input_node_ids=[unpivot_settings.depending_on_id])

    def add_union(self, union_settings: input_schema.NodeUnion):
        """Adds a union node to combine multiple data streams.

        Args:
            union_settings: The settings for the union operation.
        """

        def _func(*flowfile_tables: FlowDataEngine):
            dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [flt.data_frame for flt in flowfile_tables]
            return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))

        self.add_node_step(node_id=union_settings.node_id,
                           function=_func,
                           node_type=f'union',
                           setting_input=union_settings,
                           input_node_ids=union_settings.depending_on_ids)

    def add_initial_node_analysis(self, node_promise: input_schema.NodePromise):
        """Adds a data exploration/analysis node based on a node promise.

        Args:
            node_promise: The promise representing the node to be analyzed.
        """
        node_analysis = create_graphic_walker_node_from_node_promise(node_promise)
        self.add_explore_data(node_analysis)

    def add_explore_data(self, node_analysis: input_schema.NodeExploreData):
        """Adds a specialized node for data exploration and visualization.

        Args:
            node_analysis: The settings for the data exploration node.
        """
        sample_size: int = 10000

        def analysis_preparation(flowfile_table: FlowDataEngine):
            if flowfile_table.number_of_records <= 0:
                number_of_records = flowfile_table.get_number_of_records(calculate_in_worker_process=True)
            else:
                number_of_records = flowfile_table.number_of_records
            if number_of_records > sample_size:
                flowfile_table = flowfile_table.get_sample(sample_size, random=True)
            external_sampler = ExternalDfFetcher(
                lf=flowfile_table.data_frame,
                file_ref="__gf_walker"+node.hash,
                wait_on_completion=True,
                node_id=node.node_id,
                flow_id=self.flow_id,
            )
            node.results.analysis_data_generator = get_read_top_n(external_sampler.status.file_ref,
                                                                  n=min(sample_size, number_of_records))
            return flowfile_table

        def schema_callback():
            node = self.get_node(node_analysis.node_id)
            if len(node.all_inputs) == 1:
                input_node = node.all_inputs[0]
                return input_node.schema
            else:
                return [FlowfileColumn.from_input('col_1', 'na')]

        self.add_node_step(node_id=node_analysis.node_id, node_type='explore_data',
                           function=analysis_preparation,
                           setting_input=node_analysis, schema_callback=schema_callback)
        node = self.get_node(node_analysis.node_id)

    def add_group_by(self, group_by_settings: input_schema.NodeGroupBy):
        """Adds a group-by aggregation node to the graph.

        Args:
            group_by_settings: The settings for the group-by operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.do_group_by(group_by_settings.groupby_input, False)

        self.add_node_step(node_id=group_by_settings.node_id,
                           function=_func,
                           node_type=f'group_by',
                           setting_input=group_by_settings,
                           input_node_ids=[group_by_settings.depending_on_id])

        node = self.get_node(group_by_settings.node_id)

        def schema_callback():

            output_columns = [(c.old_name, c.new_name, c.output_type) for c in group_by_settings.groupby_input.agg_cols]
            depends_on = node.node_inputs.main_inputs[0]
            input_schema_dict: Dict[str, str] = {s.name: s.data_type for s in depends_on.schema}
            output_schema = []
            for old_name, new_name, data_type in output_columns:
                data_type = input_schema_dict[old_name] if data_type is None else data_type
                output_schema.append(FlowfileColumn.from_input(data_type=data_type, column_name=new_name))
            return output_schema

        node.schema_callback = schema_callback

    def add_filter(self, filter_settings: input_schema.NodeFilter):
        """Adds a filter node to the graph.

        Args:
            filter_settings: The settings for the filter operation.
        """

        is_advanced = filter_settings.filter_input.filter_type == 'advanced'
        if is_advanced:
            predicate = filter_settings.filter_input.advanced_filter
        else:
            _basic_filter = filter_settings.filter_input.basic_filter
            filter_settings.filter_input.advanced_filter = (f'[{_basic_filter.field}]{_basic_filter.filter_type}"'
                                                            f'{_basic_filter.filter_value}"')

        def _func(fl: FlowDataEngine):
            is_advanced = filter_settings.filter_input.filter_type == 'advanced'
            if is_advanced:
                return fl.do_filter(predicate)
            else:
                basic_filter = filter_settings.filter_input.basic_filter
                if basic_filter.filter_value.isnumeric():
                    field_data_type = fl.get_schema_column(basic_filter.field).generic_datatype()
                    if field_data_type == 'str':
                        _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
                    else:
                        _f = f'[{basic_filter.field}]{basic_filter.filter_type}{basic_filter.filter_value}'
                else:
                    _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
                filter_settings.filter_input.advanced_filter = _f
                return fl.do_filter(_f)

        self.add_node_step(filter_settings.node_id, _func,
                           node_type='filter',
                           renew_schema=False,
                           setting_input=filter_settings,
                           input_node_ids=[filter_settings.depending_on_id]
                           )

    def add_record_count(self, node_number_of_records: input_schema.NodeRecordCount):
        """Adds a filter node to the graph.

        Args:
            node_number_of_records: The settings for the record count operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.get_record_count()

        self.add_node_step(node_id=node_number_of_records.node_id,
                           function=_func,
                           node_type='record_count',
                           setting_input=node_number_of_records,
                           input_node_ids=[node_number_of_records.depending_on_id])

    def add_polars_code(self, node_polars_code: input_schema.NodePolarsCode):
        """Adds a node that executes custom Polars code.

        Args:
            node_polars_code: The settings for the Polars code node.
        """

        def _func(*flowfile_tables: FlowDataEngine) -> FlowDataEngine:
            return execute_polars_code(*flowfile_tables, code=node_polars_code.polars_code_input.polars_code)
        self.add_node_step(node_id=node_polars_code.node_id,
                           function=_func,
                           node_type='polars_code',
                           setting_input=node_polars_code,
                           input_node_ids=node_polars_code.depending_on_ids)

        try:
            polars_code_parser.validate_code(node_polars_code.polars_code_input.polars_code)
        except Exception as e:
            node = self.get_node(node_id=node_polars_code.node_id)
            node.results.errors = str(e)

    def add_dependency_on_polars_lazy_frame(self,
                                            lazy_frame: pl.LazyFrame,
                                            node_id: int):
        """Adds a special node that directly injects a Polars LazyFrame into the graph.

        Note: This is intended for backend use and will not work in the UI editor.

        Args:
            lazy_frame: The Polars LazyFrame to inject.
            node_id: The ID for the new node.
        """
        def _func():
            return FlowDataEngine(lazy_frame)
        node_promise = input_schema.NodePromise(flow_id=self.flow_id,
                                                node_id=node_id, node_type="polars_lazy_frame",
                                                is_setup=True)
        self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=_func,
                           setting_input=node_promise)

    def add_unique(self, unique_settings: input_schema.NodeUnique):
        """Adds a node to find and remove duplicate rows.

        Args:
            unique_settings: The settings for the unique operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.make_unique(unique_settings.unique_input)

        self.add_node_step(node_id=unique_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='unique',
                           setting_input=unique_settings,
                           input_node_ids=[unique_settings.depending_on_id])

    def add_graph_solver(self, graph_solver_settings: input_schema.NodeGraphSolver):
        """Adds a node that solves graph-like problems within the data.

        This node can be used for operations like finding network paths,
        calculating connected components, or performing other graph algorithms
        on relational data that represents nodes and edges.

        Args:
            graph_solver_settings: The settings object defining the graph inputs
                and the specific algorithm to apply.
        """
        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.solve_graph(graph_solver_settings.graph_solver_input)

        self.add_node_step(node_id=graph_solver_settings.node_id,
                           function=_func,
                           node_type='graph_solver',
                           setting_input=graph_solver_settings,
                           input_node_ids=[graph_solver_settings.depending_on_id])

    def add_formula(self, function_settings: input_schema.NodeFormula):
        """Adds a node that applies a formula to create or modify a column.

        Args:
            function_settings: The settings for the formula operation.
        """

        error = ""
        if function_settings.function.field.data_type not in (None, "Auto"):
            output_type = cast_str_to_polars_type(function_settings.function.field.data_type)
        else:
            output_type = None
        if output_type not in (None, "Auto"):
            new_col = [FlowfileColumn.from_input(column_name=function_settings.function.field.name,
                                                 data_type=str(output_type))]
        else:
            new_col = [FlowfileColumn.from_input(function_settings.function.field.name, 'String')]

        def _func(fl: FlowDataEngine):
            return fl.apply_sql_formula(func=function_settings.function.function,
                                        col_name=function_settings.function.field.name,
                                        output_data_type=output_type)

        self.add_node_step(function_settings.node_id, _func,
                           output_schema=new_col,
                           node_type='formula',
                           renew_schema=False,
                           setting_input=function_settings,
                           input_node_ids=[function_settings.depending_on_id]
                           )
        if error != "":
            node = self.get_node(function_settings.node_id)
            node.results.errors = error
            return False, error
        else:
            return True, ""

    def add_cross_join(self, cross_join_settings: input_schema.NodeCrossJoin) -> "FlowGraph":
        """Adds a cross join node to the graph.

        Args:
            cross_join_settings: The settings for the cross join operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            for left_select in cross_join_settings.cross_join_input.left_select.renames:
                left_select.is_available = True if left_select.old_name in main.schema else False
            for right_select in cross_join_settings.cross_join_input.right_select.renames:
                right_select.is_available = True if right_select.old_name in right.schema else False

            return main.do_cross_join(cross_join_input=cross_join_settings.cross_join_input,
                                      auto_generate_selection=cross_join_settings.auto_generate_selection,
                                      verify_integrity=False,
                                      other=right)

        self.add_node_step(node_id=cross_join_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='cross_join',
                           setting_input=cross_join_settings,
                           input_node_ids=cross_join_settings.depending_on_ids)
        return self

    def add_join(self, join_settings: input_schema.NodeJoin) -> "FlowGraph":
        """Adds a join node to combine two data streams based on key columns.

        Args:
            join_settings: The settings for the join operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            for left_select in join_settings.join_input.left_select.renames:
                left_select.is_available = True if left_select.old_name in main.schema else False
            for right_select in join_settings.join_input.right_select.renames:
                right_select.is_available = True if right_select.old_name in right.schema else False

            return main.join(join_input=join_settings.join_input,
                             auto_generate_selection=join_settings.auto_generate_selection,
                             verify_integrity=False,
                             other=right)

        self.add_node_step(node_id=join_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='join',
                           setting_input=join_settings,
                           input_node_ids=join_settings.depending_on_ids)
        return self

    def add_fuzzy_match(self, fuzzy_settings: input_schema.NodeFuzzyMatch) -> "FlowGraph":
        """Adds a fuzzy matching node to join data on approximate string matches.

        Args:
            fuzzy_settings: The settings for the fuzzy match operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            f = main.start_fuzzy_join(fuzzy_match_input=fuzzy_settings.join_input, other=right, file_ref=node.hash,
                                      flow_id=self.flow_id, node_id=fuzzy_settings.node_id)
            logger.info("Started the fuzzy match action")
            node._fetch_cached_df = f
            return FlowDataEngine(f.get_result())

        self.add_node_step(node_id=fuzzy_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='fuzzy_match',
                           setting_input=fuzzy_settings)
        node = self.get_node(node_id=fuzzy_settings.node_id)

        def schema_callback():
            return calculate_fuzzy_match_schema(fuzzy_settings.join_input,
                                                left_schema=node.node_inputs.main_inputs[0].schema,
                                                right_schema=node.node_inputs.right_input.schema
                                                )

        node.schema_callback = schema_callback
        return self

    def add_text_to_rows(self, node_text_to_rows: input_schema.NodeTextToRows) -> "FlowGraph":
        """Adds a node that splits cell values into multiple rows.

        This is useful for un-nesting data where a single field contains multiple
        values separated by a delimiter.

        Args:
            node_text_to_rows: The settings object that specifies the column to split
                and the delimiter to use.

        Returns:
            The `FlowGraph` instance for method chaining.
        """
        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.split(node_text_to_rows.text_to_rows_input)

        self.add_node_step(node_id=node_text_to_rows.node_id,
                           function=_func,
                           node_type='text_to_rows',
                           setting_input=node_text_to_rows,
                           input_node_ids=[node_text_to_rows.depending_on_id])
        return self

    def add_sort(self, sort_settings: input_schema.NodeSort) -> "FlowGraph":
        """Adds a node to sort the data based on one or more columns.

        Args:
            sort_settings: The settings for the sort operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.do_sort(sort_settings.sort_input)

        self.add_node_step(node_id=sort_settings.node_id,
                           function=_func,
                           node_type='sort',
                           setting_input=sort_settings,
                           input_node_ids=[sort_settings.depending_on_id])
        return self

    def add_sample(self, sample_settings: input_schema.NodeSample) -> "FlowGraph":
        """Adds a node to take a random or top-N sample of the data.

        Args:
            sample_settings: The settings object specifying the size of the sample.

        Returns:
            The `FlowGraph` instance for method chaining.
        """
        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.get_sample(sample_settings.sample_size)

        self.add_node_step(node_id=sample_settings.node_id,
                           function=_func,
                           node_type='sample',
                           setting_input=sample_settings,
                           input_node_ids=[sample_settings.depending_on_id]
                           )
        return self

    def add_record_id(self, record_id_settings: input_schema.NodeRecordId) -> "FlowGraph":
        """Adds a node to create a new column with a unique ID for each record.

        Args:
            record_id_settings: The settings object specifying the name of the
                new record ID column.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.add_record_id(record_id_settings.record_id_input)

        self.add_node_step(node_id=record_id_settings.node_id,
                           function=_func,
                           node_type='record_id',
                           setting_input=record_id_settings,
                           input_node_ids=[record_id_settings.depending_on_id]
                           )
        return self

    def add_select(self, select_settings: input_schema.NodeSelect) -> "FlowGraph":
        """Adds a node to select, rename, reorder, or drop columns.

        Args:
            select_settings: The settings for the select operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        select_cols = select_settings.select_input
        drop_cols = tuple(s.old_name for s in select_settings.select_input)

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            input_cols = set(f.name for f in table.schema)
            ids_to_remove = []
            for i, select_col in enumerate(select_cols):
                if select_col.data_type is None:
                    select_col.data_type = table.get_schema_column(select_col.old_name).data_type
                if select_col.old_name not in input_cols:
                    select_col.is_available = False
                    if not select_col.keep:
                        ids_to_remove.append(i)
                else:
                    select_col.is_available = True
            ids_to_remove.reverse()
            for i in ids_to_remove:
                v = select_cols.pop(i)
                del v
            return table.do_select(select_inputs=transform_schema.SelectInputs(select_cols),
                                   keep_missing=select_settings.keep_missing)

        self.add_node_step(node_id=select_settings.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='select',
                           drop_columns=list(drop_cols),
                           setting_input=select_settings,
                           input_node_ids=[select_settings.depending_on_id])
        return self

    @property
    def graph_has_functions(self) -> bool:
        """Checks if the graph has any nodes."""
        return len(self._node_ids) > 0

    def delete_node(self, node_id: Union[int, str]):
        """Deletes a node from the graph and updates all its connections.

        Args:
            node_id: The ID of the node to delete.

        Raises:
            Exception: If the node with the given ID does not exist.
        """
        logger.info(f"Starting deletion of node with ID: {node_id}")

        node = self._node_db.get(node_id)
        if node:
            logger.info(f"Found node: {node_id}, processing deletion")

            lead_to_steps: List[FlowNode] = node.leads_to_nodes
            logger.debug(f"Node {node_id} leads to {len(lead_to_steps)} other nodes")

            if len(lead_to_steps) > 0:
                for lead_to_step in lead_to_steps:
                    logger.debug(f"Deleting input node {node_id} from dependent node {lead_to_step}")
                    lead_to_step.delete_input_node(node_id, complete=True)

            if not node.is_start:
                depends_on: List[FlowNode] = node.node_inputs.get_all_inputs()
                logger.debug(f"Node {node_id} depends on {len(depends_on)} other nodes")

                for depend_on in depends_on:
                    logger.debug(f"Removing lead_to reference {node_id} from node {depend_on}")
                    depend_on.delete_lead_to_node(node_id)

            self._node_db.pop(node_id)
            logger.debug(f"Successfully removed node {node_id} from node_db")
            del node
            logger.info("Node object deleted")
        else:
            logger.error(f"Failed to find node with id {node_id}")
            raise Exception(f"Node with id {node_id} does not exist")

    @property
    def graph_has_input_data(self) -> bool:
        """Checks if the graph has an initial input data source."""
        return self._input_data is not None

    def add_node_step(self,
                      node_id: Union[int, str],
                      function: Callable,
                      input_columns: List[str] = None,
                      output_schema: List[FlowfileColumn] = None,
                      node_type: str = None,
                      drop_columns: List[str] = None,
                      renew_schema: bool = True,
                      setting_input: Any = None,
                      cache_results: bool = None,
                      schema_callback: Callable = None,
                      input_node_ids: List[int] = None) -> FlowNode:
        """The core method for adding or updating a node in the graph.

        Args:
            node_id: The unique ID for the node.
            function: The core processing function for the node.
            input_columns: A list of input column names required by the function.
            output_schema: A predefined schema for the node's output.
            node_type: A string identifying the type of node (e.g., 'filter', 'join').
            drop_columns: A list of columns to be dropped after the function executes.
            renew_schema: If True, the schema is recalculated after execution.
            setting_input: A configuration object containing settings for the node.
            cache_results: If True, the node's results are cached for future runs.
            schema_callback: A function that dynamically calculates the output schema.
            input_node_ids: A list of IDs for the nodes that this node depends on.

        Returns:
            The created or updated FlowNode object.
        """
        existing_node = self.get_node(node_id)
        if existing_node is not None:
            if existing_node.node_type != node_type:
                self.delete_node(existing_node.node_id)
                existing_node = None
        if existing_node:
            input_nodes = existing_node.all_inputs
        elif input_node_ids is not None:
            input_nodes = [self.get_node(node_id) for node_id in input_node_ids]
        else:
            input_nodes = None
        if isinstance(input_columns, str):
            input_columns = [input_columns]
        if (
                input_nodes is not None or
                function.__name__ in ('placeholder', 'analysis_preparation') or
                node_type in ("cloud_storage_reader", "polars_lazy_frame", "input_data")
        ):
            if not existing_node:
                node = FlowNode(node_id=node_id,
                                function=function,
                                output_schema=output_schema,
                                input_columns=input_columns,
                                drop_columns=drop_columns,
                                renew_schema=renew_schema,
                                setting_input=setting_input,
                                node_type=node_type,
                                name=function.__name__,
                                schema_callback=schema_callback,
                                parent_uuid=self.uuid)
            else:
                existing_node.update_node(function=function,
                                          output_schema=output_schema,
                                          input_columns=input_columns,
                                          drop_columns=drop_columns,
                                          setting_input=setting_input,
                                          schema_callback=schema_callback)
                node = existing_node
        else:
            raise Exception("No data initialized")
        self._node_db[node_id] = node
        self._node_ids.append(node_id)
        return node

    def add_include_cols(self, include_columns: List[str]):
        """Adds columns to both the input and output column lists.

        Args:
            include_columns: A list of column names to include.
        """
        for column in include_columns:
            if column not in self._input_cols:
                self._input_cols.append(column)
            if column not in self._output_cols:
                self._output_cols.append(column)
        return self

    def add_output(self, output_file: input_schema.NodeOutput):
        """Adds an output node to write the final data to a destination.

        Args:
            output_file: The settings for the output file.
        """

        def _func(df: FlowDataEngine):
            output_file.output_settings.populate_abs_file_path()
            execute_remote = self.execution_location != 'local'
            df.output(output_fs=output_file.output_settings, flow_id=self.flow_id, node_id=output_file.node_id,
                      execute_remote=execute_remote)
            return df

        def schema_callback():
            input_node: FlowNode = self.get_node(output_file.node_id).node_inputs.main_inputs[0]

            return input_node.schema
        input_node_id = getattr(output_file, "depending_on_id") if hasattr(output_file, 'depending_on_id') else None
        self.add_node_step(node_id=output_file.node_id,
                           function=_func,
                           input_columns=[],
                           node_type='output',
                           setting_input=output_file,
                           schema_callback=schema_callback,
                           input_node_ids=[input_node_id])

    def add_database_writer(self, node_database_writer: input_schema.NodeDatabaseWriter):
        """Adds a node to write data to a database.

        Args:
            node_database_writer: The settings for the database writer node.
        """

        node_type = 'database_writer'
        database_settings: input_schema.DatabaseWriteSettings = node_database_writer.database_write_settings
        database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
        if database_settings.connection_mode == 'inline':
            database_connection: input_schema.DatabaseConnection = database_settings.database_connection
            encrypted_password = get_encrypted_secret(current_user_id=node_database_writer.user_id,
                                                      secret_name=database_connection.password_ref)
            if encrypted_password is None:
                raise HTTPException(status_code=400, detail="Password not found")
        else:
            database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                        node_database_writer.user_id)
            encrypted_password = database_reference_settings.password.get_secret_value()

        def _func(df: FlowDataEngine):
            df.lazy = True
            database_external_write_settings = (
                sql_models.DatabaseExternalWriteSettings.create_from_from_node_database_writer(
                    node_database_writer=node_database_writer,
                    password=encrypted_password,
                    table_name=(database_settings.schema_name+'.'+database_settings.table_name
                                if database_settings.schema_name else database_settings.table_name),
                    database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                                 else None),
                    lf=df.data_frame
                )
            )
            external_database_writer = ExternalDatabaseWriter(database_external_write_settings, wait_on_completion=False)
            node._fetch_cached_df = external_database_writer
            external_database_writer.get_result()
            return df

        def schema_callback():
            input_node: FlowNode = self.get_node(node_database_writer.node_id).node_inputs.main_inputs[0]
            return input_node.schema

        self.add_node_step(
            node_id=node_database_writer.node_id,
            function=_func,
            input_columns=[],
            node_type=node_type,
            setting_input=node_database_writer,
            schema_callback=schema_callback,
        )
        node = self.get_node(node_database_writer.node_id)

    def add_database_reader(self, node_database_reader: input_schema.NodeDatabaseReader):
        """Adds a node to read data from a database.

        Args:
            node_database_reader: The settings for the database reader node.
        """

        logger.info("Adding database reader")
        node_type = 'database_reader'
        database_settings: input_schema.DatabaseSettings = node_database_reader.database_settings
        database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
        if database_settings.connection_mode == 'inline':
            database_connection: input_schema.DatabaseConnection = database_settings.database_connection
            encrypted_password = get_encrypted_secret(current_user_id=node_database_reader.user_id,
                                                      secret_name=database_connection.password_ref)
            if encrypted_password is None:
                raise HTTPException(status_code=400, detail="Password not found")
        else:
            database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                        node_database_reader.user_id)
            database_connection = database_reference_settings
            encrypted_password = database_reference_settings.password.get_secret_value()

        def _func():
            sql_source = BaseSqlSource(query=None if database_settings.query_mode == 'table' else database_settings.query,
                                       table_name=database_settings.table_name,
                                       schema_name=database_settings.schema_name,
                                       fields=node_database_reader.fields,
                                       )
            database_external_read_settings = (
                sql_models.DatabaseExternalReadSettings.create_from_from_node_database_reader(
                    node_database_reader=node_database_reader,
                    password=encrypted_password,
                    query=sql_source.query,
                    database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                                 else None),
                )
            )

            external_database_fetcher = ExternalDatabaseFetcher(database_external_read_settings, wait_on_completion=False)
            node._fetch_cached_df = external_database_fetcher
            fl = FlowDataEngine(external_database_fetcher.get_result())
            node_database_reader.fields = [c.get_minimal_field_info() for c in fl.schema]
            return fl

        def schema_callback():
            sql_source = SqlSource(connection_string=
                                   sql_utils.construct_sql_uri(database_type=database_connection.database_type,
                                                               host=database_connection.host,
                                                               port=database_connection.port,
                                                               database=database_connection.database,
                                                               username=database_connection.username,
                                                               password=decrypt_secret(encrypted_password)),
                                   query=None if database_settings.query_mode == 'table' else database_settings.query,
                                   table_name=database_settings.table_name,
                                   schema_name=database_settings.schema_name,
                                   fields=node_database_reader.fields,
                                   )
            return sql_source.get_schema()

        node = self.get_node(node_database_reader.node_id)
        if node:
            node.node_type = node_type
            node.name = node_type
            node.function = _func
            node.setting_input = node_database_reader
            node.node_settings.cache_results = node_database_reader.cache_results
            if node_database_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)
            node.schema_callback = schema_callback
        else:
            node = FlowNode(node_database_reader.node_id, function=_func,
                            setting_input=node_database_reader,
                            name=node_type, node_type=node_type, parent_uuid=self.uuid,
                            schema_callback=schema_callback)
            self._node_db[node_database_reader.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(node_database_reader.node_id)

    def add_sql_source(self, external_source_input: input_schema.NodeExternalSource):
        """Adds a node that reads data from a SQL source.

        This is a convenience alias for `add_external_source`.

        Args:
            external_source_input: The settings for the external SQL source node.
        """
        logger.info('Adding sql source')
        self.add_external_source(external_source_input)

    def add_cloud_storage_writer(self, node_cloud_storage_writer: input_schema.NodeCloudStorageWriter) -> None:
        """Adds a node to write data to a cloud storage provider.

        Args:
            node_cloud_storage_writer: The settings for the cloud storage writer node.
        """

        node_type = "cloud_storage_writer"
        def _func(df: FlowDataEngine):
            df.lazy = True
            execute_remote = self.execution_location != 'local'
            cloud_connection_settings = get_cloud_connection_settings(
                connection_name=node_cloud_storage_writer.cloud_storage_settings.connection_name,
                user_id=node_cloud_storage_writer.user_id,
                auth_mode=node_cloud_storage_writer.cloud_storage_settings.auth_mode
            )
            full_cloud_storage_connection = FullCloudStorageConnection(
                storage_type=cloud_connection_settings.storage_type,
                auth_method=cloud_connection_settings.auth_method,
                aws_allow_unsafe_html=cloud_connection_settings.aws_allow_unsafe_html,
                **CloudStorageReader.get_storage_options(cloud_connection_settings)
            )
            if execute_remote:
                settings = get_cloud_storage_write_settings_worker_interface(
                    write_settings=node_cloud_storage_writer.cloud_storage_settings,
                    connection=full_cloud_storage_connection,
                    lf=df.data_frame,
                    flowfile_node_id=node_cloud_storage_writer.node_id,
                    flowfile_flow_id=self.flow_id)
                external_database_writer = ExternalCloudWriter(settings, wait_on_completion=False)
                node._fetch_cached_df = external_database_writer
                external_database_writer.get_result()
            else:
                cloud_storage_write_settings_internal = CloudStorageWriteSettingsInternal(
                    connection=full_cloud_storage_connection,
                    write_settings=node_cloud_storage_writer.cloud_storage_settings,
                )
                df.to_cloud_storage_obj(cloud_storage_write_settings_internal)
            return df

        def schema_callback():
            logger.info("Starting to run the schema callback for cloud storage writer")
            if self.get_node(node_cloud_storage_writer.node_id).is_correct:
                return self.get_node(node_cloud_storage_writer.node_id).node_inputs.main_inputs[0].schema
            else:
                return [FlowfileColumn.from_input(column_name="__error__", data_type="String")]

        self.add_node_step(
            node_id=node_cloud_storage_writer.node_id,
            function=_func,
            input_columns=[],
            node_type=node_type,
            setting_input=node_cloud_storage_writer,
            schema_callback=schema_callback,
            input_node_ids=[node_cloud_storage_writer.depending_on_id]
        )

        node = self.get_node(node_cloud_storage_writer.node_id)

    def add_cloud_storage_reader(self, node_cloud_storage_reader: input_schema.NodeCloudStorageReader) -> None:
        """Adds a cloud storage read node to the flow graph.

        Args:
            node_cloud_storage_reader: The settings for the cloud storage read node.
        """
        node_type = "cloud_storage_reader"
        logger.info("Adding cloud storage reader")
        cloud_storage_read_settings = node_cloud_storage_reader.cloud_storage_settings

        def _func():
            logger.info("Starting to run the schema callback for cloud storage reader")
            self.flow_logger.info("Starting to run the schema callback for cloud storage reader")
            settings = CloudStorageReadSettingsInternal(read_settings=cloud_storage_read_settings,
                                                        connection=get_cloud_connection_settings(
                                                            connection_name=cloud_storage_read_settings.connection_name,
                                                            user_id=node_cloud_storage_reader.user_id,
                                                            auth_mode=cloud_storage_read_settings.auth_mode
                                                        ))
            fl = FlowDataEngine.from_cloud_storage_obj(settings)
            return fl

        node = self.add_node_step(node_id=node_cloud_storage_reader.node_id,
                                  function=_func,
                                  cache_results=node_cloud_storage_reader.cache_results,
                                  setting_input=node_cloud_storage_reader,
                                  node_type=node_type,
                                  )
        if node_cloud_storage_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)

    def add_external_source(self,
                            external_source_input: input_schema.NodeExternalSource):
        """Adds a node for a custom external data source.

        Args:
            external_source_input: The settings for the external source node.
        """

        node_type = 'external_source'
        external_source_script = getattr(external_sources.custom_external_sources, external_source_input.identifier)
        source_settings = (getattr(input_schema, snake_case_to_camel_case(external_source_input.identifier)).
                           model_validate(external_source_input.source_settings))
        if hasattr(external_source_script, 'initial_getter'):
            initial_getter = getattr(external_source_script, 'initial_getter')(source_settings)
        else:
            initial_getter = None
        data_getter = external_source_script.getter(source_settings)
        external_source = data_source_factory(source_type='custom',
                                              data_getter=data_getter,
                                              initial_data_getter=initial_getter,
                                              orientation=external_source_input.source_settings.orientation,
                                              schema=None)

        def _func():
            logger.info('Calling external source')
            fl = FlowDataEngine.create_from_external_source(external_source=external_source)
            external_source_input.source_settings.fields = [c.get_minimal_field_info() for c in fl.schema]
            return fl

        node = self.get_node(external_source_input.node_id)
        if node:
            node.node_type = node_type
            node.name = node_type
            node.function = _func
            node.setting_input = external_source_input
            node.node_settings.cache_results = external_source_input.cache_results
            if external_source_input.node_id not in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)
        else:
            node = FlowNode(external_source_input.node_id, function=_func,
                            setting_input=external_source_input,
                            name=node_type, node_type=node_type, parent_uuid=self.uuid)
            self._node_db[external_source_input.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(external_source_input.node_id)
        if external_source_input.source_settings.fields and len(external_source_input.source_settings.fields) > 0:
            logger.info('Using provided schema in the node')

            def schema_callback():
                return [FlowfileColumn.from_input(f.name, f.data_type) for f in
                        external_source_input.source_settings.fields]

            node.schema_callback = schema_callback
        else:
            logger.warning('Removing schema')
            node._schema_callback = None
        self.add_node_step(node_id=external_source_input.node_id,
                           function=_func,
                           input_columns=[],
                           node_type=node_type,
                           setting_input=external_source_input)

    def add_read(self, input_file: input_schema.NodeRead):
        """Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

        Args:
            input_file: The settings for the read operation.
        """

        if input_file.received_file.file_type in ('xlsx', 'excel') and input_file.received_file.sheet_name == '':
            sheet_name = fastexcel.read_excel(input_file.received_file.path).sheet_names[0]
            input_file.received_file.sheet_name = sheet_name

        received_file = input_file.received_file
        input_file.received_file.set_absolute_filepath()

        def _func():
            input_file.received_file.set_absolute_filepath()
            if input_file.received_file.file_type == 'parquet':
                input_data = FlowDataEngine.create_from_path(input_file.received_file)
            elif input_file.received_file.file_type == 'csv' and 'utf' in input_file.received_file.encoding:
                input_data = FlowDataEngine.create_from_path(input_file.received_file)
            else:
                input_data = FlowDataEngine.create_from_path_worker(input_file.received_file,
                                                                    node_id=input_file.node_id,
                                                                    flow_id=self.flow_id)
            input_data.name = input_file.received_file.name
            return input_data

        node = self.get_node(input_file.node_id)
        schema_callback = None
        if node:
            start_hash = node.hash
            node.node_type = 'read'
            node.name = 'read'
            node.function = _func
            node.setting_input = input_file
            if input_file.node_id not in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)

            if start_hash != node.hash:
                logger.info('Hash changed, updating schema')
                if len(received_file.fields) > 0:
                    # If the file has fields defined, we can use them to create the schema
                    def schema_callback():
                        return [FlowfileColumn.from_input(f.name, f.data_type) for f in received_file.fields]

                elif input_file.received_file.file_type in ('csv', 'json', 'parquet'):
                    # everything that can be scanned by polars
                    def schema_callback():
                        input_data = FlowDataEngine.create_from_path(input_file.received_file)
                        return input_data.schema

                elif input_file.received_file.file_type in ('xlsx', 'excel'):
                    # If the file is an Excel file, we need to use the openpyxl engine to read the schema
                    schema_callback = get_xlsx_schema_callback(engine='openpyxl',
                                                               file_path=received_file.file_path,
                                                               sheet_name=received_file.sheet_name,
                                                               start_row=received_file.start_row,
                                                               end_row=received_file.end_row,
                                                               start_column=received_file.start_column,
                                                               end_column=received_file.end_column,
                                                               has_headers=received_file.has_headers)
                else:
                    schema_callback = None
        else:
            node = FlowNode(input_file.node_id, function=_func,
                            setting_input=input_file,
                            name='read', node_type='read', parent_uuid=self.uuid)
            self._node_db[input_file.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(input_file.node_id)

        if schema_callback is not None:
            node.schema_callback = schema_callback
        return self

    def add_datasource(self, input_file: Union[input_schema.NodeDatasource, input_schema.NodeManualInput]) -> "FlowGraph":
        """Adds a data source node to the graph.

        This method serves as a factory for creating starting nodes, handling both
        file-based sources and direct manual data entry.

        Args:
            input_file: The configuration object for the data source.

        Returns:
            The `FlowGraph` instance for method chaining.
        """
        if isinstance(input_file, input_schema.NodeManualInput):
            input_data = FlowDataEngine(input_file.raw_data_format)
            ref = 'manual_input'
        else:
            input_data = FlowDataEngine(path_ref=input_file.file_ref)
            ref = 'datasource'
        node = self.get_node(input_file.node_id)
        if node:
            node.node_type = ref
            node.name = ref
            node.function = input_data
            node.setting_input = input_file
            if not input_file.node_id in set(start_node.node_id for start_node in self._flow_starts):
                self._flow_starts.append(node)
        else:
            input_data.collect()
            node = FlowNode(input_file.node_id, function=input_data,
                            setting_input=input_file,
                            name=ref, node_type=ref, parent_uuid=self.uuid)
            self._node_db[input_file.node_id] = node
            self._flow_starts.append(node)
            self._node_ids.append(input_file.node_id)
        return self

    def add_manual_input(self, input_file: input_schema.NodeManualInput):
        """Adds a node for manual data entry.

        This is a convenience alias for `add_datasource`.

        Args:
            input_file: The settings and data for the manual input node.
        """
        self.add_datasource(input_file)

    @property
    def nodes(self) -> List[FlowNode]:
        """Gets a list of all FlowNode objects in the graph."""

        return list(self._node_db.values())

    @property
    def execution_mode(self) -> schemas.ExecutionModeLiteral:
        """Gets the current execution mode ('Development' or 'Performance')."""
        return self.flow_settings.execution_mode

    def get_implicit_starter_nodes(self) -> List[FlowNode]:
        """Finds nodes that can act as starting points but are not explicitly defined as such.

        Some nodes, like the Polars Code node, can function without an input. This
        method identifies such nodes if they have no incoming connections.

        Returns:
            A list of `FlowNode` objects that are implicit starting nodes.
        """
        starting_node_ids = [node.node_id for node in self._flow_starts]
        implicit_starting_nodes = []
        for node in self.nodes:
            if node.node_template.can_be_start and not node.has_input and node.node_id not in starting_node_ids:
                implicit_starting_nodes.append(node)
        return implicit_starting_nodes

    @execution_mode.setter
    def execution_mode(self, mode: schemas.ExecutionModeLiteral):
        """Sets the execution mode for the flow.

        Args:
            mode: The execution mode to set.
        """
        self.flow_settings.execution_mode = mode

    @property
    def execution_location(self) -> schemas.ExecutionLocationsLiteral:
        """Gets the current execution location."""
        return self.flow_settings.execution_location

    @execution_location.setter
    def execution_location(self, execution_location: schemas.ExecutionLocationsLiteral):
        """Sets the execution location for the flow.

        Args:
            execution_location: The execution location to set.
        """
        self.flow_settings.execution_location = execution_location

    def run_graph(self) -> RunInformation | None:
        """Executes the entire data flow graph from start to finish.

        It determines the correct execution order, runs each node,
        collects results, and handles errors and cancellations.

        Returns:
            A RunInformation object summarizing the execution results.

        Raises:
            Exception: If the flow is already running.
        """
        if self.flow_settings.is_running:
            raise Exception('Flow is already running')
        try:
            self.flow_settings.is_running = True
            self.flow_settings.is_canceled = False
            self.flow_logger.clear_log_file()
            self.nodes_completed = 0
            self.node_results = []
            self.start_datetime = datetime.datetime.now()
            self.end_datetime = None
            self.latest_run_info = None
            self.flow_logger.info('Starting to run flowfile flow...')
            skip_nodes = [node for node in self.nodes if not node.is_correct]
            skip_nodes.extend([lead_to_node for node in skip_nodes for lead_to_node in node.leads_to_nodes])
            execution_order = determine_execution_order(all_nodes=[node for node in self.nodes if
                                                                   node not in skip_nodes],
                                                        flow_starts=self._flow_starts+self.get_implicit_starter_nodes())
            skip_node_message(self.flow_logger, skip_nodes)
            execution_order_message(self.flow_logger, execution_order)
            performance_mode = self.flow_settings.execution_mode == 'Performance'
            if self.flow_settings.execution_location == 'local':
                OFFLOAD_TO_WORKER.value = False
            elif self.flow_settings.execution_location == 'remote':
                OFFLOAD_TO_WORKER.value = True
            for node in execution_order:
                node_logger = self.flow_logger.get_node_logger(node.node_id)
                if self.flow_settings.is_canceled:
                    self.flow_logger.info('Flow canceled')
                    break
                if node in skip_nodes:
                    node_logger.info(f'Skipping node {node.node_id}')
                    continue
                node_result = NodeResult(node_id=node.node_id, node_name=node.name)
                self.node_results.append(node_result)
                logger.info(f'Starting to run: node {node.node_id}, start time: {node_result.start_timestamp}')
                node.execute_node(run_location=self.flow_settings.execution_location,
                                  performance_mode=performance_mode,
                                  node_logger=node_logger)
                try:
                    node_result.error = str(node.results.errors)
                    if self.flow_settings.is_canceled:
                        node_result.success = None
                        node_result.success = None
                        node_result.is_running = False
                        continue
                    node_result.success = node.results.errors is None
                    node_result.end_timestamp = time()
                    node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                    node_result.is_running = False
                except Exception as e:
                    node_result.error = 'Node did not run'
                    node_result.success = False
                    node_result.end_timestamp = time()
                    node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                    node_result.is_running = False
                    node_logger.error(f'Error in node {node.node_id}: {e}')
                if not node_result.success:
                    skip_nodes.extend(list(node.get_all_dependent_nodes()))
                node_logger.info(f'Completed node with success: {node_result.success}')
                self.nodes_completed += 1
            self.flow_logger.info('Flow completed!')
            self.end_datetime = datetime.datetime.now()
            self.flow_settings.is_running = False
            if self.flow_settings.is_canceled:
                self.flow_logger.info('Flow canceled')
            return self.get_run_info()
        except Exception as e:
            raise e
        finally:
            self.flow_settings.is_running = False

    def get_run_info(self) -> RunInformation:
        """Gets a summary of the most recent graph execution.

        Returns:
            A RunInformation object with details about the last run.
        """
        if self.latest_run_info is None:
            node_results = self.node_results
            success = all(nr.success for nr in node_results)
            self.latest_run_info = RunInformation(start_time=self.start_datetime, end_time=self.end_datetime,
                                                  success=success,
                                                  node_step_result=node_results, flow_id=self.flow_id,
                                                  nodes_completed=self.nodes_completed,
                                                  number_of_nodes=len(self.nodes))
        elif self.latest_run_info.nodes_completed != self.nodes_completed:
            node_results = self.node_results
            self.latest_run_info = RunInformation(start_time=self.start_datetime, end_time=self.end_datetime,
                                                  success=all(nr.success for nr in node_results),
                                                  node_step_result=node_results, flow_id=self.flow_id,
                                                  nodes_completed=self.nodes_completed,
                                                  number_of_nodes=len(self.nodes))
        return self.latest_run_info

    @property
    def node_connections(self) -> List[Tuple[int, int]]:
        """Computes and returns a list of all connections in the graph.

        Returns:
            A list of tuples, where each tuple is a (source_id, target_id) pair.
        """
        connections = set()
        for node in self.nodes:
            outgoing_connections = [(node.node_id, ltn.node_id) for ltn in node.leads_to_nodes]
            incoming_connections = [(don.node_id, node.node_id) for don in node.all_inputs]
            node_connections = [c for c in outgoing_connections + incoming_connections if (c[0] is not None
                                                                                           and c[1] is not None)]
            for node_connection in node_connections:
                if node_connection not in connections:
                    connections.add(node_connection)
        return list(connections)

    def get_node_data(self, node_id: int, include_example: bool = True) -> NodeData:
        """Retrieves all data needed to render a node in the UI.

        Args:
            node_id: The ID of the node.
            include_example: Whether to include data samples in the result.

        Returns:
            A NodeData object, or None if the node is not found.
        """
        node = self._node_db[node_id]
        return node.get_node_data(flow_id=self.flow_id, include_example=include_example)

    def get_node_storage(self) -> schemas.FlowInformation:
        """Serializes the entire graph's state into a storable format.

        Returns:
            A FlowInformation object representing the complete graph.
        """
        node_information = {node.node_id: node.get_node_information() for
                            node in self.nodes if node.is_setup and node.is_correct}

        return schemas.FlowInformation(flow_id=self.flow_id,
                                       flow_name=self.__name__,
                                       flow_settings=self.flow_settings,
                                       data=node_information,
                                       node_starts=[v.node_id for v in self._flow_starts],
                                       node_connections=self.node_connections
                                       )

    def cancel(self):
        """Cancels an ongoing graph execution."""

        if not self.flow_settings.is_running:
            return
        self.flow_settings.is_canceled = True
        for node in self.nodes:
            node.cancel()

    def close_flow(self):
        """Performs cleanup operations, such as clearing node caches."""

        for node in self.nodes:
            node.remove_cache()

    def save_flow(self, flow_path: str):
        """Saves the current state of the flow graph to a file.

        Args:
            flow_path: The path where the flow file will be saved.
        """
        with open(flow_path, 'wb') as f:
            pickle.dump(self.get_node_storage(), f)
        self.flow_settings.path = flow_path

    def get_frontend_data(self) -> dict:
        """Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

        This method transforms the graph's state into a format compatible with the
        Drawflow.js library.

        Returns:
            A dictionary representing the graph in Drawflow format.
        """
        result = {
            'Home': {
                "data": {}
            }
        }
        flow_info: schemas.FlowInformation = self.get_node_storage()

        for node_id, node_info in flow_info.data.items():
            if node_info.is_setup:
                try:
                    pos_x = node_info.data.pos_x
                    pos_y = node_info.data.pos_y
                    # Basic node structure
                    result["Home"]["data"][str(node_id)] = {
                        "id": node_info.id,
                        "name": node_info.type,
                        "data": {},  # Additional data can go here
                        "class": node_info.type,
                        "html": node_info.type,
                        "typenode": "vue",
                        "inputs": {},
                        "outputs": {},
                        "pos_x": pos_x,
                        "pos_y": pos_y
                    }
                except Exception as e:
                    logger.error(e)
            # Add outputs to the node based on `outputs` in your backend data
            if node_info.outputs:
                outputs = {o: 0 for o in node_info.outputs}
                for o in node_info.outputs:
                    outputs[o] += 1
                connections = []
                for output_node_id, n_connections in outputs.items():
                    leading_to_node = self.get_node(output_node_id)
                    input_types = leading_to_node.get_input_type(node_info.id)
                    for input_type in input_types:
                        if input_type == 'main':
                            input_frontend_id = 'input_1'
                        elif input_type == 'right':
                            input_frontend_id = 'input_2'
                        elif input_type == 'left':
                            input_frontend_id = 'input_3'
                        else:
                            input_frontend_id = 'input_1'
                        connection = {"node": str(output_node_id), "input": input_frontend_id}
                        connections.append(connection)

                result["Home"]["data"][str(node_id)]["outputs"]["output_1"] = {
                    "connections": connections}
            else:
                result["Home"]["data"][str(node_id)]["outputs"] = {"output_1": {"connections": []}}

            # Add input to the node based on `depending_on_id` in your backend data
            if node_info.left_input_id is not None or node_info.right_input_id is not None or node_info.input_ids is not None:
                main_inputs = node_info.main_input_ids
                result["Home"]["data"][str(node_id)]["inputs"]["input_1"] = {
                    "connections": [{"node": str(main_node_id), "input": "output_1"} for main_node_id in main_inputs]
                }
                if node_info.right_input_id is not None:
                    result["Home"]["data"][str(node_id)]["inputs"]["input_2"] = {
                        "connections": [{"node": str(node_info.right_input_id), "input": "output_1"}]
                    }
                if node_info.left_input_id is not None:
                    result["Home"]["data"][str(node_id)]["inputs"]["input_3"] = {
                        "connections": [{"node": str(node_info.left_input_id), "input": "output_1"}]
                    }
        return result

    def get_vue_flow_input(self) -> schemas.VueFlowInput:
        """Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

        Returns:
            A VueFlowInput object.
        """
        edges: List[schemas.NodeEdge] = []
        nodes: List[schemas.NodeInput] = []
        for node in self.nodes:
            nodes.append(node.get_node_input())
            edges.extend(node.get_edge_input())
        return schemas.VueFlowInput(node_edges=edges, node_inputs=nodes)

    def reset(self):
        """Forces a deep reset on all nodes in the graph."""

        for node in self.nodes:
            node.reset(True)

    def copy_node(self, new_node_settings: input_schema.NodePromise, existing_setting_input: Any, node_type: str) -> None:
        """Creates a copy of an existing node.

        Args:
            new_node_settings: The promise containing new settings (like ID and position).
            existing_setting_input: The settings object from the node being copied.
            node_type: The type of the node being copied.
        """
        self.add_node_promise(new_node_settings)

        if isinstance(existing_setting_input, input_schema.NodePromise):
            return

        combined_settings = combine_existing_settings_and_new_settings(
            existing_setting_input, new_node_settings
        )
        getattr(self, f"add_{node_type}")(combined_settings)

    def generate_code(self):
        """Generates code for the flow graph.
        This method exports the flow graph to a Polars-compatible format.
        """
        from flowfile_core.flowfile.code_generator.code_generator import export_flow_to_polars
        print(export_flow_to_polars(self))
execution_location property writable

Gets the current execution location.

execution_mode property writable

Gets the current execution mode ('Development' or 'Performance').

flow_id property writable

Gets the unique identifier of the flow.

graph_has_functions property

Checks if the graph has any nodes.

graph_has_input_data property

Checks if the graph has an initial input data source.

node_connections property

Computes and returns a list of all connections in the graph.

Returns:

Type Description
List[Tuple[int, int]]

A list of tuples, where each tuple is a (source_id, target_id) pair.

nodes property

Gets a list of all FlowNode objects in the graph.

__init__(flow_settings, name=None, input_cols=None, output_cols=None, path_ref=None, input_flow=None, cache_results=False)

Initializes a new FlowGraph instance.

Parameters:

Name Type Description Default
flow_settings FlowSettings | FlowGraphConfig

The configuration settings for the flow.

required
name str

The name of the flow.

None
input_cols List[str]

A list of input column names.

None
output_cols List[str]

A list of output column names.

None
path_ref str

An optional path to an initial data source.

None
input_flow Union[ParquetFile, FlowDataEngine, FlowGraph]

An optional existing data object to start the flow with.

None
cache_results bool

A global flag to enable or disable result caching.

False
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
def __init__(self,
             flow_settings: schemas.FlowSettings | schemas.FlowGraphConfig,
             name: str = None, input_cols: List[str] = None,
             output_cols: List[str] = None,
             path_ref: str = None,
             input_flow: Union[ParquetFile, FlowDataEngine, "FlowGraph"] = None,
             cache_results: bool = False):
    """Initializes a new FlowGraph instance.

    Args:
        flow_settings: The configuration settings for the flow.
        name: The name of the flow.
        input_cols: A list of input column names.
        output_cols: A list of output column names.
        path_ref: An optional path to an initial data source.
        input_flow: An optional existing data object to start the flow with.
        cache_results: A global flag to enable or disable result caching.
    """
    if isinstance(flow_settings, schemas.FlowGraphConfig):
        flow_settings = schemas.FlowSettings.from_flow_settings_input(flow_settings)

    self.flow_settings = flow_settings
    self.uuid = str(uuid1())
    self.nodes_completed = 0
    self.start_datetime = None
    self.end_datetime = None
    self.latest_run_info = None
    self.node_results = []
    self._flow_id = flow_settings.flow_id
    self.flow_logger = FlowLogger(flow_settings.flow_id)
    self._flow_starts: List[FlowNode] = []
    self._results = None
    self.schema = None
    self.has_over_row_function = False
    self._input_cols = [] if input_cols is None else input_cols
    self._output_cols = [] if output_cols is None else output_cols
    self._node_ids = []
    self._node_db = {}
    self.cache_results = cache_results
    self.__name__ = name if name else id(self)
    self.depends_on = {}
    if path_ref is not None:
        self.add_datasource(input_schema.NodeDatasource(file_path=path_ref))
    elif input_flow is not None:
        self.add_datasource(input_file=input_flow)
__repr__()

Provides the official string representation of the FlowGraph instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
367
368
369
370
def __repr__(self):
    """Provides the official string representation of the FlowGraph instance."""
    settings_str = "  -" + '\n  -'.join(f"{k}: {v}" for k, v in self.flow_settings)
    return f"FlowGraph(\nNodes: {self._node_db}\n\nSettings:\n{settings_str}"
add_cloud_storage_reader(node_cloud_storage_reader)

Adds a cloud storage read node to the flow graph.

Parameters:

Name Type Description Default
node_cloud_storage_reader NodeCloudStorageReader

The settings for the cloud storage read node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
def add_cloud_storage_reader(self, node_cloud_storage_reader: input_schema.NodeCloudStorageReader) -> None:
    """Adds a cloud storage read node to the flow graph.

    Args:
        node_cloud_storage_reader: The settings for the cloud storage read node.
    """
    node_type = "cloud_storage_reader"
    logger.info("Adding cloud storage reader")
    cloud_storage_read_settings = node_cloud_storage_reader.cloud_storage_settings

    def _func():
        logger.info("Starting to run the schema callback for cloud storage reader")
        self.flow_logger.info("Starting to run the schema callback for cloud storage reader")
        settings = CloudStorageReadSettingsInternal(read_settings=cloud_storage_read_settings,
                                                    connection=get_cloud_connection_settings(
                                                        connection_name=cloud_storage_read_settings.connection_name,
                                                        user_id=node_cloud_storage_reader.user_id,
                                                        auth_mode=cloud_storage_read_settings.auth_mode
                                                    ))
        fl = FlowDataEngine.from_cloud_storage_obj(settings)
        return fl

    node = self.add_node_step(node_id=node_cloud_storage_reader.node_id,
                              function=_func,
                              cache_results=node_cloud_storage_reader.cache_results,
                              setting_input=node_cloud_storage_reader,
                              node_type=node_type,
                              )
    if node_cloud_storage_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
        self._flow_starts.append(node)
add_cloud_storage_writer(node_cloud_storage_writer)

Adds a node to write data to a cloud storage provider.

Parameters:

Name Type Description Default
node_cloud_storage_writer NodeCloudStorageWriter

The settings for the cloud storage writer node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
def add_cloud_storage_writer(self, node_cloud_storage_writer: input_schema.NodeCloudStorageWriter) -> None:
    """Adds a node to write data to a cloud storage provider.

    Args:
        node_cloud_storage_writer: The settings for the cloud storage writer node.
    """

    node_type = "cloud_storage_writer"
    def _func(df: FlowDataEngine):
        df.lazy = True
        execute_remote = self.execution_location != 'local'
        cloud_connection_settings = get_cloud_connection_settings(
            connection_name=node_cloud_storage_writer.cloud_storage_settings.connection_name,
            user_id=node_cloud_storage_writer.user_id,
            auth_mode=node_cloud_storage_writer.cloud_storage_settings.auth_mode
        )
        full_cloud_storage_connection = FullCloudStorageConnection(
            storage_type=cloud_connection_settings.storage_type,
            auth_method=cloud_connection_settings.auth_method,
            aws_allow_unsafe_html=cloud_connection_settings.aws_allow_unsafe_html,
            **CloudStorageReader.get_storage_options(cloud_connection_settings)
        )
        if execute_remote:
            settings = get_cloud_storage_write_settings_worker_interface(
                write_settings=node_cloud_storage_writer.cloud_storage_settings,
                connection=full_cloud_storage_connection,
                lf=df.data_frame,
                flowfile_node_id=node_cloud_storage_writer.node_id,
                flowfile_flow_id=self.flow_id)
            external_database_writer = ExternalCloudWriter(settings, wait_on_completion=False)
            node._fetch_cached_df = external_database_writer
            external_database_writer.get_result()
        else:
            cloud_storage_write_settings_internal = CloudStorageWriteSettingsInternal(
                connection=full_cloud_storage_connection,
                write_settings=node_cloud_storage_writer.cloud_storage_settings,
            )
            df.to_cloud_storage_obj(cloud_storage_write_settings_internal)
        return df

    def schema_callback():
        logger.info("Starting to run the schema callback for cloud storage writer")
        if self.get_node(node_cloud_storage_writer.node_id).is_correct:
            return self.get_node(node_cloud_storage_writer.node_id).node_inputs.main_inputs[0].schema
        else:
            return [FlowfileColumn.from_input(column_name="__error__", data_type="String")]

    self.add_node_step(
        node_id=node_cloud_storage_writer.node_id,
        function=_func,
        input_columns=[],
        node_type=node_type,
        setting_input=node_cloud_storage_writer,
        schema_callback=schema_callback,
        input_node_ids=[node_cloud_storage_writer.depending_on_id]
    )

    node = self.get_node(node_cloud_storage_writer.node_id)
add_cross_join(cross_join_settings)

Adds a cross join node to the graph.

Parameters:

Name Type Description Default
cross_join_settings NodeCrossJoin

The settings for the cross join operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
def add_cross_join(self, cross_join_settings: input_schema.NodeCrossJoin) -> "FlowGraph":
    """Adds a cross join node to the graph.

    Args:
        cross_join_settings: The settings for the cross join operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        for left_select in cross_join_settings.cross_join_input.left_select.renames:
            left_select.is_available = True if left_select.old_name in main.schema else False
        for right_select in cross_join_settings.cross_join_input.right_select.renames:
            right_select.is_available = True if right_select.old_name in right.schema else False

        return main.do_cross_join(cross_join_input=cross_join_settings.cross_join_input,
                                  auto_generate_selection=cross_join_settings.auto_generate_selection,
                                  verify_integrity=False,
                                  other=right)

    self.add_node_step(node_id=cross_join_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='cross_join',
                       setting_input=cross_join_settings,
                       input_node_ids=cross_join_settings.depending_on_ids)
    return self
add_database_reader(node_database_reader)

Adds a node to read data from a database.

Parameters:

Name Type Description Default
node_database_reader NodeDatabaseReader

The settings for the database reader node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
def add_database_reader(self, node_database_reader: input_schema.NodeDatabaseReader):
    """Adds a node to read data from a database.

    Args:
        node_database_reader: The settings for the database reader node.
    """

    logger.info("Adding database reader")
    node_type = 'database_reader'
    database_settings: input_schema.DatabaseSettings = node_database_reader.database_settings
    database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
    if database_settings.connection_mode == 'inline':
        database_connection: input_schema.DatabaseConnection = database_settings.database_connection
        encrypted_password = get_encrypted_secret(current_user_id=node_database_reader.user_id,
                                                  secret_name=database_connection.password_ref)
        if encrypted_password is None:
            raise HTTPException(status_code=400, detail="Password not found")
    else:
        database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                    node_database_reader.user_id)
        database_connection = database_reference_settings
        encrypted_password = database_reference_settings.password.get_secret_value()

    def _func():
        sql_source = BaseSqlSource(query=None if database_settings.query_mode == 'table' else database_settings.query,
                                   table_name=database_settings.table_name,
                                   schema_name=database_settings.schema_name,
                                   fields=node_database_reader.fields,
                                   )
        database_external_read_settings = (
            sql_models.DatabaseExternalReadSettings.create_from_from_node_database_reader(
                node_database_reader=node_database_reader,
                password=encrypted_password,
                query=sql_source.query,
                database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                             else None),
            )
        )

        external_database_fetcher = ExternalDatabaseFetcher(database_external_read_settings, wait_on_completion=False)
        node._fetch_cached_df = external_database_fetcher
        fl = FlowDataEngine(external_database_fetcher.get_result())
        node_database_reader.fields = [c.get_minimal_field_info() for c in fl.schema]
        return fl

    def schema_callback():
        sql_source = SqlSource(connection_string=
                               sql_utils.construct_sql_uri(database_type=database_connection.database_type,
                                                           host=database_connection.host,
                                                           port=database_connection.port,
                                                           database=database_connection.database,
                                                           username=database_connection.username,
                                                           password=decrypt_secret(encrypted_password)),
                               query=None if database_settings.query_mode == 'table' else database_settings.query,
                               table_name=database_settings.table_name,
                               schema_name=database_settings.schema_name,
                               fields=node_database_reader.fields,
                               )
        return sql_source.get_schema()

    node = self.get_node(node_database_reader.node_id)
    if node:
        node.node_type = node_type
        node.name = node_type
        node.function = _func
        node.setting_input = node_database_reader
        node.node_settings.cache_results = node_database_reader.cache_results
        if node_database_reader.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)
        node.schema_callback = schema_callback
    else:
        node = FlowNode(node_database_reader.node_id, function=_func,
                        setting_input=node_database_reader,
                        name=node_type, node_type=node_type, parent_uuid=self.uuid,
                        schema_callback=schema_callback)
        self._node_db[node_database_reader.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(node_database_reader.node_id)
add_database_writer(node_database_writer)

Adds a node to write data to a database.

Parameters:

Name Type Description Default
node_database_writer NodeDatabaseWriter

The settings for the database writer node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
def add_database_writer(self, node_database_writer: input_schema.NodeDatabaseWriter):
    """Adds a node to write data to a database.

    Args:
        node_database_writer: The settings for the database writer node.
    """

    node_type = 'database_writer'
    database_settings: input_schema.DatabaseWriteSettings = node_database_writer.database_write_settings
    database_connection: Optional[input_schema.DatabaseConnection | input_schema.FullDatabaseConnection]
    if database_settings.connection_mode == 'inline':
        database_connection: input_schema.DatabaseConnection = database_settings.database_connection
        encrypted_password = get_encrypted_secret(current_user_id=node_database_writer.user_id,
                                                  secret_name=database_connection.password_ref)
        if encrypted_password is None:
            raise HTTPException(status_code=400, detail="Password not found")
    else:
        database_reference_settings = get_local_database_connection(database_settings.database_connection_name,
                                                                    node_database_writer.user_id)
        encrypted_password = database_reference_settings.password.get_secret_value()

    def _func(df: FlowDataEngine):
        df.lazy = True
        database_external_write_settings = (
            sql_models.DatabaseExternalWriteSettings.create_from_from_node_database_writer(
                node_database_writer=node_database_writer,
                password=encrypted_password,
                table_name=(database_settings.schema_name+'.'+database_settings.table_name
                            if database_settings.schema_name else database_settings.table_name),
                database_reference_settings=(database_reference_settings if database_settings.connection_mode == 'reference'
                                             else None),
                lf=df.data_frame
            )
        )
        external_database_writer = ExternalDatabaseWriter(database_external_write_settings, wait_on_completion=False)
        node._fetch_cached_df = external_database_writer
        external_database_writer.get_result()
        return df

    def schema_callback():
        input_node: FlowNode = self.get_node(node_database_writer.node_id).node_inputs.main_inputs[0]
        return input_node.schema

    self.add_node_step(
        node_id=node_database_writer.node_id,
        function=_func,
        input_columns=[],
        node_type=node_type,
        setting_input=node_database_writer,
        schema_callback=schema_callback,
    )
    node = self.get_node(node_database_writer.node_id)
add_datasource(input_file)

Adds a data source node to the graph.

This method serves as a factory for creating starting nodes, handling both file-based sources and direct manual data entry.

Parameters:

Name Type Description Default
input_file Union[NodeDatasource, NodeManualInput]

The configuration object for the data source.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
def add_datasource(self, input_file: Union[input_schema.NodeDatasource, input_schema.NodeManualInput]) -> "FlowGraph":
    """Adds a data source node to the graph.

    This method serves as a factory for creating starting nodes, handling both
    file-based sources and direct manual data entry.

    Args:
        input_file: The configuration object for the data source.

    Returns:
        The `FlowGraph` instance for method chaining.
    """
    if isinstance(input_file, input_schema.NodeManualInput):
        input_data = FlowDataEngine(input_file.raw_data_format)
        ref = 'manual_input'
    else:
        input_data = FlowDataEngine(path_ref=input_file.file_ref)
        ref = 'datasource'
    node = self.get_node(input_file.node_id)
    if node:
        node.node_type = ref
        node.name = ref
        node.function = input_data
        node.setting_input = input_file
        if not input_file.node_id in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)
    else:
        input_data.collect()
        node = FlowNode(input_file.node_id, function=input_data,
                        setting_input=input_file,
                        name=ref, node_type=ref, parent_uuid=self.uuid)
        self._node_db[input_file.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(input_file.node_id)
    return self
add_dependency_on_polars_lazy_frame(lazy_frame, node_id)

Adds a special node that directly injects a Polars LazyFrame into the graph.

Note: This is intended for backend use and will not work in the UI editor.

Parameters:

Name Type Description Default
lazy_frame LazyFrame

The Polars LazyFrame to inject.

required
node_id int

The ID for the new node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
def add_dependency_on_polars_lazy_frame(self,
                                        lazy_frame: pl.LazyFrame,
                                        node_id: int):
    """Adds a special node that directly injects a Polars LazyFrame into the graph.

    Note: This is intended for backend use and will not work in the UI editor.

    Args:
        lazy_frame: The Polars LazyFrame to inject.
        node_id: The ID for the new node.
    """
    def _func():
        return FlowDataEngine(lazy_frame)
    node_promise = input_schema.NodePromise(flow_id=self.flow_id,
                                            node_id=node_id, node_type="polars_lazy_frame",
                                            is_setup=True)
    self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=_func,
                       setting_input=node_promise)
add_explore_data(node_analysis)

Adds a specialized node for data exploration and visualization.

Parameters:

Name Type Description Default
node_analysis NodeExploreData

The settings for the data exploration node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
def add_explore_data(self, node_analysis: input_schema.NodeExploreData):
    """Adds a specialized node for data exploration and visualization.

    Args:
        node_analysis: The settings for the data exploration node.
    """
    sample_size: int = 10000

    def analysis_preparation(flowfile_table: FlowDataEngine):
        if flowfile_table.number_of_records <= 0:
            number_of_records = flowfile_table.get_number_of_records(calculate_in_worker_process=True)
        else:
            number_of_records = flowfile_table.number_of_records
        if number_of_records > sample_size:
            flowfile_table = flowfile_table.get_sample(sample_size, random=True)
        external_sampler = ExternalDfFetcher(
            lf=flowfile_table.data_frame,
            file_ref="__gf_walker"+node.hash,
            wait_on_completion=True,
            node_id=node.node_id,
            flow_id=self.flow_id,
        )
        node.results.analysis_data_generator = get_read_top_n(external_sampler.status.file_ref,
                                                              n=min(sample_size, number_of_records))
        return flowfile_table

    def schema_callback():
        node = self.get_node(node_analysis.node_id)
        if len(node.all_inputs) == 1:
            input_node = node.all_inputs[0]
            return input_node.schema
        else:
            return [FlowfileColumn.from_input('col_1', 'na')]

    self.add_node_step(node_id=node_analysis.node_id, node_type='explore_data',
                       function=analysis_preparation,
                       setting_input=node_analysis, schema_callback=schema_callback)
    node = self.get_node(node_analysis.node_id)
add_external_source(external_source_input)

Adds a node for a custom external data source.

Parameters:

Name Type Description Default
external_source_input NodeExternalSource

The settings for the external source node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
def add_external_source(self,
                        external_source_input: input_schema.NodeExternalSource):
    """Adds a node for a custom external data source.

    Args:
        external_source_input: The settings for the external source node.
    """

    node_type = 'external_source'
    external_source_script = getattr(external_sources.custom_external_sources, external_source_input.identifier)
    source_settings = (getattr(input_schema, snake_case_to_camel_case(external_source_input.identifier)).
                       model_validate(external_source_input.source_settings))
    if hasattr(external_source_script, 'initial_getter'):
        initial_getter = getattr(external_source_script, 'initial_getter')(source_settings)
    else:
        initial_getter = None
    data_getter = external_source_script.getter(source_settings)
    external_source = data_source_factory(source_type='custom',
                                          data_getter=data_getter,
                                          initial_data_getter=initial_getter,
                                          orientation=external_source_input.source_settings.orientation,
                                          schema=None)

    def _func():
        logger.info('Calling external source')
        fl = FlowDataEngine.create_from_external_source(external_source=external_source)
        external_source_input.source_settings.fields = [c.get_minimal_field_info() for c in fl.schema]
        return fl

    node = self.get_node(external_source_input.node_id)
    if node:
        node.node_type = node_type
        node.name = node_type
        node.function = _func
        node.setting_input = external_source_input
        node.node_settings.cache_results = external_source_input.cache_results
        if external_source_input.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)
    else:
        node = FlowNode(external_source_input.node_id, function=_func,
                        setting_input=external_source_input,
                        name=node_type, node_type=node_type, parent_uuid=self.uuid)
        self._node_db[external_source_input.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(external_source_input.node_id)
    if external_source_input.source_settings.fields and len(external_source_input.source_settings.fields) > 0:
        logger.info('Using provided schema in the node')

        def schema_callback():
            return [FlowfileColumn.from_input(f.name, f.data_type) for f in
                    external_source_input.source_settings.fields]

        node.schema_callback = schema_callback
    else:
        logger.warning('Removing schema')
        node._schema_callback = None
    self.add_node_step(node_id=external_source_input.node_id,
                       function=_func,
                       input_columns=[],
                       node_type=node_type,
                       setting_input=external_source_input)
add_filter(filter_settings)

Adds a filter node to the graph.

Parameters:

Name Type Description Default
filter_settings NodeFilter

The settings for the filter operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
def add_filter(self, filter_settings: input_schema.NodeFilter):
    """Adds a filter node to the graph.

    Args:
        filter_settings: The settings for the filter operation.
    """

    is_advanced = filter_settings.filter_input.filter_type == 'advanced'
    if is_advanced:
        predicate = filter_settings.filter_input.advanced_filter
    else:
        _basic_filter = filter_settings.filter_input.basic_filter
        filter_settings.filter_input.advanced_filter = (f'[{_basic_filter.field}]{_basic_filter.filter_type}"'
                                                        f'{_basic_filter.filter_value}"')

    def _func(fl: FlowDataEngine):
        is_advanced = filter_settings.filter_input.filter_type == 'advanced'
        if is_advanced:
            return fl.do_filter(predicate)
        else:
            basic_filter = filter_settings.filter_input.basic_filter
            if basic_filter.filter_value.isnumeric():
                field_data_type = fl.get_schema_column(basic_filter.field).generic_datatype()
                if field_data_type == 'str':
                    _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
                else:
                    _f = f'[{basic_filter.field}]{basic_filter.filter_type}{basic_filter.filter_value}'
            else:
                _f = f'[{basic_filter.field}]{basic_filter.filter_type}"{basic_filter.filter_value}"'
            filter_settings.filter_input.advanced_filter = _f
            return fl.do_filter(_f)

    self.add_node_step(filter_settings.node_id, _func,
                       node_type='filter',
                       renew_schema=False,
                       setting_input=filter_settings,
                       input_node_ids=[filter_settings.depending_on_id]
                       )
add_formula(function_settings)

Adds a node that applies a formula to create or modify a column.

Parameters:

Name Type Description Default
function_settings NodeFormula

The settings for the formula operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
def add_formula(self, function_settings: input_schema.NodeFormula):
    """Adds a node that applies a formula to create or modify a column.

    Args:
        function_settings: The settings for the formula operation.
    """

    error = ""
    if function_settings.function.field.data_type not in (None, "Auto"):
        output_type = cast_str_to_polars_type(function_settings.function.field.data_type)
    else:
        output_type = None
    if output_type not in (None, "Auto"):
        new_col = [FlowfileColumn.from_input(column_name=function_settings.function.field.name,
                                             data_type=str(output_type))]
    else:
        new_col = [FlowfileColumn.from_input(function_settings.function.field.name, 'String')]

    def _func(fl: FlowDataEngine):
        return fl.apply_sql_formula(func=function_settings.function.function,
                                    col_name=function_settings.function.field.name,
                                    output_data_type=output_type)

    self.add_node_step(function_settings.node_id, _func,
                       output_schema=new_col,
                       node_type='formula',
                       renew_schema=False,
                       setting_input=function_settings,
                       input_node_ids=[function_settings.depending_on_id]
                       )
    if error != "":
        node = self.get_node(function_settings.node_id)
        node.results.errors = error
        return False, error
    else:
        return True, ""
add_fuzzy_match(fuzzy_settings)

Adds a fuzzy matching node to join data on approximate string matches.

Parameters:

Name Type Description Default
fuzzy_settings NodeFuzzyMatch

The settings for the fuzzy match operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
def add_fuzzy_match(self, fuzzy_settings: input_schema.NodeFuzzyMatch) -> "FlowGraph":
    """Adds a fuzzy matching node to join data on approximate string matches.

    Args:
        fuzzy_settings: The settings for the fuzzy match operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        f = main.start_fuzzy_join(fuzzy_match_input=fuzzy_settings.join_input, other=right, file_ref=node.hash,
                                  flow_id=self.flow_id, node_id=fuzzy_settings.node_id)
        logger.info("Started the fuzzy match action")
        node._fetch_cached_df = f
        return FlowDataEngine(f.get_result())

    self.add_node_step(node_id=fuzzy_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='fuzzy_match',
                       setting_input=fuzzy_settings)
    node = self.get_node(node_id=fuzzy_settings.node_id)

    def schema_callback():
        return calculate_fuzzy_match_schema(fuzzy_settings.join_input,
                                            left_schema=node.node_inputs.main_inputs[0].schema,
                                            right_schema=node.node_inputs.right_input.schema
                                            )

    node.schema_callback = schema_callback
    return self
add_graph_solver(graph_solver_settings)

Adds a node that solves graph-like problems within the data.

This node can be used for operations like finding network paths, calculating connected components, or performing other graph algorithms on relational data that represents nodes and edges.

Parameters:

Name Type Description Default
graph_solver_settings NodeGraphSolver

The settings object defining the graph inputs and the specific algorithm to apply.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
def add_graph_solver(self, graph_solver_settings: input_schema.NodeGraphSolver):
    """Adds a node that solves graph-like problems within the data.

    This node can be used for operations like finding network paths,
    calculating connected components, or performing other graph algorithms
    on relational data that represents nodes and edges.

    Args:
        graph_solver_settings: The settings object defining the graph inputs
            and the specific algorithm to apply.
    """
    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.solve_graph(graph_solver_settings.graph_solver_input)

    self.add_node_step(node_id=graph_solver_settings.node_id,
                       function=_func,
                       node_type='graph_solver',
                       setting_input=graph_solver_settings,
                       input_node_ids=[graph_solver_settings.depending_on_id])
add_group_by(group_by_settings)

Adds a group-by aggregation node to the graph.

Parameters:

Name Type Description Default
group_by_settings NodeGroupBy

The settings for the group-by operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
def add_group_by(self, group_by_settings: input_schema.NodeGroupBy):
    """Adds a group-by aggregation node to the graph.

    Args:
        group_by_settings: The settings for the group-by operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.do_group_by(group_by_settings.groupby_input, False)

    self.add_node_step(node_id=group_by_settings.node_id,
                       function=_func,
                       node_type=f'group_by',
                       setting_input=group_by_settings,
                       input_node_ids=[group_by_settings.depending_on_id])

    node = self.get_node(group_by_settings.node_id)

    def schema_callback():

        output_columns = [(c.old_name, c.new_name, c.output_type) for c in group_by_settings.groupby_input.agg_cols]
        depends_on = node.node_inputs.main_inputs[0]
        input_schema_dict: Dict[str, str] = {s.name: s.data_type for s in depends_on.schema}
        output_schema = []
        for old_name, new_name, data_type in output_columns:
            data_type = input_schema_dict[old_name] if data_type is None else data_type
            output_schema.append(FlowfileColumn.from_input(data_type=data_type, column_name=new_name))
        return output_schema

    node.schema_callback = schema_callback
add_include_cols(include_columns)

Adds columns to both the input and output column lists.

Parameters:

Name Type Description Default
include_columns List[str]

A list of column names to include.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
def add_include_cols(self, include_columns: List[str]):
    """Adds columns to both the input and output column lists.

    Args:
        include_columns: A list of column names to include.
    """
    for column in include_columns:
        if column not in self._input_cols:
            self._input_cols.append(column)
        if column not in self._output_cols:
            self._output_cols.append(column)
    return self
add_initial_node_analysis(node_promise)

Adds a data exploration/analysis node based on a node promise.

Parameters:

Name Type Description Default
node_promise NodePromise

The promise representing the node to be analyzed.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
461
462
463
464
465
466
467
468
def add_initial_node_analysis(self, node_promise: input_schema.NodePromise):
    """Adds a data exploration/analysis node based on a node promise.

    Args:
        node_promise: The promise representing the node to be analyzed.
    """
    node_analysis = create_graphic_walker_node_from_node_promise(node_promise)
    self.add_explore_data(node_analysis)
add_join(join_settings)

Adds a join node to combine two data streams based on key columns.

Parameters:

Name Type Description Default
join_settings NodeJoin

The settings for the join operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
def add_join(self, join_settings: input_schema.NodeJoin) -> "FlowGraph":
    """Adds a join node to combine two data streams based on key columns.

    Args:
        join_settings: The settings for the join operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        for left_select in join_settings.join_input.left_select.renames:
            left_select.is_available = True if left_select.old_name in main.schema else False
        for right_select in join_settings.join_input.right_select.renames:
            right_select.is_available = True if right_select.old_name in right.schema else False

        return main.join(join_input=join_settings.join_input,
                         auto_generate_selection=join_settings.auto_generate_selection,
                         verify_integrity=False,
                         other=right)

    self.add_node_step(node_id=join_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='join',
                       setting_input=join_settings,
                       input_node_ids=join_settings.depending_on_ids)
    return self
add_manual_input(input_file)

Adds a node for manual data entry.

This is a convenience alias for add_datasource.

Parameters:

Name Type Description Default
input_file NodeManualInput

The settings and data for the manual input node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1495
1496
1497
1498
1499
1500
1501
1502
1503
def add_manual_input(self, input_file: input_schema.NodeManualInput):
    """Adds a node for manual data entry.

    This is a convenience alias for `add_datasource`.

    Args:
        input_file: The settings and data for the manual input node.
    """
    self.add_datasource(input_file)
add_node_promise(node_promise)

Adds a placeholder node to the graph that is not yet fully configured.

Useful for building the graph structure before all settings are available.

Parameters:

Name Type Description Default
node_promise NodePromise

A promise object containing basic node information.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
def add_node_promise(self, node_promise: input_schema.NodePromise):
    """Adds a placeholder node to the graph that is not yet fully configured.

    Useful for building the graph structure before all settings are available.

    Args:
        node_promise: A promise object containing basic node information.
    """
    def placeholder(n: FlowNode = None):
        if n is None:
            return FlowDataEngine()
        return n

    self.add_node_step(node_id=node_promise.node_id, node_type=node_promise.node_type, function=placeholder,
                       setting_input=node_promise)
add_node_step(node_id, function, input_columns=None, output_schema=None, node_type=None, drop_columns=None, renew_schema=True, setting_input=None, cache_results=None, schema_callback=None, input_node_ids=None)

The core method for adding or updating a node in the graph.

Parameters:

Name Type Description Default
node_id Union[int, str]

The unique ID for the node.

required
function Callable

The core processing function for the node.

required
input_columns List[str]

A list of input column names required by the function.

None
output_schema List[FlowfileColumn]

A predefined schema for the node's output.

None
node_type str

A string identifying the type of node (e.g., 'filter', 'join').

None
drop_columns List[str]

A list of columns to be dropped after the function executes.

None
renew_schema bool

If True, the schema is recalculated after execution.

True
setting_input Any

A configuration object containing settings for the node.

None
cache_results bool

If True, the node's results are cached for future runs.

None
schema_callback Callable

A function that dynamically calculates the output schema.

None
input_node_ids List[int]

A list of IDs for the nodes that this node depends on.

None

Returns:

Type Description
FlowNode

The created or updated FlowNode object.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
def add_node_step(self,
                  node_id: Union[int, str],
                  function: Callable,
                  input_columns: List[str] = None,
                  output_schema: List[FlowfileColumn] = None,
                  node_type: str = None,
                  drop_columns: List[str] = None,
                  renew_schema: bool = True,
                  setting_input: Any = None,
                  cache_results: bool = None,
                  schema_callback: Callable = None,
                  input_node_ids: List[int] = None) -> FlowNode:
    """The core method for adding or updating a node in the graph.

    Args:
        node_id: The unique ID for the node.
        function: The core processing function for the node.
        input_columns: A list of input column names required by the function.
        output_schema: A predefined schema for the node's output.
        node_type: A string identifying the type of node (e.g., 'filter', 'join').
        drop_columns: A list of columns to be dropped after the function executes.
        renew_schema: If True, the schema is recalculated after execution.
        setting_input: A configuration object containing settings for the node.
        cache_results: If True, the node's results are cached for future runs.
        schema_callback: A function that dynamically calculates the output schema.
        input_node_ids: A list of IDs for the nodes that this node depends on.

    Returns:
        The created or updated FlowNode object.
    """
    existing_node = self.get_node(node_id)
    if existing_node is not None:
        if existing_node.node_type != node_type:
            self.delete_node(existing_node.node_id)
            existing_node = None
    if existing_node:
        input_nodes = existing_node.all_inputs
    elif input_node_ids is not None:
        input_nodes = [self.get_node(node_id) for node_id in input_node_ids]
    else:
        input_nodes = None
    if isinstance(input_columns, str):
        input_columns = [input_columns]
    if (
            input_nodes is not None or
            function.__name__ in ('placeholder', 'analysis_preparation') or
            node_type in ("cloud_storage_reader", "polars_lazy_frame", "input_data")
    ):
        if not existing_node:
            node = FlowNode(node_id=node_id,
                            function=function,
                            output_schema=output_schema,
                            input_columns=input_columns,
                            drop_columns=drop_columns,
                            renew_schema=renew_schema,
                            setting_input=setting_input,
                            node_type=node_type,
                            name=function.__name__,
                            schema_callback=schema_callback,
                            parent_uuid=self.uuid)
        else:
            existing_node.update_node(function=function,
                                      output_schema=output_schema,
                                      input_columns=input_columns,
                                      drop_columns=drop_columns,
                                      setting_input=setting_input,
                                      schema_callback=schema_callback)
            node = existing_node
    else:
        raise Exception("No data initialized")
    self._node_db[node_id] = node
    self._node_ids.append(node_id)
    return node
add_output(output_file)

Adds an output node to write the final data to a destination.

Parameters:

Name Type Description Default
output_file NodeOutput

The settings for the output file.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
def add_output(self, output_file: input_schema.NodeOutput):
    """Adds an output node to write the final data to a destination.

    Args:
        output_file: The settings for the output file.
    """

    def _func(df: FlowDataEngine):
        output_file.output_settings.populate_abs_file_path()
        execute_remote = self.execution_location != 'local'
        df.output(output_fs=output_file.output_settings, flow_id=self.flow_id, node_id=output_file.node_id,
                  execute_remote=execute_remote)
        return df

    def schema_callback():
        input_node: FlowNode = self.get_node(output_file.node_id).node_inputs.main_inputs[0]

        return input_node.schema
    input_node_id = getattr(output_file, "depending_on_id") if hasattr(output_file, 'depending_on_id') else None
    self.add_node_step(node_id=output_file.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='output',
                       setting_input=output_file,
                       schema_callback=schema_callback,
                       input_node_ids=[input_node_id])
add_pivot(pivot_settings)

Adds a pivot node to the graph.

Parameters:

Name Type Description Default
pivot_settings NodePivot

The settings for the pivot operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
def add_pivot(self, pivot_settings: input_schema.NodePivot):
    """Adds a pivot node to the graph.

    Args:
        pivot_settings: The settings for the pivot operation.
    """

    def _func(fl: FlowDataEngine):
        return fl.do_pivot(pivot_settings.pivot_input, self.flow_logger.get_node_logger(pivot_settings.node_id))

    self.add_node_step(node_id=pivot_settings.node_id,
                       function=_func,
                       node_type='pivot',
                       setting_input=pivot_settings,
                       input_node_ids=[pivot_settings.depending_on_id])

    node = self.get_node(pivot_settings.node_id)

    def schema_callback():
        input_data = node.singular_main_input.get_resulting_data()  # get from the previous step the data
        input_data.lazy = True  # ensure the dataset is lazy
        input_lf = input_data.data_frame  # get the lazy frame
        return pre_calculate_pivot_schema(input_data.schema, pivot_settings.pivot_input, input_lf=input_lf)
    node.schema_callback = schema_callback
add_polars_code(node_polars_code)

Adds a node that executes custom Polars code.

Parameters:

Name Type Description Default
node_polars_code NodePolarsCode

The settings for the Polars code node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
def add_polars_code(self, node_polars_code: input_schema.NodePolarsCode):
    """Adds a node that executes custom Polars code.

    Args:
        node_polars_code: The settings for the Polars code node.
    """

    def _func(*flowfile_tables: FlowDataEngine) -> FlowDataEngine:
        return execute_polars_code(*flowfile_tables, code=node_polars_code.polars_code_input.polars_code)
    self.add_node_step(node_id=node_polars_code.node_id,
                       function=_func,
                       node_type='polars_code',
                       setting_input=node_polars_code,
                       input_node_ids=node_polars_code.depending_on_ids)

    try:
        polars_code_parser.validate_code(node_polars_code.polars_code_input.polars_code)
    except Exception as e:
        node = self.get_node(node_id=node_polars_code.node_id)
        node.results.errors = str(e)
add_read(input_file)

Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

Parameters:

Name Type Description Default
input_file NodeRead

The settings for the read operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
def add_read(self, input_file: input_schema.NodeRead):
    """Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

    Args:
        input_file: The settings for the read operation.
    """

    if input_file.received_file.file_type in ('xlsx', 'excel') and input_file.received_file.sheet_name == '':
        sheet_name = fastexcel.read_excel(input_file.received_file.path).sheet_names[0]
        input_file.received_file.sheet_name = sheet_name

    received_file = input_file.received_file
    input_file.received_file.set_absolute_filepath()

    def _func():
        input_file.received_file.set_absolute_filepath()
        if input_file.received_file.file_type == 'parquet':
            input_data = FlowDataEngine.create_from_path(input_file.received_file)
        elif input_file.received_file.file_type == 'csv' and 'utf' in input_file.received_file.encoding:
            input_data = FlowDataEngine.create_from_path(input_file.received_file)
        else:
            input_data = FlowDataEngine.create_from_path_worker(input_file.received_file,
                                                                node_id=input_file.node_id,
                                                                flow_id=self.flow_id)
        input_data.name = input_file.received_file.name
        return input_data

    node = self.get_node(input_file.node_id)
    schema_callback = None
    if node:
        start_hash = node.hash
        node.node_type = 'read'
        node.name = 'read'
        node.function = _func
        node.setting_input = input_file
        if input_file.node_id not in set(start_node.node_id for start_node in self._flow_starts):
            self._flow_starts.append(node)

        if start_hash != node.hash:
            logger.info('Hash changed, updating schema')
            if len(received_file.fields) > 0:
                # If the file has fields defined, we can use them to create the schema
                def schema_callback():
                    return [FlowfileColumn.from_input(f.name, f.data_type) for f in received_file.fields]

            elif input_file.received_file.file_type in ('csv', 'json', 'parquet'):
                # everything that can be scanned by polars
                def schema_callback():
                    input_data = FlowDataEngine.create_from_path(input_file.received_file)
                    return input_data.schema

            elif input_file.received_file.file_type in ('xlsx', 'excel'):
                # If the file is an Excel file, we need to use the openpyxl engine to read the schema
                schema_callback = get_xlsx_schema_callback(engine='openpyxl',
                                                           file_path=received_file.file_path,
                                                           sheet_name=received_file.sheet_name,
                                                           start_row=received_file.start_row,
                                                           end_row=received_file.end_row,
                                                           start_column=received_file.start_column,
                                                           end_column=received_file.end_column,
                                                           has_headers=received_file.has_headers)
            else:
                schema_callback = None
    else:
        node = FlowNode(input_file.node_id, function=_func,
                        setting_input=input_file,
                        name='read', node_type='read', parent_uuid=self.uuid)
        self._node_db[input_file.node_id] = node
        self._flow_starts.append(node)
        self._node_ids.append(input_file.node_id)

    if schema_callback is not None:
        node.schema_callback = schema_callback
    return self
add_record_count(node_number_of_records)

Adds a filter node to the graph.

Parameters:

Name Type Description Default
node_number_of_records NodeRecordCount

The settings for the record count operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
def add_record_count(self, node_number_of_records: input_schema.NodeRecordCount):
    """Adds a filter node to the graph.

    Args:
        node_number_of_records: The settings for the record count operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.get_record_count()

    self.add_node_step(node_id=node_number_of_records.node_id,
                       function=_func,
                       node_type='record_count',
                       setting_input=node_number_of_records,
                       input_node_ids=[node_number_of_records.depending_on_id])
add_record_id(record_id_settings)

Adds a node to create a new column with a unique ID for each record.

Parameters:

Name Type Description Default
record_id_settings NodeRecordId

The settings object specifying the name of the new record ID column.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
def add_record_id(self, record_id_settings: input_schema.NodeRecordId) -> "FlowGraph":
    """Adds a node to create a new column with a unique ID for each record.

    Args:
        record_id_settings: The settings object specifying the name of the
            new record ID column.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.add_record_id(record_id_settings.record_id_input)

    self.add_node_step(node_id=record_id_settings.node_id,
                       function=_func,
                       node_type='record_id',
                       setting_input=record_id_settings,
                       input_node_ids=[record_id_settings.depending_on_id]
                       )
    return self
add_sample(sample_settings)

Adds a node to take a random or top-N sample of the data.

Parameters:

Name Type Description Default
sample_settings NodeSample

The settings object specifying the size of the sample.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
def add_sample(self, sample_settings: input_schema.NodeSample) -> "FlowGraph":
    """Adds a node to take a random or top-N sample of the data.

    Args:
        sample_settings: The settings object specifying the size of the sample.

    Returns:
        The `FlowGraph` instance for method chaining.
    """
    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.get_sample(sample_settings.sample_size)

    self.add_node_step(node_id=sample_settings.node_id,
                       function=_func,
                       node_type='sample',
                       setting_input=sample_settings,
                       input_node_ids=[sample_settings.depending_on_id]
                       )
    return self
add_select(select_settings)

Adds a node to select, rename, reorder, or drop columns.

Parameters:

Name Type Description Default
select_settings NodeSelect

The settings for the select operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
def add_select(self, select_settings: input_schema.NodeSelect) -> "FlowGraph":
    """Adds a node to select, rename, reorder, or drop columns.

    Args:
        select_settings: The settings for the select operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    select_cols = select_settings.select_input
    drop_cols = tuple(s.old_name for s in select_settings.select_input)

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        input_cols = set(f.name for f in table.schema)
        ids_to_remove = []
        for i, select_col in enumerate(select_cols):
            if select_col.data_type is None:
                select_col.data_type = table.get_schema_column(select_col.old_name).data_type
            if select_col.old_name not in input_cols:
                select_col.is_available = False
                if not select_col.keep:
                    ids_to_remove.append(i)
            else:
                select_col.is_available = True
        ids_to_remove.reverse()
        for i in ids_to_remove:
            v = select_cols.pop(i)
            del v
        return table.do_select(select_inputs=transform_schema.SelectInputs(select_cols),
                               keep_missing=select_settings.keep_missing)

    self.add_node_step(node_id=select_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='select',
                       drop_columns=list(drop_cols),
                       setting_input=select_settings,
                       input_node_ids=[select_settings.depending_on_id])
    return self
add_sort(sort_settings)

Adds a node to sort the data based on one or more columns.

Parameters:

Name Type Description Default
sort_settings NodeSort

The settings for the sort operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
def add_sort(self, sort_settings: input_schema.NodeSort) -> "FlowGraph":
    """Adds a node to sort the data based on one or more columns.

    Args:
        sort_settings: The settings for the sort operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.do_sort(sort_settings.sort_input)

    self.add_node_step(node_id=sort_settings.node_id,
                       function=_func,
                       node_type='sort',
                       setting_input=sort_settings,
                       input_node_ids=[sort_settings.depending_on_id])
    return self
add_sql_source(external_source_input)

Adds a node that reads data from a SQL source.

This is a convenience alias for add_external_source.

Parameters:

Name Type Description Default
external_source_input NodeExternalSource

The settings for the external SQL source node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
def add_sql_source(self, external_source_input: input_schema.NodeExternalSource):
    """Adds a node that reads data from a SQL source.

    This is a convenience alias for `add_external_source`.

    Args:
        external_source_input: The settings for the external SQL source node.
    """
    logger.info('Adding sql source')
    self.add_external_source(external_source_input)
add_text_to_rows(node_text_to_rows)

Adds a node that splits cell values into multiple rows.

This is useful for un-nesting data where a single field contains multiple values separated by a delimiter.

Parameters:

Name Type Description Default
node_text_to_rows NodeTextToRows

The settings object that specifies the column to split and the delimiter to use.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
def add_text_to_rows(self, node_text_to_rows: input_schema.NodeTextToRows) -> "FlowGraph":
    """Adds a node that splits cell values into multiple rows.

    This is useful for un-nesting data where a single field contains multiple
    values separated by a delimiter.

    Args:
        node_text_to_rows: The settings object that specifies the column to split
            and the delimiter to use.

    Returns:
        The `FlowGraph` instance for method chaining.
    """
    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.split(node_text_to_rows.text_to_rows_input)

    self.add_node_step(node_id=node_text_to_rows.node_id,
                       function=_func,
                       node_type='text_to_rows',
                       setting_input=node_text_to_rows,
                       input_node_ids=[node_text_to_rows.depending_on_id])
    return self
add_union(union_settings)

Adds a union node to combine multiple data streams.

Parameters:

Name Type Description Default
union_settings NodeUnion

The settings for the union operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
def add_union(self, union_settings: input_schema.NodeUnion):
    """Adds a union node to combine multiple data streams.

    Args:
        union_settings: The settings for the union operation.
    """

    def _func(*flowfile_tables: FlowDataEngine):
        dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [flt.data_frame for flt in flowfile_tables]
        return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))

    self.add_node_step(node_id=union_settings.node_id,
                       function=_func,
                       node_type=f'union',
                       setting_input=union_settings,
                       input_node_ids=union_settings.depending_on_ids)
add_unique(unique_settings)

Adds a node to find and remove duplicate rows.

Parameters:

Name Type Description Default
unique_settings NodeUnique

The settings for the unique operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
def add_unique(self, unique_settings: input_schema.NodeUnique):
    """Adds a node to find and remove duplicate rows.

    Args:
        unique_settings: The settings for the unique operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.make_unique(unique_settings.unique_input)

    self.add_node_step(node_id=unique_settings.node_id,
                       function=_func,
                       input_columns=[],
                       node_type='unique',
                       setting_input=unique_settings,
                       input_node_ids=[unique_settings.depending_on_id])
add_unpivot(unpivot_settings)

Adds an unpivot node to the graph.

Parameters:

Name Type Description Default
unpivot_settings NodeUnpivot

The settings for the unpivot operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
def add_unpivot(self, unpivot_settings: input_schema.NodeUnpivot):
    """Adds an unpivot node to the graph.

    Args:
        unpivot_settings: The settings for the unpivot operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.unpivot(unpivot_settings.unpivot_input)

    self.add_node_step(node_id=unpivot_settings.node_id,
                       function=_func,
                       node_type='unpivot',
                       setting_input=unpivot_settings,
                       input_node_ids=[unpivot_settings.depending_on_id])
apply_layout(y_spacing=150, x_spacing=200, initial_y=100)

Calculates and applies a layered layout to all nodes in the graph.

This updates their x and y positions for UI rendering.

Parameters:

Name Type Description Default
y_spacing int

The vertical spacing between layers.

150
x_spacing int

The horizontal spacing between nodes in the same layer.

200
initial_y int

The initial y-position for the first layer.

100
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
def apply_layout(self, y_spacing: int = 150, x_spacing: int = 200, initial_y: int = 100):
    """Calculates and applies a layered layout to all nodes in the graph.

    This updates their x and y positions for UI rendering.

    Args:
        y_spacing: The vertical spacing between layers.
        x_spacing: The horizontal spacing between nodes in the same layer.
        initial_y: The initial y-position for the first layer.
    """
    self.flow_logger.info("Applying layered layout...")
    start_time = time()
    try:
        # Calculate new positions for all nodes
        new_positions = calculate_layered_layout(
            self, y_spacing=y_spacing, x_spacing=x_spacing, initial_y=initial_y
        )

        if not new_positions:
            self.flow_logger.warning("Layout calculation returned no positions.")
            return

        # Apply the new positions to the setting_input of each node
        updated_count = 0
        for node_id, (pos_x, pos_y) in new_positions.items():
            node = self.get_node(node_id)
            if node and hasattr(node, 'setting_input'):
                setting = node.setting_input
                if hasattr(setting, 'pos_x') and hasattr(setting, 'pos_y'):
                    setting.pos_x = pos_x
                    setting.pos_y = pos_y
                    updated_count += 1
                else:
                    self.flow_logger.warning(f"Node {node_id} setting_input ({type(setting)}) lacks pos_x/pos_y attributes.")
            elif node:
                self.flow_logger.warning(f"Node {node_id} lacks setting_input attribute.")
            # else: Node not found, already warned by calculate_layered_layout

        end_time = time()
        self.flow_logger.info(f"Layout applied to {updated_count}/{len(self.nodes)} nodes in {end_time - start_time:.2f} seconds.")

    except Exception as e:
        self.flow_logger.error(f"Error applying layout: {e}")
        raise  # Optional: re-raise the exception
cancel()

Cancels an ongoing graph execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1709
1710
1711
1712
1713
1714
1715
1716
def cancel(self):
    """Cancels an ongoing graph execution."""

    if not self.flow_settings.is_running:
        return
    self.flow_settings.is_canceled = True
    for node in self.nodes:
        node.cancel()
close_flow()

Performs cleanup operations, such as clearing node caches.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1718
1719
1720
1721
1722
def close_flow(self):
    """Performs cleanup operations, such as clearing node caches."""

    for node in self.nodes:
        node.remove_cache()
copy_node(new_node_settings, existing_setting_input, node_type)

Creates a copy of an existing node.

Parameters:

Name Type Description Default
new_node_settings NodePromise

The promise containing new settings (like ID and position).

required
existing_setting_input Any

The settings object from the node being copied.

required
node_type str

The type of the node being copied.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
def copy_node(self, new_node_settings: input_schema.NodePromise, existing_setting_input: Any, node_type: str) -> None:
    """Creates a copy of an existing node.

    Args:
        new_node_settings: The promise containing new settings (like ID and position).
        existing_setting_input: The settings object from the node being copied.
        node_type: The type of the node being copied.
    """
    self.add_node_promise(new_node_settings)

    if isinstance(existing_setting_input, input_schema.NodePromise):
        return

    combined_settings = combine_existing_settings_and_new_settings(
        existing_setting_input, new_node_settings
    )
    getattr(self, f"add_{node_type}")(combined_settings)
delete_node(node_id)

Deletes a node from the graph and updates all its connections.

Parameters:

Name Type Description Default
node_id Union[int, str]

The ID of the node to delete.

required

Raises:

Type Description
Exception

If the node with the given ID does not exist.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
def delete_node(self, node_id: Union[int, str]):
    """Deletes a node from the graph and updates all its connections.

    Args:
        node_id: The ID of the node to delete.

    Raises:
        Exception: If the node with the given ID does not exist.
    """
    logger.info(f"Starting deletion of node with ID: {node_id}")

    node = self._node_db.get(node_id)
    if node:
        logger.info(f"Found node: {node_id}, processing deletion")

        lead_to_steps: List[FlowNode] = node.leads_to_nodes
        logger.debug(f"Node {node_id} leads to {len(lead_to_steps)} other nodes")

        if len(lead_to_steps) > 0:
            for lead_to_step in lead_to_steps:
                logger.debug(f"Deleting input node {node_id} from dependent node {lead_to_step}")
                lead_to_step.delete_input_node(node_id, complete=True)

        if not node.is_start:
            depends_on: List[FlowNode] = node.node_inputs.get_all_inputs()
            logger.debug(f"Node {node_id} depends on {len(depends_on)} other nodes")

            for depend_on in depends_on:
                logger.debug(f"Removing lead_to reference {node_id} from node {depend_on}")
                depend_on.delete_lead_to_node(node_id)

        self._node_db.pop(node_id)
        logger.debug(f"Successfully removed node {node_id} from node_db")
        del node
        logger.info("Node object deleted")
    else:
        logger.error(f"Failed to find node with id {node_id}")
        raise Exception(f"Node with id {node_id} does not exist")
generate_code()

Generates code for the flow graph. This method exports the flow graph to a Polars-compatible format.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1849
1850
1851
1852
1853
1854
def generate_code(self):
    """Generates code for the flow graph.
    This method exports the flow graph to a Polars-compatible format.
    """
    from flowfile_core.flowfile.code_generator.code_generator import export_flow_to_polars
    print(export_flow_to_polars(self))
get_frontend_data()

Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

This method transforms the graph's state into a format compatible with the Drawflow.js library.

Returns:

Type Description
dict

A dictionary representing the graph in Drawflow format.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
def get_frontend_data(self) -> dict:
    """Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

    This method transforms the graph's state into a format compatible with the
    Drawflow.js library.

    Returns:
        A dictionary representing the graph in Drawflow format.
    """
    result = {
        'Home': {
            "data": {}
        }
    }
    flow_info: schemas.FlowInformation = self.get_node_storage()

    for node_id, node_info in flow_info.data.items():
        if node_info.is_setup:
            try:
                pos_x = node_info.data.pos_x
                pos_y = node_info.data.pos_y
                # Basic node structure
                result["Home"]["data"][str(node_id)] = {
                    "id": node_info.id,
                    "name": node_info.type,
                    "data": {},  # Additional data can go here
                    "class": node_info.type,
                    "html": node_info.type,
                    "typenode": "vue",
                    "inputs": {},
                    "outputs": {},
                    "pos_x": pos_x,
                    "pos_y": pos_y
                }
            except Exception as e:
                logger.error(e)
        # Add outputs to the node based on `outputs` in your backend data
        if node_info.outputs:
            outputs = {o: 0 for o in node_info.outputs}
            for o in node_info.outputs:
                outputs[o] += 1
            connections = []
            for output_node_id, n_connections in outputs.items():
                leading_to_node = self.get_node(output_node_id)
                input_types = leading_to_node.get_input_type(node_info.id)
                for input_type in input_types:
                    if input_type == 'main':
                        input_frontend_id = 'input_1'
                    elif input_type == 'right':
                        input_frontend_id = 'input_2'
                    elif input_type == 'left':
                        input_frontend_id = 'input_3'
                    else:
                        input_frontend_id = 'input_1'
                    connection = {"node": str(output_node_id), "input": input_frontend_id}
                    connections.append(connection)

            result["Home"]["data"][str(node_id)]["outputs"]["output_1"] = {
                "connections": connections}
        else:
            result["Home"]["data"][str(node_id)]["outputs"] = {"output_1": {"connections": []}}

        # Add input to the node based on `depending_on_id` in your backend data
        if node_info.left_input_id is not None or node_info.right_input_id is not None or node_info.input_ids is not None:
            main_inputs = node_info.main_input_ids
            result["Home"]["data"][str(node_id)]["inputs"]["input_1"] = {
                "connections": [{"node": str(main_node_id), "input": "output_1"} for main_node_id in main_inputs]
            }
            if node_info.right_input_id is not None:
                result["Home"]["data"][str(node_id)]["inputs"]["input_2"] = {
                    "connections": [{"node": str(node_info.right_input_id), "input": "output_1"}]
                }
            if node_info.left_input_id is not None:
                result["Home"]["data"][str(node_id)]["inputs"]["input_3"] = {
                    "connections": [{"node": str(node_info.left_input_id), "input": "output_1"}]
                }
    return result
get_implicit_starter_nodes()

Finds nodes that can act as starting points but are not explicitly defined as such.

Some nodes, like the Polars Code node, can function without an input. This method identifies such nodes if they have no incoming connections.

Returns:

Type Description
List[FlowNode]

A list of FlowNode objects that are implicit starting nodes.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
def get_implicit_starter_nodes(self) -> List[FlowNode]:
    """Finds nodes that can act as starting points but are not explicitly defined as such.

    Some nodes, like the Polars Code node, can function without an input. This
    method identifies such nodes if they have no incoming connections.

    Returns:
        A list of `FlowNode` objects that are implicit starting nodes.
    """
    starting_node_ids = [node.node_id for node in self._flow_starts]
    implicit_starting_nodes = []
    for node in self.nodes:
        if node.node_template.can_be_start and not node.has_input and node.node_id not in starting_node_ids:
            implicit_starting_nodes.append(node)
    return implicit_starting_nodes
get_node(node_id=None)

Retrieves a node from the graph by its ID.

Parameters:

Name Type Description Default
node_id Union[int, str]

The ID of the node to retrieve. If None, retrieves the last added node.

None

Returns:

Type Description
FlowNode | None

The FlowNode object, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
388
389
390
391
392
393
394
395
396
397
398
399
400
401
def get_node(self, node_id: Union[int, str] = None) -> FlowNode | None:
    """Retrieves a node from the graph by its ID.

    Args:
        node_id: The ID of the node to retrieve. If None, retrieves the last added node.

    Returns:
        The FlowNode object, or None if not found.
    """
    if node_id is None:
        node_id = self._node_ids[-1]
    node = self._node_db.get(node_id)
    if node is not None:
        return node
get_node_data(node_id, include_example=True)

Retrieves all data needed to render a node in the UI.

Parameters:

Name Type Description Default
node_id int

The ID of the node.

required
include_example bool

Whether to include data samples in the result.

True

Returns:

Type Description
NodeData

A NodeData object, or None if the node is not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
def get_node_data(self, node_id: int, include_example: bool = True) -> NodeData:
    """Retrieves all data needed to render a node in the UI.

    Args:
        node_id: The ID of the node.
        include_example: Whether to include data samples in the result.

    Returns:
        A NodeData object, or None if the node is not found.
    """
    node = self._node_db[node_id]
    return node.get_node_data(flow_id=self.flow_id, include_example=include_example)
get_node_storage()

Serializes the entire graph's state into a storable format.

Returns:

Type Description
FlowInformation

A FlowInformation object representing the complete graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
def get_node_storage(self) -> schemas.FlowInformation:
    """Serializes the entire graph's state into a storable format.

    Returns:
        A FlowInformation object representing the complete graph.
    """
    node_information = {node.node_id: node.get_node_information() for
                        node in self.nodes if node.is_setup and node.is_correct}

    return schemas.FlowInformation(flow_id=self.flow_id,
                                   flow_name=self.__name__,
                                   flow_settings=self.flow_settings,
                                   data=node_information,
                                   node_starts=[v.node_id for v in self._flow_starts],
                                   node_connections=self.node_connections
                                   )
get_nodes_overview()

Gets a list of dictionary representations for all nodes in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
372
373
374
375
376
377
def get_nodes_overview(self):
    """Gets a list of dictionary representations for all nodes in the graph."""
    output = []
    for v in self._node_db.values():
        output.append(v.get_repr())
    return output
get_run_info()

Gets a summary of the most recent graph execution.

Returns:

Type Description
RunInformation

A RunInformation object with details about the last run.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
def get_run_info(self) -> RunInformation:
    """Gets a summary of the most recent graph execution.

    Returns:
        A RunInformation object with details about the last run.
    """
    if self.latest_run_info is None:
        node_results = self.node_results
        success = all(nr.success for nr in node_results)
        self.latest_run_info = RunInformation(start_time=self.start_datetime, end_time=self.end_datetime,
                                              success=success,
                                              node_step_result=node_results, flow_id=self.flow_id,
                                              nodes_completed=self.nodes_completed,
                                              number_of_nodes=len(self.nodes))
    elif self.latest_run_info.nodes_completed != self.nodes_completed:
        node_results = self.node_results
        self.latest_run_info = RunInformation(start_time=self.start_datetime, end_time=self.end_datetime,
                                              success=all(nr.success for nr in node_results),
                                              node_step_result=node_results, flow_id=self.flow_id,
                                              nodes_completed=self.nodes_completed,
                                              number_of_nodes=len(self.nodes))
    return self.latest_run_info
get_vue_flow_input()

Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

Returns:

Type Description
VueFlowInput

A VueFlowInput object.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
def get_vue_flow_input(self) -> schemas.VueFlowInput:
    """Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

    Returns:
        A VueFlowInput object.
    """
    edges: List[schemas.NodeEdge] = []
    nodes: List[schemas.NodeInput] = []
    for node in self.nodes:
        nodes.append(node.get_node_input())
        edges.extend(node.get_edge_input())
    return schemas.VueFlowInput(node_edges=edges, node_inputs=nodes)
print_tree(show_schema=False, show_descriptions=False)

Print flow_graph as a tree.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
def print_tree(self, show_schema=False, show_descriptions=False):
    """
    Print flow_graph as a tree.
    """
    max_node_id = max(self._node_db.keys())

    tree = ""
    tabs = 0
    tab_counter = 0
    for node in self.nodes:
        tab_counter += 1
        node_input = node.setting_input
        operation = str(self._node_db[node_input.node_id]).split("(")[1][:-1].replace("_", " ").title()

        if operation == "Formula":
            operation = "With Columns"

        tree += str(operation) + " (id=" + str(node_input.node_id) + ")"

        if show_descriptions & show_schema:
            raise ValueError('show_descriptions and show_schema cannot be True simultaneously')
        if show_descriptions:
            tree += ": " + str(node_input.description)
        elif show_schema:
            tree += " -> ["
            if operation == "Manual Input":
                schema = ", ".join([str(i.name) + ": " + str(i.data_type) for i in node_input.raw_data_format.columns])
                tree += schema
            elif operation == "With Columns":
                tree_with_col_schema = ", " + node_input.function.field.name + ": " + node_input.function.field.data_type
                tree += schema + tree_with_col_schema
            elif operation == "Filter":
                index = node_input.filter_input.advanced_filter.find("]")
                filtered_column = str(node_input.filter_input.advanced_filter[1:index])
                schema = re.sub('({str(filtered_column)}: [A-Za-z0-9]+\,\s)', "", schema)
                tree += schema
            elif operation == "Group By":
                for col in node_input.groupby_input.agg_cols:
                    schema = re.sub(str(col.old_name) + ': [a-z0-9]+\, ', "", schema)
                tree += schema
            tree += "]"
        else:
            if operation == "Manual Input":
                tree += ": " + str(node_input.raw_data_format.data)
            elif operation == "With Columns":
                tree += ": " + str(node_input.function)
            elif operation == "Filter":
                tree += ": " + str(node_input.filter_input.advanced_filter)
            elif operation == "Group By":
                tree += ": groupby=[" + ", ".join([col.old_name for col in node_input.groupby_input.agg_cols if col.agg == "groupby"]) + "], "
                tree += "agg=[" + ", ".join([str(col.agg) + "(" + str(col.old_name) + ")" for col in node_input.groupby_input.agg_cols if col.agg != "groupby"]) + "]"

        if node_input.node_id < max_node_id:
            tree += "\n" + "# " + " "*3*(tabs-1) + "|___ "
        print("\n"*2)

    return print(tree)
remove_from_output_cols(columns)

Removes specified columns from the list of expected output columns.

Parameters:

Name Type Description Default
columns List[str]

A list of column names to remove.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
379
380
381
382
383
384
385
386
def remove_from_output_cols(self, columns: List[str]):
    """Removes specified columns from the list of expected output columns.

    Args:
        columns: A list of column names to remove.
    """
    cols = set(columns)
    self._output_cols = [c for c in self._output_cols if c not in cols]
reset()

Forces a deep reset on all nodes in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1825
1826
1827
1828
1829
def reset(self):
    """Forces a deep reset on all nodes in the graph."""

    for node in self.nodes:
        node.reset(True)
run_graph()

Executes the entire data flow graph from start to finish.

It determines the correct execution order, runs each node, collects results, and handles errors and cancellations.

Returns:

Type Description
RunInformation | None

A RunInformation object summarizing the execution results.

Raises:

Type Description
Exception

If the flow is already running.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
def run_graph(self) -> RunInformation | None:
    """Executes the entire data flow graph from start to finish.

    It determines the correct execution order, runs each node,
    collects results, and handles errors and cancellations.

    Returns:
        A RunInformation object summarizing the execution results.

    Raises:
        Exception: If the flow is already running.
    """
    if self.flow_settings.is_running:
        raise Exception('Flow is already running')
    try:
        self.flow_settings.is_running = True
        self.flow_settings.is_canceled = False
        self.flow_logger.clear_log_file()
        self.nodes_completed = 0
        self.node_results = []
        self.start_datetime = datetime.datetime.now()
        self.end_datetime = None
        self.latest_run_info = None
        self.flow_logger.info('Starting to run flowfile flow...')
        skip_nodes = [node for node in self.nodes if not node.is_correct]
        skip_nodes.extend([lead_to_node for node in skip_nodes for lead_to_node in node.leads_to_nodes])
        execution_order = determine_execution_order(all_nodes=[node for node in self.nodes if
                                                               node not in skip_nodes],
                                                    flow_starts=self._flow_starts+self.get_implicit_starter_nodes())
        skip_node_message(self.flow_logger, skip_nodes)
        execution_order_message(self.flow_logger, execution_order)
        performance_mode = self.flow_settings.execution_mode == 'Performance'
        if self.flow_settings.execution_location == 'local':
            OFFLOAD_TO_WORKER.value = False
        elif self.flow_settings.execution_location == 'remote':
            OFFLOAD_TO_WORKER.value = True
        for node in execution_order:
            node_logger = self.flow_logger.get_node_logger(node.node_id)
            if self.flow_settings.is_canceled:
                self.flow_logger.info('Flow canceled')
                break
            if node in skip_nodes:
                node_logger.info(f'Skipping node {node.node_id}')
                continue
            node_result = NodeResult(node_id=node.node_id, node_name=node.name)
            self.node_results.append(node_result)
            logger.info(f'Starting to run: node {node.node_id}, start time: {node_result.start_timestamp}')
            node.execute_node(run_location=self.flow_settings.execution_location,
                              performance_mode=performance_mode,
                              node_logger=node_logger)
            try:
                node_result.error = str(node.results.errors)
                if self.flow_settings.is_canceled:
                    node_result.success = None
                    node_result.success = None
                    node_result.is_running = False
                    continue
                node_result.success = node.results.errors is None
                node_result.end_timestamp = time()
                node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                node_result.is_running = False
            except Exception as e:
                node_result.error = 'Node did not run'
                node_result.success = False
                node_result.end_timestamp = time()
                node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
                node_result.is_running = False
                node_logger.error(f'Error in node {node.node_id}: {e}')
            if not node_result.success:
                skip_nodes.extend(list(node.get_all_dependent_nodes()))
            node_logger.info(f'Completed node with success: {node_result.success}')
            self.nodes_completed += 1
        self.flow_logger.info('Flow completed!')
        self.end_datetime = datetime.datetime.now()
        self.flow_settings.is_running = False
        if self.flow_settings.is_canceled:
            self.flow_logger.info('Flow canceled')
        return self.get_run_info()
    except Exception as e:
        raise e
    finally:
        self.flow_settings.is_running = False
save_flow(flow_path)

Saves the current state of the flow graph to a file.

Parameters:

Name Type Description Default
flow_path str

The path where the flow file will be saved.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1724
1725
1726
1727
1728
1729
1730
1731
1732
def save_flow(self, flow_path: str):
    """Saves the current state of the flow graph to a file.

    Args:
        flow_path: The path where the flow file will be saved.
    """
    with open(flow_path, 'wb') as f:
        pickle.dump(self.get_node_storage(), f)
    self.flow_settings.path = flow_path

FlowNode

The FlowNode represents a single operation in the FlowGraph. Each node corresponds to a specific transformation or action, such as filtering or grouping data.

flowfile_core.flowfile.flow_node.flow_node.FlowNode

Represents a single node in a data flow graph.

This class manages the node's state, its data processing function, and its connections to other nodes within the graph.

Methods:

Name Description
__call__

Makes the node instance callable, acting as an alias for execute_node.

__init__

Initializes a FlowNode instance.

__repr__

Provides a string representation of the FlowNode instance.

add_lead_to_in_depend_source

Ensures this node is registered in the leads_to_nodes list of its inputs.

add_node_connection

Adds a connection from a source node to this node.

calculate_hash

Calculates a hash based on settings and input node hashes.

cancel

Cancels an ongoing external process if one is running.

create_schema_callback_from_function

Wraps a node's function to create a schema callback that extracts the schema.

delete_input_node

Removes a connection from a specific input node.

delete_lead_to_node

Removes a connection to a specific downstream node.

evaluate_nodes

Triggers a state reset for all directly connected downstream nodes.

execute_full_local

Executes the node's logic locally, including example data generation.

execute_local

Executes the node's logic locally.

execute_node

Orchestrates the execution, handling location, caching, and retries.

execute_remote

Executes the node's logic remotely or handles cached results.

get_all_dependent_node_ids

Yields the IDs of all downstream nodes recursively.

get_all_dependent_nodes

Yields all downstream nodes recursively.

get_edge_input

Generates NodeEdge objects for all input connections to this node.

get_flow_file_column_schema

Retrieves the schema for a specific column from the output schema.

get_input_type

Gets the type of connection ('main', 'left', 'right') for a given input node ID.

get_node_data

Gathers all necessary data for representing the node in the UI.

get_node_information

Updates and returns the node's information object.

get_node_input

Creates a NodeInput schema object for representing this node in the UI.

get_output_data

Gets the full output data sample for this node.

get_predicted_resulting_data

Creates a FlowDataEngine instance based on the predicted schema.

get_predicted_schema

Predicts the output schema of the node without full execution.

get_repr

Gets a detailed dictionary representation of the node's state.

get_resulting_data

Executes the node's function to produce the actual output data.

get_table_example

Generates a TableExample model summarizing the node's output.

needs_reset

Checks if the node's hash has changed, indicating an outdated state.

needs_run

Determines if the node needs to be executed.

post_init

Initializes or resets the node's attributes to their default states.

prepare_before_run

Resets results and errors before a new execution.

print

Helper method to log messages with node context.

remove_cache

Removes cached results for this node.

reset

Resets the node's execution state and schema information.

set_node_information

Populates the node_information attribute with the current state.

store_example_data_generator

Stores a generator function for fetching a sample of the result data.

update_node

Updates the properties of the node.

Attributes:

Name Type Description
all_inputs List[FlowNode]

Gets a list of all nodes connected to any input port.

function Callable

Gets the core processing function of the node.

has_input bool

Checks if this node has any input connections.

has_next_step bool

Checks if this node has any downstream connections.

hash str

Gets the cached hash for the node, calculating it if it doesn't exist.

is_correct bool

Checks if the node's input connections satisfy its template requirements.

is_setup bool

Checks if the node has been properly configured and is ready for execution.

is_start bool

Determines if the node is a starting node in the flow.

left_input Optional[FlowNode]

Gets the node connected to the left input port.

main_input List[FlowNode]

Gets the list of nodes connected to the main input port(s).

name str

Gets the name of the node.

node_id Union[str, int]

Gets the unique identifier of the node.

number_of_leads_to_nodes int | None

Counts the number of downstream node connections.

right_input Optional[FlowNode]

Gets the node connected to the right input port.

schema List[FlowfileColumn]

Gets the definitive output schema of the node.

schema_callback SingleExecutionFuture

Gets the schema callback function, creating one if it doesn't exist.

setting_input Any

Gets the node's specific configuration settings.

singular_input bool

Checks if the node template specifies exactly one input.

singular_main_input FlowNode

Gets the input node, assuming it is a single-input type.

state_needs_reset bool

Checks if the node's state needs to be reset.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
class FlowNode:
    """Represents a single node in a data flow graph.

    This class manages the node's state, its data processing function,
    and its connections to other nodes within the graph.
    """
    parent_uuid: str
    node_type: str
    node_template: node_interface.NodeTemplate
    node_default: schemas.NodeDefault
    node_schema: NodeSchemaInformation
    node_inputs: NodeStepInputs
    node_stats: NodeStepStats
    node_settings: NodeStepSettings
    results: NodeResults
    node_information: Optional[schemas.NodeInformation] = None
    leads_to_nodes: List["FlowNode"] = []  # list with target flows, after execution the step will trigger those step(s)
    user_provided_schema_callback: Optional[Callable] = None  # user provided callback function for schema calculation
    _setting_input: Any = None
    _hash: Optional[str] = None  # host this for caching results
    _function: Callable = None  # the function that needs to be executed when triggered
    _name: str = None  # name of the node, used for display
    _schema_callback: Optional[SingleExecutionFuture] = None  # Function that calculates the schema without executing
    _state_needs_reset: bool = False
    _fetch_cached_df: Optional[ExternalDfFetcher | ExternalDatabaseFetcher | ExternalDatabaseWriter | ExternalCloudWriter] = None
    _cache_progress: Optional[ExternalDfFetcher | ExternalDatabaseFetcher | ExternalDatabaseWriter | ExternalCloudWriter] = None

    def __init__(self, node_id: Union[str, int], function: Callable,
                 parent_uuid: str,
                 setting_input: Any,
                 name: str,
                 node_type: str,
                 input_columns: List[str] = None,
                 output_schema: List[FlowfileColumn] = None,
                 drop_columns: List[str] = None,
                 renew_schema: bool = True,
                 pos_x: float = 0,
                 pos_y: float = 0,
                 schema_callback: Callable = None,
                 ):
        """Initializes a FlowNode instance.

        Args:
            node_id: Unique identifier for the node.
            function: The core data processing function for the node.
            parent_uuid: The UUID of the parent flow.
            setting_input: The configuration/settings object for the node.
            name: The name of the node.
            node_type: The type identifier of the node (e.g., 'join', 'filter').
            input_columns: List of column names expected as input.
            output_schema: The schema of the columns to be added.
            drop_columns: List of column names to be dropped.
            renew_schema: Flag to indicate if the schema should be renewed.
            pos_x: The x-coordinate on the canvas.
            pos_y: The y-coordinate on the canvas.
            schema_callback: A custom function to calculate the output schema.
        """
        self._name = None
        self.parent_uuid = parent_uuid
        self.post_init()
        self.active = True
        self.node_information.id = node_id
        self.node_type = node_type
        self.node_settings.renew_schema = renew_schema
        self.update_node(function=function,
                         input_columns=input_columns,
                         output_schema=output_schema,
                         drop_columns=drop_columns,
                         setting_input=setting_input,
                         name=name,
                         pos_x=pos_x,
                         pos_y=pos_y,
                         schema_callback=schema_callback,
                         )

    def post_init(self):
        """Initializes or resets the node's attributes to their default states."""
        self.node_inputs = NodeStepInputs()
        self.node_stats = NodeStepStats()
        self.node_settings = NodeStepSettings()
        self.node_schema = NodeSchemaInformation()
        self.results = NodeResults()
        self.node_information = schemas.NodeInformation()
        self.leads_to_nodes = []
        self._setting_input = None
        self._cache_progress = None
        self._schema_callback = None
        self._state_needs_reset = False

    @property
    def state_needs_reset(self) -> bool:
        """Checks if the node's state needs to be reset.

        Returns:
            True if a reset is required, False otherwise.
        """
        return self._state_needs_reset

    @state_needs_reset.setter
    def state_needs_reset(self, v: bool):
        """Sets the flag indicating that the node's state needs to be reset.

        Args:
            v: The boolean value to set.
        """
        self._state_needs_reset = v

    @staticmethod
    def create_schema_callback_from_function(f: Callable) -> Callable[[], List[FlowfileColumn]]:
        """Wraps a node's function to create a schema callback that extracts the schema.

        Args:
            f: The node's core function that returns a FlowDataEngine instance.

        Returns:
            A callable that, when executed, returns the output schema.
        """
        def schema_callback() -> List[FlowfileColumn]:
            try:
                logger.info('Executing the schema callback function based on the node function')
                return f().schema
            except Exception as e:
                logger.warning(f'Error with the schema callback: {e}')
                return []
        return schema_callback

    @property
    def schema_callback(self) -> SingleExecutionFuture:
        """Gets the schema callback function, creating one if it doesn't exist.

        The callback is used for predicting the output schema without full execution.

        Returns:
            A SingleExecutionFuture instance wrapping the schema function.
        """
        if self._schema_callback is None:
            if self.user_provided_schema_callback is not None:
                self.schema_callback = self.user_provided_schema_callback
            elif self.is_start:
                self.schema_callback = self.create_schema_callback_from_function(self._function)
        return self._schema_callback

    @schema_callback.setter
    def schema_callback(self, f: Callable):
        """Sets the schema callback function for the node.

        Args:
            f: The function to be used for schema calculation.
        """
        if f is None:
            return

        def error_callback(e: Exception) -> List:
            logger.warning(e)

            self.node_settings.setup_errors = True
            return []

        self._schema_callback = SingleExecutionFuture(f, error_callback)

    @property
    def is_start(self) -> bool:
        """Determines if the node is a starting node in the flow.

        A starting node requires no inputs.

        Returns:
            True if the node is a start node, False otherwise.
        """
        return not self.has_input and self.node_template.input == 0

    def get_input_type(self, node_id: int) -> List:
        """Gets the type of connection ('main', 'left', 'right') for a given input node ID.

        Args:
            node_id: The ID of the input node.

        Returns:
            A list of connection types for that node ID.
        """
        relation_type = []
        if node_id in [n.node_id for n in self.node_inputs.main_inputs]:
            relation_type.append('main')
        if self.node_inputs.left_input is not None and node_id == self.node_inputs.left_input.node_id:
            relation_type.append('left')
        if self.node_inputs.right_input is not None and node_id == self.node_inputs.right_input.node_id:
            relation_type.append('right')
        return list(set(relation_type))

    def update_node(self,
                    function: Callable,
                    input_columns: List[str] = None,
                    output_schema: List[FlowfileColumn] = None,
                    drop_columns: List[str] = None,
                    name: str = None,
                    setting_input: Any = None,
                    pos_x: float = 0,
                    pos_y: float = 0,
                    schema_callback: Callable = None,
                    ):
        """Updates the properties of the node.

        This is called during initialization and when settings are changed.

        Args:
            function: The new core data processing function.
            input_columns: The new list of input columns.
            output_schema: The new schema of added columns.
            drop_columns: The new list of dropped columns.
            name: The new name for the node.
            setting_input: The new settings object.
            pos_x: The new x-coordinate.
            pos_y: The new y-coordinate.
            schema_callback: The new custom schema callback function.
        """
        self.user_provided_schema_callback = schema_callback
        self.node_information.y_position = int(pos_y)
        self.node_information.x_position = int(pos_x)
        self.node_information.setting_input = setting_input
        self.name = self.node_type if name is None else name
        self._function = function

        self.node_schema.input_columns = [] if input_columns is None else input_columns
        self.node_schema.output_columns = [] if output_schema is None else output_schema
        self.node_schema.drop_columns = [] if drop_columns is None else drop_columns
        self.node_settings.renew_schema = True
        if hasattr(setting_input, 'cache_results'):
            self.node_settings.cache_results = setting_input.cache_results

        self.results.errors = None
        self.add_lead_to_in_depend_source()
        _ = self.hash
        self.node_template = node_interface.node_dict.get(self.node_type)
        if self.node_template is None:
            raise Exception(f'Node template {self.node_type} not found')
        self.node_default = node_interface.node_defaults.get(self.node_type)
        self.setting_input = setting_input  # wait until the end so that the hash is calculated correctly

    @property
    def name(self) -> str:
        """Gets the name of the node.

        Returns:
            The node's name.
        """
        return self._name

    @name.setter
    def name(self, name: str):
        """Sets the name of the node.

        Args:
            name: The new name.
        """
        self._name = name
        self.__name__ = name

    @property
    def setting_input(self) -> Any:
        """Gets the node's specific configuration settings.

        Returns:
            The settings object.
        """
        return self._setting_input

    @setting_input.setter
    def setting_input(self, setting_input: Any):
        """Sets the node's configuration and triggers a reset if necessary.

        Args:
            setting_input: The new settings object.
        """
        is_manual_input = (self.node_type == 'manual_input' and
                           isinstance(setting_input, input_schema.NodeManualInput) and
                           isinstance(self._setting_input, input_schema.NodeManualInput)
                           )
        if is_manual_input:
            _ = self.hash
        self._setting_input = setting_input
        self.set_node_information()
        if is_manual_input:
            if self.hash != self.calculate_hash(setting_input) or not self.node_stats.has_run_with_current_setup:
                self.function = FlowDataEngine(setting_input.raw_data_format)
                self.reset()
                self.get_predicted_schema()
        elif self._setting_input is not None:
            self.reset()

    @property
    def node_id(self) -> Union[str, int]:
        """Gets the unique identifier of the node.

        Returns:
            The node's ID.
        """
        return self.node_information.id

    @property
    def left_input(self) -> Optional["FlowNode"]:
        """Gets the node connected to the left input port.

        Returns:
            The left input FlowNode, or None.
        """
        return self.node_inputs.left_input

    @property
    def right_input(self) -> Optional["FlowNode"]:
        """Gets the node connected to the right input port.

        Returns:
            The right input FlowNode, or None.
        """
        return self.node_inputs.right_input

    @property
    def main_input(self) -> List["FlowNode"]:
        """Gets the list of nodes connected to the main input port(s).

        Returns:
            A list of main input FlowNodes.
        """
        return self.node_inputs.main_inputs

    @property
    def is_correct(self) -> bool:
        """Checks if the node's input connections satisfy its template requirements.

        Returns:
            True if connections are valid, False otherwise.
        """
        if isinstance(self.setting_input, input_schema.NodePromise):
            return False
        return (self.node_template.input == len(self.node_inputs.get_all_inputs()) or
                (self.node_template.multi and len(self.node_inputs.get_all_inputs()) > 0) or
                (self.node_template.multi and self.node_template.can_be_start))

    def set_node_information(self):
        """Populates the `node_information` attribute with the current state.

        This includes the node's connections, settings, and position.
        """
        logger.info('setting node information')
        node_information = self.node_information
        node_information.left_input_id = self.node_inputs.left_input.node_id if self.left_input else None
        node_information.right_input_id = self.node_inputs.right_input.node_id if self.right_input else None
        node_information.input_ids = [mi.node_id for mi in
                                      self.node_inputs.main_inputs] if self.node_inputs.main_inputs is not None else None
        node_information.setting_input = self.setting_input
        node_information.outputs = [n.node_id for n in self.leads_to_nodes]
        node_information.is_setup = self.is_setup
        node_information.x_position = self.setting_input.pos_x
        node_information.y_position = self.setting_input.pos_y
        node_information.type = self.node_type

    def get_node_information(self) -> schemas.NodeInformation:
        """Updates and returns the node's information object.

        Returns:
            The `NodeInformation` object for this node.
        """
        self.set_node_information()
        return self.node_information

    @property
    def function(self) -> Callable:
        """Gets the core processing function of the node.

        Returns:
            The callable function.
        """
        return self._function

    @function.setter
    def function(self, function: Callable):
        """Sets the core processing function of the node.

        Args:
            function: The new callable function.
        """
        self._function = function

    @property
    def all_inputs(self) -> List["FlowNode"]:
        """Gets a list of all nodes connected to any input port.

        Returns:
            A list of all input FlowNodes.
        """
        return self.node_inputs.get_all_inputs()

    def calculate_hash(self, setting_input: Any) -> str:
        """Calculates a hash based on settings and input node hashes.

        Args:
            setting_input: The node's settings object to be included in the hash.

        Returns:
            A string hash value.
        """
        depends_on_hashes = [_node.hash for _node in self.all_inputs]
        node_data_hash = get_hash(setting_input)
        return get_hash(depends_on_hashes + [node_data_hash, self.parent_uuid])

    @property
    def hash(self) -> str:
        """Gets the cached hash for the node, calculating it if it doesn't exist.

        Returns:
            The string hash value.
        """
        if not self._hash:
            self._hash = self.calculate_hash(self.setting_input)
        return self._hash

    def add_node_connection(self, from_node: "FlowNode",
                            insert_type: Literal['main', 'left', 'right'] = 'main') -> None:
        """Adds a connection from a source node to this node.

        Args:
            from_node: The node to connect from.
            insert_type: The type of input to connect to ('main', 'left', 'right').

        Raises:
            Exception: If the insert_type is invalid.
        """
        from_node.leads_to_nodes.append(self)
        if insert_type == 'main':
            if self.node_template.input <= 2 or self.node_inputs.main_inputs is None:
                self.node_inputs.main_inputs = [from_node]
            else:
                self.node_inputs.main_inputs.append(from_node)
        elif insert_type == 'right':
            self.node_inputs.right_input = from_node
        elif insert_type == 'left':
            self.node_inputs.left_input = from_node
        else:
            raise Exception('Cannot find the connection')
        if self.setting_input.is_setup:
            if hasattr(self.setting_input, 'depending_on_id') and insert_type == 'main':
                self.setting_input.depending_on_id = from_node.node_id
        self.reset()
        from_node.reset()

    def evaluate_nodes(self, deep: bool = False) -> None:
        """Triggers a state reset for all directly connected downstream nodes.

        Args:
            deep: If True, the reset propagates recursively through the entire downstream graph.
        """
        for node in self.leads_to_nodes:
            self.print(f'resetting node: {node.node_id}')
            node.reset(deep)

    def get_flow_file_column_schema(self, col_name: str) -> FlowfileColumn | None:
        """Retrieves the schema for a specific column from the output schema.

        Args:
            col_name: The name of the column.

        Returns:
            The FlowfileColumn object for that column, or None if not found.
        """
        for s in self.schema:
            if s.column_name == col_name:
                return s

    def get_predicted_schema(self, force: bool = False) -> List[FlowfileColumn] | None:
        """Predicts the output schema of the node without full execution.

        It uses the schema_callback or infers from predicted data.

        Args:
            force: If True, forces recalculation even if a predicted schema exists.

        Returns:
            A list of FlowfileColumn objects representing the predicted schema.
        """
        if self.node_schema.predicted_schema and not force:
            return self.node_schema.predicted_schema
        if self.schema_callback is not None and (self.node_schema.predicted_schema is None or force):
            self.print('Getting the data from a schema callback')
            if force:
                # Force the schema callback to reset, so that it will be executed again
                self.schema_callback.reset()
            schema = self.schema_callback()
            if schema is not None and len(schema) > 0:
                self.print('Calculating the schema based on the schema callback')
                self.node_schema.predicted_schema = schema
                return self.node_schema.predicted_schema
        predicted_data = self._predicted_data_getter()
        if predicted_data is not None and predicted_data.schema is not None:
            self.print('Calculating the schema based on the predicted resulting data')
            self.node_schema.predicted_schema = self._predicted_data_getter().schema
        return self.node_schema.predicted_schema

    @property
    def is_setup(self) -> bool:
        """Checks if the node has been properly configured and is ready for execution.

        Returns:
            True if the node is set up, False otherwise.
        """
        if not self.node_information.is_setup:
            if self.function.__name__ != 'placeholder':
                self.node_information.is_setup = True
                self.setting_input.is_setup = True
        return self.node_information.is_setup

    def print(self, v: Any):
        """Helper method to log messages with node context.

        Args:
            v: The message or value to log.
        """
        logger.info(f'{self.node_type}, node_id: {self.node_id}: {v}')

    def get_resulting_data(self) -> FlowDataEngine | None:
        """Executes the node's function to produce the actual output data.

        Handles both regular functions and external data sources.

        Returns:
            A FlowDataEngine instance containing the result, or None on error.

        Raises:
            Exception: Propagates exceptions from the node's function execution.
        """
        if self.is_setup:
            if self.results.resulting_data is None and self.results.errors is None:
                self.print('getting resulting data')
                try:
                    if isinstance(self.function, FlowDataEngine):
                        fl: FlowDataEngine = self.function
                    elif self.node_type == 'external_source':
                        fl: FlowDataEngine = self.function()
                        fl.collect_external()
                        self.node_settings.streamable = False
                    else:
                        try:
                            fl = self._function(*[v.get_resulting_data() for v in self.all_inputs])
                        except Exception as e:
                            raise e
                    fl.set_streamable(self.node_settings.streamable)
                    self.results.resulting_data = fl
                    self.node_schema.result_schema = fl.schema
                except Exception as e:
                    self.results.resulting_data = FlowDataEngine()
                    self.results.errors = str(e)
                    self.node_stats.has_run_with_current_setup = False
                    self.node_stats.has_completed_last_run = False
                    raise e
            return self.results.resulting_data

    def _predicted_data_getter(self) -> FlowDataEngine | None:
        """Internal helper to get a predicted data result.

        This calls the function with predicted data from input nodes.

        Returns:
            A FlowDataEngine instance with predicted data, or an empty one on error.
        """
        try:
            fl = self._function(*[v.get_predicted_resulting_data() for v in self.all_inputs])
            return fl
        except ValueError as e:
            if str(e) == "generator already executing":
                logger.info('Generator already executing, waiting for the result')
                sleep(1)
                return self._predicted_data_getter()
            fl = FlowDataEngine()
            return fl

        except Exception as e:
            logger.warning('there was an issue with the function, returning an empty Flowfile')
            logger.warning(e)

    def get_predicted_resulting_data(self) -> FlowDataEngine:
        """Creates a `FlowDataEngine` instance based on the predicted schema.

        This avoids executing the node's full logic.

        Returns:
            A FlowDataEngine instance with a schema but no data.
        """
        if self.needs_run(False) and self.schema_callback is not None or self.node_schema.result_schema is not None:
            self.print('Getting data based on the schema')

            _s = self.schema_callback() if self.node_schema.result_schema is None else self.node_schema.result_schema
            return FlowDataEngine.create_from_schema(_s)
        else:
            if isinstance(self.function, FlowDataEngine):
                fl = self.function
            else:
                fl = FlowDataEngine.create_from_schema(self.get_predicted_schema())
            return fl

    def add_lead_to_in_depend_source(self):
        """Ensures this node is registered in the `leads_to_nodes` list of its inputs."""
        for input_node in self.all_inputs:
            if self.node_id not in [n.node_id for n in input_node.leads_to_nodes]:
                input_node.leads_to_nodes.append(self)

    def get_all_dependent_nodes(self) -> Generator["FlowNode", None, None]:
        """Yields all downstream nodes recursively.

        Returns:
            A generator of all dependent FlowNode objects.
        """
        for node in self.leads_to_nodes:
            yield node
            for n in node.get_all_dependent_nodes():
                yield n

    def get_all_dependent_node_ids(self) -> Generator[int, None, None]:
        """Yields the IDs of all downstream nodes recursively.

        Returns:
            A generator of all dependent node IDs.
        """
        for node in self.leads_to_nodes:
            yield node.node_id
            for n in node.get_all_dependent_node_ids():
                yield n

    @property
    def schema(self) -> List[FlowfileColumn]:
        """Gets the definitive output schema of the node.

        If not already run, it falls back to the predicted schema.

        Returns:
            A list of FlowfileColumn objects.
        """
        try:
            if self.is_setup and self.results.errors is None:
                if self.node_schema.result_schema is not None and len(self.node_schema.result_schema) > 0:
                    return self.node_schema.result_schema
                elif self.node_type == 'output':
                    if len(self.node_inputs.main_inputs) > 0:
                        self.node_schema.result_schema = self.node_inputs.main_inputs[0].schema
                else:
                    self.node_schema.result_schema = self.get_predicted_schema()
                return self.node_schema.result_schema
            else:
                return []
        except Exception as e:
            logger.error(e)
            return []

    def remove_cache(self):
        """Removes cached results for this node.

        Note: Currently not fully implemented.
        """

        if results_exists(self.hash):
            logger.warning('Not implemented')

    def needs_run(self, performance_mode: bool, node_logger: NodeLogger = None,
                  execution_location: schemas.ExecutionLocationsLiteral = "auto") -> bool:
        """Determines if the node needs to be executed.

        The decision is based on its run state, caching settings, and execution mode.

        Args:
            performance_mode: True if the flow is in performance mode.
            node_logger: The logger instance for this node.
            execution_location: The target execution location.

        Returns:
            True if the node should be run, False otherwise.
        """
        if execution_location == "local" or SINGLE_FILE_MODE:
            return False

        flow_logger = logger if node_logger is None else node_logger
        cache_result_exists = results_exists(self.hash)
        if not self.node_stats.has_run_with_current_setup:
            flow_logger.info('Node has not run, needs to run')
            return True
        if self.node_settings.cache_results and cache_result_exists:
            return False
        elif self.node_settings.cache_results and not cache_result_exists:
            return True
        elif not performance_mode and cache_result_exists:
            return False
        else:
            return True

    def __call__(self, *args, **kwargs):
        """Makes the node instance callable, acting as an alias for execute_node."""
        self.execute_node(*args, **kwargs)

    def execute_full_local(self, performance_mode: bool = False) -> None:
        """Executes the node's logic locally, including example data generation.

        Args:
            performance_mode: If True, skips generating example data.

        Raises:
            Exception: Propagates exceptions from the execution.
        """
        def example_data_generator():
            example_data = None

            def get_example_data():
                nonlocal example_data
                if example_data is None:
                    example_data = resulting_data.get_sample(100).to_arrow()
                return example_data
            return get_example_data
        resulting_data = self.get_resulting_data()

        if not performance_mode:
            self.results.example_data_generator = example_data_generator()
            self.node_schema.result_schema = self.results.resulting_data.schema
            self.node_stats.has_completed_last_run = True

    def execute_local(self, flow_id: int, performance_mode: bool = False):
        """Executes the node's logic locally.

        Args:
            flow_id: The ID of the parent flow.
            performance_mode: If True, skips generating example data.

        Raises:
            Exception: Propagates exceptions from the execution.
        """
        try:
            resulting_data = self.get_resulting_data()
            if not performance_mode:
                external_sampler = ExternalSampler(lf=resulting_data.data_frame, file_ref=self.hash,
                                                   wait_on_completion=True, node_id=self.node_id, flow_id=flow_id)
                self.store_example_data_generator(external_sampler)
                if self.results.errors is None and not self.node_stats.is_canceled:
                    self.node_stats.has_run_with_current_setup = True
            self.node_schema.result_schema = resulting_data.schema

        except Exception as e:
            logger.warning(f"Error with step {self.__name__}")
            logger.error(str(e))
            self.results.errors = str(e)
            self.node_stats.has_run_with_current_setup = False
            self.node_stats.has_completed_last_run = False
            raise e

        if self.node_stats.has_run_with_current_setup:
            for step in self.leads_to_nodes:
                if not self.node_settings.streamable:
                    step.node_settings.streamable = self.node_settings.streamable

    def execute_remote(self, performance_mode: bool = False, node_logger: NodeLogger = None):
        """Executes the node's logic remotely or handles cached results.

        Args:
            performance_mode: If True, skips generating example data.
            node_logger: The logger for this node execution.

        Raises:
            Exception: If the node_logger is not provided or if execution fails.
        """
        if node_logger is None:
            raise Exception('Node logger is not defined')
        if self.node_settings.cache_results and results_exists(self.hash):
            try:
                self.results.resulting_data = get_external_df_result(self.hash)
                self._cache_progress = None
                return
            except Exception as e:
                node_logger.warning('Failed to read the cache, rerunning the code')
        if self.node_type == 'output':
            self.results.resulting_data = self.get_resulting_data()
            self.node_stats.has_run_with_current_setup = True
            return
        try:
            self.get_resulting_data()
        except Exception as e:
            self.results.errors = 'Error with creating the lazy frame, most likely due to invalid graph'
            raise e
        if not performance_mode:
            external_df_fetcher = ExternalDfFetcher(lf=self.get_resulting_data().data_frame,
                                                    file_ref=self.hash, wait_on_completion=False,
                                                    flow_id=node_logger.flow_id,
                                                    node_id=self.node_id)
            self._fetch_cached_df = external_df_fetcher
            try:
                lf = external_df_fetcher.get_result()
                self.results.resulting_data = FlowDataEngine(
                    lf, number_of_records=ExternalDfFetcher(lf=lf, operation_type='calculate_number_of_records',
                                                            flow_id=node_logger.flow_id, node_id=self.node_id).result
                )
                if not performance_mode:
                    self.store_example_data_generator(external_df_fetcher)
                    self.node_stats.has_run_with_current_setup = True

            except Exception as e:
                node_logger.error('Error with external process')
                if external_df_fetcher.error_code == -1:
                    try:
                        self.results.resulting_data = self.get_resulting_data()
                        self.results.warnings = ('Error with external process (unknown error), '
                                                 'likely the process was killed by the server because of memory constraints, '
                                                 'continue with the process. '
                                                 'We cannot display example data...')
                    except Exception as e:
                        self.results.errors = str(e)
                        raise e
                elif external_df_fetcher.error_description is None:
                    self.results.errors = str(e)
                    raise e
                else:
                    self.results.errors = external_df_fetcher.error_description
                    raise Exception(external_df_fetcher.error_description)
            finally:
                self._fetch_cached_df = None

    def prepare_before_run(self):
        """Resets results and errors before a new execution."""

        self.results.errors = None
        self.results.resulting_data = None
        self.results.example_data = None

    def cancel(self):
        """Cancels an ongoing external process if one is running."""

        if self._fetch_cached_df is not None:
            self._fetch_cached_df.cancel()
            self.node_stats.is_canceled = True
        else:
            logger.warning('No external process to cancel')
        self.node_stats.is_canceled = True

    def execute_node(self, run_location: schemas.ExecutionLocationsLiteral, reset_cache: bool = False,
                     performance_mode: bool = False, retry: bool = True, node_logger: NodeLogger = None):
        """Orchestrates the execution, handling location, caching, and retries.

        Args:
            run_location: The location for execution ('local', 'remote').
            reset_cache: If True, forces removal of any existing cache.
            performance_mode: If True, optimizes for speed over diagnostics.
            retry: If True, allows retrying execution on recoverable errors.
            node_logger: The logger for this node execution.

        Raises:
            Exception: If the node_logger is not defined.
        """
        if node_logger is None:
            raise Exception('Flow logger is not defined')
        # node_logger = flow_logger.get_node_logger(self.node_id)
        if reset_cache:
            self.remove_cache()
            self.node_stats.has_run_with_current_setup = False
            self.node_stats.has_completed_last_run = False
        if self.is_setup:
            node_logger.info(f'Starting to run {self.__name__}')
            if (self.needs_run(performance_mode, node_logger, run_location) or self.node_template.node_group == "output"
                    and not (run_location == 'local' or SINGLE_FILE_MODE)):
                self.prepare_before_run()
                try:
                    if ((run_location == 'remote' or (self.node_default.transform_type == 'wide')
                            and not run_location == 'local')) or self.node_settings.cache_results:
                        node_logger.info('Running the node remotely')
                        if self.node_settings.cache_results:
                            performance_mode = False
                        self.execute_remote(performance_mode=(performance_mode if not self.node_settings.cache_results
                                                              else False),
                                            node_logger=node_logger
                                            )
                    else:
                        node_logger.info('Running the node locally')
                        self.execute_local(performance_mode=performance_mode, flow_id=node_logger.flow_id)
                except Exception as e:
                    if 'No such file or directory (os error' in str(e) and retry:
                        logger.warning('Error with the input node, starting to rerun the input node...')
                        all_inputs: List[FlowNode] = self.node_inputs.get_all_inputs()
                        for node_input in all_inputs:
                            node_input.execute_node(run_location=run_location,
                                                    performance_mode=performance_mode, retry=True,
                                                    reset_cache=True,
                                                    node_logger=node_logger)
                        self.execute_node(run_location=run_location,
                                          performance_mode=performance_mode, retry=False,
                                          node_logger=node_logger)
                    else:
                        self.results.errors = str(e)
                        node_logger.error(f'Error with running the node: {e}')
            elif ((run_location == 'local' or SINGLE_FILE_MODE) and
                  (not self.node_stats.has_run_with_current_setup or self.node_template.node_group == "output")):
                try:
                    node_logger.info('Executing fully locally')
                    self.execute_full_local(performance_mode)
                except Exception as e:
                    self.results.errors = str(e)
                    node_logger.error(f'Error with running the node: {e}')
                    self.node_stats.error = str(e)
                    self.node_stats.has_completed_last_run = False
                self.node_stats.has_run_with_current_setup = True
            else:
                node_logger.info('Node has already run, not running the node')
        else:
            node_logger.warning(f'Node {self.__name__} is not setup, cannot run the node')

    def store_example_data_generator(self, external_df_fetcher: ExternalDfFetcher | ExternalSampler):
        """Stores a generator function for fetching a sample of the result data.

        Args:
            external_df_fetcher: The process that generated the sample data.
        """
        if external_df_fetcher.status is not None:
            file_ref = external_df_fetcher.status.file_ref
            self.results.example_data_path = file_ref
            self.results.example_data_generator = get_read_top_n(file_path=file_ref, n=100)
        else:
            logger.error('Could not get the sample data, the external process is not ready')

    def needs_reset(self) -> bool:
        """Checks if the node's hash has changed, indicating an outdated state.

        Returns:
            True if the calculated hash differs from the stored hash.
        """
        return self._hash != self.calculate_hash(self.setting_input)

    def reset(self, deep: bool = False):
        """Resets the node's execution state and schema information.

        This also triggers a reset on all downstream nodes.

        Args:
            deep: If True, forces a reset even if the hash hasn't changed.
        """
        needs_reset = self.needs_reset() or deep
        if needs_reset:
            logger.info(f'{self.node_id}: Node needs reset')
            self.node_stats.has_run_with_current_setup = False
            self.results.reset()
            if self.is_correct:
                self._schema_callback = None  # Ensure the schema callback is reset
                if self.schema_callback:
                    logger.info(f'{self.node_id}: Resetting the schema callback')
                    self.schema_callback.start()
            self.node_schema.result_schema = None
            self.node_schema.predicted_schema = None
            self._hash = None
            self.node_information.is_setup = None
            self.results.errors = None
            self.evaluate_nodes()
            _ = self.hash  # Recalculate the hash after reset

    def delete_lead_to_node(self, node_id: int) -> bool:
        """Removes a connection to a specific downstream node.

        Args:
            node_id: The ID of the downstream node to disconnect.

        Returns:
            True if the connection was found and removed, False otherwise.
        """
        logger.info(f'Deleting lead to node: {node_id}')
        for i, lead_to_node in enumerate(self.leads_to_nodes):
            logger.info(f'Checking lead to node: {lead_to_node.node_id}')
            if lead_to_node.node_id == node_id:
                logger.info(f'Found the node to delete: {node_id}')
                self.leads_to_nodes.pop(i)
                return True
        return False

    def delete_input_node(self, node_id: int, connection_type: input_schema.InputConnectionClass = 'input-0',
                          complete: bool = False) -> bool:
        """Removes a connection from a specific input node.

        Args:
            node_id: The ID of the input node to disconnect.
            connection_type: The specific input handle (e.g., 'input-0', 'input-1').
            complete: If True, tries to delete from all input types.

        Returns:
            True if a connection was found and removed, False otherwise.
        """
        deleted: bool = False
        if connection_type == 'input-0':
            for i, node in enumerate(self.node_inputs.main_inputs):
                if node.node_id == node_id:
                    self.node_inputs.main_inputs.pop(i)
                    deleted = True
                    if not complete:
                        continue
        elif connection_type == 'input-1' or complete:
            if self.node_inputs.right_input is not None and self.node_inputs.right_input.node_id == node_id:
                self.node_inputs.right_input = None
                deleted = True
        elif connection_type == 'input-2' or complete:
            if self.node_inputs.left_input is not None and self.node_inputs.right_input.node_id == node_id:
                self.node_inputs.left_input = None
                deleted = True
        else:
            logger.warning('Could not find the connection to delete...')
        if deleted:
            self.reset()
        return deleted

    def __repr__(self) -> str:
        """Provides a string representation of the FlowNode instance.

        Returns:
            A string showing the node's ID and type.
        """
        return f"Node id: {self.node_id} ({self.node_type})"

    def _get_readable_schema(self) -> List[dict] | None:
        """Helper to get a simplified, dictionary representation of the output schema.

        Returns:
            A list of dictionaries, each with 'column_name' and 'data_type'.
        """
        if self.is_setup:
            output = []
            for s in self.schema:
                output.append(dict(column_name=s.column_name, data_type=s.data_type))
            return output

    def get_repr(self) -> dict:
        """Gets a detailed dictionary representation of the node's state.

        Returns:
            A dictionary containing key information about the node.
        """
        return dict(FlowNode=
                    dict(node_id=self.node_id,
                         step_name=self.__name__,
                         output_columns=self.node_schema.output_columns,
                         output_schema=self._get_readable_schema()))

    @property
    def number_of_leads_to_nodes(self) -> int | None:
        """Counts the number of downstream node connections.

        Returns:
            The number of nodes this node leads to.
        """
        if self.is_setup:
            return len(self.leads_to_nodes)

    @property
    def has_next_step(self) -> bool:
        """Checks if this node has any downstream connections.

        Returns:
            True if it has at least one downstream node.
        """
        return len(self.leads_to_nodes) > 0

    @property
    def has_input(self) -> bool:
        """Checks if this node has any input connections.

        Returns:
            True if it has at least one input node.
        """
        return len(self.all_inputs) > 0

    @property
    def singular_input(self) -> bool:
        """Checks if the node template specifies exactly one input.

        Returns:
            True if the node is a single-input type.
        """
        return self.node_template.input == 1

    @property
    def singular_main_input(self) -> "FlowNode":
        """Gets the input node, assuming it is a single-input type.

        Returns:
            The single input FlowNode, or None.
        """
        if self.singular_input:
            return self.all_inputs[0]

    def get_table_example(self, include_data: bool = False) -> TableExample | None:
        """Generates a `TableExample` model summarizing the node's output.

        This can optionally include a sample of the data.

        Args:
            include_data: If True, includes a data sample in the result.

        Returns:
            A `TableExample` object, or None if the node is not set up.
        """
        self.print('Getting a table example')
        if self.is_setup and include_data and self.node_stats.has_completed_last_run:

            if self.node_template.node_group == 'output':
                self.print('getting the table example')
                return self.main_input[0].get_table_example(include_data)

            logger.info('getting the table example since the node has run')
            example_data_getter = self.results.example_data_generator
            if example_data_getter is not None:
                data = example_data_getter().to_pylist()
                if data is None:
                    data = []
            else:
                data = []
            schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
            fl = self.get_resulting_data()
            return TableExample(node_id=self.node_id,
                                name=str(self.node_id), number_of_records=999,
                                number_of_columns=fl.number_of_fields,
                                table_schema=schema, columns=fl.columns, data=data)
        else:
            logger.warning('getting the table example but the node has not run')
            try:
                schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
            except Exception as e:
                logger.warning(e)
                schema = []
            columns = [s.name for s in schema]
            return TableExample(node_id=self.node_id,
                                name=str(self.node_id), number_of_records=0,
                                number_of_columns=len(columns),
                                table_schema=schema, columns=columns,
                                data=[])

    def get_node_data(self, flow_id: int, include_example: bool = False) -> NodeData:
        """Gathers all necessary data for representing the node in the UI.

        Args:
            flow_id: The ID of the parent flow.
            include_example: If True, includes data samples.

        Returns:
            A `NodeData` object.
        """
        node = NodeData(flow_id=flow_id,
                        node_id=self.node_id,
                        has_run=self.node_stats.has_run_with_current_setup,
                        setting_input=self.setting_input,
                        flow_type=self.node_type)
        if self.main_input:
            node.main_input = self.main_input[0].get_table_example()
        if self.left_input:
            node.left_input = self.left_input.get_table_example()
        if self.right_input:
            node.right_input = self.right_input.get_table_example()
        if self.is_setup:
            node.main_output = self.get_table_example(include_example)
        node = setting_generator.get_setting_generator(self.node_type)(node)

        node = setting_updator.get_setting_updator(self.node_type)(node)
        return node

    def get_output_data(self) -> TableExample:
        """Gets the full output data sample for this node.

        Returns:
            A `TableExample` object with data.
        """
        return self.get_table_example(True)

    def get_node_input(self) -> schemas.NodeInput:
        """Creates a `NodeInput` schema object for representing this node in the UI.

        Returns:
            A `NodeInput` object.
        """
        return schemas.NodeInput(pos_y=self.setting_input.pos_y,
                                 pos_x=self.setting_input.pos_x,
                                 id=self.node_id,
                                 **self.node_template.__dict__)

    def get_edge_input(self) -> List[schemas.NodeEdge]:
        """Generates `NodeEdge` objects for all input connections to this node.

        Returns:
            A list of `NodeEdge` objects.
        """
        edges = []
        if self.node_inputs.main_inputs is not None:
            for i, main_input in enumerate(self.node_inputs.main_inputs):
                edges.append(schemas.NodeEdge(id=f'{main_input.node_id}-{self.node_id}-{i}',
                                              source=main_input.node_id,
                                              target=self.node_id,
                                              sourceHandle='output-0',
                                              targetHandle='input-0',
                                              ))
        if self.node_inputs.left_input is not None:
            edges.append(schemas.NodeEdge(id=f'{self.node_inputs.left_input.node_id}-{self.node_id}-right',
                                          source=self.node_inputs.left_input.node_id,
                                          target=self.node_id,
                                          sourceHandle='output-0',
                                          targetHandle='input-2',
                                          ))
        if self.node_inputs.right_input is not None:
            edges.append(schemas.NodeEdge(id=f'{self.node_inputs.right_input.node_id}-{self.node_id}-left',
                                          source=self.node_inputs.right_input.node_id,
                                          target=self.node_id,
                                          sourceHandle='output-0',
                                          targetHandle='input-1',
                                          ))
        return edges
all_inputs property

Gets a list of all nodes connected to any input port.

Returns:

Type Description
List[FlowNode]

A list of all input FlowNodes.

function property writable

Gets the core processing function of the node.

Returns:

Type Description
Callable

The callable function.

has_input property

Checks if this node has any input connections.

Returns:

Type Description
bool

True if it has at least one input node.

has_next_step property

Checks if this node has any downstream connections.

Returns:

Type Description
bool

True if it has at least one downstream node.

hash property

Gets the cached hash for the node, calculating it if it doesn't exist.

Returns:

Type Description
str

The string hash value.

is_correct property

Checks if the node's input connections satisfy its template requirements.

Returns:

Type Description
bool

True if connections are valid, False otherwise.

is_setup property

Checks if the node has been properly configured and is ready for execution.

Returns:

Type Description
bool

True if the node is set up, False otherwise.

is_start property

Determines if the node is a starting node in the flow.

A starting node requires no inputs.

Returns:

Type Description
bool

True if the node is a start node, False otherwise.

left_input property

Gets the node connected to the left input port.

Returns:

Type Description
Optional[FlowNode]

The left input FlowNode, or None.

main_input property

Gets the list of nodes connected to the main input port(s).

Returns:

Type Description
List[FlowNode]

A list of main input FlowNodes.

name property writable

Gets the name of the node.

Returns:

Type Description
str

The node's name.

node_id property

Gets the unique identifier of the node.

Returns:

Type Description
Union[str, int]

The node's ID.

number_of_leads_to_nodes property

Counts the number of downstream node connections.

Returns:

Type Description
int | None

The number of nodes this node leads to.

right_input property

Gets the node connected to the right input port.

Returns:

Type Description
Optional[FlowNode]

The right input FlowNode, or None.

schema property

Gets the definitive output schema of the node.

If not already run, it falls back to the predicted schema.

Returns:

Type Description
List[FlowfileColumn]

A list of FlowfileColumn objects.

schema_callback property writable

Gets the schema callback function, creating one if it doesn't exist.

The callback is used for predicting the output schema without full execution.

Returns:

Type Description
SingleExecutionFuture

A SingleExecutionFuture instance wrapping the schema function.

setting_input property writable

Gets the node's specific configuration settings.

Returns:

Type Description
Any

The settings object.

singular_input property

Checks if the node template specifies exactly one input.

Returns:

Type Description
bool

True if the node is a single-input type.

singular_main_input property

Gets the input node, assuming it is a single-input type.

Returns:

Type Description
FlowNode

The single input FlowNode, or None.

state_needs_reset property writable

Checks if the node's state needs to be reset.

Returns:

Type Description
bool

True if a reset is required, False otherwise.

__call__(*args, **kwargs)

Makes the node instance callable, acting as an alias for execute_node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
714
715
716
def __call__(self, *args, **kwargs):
    """Makes the node instance callable, acting as an alias for execute_node."""
    self.execute_node(*args, **kwargs)
__init__(node_id, function, parent_uuid, setting_input, name, node_type, input_columns=None, output_schema=None, drop_columns=None, renew_schema=True, pos_x=0, pos_y=0, schema_callback=None)

Initializes a FlowNode instance.

Parameters:

Name Type Description Default
node_id Union[str, int]

Unique identifier for the node.

required
function Callable

The core data processing function for the node.

required
parent_uuid str

The UUID of the parent flow.

required
setting_input Any

The configuration/settings object for the node.

required
name str

The name of the node.

required
node_type str

The type identifier of the node (e.g., 'join', 'filter').

required
input_columns List[str]

List of column names expected as input.

None
output_schema List[FlowfileColumn]

The schema of the columns to be added.

None
drop_columns List[str]

List of column names to be dropped.

None
renew_schema bool

Flag to indicate if the schema should be renewed.

True
pos_x float

The x-coordinate on the canvas.

0
pos_y float

The y-coordinate on the canvas.

0
schema_callback Callable

A custom function to calculate the output schema.

None
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def __init__(self, node_id: Union[str, int], function: Callable,
             parent_uuid: str,
             setting_input: Any,
             name: str,
             node_type: str,
             input_columns: List[str] = None,
             output_schema: List[FlowfileColumn] = None,
             drop_columns: List[str] = None,
             renew_schema: bool = True,
             pos_x: float = 0,
             pos_y: float = 0,
             schema_callback: Callable = None,
             ):
    """Initializes a FlowNode instance.

    Args:
        node_id: Unique identifier for the node.
        function: The core data processing function for the node.
        parent_uuid: The UUID of the parent flow.
        setting_input: The configuration/settings object for the node.
        name: The name of the node.
        node_type: The type identifier of the node (e.g., 'join', 'filter').
        input_columns: List of column names expected as input.
        output_schema: The schema of the columns to be added.
        drop_columns: List of column names to be dropped.
        renew_schema: Flag to indicate if the schema should be renewed.
        pos_x: The x-coordinate on the canvas.
        pos_y: The y-coordinate on the canvas.
        schema_callback: A custom function to calculate the output schema.
    """
    self._name = None
    self.parent_uuid = parent_uuid
    self.post_init()
    self.active = True
    self.node_information.id = node_id
    self.node_type = node_type
    self.node_settings.renew_schema = renew_schema
    self.update_node(function=function,
                     input_columns=input_columns,
                     output_schema=output_schema,
                     drop_columns=drop_columns,
                     setting_input=setting_input,
                     name=name,
                     pos_x=pos_x,
                     pos_y=pos_y,
                     schema_callback=schema_callback,
                     )
__repr__()

Provides a string representation of the FlowNode instance.

Returns:

Type Description
str

A string showing the node's ID and type.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1027
1028
1029
1030
1031
1032
1033
def __repr__(self) -> str:
    """Provides a string representation of the FlowNode instance.

    Returns:
        A string showing the node's ID and type.
    """
    return f"Node id: {self.node_id} ({self.node_type})"
add_lead_to_in_depend_source()

Ensures this node is registered in the leads_to_nodes list of its inputs.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
621
622
623
624
625
def add_lead_to_in_depend_source(self):
    """Ensures this node is registered in the `leads_to_nodes` list of its inputs."""
    for input_node in self.all_inputs:
        if self.node_id not in [n.node_id for n in input_node.leads_to_nodes]:
            input_node.leads_to_nodes.append(self)
add_node_connection(from_node, insert_type='main')

Adds a connection from a source node to this node.

Parameters:

Name Type Description Default
from_node FlowNode

The node to connect from.

required
insert_type Literal['main', 'left', 'right']

The type of input to connect to ('main', 'left', 'right').

'main'

Raises:

Type Description
Exception

If the insert_type is invalid.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
def add_node_connection(self, from_node: "FlowNode",
                        insert_type: Literal['main', 'left', 'right'] = 'main') -> None:
    """Adds a connection from a source node to this node.

    Args:
        from_node: The node to connect from.
        insert_type: The type of input to connect to ('main', 'left', 'right').

    Raises:
        Exception: If the insert_type is invalid.
    """
    from_node.leads_to_nodes.append(self)
    if insert_type == 'main':
        if self.node_template.input <= 2 or self.node_inputs.main_inputs is None:
            self.node_inputs.main_inputs = [from_node]
        else:
            self.node_inputs.main_inputs.append(from_node)
    elif insert_type == 'right':
        self.node_inputs.right_input = from_node
    elif insert_type == 'left':
        self.node_inputs.left_input = from_node
    else:
        raise Exception('Cannot find the connection')
    if self.setting_input.is_setup:
        if hasattr(self.setting_input, 'depending_on_id') and insert_type == 'main':
            self.setting_input.depending_on_id = from_node.node_id
    self.reset()
    from_node.reset()
calculate_hash(setting_input)

Calculates a hash based on settings and input node hashes.

Parameters:

Name Type Description Default
setting_input Any

The node's settings object to be included in the hash.

required

Returns:

Type Description
str

A string hash value.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
415
416
417
418
419
420
421
422
423
424
425
426
def calculate_hash(self, setting_input: Any) -> str:
    """Calculates a hash based on settings and input node hashes.

    Args:
        setting_input: The node's settings object to be included in the hash.

    Returns:
        A string hash value.
    """
    depends_on_hashes = [_node.hash for _node in self.all_inputs]
    node_data_hash = get_hash(setting_input)
    return get_hash(depends_on_hashes + [node_data_hash, self.parent_uuid])
cancel()

Cancels an ongoing external process if one is running.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
848
849
850
851
852
853
854
855
856
def cancel(self):
    """Cancels an ongoing external process if one is running."""

    if self._fetch_cached_df is not None:
        self._fetch_cached_df.cancel()
        self.node_stats.is_canceled = True
    else:
        logger.warning('No external process to cancel')
    self.node_stats.is_canceled = True
create_schema_callback_from_function(f) staticmethod

Wraps a node's function to create a schema callback that extracts the schema.

Parameters:

Name Type Description Default
f Callable

The node's core function that returns a FlowDataEngine instance.

required

Returns:

Type Description
Callable[[], List[FlowfileColumn]]

A callable that, when executed, returns the output schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
@staticmethod
def create_schema_callback_from_function(f: Callable) -> Callable[[], List[FlowfileColumn]]:
    """Wraps a node's function to create a schema callback that extracts the schema.

    Args:
        f: The node's core function that returns a FlowDataEngine instance.

    Returns:
        A callable that, when executed, returns the output schema.
    """
    def schema_callback() -> List[FlowfileColumn]:
        try:
            logger.info('Executing the schema callback function based on the node function')
            return f().schema
        except Exception as e:
            logger.warning(f'Error with the schema callback: {e}')
            return []
    return schema_callback
delete_input_node(node_id, connection_type='input-0', complete=False)

Removes a connection from a specific input node.

Parameters:

Name Type Description Default
node_id int

The ID of the input node to disconnect.

required
connection_type InputConnectionClass

The specific input handle (e.g., 'input-0', 'input-1').

'input-0'
complete bool

If True, tries to delete from all input types.

False

Returns:

Type Description
bool

True if a connection was found and removed, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
def delete_input_node(self, node_id: int, connection_type: input_schema.InputConnectionClass = 'input-0',
                      complete: bool = False) -> bool:
    """Removes a connection from a specific input node.

    Args:
        node_id: The ID of the input node to disconnect.
        connection_type: The specific input handle (e.g., 'input-0', 'input-1').
        complete: If True, tries to delete from all input types.

    Returns:
        True if a connection was found and removed, False otherwise.
    """
    deleted: bool = False
    if connection_type == 'input-0':
        for i, node in enumerate(self.node_inputs.main_inputs):
            if node.node_id == node_id:
                self.node_inputs.main_inputs.pop(i)
                deleted = True
                if not complete:
                    continue
    elif connection_type == 'input-1' or complete:
        if self.node_inputs.right_input is not None and self.node_inputs.right_input.node_id == node_id:
            self.node_inputs.right_input = None
            deleted = True
    elif connection_type == 'input-2' or complete:
        if self.node_inputs.left_input is not None and self.node_inputs.right_input.node_id == node_id:
            self.node_inputs.left_input = None
            deleted = True
    else:
        logger.warning('Could not find the connection to delete...')
    if deleted:
        self.reset()
    return deleted
delete_lead_to_node(node_id)

Removes a connection to a specific downstream node.

Parameters:

Name Type Description Default
node_id int

The ID of the downstream node to disconnect.

required

Returns:

Type Description
bool

True if the connection was found and removed, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
def delete_lead_to_node(self, node_id: int) -> bool:
    """Removes a connection to a specific downstream node.

    Args:
        node_id: The ID of the downstream node to disconnect.

    Returns:
        True if the connection was found and removed, False otherwise.
    """
    logger.info(f'Deleting lead to node: {node_id}')
    for i, lead_to_node in enumerate(self.leads_to_nodes):
        logger.info(f'Checking lead to node: {lead_to_node.node_id}')
        if lead_to_node.node_id == node_id:
            logger.info(f'Found the node to delete: {node_id}')
            self.leads_to_nodes.pop(i)
            return True
    return False
evaluate_nodes(deep=False)

Triggers a state reset for all directly connected downstream nodes.

Parameters:

Name Type Description Default
deep bool

If True, the reset propagates recursively through the entire downstream graph.

False
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
468
469
470
471
472
473
474
475
476
def evaluate_nodes(self, deep: bool = False) -> None:
    """Triggers a state reset for all directly connected downstream nodes.

    Args:
        deep: If True, the reset propagates recursively through the entire downstream graph.
    """
    for node in self.leads_to_nodes:
        self.print(f'resetting node: {node.node_id}')
        node.reset(deep)
execute_full_local(performance_mode=False)

Executes the node's logic locally, including example data generation.

Parameters:

Name Type Description Default
performance_mode bool

If True, skips generating example data.

False

Raises:

Type Description
Exception

Propagates exceptions from the execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
def execute_full_local(self, performance_mode: bool = False) -> None:
    """Executes the node's logic locally, including example data generation.

    Args:
        performance_mode: If True, skips generating example data.

    Raises:
        Exception: Propagates exceptions from the execution.
    """
    def example_data_generator():
        example_data = None

        def get_example_data():
            nonlocal example_data
            if example_data is None:
                example_data = resulting_data.get_sample(100).to_arrow()
            return example_data
        return get_example_data
    resulting_data = self.get_resulting_data()

    if not performance_mode:
        self.results.example_data_generator = example_data_generator()
        self.node_schema.result_schema = self.results.resulting_data.schema
        self.node_stats.has_completed_last_run = True
execute_local(flow_id, performance_mode=False)

Executes the node's logic locally.

Parameters:

Name Type Description Default
flow_id int

The ID of the parent flow.

required
performance_mode bool

If True, skips generating example data.

False

Raises:

Type Description
Exception

Propagates exceptions from the execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
def execute_local(self, flow_id: int, performance_mode: bool = False):
    """Executes the node's logic locally.

    Args:
        flow_id: The ID of the parent flow.
        performance_mode: If True, skips generating example data.

    Raises:
        Exception: Propagates exceptions from the execution.
    """
    try:
        resulting_data = self.get_resulting_data()
        if not performance_mode:
            external_sampler = ExternalSampler(lf=resulting_data.data_frame, file_ref=self.hash,
                                               wait_on_completion=True, node_id=self.node_id, flow_id=flow_id)
            self.store_example_data_generator(external_sampler)
            if self.results.errors is None and not self.node_stats.is_canceled:
                self.node_stats.has_run_with_current_setup = True
        self.node_schema.result_schema = resulting_data.schema

    except Exception as e:
        logger.warning(f"Error with step {self.__name__}")
        logger.error(str(e))
        self.results.errors = str(e)
        self.node_stats.has_run_with_current_setup = False
        self.node_stats.has_completed_last_run = False
        raise e

    if self.node_stats.has_run_with_current_setup:
        for step in self.leads_to_nodes:
            if not self.node_settings.streamable:
                step.node_settings.streamable = self.node_settings.streamable
execute_node(run_location, reset_cache=False, performance_mode=False, retry=True, node_logger=None)

Orchestrates the execution, handling location, caching, and retries.

Parameters:

Name Type Description Default
run_location ExecutionLocationsLiteral

The location for execution ('local', 'remote').

required
reset_cache bool

If True, forces removal of any existing cache.

False
performance_mode bool

If True, optimizes for speed over diagnostics.

False
retry bool

If True, allows retrying execution on recoverable errors.

True
node_logger NodeLogger

The logger for this node execution.

None

Raises:

Type Description
Exception

If the node_logger is not defined.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
def execute_node(self, run_location: schemas.ExecutionLocationsLiteral, reset_cache: bool = False,
                 performance_mode: bool = False, retry: bool = True, node_logger: NodeLogger = None):
    """Orchestrates the execution, handling location, caching, and retries.

    Args:
        run_location: The location for execution ('local', 'remote').
        reset_cache: If True, forces removal of any existing cache.
        performance_mode: If True, optimizes for speed over diagnostics.
        retry: If True, allows retrying execution on recoverable errors.
        node_logger: The logger for this node execution.

    Raises:
        Exception: If the node_logger is not defined.
    """
    if node_logger is None:
        raise Exception('Flow logger is not defined')
    # node_logger = flow_logger.get_node_logger(self.node_id)
    if reset_cache:
        self.remove_cache()
        self.node_stats.has_run_with_current_setup = False
        self.node_stats.has_completed_last_run = False
    if self.is_setup:
        node_logger.info(f'Starting to run {self.__name__}')
        if (self.needs_run(performance_mode, node_logger, run_location) or self.node_template.node_group == "output"
                and not (run_location == 'local' or SINGLE_FILE_MODE)):
            self.prepare_before_run()
            try:
                if ((run_location == 'remote' or (self.node_default.transform_type == 'wide')
                        and not run_location == 'local')) or self.node_settings.cache_results:
                    node_logger.info('Running the node remotely')
                    if self.node_settings.cache_results:
                        performance_mode = False
                    self.execute_remote(performance_mode=(performance_mode if not self.node_settings.cache_results
                                                          else False),
                                        node_logger=node_logger
                                        )
                else:
                    node_logger.info('Running the node locally')
                    self.execute_local(performance_mode=performance_mode, flow_id=node_logger.flow_id)
            except Exception as e:
                if 'No such file or directory (os error' in str(e) and retry:
                    logger.warning('Error with the input node, starting to rerun the input node...')
                    all_inputs: List[FlowNode] = self.node_inputs.get_all_inputs()
                    for node_input in all_inputs:
                        node_input.execute_node(run_location=run_location,
                                                performance_mode=performance_mode, retry=True,
                                                reset_cache=True,
                                                node_logger=node_logger)
                    self.execute_node(run_location=run_location,
                                      performance_mode=performance_mode, retry=False,
                                      node_logger=node_logger)
                else:
                    self.results.errors = str(e)
                    node_logger.error(f'Error with running the node: {e}')
        elif ((run_location == 'local' or SINGLE_FILE_MODE) and
              (not self.node_stats.has_run_with_current_setup or self.node_template.node_group == "output")):
            try:
                node_logger.info('Executing fully locally')
                self.execute_full_local(performance_mode)
            except Exception as e:
                self.results.errors = str(e)
                node_logger.error(f'Error with running the node: {e}')
                self.node_stats.error = str(e)
                self.node_stats.has_completed_last_run = False
            self.node_stats.has_run_with_current_setup = True
        else:
            node_logger.info('Node has already run, not running the node')
    else:
        node_logger.warning(f'Node {self.__name__} is not setup, cannot run the node')
execute_remote(performance_mode=False, node_logger=None)

Executes the node's logic remotely or handles cached results.

Parameters:

Name Type Description Default
performance_mode bool

If True, skips generating example data.

False
node_logger NodeLogger

The logger for this node execution.

None

Raises:

Type Description
Exception

If the node_logger is not provided or if execution fails.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
def execute_remote(self, performance_mode: bool = False, node_logger: NodeLogger = None):
    """Executes the node's logic remotely or handles cached results.

    Args:
        performance_mode: If True, skips generating example data.
        node_logger: The logger for this node execution.

    Raises:
        Exception: If the node_logger is not provided or if execution fails.
    """
    if node_logger is None:
        raise Exception('Node logger is not defined')
    if self.node_settings.cache_results and results_exists(self.hash):
        try:
            self.results.resulting_data = get_external_df_result(self.hash)
            self._cache_progress = None
            return
        except Exception as e:
            node_logger.warning('Failed to read the cache, rerunning the code')
    if self.node_type == 'output':
        self.results.resulting_data = self.get_resulting_data()
        self.node_stats.has_run_with_current_setup = True
        return
    try:
        self.get_resulting_data()
    except Exception as e:
        self.results.errors = 'Error with creating the lazy frame, most likely due to invalid graph'
        raise e
    if not performance_mode:
        external_df_fetcher = ExternalDfFetcher(lf=self.get_resulting_data().data_frame,
                                                file_ref=self.hash, wait_on_completion=False,
                                                flow_id=node_logger.flow_id,
                                                node_id=self.node_id)
        self._fetch_cached_df = external_df_fetcher
        try:
            lf = external_df_fetcher.get_result()
            self.results.resulting_data = FlowDataEngine(
                lf, number_of_records=ExternalDfFetcher(lf=lf, operation_type='calculate_number_of_records',
                                                        flow_id=node_logger.flow_id, node_id=self.node_id).result
            )
            if not performance_mode:
                self.store_example_data_generator(external_df_fetcher)
                self.node_stats.has_run_with_current_setup = True

        except Exception as e:
            node_logger.error('Error with external process')
            if external_df_fetcher.error_code == -1:
                try:
                    self.results.resulting_data = self.get_resulting_data()
                    self.results.warnings = ('Error with external process (unknown error), '
                                             'likely the process was killed by the server because of memory constraints, '
                                             'continue with the process. '
                                             'We cannot display example data...')
                except Exception as e:
                    self.results.errors = str(e)
                    raise e
            elif external_df_fetcher.error_description is None:
                self.results.errors = str(e)
                raise e
            else:
                self.results.errors = external_df_fetcher.error_description
                raise Exception(external_df_fetcher.error_description)
        finally:
            self._fetch_cached_df = None
get_all_dependent_node_ids()

Yields the IDs of all downstream nodes recursively.

Returns:

Type Description
None

A generator of all dependent node IDs.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
638
639
640
641
642
643
644
645
646
647
def get_all_dependent_node_ids(self) -> Generator[int, None, None]:
    """Yields the IDs of all downstream nodes recursively.

    Returns:
        A generator of all dependent node IDs.
    """
    for node in self.leads_to_nodes:
        yield node.node_id
        for n in node.get_all_dependent_node_ids():
            yield n
get_all_dependent_nodes()

Yields all downstream nodes recursively.

Returns:

Type Description
None

A generator of all dependent FlowNode objects.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
627
628
629
630
631
632
633
634
635
636
def get_all_dependent_nodes(self) -> Generator["FlowNode", None, None]:
    """Yields all downstream nodes recursively.

    Returns:
        A generator of all dependent FlowNode objects.
    """
    for node in self.leads_to_nodes:
        yield node
        for n in node.get_all_dependent_nodes():
            yield n
get_edge_input()

Generates NodeEdge objects for all input connections to this node.

Returns:

Type Description
List[NodeEdge]

A list of NodeEdge objects.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
def get_edge_input(self) -> List[schemas.NodeEdge]:
    """Generates `NodeEdge` objects for all input connections to this node.

    Returns:
        A list of `NodeEdge` objects.
    """
    edges = []
    if self.node_inputs.main_inputs is not None:
        for i, main_input in enumerate(self.node_inputs.main_inputs):
            edges.append(schemas.NodeEdge(id=f'{main_input.node_id}-{self.node_id}-{i}',
                                          source=main_input.node_id,
                                          target=self.node_id,
                                          sourceHandle='output-0',
                                          targetHandle='input-0',
                                          ))
    if self.node_inputs.left_input is not None:
        edges.append(schemas.NodeEdge(id=f'{self.node_inputs.left_input.node_id}-{self.node_id}-right',
                                      source=self.node_inputs.left_input.node_id,
                                      target=self.node_id,
                                      sourceHandle='output-0',
                                      targetHandle='input-2',
                                      ))
    if self.node_inputs.right_input is not None:
        edges.append(schemas.NodeEdge(id=f'{self.node_inputs.right_input.node_id}-{self.node_id}-left',
                                      source=self.node_inputs.right_input.node_id,
                                      target=self.node_id,
                                      sourceHandle='output-0',
                                      targetHandle='input-1',
                                      ))
    return edges
get_flow_file_column_schema(col_name)

Retrieves the schema for a specific column from the output schema.

Parameters:

Name Type Description Default
col_name str

The name of the column.

required

Returns:

Type Description
FlowfileColumn | None

The FlowfileColumn object for that column, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
478
479
480
481
482
483
484
485
486
487
488
489
def get_flow_file_column_schema(self, col_name: str) -> FlowfileColumn | None:
    """Retrieves the schema for a specific column from the output schema.

    Args:
        col_name: The name of the column.

    Returns:
        The FlowfileColumn object for that column, or None if not found.
    """
    for s in self.schema:
        if s.column_name == col_name:
            return s
get_input_type(node_id)

Gets the type of connection ('main', 'left', 'right') for a given input node ID.

Parameters:

Name Type Description Default
node_id int

The ID of the input node.

required

Returns:

Type Description
List

A list of connection types for that node ID.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def get_input_type(self, node_id: int) -> List:
    """Gets the type of connection ('main', 'left', 'right') for a given input node ID.

    Args:
        node_id: The ID of the input node.

    Returns:
        A list of connection types for that node ID.
    """
    relation_type = []
    if node_id in [n.node_id for n in self.node_inputs.main_inputs]:
        relation_type.append('main')
    if self.node_inputs.left_input is not None and node_id == self.node_inputs.left_input.node_id:
        relation_type.append('left')
    if self.node_inputs.right_input is not None and node_id == self.node_inputs.right_input.node_id:
        relation_type.append('right')
    return list(set(relation_type))
get_node_data(flow_id, include_example=False)

Gathers all necessary data for representing the node in the UI.

Parameters:

Name Type Description Default
flow_id int

The ID of the parent flow.

required
include_example bool

If True, includes data samples.

False

Returns:

Type Description
NodeData

A NodeData object.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
def get_node_data(self, flow_id: int, include_example: bool = False) -> NodeData:
    """Gathers all necessary data for representing the node in the UI.

    Args:
        flow_id: The ID of the parent flow.
        include_example: If True, includes data samples.

    Returns:
        A `NodeData` object.
    """
    node = NodeData(flow_id=flow_id,
                    node_id=self.node_id,
                    has_run=self.node_stats.has_run_with_current_setup,
                    setting_input=self.setting_input,
                    flow_type=self.node_type)
    if self.main_input:
        node.main_input = self.main_input[0].get_table_example()
    if self.left_input:
        node.left_input = self.left_input.get_table_example()
    if self.right_input:
        node.right_input = self.right_input.get_table_example()
    if self.is_setup:
        node.main_output = self.get_table_example(include_example)
    node = setting_generator.get_setting_generator(self.node_type)(node)

    node = setting_updator.get_setting_updator(self.node_type)(node)
    return node
get_node_information()

Updates and returns the node's information object.

Returns:

Type Description
NodeInformation

The NodeInformation object for this node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
379
380
381
382
383
384
385
386
def get_node_information(self) -> schemas.NodeInformation:
    """Updates and returns the node's information object.

    Returns:
        The `NodeInformation` object for this node.
    """
    self.set_node_information()
    return self.node_information
get_node_input()

Creates a NodeInput schema object for representing this node in the UI.

Returns:

Type Description
NodeInput

A NodeInput object.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
def get_node_input(self) -> schemas.NodeInput:
    """Creates a `NodeInput` schema object for representing this node in the UI.

    Returns:
        A `NodeInput` object.
    """
    return schemas.NodeInput(pos_y=self.setting_input.pos_y,
                             pos_x=self.setting_input.pos_x,
                             id=self.node_id,
                             **self.node_template.__dict__)
get_output_data()

Gets the full output data sample for this node.

Returns:

Type Description
TableExample

A TableExample object with data.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1180
1181
1182
1183
1184
1185
1186
def get_output_data(self) -> TableExample:
    """Gets the full output data sample for this node.

    Returns:
        A `TableExample` object with data.
    """
    return self.get_table_example(True)
get_predicted_resulting_data()

Creates a FlowDataEngine instance based on the predicted schema.

This avoids executing the node's full logic.

Returns:

Type Description
FlowDataEngine

A FlowDataEngine instance with a schema but no data.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
def get_predicted_resulting_data(self) -> FlowDataEngine:
    """Creates a `FlowDataEngine` instance based on the predicted schema.

    This avoids executing the node's full logic.

    Returns:
        A FlowDataEngine instance with a schema but no data.
    """
    if self.needs_run(False) and self.schema_callback is not None or self.node_schema.result_schema is not None:
        self.print('Getting data based on the schema')

        _s = self.schema_callback() if self.node_schema.result_schema is None else self.node_schema.result_schema
        return FlowDataEngine.create_from_schema(_s)
    else:
        if isinstance(self.function, FlowDataEngine):
            fl = self.function
        else:
            fl = FlowDataEngine.create_from_schema(self.get_predicted_schema())
        return fl
get_predicted_schema(force=False)

Predicts the output schema of the node without full execution.

It uses the schema_callback or infers from predicted data.

Parameters:

Name Type Description Default
force bool

If True, forces recalculation even if a predicted schema exists.

False

Returns:

Type Description
List[FlowfileColumn] | None

A list of FlowfileColumn objects representing the predicted schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
def get_predicted_schema(self, force: bool = False) -> List[FlowfileColumn] | None:
    """Predicts the output schema of the node without full execution.

    It uses the schema_callback or infers from predicted data.

    Args:
        force: If True, forces recalculation even if a predicted schema exists.

    Returns:
        A list of FlowfileColumn objects representing the predicted schema.
    """
    if self.node_schema.predicted_schema and not force:
        return self.node_schema.predicted_schema
    if self.schema_callback is not None and (self.node_schema.predicted_schema is None or force):
        self.print('Getting the data from a schema callback')
        if force:
            # Force the schema callback to reset, so that it will be executed again
            self.schema_callback.reset()
        schema = self.schema_callback()
        if schema is not None and len(schema) > 0:
            self.print('Calculating the schema based on the schema callback')
            self.node_schema.predicted_schema = schema
            return self.node_schema.predicted_schema
    predicted_data = self._predicted_data_getter()
    if predicted_data is not None and predicted_data.schema is not None:
        self.print('Calculating the schema based on the predicted resulting data')
        self.node_schema.predicted_schema = self._predicted_data_getter().schema
    return self.node_schema.predicted_schema
get_repr()

Gets a detailed dictionary representation of the node's state.

Returns:

Type Description
dict

A dictionary containing key information about the node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
def get_repr(self) -> dict:
    """Gets a detailed dictionary representation of the node's state.

    Returns:
        A dictionary containing key information about the node.
    """
    return dict(FlowNode=
                dict(node_id=self.node_id,
                     step_name=self.__name__,
                     output_columns=self.node_schema.output_columns,
                     output_schema=self._get_readable_schema()))
get_resulting_data()

Executes the node's function to produce the actual output data.

Handles both regular functions and external data sources.

Returns:

Type Description
FlowDataEngine | None

A FlowDataEngine instance containing the result, or None on error.

Raises:

Type Description
Exception

Propagates exceptions from the node's function execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
def get_resulting_data(self) -> FlowDataEngine | None:
    """Executes the node's function to produce the actual output data.

    Handles both regular functions and external data sources.

    Returns:
        A FlowDataEngine instance containing the result, or None on error.

    Raises:
        Exception: Propagates exceptions from the node's function execution.
    """
    if self.is_setup:
        if self.results.resulting_data is None and self.results.errors is None:
            self.print('getting resulting data')
            try:
                if isinstance(self.function, FlowDataEngine):
                    fl: FlowDataEngine = self.function
                elif self.node_type == 'external_source':
                    fl: FlowDataEngine = self.function()
                    fl.collect_external()
                    self.node_settings.streamable = False
                else:
                    try:
                        fl = self._function(*[v.get_resulting_data() for v in self.all_inputs])
                    except Exception as e:
                        raise e
                fl.set_streamable(self.node_settings.streamable)
                self.results.resulting_data = fl
                self.node_schema.result_schema = fl.schema
            except Exception as e:
                self.results.resulting_data = FlowDataEngine()
                self.results.errors = str(e)
                self.node_stats.has_run_with_current_setup = False
                self.node_stats.has_completed_last_run = False
                raise e
        return self.results.resulting_data
get_table_example(include_data=False)

Generates a TableExample model summarizing the node's output.

This can optionally include a sample of the data.

Parameters:

Name Type Description Default
include_data bool

If True, includes a data sample in the result.

False

Returns:

Type Description
TableExample | None

A TableExample object, or None if the node is not set up.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
def get_table_example(self, include_data: bool = False) -> TableExample | None:
    """Generates a `TableExample` model summarizing the node's output.

    This can optionally include a sample of the data.

    Args:
        include_data: If True, includes a data sample in the result.

    Returns:
        A `TableExample` object, or None if the node is not set up.
    """
    self.print('Getting a table example')
    if self.is_setup and include_data and self.node_stats.has_completed_last_run:

        if self.node_template.node_group == 'output':
            self.print('getting the table example')
            return self.main_input[0].get_table_example(include_data)

        logger.info('getting the table example since the node has run')
        example_data_getter = self.results.example_data_generator
        if example_data_getter is not None:
            data = example_data_getter().to_pylist()
            if data is None:
                data = []
        else:
            data = []
        schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
        fl = self.get_resulting_data()
        return TableExample(node_id=self.node_id,
                            name=str(self.node_id), number_of_records=999,
                            number_of_columns=fl.number_of_fields,
                            table_schema=schema, columns=fl.columns, data=data)
    else:
        logger.warning('getting the table example but the node has not run')
        try:
            schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
        except Exception as e:
            logger.warning(e)
            schema = []
        columns = [s.name for s in schema]
        return TableExample(node_id=self.node_id,
                            name=str(self.node_id), number_of_records=0,
                            number_of_columns=len(columns),
                            table_schema=schema, columns=columns,
                            data=[])
needs_reset()

Checks if the node's hash has changed, indicating an outdated state.

Returns:

Type Description
bool

True if the calculated hash differs from the stored hash.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
941
942
943
944
945
946
947
def needs_reset(self) -> bool:
    """Checks if the node's hash has changed, indicating an outdated state.

    Returns:
        True if the calculated hash differs from the stored hash.
    """
    return self._hash != self.calculate_hash(self.setting_input)
needs_run(performance_mode, node_logger=None, execution_location='auto')

Determines if the node needs to be executed.

The decision is based on its run state, caching settings, and execution mode.

Parameters:

Name Type Description Default
performance_mode bool

True if the flow is in performance mode.

required
node_logger NodeLogger

The logger instance for this node.

None
execution_location ExecutionLocationsLiteral

The target execution location.

'auto'

Returns:

Type Description
bool

True if the node should be run, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
def needs_run(self, performance_mode: bool, node_logger: NodeLogger = None,
              execution_location: schemas.ExecutionLocationsLiteral = "auto") -> bool:
    """Determines if the node needs to be executed.

    The decision is based on its run state, caching settings, and execution mode.

    Args:
        performance_mode: True if the flow is in performance mode.
        node_logger: The logger instance for this node.
        execution_location: The target execution location.

    Returns:
        True if the node should be run, False otherwise.
    """
    if execution_location == "local" or SINGLE_FILE_MODE:
        return False

    flow_logger = logger if node_logger is None else node_logger
    cache_result_exists = results_exists(self.hash)
    if not self.node_stats.has_run_with_current_setup:
        flow_logger.info('Node has not run, needs to run')
        return True
    if self.node_settings.cache_results and cache_result_exists:
        return False
    elif self.node_settings.cache_results and not cache_result_exists:
        return True
    elif not performance_mode and cache_result_exists:
        return False
    else:
        return True
post_init()

Initializes or resets the node's attributes to their default states.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def post_init(self):
    """Initializes or resets the node's attributes to their default states."""
    self.node_inputs = NodeStepInputs()
    self.node_stats = NodeStepStats()
    self.node_settings = NodeStepSettings()
    self.node_schema = NodeSchemaInformation()
    self.results = NodeResults()
    self.node_information = schemas.NodeInformation()
    self.leads_to_nodes = []
    self._setting_input = None
    self._cache_progress = None
    self._schema_callback = None
    self._state_needs_reset = False
prepare_before_run()

Resets results and errors before a new execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
841
842
843
844
845
846
def prepare_before_run(self):
    """Resets results and errors before a new execution."""

    self.results.errors = None
    self.results.resulting_data = None
    self.results.example_data = None
print(v)

Helper method to log messages with node context.

Parameters:

Name Type Description Default
v Any

The message or value to log.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
533
534
535
536
537
538
539
def print(self, v: Any):
    """Helper method to log messages with node context.

    Args:
        v: The message or value to log.
    """
    logger.info(f'{self.node_type}, node_id: {self.node_id}: {v}')
remove_cache()

Removes cached results for this node.

Note: Currently not fully implemented.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
674
675
676
677
678
679
680
681
def remove_cache(self):
    """Removes cached results for this node.

    Note: Currently not fully implemented.
    """

    if results_exists(self.hash):
        logger.warning('Not implemented')
reset(deep=False)

Resets the node's execution state and schema information.

This also triggers a reset on all downstream nodes.

Parameters:

Name Type Description Default
deep bool

If True, forces a reset even if the hash hasn't changed.

False
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
def reset(self, deep: bool = False):
    """Resets the node's execution state and schema information.

    This also triggers a reset on all downstream nodes.

    Args:
        deep: If True, forces a reset even if the hash hasn't changed.
    """
    needs_reset = self.needs_reset() or deep
    if needs_reset:
        logger.info(f'{self.node_id}: Node needs reset')
        self.node_stats.has_run_with_current_setup = False
        self.results.reset()
        if self.is_correct:
            self._schema_callback = None  # Ensure the schema callback is reset
            if self.schema_callback:
                logger.info(f'{self.node_id}: Resetting the schema callback')
                self.schema_callback.start()
        self.node_schema.result_schema = None
        self.node_schema.predicted_schema = None
        self._hash = None
        self.node_information.is_setup = None
        self.results.errors = None
        self.evaluate_nodes()
        _ = self.hash  # Recalculate the hash after reset
set_node_information()

Populates the node_information attribute with the current state.

This includes the node's connections, settings, and position.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
def set_node_information(self):
    """Populates the `node_information` attribute with the current state.

    This includes the node's connections, settings, and position.
    """
    logger.info('setting node information')
    node_information = self.node_information
    node_information.left_input_id = self.node_inputs.left_input.node_id if self.left_input else None
    node_information.right_input_id = self.node_inputs.right_input.node_id if self.right_input else None
    node_information.input_ids = [mi.node_id for mi in
                                  self.node_inputs.main_inputs] if self.node_inputs.main_inputs is not None else None
    node_information.setting_input = self.setting_input
    node_information.outputs = [n.node_id for n in self.leads_to_nodes]
    node_information.is_setup = self.is_setup
    node_information.x_position = self.setting_input.pos_x
    node_information.y_position = self.setting_input.pos_y
    node_information.type = self.node_type
store_example_data_generator(external_df_fetcher)

Stores a generator function for fetching a sample of the result data.

Parameters:

Name Type Description Default
external_df_fetcher ExternalDfFetcher | ExternalSampler

The process that generated the sample data.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
928
929
930
931
932
933
934
935
936
937
938
939
def store_example_data_generator(self, external_df_fetcher: ExternalDfFetcher | ExternalSampler):
    """Stores a generator function for fetching a sample of the result data.

    Args:
        external_df_fetcher: The process that generated the sample data.
    """
    if external_df_fetcher.status is not None:
        file_ref = external_df_fetcher.status.file_ref
        self.results.example_data_path = file_ref
        self.results.example_data_generator = get_read_top_n(file_path=file_ref, n=100)
    else:
        logger.error('Could not get the sample data, the external process is not ready')
update_node(function, input_columns=None, output_schema=None, drop_columns=None, name=None, setting_input=None, pos_x=0, pos_y=0, schema_callback=None)

Updates the properties of the node.

This is called during initialization and when settings are changed.

Parameters:

Name Type Description Default
function Callable

The new core data processing function.

required
input_columns List[str]

The new list of input columns.

None
output_schema List[FlowfileColumn]

The new schema of added columns.

None
drop_columns List[str]

The new list of dropped columns.

None
name str

The new name for the node.

None
setting_input Any

The new settings object.

None
pos_x float

The new x-coordinate.

0
pos_y float

The new y-coordinate.

0
schema_callback Callable

The new custom schema callback function.

None
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
def update_node(self,
                function: Callable,
                input_columns: List[str] = None,
                output_schema: List[FlowfileColumn] = None,
                drop_columns: List[str] = None,
                name: str = None,
                setting_input: Any = None,
                pos_x: float = 0,
                pos_y: float = 0,
                schema_callback: Callable = None,
                ):
    """Updates the properties of the node.

    This is called during initialization and when settings are changed.

    Args:
        function: The new core data processing function.
        input_columns: The new list of input columns.
        output_schema: The new schema of added columns.
        drop_columns: The new list of dropped columns.
        name: The new name for the node.
        setting_input: The new settings object.
        pos_x: The new x-coordinate.
        pos_y: The new y-coordinate.
        schema_callback: The new custom schema callback function.
    """
    self.user_provided_schema_callback = schema_callback
    self.node_information.y_position = int(pos_y)
    self.node_information.x_position = int(pos_x)
    self.node_information.setting_input = setting_input
    self.name = self.node_type if name is None else name
    self._function = function

    self.node_schema.input_columns = [] if input_columns is None else input_columns
    self.node_schema.output_columns = [] if output_schema is None else output_schema
    self.node_schema.drop_columns = [] if drop_columns is None else drop_columns
    self.node_settings.renew_schema = True
    if hasattr(setting_input, 'cache_results'):
        self.node_settings.cache_results = setting_input.cache_results

    self.results.errors = None
    self.add_lead_to_in_depend_source()
    _ = self.hash
    self.node_template = node_interface.node_dict.get(self.node_type)
    if self.node_template is None:
        raise Exception(f'Node template {self.node_type} not found')
    self.node_default = node_interface.node_defaults.get(self.node_type)
    self.setting_input = setting_input  # wait until the end so that the hash is calculated correctly

The FlowDataEngine

The FlowDataEngine is the primary engine of the library, providing a rich API for data manipulation, I/O, and transformation. Its methods are grouped below by functionality.

flowfile_core.flowfile.flow_data_engine.flow_data_engine.FlowDataEngine dataclass

The core data handling engine for Flowfile.

This class acts as a high-level wrapper around a Polars DataFrame or LazyFrame, providing a unified API for data ingestion, transformation, and output. It manages data state (lazy vs. eager), schema information, and execution logic.

Attributes:

Name Type Description
_data_frame Union[DataFrame, LazyFrame]

The underlying Polars DataFrame or LazyFrame.

columns List[Any]

A list of column names in the current data frame.

name str

An optional name for the data engine instance.

number_of_records int

The number of records. Can be -1 for lazy frames.

errors List

A list of errors encountered during operations.

_schema Optional[List[FlowfileColumn]]

A cached list of FlowfileColumn objects representing the schema.

Methods:

Name Description
__call__

Makes the class instance callable, returning itself.

__get_sample__

Internal method to get a sample of the data.

__getitem__

Accesses a specific column or item from the DataFrame.

__init__

Initializes the FlowDataEngine from various data sources.

__len__

Returns the number of records in the table.

__repr__

Returns a string representation of the FlowDataEngine.

add_new_values

Adds a new column with the provided values.

add_record_id

Adds a record ID (row number) column to the DataFrame.

apply_flowfile_formula

Applies a formula to create a new column or transform an existing one.

apply_sql_formula

Applies an SQL-style formula using pl.sql_expr.

assert_equal

Asserts that this DataFrame is equal to another.

cache

Caches the current DataFrame to disk and updates the internal reference.

calculate_schema

Calculates and returns the schema.

change_column_types

Changes the data type of one or more columns.

collect

Collects the data and returns it as a Polars DataFrame.

collect_external

Materializes data from a tracked external source.

concat

Concatenates this DataFrame with one or more other DataFrames.

count

Gets the total number of records.

create_from_external_source

Creates a FlowDataEngine from an external data source.

create_from_path

Creates a FlowDataEngine from a local file path.

create_from_path_worker

Creates a FlowDataEngine from a path in a worker process.

create_from_schema

Creates an empty FlowDataEngine from a schema definition.

create_from_sql

Creates a FlowDataEngine by executing a SQL query.

create_random

Creates a FlowDataEngine with randomly generated data.

do_cross_join

Performs a cross join with another DataFrame.

do_filter

Filters rows based on a predicate expression.

do_fuzzy_join

Performs a fuzzy join with another DataFrame.

do_group_by

Performs a group-by operation on the DataFrame.

do_pivot

Converts the DataFrame from a long to a wide format, aggregating values.

do_select

Performs a complex column selection, renaming, and reordering operation.

do_sort

Sorts the DataFrame by one or more columns.

drop_columns

Drops specified columns from the DataFrame.

from_cloud_storage_obj

Creates a FlowDataEngine from an object in cloud storage.

fuzzy_match

Performs a simple fuzzy match between two DataFrames on a single column pair.

generate_enumerator

Generates a FlowDataEngine with a single column containing a sequence of integers.

get_estimated_file_size

Estimates the file size in bytes if the data originated from a local file.

get_number_of_records

Gets the total number of records in the DataFrame.

get_output_sample

Gets a sample of the data as a list of dictionaries.

get_record_count

Returns a new FlowDataEngine with a single column 'number_of_records'

get_sample

Gets a sample of rows from the DataFrame.

get_schema_column

Retrieves the schema information for a single column by its name.

get_select_inputs

Gets SelectInput specifications for all columns in the current schema.

get_subset

Gets the first n_rows from the DataFrame.

initialize_empty_fl

Initializes an empty LazyFrame.

iter_batches

Iterates over the DataFrame in batches.

join

Performs a standard SQL-style join with another DataFrame.

make_unique

Gets the unique rows from the DataFrame.

output

Writes the DataFrame to an output file.

reorganize_order

Reorganizes columns into a specified order.

save

Saves the DataFrame to a file in a separate thread.

select_columns

Selects a subset of columns from the DataFrame.

set_streamable

Sets whether DataFrame operations should be streamable.

solve_graph

Solves a graph problem represented by 'from' and 'to' columns.

split

Splits a column's text values into multiple rows based on a delimiter.

start_fuzzy_join

Starts a fuzzy join operation in a background process.

to_arrow

Converts the DataFrame to a PyArrow Table.

to_cloud_storage_obj

Writes the DataFrame to an object in cloud storage.

to_dict

Converts the DataFrame to a Python dictionary of columns.

to_pylist

Converts the DataFrame to a list of Python dictionaries.

to_raw_data

Converts the DataFrame to a RawData schema object.

unpivot

Converts the DataFrame from a wide to a long format.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
@dataclass
class FlowDataEngine:
    """The core data handling engine for Flowfile.

    This class acts as a high-level wrapper around a Polars DataFrame or
    LazyFrame, providing a unified API for data ingestion, transformation,
    and output. It manages data state (lazy vs. eager), schema information,
    and execution logic.

    Attributes:
        _data_frame: The underlying Polars DataFrame or LazyFrame.
        columns: A list of column names in the current data frame.
        name: An optional name for the data engine instance.
        number_of_records: The number of records. Can be -1 for lazy frames.
        errors: A list of errors encountered during operations.
        _schema: A cached list of `FlowfileColumn` objects representing the schema.
    """
    # Core attributes
    _data_frame: Union[pl.DataFrame, pl.LazyFrame]
    columns: List[Any]

    # Metadata attributes
    name: str = None
    number_of_records: int = None
    errors: List = None
    _schema: Optional[List['FlowfileColumn']] = None

    # Configuration attributes
    _optimize_memory: bool = False
    _lazy: bool = None
    _streamable: bool = True
    _calculate_schema_stats: bool = False

    # Cache and optimization attributes
    __col_name_idx_map: Dict = None
    __data_map: Dict = None
    __optimized_columns: List = None
    __sample__: str = None
    __number_of_fields: int = None
    _col_idx: Dict[str, int] = None

    # Source tracking
    _org_path: Optional[str] = None
    _external_source: Optional[ExternalDataSource] = None

    # State tracking
    sorted_by: int = None
    is_future: bool = False
    is_collected: bool = True
    ind_schema_calculated: bool = False

    # Callbacks
    _future: Future = None
    _number_of_records_callback: Callable = None
    _data_callback: Callable = None


    def __init__(self,
                 raw_data: Union[List[Dict], List[Any], Dict[str, Any], 'ParquetFile', pl.DataFrame, pl.LazyFrame, input_schema.RawData] = None,
                 path_ref: str = None,
                 name: str = None,
                 optimize_memory: bool = True,
                 schema: List['FlowfileColumn'] | List[str] | pl.Schema = None,
                 number_of_records: int = None,
                 calculate_schema_stats: bool = False,
                 streamable: bool = True,
                 number_of_records_callback: Callable = None,
                 data_callback: Callable = None):
        """Initializes the FlowDataEngine from various data sources.

        Args:
            raw_data: The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame,
                or a `RawData` schema object.
            path_ref: A string path to a Parquet file.
            name: An optional name for the data engine instance.
            optimize_memory: If True, prefers lazy operations to conserve memory.
            schema: An optional schema definition. Can be a list of `FlowfileColumn` objects,
                a list of column names, or a Polars `Schema`.
            number_of_records: The number of records, if known.
            calculate_schema_stats: If True, computes detailed statistics for each column.
            streamable: If True, allows for streaming operations when possible.
            number_of_records_callback: A callback function to retrieve the number of records.
            data_callback: A callback function to retrieve the data.
        """
        self._initialize_attributes(number_of_records_callback, data_callback, streamable)

        if raw_data is not None:
            self._handle_raw_data(raw_data, number_of_records, optimize_memory)
        elif path_ref:
            self._handle_path_ref(path_ref, optimize_memory)
        else:
            self.initialize_empty_fl()
        self._finalize_initialization(name, optimize_memory, schema, calculate_schema_stats)

    def _initialize_attributes(self, number_of_records_callback, data_callback, streamable):
        """(Internal) Sets the initial default attributes for a new instance.

        This helper is called first during initialization to ensure all state-tracking
        and configuration attributes have a clean default value before data is processed.
        """
        self._external_source = None
        self._number_of_records_callback = number_of_records_callback
        self._data_callback = data_callback
        self.ind_schema_calculated = False
        self._streamable = streamable
        self._org_path = None
        self._lazy = False
        self.errors = []
        self._calculate_schema_stats = False
        self.is_collected = True
        self.is_future = False

    def _handle_raw_data(self, raw_data, number_of_records, optimize_memory):
        """(Internal) Dispatches raw data to the appropriate handler based on its type.

        This acts as a router during initialization, inspecting the type of `raw_data`
        and calling the corresponding specialized `_handle_*` method to process it.
        """
        if isinstance(raw_data, input_schema.RawData):
            self._handle_raw_data_format(raw_data)
        elif isinstance(raw_data, pl.DataFrame):
            self._handle_polars_dataframe(raw_data, number_of_records)
        elif isinstance(raw_data, pl.LazyFrame):
            self._handle_polars_lazy_frame(raw_data, number_of_records, optimize_memory)
        elif isinstance(raw_data, (list, dict)):
            self._handle_python_data(raw_data)

    def _handle_polars_dataframe(self, df: pl.DataFrame, number_of_records: Optional[int]):
        """(Internal) Initializes the engine from an eager Polars DataFrame."""
        self.data_frame = df
        self.number_of_records = number_of_records or df.select(pl.len())[0, 0]

    def _handle_polars_lazy_frame(self, lf: pl.LazyFrame, number_of_records: Optional[int], optimize_memory: bool):
        """(Internal) Initializes the engine from a Polars LazyFrame."""
        self.data_frame = lf
        self._lazy = True
        if number_of_records is not None:
            self.number_of_records = number_of_records
        elif optimize_memory:
            self.number_of_records = -1
        else:
            self.number_of_records = lf.select(pl.len()).collect()[0, 0]

    def _handle_python_data(self, data: Union[List, Dict]):
        """(Internal) Dispatches Python collections to the correct handler."""
        if isinstance(data, dict):
            self._handle_dict_input(data)
        else:
            self._handle_list_input(data)

    def _handle_dict_input(self, data: Dict):
        """(Internal) Initializes the engine from a Python dictionary."""
        if len(data) == 0:
            self.initialize_empty_fl()
        lengths = [len(v) if isinstance(v, (list, tuple)) else 1 for v in data.values()]

        if len(set(lengths)) == 1 and lengths[0] > 1:
            self.number_of_records = lengths[0]
            self.data_frame = pl.DataFrame(data)
        else:
            self.number_of_records = 1
            self.data_frame = pl.DataFrame([data])
        self.lazy = True

    def _handle_raw_data_format(self, raw_data: input_schema.RawData):
        """(Internal) Initializes the engine from a `RawData` schema object.

        This method uses the schema provided in the `RawData` object to correctly
        infer data types when creating the Polars DataFrame.

        Args:
            raw_data: An instance of `RawData` containing the data and schema.
        """
        flowfile_schema = list(FlowfileColumn.create_from_minimal_field_info(c) for c in raw_data.columns)
        polars_schema = pl.Schema([(flowfile_column.column_name, flowfile_column.get_polars_type().pl_datatype)
                                   for flowfile_column in flowfile_schema])
        try:
            df = pl.DataFrame(raw_data.data, polars_schema, strict=False)
        except TypeError as e:
            logger.warning(f"Could not parse the data with the schema:\n{e}")
            df = pl.DataFrame(raw_data.data)
        self.number_of_records = len(df)
        self.data_frame = df.lazy()
        self.lazy = True

    def _handle_list_input(self, data: List):
        """(Internal) Initializes the engine from a list of records."""
        number_of_records = len(data)
        if number_of_records > 0:
            processed_data = self._process_list_data(data)
            self.number_of_records = number_of_records
            self.data_frame = pl.DataFrame(processed_data)
            self.lazy = True
        else:
            self.initialize_empty_fl()
            self.number_of_records = 0

    @staticmethod
    def _process_list_data(data: List) -> List[Dict]:
        """(Internal) Normalizes list data into a list of dictionaries.

        Ensures that a list of objects or non-dict items is converted into a
        uniform list of dictionaries suitable for Polars DataFrame creation.
        """
        if not (isinstance(data[0], dict) or hasattr(data[0], '__dict__')):
            try:
                return pl.DataFrame(data).to_dicts()
            except TypeError:
                raise Exception('Value must be able to be converted to dictionary')
            except Exception as e:
                raise Exception(f'Value must be able to be converted to dictionary: {e}')

        if not isinstance(data[0], dict):
            data = [row.__dict__ for row in data]

        return ensure_similarity_dicts(data)

    def to_cloud_storage_obj(self, settings: cloud_storage_schemas.CloudStorageWriteSettingsInternal):
        """Writes the DataFrame to an object in cloud storage.

        This method supports writing to various cloud storage providers like AWS S3,
        Azure Data Lake Storage, and Google Cloud Storage.

        Args:
            settings: A `CloudStorageWriteSettingsInternal` object containing connection
                details, file format, and write options.

        Raises:
            ValueError: If the specified file format is not supported for writing.
            NotImplementedError: If the 'append' write mode is used with an unsupported format.
            Exception: If the write operation to cloud storage fails for any reason.
        """
        connection = settings.connection
        write_settings = settings.write_settings

        logger.info(f"Writing to {connection.storage_type} storage: {write_settings.resource_path}")

        if write_settings.write_mode == 'append' and write_settings.file_format != "delta":
            raise NotImplementedError("The 'append' write mode is not yet supported for this destination.")
        storage_options = CloudStorageReader.get_storage_options(connection)
        credential_provider = CloudStorageReader.get_credential_provider(connection)
        # Dispatch to the correct writer based on file format
        if write_settings.file_format == "parquet":
            self._write_parquet_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        elif write_settings.file_format == "delta":
            self._write_delta_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        elif write_settings.file_format == "csv":
            self._write_csv_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        elif write_settings.file_format == "json":
            self._write_json_to_cloud(
                write_settings.resource_path,
                storage_options,
                credential_provider,
                write_settings
            )
        else:
            raise ValueError(f"Unsupported file format for writing: {write_settings.file_format}")

        logger.info(f"Successfully wrote data to {write_settings.resource_path}")

    def _write_parquet_to_cloud(self,
                                resource_path: str,
                                storage_options: Dict[str, Any],
                                credential_provider: Optional[Callable],
                                write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a Parquet file in cloud storage.

        Uses `sink_parquet` for efficient streaming writes. Falls back to a
        collect-then-write pattern if sinking fails.
        """
        try:
            sink_kwargs = {
                "path": resource_path,
                "compression": write_settings.parquet_compression,
            }
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider
            try:
                self.data_frame.sink_parquet(**sink_kwargs)
            except Exception as e:
                logger.warning(f"Failed to sink the data, falling back to collecing and writing. \n {e}")
                pl_df = self.collect()
                sink_kwargs['file'] = sink_kwargs.pop("path")
                pl_df.write_parquet(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write Parquet to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write Parquet to cloud storage: {str(e)}")

    def _write_delta_to_cloud(self,
                              resource_path: str,
                              storage_options: Dict[str, Any],
                              credential_provider: Optional[Callable],
                              write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a Delta Lake table in cloud storage.

        This operation requires collecting the data first, as `write_delta` operates
        on an eager DataFrame.
        """
        sink_kwargs = {
            "target": resource_path,
            "mode": write_settings.write_mode,
        }
        if storage_options:
            sink_kwargs["storage_options"] = storage_options
        if credential_provider:
            sink_kwargs["credential_provider"] = credential_provider
        self.collect().write_delta(**sink_kwargs)

    def _write_csv_to_cloud(self,
                            resource_path: str,
                            storage_options: Dict[str, Any],
                            credential_provider: Optional[Callable],
                            write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a CSV file in cloud storage.

        Uses `sink_csv` for efficient, streaming writes of the data.
        """
        try:
            sink_kwargs = {
                "path": resource_path,
                "separator": write_settings.csv_delimiter,
            }
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider

            # sink_csv executes the lazy query and writes the result
            self.data_frame.sink_csv(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write CSV to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write CSV to cloud storage: {str(e)}")

    def _write_json_to_cloud(self,
                             resource_path: str,
                             storage_options: Dict[str, Any],
                             credential_provider: Optional[Callable],
                             write_settings: cloud_storage_schemas.CloudStorageWriteSettings):
        """(Internal) Writes the DataFrame to a line-delimited JSON (NDJSON) file.

        Uses `sink_ndjson` for efficient, streaming writes.
        """
        try:
            sink_kwargs = {"path": resource_path}
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider
            self.data_frame.sink_ndjson(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write JSON to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write JSON to cloud storage: {str(e)}")

    @classmethod
    def from_cloud_storage_obj(cls, settings: cloud_storage_schemas.CloudStorageReadSettingsInternal) -> "FlowDataEngine":
        """Creates a FlowDataEngine from an object in cloud storage.

        This method supports reading from various cloud storage providers like AWS S3,
        Azure Data Lake Storage, and Google Cloud Storage, with support for
        various authentication methods.

        Args:
            settings: A `CloudStorageReadSettingsInternal` object containing connection
                details, file format, and read options.

        Returns:
            A new `FlowDataEngine` instance containing the data from cloud storage.

        Raises:
            ValueError: If the storage type or file format is not supported.
            NotImplementedError: If a requested file format like "delta" or "iceberg"
                is not yet implemented.
            Exception: If reading from cloud storage fails.
        """
        connection = settings.connection
        read_settings = settings.read_settings

        logger.info(f"Reading from {connection.storage_type} storage: {read_settings.resource_path}")
        # Get storage options based on connection type
        storage_options = CloudStorageReader.get_storage_options(connection)
        # Get credential provider if needed
        credential_provider = CloudStorageReader.get_credential_provider(connection)
        if read_settings.file_format == "parquet":
            return cls._read_parquet_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings.scan_mode == "directory",
            )
        elif read_settings.file_format == "delta":
            return cls._read_delta_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings
            )
        elif read_settings.file_format == "csv":
            return cls._read_csv_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings
            )
        elif read_settings.file_format == "json":
            return cls._read_json_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings.scan_mode == "directory"
            )
        elif read_settings.file_format == "iceberg":
            return cls._read_iceberg_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings
            )

        elif read_settings.file_format in ["delta", "iceberg"]:
            # These would require additional libraries
            raise NotImplementedError(f"File format {read_settings.file_format} not yet implemented")
        else:
            raise ValueError(f"Unsupported file format: {read_settings.file_format}")

    @staticmethod
    def _get_schema_from_first_file_in_dir(source: str, storage_options: Dict[str, Any],
                                           file_format: Literal["csv", "parquet", "json", "delta"]) -> List[FlowfileColumn] | None:
        """Infers the schema by scanning the first file in a cloud directory."""
        try:
            scan_func = getattr(pl, "scan_" + file_format)
            first_file_ref = get_first_file_from_s3_dir(source, storage_options=storage_options)
            return convert_stats_to_column_info(FlowDataEngine._create_schema_stats_from_pl_schema(
                scan_func(first_file_ref, storage_options=storage_options).collect_schema()))
        except Exception as e:
            logger.warning(f"Could not read schema from first file in directory, using default schema: {e}")


    @classmethod
    def _read_iceberg_from_cloud(cls,
                                 resource_path: str,
                                 storage_options: Dict[str, Any],
                                 credential_provider: Optional[Callable],
                                 read_settings: cloud_storage_schemas.CloudStorageReadSettings) -> "FlowDataEngine":
        """Reads Iceberg table(s) from cloud storage."""
        raise NotImplementedError(f"Failed to read Iceberg table from cloud storage: Not yet implemented")

    @classmethod
    def _read_parquet_from_cloud(cls,
                                 resource_path: str,
                                 storage_options: Dict[str, Any],
                                 credential_provider: Optional[Callable],
                                 is_directory: bool) -> "FlowDataEngine":
        """Reads Parquet file(s) from cloud storage."""
        try:
            # Use scan_parquet for lazy evaluation
            if is_directory:
                resource_path = ensure_path_has_wildcard_pattern(resource_path=resource_path, file_format="parquet")
            scan_kwargs = {"source": resource_path}

            if storage_options:
                scan_kwargs["storage_options"] = storage_options

            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider
            if storage_options and is_directory:
                schema = cls._get_schema_from_first_file_in_dir(resource_path, storage_options, "parquet")
            else:
                schema = None
            lf = pl.scan_parquet(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Set so the provider is not accessed for this stat
                optimize_memory=True,
                streamable=True,
                schema=schema
            )

        except Exception as e:
            logger.error(f"Failed to read Parquet from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read Parquet from cloud storage: {str(e)}")

    @classmethod
    def _read_delta_from_cloud(cls,
                               resource_path: str,
                               storage_options: Dict[str, Any],
                               credential_provider: Optional[Callable],
                               read_settings: cloud_storage_schemas.CloudStorageReadSettings) -> "FlowDataEngine":
        """Reads a Delta Lake table from cloud storage."""
        try:
            logger.info("Reading Delta file from cloud storage...")
            logger.info(f"read_settings: {read_settings}")
            scan_kwargs = {"source": resource_path}
            if read_settings.delta_version:
                scan_kwargs['version'] = read_settings.delta_version
            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider
            lf = pl.scan_delta(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Set so the provider is not accessed for this stat
                optimize_memory=True,
                streamable=True
            )
        except Exception as e:
            logger.error(f"Failed to read Delta file from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read Delta file from cloud storage: {str(e)}")

    @classmethod
    def _read_csv_from_cloud(cls,
                             resource_path: str,
                             storage_options: Dict[str, Any],
                             credential_provider: Optional[Callable],
                             read_settings: cloud_storage_schemas.CloudStorageReadSettings) -> "FlowDataEngine":
        """Reads CSV file(s) from cloud storage."""
        try:
            scan_kwargs = {
                "source": resource_path,
                "has_header": read_settings.csv_has_header,
                "separator": read_settings.csv_delimiter,
                "encoding": read_settings.csv_encoding,
            }
            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider

            if read_settings.scan_mode == "directory":
                resource_path = ensure_path_has_wildcard_pattern(resource_path=resource_path, file_format="csv")
                scan_kwargs["source"] = resource_path
            if storage_options and read_settings.scan_mode == "directory":
                schema = cls._get_schema_from_first_file_in_dir(resource_path, storage_options, "csv")
            else:
                schema = None

            lf = pl.scan_csv(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Will be calculated lazily
                optimize_memory=True,
                streamable=True,
                schema=schema
            )

        except Exception as e:
            logger.error(f"Failed to read CSV from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read CSV from cloud storage: {str(e)}")

    @classmethod
    def _read_json_from_cloud(cls,
                              resource_path: str,
                              storage_options: Dict[str, Any],
                              credential_provider: Optional[Callable],
                              is_directory: bool) -> "FlowDataEngine":
        """Reads JSON file(s) from cloud storage."""
        try:
            if is_directory:
                resource_path = ensure_path_has_wildcard_pattern(resource_path, "json")
            scan_kwargs = {"source": resource_path}

            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider

            lf = pl.scan_ndjson(**scan_kwargs)  # Using NDJSON for line-delimited JSON

            return cls(
                lf,
                number_of_records=-1,
                optimize_memory=True,
                streamable=True,
            )

        except Exception as e:
            logger.error(f"Failed to read JSON from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read JSON from cloud storage: {str(e)}")

    def _handle_path_ref(self, path_ref: str, optimize_memory: bool):
        """Handles file path reference input."""
        try:
            pf = ParquetFile(path_ref)
        except Exception as e:
            logger.error(e)
            raise Exception("Provided ref is not a parquet file")

        self.number_of_records = pf.metadata.num_rows
        if optimize_memory:
            self._lazy = True
            self.data_frame = pl.scan_parquet(path_ref)
        else:
            self.data_frame = pl.read_parquet(path_ref)

    def _finalize_initialization(self, name: str, optimize_memory: bool, schema: Optional[Any],
                                 calculate_schema_stats: bool):
        """Finalizes initialization by setting remaining attributes."""
        _ = calculate_schema_stats
        self.name = name
        self._optimize_memory = optimize_memory
        if assert_if_flowfile_schema(schema):
            self._schema = schema
            self.columns = [c.column_name for c in self._schema]
        else:
            pl_schema = self.data_frame.collect_schema()
            self._schema = self._handle_schema(schema, pl_schema)
            self.columns = [c.column_name for c in self._schema] if self._schema else pl_schema.names()

    def __getitem__(self, item):
        """Accesses a specific column or item from the DataFrame."""
        return self.data_frame.select([item])

    @property
    def data_frame(self) -> pl.LazyFrame | pl.DataFrame | None:
        """The underlying Polars DataFrame or LazyFrame.

        This property provides access to the Polars object that backs the
        FlowDataEngine. It handles lazy-loading from external sources if necessary.

        Returns:
            The active Polars `DataFrame` or `LazyFrame`.
        """
        if self._data_frame is not None and not self.is_future:
            return self._data_frame
        elif self.is_future:
            return self._data_frame
        elif self._external_source is not None and self.lazy:
            return self._data_frame
        elif self._external_source is not None and not self.lazy:
            if self._external_source.get_pl_df() is None:
                data_frame = list(self._external_source.get_iter())
                if len(data_frame) > 0:
                    self.data_frame = pl.DataFrame(data_frame)
            else:
                self.data_frame = self._external_source.get_pl_df()
            self.calculate_schema()
            return self._data_frame

    @data_frame.setter
    def data_frame(self, df: pl.LazyFrame | pl.DataFrame):
        """Sets the underlying Polars DataFrame or LazyFrame."""
        if self.lazy and isinstance(df, pl.DataFrame):
            raise Exception('Cannot set a non-lazy dataframe to a lazy flowfile')
        self._data_frame = df

    @staticmethod
    def _create_schema_stats_from_pl_schema(pl_schema: pl.Schema) -> List[Dict]:
        """Converts a Polars Schema into a list of schema statistics dictionaries."""
        return [
            dict(column_name=k, pl_datatype=v, col_index=i)
            for i, (k, v) in enumerate(pl_schema.items())
        ]

    def _add_schema_from_schema_stats(self, schema_stats: List[Dict]):
        """Populates the schema from a list of schema statistics dictionaries."""
        self._schema = convert_stats_to_column_info(schema_stats)

    @property
    def schema(self) -> List[FlowfileColumn]:
        """The schema of the DataFrame as a list of `FlowfileColumn` objects.

        This property lazily calculates the schema if it hasn't been determined yet.

        Returns:
            A list of `FlowfileColumn` objects describing the schema.
        """
        if self.number_of_fields == 0:
            return []
        if self._schema is None or (self._calculate_schema_stats and not self.ind_schema_calculated):
            if self._calculate_schema_stats and not self.ind_schema_calculated:
                schema_stats = self._calculate_schema()
                self.ind_schema_calculated = True
            else:
                schema_stats = self._create_schema_stats_from_pl_schema(self.data_frame.collect_schema())
            self._add_schema_from_schema_stats(schema_stats)
        return self._schema

    @property
    def number_of_fields(self) -> int:
        """The number of columns (fields) in the DataFrame.

        Returns:
            The integer count of columns.
        """
        if self.__number_of_fields is None:
            self.__number_of_fields = len(self.columns)
        return self.__number_of_fields

    def collect(self, n_records: int = None) -> pl.DataFrame:
        """Collects the data and returns it as a Polars DataFrame.

        This method triggers the execution of the lazy query plan (if applicable)
        and returns the result. It supports streaming to optimize memory usage
        for large datasets.

        Args:
            n_records: The maximum number of records to collect. If None, all
                records are collected.

        Returns:
            A Polars `DataFrame` containing the collected data.
        """
        if n_records is None:
            logger.info(f'Fetching all data for Table object "{id(self)}". Settings: streaming={self._streamable}')
        else:
            logger.info(f'Fetching {n_records} record(s) for Table object "{id(self)}". '
                        f'Settings: streaming={self._streamable}')

        if not self.lazy:
            return self.data_frame

        try:
            return self._collect_data(n_records)
        except Exception as e:
            self.errors = [e]
            return self._handle_collection_error(n_records)

    def _collect_data(self, n_records: int = None) -> pl.DataFrame:
        """Internal method to handle data collection logic."""
        if n_records is None:

            self.collect_external()
            if self._streamable:
                try:
                    logger.info('Collecting data in streaming mode')
                    return self.data_frame.collect(engine="streaming")
                except PanicException:
                    self._streamable = False

            logger.info('Collecting data in non-streaming mode')
            return self.data_frame.collect()

        if self.external_source is not None:
            return self._collect_from_external_source(n_records)

        if self._streamable:
            return self.data_frame.head(n_records).collect(engine="streaming")
        return self.data_frame.head(n_records).collect()

    def _collect_from_external_source(self, n_records: int) -> pl.DataFrame:
        """Handles collection from an external source."""
        if self.external_source.get_pl_df() is not None:
            all_data = self.external_source.get_pl_df().head(n_records)
            self.data_frame = all_data
        else:
            all_data = self.external_source.get_sample(n_records)
            self.data_frame = pl.LazyFrame(all_data)
        return self.data_frame

    def _handle_collection_error(self, n_records: int) -> pl.DataFrame:
        """Handles errors during collection by attempting partial collection."""
        n_records = 100000000 if n_records is None else n_records
        ok_cols, error_cols = self._identify_valid_columns(n_records)

        if len(ok_cols) > 0:
            return self._create_partial_dataframe(ok_cols, error_cols, n_records)
        return self._create_empty_dataframe(n_records)

    def _identify_valid_columns(self, n_records: int) -> Tuple[List[str], List[Tuple[str, Any]]]:
        """Identifies which columns can be collected successfully."""
        ok_cols = []
        error_cols = []
        for c in self.columns:
            try:
                _ = self.data_frame.select(c).head(n_records).collect()
                ok_cols.append(c)
            except:
                error_cols.append((c, self.data_frame.schema[c]))
        return ok_cols, error_cols

    def _create_partial_dataframe(self, ok_cols: List[str], error_cols: List[Tuple[str, Any]],
                                  n_records: int) -> pl.DataFrame:
        """Creates a DataFrame with partial data for columns that could be collected."""
        df = self.data_frame.select(ok_cols)
        df = df.with_columns([
            pl.lit(None).alias(column_name).cast(data_type)
            for column_name, data_type in error_cols
        ])
        return df.select(self.columns).head(n_records).collect()

    def _create_empty_dataframe(self, n_records: int) -> pl.DataFrame:
        """Creates an empty DataFrame with the correct schema."""
        if self.number_of_records > 0:
            return pl.DataFrame({
                column_name: pl.Series(
                    name=column_name,
                    values=[None] * min(self.number_of_records, n_records)
                ).cast(data_type)
                for column_name, data_type in self.data_frame.schema.items()
            })
        return pl.DataFrame(schema=self.data_frame.schema)

    def do_group_by(self, group_by_input: transform_schemas.GroupByInput,
                    calculate_schema_stats: bool = True) -> "FlowDataEngine":
        """Performs a group-by operation on the DataFrame.

        Args:
            group_by_input: A `GroupByInput` object defining the grouping columns
                and aggregations.
            calculate_schema_stats: If True, calculates schema statistics for the
                resulting DataFrame.

        Returns:
            A new `FlowDataEngine` instance with the grouped and aggregated data.
        """
        aggregations = [c for c in group_by_input.agg_cols if c.agg != 'groupby']
        group_columns = [c for c in group_by_input.agg_cols if c.agg == 'groupby']

        if len(group_columns) == 0:
            return FlowDataEngine(
                self.data_frame.select(
                    ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
                ),
                calculate_schema_stats=calculate_schema_stats
            )

        df = self.data_frame.rename({c.old_name: c.new_name for c in group_columns})
        group_by_columns = [n_c.new_name for n_c in group_columns]
        return FlowDataEngine(
            df.group_by(*group_by_columns).agg(
                ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
            ),
            calculate_schema_stats=calculate_schema_stats
        )

    def do_sort(self, sorts: List[transform_schemas.SortByInput]) -> "FlowDataEngine":
        """Sorts the DataFrame by one or more columns.

        Args:
            sorts: A list of `SortByInput` objects, each specifying a column
                and sort direction ('asc' or 'desc').

        Returns:
            A new `FlowDataEngine` instance with the sorted data.
        """
        if not sorts:
            return self

        descending = [s.how == 'desc' or s.how.lower() == 'descending' for s in sorts]
        df = self.data_frame.sort([sort_by.column for sort_by in sorts], descending=descending)
        return FlowDataEngine(df, number_of_records=self.number_of_records, schema=self.schema)

    def change_column_types(self, transforms: List[transform_schemas.SelectInput],
                            calculate_schema: bool = False) -> "FlowDataEngine":
        """Changes the data type of one or more columns.

        Args:
            transforms: A list of `SelectInput` objects, where each object specifies
                the column and its new `polars_type`.
            calculate_schema: If True, recalculates the schema after the type change.

        Returns:
            A new `FlowDataEngine` instance with the updated column types.
        """
        dtypes = [dtype.base_type() for dtype in self.data_frame.collect_schema().dtypes()]
        idx_mapping = list(
            (transform.old_name, self.cols_idx.get(transform.old_name), getattr(pl, transform.polars_type))
            for transform in transforms if transform.data_type is not None
        )

        actual_transforms = [c for c in idx_mapping if c[2] != dtypes[c[1]]]
        transformations = [
            utils.define_pl_col_transformation(col_name=transform[0], col_type=transform[2])
            for transform in actual_transforms
        ]

        df = self.data_frame.with_columns(transformations)
        return FlowDataEngine(
            df,
            number_of_records=self.number_of_records,
            calculate_schema_stats=calculate_schema,
            streamable=self._streamable
        )

    def save(self, path: str, data_type: str = 'parquet') -> Future:
        """Saves the DataFrame to a file in a separate thread.

        Args:
            path: The file path to save to.
            data_type: The format to save in (e.g., 'parquet', 'csv').

        Returns:
            A `loky.Future` object representing the asynchronous save operation.
        """
        estimated_size = deepcopy(self.get_estimated_file_size() * 4)
        df = deepcopy(self.data_frame)
        return write_threaded(_df=df, path=path, data_type=data_type, estimated_size=estimated_size)

    def to_pylist(self) -> List[Dict]:
        """Converts the DataFrame to a list of Python dictionaries.

        Returns:
            A list where each item is a dictionary representing a row.
        """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dicts()
        return self.data_frame.to_dicts()

    def to_arrow(self) -> PaTable:
        """Converts the DataFrame to a PyArrow Table.

        This method triggers a `.collect()` call if the data is lazy,
        then converts the resulting eager DataFrame into a `pyarrow.Table`.

        Returns:
            A `pyarrow.Table` instance representing the data.
        """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_arrow()
        else:
            return self.data_frame.to_arrow()

    def to_raw_data(self) -> input_schema.RawData:
        """Converts the DataFrame to a `RawData` schema object.

        Returns:
            An `input_schema.RawData` object containing the schema and data.
        """
        columns = [c.get_minimal_field_info() for c in self.schema]
        data = list(self.to_dict().values())
        return input_schema.RawData(columns=columns, data=data)

    def to_dict(self) -> Dict[str, List]:
        """Converts the DataFrame to a Python dictionary of columns.

         Each key in the dictionary is a column name, and the corresponding value
         is a list of the data in that column.

         Returns:
             A dictionary mapping column names to lists of their values.
         """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dict(as_series=False)
        else:
            return self.data_frame.to_dict(as_series=False)

    @classmethod
    def create_from_external_source(cls, external_source: ExternalDataSource) -> "FlowDataEngine":
        """Creates a FlowDataEngine from an external data source.

        Args:
            external_source: An object that conforms to the `ExternalDataSource`
                interface.

        Returns:
            A new `FlowDataEngine` instance.
        """
        if external_source.schema is not None:
            ff = cls.create_from_schema(external_source.schema)
        elif external_source.initial_data_getter is not None:
            ff = cls(raw_data=external_source.initial_data_getter())
        else:
            ff = cls()
        ff._external_source = external_source
        return ff

    @classmethod
    def create_from_sql(cls, sql: str, conn: Any) -> "FlowDataEngine":
        """Creates a FlowDataEngine by executing a SQL query.

        Args:
            sql: The SQL query string to execute.
            conn: A database connection object or connection URI string.

        Returns:
            A new `FlowDataEngine` instance with the query result.
        """
        return cls(pl.read_sql(sql, conn))

    @classmethod
    def create_from_schema(cls, schema: List[FlowfileColumn]) -> "FlowDataEngine":
        """Creates an empty FlowDataEngine from a schema definition.

        Args:
            schema: A list of `FlowfileColumn` objects defining the schema.

        Returns:
            A new, empty `FlowDataEngine` instance with the specified schema.
        """
        pl_schema = []
        for i, flow_file_column in enumerate(schema):
            pl_schema.append((flow_file_column.name, cast_str_to_polars_type(flow_file_column.data_type)))
            schema[i].col_index = i
        df = pl.LazyFrame(schema=pl_schema)
        return cls(df, schema=schema, calculate_schema_stats=False, number_of_records=0)

    @classmethod
    def create_from_path(cls, received_table: input_schema.ReceivedTableBase) -> "FlowDataEngine":
        """Creates a FlowDataEngine from a local file path.

        Supports various file types like CSV, Parquet, and Excel.

        Args:
            received_table: A `ReceivedTableBase` object containing the file path
                and format details.

        Returns:
            A new `FlowDataEngine` instance with data from the file.
        """
        received_table.set_absolute_filepath()
        file_type_handlers = {
            'csv': create_funcs.create_from_path_csv,
            'parquet': create_funcs.create_from_path_parquet,
            'excel': create_funcs.create_from_path_excel
        }

        handler = file_type_handlers.get(received_table.file_type)
        if not handler:
            raise Exception(f'Cannot create from {received_table.file_type}')

        flow_file = cls(handler(received_table))
        flow_file._org_path = received_table.abs_file_path
        return flow_file

    @classmethod
    def create_random(cls, number_of_records: int = 1000) -> "FlowDataEngine":
        """Creates a FlowDataEngine with randomly generated data.

        Useful for testing and examples.

        Args:
            number_of_records: The number of random records to generate.

        Returns:
            A new `FlowDataEngine` instance with fake data.
        """
        return cls(create_fake_data(number_of_records))

    @classmethod
    def generate_enumerator(cls, length: int = 1000, output_name: str = 'output_column') -> "FlowDataEngine":
        """Generates a FlowDataEngine with a single column containing a sequence of integers.

        Args:
            length: The number of integers to generate in the sequence.
            output_name: The name of the output column.

        Returns:
            A new `FlowDataEngine` instance.
        """
        if length > 10_000_000:
            length = 10_000_000
        return cls(pl.LazyFrame().select((pl.int_range(0, length, dtype=pl.UInt32)).alias(output_name)))

    def _handle_schema(self, schema: List[FlowfileColumn] | List[str] | pl.Schema | None,
                       pl_schema: pl.Schema) -> List[FlowfileColumn] | None:
        """Handles schema processing and validation during initialization."""
        if schema is None and pl_schema is not None:
            return convert_stats_to_column_info(self._create_schema_stats_from_pl_schema(pl_schema))
        elif schema is None and pl_schema is None:
            return None
        elif assert_if_flowfile_schema(schema) and pl_schema is None:
            return schema
        elif pl_schema is not None and schema is not None:
            if schema.__len__() != pl_schema.__len__():
                raise Exception(
                    f'Schema does not match the data got {schema.__len__()} columns expected {pl_schema.__len__()}')
            if isinstance(schema, pl.Schema):
                return self._handle_polars_schema(schema, pl_schema)
            elif isinstance(schema, list) and len(schema) == 0:
                return []
            elif isinstance(schema[0], str):
                return self._handle_string_schema(schema, pl_schema)
            return schema

    def _handle_polars_schema(self, schema: pl.Schema, pl_schema: pl.Schema) -> List[FlowfileColumn]:
        """Handles Polars schema conversion."""
        flow_file_columns = [
            FlowfileColumn.create_from_polars_dtype(column_name=col_name, data_type=dtype)
            for col_name, dtype in zip(schema.names(), schema.dtypes())
        ]

        select_arg = [
            pl.col(o).alias(n).cast(schema_dtype)
            for o, n, schema_dtype in zip(pl_schema.names(), schema.names(), schema.dtypes())
        ]

        self.data_frame = self.data_frame.select(select_arg)
        return flow_file_columns

    def _handle_string_schema(self, schema: List[str], pl_schema: pl.Schema) -> List[FlowfileColumn]:
        """Handles string-based schema conversion."""
        flow_file_columns = [
            FlowfileColumn.create_from_polars_dtype(column_name=col_name, data_type=dtype)
            for col_name, dtype in zip(schema, pl_schema.dtypes())
        ]

        self.data_frame = self.data_frame.rename({
            o: n for o, n in zip(pl_schema.names(), schema)
        })

        return flow_file_columns

    def split(self, split_input: transform_schemas.TextToRowsInput) -> "FlowDataEngine":
        """Splits a column's text values into multiple rows based on a delimiter.

        This operation is often referred to as "exploding" the DataFrame, as it
        increases the number of rows.

        Args:
            split_input: A `TextToRowsInput` object specifying the column to split,
                the delimiter, and the output column name.

        Returns:
            A new `FlowDataEngine` instance with the exploded rows.
        """
        output_column_name = (
            split_input.output_column_name
            if split_input.output_column_name
            else split_input.column_to_split
        )

        split_value = (
            split_input.split_fixed_value
            if split_input.split_by_fixed_value
            else pl.col(split_input.split_by_column)
        )

        df = (
            self.data_frame.with_columns(
                pl.col(split_input.column_to_split)
                .str.split(by=split_value)
                .alias(output_column_name)
            )
            .explode(output_column_name)
        )

        return FlowDataEngine(df)

    def unpivot(self, unpivot_input: transform_schemas.UnpivotInput) -> "FlowDataEngine":
        """Converts the DataFrame from a wide to a long format.

        This is the inverse of a pivot operation, taking columns and transforming
        them into `variable` and `value` rows.

        Args:
            unpivot_input: An `UnpivotInput` object specifying which columns to
                unpivot and which to keep as index columns.

        Returns:
            A new, unpivoted `FlowDataEngine` instance.
        """
        lf = self.data_frame

        if unpivot_input.data_type_selector_expr is not None:
            result = lf.unpivot(
                on=unpivot_input.data_type_selector_expr(),
                index=unpivot_input.index_columns
            )
        elif unpivot_input.value_columns is not None:
            result = lf.unpivot(
                on=unpivot_input.value_columns,
                index=unpivot_input.index_columns
            )
        else:
            result = lf.unpivot()

        return FlowDataEngine(result)

    def do_pivot(self, pivot_input: transform_schemas.PivotInput, node_logger: NodeLogger = None) -> "FlowDataEngine":
        """Converts the DataFrame from a long to a wide format, aggregating values.

        Args:
            pivot_input: A `PivotInput` object defining the index, pivot, and value
                columns, along with the aggregation logic.
            node_logger: An optional logger for reporting warnings, e.g., if the
                pivot column has too many unique values.

        Returns:
            A new, pivoted `FlowDataEngine` instance.
        """
        # Get unique values for pivot columns
        max_unique_vals = 200
        new_cols_unique = fetch_unique_values(self.data_frame.select(pivot_input.pivot_column)
                                              .unique()
                                              .sort(pivot_input.pivot_column)
                                              .limit(max_unique_vals).cast(pl.String))
        if len(new_cols_unique) >= max_unique_vals:
            if node_logger:
                node_logger.warning('Pivot column has too many unique values. Please consider using a different column.'
                                    f' Max unique values: {max_unique_vals}')

        if len(pivot_input.index_columns) == 0:
            no_index_cols = True
            pivot_input.index_columns = ['__temp__']
            ff = self.apply_flowfile_formula('1', col_name='__temp__')
        else:
            no_index_cols = False
            ff = self

        # Perform pivot operations
        index_columns = pivot_input.get_index_columns()
        grouped_ff = ff.do_group_by(pivot_input.get_group_by_input(), False)
        pivot_column = pivot_input.get_pivot_column()

        input_df = grouped_ff.data_frame.with_columns(
            pivot_column.cast(pl.String).alias(pivot_input.pivot_column)
        )
        number_of_aggregations = len(pivot_input.aggregations)
        df = (
            input_df.select(
                *index_columns,
                pivot_column,
                pivot_input.get_values_expr()
            )
            .group_by(*index_columns)
            .agg([
                (pl.col('vals').filter(pivot_column == new_col_value))
                .first()
                .alias(new_col_value)
                for new_col_value in new_cols_unique
            ])
            .select(
                *index_columns,
                *[
                    pl.col(new_col).struct.field(agg).alias(f'{new_col + "_" + agg if number_of_aggregations > 1 else new_col }')
                    for new_col in new_cols_unique
                    for agg in pivot_input.aggregations
                ]
            )
        )

        # Clean up temporary columns if needed
        if no_index_cols:
            df = df.drop('__temp__')
            pivot_input.index_columns = []

        return FlowDataEngine(df, calculate_schema_stats=False)

    def do_filter(self, predicate: str) -> "FlowDataEngine":
        """Filters rows based on a predicate expression.

        Args:
            predicate: A string containing a Polars expression that evaluates to
                a boolean value.

        Returns:
            A new `FlowDataEngine` instance containing only the rows that match
            the predicate.
        """
        try:
            f = to_expr(predicate)
        except Exception as e:
            logger.warning(f'Error in filter expression: {e}')
            f = to_expr("False")
        df = self.data_frame.filter(f)
        return FlowDataEngine(df, schema=self.schema, streamable=self._streamable)

    def add_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
        """Adds a record ID (row number) column to the DataFrame.

        Can generate a simple sequential ID or a grouped ID that resets for
        each group.

        Args:
            record_id_settings: A `RecordIdInput` object specifying the output
                column name, offset, and optional grouping columns.

        Returns:
            A new `FlowDataEngine` instance with the added record ID column.
        """
        if record_id_settings.group_by and len(record_id_settings.group_by_columns) > 0:
            return self._add_grouped_record_id(record_id_settings)
        return self._add_simple_record_id(record_id_settings)

    def _add_grouped_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
        """Adds a record ID column with grouping."""
        select_cols = [pl.col(record_id_settings.output_column_name)] + [pl.col(c) for c in self.columns]

        df = (
            self.data_frame
            .with_columns(pl.lit(1).alias(record_id_settings.output_column_name))
            .with_columns(
                (pl.cum_count(record_id_settings.output_column_name)
                 .over(record_id_settings.group_by_columns) + record_id_settings.offset - 1)
                .alias(record_id_settings.output_column_name)
            )
            .select(select_cols)
        )

        output_schema = [FlowfileColumn.from_input(record_id_settings.output_column_name, 'UInt64')]
        output_schema.extend(self.schema)

        return FlowDataEngine(df, schema=output_schema)

    def _add_simple_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
        """Adds a simple sequential record ID column."""
        df = self.data_frame.with_row_index(
            record_id_settings.output_column_name,
            record_id_settings.offset
        )

        output_schema = [FlowfileColumn.from_input(record_id_settings.output_column_name, 'UInt64')]
        output_schema.extend(self.schema)

        return FlowDataEngine(df, schema=output_schema)

    def get_schema_column(self, col_name: str) -> FlowfileColumn:
        """Retrieves the schema information for a single column by its name.

        Args:
            col_name: The name of the column to retrieve.

        Returns:
            A `FlowfileColumn` object for the specified column, or `None` if not found.
        """
        for s in self.schema:
            if s.name == col_name:
                return s

    def get_estimated_file_size(self) -> int:
        """Estimates the file size in bytes if the data originated from a local file.

        This relies on the original path being tracked during file ingestion.

        Returns:
            The file size in bytes, or 0 if the original path is unknown.
        """
        if self._org_path is not None:
            return os.path.getsize(self._org_path)
        return 0

    def __repr__(self) -> str:
        """Returns a string representation of the FlowDataEngine."""
        return f'flow data engine\n{self.data_frame.__repr__()}'

    def __call__(self) -> "FlowDataEngine":
        """Makes the class instance callable, returning itself."""
        return self

    def __len__(self) -> int:
        """Returns the number of records in the table."""
        return self.number_of_records if self.number_of_records >= 0 else self.get_number_of_records()

    def cache(self) -> "FlowDataEngine":
        """Caches the current DataFrame to disk and updates the internal reference.

        This triggers a background process to write the current LazyFrame's result
        to a temporary file. Subsequent operations on this `FlowDataEngine` instance
        will read from the cached file, which can speed up downstream computations.

        Returns:
            The same `FlowDataEngine` instance, now backed by the cached data.
        """
        edf = ExternalDfFetcher(lf=self.data_frame, file_ref=str(id(self)), wait_on_completion=False,
                                flow_id=-1,
                                node_id=-1)
        logger.info('Caching data in background')
        result = edf.get_result()
        if isinstance(result, pl.LazyFrame):
            logger.info('Data cached')
            del self._data_frame
            self.data_frame = result
            logger.info('Data loaded from cache')
        return self

    def collect_external(self):
        """Materializes data from a tracked external source.

        If the `FlowDataEngine` was created from an `ExternalDataSource`, this
        method will trigger the data retrieval, update the internal `_data_frame`
        to a `LazyFrame` of the collected data, and reset the schema to be
        re-evaluated.
        """
        if self._external_source is not None:
            logger.info('Collecting external source')
            if self.external_source.get_pl_df() is not None:
                self.data_frame = self.external_source.get_pl_df().lazy()
            else:
                self.data_frame = pl.LazyFrame(list(self.external_source.get_iter()))
            self._schema = None  # enforce reset schema

    def get_output_sample(self, n_rows: int = 10) -> List[Dict]:
        """Gets a sample of the data as a list of dictionaries.

        This is typically used to display a preview of the data in a UI.

        Args:
            n_rows: The number of rows to sample.

        Returns:
            A list of dictionaries, where each dictionary represents a row.
        """
        if self.number_of_records > n_rows or self.number_of_records < 0:
            df = self.collect(n_rows)
        else:
            df = self.collect()
        return df.to_dicts()

    def __get_sample__(self, n_rows: int = 100, streamable: bool = True) -> "FlowDataEngine":
        """Internal method to get a sample of the data."""
        if not self.lazy:
            df = self.data_frame.lazy()
        else:
            df = self.data_frame

        if streamable:
            try:
                df = df.head(n_rows).collect()
            except Exception as e:
                logger.warning(f'Error in getting sample: {e}')
                df = df.head(n_rows).collect(engine="auto")
        else:
            df = self.collect()
        return FlowDataEngine(df, number_of_records=len(df), schema=self.schema)

    def get_sample(self, n_rows: int = 100, random: bool = False, shuffle: bool = False,
                   seed: int = None) -> "FlowDataEngine":
        """Gets a sample of rows from the DataFrame.

        Args:
            n_rows: The number of rows to sample.
            random: If True, performs random sampling. If False, takes the first n_rows.
            shuffle: If True (and `random` is True), shuffles the data before sampling.
            seed: A random seed for reproducibility.

        Returns:
            A new `FlowDataEngine` instance containing the sampled data.
        """
        n_records = min(n_rows, self.get_number_of_records(calculate_in_worker_process=OFFLOAD_TO_WORKER))
        logging.info(f'Getting sample of {n_rows} rows')

        if random:
            if self.lazy and self.external_source is not None:
                self.collect_external()

            if self.lazy and shuffle:
                sample_df = self.data_frame.collect(engine="streaming" if self._streamable else "auto").sample(n_rows,
                                                                                                               seed=seed,
                                                                                     shuffle=shuffle)
            elif shuffle:
                sample_df = self.data_frame.sample(n_rows, seed=seed, shuffle=shuffle)
            else:
                every_n_records = ceil(self.number_of_records / n_rows)
                sample_df = self.data_frame.gather_every(every_n_records)
        else:
            if self.external_source:
                self.collect(n_rows)
            sample_df = self.data_frame.head(n_rows)

        return FlowDataEngine(sample_df, schema=self.schema, number_of_records=n_records)

    def get_subset(self, n_rows: int = 100) -> "FlowDataEngine":
        """Gets the first `n_rows` from the DataFrame.

        Args:
            n_rows: The number of rows to include in the subset.

        Returns:
            A new `FlowDataEngine` instance containing the subset of data.
        """
        if not self.lazy:
            return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
        else:
            return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)

    def iter_batches(self, batch_size: int = 1000,
                     columns: Union[List, Tuple, str] = None) -> Generator["FlowDataEngine", None, None]:
        """Iterates over the DataFrame in batches.

        Args:
            batch_size: The size of each batch.
            columns: A list of column names to include in the batches. If None,
                all columns are included.

        Yields:
            A `FlowDataEngine` instance for each batch.
        """
        if columns:
            self.data_frame = self.data_frame.select(columns)
        self.lazy = False
        batches = self.data_frame.iter_slices(batch_size)
        for batch in batches:
            yield FlowDataEngine(batch)

    def start_fuzzy_join(self, fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                         other: "FlowDataEngine", file_ref: str, flow_id: int = -1,
                         node_id: int | str = -1) -> ExternalFuzzyMatchFetcher:
        """Starts a fuzzy join operation in a background process.

        This method prepares the data and initiates the fuzzy matching in a
        separate process, returning a tracker object immediately.

        Args:
            fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
            other: The right `FlowDataEngine` to join with.
            file_ref: A reference string for temporary files.
            flow_id: The flow ID for tracking.
            node_id: The node ID for tracking.

        Returns:
            An `ExternalFuzzyMatchFetcher` object that can be used to track the
            progress and retrieve the result of the fuzzy join.
        """
        left_df, right_df = prepare_for_fuzzy_match(left=self, right=other,
                                                    fuzzy_match_input=fuzzy_match_input)
        return ExternalFuzzyMatchFetcher(left_df, right_df,
                                         fuzzy_maps=fuzzy_match_input.fuzzy_maps,
                                         file_ref=file_ref + '_fm',
                                         wait_on_completion=False,
                                         flow_id=flow_id,
                                         node_id=node_id)

    def do_fuzzy_join(self, fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                      other: "FlowDataEngine", file_ref: str, flow_id: int = -1,
                      node_id: int | str = -1) -> "FlowDataEngine":
        """Performs a fuzzy join with another DataFrame.

        This method blocks until the fuzzy join operation is complete.

        Args:
            fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
            other: The right `FlowDataEngine` to join with.
            file_ref: A reference string for temporary files.
            flow_id: The flow ID for tracking.
            node_id: The node ID for tracking.

        Returns:
            A new `FlowDataEngine` instance with the result of the fuzzy join.
        """
        left_df, right_df = prepare_for_fuzzy_match(left=self, right=other,
                                                    fuzzy_match_input=fuzzy_match_input)
        f = ExternalFuzzyMatchFetcher(left_df, right_df,
                                      fuzzy_maps=fuzzy_match_input.fuzzy_maps,
                                      file_ref=file_ref + '_fm',
                                      wait_on_completion=True,
                                      flow_id=flow_id,
                                      node_id=node_id)
        return FlowDataEngine(f.get_result())

    def fuzzy_match(self, right: "FlowDataEngine", left_on: str, right_on: str,
                    fuzzy_method: str = 'levenshtein', threshold: float = 0.75) -> "FlowDataEngine":
        """Performs a simple fuzzy match between two DataFrames on a single column pair.

        This is a convenience method for a common fuzzy join scenario.

        Args:
            right: The right `FlowDataEngine` to match against.
            left_on: The column name from the left DataFrame to match on.
            right_on: The column name from the right DataFrame to match on.
            fuzzy_method: The fuzzy matching algorithm to use (e.g., 'levenshtein').
            threshold: The similarity score threshold (0.0 to 1.0) for a match.

        Returns:
            A new `FlowDataEngine` with the matched data.
        """
        fuzzy_match_input = transform_schemas.FuzzyMatchInput(
            [transform_schemas.FuzzyMap(
                left_on, right_on,
                fuzzy_type=fuzzy_method,
                threshold_score=threshold
            )],
            left_select=self.columns,
            right_select=right.columns
        )
        return self.do_fuzzy_join(fuzzy_match_input, right, str(id(self)))

    def do_cross_join(self, cross_join_input: transform_schemas.CrossJoinInput,
                      auto_generate_selection: bool, verify_integrity: bool,
                      other: "FlowDataEngine") -> "FlowDataEngine":
        """Performs a cross join with another DataFrame.

        A cross join produces the Cartesian product of the two DataFrames.

        Args:
            cross_join_input: A `CrossJoinInput` object specifying column selections.
            auto_generate_selection: If True, automatically renames columns to avoid conflicts.
            verify_integrity: If True, checks if the resulting join would be too large.
            other: The right `FlowDataEngine` to join with.

        Returns:
            A new `FlowDataEngine` with the result of the cross join.

        Raises:
            Exception: If `verify_integrity` is True and the join would result in
                an excessively large number of records.
        """
        self.lazy = True
        other.lazy = True

        verify_join_select_integrity(cross_join_input, left_columns=self.columns, right_columns=other.columns)

        right_select = [v.old_name for v in cross_join_input.right_select.renames
                        if (v.keep or v.join_key) and v.is_available]
        left_select = [v.old_name for v in cross_join_input.left_select.renames
                       if (v.keep or v.join_key) and v.is_available]

        left = self.data_frame.select(left_select).rename(cross_join_input.left_select.rename_table)
        right = other.data_frame.select(right_select).rename(cross_join_input.right_select.rename_table)

        if verify_integrity:
            n_records = self.get_number_of_records() * other.get_number_of_records()
            if n_records > 1_000_000_000:
                raise Exception("Join will result in too many records, ending process")
        else:
            n_records = -1

        joined_df = left.join(right, how='cross')

        cols_to_delete_after = [col.new_name for col in
                                cross_join_input.left_select.renames + cross_join_input.left_select.renames
                                if col.join_key and not col.keep and col.is_available]

        if verify_integrity:
            return FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False,
                                 number_of_records=n_records, streamable=False)
        else:
            fl = FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False,
                               number_of_records=0, streamable=False)
            return fl

    def join(self, join_input: transform_schemas.JoinInput, auto_generate_selection: bool,
             verify_integrity: bool, other: "FlowDataEngine") -> "FlowDataEngine":
        """Performs a standard SQL-style join with another DataFrame.

        Supports various join types like 'inner', 'left', 'right', 'outer', 'semi', and 'anti'.

        Args:
            join_input: A `JoinInput` object defining the join keys, join type,
                and column selections.
            auto_generate_selection: If True, automatically handles column renaming.
            verify_integrity: If True, performs checks to prevent excessively large joins.
            other: The right `FlowDataEngine` to join with.

        Returns:
            A new `FlowDataEngine` with the joined data.

        Raises:
            Exception: If the join configuration is invalid or if `verify_integrity`
                is True and the join is predicted to be too large.
        """
        ensure_right_unselect_for_semi_and_anti_joins(join_input)
        verify_join_select_integrity(join_input, left_columns=self.columns, right_columns=other.columns)
        if not verify_join_map_integrity(join_input, left_columns=self.schema, right_columns=other.schema):
            raise Exception('Join is not valid by the data fields')
        if auto_generate_selection:
            join_input.auto_rename()
        left = self.data_frame.select(get_select_columns(join_input.left_select.renames)).rename(join_input.left_select.rename_table)
        right = other.data_frame.select(get_select_columns(join_input.right_select.renames)).rename(join_input.right_select.rename_table)
        if verify_integrity and join_input.how != 'right':
            n_records = get_join_count(left, right, left_on_keys=join_input.left_join_keys,
                                       right_on_keys=join_input.right_join_keys, how=join_input.how)
            if n_records > 1_000_000_000:
                raise Exception("Join will result in too many records, ending process")
        else:
            n_records = -1
        left, right, reverse_join_key_mapping = _handle_duplication_join_keys(left, right, join_input)
        left, right = rename_df_table_for_join(left, right, join_input.get_join_key_renames())
        if join_input.how == 'right':
            joined_df = right.join(
                other=left,
                left_on=join_input.right_join_keys,
                right_on=join_input.left_join_keys,
                how="left",
                suffix="").rename(reverse_join_key_mapping)
        else:
            joined_df = left.join(
                other=right,
                left_on=join_input.left_join_keys,
                right_on=join_input.right_join_keys,
                how=join_input.how,
                suffix="").rename(reverse_join_key_mapping)
        left_cols_to_delete_after = [get_col_name_to_delete(col, 'left') for col in join_input.left_select.renames
                                     if not col.keep
                                     and col.is_available and col.join_key
                                     ]
        right_cols_to_delete_after = [get_col_name_to_delete(col, 'right') for col in join_input.right_select.renames
                                      if not col.keep
                                      and col.is_available and col.join_key
                                      and join_input.how in ("left", "right", "inner", "cross", "outer")
                                      ]
        if len(right_cols_to_delete_after + left_cols_to_delete_after) > 0:
            joined_df = joined_df.drop(left_cols_to_delete_after + right_cols_to_delete_after)
        undo_join_key_remapping = get_undo_rename_mapping_join(join_input)
        joined_df = joined_df.rename(undo_join_key_remapping)

        if verify_integrity:
            return FlowDataEngine(joined_df, calculate_schema_stats=True,
                                  number_of_records=n_records, streamable=False)
        else:
            fl = FlowDataEngine(joined_df, calculate_schema_stats=False,
                                number_of_records=0, streamable=False)
            return fl

    def solve_graph(self, graph_solver_input: transform_schemas.GraphSolverInput) -> "FlowDataEngine":
        """Solves a graph problem represented by 'from' and 'to' columns.

        This is used for operations like finding connected components in a graph.

        Args:
            graph_solver_input: A `GraphSolverInput` object defining the source,
                destination, and output column names.

        Returns:
            A new `FlowDataEngine` instance with the solved graph data.
        """
        lf = self.data_frame.with_columns(
            graph_solver(graph_solver_input.col_from, graph_solver_input.col_to)
            .alias(graph_solver_input.output_column_name)
        )
        return FlowDataEngine(lf)

    def add_new_values(self, values: Iterable, col_name: str = None) -> "FlowDataEngine":
        """Adds a new column with the provided values.

        Args:
            values: An iterable (e.g., list, tuple) of values to add as a new column.
            col_name: The name for the new column. Defaults to 'new_values'.

        Returns:
            A new `FlowDataEngine` instance with the added column.
        """
        if col_name is None:
            col_name = 'new_values'
        return FlowDataEngine(self.data_frame.with_columns(pl.Series(values).alias(col_name)))

    def get_record_count(self) -> "FlowDataEngine":
        """Returns a new FlowDataEngine with a single column 'number_of_records'
        containing the total number of records.

        Returns:
            A new `FlowDataEngine` instance.
        """
        return FlowDataEngine(self.data_frame.select(pl.len().alias('number_of_records')))

    def assert_equal(self, other: "FlowDataEngine", ordered: bool = True, strict_schema: bool = False):
        """Asserts that this DataFrame is equal to another.

        Useful for testing.

        Args:
            other: The other `FlowDataEngine` to compare with.
            ordered: If True, the row order must be identical.
            strict_schema: If True, the data types of the schemas must be identical.

        Raises:
            Exception: If the DataFrames are not equal based on the specified criteria.
        """
        org_laziness = self.lazy, other.lazy
        self.lazy = False
        other.lazy = False
        self.number_of_records = -1
        other.number_of_records = -1
        other = other.select_columns(self.columns)

        if self.get_number_of_records() != other.get_number_of_records():
            raise Exception('Number of records is not equal')

        if self.columns != other.columns:
            raise Exception('Schema is not equal')

        if strict_schema:
            assert self.data_frame.schema == other.data_frame.schema, 'Data types do not match'

        if ordered:
            self_lf = self.data_frame.sort(by=self.columns)
            other_lf = other.data_frame.sort(by=other.columns)
        else:
            self_lf = self.data_frame
            other_lf = other.data_frame

        self.lazy, other.lazy = org_laziness
        assert self_lf.equals(other_lf), 'Data is not equal'

    def initialize_empty_fl(self):
        """Initializes an empty LazyFrame."""
        self.data_frame = pl.LazyFrame()
        self.number_of_records = 0
        self._lazy = True

    def _calculate_number_of_records_in_worker(self) -> int:
        """Calculates the number of records in a worker process."""
        number_of_records = ExternalDfFetcher(
            lf=self.data_frame,
            operation_type="calculate_number_of_records",
            flow_id=-1,
            node_id=-1,
            wait_on_completion=True
        ).result
        return number_of_records

    def get_number_of_records(self, warn: bool = False, force_calculate: bool = False,
                              calculate_in_worker_process: bool = False) -> int:
        """Gets the total number of records in the DataFrame.

        For lazy frames, this may trigger a full data scan, which can be expensive.

        Args:
            warn: If True, logs a warning if a potentially expensive calculation is triggered.
            force_calculate: If True, forces recalculation even if a value is cached.
            calculate_in_worker_process: If True, offloads the calculation to a worker process.

        Returns:
            The total number of records.

        Raises:
            ValueError: If the number of records could not be determined.
        """
        if self.is_future and not self.is_collected:
            return -1
        calculate_in_worker_process = False if not OFFLOAD_TO_WORKER else calculate_in_worker_process
        if self.number_of_records is None or self.number_of_records < 0 or force_calculate:
            if self._number_of_records_callback is not None:
                self._number_of_records_callback(self)

            if self.lazy:
                if calculate_in_worker_process:
                    try:
                        self.number_of_records = self._calculate_number_of_records_in_worker()
                        return self.number_of_records
                    except Exception as e:
                        logger.error(f"Error: {e}")
                if warn:
                    logger.warning('Calculating the number of records this can be expensive on a lazy frame')
                try:
                    self.number_of_records = self.data_frame.select(pl.len()).collect(
                        engine="streaming" if self._streamable else "auto")[0, 0]
                except Exception:
                    raise ValueError('Could not get number of records')
            else:
                self.number_of_records = self.data_frame.__len__()
        return self.number_of_records

    @property
    def has_errors(self) -> bool:
        """Checks if there are any errors."""
        return len(self.errors) > 0

    @property
    def lazy(self) -> bool:
        """Indicates if the DataFrame is in lazy mode."""
        return self._lazy

    @lazy.setter
    def lazy(self, exec_lazy: bool = False):
        """Sets the laziness of the DataFrame.

        Args:
            exec_lazy: If True, converts the DataFrame to a LazyFrame. If False,
                collects the data and converts it to an eager DataFrame.
        """
        if exec_lazy != self._lazy:
            if exec_lazy:
                self.data_frame = self.data_frame.lazy()
            else:
                self._lazy = exec_lazy
                if self.external_source is not None:
                    df = self.collect()
                    self.data_frame = df
                else:
                    self.data_frame = self.data_frame.collect(engine="streaming" if self._streamable else "auto")
            self._lazy = exec_lazy

    @property
    def external_source(self) -> ExternalDataSource:
        """The external data source, if any."""
        return self._external_source

    @property
    def cols_idx(self) -> Dict[str, int]:
        """A dictionary mapping column names to their integer index."""
        if self._col_idx is None:
            self._col_idx = {c: i for i, c in enumerate(self.columns)}
        return self._col_idx

    @property
    def __name__(self) -> str:
        """The name of the table."""
        return self.name

    def get_select_inputs(self) -> transform_schemas.SelectInputs:
        """Gets `SelectInput` specifications for all columns in the current schema.

        Returns:
            A `SelectInputs` object that can be used to configure selection or
            transformation operations.
        """
        return transform_schemas.SelectInputs(
            [transform_schemas.SelectInput(old_name=c.name, data_type=c.data_type) for c in self.schema]
        )

    def select_columns(self, list_select: Union[List[str], Tuple[str], str]) -> "FlowDataEngine":
        """Selects a subset of columns from the DataFrame.

        Args:
            list_select: A list, tuple, or single string of column names to select.

        Returns:
            A new `FlowDataEngine` instance containing only the selected columns.
        """
        if isinstance(list_select, str):
            list_select = [list_select]

        idx_to_keep = [self.cols_idx.get(c) for c in list_select]
        selects = [ls for ls, id_to_keep in zip(list_select, idx_to_keep) if id_to_keep is not None]
        new_schema = [self.schema[i] for i in idx_to_keep if i is not None]

        return FlowDataEngine(
            self.data_frame.select(selects),
            number_of_records=self.number_of_records,
            schema=new_schema,
            streamable=self._streamable
        )

    def drop_columns(self, columns: List[str]) -> "FlowDataEngine":
        """Drops specified columns from the DataFrame.

        Args:
            columns: A list of column names to drop.

        Returns:
            A new `FlowDataEngine` instance without the dropped columns.
        """
        cols_for_select = tuple(set(self.columns) - set(columns))
        idx_to_keep = [self.cols_idx.get(c) for c in cols_for_select]
        new_schema = [self.schema[i] for i in idx_to_keep]

        return FlowDataEngine(
            self.data_frame.select(cols_for_select),
            number_of_records=self.number_of_records,
            schema=new_schema
        )

    def reorganize_order(self, column_order: List[str]) -> "FlowDataEngine":
        """Reorganizes columns into a specified order.

        Args:
            column_order: A list of column names in the desired order.

        Returns:
            A new `FlowDataEngine` instance with the columns reordered.
        """
        df = self.data_frame.select(column_order)
        schema = sorted(self.schema, key=lambda x: column_order.index(x.column_name))
        return FlowDataEngine(df, schema=schema, number_of_records=self.number_of_records)

    def apply_flowfile_formula(self, func: str, col_name: str,
                               output_data_type: pl.DataType = None) -> "FlowDataEngine":
        """Applies a formula to create a new column or transform an existing one.

        Args:
            func: A string containing a Polars expression formula.
            col_name: The name of the new or transformed column.
            output_data_type: The desired Polars data type for the output column.

        Returns:
            A new `FlowDataEngine` instance with the applied formula.
        """
        parsed_func = to_expr(func)
        if output_data_type is not None:
            df2 = self.data_frame.with_columns(parsed_func.cast(output_data_type).alias(col_name))
        else:
            df2 = self.data_frame.with_columns(parsed_func.alias(col_name))

        return FlowDataEngine(df2, number_of_records=self.number_of_records)

    def apply_sql_formula(self, func: str, col_name: str,
                          output_data_type: pl.DataType = None) -> "FlowDataEngine":
        """Applies an SQL-style formula using `pl.sql_expr`.

        Args:
            func: A string containing an SQL expression.
            col_name: The name of the new or transformed column.
            output_data_type: The desired Polars data type for the output column.

        Returns:
            A new `FlowDataEngine` instance with the applied formula.
        """
        expr = to_expr(func)
        if output_data_type not in (None, "Auto"):
            df = self.data_frame.with_columns(expr.cast(output_data_type).alias(col_name))
        else:
            df = self.data_frame.with_columns(expr.alias(col_name))

        return FlowDataEngine(df, number_of_records=self.number_of_records)

    def output(self, output_fs: input_schema.OutputSettings, flow_id: int, node_id: int | str,
               execute_remote: bool = True) -> "FlowDataEngine":
        """Writes the DataFrame to an output file.

        Can execute the write operation locally or in a remote worker process.

        Args:
            output_fs: An `OutputSettings` object with details about the output file.
            flow_id: The flow ID for tracking.
            node_id: The node ID for tracking.
            execute_remote: If True, executes the write in a worker process.

        Returns:
            The same `FlowDataEngine` instance for chaining.
        """
        logger.info('Starting to write output')
        if execute_remote:
            status = utils.write_output(
                self.data_frame,
                data_type=output_fs.file_type,
                path=output_fs.abs_file_path,
                write_mode=output_fs.write_mode,
                sheet_name=output_fs.output_excel_table.sheet_name,
                delimiter=output_fs.output_csv_table.delimiter,
                flow_id=flow_id,
                node_id=node_id
            )
            tracker = ExternalExecutorTracker(status)
            tracker.get_result()
            logger.info('Finished writing output')
        else:
            logger.info("Starting to write results locally")
            utils.local_write_output(
                self.data_frame,
                data_type=output_fs.file_type,
                path=output_fs.abs_file_path,
                write_mode=output_fs.write_mode,
                sheet_name=output_fs.output_excel_table.sheet_name,
                delimiter=output_fs.output_csv_table.delimiter,
                flow_id=flow_id,
                node_id=node_id,
            )
            logger.info("Finished writing output")
        return self

    def make_unique(self, unique_input: transform_schemas.UniqueInput = None) -> "FlowDataEngine":
        """Gets the unique rows from the DataFrame.

        Args:
            unique_input: A `UniqueInput` object specifying a subset of columns
                to consider for uniqueness and a strategy for keeping rows.

        Returns:
            A new `FlowDataEngine` instance with unique rows.
        """
        if unique_input is None or unique_input.columns is None:
            return FlowDataEngine(self.data_frame.unique())
        return FlowDataEngine(self.data_frame.unique(unique_input.columns, keep=unique_input.strategy))

    def concat(self, other: Iterable["FlowDataEngine"] | "FlowDataEngine") -> "FlowDataEngine":
        """Concatenates this DataFrame with one or more other DataFrames.

        Args:
            other: A single `FlowDataEngine` or an iterable of them.

        Returns:
            A new `FlowDataEngine` containing the concatenated data.
        """
        if isinstance(other, FlowDataEngine):
            other = [other]

        dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [self.data_frame] + [flt.data_frame for flt in other]
        return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))

    def do_select(self, select_inputs: transform_schemas.SelectInputs,
                  keep_missing: bool = True) -> "FlowDataEngine":
        """Performs a complex column selection, renaming, and reordering operation.

        Args:
            select_inputs: A `SelectInputs` object defining the desired transformations.
            keep_missing: If True, columns not specified in `select_inputs` are kept.
                If False, they are dropped.

        Returns:
            A new `FlowDataEngine` with the transformed selection.
        """
        new_schema = deepcopy(self.schema)
        renames = [r for r in select_inputs.renames if r.is_available]

        if not keep_missing:
            drop_cols = set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames).union(
                set(r.old_name for r in renames if not r.keep))
            keep_cols = []
        else:
            keep_cols = list(set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames))
            drop_cols = set(r.old_name for r in renames if not r.keep)

        if len(drop_cols) > 0:
            new_schema = [s for s in new_schema if s.name not in drop_cols]
        new_schema_mapping = {v.name: v for v in new_schema}

        available_renames = []
        for rename in renames:
            if (rename.new_name != rename.old_name or rename.new_name not in new_schema_mapping) and rename.keep:
                schema_entry = new_schema_mapping.get(rename.old_name)
                if schema_entry is not None:
                    available_renames.append(rename)
                    schema_entry.column_name = rename.new_name

        rename_dict = {r.old_name: r.new_name for r in available_renames}
        fl = self.select_columns(
            list_select=[col_to_keep.old_name for col_to_keep in renames if col_to_keep.keep] + keep_cols)
        fl = fl.change_column_types(transforms=[r for r in renames if r.keep])
        ndf = fl.data_frame.rename(rename_dict)
        renames.sort(key=lambda r: 0 if r.position is None else r.position)
        sorted_cols = utils.match_order(ndf.collect_schema().names(),
                                        [r.new_name for r in renames] + self.data_frame.collect_schema().names())
        output_file = FlowDataEngine(ndf, number_of_records=self.number_of_records)
        return output_file.reorganize_order(sorted_cols)

    def set_streamable(self, streamable: bool = False):
        """Sets whether DataFrame operations should be streamable."""
        self._streamable = streamable

    def _calculate_schema(self) -> List[Dict]:
        """Calculates schema statistics."""
        if self.external_source is not None:
            self.collect_external()
        v = utils.calculate_schema(self.data_frame)
        return v

    def calculate_schema(self):
        """Calculates and returns the schema."""
        self._calculate_schema_stats = True
        return self.schema

    def count(self) -> int:
        """Gets the total number of records."""
        return self.get_number_of_records()

    @classmethod
    def create_from_path_worker(cls, received_table: input_schema.ReceivedTable, flow_id: int, node_id: int | str):
        """Creates a FlowDataEngine from a path in a worker process."""
        received_table.set_absolute_filepath()
        external_fetcher = ExternalCreateFetcher(received_table=received_table,
                                                 file_type=received_table.file_type, flow_id=flow_id, node_id=node_id)
        return cls(external_fetcher.get_result())
__name__ property

The name of the table.

cols_idx property

A dictionary mapping column names to their integer index.

data_frame property writable

The underlying Polars DataFrame or LazyFrame.

This property provides access to the Polars object that backs the FlowDataEngine. It handles lazy-loading from external sources if necessary.

Returns:

Type Description
LazyFrame | DataFrame | None

The active Polars DataFrame or LazyFrame.

external_source property

The external data source, if any.

has_errors property

Checks if there are any errors.

lazy property writable

Indicates if the DataFrame is in lazy mode.

number_of_fields property

The number of columns (fields) in the DataFrame.

Returns:

Type Description
int

The integer count of columns.

schema property

The schema of the DataFrame as a list of FlowfileColumn objects.

This property lazily calculates the schema if it hasn't been determined yet.

Returns:

Type Description
List[FlowfileColumn]

A list of FlowfileColumn objects describing the schema.

__call__()

Makes the class instance callable, returning itself.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1485
1486
1487
def __call__(self) -> "FlowDataEngine":
    """Makes the class instance callable, returning itself."""
    return self
__get_sample__(n_rows=100, streamable=True)

Internal method to get a sample of the data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
def __get_sample__(self, n_rows: int = 100, streamable: bool = True) -> "FlowDataEngine":
    """Internal method to get a sample of the data."""
    if not self.lazy:
        df = self.data_frame.lazy()
    else:
        df = self.data_frame

    if streamable:
        try:
            df = df.head(n_rows).collect()
        except Exception as e:
            logger.warning(f'Error in getting sample: {e}')
            df = df.head(n_rows).collect(engine="auto")
    else:
        df = self.collect()
    return FlowDataEngine(df, number_of_records=len(df), schema=self.schema)
__getitem__(item)

Accesses a specific column or item from the DataFrame.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
768
769
770
def __getitem__(self, item):
    """Accesses a specific column or item from the DataFrame."""
    return self.data_frame.select([item])
__init__(raw_data=None, path_ref=None, name=None, optimize_memory=True, schema=None, number_of_records=None, calculate_schema_stats=False, streamable=True, number_of_records_callback=None, data_callback=None)

Initializes the FlowDataEngine from various data sources.

Parameters:

Name Type Description Default
raw_data Union[List[Dict], List[Any], Dict[str, Any], ParquetFile, DataFrame, LazyFrame, RawData]

The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame, or a RawData schema object.

None
path_ref str

A string path to a Parquet file.

None
name str

An optional name for the data engine instance.

None
optimize_memory bool

If True, prefers lazy operations to conserve memory.

True
schema List[FlowfileColumn] | List[str] | Schema

An optional schema definition. Can be a list of FlowfileColumn objects, a list of column names, or a Polars Schema.

None
number_of_records int

The number of records, if known.

None
calculate_schema_stats bool

If True, computes detailed statistics for each column.

False
streamable bool

If True, allows for streaming operations when possible.

True
number_of_records_callback Callable

A callback function to retrieve the number of records.

None
data_callback Callable

A callback function to retrieve the data.

None
Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
def __init__(self,
             raw_data: Union[List[Dict], List[Any], Dict[str, Any], 'ParquetFile', pl.DataFrame, pl.LazyFrame, input_schema.RawData] = None,
             path_ref: str = None,
             name: str = None,
             optimize_memory: bool = True,
             schema: List['FlowfileColumn'] | List[str] | pl.Schema = None,
             number_of_records: int = None,
             calculate_schema_stats: bool = False,
             streamable: bool = True,
             number_of_records_callback: Callable = None,
             data_callback: Callable = None):
    """Initializes the FlowDataEngine from various data sources.

    Args:
        raw_data: The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame,
            or a `RawData` schema object.
        path_ref: A string path to a Parquet file.
        name: An optional name for the data engine instance.
        optimize_memory: If True, prefers lazy operations to conserve memory.
        schema: An optional schema definition. Can be a list of `FlowfileColumn` objects,
            a list of column names, or a Polars `Schema`.
        number_of_records: The number of records, if known.
        calculate_schema_stats: If True, computes detailed statistics for each column.
        streamable: If True, allows for streaming operations when possible.
        number_of_records_callback: A callback function to retrieve the number of records.
        data_callback: A callback function to retrieve the data.
    """
    self._initialize_attributes(number_of_records_callback, data_callback, streamable)

    if raw_data is not None:
        self._handle_raw_data(raw_data, number_of_records, optimize_memory)
    elif path_ref:
        self._handle_path_ref(path_ref, optimize_memory)
    else:
        self.initialize_empty_fl()
    self._finalize_initialization(name, optimize_memory, schema, calculate_schema_stats)
__len__()

Returns the number of records in the table.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1489
1490
1491
def __len__(self) -> int:
    """Returns the number of records in the table."""
    return self.number_of_records if self.number_of_records >= 0 else self.get_number_of_records()
__repr__()

Returns a string representation of the FlowDataEngine.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1481
1482
1483
def __repr__(self) -> str:
    """Returns a string representation of the FlowDataEngine."""
    return f'flow data engine\n{self.data_frame.__repr__()}'
add_new_values(values, col_name=None)

Adds a new column with the provided values.

Parameters:

Name Type Description Default
values Iterable

An iterable (e.g., list, tuple) of values to add as a new column.

required
col_name str

The name for the new column. Defaults to 'new_values'.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the added column.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
def add_new_values(self, values: Iterable, col_name: str = None) -> "FlowDataEngine":
    """Adds a new column with the provided values.

    Args:
        values: An iterable (e.g., list, tuple) of values to add as a new column.
        col_name: The name for the new column. Defaults to 'new_values'.

    Returns:
        A new `FlowDataEngine` instance with the added column.
    """
    if col_name is None:
        col_name = 'new_values'
    return FlowDataEngine(self.data_frame.with_columns(pl.Series(values).alias(col_name)))
add_record_id(record_id_settings)

Adds a record ID (row number) column to the DataFrame.

Can generate a simple sequential ID or a grouped ID that resets for each group.

Parameters:

Name Type Description Default
record_id_settings RecordIdInput

A RecordIdInput object specifying the output column name, offset, and optional grouping columns.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the added record ID column.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
def add_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> "FlowDataEngine":
    """Adds a record ID (row number) column to the DataFrame.

    Can generate a simple sequential ID or a grouped ID that resets for
    each group.

    Args:
        record_id_settings: A `RecordIdInput` object specifying the output
            column name, offset, and optional grouping columns.

    Returns:
        A new `FlowDataEngine` instance with the added record ID column.
    """
    if record_id_settings.group_by and len(record_id_settings.group_by_columns) > 0:
        return self._add_grouped_record_id(record_id_settings)
    return self._add_simple_record_id(record_id_settings)
apply_flowfile_formula(func, col_name, output_data_type=None)

Applies a formula to create a new column or transform an existing one.

Parameters:

Name Type Description Default
func str

A string containing a Polars expression formula.

required
col_name str

The name of the new or transformed column.

required
output_data_type DataType

The desired Polars data type for the output column.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the applied formula.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
def apply_flowfile_formula(self, func: str, col_name: str,
                           output_data_type: pl.DataType = None) -> "FlowDataEngine":
    """Applies a formula to create a new column or transform an existing one.

    Args:
        func: A string containing a Polars expression formula.
        col_name: The name of the new or transformed column.
        output_data_type: The desired Polars data type for the output column.

    Returns:
        A new `FlowDataEngine` instance with the applied formula.
    """
    parsed_func = to_expr(func)
    if output_data_type is not None:
        df2 = self.data_frame.with_columns(parsed_func.cast(output_data_type).alias(col_name))
    else:
        df2 = self.data_frame.with_columns(parsed_func.alias(col_name))

    return FlowDataEngine(df2, number_of_records=self.number_of_records)
apply_sql_formula(func, col_name, output_data_type=None)

Applies an SQL-style formula using pl.sql_expr.

Parameters:

Name Type Description Default
func str

A string containing an SQL expression.

required
col_name str

The name of the new or transformed column.

required
output_data_type DataType

The desired Polars data type for the output column.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the applied formula.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
def apply_sql_formula(self, func: str, col_name: str,
                      output_data_type: pl.DataType = None) -> "FlowDataEngine":
    """Applies an SQL-style formula using `pl.sql_expr`.

    Args:
        func: A string containing an SQL expression.
        col_name: The name of the new or transformed column.
        output_data_type: The desired Polars data type for the output column.

    Returns:
        A new `FlowDataEngine` instance with the applied formula.
    """
    expr = to_expr(func)
    if output_data_type not in (None, "Auto"):
        df = self.data_frame.with_columns(expr.cast(output_data_type).alias(col_name))
    else:
        df = self.data_frame.with_columns(expr.alias(col_name))

    return FlowDataEngine(df, number_of_records=self.number_of_records)
assert_equal(other, ordered=True, strict_schema=False)

Asserts that this DataFrame is equal to another.

Useful for testing.

Parameters:

Name Type Description Default
other FlowDataEngine

The other FlowDataEngine to compare with.

required
ordered bool

If True, the row order must be identical.

True
strict_schema bool

If True, the data types of the schemas must be identical.

False

Raises:

Type Description
Exception

If the DataFrames are not equal based on the specified criteria.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
def assert_equal(self, other: "FlowDataEngine", ordered: bool = True, strict_schema: bool = False):
    """Asserts that this DataFrame is equal to another.

    Useful for testing.

    Args:
        other: The other `FlowDataEngine` to compare with.
        ordered: If True, the row order must be identical.
        strict_schema: If True, the data types of the schemas must be identical.

    Raises:
        Exception: If the DataFrames are not equal based on the specified criteria.
    """
    org_laziness = self.lazy, other.lazy
    self.lazy = False
    other.lazy = False
    self.number_of_records = -1
    other.number_of_records = -1
    other = other.select_columns(self.columns)

    if self.get_number_of_records() != other.get_number_of_records():
        raise Exception('Number of records is not equal')

    if self.columns != other.columns:
        raise Exception('Schema is not equal')

    if strict_schema:
        assert self.data_frame.schema == other.data_frame.schema, 'Data types do not match'

    if ordered:
        self_lf = self.data_frame.sort(by=self.columns)
        other_lf = other.data_frame.sort(by=other.columns)
    else:
        self_lf = self.data_frame
        other_lf = other.data_frame

    self.lazy, other.lazy = org_laziness
    assert self_lf.equals(other_lf), 'Data is not equal'
cache()

Caches the current DataFrame to disk and updates the internal reference.

This triggers a background process to write the current LazyFrame's result to a temporary file. Subsequent operations on this FlowDataEngine instance will read from the cached file, which can speed up downstream computations.

Returns:

Type Description
FlowDataEngine

The same FlowDataEngine instance, now backed by the cached data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
def cache(self) -> "FlowDataEngine":
    """Caches the current DataFrame to disk and updates the internal reference.

    This triggers a background process to write the current LazyFrame's result
    to a temporary file. Subsequent operations on this `FlowDataEngine` instance
    will read from the cached file, which can speed up downstream computations.

    Returns:
        The same `FlowDataEngine` instance, now backed by the cached data.
    """
    edf = ExternalDfFetcher(lf=self.data_frame, file_ref=str(id(self)), wait_on_completion=False,
                            flow_id=-1,
                            node_id=-1)
    logger.info('Caching data in background')
    result = edf.get_result()
    if isinstance(result, pl.LazyFrame):
        logger.info('Data cached')
        del self._data_frame
        self.data_frame = result
        logger.info('Data loaded from cache')
    return self
calculate_schema()

Calculates and returns the schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2266
2267
2268
2269
def calculate_schema(self):
    """Calculates and returns the schema."""
    self._calculate_schema_stats = True
    return self.schema
change_column_types(transforms, calculate_schema=False)

Changes the data type of one or more columns.

Parameters:

Name Type Description Default
transforms List[SelectInput]

A list of SelectInput objects, where each object specifies the column and its new polars_type.

required
calculate_schema bool

If True, recalculates the schema after the type change.

False

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the updated column types.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
def change_column_types(self, transforms: List[transform_schemas.SelectInput],
                        calculate_schema: bool = False) -> "FlowDataEngine":
    """Changes the data type of one or more columns.

    Args:
        transforms: A list of `SelectInput` objects, where each object specifies
            the column and its new `polars_type`.
        calculate_schema: If True, recalculates the schema after the type change.

    Returns:
        A new `FlowDataEngine` instance with the updated column types.
    """
    dtypes = [dtype.base_type() for dtype in self.data_frame.collect_schema().dtypes()]
    idx_mapping = list(
        (transform.old_name, self.cols_idx.get(transform.old_name), getattr(pl, transform.polars_type))
        for transform in transforms if transform.data_type is not None
    )

    actual_transforms = [c for c in idx_mapping if c[2] != dtypes[c[1]]]
    transformations = [
        utils.define_pl_col_transformation(col_name=transform[0], col_type=transform[2])
        for transform in actual_transforms
    ]

    df = self.data_frame.with_columns(transformations)
    return FlowDataEngine(
        df,
        number_of_records=self.number_of_records,
        calculate_schema_stats=calculate_schema,
        streamable=self._streamable
    )
collect(n_records=None)

Collects the data and returns it as a Polars DataFrame.

This method triggers the execution of the lazy query plan (if applicable) and returns the result. It supports streaming to optimize memory usage for large datasets.

Parameters:

Name Type Description Default
n_records int

The maximum number of records to collect. If None, all records are collected.

None

Returns:

Type Description
DataFrame

A Polars DataFrame containing the collected data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
def collect(self, n_records: int = None) -> pl.DataFrame:
    """Collects the data and returns it as a Polars DataFrame.

    This method triggers the execution of the lazy query plan (if applicable)
    and returns the result. It supports streaming to optimize memory usage
    for large datasets.

    Args:
        n_records: The maximum number of records to collect. If None, all
            records are collected.

    Returns:
        A Polars `DataFrame` containing the collected data.
    """
    if n_records is None:
        logger.info(f'Fetching all data for Table object "{id(self)}". Settings: streaming={self._streamable}')
    else:
        logger.info(f'Fetching {n_records} record(s) for Table object "{id(self)}". '
                    f'Settings: streaming={self._streamable}')

    if not self.lazy:
        return self.data_frame

    try:
        return self._collect_data(n_records)
    except Exception as e:
        self.errors = [e]
        return self._handle_collection_error(n_records)
collect_external()

Materializes data from a tracked external source.

If the FlowDataEngine was created from an ExternalDataSource, this method will trigger the data retrieval, update the internal _data_frame to a LazyFrame of the collected data, and reset the schema to be re-evaluated.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
def collect_external(self):
    """Materializes data from a tracked external source.

    If the `FlowDataEngine` was created from an `ExternalDataSource`, this
    method will trigger the data retrieval, update the internal `_data_frame`
    to a `LazyFrame` of the collected data, and reset the schema to be
    re-evaluated.
    """
    if self._external_source is not None:
        logger.info('Collecting external source')
        if self.external_source.get_pl_df() is not None:
            self.data_frame = self.external_source.get_pl_df().lazy()
        else:
            self.data_frame = pl.LazyFrame(list(self.external_source.get_iter()))
        self._schema = None  # enforce reset schema
concat(other)

Concatenates this DataFrame with one or more other DataFrames.

Parameters:

Name Type Description Default
other Iterable[FlowDataEngine] | FlowDataEngine

A single FlowDataEngine or an iterable of them.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine containing the concatenated data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
def concat(self, other: Iterable["FlowDataEngine"] | "FlowDataEngine") -> "FlowDataEngine":
    """Concatenates this DataFrame with one or more other DataFrames.

    Args:
        other: A single `FlowDataEngine` or an iterable of them.

    Returns:
        A new `FlowDataEngine` containing the concatenated data.
    """
    if isinstance(other, FlowDataEngine):
        other = [other]

    dfs: List[pl.LazyFrame] | List[pl.DataFrame] = [self.data_frame] + [flt.data_frame for flt in other]
    return FlowDataEngine(pl.concat(dfs, how='diagonal_relaxed'))
count()

Gets the total number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2271
2272
2273
def count(self) -> int:
    """Gets the total number of records."""
    return self.get_number_of_records()
create_from_external_source(external_source) classmethod

Creates a FlowDataEngine from an external data source.

Parameters:

Name Type Description Default
external_source ExternalDataSource

An object that conforms to the ExternalDataSource interface.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
@classmethod
def create_from_external_source(cls, external_source: ExternalDataSource) -> "FlowDataEngine":
    """Creates a FlowDataEngine from an external data source.

    Args:
        external_source: An object that conforms to the `ExternalDataSource`
            interface.

    Returns:
        A new `FlowDataEngine` instance.
    """
    if external_source.schema is not None:
        ff = cls.create_from_schema(external_source.schema)
    elif external_source.initial_data_getter is not None:
        ff = cls(raw_data=external_source.initial_data_getter())
    else:
        ff = cls()
    ff._external_source = external_source
    return ff
create_from_path(received_table) classmethod

Creates a FlowDataEngine from a local file path.

Supports various file types like CSV, Parquet, and Excel.

Parameters:

Name Type Description Default
received_table ReceivedTableBase

A ReceivedTableBase object containing the file path and format details.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with data from the file.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
@classmethod
def create_from_path(cls, received_table: input_schema.ReceivedTableBase) -> "FlowDataEngine":
    """Creates a FlowDataEngine from a local file path.

    Supports various file types like CSV, Parquet, and Excel.

    Args:
        received_table: A `ReceivedTableBase` object containing the file path
            and format details.

    Returns:
        A new `FlowDataEngine` instance with data from the file.
    """
    received_table.set_absolute_filepath()
    file_type_handlers = {
        'csv': create_funcs.create_from_path_csv,
        'parquet': create_funcs.create_from_path_parquet,
        'excel': create_funcs.create_from_path_excel
    }

    handler = file_type_handlers.get(received_table.file_type)
    if not handler:
        raise Exception(f'Cannot create from {received_table.file_type}')

    flow_file = cls(handler(received_table))
    flow_file._org_path = received_table.abs_file_path
    return flow_file
create_from_path_worker(received_table, flow_id, node_id) classmethod

Creates a FlowDataEngine from a path in a worker process.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2275
2276
2277
2278
2279
2280
2281
@classmethod
def create_from_path_worker(cls, received_table: input_schema.ReceivedTable, flow_id: int, node_id: int | str):
    """Creates a FlowDataEngine from a path in a worker process."""
    received_table.set_absolute_filepath()
    external_fetcher = ExternalCreateFetcher(received_table=received_table,
                                             file_type=received_table.file_type, flow_id=flow_id, node_id=node_id)
    return cls(external_fetcher.get_result())
create_from_schema(schema) classmethod

Creates an empty FlowDataEngine from a schema definition.

Parameters:

Name Type Description Default
schema List[FlowfileColumn]

A list of FlowfileColumn objects defining the schema.

required

Returns:

Type Description
FlowDataEngine

A new, empty FlowDataEngine instance with the specified schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
@classmethod
def create_from_schema(cls, schema: List[FlowfileColumn]) -> "FlowDataEngine":
    """Creates an empty FlowDataEngine from a schema definition.

    Args:
        schema: A list of `FlowfileColumn` objects defining the schema.

    Returns:
        A new, empty `FlowDataEngine` instance with the specified schema.
    """
    pl_schema = []
    for i, flow_file_column in enumerate(schema):
        pl_schema.append((flow_file_column.name, cast_str_to_polars_type(flow_file_column.data_type)))
        schema[i].col_index = i
    df = pl.LazyFrame(schema=pl_schema)
    return cls(df, schema=schema, calculate_schema_stats=False, number_of_records=0)
create_from_sql(sql, conn) classmethod

Creates a FlowDataEngine by executing a SQL query.

Parameters:

Name Type Description Default
sql str

The SQL query string to execute.

required
conn Any

A database connection object or connection URI string.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the query result.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
@classmethod
def create_from_sql(cls, sql: str, conn: Any) -> "FlowDataEngine":
    """Creates a FlowDataEngine by executing a SQL query.

    Args:
        sql: The SQL query string to execute.
        conn: A database connection object or connection URI string.

    Returns:
        A new `FlowDataEngine` instance with the query result.
    """
    return cls(pl.read_sql(sql, conn))
create_random(number_of_records=1000) classmethod

Creates a FlowDataEngine with randomly generated data.

Useful for testing and examples.

Parameters:

Name Type Description Default
number_of_records int

The number of random records to generate.

1000

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with fake data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
@classmethod
def create_random(cls, number_of_records: int = 1000) -> "FlowDataEngine":
    """Creates a FlowDataEngine with randomly generated data.

    Useful for testing and examples.

    Args:
        number_of_records: The number of random records to generate.

    Returns:
        A new `FlowDataEngine` instance with fake data.
    """
    return cls(create_fake_data(number_of_records))
do_cross_join(cross_join_input, auto_generate_selection, verify_integrity, other)

Performs a cross join with another DataFrame.

A cross join produces the Cartesian product of the two DataFrames.

Parameters:

Name Type Description Default
cross_join_input CrossJoinInput

A CrossJoinInput object specifying column selections.

required
auto_generate_selection bool

If True, automatically renames columns to avoid conflicts.

required
verify_integrity bool

If True, checks if the resulting join would be too large.

required
other FlowDataEngine

The right FlowDataEngine to join with.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine with the result of the cross join.

Raises:

Type Description
Exception

If verify_integrity is True and the join would result in an excessively large number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
def do_cross_join(self, cross_join_input: transform_schemas.CrossJoinInput,
                  auto_generate_selection: bool, verify_integrity: bool,
                  other: "FlowDataEngine") -> "FlowDataEngine":
    """Performs a cross join with another DataFrame.

    A cross join produces the Cartesian product of the two DataFrames.

    Args:
        cross_join_input: A `CrossJoinInput` object specifying column selections.
        auto_generate_selection: If True, automatically renames columns to avoid conflicts.
        verify_integrity: If True, checks if the resulting join would be too large.
        other: The right `FlowDataEngine` to join with.

    Returns:
        A new `FlowDataEngine` with the result of the cross join.

    Raises:
        Exception: If `verify_integrity` is True and the join would result in
            an excessively large number of records.
    """
    self.lazy = True
    other.lazy = True

    verify_join_select_integrity(cross_join_input, left_columns=self.columns, right_columns=other.columns)

    right_select = [v.old_name for v in cross_join_input.right_select.renames
                    if (v.keep or v.join_key) and v.is_available]
    left_select = [v.old_name for v in cross_join_input.left_select.renames
                   if (v.keep or v.join_key) and v.is_available]

    left = self.data_frame.select(left_select).rename(cross_join_input.left_select.rename_table)
    right = other.data_frame.select(right_select).rename(cross_join_input.right_select.rename_table)

    if verify_integrity:
        n_records = self.get_number_of_records() * other.get_number_of_records()
        if n_records > 1_000_000_000:
            raise Exception("Join will result in too many records, ending process")
    else:
        n_records = -1

    joined_df = left.join(right, how='cross')

    cols_to_delete_after = [col.new_name for col in
                            cross_join_input.left_select.renames + cross_join_input.left_select.renames
                            if col.join_key and not col.keep and col.is_available]

    if verify_integrity:
        return FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False,
                             number_of_records=n_records, streamable=False)
    else:
        fl = FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False,
                           number_of_records=0, streamable=False)
        return fl
do_filter(predicate)

Filters rows based on a predicate expression.

Parameters:

Name Type Description Default
predicate str

A string containing a Polars expression that evaluates to a boolean value.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing only the rows that match

FlowDataEngine

the predicate.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
def do_filter(self, predicate: str) -> "FlowDataEngine":
    """Filters rows based on a predicate expression.

    Args:
        predicate: A string containing a Polars expression that evaluates to
            a boolean value.

    Returns:
        A new `FlowDataEngine` instance containing only the rows that match
        the predicate.
    """
    try:
        f = to_expr(predicate)
    except Exception as e:
        logger.warning(f'Error in filter expression: {e}')
        f = to_expr("False")
    df = self.data_frame.filter(f)
    return FlowDataEngine(df, schema=self.schema, streamable=self._streamable)
do_fuzzy_join(fuzzy_match_input, other, file_ref, flow_id=-1, node_id=-1)

Performs a fuzzy join with another DataFrame.

This method blocks until the fuzzy join operation is complete.

Parameters:

Name Type Description Default
fuzzy_match_input FuzzyMatchInput

A FuzzyMatchInput object with the matching parameters.

required
other FlowDataEngine

The right FlowDataEngine to join with.

required
file_ref str

A reference string for temporary files.

required
flow_id int

The flow ID for tracking.

-1
node_id int | str

The node ID for tracking.

-1

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the result of the fuzzy join.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
def do_fuzzy_join(self, fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                  other: "FlowDataEngine", file_ref: str, flow_id: int = -1,
                  node_id: int | str = -1) -> "FlowDataEngine":
    """Performs a fuzzy join with another DataFrame.

    This method blocks until the fuzzy join operation is complete.

    Args:
        fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
        other: The right `FlowDataEngine` to join with.
        file_ref: A reference string for temporary files.
        flow_id: The flow ID for tracking.
        node_id: The node ID for tracking.

    Returns:
        A new `FlowDataEngine` instance with the result of the fuzzy join.
    """
    left_df, right_df = prepare_for_fuzzy_match(left=self, right=other,
                                                fuzzy_match_input=fuzzy_match_input)
    f = ExternalFuzzyMatchFetcher(left_df, right_df,
                                  fuzzy_maps=fuzzy_match_input.fuzzy_maps,
                                  file_ref=file_ref + '_fm',
                                  wait_on_completion=True,
                                  flow_id=flow_id,
                                  node_id=node_id)
    return FlowDataEngine(f.get_result())
do_group_by(group_by_input, calculate_schema_stats=True)

Performs a group-by operation on the DataFrame.

Parameters:

Name Type Description Default
group_by_input GroupByInput

A GroupByInput object defining the grouping columns and aggregations.

required
calculate_schema_stats bool

If True, calculates schema statistics for the resulting DataFrame.

True

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the grouped and aggregated data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
def do_group_by(self, group_by_input: transform_schemas.GroupByInput,
                calculate_schema_stats: bool = True) -> "FlowDataEngine":
    """Performs a group-by operation on the DataFrame.

    Args:
        group_by_input: A `GroupByInput` object defining the grouping columns
            and aggregations.
        calculate_schema_stats: If True, calculates schema statistics for the
            resulting DataFrame.

    Returns:
        A new `FlowDataEngine` instance with the grouped and aggregated data.
    """
    aggregations = [c for c in group_by_input.agg_cols if c.agg != 'groupby']
    group_columns = [c for c in group_by_input.agg_cols if c.agg == 'groupby']

    if len(group_columns) == 0:
        return FlowDataEngine(
            self.data_frame.select(
                ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
            ),
            calculate_schema_stats=calculate_schema_stats
        )

    df = self.data_frame.rename({c.old_name: c.new_name for c in group_columns})
    group_by_columns = [n_c.new_name for n_c in group_columns]
    return FlowDataEngine(
        df.group_by(*group_by_columns).agg(
            ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations
        ),
        calculate_schema_stats=calculate_schema_stats
    )
do_pivot(pivot_input, node_logger=None)

Converts the DataFrame from a long to a wide format, aggregating values.

Parameters:

Name Type Description Default
pivot_input PivotInput

A PivotInput object defining the index, pivot, and value columns, along with the aggregation logic.

required
node_logger NodeLogger

An optional logger for reporting warnings, e.g., if the pivot column has too many unique values.

None

Returns:

Type Description
FlowDataEngine

A new, pivoted FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
def do_pivot(self, pivot_input: transform_schemas.PivotInput, node_logger: NodeLogger = None) -> "FlowDataEngine":
    """Converts the DataFrame from a long to a wide format, aggregating values.

    Args:
        pivot_input: A `PivotInput` object defining the index, pivot, and value
            columns, along with the aggregation logic.
        node_logger: An optional logger for reporting warnings, e.g., if the
            pivot column has too many unique values.

    Returns:
        A new, pivoted `FlowDataEngine` instance.
    """
    # Get unique values for pivot columns
    max_unique_vals = 200
    new_cols_unique = fetch_unique_values(self.data_frame.select(pivot_input.pivot_column)
                                          .unique()
                                          .sort(pivot_input.pivot_column)
                                          .limit(max_unique_vals).cast(pl.String))
    if len(new_cols_unique) >= max_unique_vals:
        if node_logger:
            node_logger.warning('Pivot column has too many unique values. Please consider using a different column.'
                                f' Max unique values: {max_unique_vals}')

    if len(pivot_input.index_columns) == 0:
        no_index_cols = True
        pivot_input.index_columns = ['__temp__']
        ff = self.apply_flowfile_formula('1', col_name='__temp__')
    else:
        no_index_cols = False
        ff = self

    # Perform pivot operations
    index_columns = pivot_input.get_index_columns()
    grouped_ff = ff.do_group_by(pivot_input.get_group_by_input(), False)
    pivot_column = pivot_input.get_pivot_column()

    input_df = grouped_ff.data_frame.with_columns(
        pivot_column.cast(pl.String).alias(pivot_input.pivot_column)
    )
    number_of_aggregations = len(pivot_input.aggregations)
    df = (
        input_df.select(
            *index_columns,
            pivot_column,
            pivot_input.get_values_expr()
        )
        .group_by(*index_columns)
        .agg([
            (pl.col('vals').filter(pivot_column == new_col_value))
            .first()
            .alias(new_col_value)
            for new_col_value in new_cols_unique
        ])
        .select(
            *index_columns,
            *[
                pl.col(new_col).struct.field(agg).alias(f'{new_col + "_" + agg if number_of_aggregations > 1 else new_col }')
                for new_col in new_cols_unique
                for agg in pivot_input.aggregations
            ]
        )
    )

    # Clean up temporary columns if needed
    if no_index_cols:
        df = df.drop('__temp__')
        pivot_input.index_columns = []

    return FlowDataEngine(df, calculate_schema_stats=False)
do_select(select_inputs, keep_missing=True)

Performs a complex column selection, renaming, and reordering operation.

Parameters:

Name Type Description Default
select_inputs SelectInputs

A SelectInputs object defining the desired transformations.

required
keep_missing bool

If True, columns not specified in select_inputs are kept. If False, they are dropped.

True

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine with the transformed selection.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
def do_select(self, select_inputs: transform_schemas.SelectInputs,
              keep_missing: bool = True) -> "FlowDataEngine":
    """Performs a complex column selection, renaming, and reordering operation.

    Args:
        select_inputs: A `SelectInputs` object defining the desired transformations.
        keep_missing: If True, columns not specified in `select_inputs` are kept.
            If False, they are dropped.

    Returns:
        A new `FlowDataEngine` with the transformed selection.
    """
    new_schema = deepcopy(self.schema)
    renames = [r for r in select_inputs.renames if r.is_available]

    if not keep_missing:
        drop_cols = set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames).union(
            set(r.old_name for r in renames if not r.keep))
        keep_cols = []
    else:
        keep_cols = list(set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames))
        drop_cols = set(r.old_name for r in renames if not r.keep)

    if len(drop_cols) > 0:
        new_schema = [s for s in new_schema if s.name not in drop_cols]
    new_schema_mapping = {v.name: v for v in new_schema}

    available_renames = []
    for rename in renames:
        if (rename.new_name != rename.old_name or rename.new_name not in new_schema_mapping) and rename.keep:
            schema_entry = new_schema_mapping.get(rename.old_name)
            if schema_entry is not None:
                available_renames.append(rename)
                schema_entry.column_name = rename.new_name

    rename_dict = {r.old_name: r.new_name for r in available_renames}
    fl = self.select_columns(
        list_select=[col_to_keep.old_name for col_to_keep in renames if col_to_keep.keep] + keep_cols)
    fl = fl.change_column_types(transforms=[r for r in renames if r.keep])
    ndf = fl.data_frame.rename(rename_dict)
    renames.sort(key=lambda r: 0 if r.position is None else r.position)
    sorted_cols = utils.match_order(ndf.collect_schema().names(),
                                    [r.new_name for r in renames] + self.data_frame.collect_schema().names())
    output_file = FlowDataEngine(ndf, number_of_records=self.number_of_records)
    return output_file.reorganize_order(sorted_cols)
do_sort(sorts)

Sorts the DataFrame by one or more columns.

Parameters:

Name Type Description Default
sorts List[SortByInput]

A list of SortByInput objects, each specifying a column and sort direction ('asc' or 'desc').

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the sorted data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
def do_sort(self, sorts: List[transform_schemas.SortByInput]) -> "FlowDataEngine":
    """Sorts the DataFrame by one or more columns.

    Args:
        sorts: A list of `SortByInput` objects, each specifying a column
            and sort direction ('asc' or 'desc').

    Returns:
        A new `FlowDataEngine` instance with the sorted data.
    """
    if not sorts:
        return self

    descending = [s.how == 'desc' or s.how.lower() == 'descending' for s in sorts]
    df = self.data_frame.sort([sort_by.column for sort_by in sorts], descending=descending)
    return FlowDataEngine(df, number_of_records=self.number_of_records, schema=self.schema)
drop_columns(columns)

Drops specified columns from the DataFrame.

Parameters:

Name Type Description Default
columns List[str]

A list of column names to drop.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance without the dropped columns.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
def drop_columns(self, columns: List[str]) -> "FlowDataEngine":
    """Drops specified columns from the DataFrame.

    Args:
        columns: A list of column names to drop.

    Returns:
        A new `FlowDataEngine` instance without the dropped columns.
    """
    cols_for_select = tuple(set(self.columns) - set(columns))
    idx_to_keep = [self.cols_idx.get(c) for c in cols_for_select]
    new_schema = [self.schema[i] for i in idx_to_keep]

    return FlowDataEngine(
        self.data_frame.select(cols_for_select),
        number_of_records=self.number_of_records,
        schema=new_schema
    )
from_cloud_storage_obj(settings) classmethod

Creates a FlowDataEngine from an object in cloud storage.

This method supports reading from various cloud storage providers like AWS S3, Azure Data Lake Storage, and Google Cloud Storage, with support for various authentication methods.

Parameters:

Name Type Description Default
settings CloudStorageReadSettingsInternal

A CloudStorageReadSettingsInternal object containing connection details, file format, and read options.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing the data from cloud storage.

Raises:

Type Description
ValueError

If the storage type or file format is not supported.

NotImplementedError

If a requested file format like "delta" or "iceberg" is not yet implemented.

Exception

If reading from cloud storage fails.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
@classmethod
def from_cloud_storage_obj(cls, settings: cloud_storage_schemas.CloudStorageReadSettingsInternal) -> "FlowDataEngine":
    """Creates a FlowDataEngine from an object in cloud storage.

    This method supports reading from various cloud storage providers like AWS S3,
    Azure Data Lake Storage, and Google Cloud Storage, with support for
    various authentication methods.

    Args:
        settings: A `CloudStorageReadSettingsInternal` object containing connection
            details, file format, and read options.

    Returns:
        A new `FlowDataEngine` instance containing the data from cloud storage.

    Raises:
        ValueError: If the storage type or file format is not supported.
        NotImplementedError: If a requested file format like "delta" or "iceberg"
            is not yet implemented.
        Exception: If reading from cloud storage fails.
    """
    connection = settings.connection
    read_settings = settings.read_settings

    logger.info(f"Reading from {connection.storage_type} storage: {read_settings.resource_path}")
    # Get storage options based on connection type
    storage_options = CloudStorageReader.get_storage_options(connection)
    # Get credential provider if needed
    credential_provider = CloudStorageReader.get_credential_provider(connection)
    if read_settings.file_format == "parquet":
        return cls._read_parquet_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings.scan_mode == "directory",
        )
    elif read_settings.file_format == "delta":
        return cls._read_delta_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings
        )
    elif read_settings.file_format == "csv":
        return cls._read_csv_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings
        )
    elif read_settings.file_format == "json":
        return cls._read_json_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings.scan_mode == "directory"
        )
    elif read_settings.file_format == "iceberg":
        return cls._read_iceberg_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings
        )

    elif read_settings.file_format in ["delta", "iceberg"]:
        # These would require additional libraries
        raise NotImplementedError(f"File format {read_settings.file_format} not yet implemented")
    else:
        raise ValueError(f"Unsupported file format: {read_settings.file_format}")
fuzzy_match(right, left_on, right_on, fuzzy_method='levenshtein', threshold=0.75)

Performs a simple fuzzy match between two DataFrames on a single column pair.

This is a convenience method for a common fuzzy join scenario.

Parameters:

Name Type Description Default
right FlowDataEngine

The right FlowDataEngine to match against.

required
left_on str

The column name from the left DataFrame to match on.

required
right_on str

The column name from the right DataFrame to match on.

required
fuzzy_method str

The fuzzy matching algorithm to use (e.g., 'levenshtein').

'levenshtein'
threshold float

The similarity score threshold (0.0 to 1.0) for a match.

0.75

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine with the matched data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
def fuzzy_match(self, right: "FlowDataEngine", left_on: str, right_on: str,
                fuzzy_method: str = 'levenshtein', threshold: float = 0.75) -> "FlowDataEngine":
    """Performs a simple fuzzy match between two DataFrames on a single column pair.

    This is a convenience method for a common fuzzy join scenario.

    Args:
        right: The right `FlowDataEngine` to match against.
        left_on: The column name from the left DataFrame to match on.
        right_on: The column name from the right DataFrame to match on.
        fuzzy_method: The fuzzy matching algorithm to use (e.g., 'levenshtein').
        threshold: The similarity score threshold (0.0 to 1.0) for a match.

    Returns:
        A new `FlowDataEngine` with the matched data.
    """
    fuzzy_match_input = transform_schemas.FuzzyMatchInput(
        [transform_schemas.FuzzyMap(
            left_on, right_on,
            fuzzy_type=fuzzy_method,
            threshold_score=threshold
        )],
        left_select=self.columns,
        right_select=right.columns
    )
    return self.do_fuzzy_join(fuzzy_match_input, right, str(id(self)))
generate_enumerator(length=1000, output_name='output_column') classmethod

Generates a FlowDataEngine with a single column containing a sequence of integers.

Parameters:

Name Type Description Default
length int

The number of integers to generate in the sequence.

1000
output_name str

The name of the output column.

'output_column'

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
@classmethod
def generate_enumerator(cls, length: int = 1000, output_name: str = 'output_column') -> "FlowDataEngine":
    """Generates a FlowDataEngine with a single column containing a sequence of integers.

    Args:
        length: The number of integers to generate in the sequence.
        output_name: The name of the output column.

    Returns:
        A new `FlowDataEngine` instance.
    """
    if length > 10_000_000:
        length = 10_000_000
    return cls(pl.LazyFrame().select((pl.int_range(0, length, dtype=pl.UInt32)).alias(output_name)))
get_estimated_file_size()

Estimates the file size in bytes if the data originated from a local file.

This relies on the original path being tracked during file ingestion.

Returns:

Type Description
int

The file size in bytes, or 0 if the original path is unknown.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
def get_estimated_file_size(self) -> int:
    """Estimates the file size in bytes if the data originated from a local file.

    This relies on the original path being tracked during file ingestion.

    Returns:
        The file size in bytes, or 0 if the original path is unknown.
    """
    if self._org_path is not None:
        return os.path.getsize(self._org_path)
    return 0
get_number_of_records(warn=False, force_calculate=False, calculate_in_worker_process=False)

Gets the total number of records in the DataFrame.

For lazy frames, this may trigger a full data scan, which can be expensive.

Parameters:

Name Type Description Default
warn bool

If True, logs a warning if a potentially expensive calculation is triggered.

False
force_calculate bool

If True, forces recalculation even if a value is cached.

False
calculate_in_worker_process bool

If True, offloads the calculation to a worker process.

False

Returns:

Type Description
int

The total number of records.

Raises:

Type Description
ValueError

If the number of records could not be determined.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
def get_number_of_records(self, warn: bool = False, force_calculate: bool = False,
                          calculate_in_worker_process: bool = False) -> int:
    """Gets the total number of records in the DataFrame.

    For lazy frames, this may trigger a full data scan, which can be expensive.

    Args:
        warn: If True, logs a warning if a potentially expensive calculation is triggered.
        force_calculate: If True, forces recalculation even if a value is cached.
        calculate_in_worker_process: If True, offloads the calculation to a worker process.

    Returns:
        The total number of records.

    Raises:
        ValueError: If the number of records could not be determined.
    """
    if self.is_future and not self.is_collected:
        return -1
    calculate_in_worker_process = False if not OFFLOAD_TO_WORKER else calculate_in_worker_process
    if self.number_of_records is None or self.number_of_records < 0 or force_calculate:
        if self._number_of_records_callback is not None:
            self._number_of_records_callback(self)

        if self.lazy:
            if calculate_in_worker_process:
                try:
                    self.number_of_records = self._calculate_number_of_records_in_worker()
                    return self.number_of_records
                except Exception as e:
                    logger.error(f"Error: {e}")
            if warn:
                logger.warning('Calculating the number of records this can be expensive on a lazy frame')
            try:
                self.number_of_records = self.data_frame.select(pl.len()).collect(
                    engine="streaming" if self._streamable else "auto")[0, 0]
            except Exception:
                raise ValueError('Could not get number of records')
        else:
            self.number_of_records = self.data_frame.__len__()
    return self.number_of_records
get_output_sample(n_rows=10)

Gets a sample of the data as a list of dictionaries.

This is typically used to display a preview of the data in a UI.

Parameters:

Name Type Description Default
n_rows int

The number of rows to sample.

10

Returns:

Type Description
List[Dict]

A list of dictionaries, where each dictionary represents a row.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
def get_output_sample(self, n_rows: int = 10) -> List[Dict]:
    """Gets a sample of the data as a list of dictionaries.

    This is typically used to display a preview of the data in a UI.

    Args:
        n_rows: The number of rows to sample.

    Returns:
        A list of dictionaries, where each dictionary represents a row.
    """
    if self.number_of_records > n_rows or self.number_of_records < 0:
        df = self.collect(n_rows)
    else:
        df = self.collect()
    return df.to_dicts()
get_record_count()

Returns a new FlowDataEngine with a single column 'number_of_records' containing the total number of records.

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1875
1876
1877
1878
1879
1880
1881
1882
def get_record_count(self) -> "FlowDataEngine":
    """Returns a new FlowDataEngine with a single column 'number_of_records'
    containing the total number of records.

    Returns:
        A new `FlowDataEngine` instance.
    """
    return FlowDataEngine(self.data_frame.select(pl.len().alias('number_of_records')))
get_sample(n_rows=100, random=False, shuffle=False, seed=None)

Gets a sample of rows from the DataFrame.

Parameters:

Name Type Description Default
n_rows int

The number of rows to sample.

100
random bool

If True, performs random sampling. If False, takes the first n_rows.

False
shuffle bool

If True (and random is True), shuffles the data before sampling.

False
seed int

A random seed for reproducibility.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing the sampled data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
def get_sample(self, n_rows: int = 100, random: bool = False, shuffle: bool = False,
               seed: int = None) -> "FlowDataEngine":
    """Gets a sample of rows from the DataFrame.

    Args:
        n_rows: The number of rows to sample.
        random: If True, performs random sampling. If False, takes the first n_rows.
        shuffle: If True (and `random` is True), shuffles the data before sampling.
        seed: A random seed for reproducibility.

    Returns:
        A new `FlowDataEngine` instance containing the sampled data.
    """
    n_records = min(n_rows, self.get_number_of_records(calculate_in_worker_process=OFFLOAD_TO_WORKER))
    logging.info(f'Getting sample of {n_rows} rows')

    if random:
        if self.lazy and self.external_source is not None:
            self.collect_external()

        if self.lazy and shuffle:
            sample_df = self.data_frame.collect(engine="streaming" if self._streamable else "auto").sample(n_rows,
                                                                                                           seed=seed,
                                                                                 shuffle=shuffle)
        elif shuffle:
            sample_df = self.data_frame.sample(n_rows, seed=seed, shuffle=shuffle)
        else:
            every_n_records = ceil(self.number_of_records / n_rows)
            sample_df = self.data_frame.gather_every(every_n_records)
    else:
        if self.external_source:
            self.collect(n_rows)
        sample_df = self.data_frame.head(n_rows)

    return FlowDataEngine(sample_df, schema=self.schema, number_of_records=n_records)
get_schema_column(col_name)

Retrieves the schema information for a single column by its name.

Parameters:

Name Type Description Default
col_name str

The name of the column to retrieve.

required

Returns:

Type Description
FlowfileColumn

A FlowfileColumn object for the specified column, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
def get_schema_column(self, col_name: str) -> FlowfileColumn:
    """Retrieves the schema information for a single column by its name.

    Args:
        col_name: The name of the column to retrieve.

    Returns:
        A `FlowfileColumn` object for the specified column, or `None` if not found.
    """
    for s in self.schema:
        if s.name == col_name:
            return s
get_select_inputs()

Gets SelectInput specifications for all columns in the current schema.

Returns:

Type Description
SelectInputs

A SelectInputs object that can be used to configure selection or

SelectInputs

transformation operations.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
def get_select_inputs(self) -> transform_schemas.SelectInputs:
    """Gets `SelectInput` specifications for all columns in the current schema.

    Returns:
        A `SelectInputs` object that can be used to configure selection or
        transformation operations.
    """
    return transform_schemas.SelectInputs(
        [transform_schemas.SelectInput(old_name=c.name, data_type=c.data_type) for c in self.schema]
    )
get_subset(n_rows=100)

Gets the first n_rows from the DataFrame.

Parameters:

Name Type Description Default
n_rows int

The number of rows to include in the subset.

100

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing the subset of data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
def get_subset(self, n_rows: int = 100) -> "FlowDataEngine":
    """Gets the first `n_rows` from the DataFrame.

    Args:
        n_rows: The number of rows to include in the subset.

    Returns:
        A new `FlowDataEngine` instance containing the subset of data.
    """
    if not self.lazy:
        return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
    else:
        return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
initialize_empty_fl()

Initializes an empty LazyFrame.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1923
1924
1925
1926
1927
def initialize_empty_fl(self):
    """Initializes an empty LazyFrame."""
    self.data_frame = pl.LazyFrame()
    self.number_of_records = 0
    self._lazy = True
iter_batches(batch_size=1000, columns=None)

Iterates over the DataFrame in batches.

Parameters:

Name Type Description Default
batch_size int

The size of each batch.

1000
columns Union[List, Tuple, str]

A list of column names to include in the batches. If None, all columns are included.

None

Yields:

Type Description
FlowDataEngine

A FlowDataEngine instance for each batch.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
def iter_batches(self, batch_size: int = 1000,
                 columns: Union[List, Tuple, str] = None) -> Generator["FlowDataEngine", None, None]:
    """Iterates over the DataFrame in batches.

    Args:
        batch_size: The size of each batch.
        columns: A list of column names to include in the batches. If None,
            all columns are included.

    Yields:
        A `FlowDataEngine` instance for each batch.
    """
    if columns:
        self.data_frame = self.data_frame.select(columns)
    self.lazy = False
    batches = self.data_frame.iter_slices(batch_size)
    for batch in batches:
        yield FlowDataEngine(batch)
join(join_input, auto_generate_selection, verify_integrity, other)

Performs a standard SQL-style join with another DataFrame.

Supports various join types like 'inner', 'left', 'right', 'outer', 'semi', and 'anti'.

Parameters:

Name Type Description Default
join_input JoinInput

A JoinInput object defining the join keys, join type, and column selections.

required
auto_generate_selection bool

If True, automatically handles column renaming.

required
verify_integrity bool

If True, performs checks to prevent excessively large joins.

required
other FlowDataEngine

The right FlowDataEngine to join with.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine with the joined data.

Raises:

Type Description
Exception

If the join configuration is invalid or if verify_integrity is True and the join is predicted to be too large.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
def join(self, join_input: transform_schemas.JoinInput, auto_generate_selection: bool,
         verify_integrity: bool, other: "FlowDataEngine") -> "FlowDataEngine":
    """Performs a standard SQL-style join with another DataFrame.

    Supports various join types like 'inner', 'left', 'right', 'outer', 'semi', and 'anti'.

    Args:
        join_input: A `JoinInput` object defining the join keys, join type,
            and column selections.
        auto_generate_selection: If True, automatically handles column renaming.
        verify_integrity: If True, performs checks to prevent excessively large joins.
        other: The right `FlowDataEngine` to join with.

    Returns:
        A new `FlowDataEngine` with the joined data.

    Raises:
        Exception: If the join configuration is invalid or if `verify_integrity`
            is True and the join is predicted to be too large.
    """
    ensure_right_unselect_for_semi_and_anti_joins(join_input)
    verify_join_select_integrity(join_input, left_columns=self.columns, right_columns=other.columns)
    if not verify_join_map_integrity(join_input, left_columns=self.schema, right_columns=other.schema):
        raise Exception('Join is not valid by the data fields')
    if auto_generate_selection:
        join_input.auto_rename()
    left = self.data_frame.select(get_select_columns(join_input.left_select.renames)).rename(join_input.left_select.rename_table)
    right = other.data_frame.select(get_select_columns(join_input.right_select.renames)).rename(join_input.right_select.rename_table)
    if verify_integrity and join_input.how != 'right':
        n_records = get_join_count(left, right, left_on_keys=join_input.left_join_keys,
                                   right_on_keys=join_input.right_join_keys, how=join_input.how)
        if n_records > 1_000_000_000:
            raise Exception("Join will result in too many records, ending process")
    else:
        n_records = -1
    left, right, reverse_join_key_mapping = _handle_duplication_join_keys(left, right, join_input)
    left, right = rename_df_table_for_join(left, right, join_input.get_join_key_renames())
    if join_input.how == 'right':
        joined_df = right.join(
            other=left,
            left_on=join_input.right_join_keys,
            right_on=join_input.left_join_keys,
            how="left",
            suffix="").rename(reverse_join_key_mapping)
    else:
        joined_df = left.join(
            other=right,
            left_on=join_input.left_join_keys,
            right_on=join_input.right_join_keys,
            how=join_input.how,
            suffix="").rename(reverse_join_key_mapping)
    left_cols_to_delete_after = [get_col_name_to_delete(col, 'left') for col in join_input.left_select.renames
                                 if not col.keep
                                 and col.is_available and col.join_key
                                 ]
    right_cols_to_delete_after = [get_col_name_to_delete(col, 'right') for col in join_input.right_select.renames
                                  if not col.keep
                                  and col.is_available and col.join_key
                                  and join_input.how in ("left", "right", "inner", "cross", "outer")
                                  ]
    if len(right_cols_to_delete_after + left_cols_to_delete_after) > 0:
        joined_df = joined_df.drop(left_cols_to_delete_after + right_cols_to_delete_after)
    undo_join_key_remapping = get_undo_rename_mapping_join(join_input)
    joined_df = joined_df.rename(undo_join_key_remapping)

    if verify_integrity:
        return FlowDataEngine(joined_df, calculate_schema_stats=True,
                              number_of_records=n_records, streamable=False)
    else:
        fl = FlowDataEngine(joined_df, calculate_schema_stats=False,
                            number_of_records=0, streamable=False)
        return fl
make_unique(unique_input=None)

Gets the unique rows from the DataFrame.

Parameters:

Name Type Description Default
unique_input UniqueInput

A UniqueInput object specifying a subset of columns to consider for uniqueness and a strategy for keeping rows.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with unique rows.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
def make_unique(self, unique_input: transform_schemas.UniqueInput = None) -> "FlowDataEngine":
    """Gets the unique rows from the DataFrame.

    Args:
        unique_input: A `UniqueInput` object specifying a subset of columns
            to consider for uniqueness and a strategy for keeping rows.

    Returns:
        A new `FlowDataEngine` instance with unique rows.
    """
    if unique_input is None or unique_input.columns is None:
        return FlowDataEngine(self.data_frame.unique())
    return FlowDataEngine(self.data_frame.unique(unique_input.columns, keep=unique_input.strategy))
output(output_fs, flow_id, node_id, execute_remote=True)

Writes the DataFrame to an output file.

Can execute the write operation locally or in a remote worker process.

Parameters:

Name Type Description Default
output_fs OutputSettings

An OutputSettings object with details about the output file.

required
flow_id int

The flow ID for tracking.

required
node_id int | str

The node ID for tracking.

required
execute_remote bool

If True, executes the write in a worker process.

True

Returns:

Type Description
FlowDataEngine

The same FlowDataEngine instance for chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
def output(self, output_fs: input_schema.OutputSettings, flow_id: int, node_id: int | str,
           execute_remote: bool = True) -> "FlowDataEngine":
    """Writes the DataFrame to an output file.

    Can execute the write operation locally or in a remote worker process.

    Args:
        output_fs: An `OutputSettings` object with details about the output file.
        flow_id: The flow ID for tracking.
        node_id: The node ID for tracking.
        execute_remote: If True, executes the write in a worker process.

    Returns:
        The same `FlowDataEngine` instance for chaining.
    """
    logger.info('Starting to write output')
    if execute_remote:
        status = utils.write_output(
            self.data_frame,
            data_type=output_fs.file_type,
            path=output_fs.abs_file_path,
            write_mode=output_fs.write_mode,
            sheet_name=output_fs.output_excel_table.sheet_name,
            delimiter=output_fs.output_csv_table.delimiter,
            flow_id=flow_id,
            node_id=node_id
        )
        tracker = ExternalExecutorTracker(status)
        tracker.get_result()
        logger.info('Finished writing output')
    else:
        logger.info("Starting to write results locally")
        utils.local_write_output(
            self.data_frame,
            data_type=output_fs.file_type,
            path=output_fs.abs_file_path,
            write_mode=output_fs.write_mode,
            sheet_name=output_fs.output_excel_table.sheet_name,
            delimiter=output_fs.output_csv_table.delimiter,
            flow_id=flow_id,
            node_id=node_id,
        )
        logger.info("Finished writing output")
    return self
reorganize_order(column_order)

Reorganizes columns into a specified order.

Parameters:

Name Type Description Default
column_order List[str]

A list of column names in the desired order.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the columns reordered.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
def reorganize_order(self, column_order: List[str]) -> "FlowDataEngine":
    """Reorganizes columns into a specified order.

    Args:
        column_order: A list of column names in the desired order.

    Returns:
        A new `FlowDataEngine` instance with the columns reordered.
    """
    df = self.data_frame.select(column_order)
    schema = sorted(self.schema, key=lambda x: column_order.index(x.column_name))
    return FlowDataEngine(df, schema=schema, number_of_records=self.number_of_records)
save(path, data_type='parquet')

Saves the DataFrame to a file in a separate thread.

Parameters:

Name Type Description Default
path str

The file path to save to.

required
data_type str

The format to save in (e.g., 'parquet', 'csv').

'parquet'

Returns:

Type Description
Future

A loky.Future object representing the asynchronous save operation.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
def save(self, path: str, data_type: str = 'parquet') -> Future:
    """Saves the DataFrame to a file in a separate thread.

    Args:
        path: The file path to save to.
        data_type: The format to save in (e.g., 'parquet', 'csv').

    Returns:
        A `loky.Future` object representing the asynchronous save operation.
    """
    estimated_size = deepcopy(self.get_estimated_file_size() * 4)
    df = deepcopy(self.data_frame)
    return write_threaded(_df=df, path=path, data_type=data_type, estimated_size=estimated_size)
select_columns(list_select)

Selects a subset of columns from the DataFrame.

Parameters:

Name Type Description Default
list_select Union[List[str], Tuple[str], str]

A list, tuple, or single string of column names to select.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing only the selected columns.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
def select_columns(self, list_select: Union[List[str], Tuple[str], str]) -> "FlowDataEngine":
    """Selects a subset of columns from the DataFrame.

    Args:
        list_select: A list, tuple, or single string of column names to select.

    Returns:
        A new `FlowDataEngine` instance containing only the selected columns.
    """
    if isinstance(list_select, str):
        list_select = [list_select]

    idx_to_keep = [self.cols_idx.get(c) for c in list_select]
    selects = [ls for ls, id_to_keep in zip(list_select, idx_to_keep) if id_to_keep is not None]
    new_schema = [self.schema[i] for i in idx_to_keep if i is not None]

    return FlowDataEngine(
        self.data_frame.select(selects),
        number_of_records=self.number_of_records,
        schema=new_schema,
        streamable=self._streamable
    )
set_streamable(streamable=False)

Sets whether DataFrame operations should be streamable.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2255
2256
2257
def set_streamable(self, streamable: bool = False):
    """Sets whether DataFrame operations should be streamable."""
    self._streamable = streamable
solve_graph(graph_solver_input)

Solves a graph problem represented by 'from' and 'to' columns.

This is used for operations like finding connected components in a graph.

Parameters:

Name Type Description Default
graph_solver_input GraphSolverInput

A GraphSolverInput object defining the source, destination, and output column names.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the solved graph data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
def solve_graph(self, graph_solver_input: transform_schemas.GraphSolverInput) -> "FlowDataEngine":
    """Solves a graph problem represented by 'from' and 'to' columns.

    This is used for operations like finding connected components in a graph.

    Args:
        graph_solver_input: A `GraphSolverInput` object defining the source,
            destination, and output column names.

    Returns:
        A new `FlowDataEngine` instance with the solved graph data.
    """
    lf = self.data_frame.with_columns(
        graph_solver(graph_solver_input.col_from, graph_solver_input.col_to)
        .alias(graph_solver_input.output_column_name)
    )
    return FlowDataEngine(lf)
split(split_input)

Splits a column's text values into multiple rows based on a delimiter.

This operation is often referred to as "exploding" the DataFrame, as it increases the number of rows.

Parameters:

Name Type Description Default
split_input TextToRowsInput

A TextToRowsInput object specifying the column to split, the delimiter, and the output column name.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the exploded rows.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
def split(self, split_input: transform_schemas.TextToRowsInput) -> "FlowDataEngine":
    """Splits a column's text values into multiple rows based on a delimiter.

    This operation is often referred to as "exploding" the DataFrame, as it
    increases the number of rows.

    Args:
        split_input: A `TextToRowsInput` object specifying the column to split,
            the delimiter, and the output column name.

    Returns:
        A new `FlowDataEngine` instance with the exploded rows.
    """
    output_column_name = (
        split_input.output_column_name
        if split_input.output_column_name
        else split_input.column_to_split
    )

    split_value = (
        split_input.split_fixed_value
        if split_input.split_by_fixed_value
        else pl.col(split_input.split_by_column)
    )

    df = (
        self.data_frame.with_columns(
            pl.col(split_input.column_to_split)
            .str.split(by=split_value)
            .alias(output_column_name)
        )
        .explode(output_column_name)
    )

    return FlowDataEngine(df)
start_fuzzy_join(fuzzy_match_input, other, file_ref, flow_id=-1, node_id=-1)

Starts a fuzzy join operation in a background process.

This method prepares the data and initiates the fuzzy matching in a separate process, returning a tracker object immediately.

Parameters:

Name Type Description Default
fuzzy_match_input FuzzyMatchInput

A FuzzyMatchInput object with the matching parameters.

required
other FlowDataEngine

The right FlowDataEngine to join with.

required
file_ref str

A reference string for temporary files.

required
flow_id int

The flow ID for tracking.

-1
node_id int | str

The node ID for tracking.

-1

Returns:

Type Description
ExternalFuzzyMatchFetcher

An ExternalFuzzyMatchFetcher object that can be used to track the

ExternalFuzzyMatchFetcher

progress and retrieve the result of the fuzzy join.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
def start_fuzzy_join(self, fuzzy_match_input: transform_schemas.FuzzyMatchInput,
                     other: "FlowDataEngine", file_ref: str, flow_id: int = -1,
                     node_id: int | str = -1) -> ExternalFuzzyMatchFetcher:
    """Starts a fuzzy join operation in a background process.

    This method prepares the data and initiates the fuzzy matching in a
    separate process, returning a tracker object immediately.

    Args:
        fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
        other: The right `FlowDataEngine` to join with.
        file_ref: A reference string for temporary files.
        flow_id: The flow ID for tracking.
        node_id: The node ID for tracking.

    Returns:
        An `ExternalFuzzyMatchFetcher` object that can be used to track the
        progress and retrieve the result of the fuzzy join.
    """
    left_df, right_df = prepare_for_fuzzy_match(left=self, right=other,
                                                fuzzy_match_input=fuzzy_match_input)
    return ExternalFuzzyMatchFetcher(left_df, right_df,
                                     fuzzy_maps=fuzzy_match_input.fuzzy_maps,
                                     file_ref=file_ref + '_fm',
                                     wait_on_completion=False,
                                     flow_id=flow_id,
                                     node_id=node_id)
to_arrow()

Converts the DataFrame to a PyArrow Table.

This method triggers a .collect() call if the data is lazy, then converts the resulting eager DataFrame into a pyarrow.Table.

Returns:

Type Description
Table

A pyarrow.Table instance representing the data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
def to_arrow(self) -> PaTable:
    """Converts the DataFrame to a PyArrow Table.

    This method triggers a `.collect()` call if the data is lazy,
    then converts the resulting eager DataFrame into a `pyarrow.Table`.

    Returns:
        A `pyarrow.Table` instance representing the data.
    """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_arrow()
    else:
        return self.data_frame.to_arrow()
to_cloud_storage_obj(settings)

Writes the DataFrame to an object in cloud storage.

This method supports writing to various cloud storage providers like AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

Parameters:

Name Type Description Default
settings CloudStorageWriteSettingsInternal

A CloudStorageWriteSettingsInternal object containing connection details, file format, and write options.

required

Raises:

Type Description
ValueError

If the specified file format is not supported for writing.

NotImplementedError

If the 'append' write mode is used with an unsupported format.

Exception

If the write operation to cloud storage fails for any reason.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
def to_cloud_storage_obj(self, settings: cloud_storage_schemas.CloudStorageWriteSettingsInternal):
    """Writes the DataFrame to an object in cloud storage.

    This method supports writing to various cloud storage providers like AWS S3,
    Azure Data Lake Storage, and Google Cloud Storage.

    Args:
        settings: A `CloudStorageWriteSettingsInternal` object containing connection
            details, file format, and write options.

    Raises:
        ValueError: If the specified file format is not supported for writing.
        NotImplementedError: If the 'append' write mode is used with an unsupported format.
        Exception: If the write operation to cloud storage fails for any reason.
    """
    connection = settings.connection
    write_settings = settings.write_settings

    logger.info(f"Writing to {connection.storage_type} storage: {write_settings.resource_path}")

    if write_settings.write_mode == 'append' and write_settings.file_format != "delta":
        raise NotImplementedError("The 'append' write mode is not yet supported for this destination.")
    storage_options = CloudStorageReader.get_storage_options(connection)
    credential_provider = CloudStorageReader.get_credential_provider(connection)
    # Dispatch to the correct writer based on file format
    if write_settings.file_format == "parquet":
        self._write_parquet_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    elif write_settings.file_format == "delta":
        self._write_delta_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    elif write_settings.file_format == "csv":
        self._write_csv_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    elif write_settings.file_format == "json":
        self._write_json_to_cloud(
            write_settings.resource_path,
            storage_options,
            credential_provider,
            write_settings
        )
    else:
        raise ValueError(f"Unsupported file format for writing: {write_settings.file_format}")

    logger.info(f"Successfully wrote data to {write_settings.resource_path}")
to_dict()

Converts the DataFrame to a Python dictionary of columns.

Each key in the dictionary is a column name, and the corresponding value is a list of the data in that column.

Returns:

Type Description
Dict[str, List]

A dictionary mapping column names to lists of their values.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
def to_dict(self) -> Dict[str, List]:
    """Converts the DataFrame to a Python dictionary of columns.

     Each key in the dictionary is a column name, and the corresponding value
     is a list of the data in that column.

     Returns:
         A dictionary mapping column names to lists of their values.
     """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dict(as_series=False)
    else:
        return self.data_frame.to_dict(as_series=False)
to_pylist()

Converts the DataFrame to a list of Python dictionaries.

Returns:

Type Description
List[Dict]

A list where each item is a dictionary representing a row.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1048
1049
1050
1051
1052
1053
1054
1055
1056
def to_pylist(self) -> List[Dict]:
    """Converts the DataFrame to a list of Python dictionaries.

    Returns:
        A list where each item is a dictionary representing a row.
    """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dicts()
    return self.data_frame.to_dicts()
to_raw_data()

Converts the DataFrame to a RawData schema object.

Returns:

Type Description
RawData

An input_schema.RawData object containing the schema and data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1072
1073
1074
1075
1076
1077
1078
1079
1080
def to_raw_data(self) -> input_schema.RawData:
    """Converts the DataFrame to a `RawData` schema object.

    Returns:
        An `input_schema.RawData` object containing the schema and data.
    """
    columns = [c.get_minimal_field_info() for c in self.schema]
    data = list(self.to_dict().values())
    return input_schema.RawData(columns=columns, data=data)
unpivot(unpivot_input)

Converts the DataFrame from a wide to a long format.

This is the inverse of a pivot operation, taking columns and transforming them into variable and value rows.

Parameters:

Name Type Description Default
unpivot_input UnpivotInput

An UnpivotInput object specifying which columns to unpivot and which to keep as index columns.

required

Returns:

Type Description
FlowDataEngine

A new, unpivoted FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
def unpivot(self, unpivot_input: transform_schemas.UnpivotInput) -> "FlowDataEngine":
    """Converts the DataFrame from a wide to a long format.

    This is the inverse of a pivot operation, taking columns and transforming
    them into `variable` and `value` rows.

    Args:
        unpivot_input: An `UnpivotInput` object specifying which columns to
            unpivot and which to keep as index columns.

    Returns:
        A new, unpivoted `FlowDataEngine` instance.
    """
    lf = self.data_frame

    if unpivot_input.data_type_selector_expr is not None:
        result = lf.unpivot(
            on=unpivot_input.data_type_selector_expr(),
            index=unpivot_input.index_columns
        )
    elif unpivot_input.value_columns is not None:
        result = lf.unpivot(
            on=unpivot_input.value_columns,
            index=unpivot_input.index_columns
        )
    else:
        result = lf.unpivot()

    return FlowDataEngine(result)

FlowfileColumn

FlowfileColumn

The FlowfileColumn is a data class that holds the schema and rich metadata for a single column managed by the FlowDataEngine.

flowfile_core.flowfile.flow_data_engine.flow_file_column.main.FlowfileColumn dataclass
Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_file_column/main.py
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
@dataclass
class FlowfileColumn:
    column_name: str
    data_type: str
    size: int
    max_value: str
    min_value: str
    col_index: int
    number_of_empty_values: int
    number_of_unique_values: int
    example_values: str
    __sql_type: Optional[Any]
    __is_unique: Optional[bool]
    __nullable: Optional[bool]
    __has_values: Optional[bool]
    average_value: Optional[str]
    __perc_unique: Optional[float]

    def __init__(self, polars_type: PlType):
        self.data_type = convert_pl_type_to_string(polars_type.pl_datatype)
        self.size = polars_type.count - polars_type.null_count
        self.max_value = polars_type.max
        self.min_value = polars_type.min
        self.number_of_unique_values = polars_type.n_unique
        self.number_of_empty_values = polars_type.null_count
        self.example_values = polars_type.examples
        self.column_name = polars_type.column_name
        self.average_value = polars_type.mean
        self.col_index = polars_type.col_index
        self.__has_values = None
        self.__nullable = None
        self.__is_unique = None
        self.__sql_type = None
        self.__perc_unique = None

    @classmethod
    def create_from_polars_type(cls, polars_type: PlType, **kwargs) -> "FlowfileColumn":
        for k, v in kwargs.items():
            if hasattr(polars_type, k):
                setattr(polars_type, k, v)
        return cls(polars_type)

    @classmethod
    def from_input(cls, column_name: str, data_type: str, **kwargs) -> "FlowfileColumn":
        pl_type = cast_str_to_polars_type(data_type)
        if pl_type is not None:
            data_type = pl_type
        return cls(PlType(column_name=column_name, pl_datatype=data_type, **kwargs))

    @classmethod
    def create_from_polars_dtype(cls, column_name: str, data_type: pl.DataType, **kwargs):
        return cls(PlType(column_name=column_name, pl_datatype=data_type, **kwargs))

    def get_minimal_field_info(self) -> input_schema.MinimalFieldInfo:
        return input_schema.MinimalFieldInfo(name=self.column_name, data_type=self.data_type)

    @classmethod
    def create_from_minimal_field_info(cls, minimal_field_info: input_schema.MinimalFieldInfo) -> "FlowfileColumn":
        return cls.from_input(column_name=minimal_field_info.name,
                              data_type=minimal_field_info.data_type)

    @property
    def is_unique(self) -> bool:
        if self.__is_unique is None:
            if self.has_values:
                self.__is_unique = self.number_of_unique_values == self.number_of_filled_values
            else:
                self.__is_unique = False
        return self.__is_unique

    @property
    def perc_unique(self) -> float:
        if self.__perc_unique is None:
            self.__perc_unique = self.number_of_unique_values / self.number_of_filled_values
        return self.__perc_unique

    @property
    def has_values(self) -> bool:
        if not self.__has_values:
            self.__has_values = self.number_of_unique_values > 0
        return self.__has_values

    @property
    def number_of_filled_values(self):
        return self.size

    @property
    def nullable(self):
        if self.__nullable is None:
            self.__nullable = self.number_of_empty_values > 0
        return self.__nullable

    @property
    def name(self):
        return self.column_name

    def get_column_repr(self):
        return dict(name=self.name,
                    size=self.size,
                    data_type=str(self.data_type),
                    has_values=self.has_values,
                    is_unique=self.is_unique,
                    max_value=str(self.max_value),
                    min_value=str(self.min_value),
                    number_of_unique_values=self.number_of_unique_values,
                    number_of_filled_values=self.number_of_filled_values,
                    number_of_empty_values=self.number_of_empty_values,
                    average_size=self.average_value)

    def generic_datatype(self) -> DataTypeGroup:
        if self.data_type in ('Utf8', 'VARCHAR', 'CHAR', 'NVARCHAR', 'String'):
            return 'str'
        elif self.data_type in ('fixed_decimal', 'decimal', 'float', 'integer', 'boolean', 'double', 'Int16', 'Int32',
                                'Int64', 'Float32', 'Float64', 'Decimal', 'Binary', 'Boolean', 'Uint8', 'Uint16',
                                'Uint32', 'Uint64'):
            return 'numeric'
        elif self.data_type in ('datetime', 'date', 'Date', 'Datetime', 'Time'):
            return 'date'

    def get_polars_type(self) -> PlType:
        pl_datatype = cast_str_to_polars_type(self.data_type)
        pl_type = PlType(pl_datatype=pl_datatype, **self.__dict__)
        return pl_type

    def update_type_from_polars_type(self, pl_type: PlType):
        self.data_type = str(pl_type.pl_datatype.base_type())

Data Modeling (Schemas)

This section documents the Pydantic models that define the structure of settings and data.

schemas

flowfile_core.schemas.schemas

Classes:

Name Description
FlowGraphConfig

Configuration model for a flow graph's basic properties.

FlowInformation

Represents the complete state of a flow, including settings, nodes, and connections.

FlowSettings

Extends FlowGraphConfig with additional operational settings for a flow.

NodeDefault

Defines default properties for a node type.

NodeEdge

Represents a connection (edge) between two nodes in the frontend.

NodeInformation

Stores the state and configuration of a specific node instance within a flow.

NodeInput

Represents a node as it is received from the frontend, including position.

NodeTemplate

Defines the template for a node type, specifying its UI and functional characteristics.

RawLogInput

Schema for a raw log message.

VueFlowInput

Represents the complete graph structure from the Vue-based frontend.

FlowGraphConfig pydantic-model

Bases: BaseModel

Configuration model for a flow graph's basic properties.

Attributes:

Name Type Description
flow_id int

Unique identifier for the flow.

description Optional[str]

A description of the flow.

save_location Optional[str]

The location where the flow is saved.

name str

The name of the flow.

path str

The file path associated with the flow.

execution_mode ExecutionModeLiteral

The mode of execution ('Development' or 'Performance').

execution_location ExecutionLocationsLiteral

The location for execution ('auto', 'local', 'remote').

Show JSON schema:
{
  "description": "Configuration model for a flow graph's basic properties.\n\nAttributes:\n    flow_id (int): Unique identifier for the flow.\n    description (Optional[str]): A description of the flow.\n    save_location (Optional[str]): The location where the flow is saved.\n    name (str): The name of the flow.\n    path (str): The file path associated with the flow.\n    execution_mode (ExecutionModeLiteral): The mode of execution ('Development' or 'Performance').\n    execution_location (ExecutionLocationsLiteral): The location for execution ('auto', 'local', 'remote').",
  "properties": {
    "flow_id": {
      "description": "Unique identifier for the flow.",
      "title": "Flow Id",
      "type": "integer"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Description"
    },
    "save_location": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Save Location"
    },
    "name": {
      "default": "",
      "title": "Name",
      "type": "string"
    },
    "path": {
      "default": "",
      "title": "Path",
      "type": "string"
    },
    "execution_mode": {
      "default": "Performance",
      "enum": [
        "Development",
        "Performance"
      ],
      "title": "Execution Mode",
      "type": "string"
    },
    "execution_location": {
      "default": "auto",
      "enum": [
        "auto",
        "local",
        "remote"
      ],
      "title": "Execution Location",
      "type": "string"
    }
  },
  "title": "FlowGraphConfig",
  "type": "object"
}

Fields:

  • flow_id (int)
  • description (Optional[str])
  • save_location (Optional[str])
  • name (str)
  • path (str)
  • execution_mode (ExecutionModeLiteral)
  • execution_location (ExecutionLocationsLiteral)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class FlowGraphConfig(BaseModel):
    """
    Configuration model for a flow graph's basic properties.

    Attributes:
        flow_id (int): Unique identifier for the flow.
        description (Optional[str]): A description of the flow.
        save_location (Optional[str]): The location where the flow is saved.
        name (str): The name of the flow.
        path (str): The file path associated with the flow.
        execution_mode (ExecutionModeLiteral): The mode of execution ('Development' or 'Performance').
        execution_location (ExecutionLocationsLiteral): The location for execution ('auto', 'local', 'remote').
    """
    flow_id: int = Field(default_factory=create_unique_id, description="Unique identifier for the flow.")
    description: Optional[str] = None
    save_location: Optional[str] = None
    name: str = ''
    path: str = ''
    execution_mode: ExecutionModeLiteral = 'Performance'
    execution_location: ExecutionLocationsLiteral = "auto"
flow_id pydantic-field

Unique identifier for the flow.

FlowInformation pydantic-model

Bases: BaseModel

Represents the complete state of a flow, including settings, nodes, and connections.

Attributes:

Name Type Description
flow_id int

The unique ID of the flow.

flow_name Optional[str]

The name of the flow.

flow_settings FlowSettings

The settings for the flow.

data Dict[int, NodeInformation]

A dictionary mapping node IDs to their information.

node_starts List[int]

A list of starting node IDs.

node_connections List[Tuple[int, int]]

A list of tuples representing connections between nodes.

Show JSON schema:
{
  "$defs": {
    "FlowSettings": {
      "description": "Extends FlowGraphConfig with additional operational settings for a flow.\n\nAttributes:\n    auto_save (bool): Flag to enable or disable automatic saving.\n    modified_on (Optional[float]): Timestamp of the last modification.\n    show_detailed_progress (bool): Flag to show detailed progress during execution.\n    is_running (bool): Indicates if the flow is currently running.\n    is_canceled (bool): Indicates if the flow execution has been canceled.",
      "properties": {
        "flow_id": {
          "description": "Unique identifier for the flow.",
          "title": "Flow Id",
          "type": "integer"
        },
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Description"
        },
        "save_location": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Save Location"
        },
        "name": {
          "default": "",
          "title": "Name",
          "type": "string"
        },
        "path": {
          "default": "",
          "title": "Path",
          "type": "string"
        },
        "execution_mode": {
          "default": "Performance",
          "enum": [
            "Development",
            "Performance"
          ],
          "title": "Execution Mode",
          "type": "string"
        },
        "execution_location": {
          "default": "auto",
          "enum": [
            "auto",
            "local",
            "remote"
          ],
          "title": "Execution Location",
          "type": "string"
        },
        "auto_save": {
          "default": false,
          "title": "Auto Save",
          "type": "boolean"
        },
        "modified_on": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modified On"
        },
        "show_detailed_progress": {
          "default": true,
          "title": "Show Detailed Progress",
          "type": "boolean"
        },
        "is_running": {
          "default": false,
          "title": "Is Running",
          "type": "boolean"
        },
        "is_canceled": {
          "default": false,
          "title": "Is Canceled",
          "type": "boolean"
        }
      },
      "title": "FlowSettings",
      "type": "object"
    },
    "NodeInformation": {
      "description": "Stores the state and configuration of a specific node instance within a flow.\n\nAttributes:\n    id (Optional[int]): The unique ID of the node instance.\n    type (Optional[str]): The type of the node (e.g., 'join', 'filter').\n    is_setup (Optional[bool]): Whether the node has been configured.\n    description (Optional[str]): A user-provided description.\n    x_position (Optional[int]): The x-coordinate on the canvas.\n    y_position (Optional[int]): The y-coordinate on the canvas.\n    left_input_id (Optional[int]): The ID of the node connected to the left input.\n    right_input_id (Optional[int]): The ID of the node connected to the right input.\n    input_ids (Optional[List[int]]): A list of IDs for main input nodes.\n    outputs (Optional[List[int]]): A list of IDs for nodes this node outputs to.\n    setting_input (Optional[Any]): The specific settings for this node instance.",
      "properties": {
        "id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Type"
        },
        "is_setup": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Is Setup"
        },
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "",
          "title": "Description"
        },
        "x_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "X Position"
        },
        "y_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "Y Position"
        },
        "left_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Left Input Id"
        },
        "right_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Right Input Id"
        },
        "input_ids": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": [
            -1
          ],
          "title": "Input Ids"
        },
        "outputs": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": [
            -1
          ],
          "title": "Outputs"
        },
        "setting_input": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Setting Input"
        }
      },
      "title": "NodeInformation",
      "type": "object"
    }
  },
  "description": "Represents the complete state of a flow, including settings, nodes, and connections.\n\nAttributes:\n    flow_id (int): The unique ID of the flow.\n    flow_name (Optional[str]): The name of the flow.\n    flow_settings (FlowSettings): The settings for the flow.\n    data (Dict[int, NodeInformation]): A dictionary mapping node IDs to their information.\n    node_starts (List[int]): A list of starting node IDs.\n    node_connections (List[Tuple[int, int]]): A list of tuples representing connections between nodes.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "flow_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Flow Name"
    },
    "flow_settings": {
      "$ref": "#/$defs/FlowSettings"
    },
    "data": {
      "additionalProperties": {
        "$ref": "#/$defs/NodeInformation"
      },
      "default": {},
      "title": "Data",
      "type": "object"
    },
    "node_starts": {
      "items": {
        "type": "integer"
      },
      "title": "Node Starts",
      "type": "array"
    },
    "node_connections": {
      "default": [],
      "items": {
        "maxItems": 2,
        "minItems": 2,
        "prefixItems": [
          {
            "type": "integer"
          },
          {
            "type": "integer"
          }
        ],
        "type": "array"
      },
      "title": "Node Connections",
      "type": "array"
    }
  },
  "required": [
    "flow_id",
    "flow_settings",
    "node_starts"
  ],
  "title": "FlowInformation",
  "type": "object"
}

Fields:

  • flow_id (int)
  • flow_name (Optional[str])
  • flow_settings (FlowSettings)
  • data (Dict[int, NodeInformation])
  • node_starts (List[int])
  • node_connections (List[Tuple[int, int]])

Validators:

Source code in flowfile_core/flowfile_core/schemas/schemas.py
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
class FlowInformation(BaseModel):
    """
    Represents the complete state of a flow, including settings, nodes, and connections.

    Attributes:
        flow_id (int): The unique ID of the flow.
        flow_name (Optional[str]): The name of the flow.
        flow_settings (FlowSettings): The settings for the flow.
        data (Dict[int, NodeInformation]): A dictionary mapping node IDs to their information.
        node_starts (List[int]): A list of starting node IDs.
        node_connections (List[Tuple[int, int]]): A list of tuples representing connections between nodes.
    """
    flow_id: int
    flow_name: Optional[str] = ''
    flow_settings: FlowSettings
    data: Dict[int, NodeInformation] = {}
    node_starts: List[int]
    node_connections: List[Tuple[int, int]] = []

    @field_validator('flow_name', mode="before")
    def ensure_string(cls, v):
        """
        Validator to ensure the flow_name is always a string.
        :param v: The value to validate.
        :return: The value as a string, or an empty string if it's None.
        """
        return str(v) if v is not None else ''
ensure_string(v) pydantic-validator

Validator to ensure the flow_name is always a string. :param v: The value to validate. :return: The value as a string, or an empty string if it's None.

Source code in flowfile_core/flowfile_core/schemas/schemas.py
165
166
167
168
169
170
171
172
@field_validator('flow_name', mode="before")
def ensure_string(cls, v):
    """
    Validator to ensure the flow_name is always a string.
    :param v: The value to validate.
    :return: The value as a string, or an empty string if it's None.
    """
    return str(v) if v is not None else ''
FlowSettings pydantic-model

Bases: FlowGraphConfig

Extends FlowGraphConfig with additional operational settings for a flow.

Attributes:

Name Type Description
auto_save bool

Flag to enable or disable automatic saving.

modified_on Optional[float]

Timestamp of the last modification.

show_detailed_progress bool

Flag to show detailed progress during execution.

is_running bool

Indicates if the flow is currently running.

is_canceled bool

Indicates if the flow execution has been canceled.

Show JSON schema:
{
  "description": "Extends FlowGraphConfig with additional operational settings for a flow.\n\nAttributes:\n    auto_save (bool): Flag to enable or disable automatic saving.\n    modified_on (Optional[float]): Timestamp of the last modification.\n    show_detailed_progress (bool): Flag to show detailed progress during execution.\n    is_running (bool): Indicates if the flow is currently running.\n    is_canceled (bool): Indicates if the flow execution has been canceled.",
  "properties": {
    "flow_id": {
      "description": "Unique identifier for the flow.",
      "title": "Flow Id",
      "type": "integer"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Description"
    },
    "save_location": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Save Location"
    },
    "name": {
      "default": "",
      "title": "Name",
      "type": "string"
    },
    "path": {
      "default": "",
      "title": "Path",
      "type": "string"
    },
    "execution_mode": {
      "default": "Performance",
      "enum": [
        "Development",
        "Performance"
      ],
      "title": "Execution Mode",
      "type": "string"
    },
    "execution_location": {
      "default": "auto",
      "enum": [
        "auto",
        "local",
        "remote"
      ],
      "title": "Execution Location",
      "type": "string"
    },
    "auto_save": {
      "default": false,
      "title": "Auto Save",
      "type": "boolean"
    },
    "modified_on": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modified On"
    },
    "show_detailed_progress": {
      "default": true,
      "title": "Show Detailed Progress",
      "type": "boolean"
    },
    "is_running": {
      "default": false,
      "title": "Is Running",
      "type": "boolean"
    },
    "is_canceled": {
      "default": false,
      "title": "Is Canceled",
      "type": "boolean"
    }
  },
  "title": "FlowSettings",
  "type": "object"
}

Fields:

  • flow_id (int)
  • description (Optional[str])
  • save_location (Optional[str])
  • name (str)
  • path (str)
  • execution_mode (ExecutionModeLiteral)
  • execution_location (ExecutionLocationsLiteral)
  • auto_save (bool)
  • modified_on (Optional[float])
  • show_detailed_progress (bool)
  • is_running (bool)
  • is_canceled (bool)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class FlowSettings(FlowGraphConfig):
    """
    Extends FlowGraphConfig with additional operational settings for a flow.

    Attributes:
        auto_save (bool): Flag to enable or disable automatic saving.
        modified_on (Optional[float]): Timestamp of the last modification.
        show_detailed_progress (bool): Flag to show detailed progress during execution.
        is_running (bool): Indicates if the flow is currently running.
        is_canceled (bool): Indicates if the flow execution has been canceled.
    """
    auto_save: bool = False
    modified_on: Optional[float] = None
    show_detailed_progress: bool = True
    is_running: bool = False
    is_canceled: bool = False

    @classmethod
    def from_flow_settings_input(cls, flow_graph_config: FlowGraphConfig):
        """
        Creates a FlowSettings instance from a FlowGraphConfig instance.

        :param flow_graph_config: The base flow graph configuration.
        :return: A new instance of FlowSettings with data from flow_graph_config.
        """
        return cls.model_validate(flow_graph_config.model_dump())
from_flow_settings_input(flow_graph_config) classmethod

Creates a FlowSettings instance from a FlowGraphConfig instance.

:param flow_graph_config: The base flow graph configuration. :return: A new instance of FlowSettings with data from flow_graph_config.

Source code in flowfile_core/flowfile_core/schemas/schemas.py
47
48
49
50
51
52
53
54
55
@classmethod
def from_flow_settings_input(cls, flow_graph_config: FlowGraphConfig):
    """
    Creates a FlowSettings instance from a FlowGraphConfig instance.

    :param flow_graph_config: The base flow graph configuration.
    :return: A new instance of FlowSettings with data from flow_graph_config.
    """
    return cls.model_validate(flow_graph_config.model_dump())
NodeDefault pydantic-model

Bases: BaseModel

Defines default properties for a node type.

Attributes:

Name Type Description
node_name str

The name of the node.

node_type NodeTypeLiteral

The functional type of the node ('input', 'output', 'process').

transform_type TransformTypeLiteral

The data transformation behavior ('narrow', 'wide', 'other').

has_default_settings Optional[Any]

Indicates if the node has predefined default settings.

Show JSON schema:
{
  "description": "Defines default properties for a node type.\n\nAttributes:\n    node_name (str): The name of the node.\n    node_type (NodeTypeLiteral): The functional type of the node ('input', 'output', 'process').\n    transform_type (TransformTypeLiteral): The data transformation behavior ('narrow', 'wide', 'other').\n    has_default_settings (Optional[Any]): Indicates if the node has predefined default settings.",
  "properties": {
    "node_name": {
      "title": "Node Name",
      "type": "string"
    },
    "node_type": {
      "enum": [
        "input",
        "output",
        "process"
      ],
      "title": "Node Type",
      "type": "string"
    },
    "transform_type": {
      "enum": [
        "narrow",
        "wide",
        "other"
      ],
      "title": "Transform Type",
      "type": "string"
    },
    "has_default_settings": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Has Default Settings"
    }
  },
  "required": [
    "node_name",
    "node_type",
    "transform_type"
  ],
  "title": "NodeDefault",
  "type": "object"
}

Fields:

  • node_name (str)
  • node_type (NodeTypeLiteral)
  • transform_type (TransformTypeLiteral)
  • has_default_settings (Optional[Any])
Source code in flowfile_core/flowfile_core/schemas/schemas.py
226
227
228
229
230
231
232
233
234
235
236
237
238
239
class NodeDefault(BaseModel):
    """
    Defines default properties for a node type.

    Attributes:
        node_name (str): The name of the node.
        node_type (NodeTypeLiteral): The functional type of the node ('input', 'output', 'process').
        transform_type (TransformTypeLiteral): The data transformation behavior ('narrow', 'wide', 'other').
        has_default_settings (Optional[Any]): Indicates if the node has predefined default settings.
    """
    node_name: str
    node_type: NodeTypeLiteral
    transform_type: TransformTypeLiteral
    has_default_settings: Optional[Any] = None
NodeEdge pydantic-model

Bases: BaseModel

Represents a connection (edge) between two nodes in the frontend.

Attributes:

Name Type Description
id str

A unique identifier for the edge.

source str

The ID of the source node.

target str

The ID of the target node.

targetHandle str

The specific input handle on the target node.

sourceHandle str

The specific output handle on the source node.

Show JSON schema:
{
  "description": "Represents a connection (edge) between two nodes in the frontend.\n\nAttributes:\n    id (str): A unique identifier for the edge.\n    source (str): The ID of the source node.\n    target (str): The ID of the target node.\n    targetHandle (str): The specific input handle on the target node.\n    sourceHandle (str): The specific output handle on the source node.",
  "properties": {
    "id": {
      "title": "Id",
      "type": "string"
    },
    "source": {
      "title": "Source",
      "type": "string"
    },
    "target": {
      "title": "Target",
      "type": "string"
    },
    "targetHandle": {
      "title": "Targethandle",
      "type": "string"
    },
    "sourceHandle": {
      "title": "Sourcehandle",
      "type": "string"
    }
  },
  "required": [
    "id",
    "source",
    "target",
    "targetHandle",
    "sourceHandle"
  ],
  "title": "NodeEdge",
  "type": "object"
}

Config:

  • coerce_numbers_to_str: True

Fields:

  • id (str)
  • source (str)
  • target (str)
  • targetHandle (str)
  • sourceHandle (str)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
class NodeEdge(BaseModel):
    """
    Represents a connection (edge) between two nodes in the frontend.

    Attributes:
        id (str): A unique identifier for the edge.
        source (str): The ID of the source node.
        target (str): The ID of the target node.
        targetHandle (str): The specific input handle on the target node.
        sourceHandle (str): The specific output handle on the source node.
    """
    model_config = ConfigDict(coerce_numbers_to_str=True)
    id: str
    source: str
    target: str
    targetHandle: str
    sourceHandle: str
NodeInformation pydantic-model

Bases: BaseModel

Stores the state and configuration of a specific node instance within a flow.

Attributes:

Name Type Description
id Optional[int]

The unique ID of the node instance.

type Optional[str]

The type of the node (e.g., 'join', 'filter').

is_setup Optional[bool]

Whether the node has been configured.

description Optional[str]

A user-provided description.

x_position Optional[int]

The x-coordinate on the canvas.

y_position Optional[int]

The y-coordinate on the canvas.

left_input_id Optional[int]

The ID of the node connected to the left input.

right_input_id Optional[int]

The ID of the node connected to the right input.

input_ids Optional[List[int]]

A list of IDs for main input nodes.

outputs Optional[List[int]]

A list of IDs for nodes this node outputs to.

setting_input Optional[Any]

The specific settings for this node instance.

Show JSON schema:
{
  "description": "Stores the state and configuration of a specific node instance within a flow.\n\nAttributes:\n    id (Optional[int]): The unique ID of the node instance.\n    type (Optional[str]): The type of the node (e.g., 'join', 'filter').\n    is_setup (Optional[bool]): Whether the node has been configured.\n    description (Optional[str]): A user-provided description.\n    x_position (Optional[int]): The x-coordinate on the canvas.\n    y_position (Optional[int]): The y-coordinate on the canvas.\n    left_input_id (Optional[int]): The ID of the node connected to the left input.\n    right_input_id (Optional[int]): The ID of the node connected to the right input.\n    input_ids (Optional[List[int]]): A list of IDs for main input nodes.\n    outputs (Optional[List[int]]): A list of IDs for nodes this node outputs to.\n    setting_input (Optional[Any]): The specific settings for this node instance.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Type"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "x_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "X Position"
    },
    "y_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Y Position"
    },
    "left_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Left Input Id"
    },
    "right_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Right Input Id"
    },
    "input_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Input Ids"
    },
    "outputs": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": [
        -1
      ],
      "title": "Outputs"
    },
    "setting_input": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Setting Input"
    }
  },
  "title": "NodeInformation",
  "type": "object"
}

Fields:

  • id (Optional[int])
  • type (Optional[str])
  • is_setup (Optional[bool])
  • description (Optional[str])
  • x_position (Optional[int])
  • y_position (Optional[int])
  • left_input_id (Optional[int])
  • right_input_id (Optional[int])
  • input_ids (Optional[List[int]])
  • outputs (Optional[List[int]])
  • setting_input (Optional[Any])
Source code in flowfile_core/flowfile_core/schemas/schemas.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
class NodeInformation(BaseModel):
    """
    Stores the state and configuration of a specific node instance within a flow.

    Attributes:
        id (Optional[int]): The unique ID of the node instance.
        type (Optional[str]): The type of the node (e.g., 'join', 'filter').
        is_setup (Optional[bool]): Whether the node has been configured.
        description (Optional[str]): A user-provided description.
        x_position (Optional[int]): The x-coordinate on the canvas.
        y_position (Optional[int]): The y-coordinate on the canvas.
        left_input_id (Optional[int]): The ID of the node connected to the left input.
        right_input_id (Optional[int]): The ID of the node connected to the right input.
        input_ids (Optional[List[int]]): A list of IDs for main input nodes.
        outputs (Optional[List[int]]): A list of IDs for nodes this node outputs to.
        setting_input (Optional[Any]): The specific settings for this node instance.
    """
    id: Optional[int] = None
    type: Optional[str] = None
    is_setup: Optional[bool] = None
    description: Optional[str] = ''
    x_position: Optional[int] = 0
    y_position: Optional[int] = 0
    left_input_id: Optional[int] = None
    right_input_id: Optional[int] = None
    input_ids: Optional[List[int]] = [-1]
    outputs: Optional[List[int]] = [-1]
    setting_input: Optional[Any] = None

    @property
    def data(self) -> Any:
        """
        Property to access the node's specific settings.
        :return: The settings of the node.
        """
        return self.setting_input

    @property
    def main_input_ids(self) -> Optional[List[int]]:
        """
        Property to access the main input node IDs.
        :return: A list of main input node IDs.
        """
        return self.input_ids
data property

Property to access the node's specific settings. :return: The settings of the node.

main_input_ids property

Property to access the main input node IDs. :return: A list of main input node IDs.

NodeInput pydantic-model

Bases: NodeTemplate

Represents a node as it is received from the frontend, including position.

Attributes:

Name Type Description
id int

The unique ID of the node instance.

pos_x float

The x-coordinate on the canvas.

pos_y float

The y-coordinate on the canvas.

Show JSON schema:
{
  "description": "Represents a node as it is received from the frontend, including position.\n\nAttributes:\n    id (int): The unique ID of the node instance.\n    pos_x (float): The x-coordinate on the canvas.\n    pos_y (float): The y-coordinate on the canvas.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "item": {
      "title": "Item",
      "type": "string"
    },
    "input": {
      "title": "Input",
      "type": "integer"
    },
    "output": {
      "title": "Output",
      "type": "integer"
    },
    "image": {
      "title": "Image",
      "type": "string"
    },
    "multi": {
      "default": false,
      "title": "Multi",
      "type": "boolean"
    },
    "node_group": {
      "title": "Node Group",
      "type": "string"
    },
    "prod_ready": {
      "default": true,
      "title": "Prod Ready",
      "type": "boolean"
    },
    "can_be_start": {
      "default": false,
      "title": "Can Be Start",
      "type": "boolean"
    },
    "id": {
      "title": "Id",
      "type": "integer"
    },
    "pos_x": {
      "title": "Pos X",
      "type": "number"
    },
    "pos_y": {
      "title": "Pos Y",
      "type": "number"
    }
  },
  "required": [
    "name",
    "item",
    "input",
    "output",
    "image",
    "node_group",
    "id",
    "pos_x",
    "pos_y"
  ],
  "title": "NodeInput",
  "type": "object"
}

Fields:

  • name (str)
  • item (str)
  • input (int)
  • output (int)
  • image (str)
  • multi (bool)
  • node_group (str)
  • prod_ready (bool)
  • can_be_start (bool)
  • id (int)
  • pos_x (float)
  • pos_y (float)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
175
176
177
178
179
180
181
182
183
184
185
186
class NodeInput(NodeTemplate):
    """
    Represents a node as it is received from the frontend, including position.

    Attributes:
        id (int): The unique ID of the node instance.
        pos_x (float): The x-coordinate on the canvas.
        pos_y (float): The y-coordinate on the canvas.
    """
    id: int
    pos_x: float
    pos_y: float
NodeTemplate pydantic-model

Bases: BaseModel

Defines the template for a node type, specifying its UI and functional characteristics.

Attributes:

Name Type Description
name str

The display name of the node.

item str

The unique identifier for the node type.

input int

The number of required input connections.

output int

The number of output connections.

image str

The filename of the icon for the node.

multi bool

Whether the node accepts multiple main input connections.

node_group str

The category group the node belongs to (e.g., 'input', 'transform').

prod_ready bool

Whether the node is considered production-ready.

can_be_start bool

Whether the node can be a starting point in a flow.

Show JSON schema:
{
  "description": "Defines the template for a node type, specifying its UI and functional characteristics.\n\nAttributes:\n    name (str): The display name of the node.\n    item (str): The unique identifier for the node type.\n    input (int): The number of required input connections.\n    output (int): The number of output connections.\n    image (str): The filename of the icon for the node.\n    multi (bool): Whether the node accepts multiple main input connections.\n    node_group (str): The category group the node belongs to (e.g., 'input', 'transform').\n    prod_ready (bool): Whether the node is considered production-ready.\n    can_be_start (bool): Whether the node can be a starting point in a flow.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "item": {
      "title": "Item",
      "type": "string"
    },
    "input": {
      "title": "Input",
      "type": "integer"
    },
    "output": {
      "title": "Output",
      "type": "integer"
    },
    "image": {
      "title": "Image",
      "type": "string"
    },
    "multi": {
      "default": false,
      "title": "Multi",
      "type": "boolean"
    },
    "node_group": {
      "title": "Node Group",
      "type": "string"
    },
    "prod_ready": {
      "default": true,
      "title": "Prod Ready",
      "type": "boolean"
    },
    "can_be_start": {
      "default": false,
      "title": "Can Be Start",
      "type": "boolean"
    }
  },
  "required": [
    "name",
    "item",
    "input",
    "output",
    "image",
    "node_group"
  ],
  "title": "NodeTemplate",
  "type": "object"
}

Fields:

  • name (str)
  • item (str)
  • input (int)
  • output (int)
  • image (str)
  • multi (bool)
  • node_group (str)
  • prod_ready (bool)
  • can_be_start (bool)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
class NodeTemplate(BaseModel):
    """
    Defines the template for a node type, specifying its UI and functional characteristics.

    Attributes:
        name (str): The display name of the node.
        item (str): The unique identifier for the node type.
        input (int): The number of required input connections.
        output (int): The number of output connections.
        image (str): The filename of the icon for the node.
        multi (bool): Whether the node accepts multiple main input connections.
        node_group (str): The category group the node belongs to (e.g., 'input', 'transform').
        prod_ready (bool): Whether the node is considered production-ready.
        can_be_start (bool): Whether the node can be a starting point in a flow.
    """
    name: str
    item: str
    input: int
    output: int
    image: str
    multi: bool = False
    node_group: str
    prod_ready: bool = True
    can_be_start: bool = False
RawLogInput pydantic-model

Bases: BaseModel

Schema for a raw log message.

Attributes:

Name Type Description
flowfile_flow_id int

The ID of the flow that generated the log.

log_message str

The content of the log message.

log_type Literal['INFO', 'ERROR']

The type of log.

extra Optional[dict]

Extra context data for the log.

Show JSON schema:
{
  "description": "Schema for a raw log message.\n\nAttributes:\n    flowfile_flow_id (int): The ID of the flow that generated the log.\n    log_message (str): The content of the log message.\n    log_type (Literal[\"INFO\", \"ERROR\"]): The type of log.\n    extra (Optional[dict]): Extra context data for the log.",
  "properties": {
    "flowfile_flow_id": {
      "title": "Flowfile Flow Id",
      "type": "integer"
    },
    "log_message": {
      "title": "Log Message",
      "type": "string"
    },
    "log_type": {
      "enum": [
        "INFO",
        "ERROR"
      ],
      "title": "Log Type",
      "type": "string"
    },
    "extra": {
      "anyOf": [
        {
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Extra"
    }
  },
  "required": [
    "flowfile_flow_id",
    "log_message",
    "log_type"
  ],
  "title": "RawLogInput",
  "type": "object"
}

Fields:

  • flowfile_flow_id (int)
  • log_message (str)
  • log_type (Literal['INFO', 'ERROR'])
  • extra (Optional[dict])
Source code in flowfile_core/flowfile_core/schemas/schemas.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
class RawLogInput(BaseModel):
    """
    Schema for a raw log message.

    Attributes:
        flowfile_flow_id (int): The ID of the flow that generated the log.
        log_message (str): The content of the log message.
        log_type (Literal["INFO", "ERROR"]): The type of log.
        extra (Optional[dict]): Extra context data for the log.
    """
    flowfile_flow_id: int
    log_message: str
    log_type: Literal["INFO", "ERROR"]
    extra: Optional[dict] = None
VueFlowInput pydantic-model

Bases: BaseModel

Represents the complete graph structure from the Vue-based frontend.

Attributes:

Name Type Description
node_edges List[NodeEdge]

A list of all edges in the graph.

node_inputs List[NodeInput]

A list of all nodes in the graph.

Show JSON schema:
{
  "$defs": {
    "NodeEdge": {
      "description": "Represents a connection (edge) between two nodes in the frontend.\n\nAttributes:\n    id (str): A unique identifier for the edge.\n    source (str): The ID of the source node.\n    target (str): The ID of the target node.\n    targetHandle (str): The specific input handle on the target node.\n    sourceHandle (str): The specific output handle on the source node.",
      "properties": {
        "id": {
          "title": "Id",
          "type": "string"
        },
        "source": {
          "title": "Source",
          "type": "string"
        },
        "target": {
          "title": "Target",
          "type": "string"
        },
        "targetHandle": {
          "title": "Targethandle",
          "type": "string"
        },
        "sourceHandle": {
          "title": "Sourcehandle",
          "type": "string"
        }
      },
      "required": [
        "id",
        "source",
        "target",
        "targetHandle",
        "sourceHandle"
      ],
      "title": "NodeEdge",
      "type": "object"
    },
    "NodeInput": {
      "description": "Represents a node as it is received from the frontend, including position.\n\nAttributes:\n    id (int): The unique ID of the node instance.\n    pos_x (float): The x-coordinate on the canvas.\n    pos_y (float): The y-coordinate on the canvas.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "item": {
          "title": "Item",
          "type": "string"
        },
        "input": {
          "title": "Input",
          "type": "integer"
        },
        "output": {
          "title": "Output",
          "type": "integer"
        },
        "image": {
          "title": "Image",
          "type": "string"
        },
        "multi": {
          "default": false,
          "title": "Multi",
          "type": "boolean"
        },
        "node_group": {
          "title": "Node Group",
          "type": "string"
        },
        "prod_ready": {
          "default": true,
          "title": "Prod Ready",
          "type": "boolean"
        },
        "can_be_start": {
          "default": false,
          "title": "Can Be Start",
          "type": "boolean"
        },
        "id": {
          "title": "Id",
          "type": "integer"
        },
        "pos_x": {
          "title": "Pos X",
          "type": "number"
        },
        "pos_y": {
          "title": "Pos Y",
          "type": "number"
        }
      },
      "required": [
        "name",
        "item",
        "input",
        "output",
        "image",
        "node_group",
        "id",
        "pos_x",
        "pos_y"
      ],
      "title": "NodeInput",
      "type": "object"
    }
  },
  "description": "Represents the complete graph structure from the Vue-based frontend.\n\nAttributes:\n    node_edges (List[NodeEdge]): A list of all edges in the graph.\n    node_inputs (List[NodeInput]): A list of all nodes in the graph.",
  "properties": {
    "node_edges": {
      "items": {
        "$ref": "#/$defs/NodeEdge"
      },
      "title": "Node Edges",
      "type": "array"
    },
    "node_inputs": {
      "items": {
        "$ref": "#/$defs/NodeInput"
      },
      "title": "Node Inputs",
      "type": "array"
    }
  },
  "required": [
    "node_edges",
    "node_inputs"
  ],
  "title": "VueFlowInput",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/schemas.py
208
209
210
211
212
213
214
215
216
217
218
class VueFlowInput(BaseModel):
    """

    Represents the complete graph structure from the Vue-based frontend.

    Attributes:
        node_edges (List[NodeEdge]): A list of all edges in the graph.
        node_inputs (List[NodeInput]): A list of all nodes in the graph.
    """
    node_edges: List[NodeEdge]
    node_inputs: List[NodeInput]

input_schema

flowfile_core.schemas.input_schema

Classes:

Name Description
DatabaseConnection

Defines the connection parameters for a database.

DatabaseSettings

Defines settings for reading from a database, either via table or query.

DatabaseWriteSettings

Defines settings for writing data to a database table.

ExternalSource

Base model for data coming from a predefined external source.

FullDatabaseConnection

A complete database connection model including the secret password.

FullDatabaseConnectionInterface

A database connection model intended for UI display, omitting the password.

MinimalFieldInfo

Represents the most basic information about a data field (column).

NewDirectory

Defines the information required to create a new directory.

NodeBase

Base model for all nodes in a FlowGraph. Contains common metadata.

NodeCloudStorageReader

Settings for a node that reads from a cloud storage service (S3, GCS, etc.).

NodeCloudStorageWriter

Settings for a node that writes to a cloud storage service.

NodeConnection

Represents a connection (edge) between two nodes in the graph.

NodeCrossJoin

Settings for a node that performs a cross join.

NodeDatabaseReader

Settings for a node that reads from a database.

NodeDatabaseWriter

Settings for a node that writes data to a database.

NodeDatasource

Base settings for a node that acts as a data source.

NodeDescription

A simple model for updating a node's description text.

NodeExploreData

Settings for a node that provides an interactive data exploration interface.

NodeExternalSource

Settings for a node that connects to a registered external data source.

NodeFilter

Settings for a node that filters rows based on a condition.

NodeFormula

Settings for a node that applies a formula to create/modify a column.

NodeFuzzyMatch

Settings for a node that performs a fuzzy join based on string similarity.

NodeGraphSolver

Settings for a node that solves graph-based problems (e.g., connected components).

NodeGroupBy

Settings for a node that performs a group-by and aggregation operation.

NodeInputConnection

Represents the input side of a connection between two nodes.

NodeJoin

Settings for a node that performs a standard SQL-style join.

NodeManualInput

Settings for a node that allows direct data entry in the UI.

NodeMultiInput

A base model for any node that takes multiple data inputs.

NodeOutput

Settings for a node that writes its input to a file.

NodeOutputConnection

Represents the output side of a connection between two nodes.

NodePivot

Settings for a node that pivots data from a long to a wide format.

NodePolarsCode

Settings for a node that executes arbitrary user-provided Polars code.

NodePromise

A placeholder node for an operation that has not yet been configured.

NodeRead

Settings for a node that reads data from a file.

NodeRecordCount

Settings for a node that counts the number of records.

NodeRecordId

Settings for a node that adds a unique record ID column.

NodeSample

Settings for a node that samples a subset of the data.

NodeSelect

Settings for a node that selects, renames, and reorders columns.

NodeSingleInput

A base model for any node that takes a single data input.

NodeSort

Settings for a node that sorts the data by one or more columns.

NodeTextToRows

Settings for a node that splits a text column into multiple rows.

NodeUnion

Settings for a node that concatenates multiple data inputs.

NodeUnique

Settings for a node that returns the unique rows from the data.

NodeUnpivot

Settings for a node that unpivots data from a wide to a long format.

OutputCsvTable

Defines settings for writing a CSV file.

OutputExcelTable

Defines settings for writing an Excel file.

OutputParquetTable

Defines settings for writing a Parquet file.

OutputSettings

Defines the complete settings for an output node.

RawData

Represents data in a raw, columnar format for manual input.

ReceivedCsvTable

Defines settings for reading a CSV file.

ReceivedExcelTable

Defines settings for reading an Excel file.

ReceivedJsonTable

Defines settings for reading a JSON file (inherits from CSV settings).

ReceivedParquetTable

Defines settings for reading a Parquet file.

ReceivedTable

A comprehensive model that can represent any type of received table.

ReceivedTableBase

Base model for defining a table received from an external source.

RemoveItem

Represents a single item to be removed from a directory or list.

RemoveItemsInput

Defines a list of items to be removed.

SampleUsers

Settings for generating a sample dataset of users.

DatabaseConnection

Bases: BaseModel

Defines the connection parameters for a database.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
304
305
306
307
308
309
310
311
312
class DatabaseConnection(BaseModel):
    """Defines the connection parameters for a database."""
    database_type: str = "postgresql"
    username: Optional[str] = None
    password_ref: Optional[SecretRef] = None
    host: Optional[str] = None
    port: Optional[int] = None
    database: Optional[str] = None
    url: Optional[str] = None
DatabaseSettings

Bases: BaseModel

Defines settings for reading from a database, either via table or query.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
class DatabaseSettings(BaseModel):
    """Defines settings for reading from a database, either via table or query."""
    connection_mode: Optional[Literal['inline', 'reference']] = 'inline'
    database_connection: Optional[DatabaseConnection] = None
    database_connection_name: Optional[str] = None
    schema_name: Optional[str] = None
    table_name: Optional[str] = None
    query: Optional[str] = None
    query_mode: Literal['query', 'table', 'reference'] = 'table'

    @model_validator(mode='after')
    def validate_table_or_query(self):
        # Validate that either table_name or query is provided
        if (not self.table_name and not self.query) and self.query_mode == 'inline':
            raise ValueError("Either 'table_name' or 'query' must be provided")

        # Validate correct connection information based on connection_mode
        if self.connection_mode == 'inline' and self.database_connection is None:
            raise ValueError("When 'connection_mode' is 'inline', 'database_connection' must be provided")

        if self.connection_mode == 'reference' and not self.database_connection_name:
            raise ValueError("When 'connection_mode' is 'reference', 'database_connection_name' must be provided")

        return self
DatabaseWriteSettings

Bases: BaseModel

Defines settings for writing data to a database table.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
366
367
368
369
370
371
372
373
class DatabaseWriteSettings(BaseModel):
    """Defines settings for writing data to a database table."""
    connection_mode: Optional[Literal['inline', 'reference']] = 'inline'
    database_connection: Optional[DatabaseConnection] = None
    database_connection_name: Optional[str] = None
    table_name: str
    schema_name: Optional[str] = None
    if_exists: Optional[Literal['append', 'replace', 'fail']] = 'append'
ExternalSource

Bases: BaseModel

Base model for data coming from a predefined external source.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
398
399
400
401
class ExternalSource(BaseModel):
    """Base model for data coming from a predefined external source."""
    orientation: str = 'row'
    fields: Optional[List[MinimalFieldInfo]] = None
FullDatabaseConnection

Bases: BaseModel

A complete database connection model including the secret password.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
315
316
317
318
319
320
321
322
323
324
325
class FullDatabaseConnection(BaseModel):
    """A complete database connection model including the secret password."""
    connection_name: str
    database_type: str = "postgresql"
    username: str
    password: SecretStr
    host: Optional[str] = None
    port: Optional[int] = None
    database: Optional[str] = None
    ssl_enabled: Optional[bool] = False
    url: Optional[str] = None
FullDatabaseConnectionInterface

Bases: BaseModel

A database connection model intended for UI display, omitting the password.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
328
329
330
331
332
333
334
335
336
337
class FullDatabaseConnectionInterface(BaseModel):
    """A database connection model intended for UI display, omitting the password."""
    connection_name: str
    database_type: str = "postgresql"
    username: str
    host: Optional[str] = None
    port: Optional[int] = None
    database: Optional[str] = None
    ssl_enabled: Optional[bool] = False
    url: Optional[str] = None
MinimalFieldInfo

Bases: BaseModel

Represents the most basic information about a data field (column).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
40
41
42
43
class MinimalFieldInfo(BaseModel):
    """Represents the most basic information about a data field (column)."""
    name: str
    data_type: str = "String"
NewDirectory

Bases: BaseModel

Defines the information required to create a new directory.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
22
23
24
25
class NewDirectory(BaseModel):
    """Defines the information required to create a new directory."""
    source_path: str
    dir_name: str
NodeBase

Bases: BaseModel

Base model for all nodes in a FlowGraph. Contains common metadata.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
186
187
188
189
190
191
192
193
194
195
196
197
class NodeBase(BaseModel):
    """Base model for all nodes in a FlowGraph. Contains common metadata."""
    model_config = ConfigDict(arbitrary_types_allowed=True)
    flow_id: int
    node_id: int
    cache_results: Optional[bool] = False
    pos_x: Optional[float] = 0
    pos_y: Optional[float] = 0
    is_setup: Optional[bool] = True
    description: Optional[str] = ''
    user_id: Optional[int] = None
    is_flow_output: Optional[bool] = False
NodeCloudStorageReader

Bases: NodeBase

Settings for a node that reads from a cloud storage service (S3, GCS, etc.).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
387
388
389
390
class NodeCloudStorageReader(NodeBase):
    """Settings for a node that reads from a cloud storage service (S3, GCS, etc.)."""
    cloud_storage_settings: CloudStorageReadSettings
    fields: Optional[List[MinimalFieldInfo]] = None
NodeCloudStorageWriter

Bases: NodeSingleInput

Settings for a node that writes to a cloud storage service.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
393
394
395
class NodeCloudStorageWriter(NodeSingleInput):
    """Settings for a node that writes to a cloud storage service."""
    cloud_storage_settings: CloudStorageWriteSettings
NodeConnection

Bases: BaseModel

Represents a connection (edge) between two nodes in the graph.

Methods:

Name Description
create_from_simple_input

Creates a standard connection between two nodes.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
class NodeConnection(BaseModel):
    """Represents a connection (edge) between two nodes in the graph."""
    input_connection: NodeInputConnection
    output_connection: NodeOutputConnection

    @classmethod
    def create_from_simple_input(cls, from_id: int, to_id: int, input_type: InputType = "input-0"):
        """Creates a standard connection between two nodes."""
        match input_type:
            case "main": connection_class: InputConnectionClass = "input-0"
            case "right": connection_class: InputConnectionClass = "input-1"
            case "left": connection_class: InputConnectionClass = "input-2"
            case _: connection_class: InputConnectionClass = "input-0"
        node_input = NodeInputConnection(node_id=to_id, connection_class=connection_class)
        node_output = NodeOutputConnection(node_id=from_id, connection_class='output-0')
        return cls(input_connection=node_input, output_connection=node_output)
create_from_simple_input(from_id, to_id, input_type='input-0') classmethod

Creates a standard connection between two nodes.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
479
480
481
482
483
484
485
486
487
488
489
@classmethod
def create_from_simple_input(cls, from_id: int, to_id: int, input_type: InputType = "input-0"):
    """Creates a standard connection between two nodes."""
    match input_type:
        case "main": connection_class: InputConnectionClass = "input-0"
        case "right": connection_class: InputConnectionClass = "input-1"
        case "left": connection_class: InputConnectionClass = "input-2"
        case _: connection_class: InputConnectionClass = "input-0"
    node_input = NodeInputConnection(node_id=to_id, connection_class=connection_class)
    node_output = NodeOutputConnection(node_id=from_id, connection_class='output-0')
    return cls(input_connection=node_input, output_connection=node_output)
NodeCrossJoin

Bases: NodeMultiInput

Settings for a node that performs a cross join.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
252
253
254
255
256
257
258
259
class NodeCrossJoin(NodeMultiInput):
    """Settings for a node that performs a cross join."""
    auto_generate_selection: bool = True
    verify_integrity: bool = True
    cross_join_input: transform_schema.CrossJoinInput
    auto_keep_all: bool = True
    auto_keep_right: bool = True
    auto_keep_left: bool = True
NodeDatabaseReader

Bases: NodeBase

Settings for a node that reads from a database.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
376
377
378
379
class NodeDatabaseReader(NodeBase):
    """Settings for a node that reads from a database."""
    database_settings: DatabaseSettings
    fields: Optional[List[MinimalFieldInfo]] = None
NodeDatabaseWriter

Bases: NodeSingleInput

Settings for a node that writes data to a database.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
382
383
384
class NodeDatabaseWriter(NodeSingleInput):
    """Settings for a node that writes data to a database."""
    database_write_settings: DatabaseWriteSettings
NodeDatasource

Bases: NodeBase

Base settings for a node that acts as a data source.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
267
268
269
class NodeDatasource(NodeBase):
    """Base settings for a node that acts as a data source."""
    file_ref: str = None
NodeDescription

Bases: BaseModel

A simple model for updating a node's description text.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
492
493
494
class NodeDescription(BaseModel):
    """A simple model for updating a node's description text."""
    description: str = ''
NodeExploreData

Bases: NodeBase

Settings for a node that provides an interactive data exploration interface.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
497
498
499
class NodeExploreData(NodeBase):
    """Settings for a node that provides an interactive data exploration interface."""
    graphic_walker_input: Optional[gs_schemas.GraphicWalkerInput] = None
NodeExternalSource

Bases: NodeBase

Settings for a node that connects to a registered external data source.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
411
412
413
414
class NodeExternalSource(NodeBase):
    """Settings for a node that connects to a registered external data source."""
    identifier: str
    source_settings: SampleUsers
NodeFilter

Bases: NodeSingleInput

Settings for a node that filters rows based on a condition.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
217
218
219
class NodeFilter(NodeSingleInput):
    """Settings for a node that filters rows based on a condition."""
    filter_input: transform_schema.FilterInput
NodeFormula

Bases: NodeSingleInput

Settings for a node that applies a formula to create/modify a column.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
417
418
419
class NodeFormula(NodeSingleInput):
    """Settings for a node that applies a formula to create/modify a column."""
    function: transform_schema.FunctionInput = None
NodeFuzzyMatch

Bases: NodeJoin

Settings for a node that performs a fuzzy join based on string similarity.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
262
263
264
class NodeFuzzyMatch(NodeJoin):
    """Settings for a node that performs a fuzzy join based on string similarity."""
    join_input: transform_schema.FuzzyMatchInput
NodeGraphSolver

Bases: NodeSingleInput

Settings for a node that solves graph-based problems (e.g., connected components).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
502
503
504
class NodeGraphSolver(NodeSingleInput):
    """Settings for a node that solves graph-based problems (e.g., connected components)."""
    graph_solver_input: transform_schema.GraphSolverInput
NodeGroupBy

Bases: NodeSingleInput

Settings for a node that performs a group-by and aggregation operation.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
422
423
424
class NodeGroupBy(NodeSingleInput):
    """Settings for a node that performs a group-by and aggregation operation."""
    groupby_input: transform_schema.GroupByInput = None
NodeInputConnection

Bases: BaseModel

Represents the input side of a connection between two nodes.

Methods:

Name Description
get_node_input_connection_type

Determines the semantic type of the input (e.g., for a join).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
433
434
435
436
437
438
439
440
441
442
443
444
class NodeInputConnection(BaseModel):
    """Represents the input side of a connection between two nodes."""
    node_id: int
    connection_class: InputConnectionClass

    def get_node_input_connection_type(self) -> Literal['main', 'right', 'left']:
        """Determines the semantic type of the input (e.g., for a join)."""
        match self.connection_class:
            case 'input-0': return 'main'
            case 'input-1': return 'right'
            case 'input-2': return 'left'
            case _: raise ValueError(f"Unexpected connection_class: {self.connection_class}")
get_node_input_connection_type()

Determines the semantic type of the input (e.g., for a join).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
438
439
440
441
442
443
444
def get_node_input_connection_type(self) -> Literal['main', 'right', 'left']:
    """Determines the semantic type of the input (e.g., for a join)."""
    match self.connection_class:
        case 'input-0': return 'main'
        case 'input-1': return 'right'
        case 'input-2': return 'left'
        case _: raise ValueError(f"Unexpected connection_class: {self.connection_class}")
NodeJoin

Bases: NodeMultiInput

Settings for a node that performs a standard SQL-style join.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
242
243
244
245
246
247
248
249
class NodeJoin(NodeMultiInput):
    """Settings for a node that performs a standard SQL-style join."""
    auto_generate_selection: bool = True
    verify_integrity: bool = True
    join_input: transform_schema.JoinInput
    auto_keep_all: bool = True
    auto_keep_right: bool = True
    auto_keep_left: bool = True
NodeManualInput

Bases: NodeBase

Settings for a node that allows direct data entry in the UI.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
294
295
296
class NodeManualInput(NodeBase):
    """Settings for a node that allows direct data entry in the UI."""
    raw_data_format: Optional[RawData] = None
NodeMultiInput

Bases: NodeBase

A base model for any node that takes multiple data inputs.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
205
206
207
class NodeMultiInput(NodeBase):
    """A base model for any node that takes multiple data inputs."""
    depending_on_ids: Optional[List[int]] = [-1]
NodeOutput

Bases: NodeSingleInput

Settings for a node that writes its input to a file.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
463
464
465
class NodeOutput(NodeSingleInput):
    """Settings for a node that writes its input to a file."""
    output_settings: OutputSettings
NodeOutputConnection

Bases: BaseModel

Represents the output side of a connection between two nodes.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
468
469
470
471
class NodeOutputConnection(BaseModel):
    """Represents the output side of a connection between two nodes."""
    node_id: int
    connection_class: OutputConnectionClass
NodePivot

Bases: NodeSingleInput

Settings for a node that pivots data from a long to a wide format.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
447
448
449
450
class NodePivot(NodeSingleInput):
    """Settings for a node that pivots data from a long to a wide format."""
    pivot_input: transform_schema.PivotInput = None
    output_fields: Optional[List[MinimalFieldInfo]] = None
NodePolarsCode

Bases: NodeMultiInput

Settings for a node that executes arbitrary user-provided Polars code.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
517
518
519
class NodePolarsCode(NodeMultiInput):
    """Settings for a node that executes arbitrary user-provided Polars code."""
    polars_code_input: transform_schema.PolarsCodeInput
NodePromise

Bases: NodeBase

A placeholder node for an operation that has not yet been configured.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
427
428
429
430
class NodePromise(NodeBase):
    """A placeholder node for an operation that has not yet been configured."""
    is_setup: bool = False
    node_type: str
NodeRead

Bases: NodeBase

Settings for a node that reads data from a file.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
299
300
301
class NodeRead(NodeBase):
    """Settings for a node that reads data from a file."""
    received_file: ReceivedTable
NodeRecordCount

Bases: NodeSingleInput

Settings for a node that counts the number of records.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
512
513
514
class NodeRecordCount(NodeSingleInput):
    """Settings for a node that counts the number of records."""
    pass
NodeRecordId

Bases: NodeSingleInput

Settings for a node that adds a unique record ID column.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
237
238
239
class NodeRecordId(NodeSingleInput):
    """Settings for a node that adds a unique record ID column."""
    record_id_input: transform_schema.RecordIdInput
NodeSample

Bases: NodeSingleInput

Settings for a node that samples a subset of the data.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
232
233
234
class NodeSample(NodeSingleInput):
    """Settings for a node that samples a subset of the data."""
    sample_size: int = 1000
NodeSelect

Bases: NodeSingleInput

Settings for a node that selects, renames, and reorders columns.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
210
211
212
213
214
class NodeSelect(NodeSingleInput):
    """Settings for a node that selects, renames, and reorders columns."""
    keep_missing: bool = True
    select_input: List[transform_schema.SelectInput] = Field(default_factory=list)
    sorted_by: Optional[Literal['none', 'asc', 'desc']] = 'none'
NodeSingleInput

Bases: NodeBase

A base model for any node that takes a single data input.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
200
201
202
class NodeSingleInput(NodeBase):
    """A base model for any node that takes a single data input."""
    depending_on_id: Optional[int] = -1
NodeSort

Bases: NodeSingleInput

Settings for a node that sorts the data by one or more columns.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
222
223
224
class NodeSort(NodeSingleInput):
    """Settings for a node that sorts the data by one or more columns."""
    sort_input: List[transform_schema.SortByInput] = Field(default_factory=list)
NodeTextToRows

Bases: NodeSingleInput

Settings for a node that splits a text column into multiple rows.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
227
228
229
class NodeTextToRows(NodeSingleInput):
    """Settings for a node that splits a text column into multiple rows."""
    text_to_rows_input: transform_schema.TextToRowsInput
NodeUnion

Bases: NodeMultiInput

Settings for a node that concatenates multiple data inputs.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
458
459
460
class NodeUnion(NodeMultiInput):
    """Settings for a node that concatenates multiple data inputs."""
    union_input: transform_schema.UnionInput = Field(default_factory=transform_schema.UnionInput)
NodeUnique

Bases: NodeSingleInput

Settings for a node that returns the unique rows from the data.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
507
508
509
class NodeUnique(NodeSingleInput):
    """Settings for a node that returns the unique rows from the data."""
    unique_input: transform_schema.UniqueInput
NodeUnpivot

Bases: NodeSingleInput

Settings for a node that unpivots data from a wide to a long format.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
453
454
455
class NodeUnpivot(NodeSingleInput):
    """Settings for a node that unpivots data from a wide to a long format."""
    unpivot_input: transform_schema.UnpivotInput = None
OutputCsvTable

Bases: BaseModel

Defines settings for writing a CSV file.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
140
141
142
143
144
class OutputCsvTable(BaseModel):
    """Defines settings for writing a CSV file."""
    file_type: str = 'csv'
    delimiter: str = ','
    encoding: str = 'utf-8'
OutputExcelTable

Bases: BaseModel

Defines settings for writing an Excel file.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
152
153
154
155
class OutputExcelTable(BaseModel):
    """Defines settings for writing an Excel file."""
    file_type: str = 'excel'
    sheet_name: str = 'Sheet1'
OutputParquetTable

Bases: BaseModel

Defines settings for writing a Parquet file.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
147
148
149
class OutputParquetTable(BaseModel):
    """Defines settings for writing a Parquet file."""
    file_type: str = 'parquet'
OutputSettings

Bases: BaseModel

Defines the complete settings for an output node.

Methods:

Name Description
populate_abs_file_path

Ensures the absolute file path is populated after validation.

set_absolute_filepath

Resolves the output directory and name into an absolute path.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
class OutputSettings(BaseModel):
    """Defines the complete settings for an output node."""
    name: str
    directory: str
    file_type: str
    fields: Optional[List[str]] = Field(default_factory=list)
    write_mode: str = 'overwrite'
    output_csv_table: Optional[OutputCsvTable] = Field(default_factory=OutputCsvTable)
    output_parquet_table: OutputParquetTable = Field(default_factory=OutputParquetTable)
    output_excel_table: OutputExcelTable = Field(default_factory=OutputExcelTable)
    abs_file_path: Optional[str] = None

    def set_absolute_filepath(self):
        """Resolves the output directory and name into an absolute path."""
        base_path = Path(self.directory)
        if not base_path.is_absolute():
            base_path = Path.cwd() / base_path
        if self.name and self.name not in base_path.name:
            base_path = base_path / self.name
        self.abs_file_path = str(base_path.resolve())

    @model_validator(mode='after')
    def populate_abs_file_path(self):
        """Ensures the absolute file path is populated after validation."""
        self.set_absolute_filepath()
        return self
populate_abs_file_path()

Ensures the absolute file path is populated after validation.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
179
180
181
182
183
@model_validator(mode='after')
def populate_abs_file_path(self):
    """Ensures the absolute file path is populated after validation."""
    self.set_absolute_filepath()
    return self
set_absolute_filepath()

Resolves the output directory and name into an absolute path.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
170
171
172
173
174
175
176
177
def set_absolute_filepath(self):
    """Resolves the output directory and name into an absolute path."""
    base_path = Path(self.directory)
    if not base_path.is_absolute():
        base_path = Path.cwd() / base_path
    if self.name and self.name not in base_path.name:
        base_path = base_path / self.name
    self.abs_file_path = str(base_path.resolve())
RawData

Bases: BaseModel

Represents data in a raw, columnar format for manual input.

Methods:

Name Description
from_pylist

Creates a RawData object from a list of Python dictionaries.

to_pylist

Converts the RawData object back into a list of Python dictionaries.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
class RawData(BaseModel):
    """Represents data in a raw, columnar format for manual input."""
    columns: List[MinimalFieldInfo] = None
    data: List[List]

    @classmethod
    def from_pylist(cls, pylist: List[dict]):
        """Creates a RawData object from a list of Python dictionaries."""
        if len(pylist) == 0:
            return cls(columns=[], data=[])
        pylist = ensure_similarity_dicts(pylist)
        values = [standardize_col_dtype([vv for vv in c]) for c in
                  zip(*(r.values() for r in pylist))]
        data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
        columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pylist[0].keys()]
        return cls(columns=columns, data=values)

    def to_pylist(self) -> List[dict]:
        """Converts the RawData object back into a list of Python dictionaries."""
        return [{c.name: self.data[ci][ri] for ci, c in enumerate(self.columns)} for ri in range(len(self.data[0]))]
from_pylist(pylist) classmethod

Creates a RawData object from a list of Python dictionaries.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
277
278
279
280
281
282
283
284
285
286
287
@classmethod
def from_pylist(cls, pylist: List[dict]):
    """Creates a RawData object from a list of Python dictionaries."""
    if len(pylist) == 0:
        return cls(columns=[], data=[])
    pylist = ensure_similarity_dicts(pylist)
    values = [standardize_col_dtype([vv for vv in c]) for c in
              zip(*(r.values() for r in pylist))]
    data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
    columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pylist[0].keys()]
    return cls(columns=columns, data=values)
to_pylist()

Converts the RawData object back into a list of Python dictionaries.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
289
290
291
def to_pylist(self) -> List[dict]:
    """Converts the RawData object back into a list of Python dictionaries."""
    return [{c.name: self.data[ci][ri] for ci, c in enumerate(self.columns)} for ri in range(len(self.data[0]))]
ReceivedCsvTable

Bases: ReceivedTableBase

Defines settings for reading a CSV file.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
class ReceivedCsvTable(ReceivedTableBase):
    """Defines settings for reading a CSV file."""
    file_type: str = 'csv'
    reference: str = ''
    starting_from_line: int = 0
    delimiter: str = ','
    has_headers: bool = True
    encoding: Optional[str] = 'utf-8'
    parquet_ref: Optional[str] = None
    row_delimiter: str = '\n'
    quote_char: str = '"'
    infer_schema_length: int = 10_000
    truncate_ragged_lines: bool = False
    ignore_errors: bool = False
ReceivedExcelTable

Bases: ReceivedTableBase

Defines settings for reading an Excel file.

Methods:

Name Description
validate_range_values

Validates that the Excel cell range is logical.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
class ReceivedExcelTable(ReceivedTableBase):
    """Defines settings for reading an Excel file."""
    sheet_name: Optional[str] = None
    start_row: int = 0
    start_column: int = 0
    end_row: int = 0
    end_column: int = 0
    has_headers: bool = True
    type_inference: bool = False

    def validate_range_values(self):
        """Validates that the Excel cell range is logical."""
        for attribute in [self.start_row, self.start_column, self.end_row, self.end_column]:
            if not isinstance(attribute, int) or attribute < 0:
                raise ValueError("Row and column indices must be non-negative integers")
        if (self.end_row > 0 and self.start_row > self.end_row) or \
           (self.end_column > 0 and self.start_column > self.end_column):
            raise ValueError("Start row/column must not be greater than end row/column")
validate_range_values()

Validates that the Excel cell range is logical.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
125
126
127
128
129
130
131
132
def validate_range_values(self):
    """Validates that the Excel cell range is logical."""
    for attribute in [self.start_row, self.start_column, self.end_row, self.end_column]:
        if not isinstance(attribute, int) or attribute < 0:
            raise ValueError("Row and column indices must be non-negative integers")
    if (self.end_row > 0 and self.start_row > self.end_row) or \
       (self.end_column > 0 and self.start_column > self.end_column):
        raise ValueError("Start row/column must not be greater than end row/column")
ReceivedJsonTable

Bases: ReceivedCsvTable

Defines settings for reading a JSON file (inherits from CSV settings).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
105
106
107
class ReceivedJsonTable(ReceivedCsvTable):
    """Defines settings for reading a JSON file (inherits from CSV settings)."""
    pass
ReceivedParquetTable

Bases: ReceivedTableBase

Defines settings for reading a Parquet file.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
110
111
112
class ReceivedParquetTable(ReceivedTableBase):
    """Defines settings for reading a Parquet file."""
    file_type: str = 'parquet'
ReceivedTable

Bases: ReceivedExcelTable, ReceivedCsvTable, ReceivedParquetTable

A comprehensive model that can represent any type of received table.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
135
136
137
class ReceivedTable(ReceivedExcelTable, ReceivedCsvTable, ReceivedParquetTable):
    """A comprehensive model that can represent any type of received table."""
    ...
ReceivedTableBase

Bases: BaseModel

Base model for defining a table received from an external source.

Methods:

Name Description
create_from_path

Creates an instance from a file path string.

populate_abs_file_path

Ensures the absolute file path is populated after validation.

set_absolute_filepath

Resolves the path to an absolute file path.

Attributes:

Name Type Description
file_path str

Constructs the full file path from the directory and name.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
class ReceivedTableBase(BaseModel):
    """Base model for defining a table received from an external source."""
    id: Optional[int] = None
    name: Optional[str]
    path: str  # This can be an absolute or relative path
    directory: Optional[str] = None
    analysis_file_available: bool = False
    status: Optional[str] = None
    file_type: Optional[str] = None
    fields: List[MinimalFieldInfo] = Field(default_factory=list)
    abs_file_path: Optional[str] = None

    @classmethod
    def create_from_path(cls, path: str):
        """Creates an instance from a file path string."""
        filename = Path(path).name
        return cls(name=filename, path=path)

    @property
    def file_path(self) -> str:
        """Constructs the full file path from the directory and name."""
        if not self.name in self.path:
            return os.path.join(self.path, self.name)
        else:
            return self.path

    def set_absolute_filepath(self):
        """Resolves the path to an absolute file path."""
        base_path = Path(self.path).expanduser()
        if not base_path.is_absolute():
            base_path = Path.cwd() / base_path
        if self.name and self.name not in base_path.name:
            base_path = base_path / self.name
        self.abs_file_path = str(base_path.resolve())

    @model_validator(mode='after')
    def populate_abs_file_path(self):
        """Ensures the absolute file path is populated after validation."""
        if not self.abs_file_path:
            self.set_absolute_filepath()
        return self
file_path property

Constructs the full file path from the directory and name.

create_from_path(path) classmethod

Creates an instance from a file path string.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
58
59
60
61
62
@classmethod
def create_from_path(cls, path: str):
    """Creates an instance from a file path string."""
    filename = Path(path).name
    return cls(name=filename, path=path)
populate_abs_file_path()

Ensures the absolute file path is populated after validation.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
81
82
83
84
85
86
@model_validator(mode='after')
def populate_abs_file_path(self):
    """Ensures the absolute file path is populated after validation."""
    if not self.abs_file_path:
        self.set_absolute_filepath()
    return self
set_absolute_filepath()

Resolves the path to an absolute file path.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
72
73
74
75
76
77
78
79
def set_absolute_filepath(self):
    """Resolves the path to an absolute file path."""
    base_path = Path(self.path).expanduser()
    if not base_path.is_absolute():
        base_path = Path.cwd() / base_path
    if self.name and self.name not in base_path.name:
        base_path = base_path / self.name
    self.abs_file_path = str(base_path.resolve())
RemoveItem

Bases: BaseModel

Represents a single item to be removed from a directory or list.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
28
29
30
31
class RemoveItem(BaseModel):
    """Represents a single item to be removed from a directory or list."""
    path: str
    id: int = -1
RemoveItemsInput

Bases: BaseModel

Defines a list of items to be removed.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
34
35
36
37
class RemoveItemsInput(BaseModel):
    """Defines a list of items to be removed."""
    paths: List[RemoveItem]
    source_path: str
SampleUsers

Bases: ExternalSource

Settings for generating a sample dataset of users.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
404
405
406
407
408
class SampleUsers(ExternalSource):
    """Settings for generating a sample dataset of users."""
    SAMPLE_USERS: bool
    class_name: str = "sample_users"
    size: int = 100

transform_schema

flowfile_core.schemas.transform_schema

Classes:

Name Description
AggColl

A data class that represents a single aggregation operation for a group by operation.

BasicFilter

Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').

CrossJoinInput

Defines the settings for a cross join operation, including column selections for both inputs.

FieldInput

Represents a single field with its name and data type, typically for defining an output column.

FilterInput

Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.

FullJoinKeyResponse

Holds the join key rename responses for both sides of a join.

FunctionInput

Defines a formula to be applied, including the output field information.

FuzzyMap

Extends JoinMap with settings for fuzzy string matching, such as the algorithm and similarity threshold.

FuzzyMatchInput

Extends JoinInput with settings specific to fuzzy matching, such as the matching algorithm and threshold.

GraphSolverInput

Defines settings for a graph-solving operation (e.g., finding connected components).

GroupByInput

A data class that represents the input for a group by operation.

JoinInput

Defines the settings for a standard SQL-style join, including keys, strategy, and selections.

JoinInputs

Extends SelectInputs with functionality specific to join operations, like handling join keys.

JoinKeyRename

Represents the renaming of a join key from its original to a temporary name.

JoinKeyRenameResponse

Contains a list of join key renames for one side of a join.

JoinMap

Defines a single mapping between a left and right column for a join key.

JoinSelectMixin

A mixin providing common methods for join-like operations that involve left and right inputs.

PivotInput

Defines the settings for a pivot (long-to-wide) operation.

PolarsCodeInput

A simple container for a string of user-provided Polars code to be executed.

RecordIdInput

Defines settings for adding a record ID (row number) column to the data.

SelectInput

Defines how a single column should be selected, renamed, or type-cast.

SelectInputs

A container for a list of SelectInput objects, providing helper methods for managing selections.

SortByInput

Defines a single sort condition on a column, including the direction.

TextToRowsInput

Defines settings for splitting a text column into multiple rows based on a delimiter.

UnionInput

Defines settings for a union (concatenation) operation.

UniqueInput

Defines settings for a uniqueness operation, specifying columns and which row to keep.

UnpivotInput

Defines settings for an unpivot (wide-to-long) operation.

Functions:

Name Description
construct_join_key_name

Creates a temporary, unique name for a join key column.

get_func_type_mapping

Infers the output data type of common aggregation functions.

string_concat

A simple wrapper to concatenate string columns in Polars.

AggColl dataclass

A data class that represents a single aggregation operation for a group by operation.

Attributes

old_name : str The name of the column in the original DataFrame to be aggregated.

Any

The aggregation function to use. This can be a string representing a built-in function or a custom function.

Optional[str]

The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the old_name appended with the aggregation function.

Optional[str]

The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function using the get_func_type_mapping function.

Example

agg_col = AggColl( old_name='col1', agg='sum', new_name='sum_col1', output_type='float' )

Methods:

Name Description
__init__

Initializes an aggregation column with its source, function, and new name.

Attributes:

Name Type Description
agg_func

Returns the corresponding Polars aggregation function from the agg string.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
@dataclass
class AggColl:
    """
    A data class that represents a single aggregation operation for a group by operation.

    Attributes
    ----------
    old_name : str
        The name of the column in the original DataFrame to be aggregated.

    agg : Any
        The aggregation function to use. This can be a string representing a built-in function or a custom function.

    new_name : Optional[str]
        The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the
        old_name appended with the aggregation function.

    output_type : Optional[str]
        The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function
        using the `get_func_type_mapping` function.

    Example
    --------
    agg_col = AggColl(
        old_name='col1',
        agg='sum',
        new_name='sum_col1',
        output_type='float'
    )
    """
    old_name: str
    agg: str
    new_name: Optional[str]
    output_type: Optional[str] = None

    def __init__(self, old_name: str, agg: str, new_name: str = None, output_type: str = None):
        """Initializes an aggregation column with its source, function, and new name."""
        self.old_name = str(old_name)
        if agg != 'groupby':
            self.new_name = new_name if new_name is not None else self.old_name + "_" + agg
        else:
            self.new_name = new_name if new_name is not None else self.old_name
        self.output_type = output_type if output_type is not None else get_func_type_mapping(agg)
        self.agg = agg

    @property
    def agg_func(self):
        """Returns the corresponding Polars aggregation function from the `agg` string."""
        if self.agg == 'groupby':
            return self.agg
        elif self.agg == 'concat':
            return string_concat
        else:
            return getattr(pl, self.agg) if isinstance(self.agg, str) else self.agg
agg_func property

Returns the corresponding Polars aggregation function from the agg string.

__init__(old_name, agg, new_name=None, output_type=None)

Initializes an aggregation column with its source, function, and new name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
531
532
533
534
535
536
537
538
539
def __init__(self, old_name: str, agg: str, new_name: str = None, output_type: str = None):
    """Initializes an aggregation column with its source, function, and new name."""
    self.old_name = str(old_name)
    if agg != 'groupby':
        self.new_name = new_name if new_name is not None else self.old_name + "_" + agg
    else:
        self.new_name = new_name if new_name is not None else self.old_name
    self.output_type = output_type if output_type is not None else get_func_type_mapping(agg)
    self.agg = agg
BasicFilter dataclass

Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
121
122
123
124
125
126
@dataclass
class BasicFilter:
    """Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value')."""
    field: str = ''
    filter_type: str = ''
    filter_value: str = ''
CrossJoinInput dataclass

Bases: JoinSelectMixin

Defines the settings for a cross join operation, including column selections for both inputs.

Methods:

Name Description
__init__

Initializes the CrossJoinInput with selections for left and right tables.

auto_rename

Automatically renames columns on the right side to prevent naming conflicts.

Attributes:

Name Type Description
overlapping_records

Finds column names that would conflict after the join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
@dataclass
class CrossJoinInput(JoinSelectMixin):
    """Defines the settings for a cross join operation, including column selections for both inputs."""
    left_select: SelectInputs = None
    right_select: SelectInputs = None

    def __init__(self, left_select: List[SelectInput] | List[str],
                 right_select: List[SelectInput] | List[str]):
        """Initializes the CrossJoinInput with selections for left and right tables."""
        self.left_select = self.parse_select(left_select)
        self.right_select = self.parse_select(right_select)

    @property
    def overlapping_records(self):
        """Finds column names that would conflict after the join."""
        return self.left_select.new_cols & self.right_select.new_cols

    def auto_rename(self):
        """Automatically renames columns on the right side to prevent naming conflicts."""
        overlapping_records = self.overlapping_records
        while len(overlapping_records) > 0:
            for right_col in self.right_select.renames:
                if right_col.new_name in overlapping_records:
                    right_col.new_name = 'right_' + right_col.new_name
            overlapping_records = self.overlapping_records
overlapping_records property

Finds column names that would conflict after the join.

__init__(left_select, right_select)

Initializes the CrossJoinInput with selections for left and right tables.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
296
297
298
299
300
def __init__(self, left_select: List[SelectInput] | List[str],
             right_select: List[SelectInput] | List[str]):
    """Initializes the CrossJoinInput with selections for left and right tables."""
    self.left_select = self.parse_select(left_select)
    self.right_select = self.parse_select(right_select)
auto_rename()

Automatically renames columns on the right side to prevent naming conflicts.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
307
308
309
310
311
312
313
314
def auto_rename(self):
    """Automatically renames columns on the right side to prevent naming conflicts."""
    overlapping_records = self.overlapping_records
    while len(overlapping_records) > 0:
        for right_col in self.right_select.renames:
            if right_col.new_name in overlapping_records:
                right_col.new_name = 'right_' + right_col.new_name
        overlapping_records = self.overlapping_records
FieldInput dataclass

Represents a single field with its name and data type, typically for defining an output column.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
103
104
105
106
107
108
109
110
111
@dataclass
class FieldInput:
    """Represents a single field with its name and data type, typically for defining an output column."""
    name: str
    data_type: Optional[str] = None

    def __init__(self, name: str, data_type: str = None):
        self.name = name
        self.data_type = data_type
FilterInput dataclass

Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
129
130
131
132
133
134
@dataclass
class FilterInput:
    """Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes."""
    advanced_filter: str = ''
    basic_filter: BasicFilter = None
    filter_type: str = 'basic'
FullJoinKeyResponse

Bases: NamedTuple

Holds the join key rename responses for both sides of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
49
50
51
52
class FullJoinKeyResponse(NamedTuple):
    """Holds the join key rename responses for both sides of a join."""
    left: JoinKeyRenameResponse
    right: JoinKeyRenameResponse
FunctionInput dataclass

Defines a formula to be applied, including the output field information.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
114
115
116
117
118
@dataclass
class FunctionInput:
    """Defines a formula to be applied, including the output field information."""
    field: FieldInput
    function: str
FuzzyMap dataclass

Bases: JoinMap

Extends JoinMap with settings for fuzzy string matching, such as the algorithm and similarity threshold.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
@dataclass
class FuzzyMap(JoinMap):
    """Extends `JoinMap` with settings for fuzzy string matching, such as the algorithm and similarity threshold."""
    threshold_score: Optional[float] = 80.0
    fuzzy_type: Optional[FuzzyTypeLiteral] = 'levenshtein'
    perc_unique: Optional[float] = 0.0
    output_column_name: Optional[str] = None
    valid: Optional[bool] = True

    def __init__(self, left_col: str, right_col: str = None, threshold_score: float = 80.0,
                 fuzzy_type: FuzzyTypeLiteral = 'levenshtein', perc_unique: float = 0, output_column_name: str = None,
                 _output_col_name: str = None, valid: bool = True, output_col_name: str = None):
        if right_col is None:
            right_col = left_col
        self.valid = valid
        self.left_col = left_col
        self.right_col = right_col
        self.threshold_score = threshold_score
        self.fuzzy_type = fuzzy_type
        self.perc_unique = perc_unique
        self.output_column_name = output_column_name if output_column_name is not None else _output_col_name
        self.output_column_name = self.output_column_name if self.output_column_name is not None else output_col_name
        if self.output_column_name is None:
            self.output_column_name = f'fuzzy_score_{self.left_col}_{self.right_col}'
FuzzyMatchInput dataclass

Bases: JoinInput

Extends JoinInput with settings specific to fuzzy matching, such as the matching algorithm and threshold.

Attributes:

Name Type Description
fuzzy_maps List[FuzzyMap]

Returns the final fuzzy mappings after applying all column renames.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
@dataclass
class FuzzyMatchInput(JoinInput):
    """Extends `JoinInput` with settings specific to fuzzy matching, such as the matching algorithm and threshold."""
    join_mapping: List[FuzzyMap]
    aggregate_output: bool = False

    @staticmethod
    def parse_fuzz_mapping(fuzz_mapping: List[FuzzyMap] | Tuple[str, str] | str) -> List[FuzzyMap]:
        if isinstance(fuzz_mapping, (tuple, list)):
            assert len(fuzz_mapping) > 0
            if all(isinstance(fm, dict) for fm in fuzz_mapping):
                fuzz_mapping = [FuzzyMap(**fm) for fm in fuzz_mapping]

            if not isinstance(fuzz_mapping[0], FuzzyMap):
                assert len(fuzz_mapping) <= 2
                if len(fuzz_mapping) == 2:
                    assert isinstance(fuzz_mapping[0], str) and isinstance(fuzz_mapping[1], str)
                    fuzz_mapping = [FuzzyMap(*fuzz_mapping)]
                elif isinstance(fuzz_mapping[0], str):
                    fuzz_mapping = [FuzzyMap(fuzz_mapping[0], fuzz_mapping[0])]
        elif isinstance(fuzz_mapping, str):
            fuzz_mapping = [FuzzyMap(fuzz_mapping, fuzz_mapping)]
        elif isinstance(fuzz_mapping, FuzzyMap):
            fuzz_mapping = [fuzz_mapping]
        else:
            raise Exception('No valid join mapping as input')
        return fuzz_mapping

    def __init__(self, join_mapping: List[FuzzyMap] | Tuple[str, str] | str, left_select: List[SelectInput] | List[str],
                 right_select: List[SelectInput] | List[str], aggregate_output: bool = False, how: JoinStrategy = 'inner'):
        self.join_mapping = self.parse_fuzz_mapping(join_mapping)
        self.left_select = self.parse_select(left_select)
        self.right_select = self.parse_select(right_select)
        self.how = how
        for jm in self.join_mapping:

            if jm.right_col not in self.right_select.old_cols:
                self.right_select.append(SelectInput(jm.right_col, keep=False, join_key=True))
            if jm.left_col not in self.left_select.old_cols:
                self.left_select.append(SelectInput(jm.left_col, keep=False, join_key=True))
        [setattr(v, "join_key", v.old_name in self._left_join_keys) for v in self.left_select.renames]
        [setattr(v, "join_key", v.old_name in self._right_join_keys) for v in self.right_select.renames]
        self.aggregate_output = aggregate_output

    @property
    def overlapping_records(self):
        return self.left_select.new_cols & self.right_select.new_cols

    @property
    def fuzzy_maps(self) -> List[FuzzyMap]:
        """Returns the final fuzzy mappings after applying all column renames."""
        new_mappings = []
        left_rename_table, right_rename_table = self.left_select.rename_table, self.right_select.rename_table
        for org_fuzzy_map in self.join_mapping:
            right_col = right_rename_table.get(org_fuzzy_map.right_col)
            left_col = left_rename_table.get(org_fuzzy_map.left_col)
            if right_col != org_fuzzy_map.right_col or left_col != org_fuzzy_map.left_col:
                new_mapping = deepcopy(org_fuzzy_map)
                new_mapping.left_col = left_col
                new_mapping.right_col = right_col
                new_mappings.append(new_mapping)
            else:
                new_mappings.append(org_fuzzy_map)
        return new_mappings
fuzzy_maps property

Returns the final fuzzy mappings after applying all column renames.

GraphSolverInput dataclass

Defines settings for a graph-solving operation (e.g., finding connected components).

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
676
677
678
679
680
681
@dataclass
class GraphSolverInput:
    """Defines settings for a graph-solving operation (e.g., finding connected components)."""
    col_from: str
    col_to: str
    output_column_name: Optional[str] = 'graph_group'
GroupByInput dataclass

A data class that represents the input for a group by operation.

Attributes

group_columns : List[str] A list of column names to group the DataFrame by. These column(s) will be set as the DataFrame index.

List[AggColl]

A list of AggColl objects that specify the aggregation operations to perform on the DataFrame columns after grouping. Each AggColl object should specify the column to be aggregated and the aggregation function to use.

Example

group_by_input = GroupByInput( agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'), AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')] )

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
@dataclass
class GroupByInput:
    """
    A data class that represents the input for a group by operation.

    Attributes
    ----------
    group_columns : List[str]
        A list of column names to group the DataFrame by. These column(s) will be set as the DataFrame index.

    agg_cols : List[AggColl]
        A list of `AggColl` objects that specify the aggregation operations to perform on the DataFrame columns
        after grouping. Each `AggColl` object should specify the column to be aggregated and the aggregation
        function to use.

    Example
    --------
    group_by_input = GroupByInput(
        agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'), AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')]
    )
    """
    agg_cols: List[AggColl]
JoinInput dataclass

Bases: JoinSelectMixin

Defines the settings for a standard SQL-style join, including keys, strategy, and selections.

Methods:

Name Description
__init__

Initializes the JoinInput with keys, selections, and join strategy.

auto_rename

Automatically renames columns on the right side to prevent naming conflicts.

get_join_key_renames

Gets the temporary rename mappings for the join keys on both sides.

parse_join_mapping

Parses various input formats for join keys into a standardized list of JoinMap objects.

set_join_keys

Marks the SelectInput objects corresponding to join keys.

Attributes:

Name Type Description
left_join_keys List[str]

Returns an ordered list of the left-side join key column names to be used in the join.

right_join_keys List[str]

Returns an ordered list of the right-side join key column names to be used in the join.

used_join_mapping List[JoinMap]

Returns the final join mapping after applying all renames and transformations.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
@dataclass
class JoinInput(JoinSelectMixin):
    """Defines the settings for a standard SQL-style join, including keys, strategy, and selections."""
    join_mapping: List[JoinMap]
    left_select: JoinInputs = None
    right_select: JoinInputs = None
    how: JoinStrategy = 'inner'

    @staticmethod
    def parse_join_mapping(join_mapping: any) -> List[JoinMap]:
        """Parses various input formats for join keys into a standardized list of `JoinMap` objects."""
        if isinstance(join_mapping, (tuple, list)):
            assert len(join_mapping) > 0
            if all(isinstance(jm, dict) for jm in join_mapping):
                join_mapping = [JoinMap(**jm) for jm in join_mapping]

            if not isinstance(join_mapping[0], JoinMap):
                assert len(join_mapping) <= 2
                if len(join_mapping) == 2:
                    assert isinstance(join_mapping[0], str) and isinstance(join_mapping[1], str)
                    join_mapping = [JoinMap(*join_mapping)]
                elif isinstance(join_mapping[0], str):
                    join_mapping = [JoinMap(join_mapping[0], join_mapping[0])]
        elif isinstance(join_mapping, str):
            join_mapping = [JoinMap(join_mapping, join_mapping)]
        else:
            raise Exception('No valid join mapping as input')
        return join_mapping

    def __init__(self, join_mapping: List[JoinMap] | Tuple[str, str] | str,
                 left_select: List[SelectInput] | List[str],
                 right_select: List[SelectInput] | List[str],
                 how: JoinStrategy = 'inner'):
        """Initializes the JoinInput with keys, selections, and join strategy."""
        self.join_mapping = self.parse_join_mapping(join_mapping)
        self.left_select = self.parse_select(left_select)
        self.right_select = self.parse_select(right_select)
        self.set_join_keys()
        self.how = how

    def set_join_keys(self):
        """Marks the `SelectInput` objects corresponding to join keys."""
        [setattr(v, "join_key", v.old_name in self._left_join_keys) for v in self.left_select.renames]
        [setattr(v, "join_key", v.old_name in self._right_join_keys) for v in self.right_select.renames]

    def get_join_key_renames(self, filter_drop: bool = False) -> FullJoinKeyResponse:
        """Gets the temporary rename mappings for the join keys on both sides."""
        return FullJoinKeyResponse(self.left_select.get_join_key_renames(side="left", filter_drop=filter_drop),
                                   self.right_select.get_join_key_renames(side="right", filter_drop=filter_drop))

    def get_names_for_table_rename(self) -> List[JoinMap]:
        new_mappings: List[JoinMap] = []
        left_rename_table, right_rename_table = self.left_select.rename_table, self.right_select.rename_table
        for join_map in self.join_mapping:
            new_mappings.append(JoinMap(left_rename_table.get(join_map.left_col, join_map.left_col),
                                        right_rename_table.get(join_map.right_col, join_map.right_col)
                                        )
                                )
        return new_mappings

    @property
    def _left_join_keys(self) -> Set:
        """Returns a set of the left-side join key column names."""
        return set(jm.left_col for jm in self.join_mapping)

    @property
    def _right_join_keys(self) -> Set:
        """Returns a set of the right-side join key column names."""
        return set(jm.right_col for jm in self.join_mapping)

    @property
    def left_join_keys(self) -> List[str]:
        """Returns an ordered list of the left-side join key column names to be used in the join."""
        return [jm.left_col for jm in self.used_join_mapping]

    @property
    def right_join_keys(self) -> List[str]:
        """Returns an ordered list of the right-side join key column names to be used in the join."""
        return [jm.right_col for jm in self.used_join_mapping]

    @property
    def overlapping_records(self):
        if self.how in ('left', 'right', 'inner'):
            return self.left_select.new_cols & self.right_select.new_cols
        else:
            return self.left_select.new_cols & self.right_select.new_cols

    def auto_rename(self):
        """Automatically renames columns on the right side to prevent naming conflicts."""
        self.set_join_keys()
        overlapping_records = self.overlapping_records
        while len(overlapping_records) > 0:
            for right_col in self.right_select.renames:
                if right_col.new_name in overlapping_records:
                    right_col.new_name = right_col.new_name + '_right'
            overlapping_records = self.overlapping_records

    @property
    def used_join_mapping(self) -> List[JoinMap]:
        """Returns the final join mapping after applying all renames and transformations."""
        new_mappings: List[JoinMap] = []
        left_rename_table, right_rename_table = self.left_select.rename_table, self.right_select.rename_table
        left_join_rename_mapping: Dict[str, str] = self.left_select.get_join_key_rename_mapping("left")
        right_join_rename_mapping: Dict[str, str] = self.right_select.get_join_key_rename_mapping("right")
        for join_map in self.join_mapping:
            # del self.right_select.rename_table, self.left_select.rename_table
            new_mappings.append(JoinMap(left_join_rename_mapping.get(left_rename_table.get(join_map.left_col, join_map.left_col)),
                                        right_join_rename_mapping.get(right_rename_table.get(join_map.right_col, join_map.right_col))
                                        )
                                )
        return new_mappings
left_join_keys property

Returns an ordered list of the left-side join key column names to be used in the join.

right_join_keys property

Returns an ordered list of the right-side join key column names to be used in the join.

used_join_mapping property

Returns the final join mapping after applying all renames and transformations.

__init__(join_mapping, left_select, right_select, how='inner')

Initializes the JoinInput with keys, selections, and join strategy.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
346
347
348
349
350
351
352
353
354
355
def __init__(self, join_mapping: List[JoinMap] | Tuple[str, str] | str,
             left_select: List[SelectInput] | List[str],
             right_select: List[SelectInput] | List[str],
             how: JoinStrategy = 'inner'):
    """Initializes the JoinInput with keys, selections, and join strategy."""
    self.join_mapping = self.parse_join_mapping(join_mapping)
    self.left_select = self.parse_select(left_select)
    self.right_select = self.parse_select(right_select)
    self.set_join_keys()
    self.how = how
auto_rename()

Automatically renames columns on the right side to prevent naming conflicts.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
404
405
406
407
408
409
410
411
412
def auto_rename(self):
    """Automatically renames columns on the right side to prevent naming conflicts."""
    self.set_join_keys()
    overlapping_records = self.overlapping_records
    while len(overlapping_records) > 0:
        for right_col in self.right_select.renames:
            if right_col.new_name in overlapping_records:
                right_col.new_name = right_col.new_name + '_right'
        overlapping_records = self.overlapping_records
get_join_key_renames(filter_drop=False)

Gets the temporary rename mappings for the join keys on both sides.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
362
363
364
365
def get_join_key_renames(self, filter_drop: bool = False) -> FullJoinKeyResponse:
    """Gets the temporary rename mappings for the join keys on both sides."""
    return FullJoinKeyResponse(self.left_select.get_join_key_renames(side="left", filter_drop=filter_drop),
                               self.right_select.get_join_key_renames(side="right", filter_drop=filter_drop))
parse_join_mapping(join_mapping) staticmethod

Parses various input formats for join keys into a standardized list of JoinMap objects.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
@staticmethod
def parse_join_mapping(join_mapping: any) -> List[JoinMap]:
    """Parses various input formats for join keys into a standardized list of `JoinMap` objects."""
    if isinstance(join_mapping, (tuple, list)):
        assert len(join_mapping) > 0
        if all(isinstance(jm, dict) for jm in join_mapping):
            join_mapping = [JoinMap(**jm) for jm in join_mapping]

        if not isinstance(join_mapping[0], JoinMap):
            assert len(join_mapping) <= 2
            if len(join_mapping) == 2:
                assert isinstance(join_mapping[0], str) and isinstance(join_mapping[1], str)
                join_mapping = [JoinMap(*join_mapping)]
            elif isinstance(join_mapping[0], str):
                join_mapping = [JoinMap(join_mapping[0], join_mapping[0])]
    elif isinstance(join_mapping, str):
        join_mapping = [JoinMap(join_mapping, join_mapping)]
    else:
        raise Exception('No valid join mapping as input')
    return join_mapping
set_join_keys()

Marks the SelectInput objects corresponding to join keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
357
358
359
360
def set_join_keys(self):
    """Marks the `SelectInput` objects corresponding to join keys."""
    [setattr(v, "join_key", v.old_name in self._left_join_keys) for v in self.left_select.renames]
    [setattr(v, "join_key", v.old_name in self._right_join_keys) for v in self.right_select.renames]
JoinInputs

Bases: SelectInputs

Extends SelectInputs with functionality specific to join operations, like handling join keys.

Methods:

Name Description
get_join_key_rename_mapping

Returns a dictionary mapping original join key names to their temporary names.

get_join_key_renames

Gets the temporary rename mapping for all join keys on one side of a join.

Attributes:

Name Type Description
join_key_selects List[SelectInput]

Returns only the SelectInput objects that are marked as join keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
class JoinInputs(SelectInputs):
    """Extends `SelectInputs` with functionality specific to join operations, like handling join keys."""

    def __init__(self, renames: List[SelectInput]):
        self.renames = renames

    @property
    def join_key_selects(self) -> List[SelectInput]:
        """Returns only the `SelectInput` objects that are marked as join keys."""
        return [v for v in self.renames if v.join_key]

    def get_join_key_renames(self, side: SideLit, filter_drop: bool = False) -> JoinKeyRenameResponse:
        """Gets the temporary rename mapping for all join keys on one side of a join."""
        return JoinKeyRenameResponse(
            side,
            [JoinKeyRename(jk.new_name,
                           construct_join_key_name(side, jk.new_name))
             for jk in self.join_key_selects if jk.keep or not filter_drop]
        )

    def get_join_key_rename_mapping(self, side: SideLit) -> Dict[str, str]:
        """Returns a dictionary mapping original join key names to their temporary names."""
        return {jkr[0]: jkr[1] for jkr in self.get_join_key_renames(side)[1]}
join_key_selects property

Returns only the SelectInput objects that are marked as join keys.

get_join_key_rename_mapping(side)

Returns a dictionary mapping original join key names to their temporary names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
216
217
218
def get_join_key_rename_mapping(self, side: SideLit) -> Dict[str, str]:
    """Returns a dictionary mapping original join key names to their temporary names."""
    return {jkr[0]: jkr[1] for jkr in self.get_join_key_renames(side)[1]}
get_join_key_renames(side, filter_drop=False)

Gets the temporary rename mapping for all join keys on one side of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
207
208
209
210
211
212
213
214
def get_join_key_renames(self, side: SideLit, filter_drop: bool = False) -> JoinKeyRenameResponse:
    """Gets the temporary rename mapping for all join keys on one side of a join."""
    return JoinKeyRenameResponse(
        side,
        [JoinKeyRename(jk.new_name,
                       construct_join_key_name(side, jk.new_name))
         for jk in self.join_key_selects if jk.keep or not filter_drop]
    )
JoinKeyRename

Bases: NamedTuple

Represents the renaming of a join key from its original to a temporary name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
37
38
39
40
class JoinKeyRename(NamedTuple):
    """Represents the renaming of a join key from its original to a temporary name."""
    original_name: str
    temp_name: str
JoinKeyRenameResponse

Bases: NamedTuple

Contains a list of join key renames for one side of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
43
44
45
46
class JoinKeyRenameResponse(NamedTuple):
    """Contains a list of join key renames for one side of a join."""
    side: SideLit
    join_key_renames: List[JoinKeyRename]
JoinMap dataclass

Defines a single mapping between a left and right column for a join key.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
221
222
223
224
225
@dataclass
class JoinMap:
    """Defines a single mapping between a left and right column for a join key."""
    left_col: str
    right_col: str
JoinSelectMixin

A mixin providing common methods for join-like operations that involve left and right inputs.

Methods:

Name Description
add_new_select_column

Adds a new column to the selection for either the left or right side.

auto_generate_new_col_name

Generates a new, non-conflicting column name by adding a suffix if necessary.

parse_select

Parses various input formats into a standardized JoinInputs object.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
class JoinSelectMixin:
    """A mixin providing common methods for join-like operations that involve left and right inputs."""
    left_select: JoinInputs = None
    right_select: JoinInputs = None

    @staticmethod
    def parse_select(select: List[SelectInput] | List[str] | List[Dict]) -> JoinInputs | None:
        """Parses various input formats into a standardized `JoinInputs` object."""
        if all(isinstance(c, SelectInput) for c in select):
            return JoinInputs(select)
        elif all(isinstance(c, dict) for c in select):
            return JoinInputs([SelectInput(**c.__dict__) for c in select])
        elif isinstance(select, dict):
            renames = select.get('renames')
            if renames:
                return JoinInputs([SelectInput(**c) for c in renames])
        elif all(isinstance(c, str) for c in select):
            return JoinInputs([SelectInput(s, s) for s in select])

    def auto_generate_new_col_name(self, old_col_name: str, side: str) -> str:
        """Generates a new, non-conflicting column name by adding a suffix if necessary."""
        current_names = self.left_select.new_cols & self.right_select.new_cols
        if old_col_name not in current_names:
            return old_col_name
        while True:
            if old_col_name not in current_names:
                return old_col_name
            old_col_name = f'{side}_{old_col_name}'

    def add_new_select_column(self, select_input: SelectInput, side: str):
        """Adds a new column to the selection for either the left or right side."""
        selects = self.right_select if side == 'right' else self.left_select
        select_input.new_name = self.auto_generate_new_col_name(select_input.old_name, side=side)
        selects.__add__(select_input)
add_new_select_column(select_input, side)

Adds a new column to the selection for either the left or right side.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
283
284
285
286
287
def add_new_select_column(self, select_input: SelectInput, side: str):
    """Adds a new column to the selection for either the left or right side."""
    selects = self.right_select if side == 'right' else self.left_select
    select_input.new_name = self.auto_generate_new_col_name(select_input.old_name, side=side)
    selects.__add__(select_input)
auto_generate_new_col_name(old_col_name, side)

Generates a new, non-conflicting column name by adding a suffix if necessary.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
273
274
275
276
277
278
279
280
281
def auto_generate_new_col_name(self, old_col_name: str, side: str) -> str:
    """Generates a new, non-conflicting column name by adding a suffix if necessary."""
    current_names = self.left_select.new_cols & self.right_select.new_cols
    if old_col_name not in current_names:
        return old_col_name
    while True:
        if old_col_name not in current_names:
            return old_col_name
        old_col_name = f'{side}_{old_col_name}'
parse_select(select) staticmethod

Parses various input formats into a standardized JoinInputs object.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
259
260
261
262
263
264
265
266
267
268
269
270
271
@staticmethod
def parse_select(select: List[SelectInput] | List[str] | List[Dict]) -> JoinInputs | None:
    """Parses various input formats into a standardized `JoinInputs` object."""
    if all(isinstance(c, SelectInput) for c in select):
        return JoinInputs(select)
    elif all(isinstance(c, dict) for c in select):
        return JoinInputs([SelectInput(**c.__dict__) for c in select])
    elif isinstance(select, dict):
        renames = select.get('renames')
        if renames:
            return JoinInputs([SelectInput(**c) for c in renames])
    elif all(isinstance(c, str) for c in select):
        return JoinInputs([SelectInput(s, s) for s in select])
PivotInput dataclass

Defines the settings for a pivot (long-to-wide) operation.

Methods:

Name Description
get_group_by_input

Constructs the GroupByInput needed for the pre-aggregation step of the pivot.

get_pivot_column

Returns the pivot column as a Polars column expression.

get_values_expr

Creates the struct expression used to gather the values for pivoting.

Attributes:

Name Type Description
grouped_columns List[str]

Returns the list of columns to be used for the initial grouping stage of the pivot.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
@dataclass
class PivotInput:
    """Defines the settings for a pivot (long-to-wide) operation."""
    index_columns: List[str]
    pivot_column: str
    value_col: str
    aggregations: List[str]

    @property
    def grouped_columns(self) -> List[str]:
        """Returns the list of columns to be used for the initial grouping stage of the pivot."""
        return self.index_columns + [self.pivot_column]

    def get_group_by_input(self) -> GroupByInput:
        """Constructs the `GroupByInput` needed for the pre-aggregation step of the pivot."""
        group_by_cols = [AggColl(c, 'groupby') for c in self.grouped_columns]
        agg_cols = [AggColl(self.value_col, agg=aggregation, new_name=aggregation) for aggregation in self.aggregations]
        return GroupByInput(group_by_cols+agg_cols)

    def get_index_columns(self) -> List[pl.col]:
        return [pl.col(c) for c in self.index_columns]

    def get_pivot_column(self) -> pl.Expr:
        """Returns the pivot column as a Polars column expression."""
        return pl.col(self.pivot_column)

    def get_values_expr(self) -> pl.Expr:
        """Creates the struct expression used to gather the values for pivoting."""
        return pl.struct([pl.col(c) for c in self.aggregations]).alias('vals')
grouped_columns property

Returns the list of columns to be used for the initial grouping stage of the pivot.

get_group_by_input()

Constructs the GroupByInput needed for the pre-aggregation step of the pivot.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
589
590
591
592
593
def get_group_by_input(self) -> GroupByInput:
    """Constructs the `GroupByInput` needed for the pre-aggregation step of the pivot."""
    group_by_cols = [AggColl(c, 'groupby') for c in self.grouped_columns]
    agg_cols = [AggColl(self.value_col, agg=aggregation, new_name=aggregation) for aggregation in self.aggregations]
    return GroupByInput(group_by_cols+agg_cols)
get_pivot_column()

Returns the pivot column as a Polars column expression.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
598
599
600
def get_pivot_column(self) -> pl.Expr:
    """Returns the pivot column as a Polars column expression."""
    return pl.col(self.pivot_column)
get_values_expr()

Creates the struct expression used to gather the values for pivoting.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
602
603
604
def get_values_expr(self) -> pl.Expr:
    """Creates the struct expression used to gather the values for pivoting."""
    return pl.struct([pl.col(c) for c in self.aggregations]).alias('vals')
PolarsCodeInput dataclass

A simple container for a string of user-provided Polars code to be executed.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
684
685
686
687
@dataclass
class PolarsCodeInput:
    """A simple container for a string of user-provided Polars code to be executed."""
    polars_code: str
RecordIdInput dataclass

Defines settings for adding a record ID (row number) column to the data.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
614
615
616
617
618
619
620
@dataclass
class RecordIdInput:
    """Defines settings for adding a record ID (row number) column to the data."""
    output_column_name: str = 'record_id'
    offset: int = 1
    group_by: Optional[bool] = False
    group_by_columns: Optional[List[str]] = field(default_factory=list)
SelectInput dataclass

Defines how a single column should be selected, renamed, or type-cast.

This is a core building block for any operation that involves column manipulation. It holds all the configuration for a single field in a selection operation.

Attributes:

Name Type Description
polars_type str

Translates a user-friendly type name to a Polars data type string.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
@dataclass
class SelectInput:
    """Defines how a single column should be selected, renamed, or type-cast.

    This is a core building block for any operation that involves column manipulation.
    It holds all the configuration for a single field in a selection operation.
    """
    old_name: str
    original_position: Optional[int] = None
    new_name: Optional[str] = None
    data_type: Optional[str] = None
    data_type_change: Optional[bool] = False
    join_key: Optional[bool] = False
    is_altered: Optional[bool] = False
    position: Optional[int] = None
    is_available: Optional[bool] = True
    keep: Optional[bool] = True

    def __hash__(self):
        return hash(self.old_name)

    def __init__(self, old_name: str, new_name: str = None, keep: bool = True, data_type: str = None,
                 data_type_change: bool = False, join_key: bool = False, is_altered: bool = False,
                 is_available: bool = True, position: int = None):
        self.old_name = old_name
        if new_name is None:
            new_name = old_name
        self.new_name = new_name
        self.keep = keep
        self.data_type = data_type
        self.data_type_change = data_type_change
        self.join_key = join_key
        self.is_altered = is_altered
        self.is_available = is_available
        self.position = position

    @property
    def polars_type(self) -> str:
        """Translates a user-friendly type name to a Polars data type string."""
        if self.data_type.lower() == 'string':
            return 'Utf8'
        elif self.data_type.lower() == 'integer':
            return 'Int64'
        elif self.data_type.lower() == 'double':
            return 'Float64'
        return self.data_type
polars_type property

Translates a user-friendly type name to a Polars data type string.

SelectInputs dataclass

A container for a list of SelectInput objects, providing helper methods for managing selections.

Methods:

Name Description
__add__

Allows adding a SelectInput using the '+' operator.

append

Appends a new SelectInput to the list of renames.

create_from_list

Creates a SelectInputs object from a simple list of column names.

create_from_pl_df

Creates a SelectInputs object from a Polars DataFrame's columns.

get_select_cols

Gets a list of original column names to select from the source DataFrame.

remove_select_input

Removes a SelectInput from the list based on its original name.

unselect_field

Marks a field to be dropped from the final selection by setting keep to False.

Attributes:

Name Type Description
new_cols Set

Returns a set of new (renamed) column names to be kept in the selection.

old_cols Set

Returns a set of original column names to be kept in the selection.

rename_table

Generates a dictionary for use in Polars' .rename() method.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
@dataclass
class SelectInputs:
    """A container for a list of `SelectInput` objects, providing helper methods for managing selections."""
    renames: List[SelectInput]

    @property
    def old_cols(self) -> Set:
        """Returns a set of original column names to be kept in the selection."""
        return set(v.old_name for v in self.renames if v.keep)

    @property
    def new_cols(self) -> Set:
        """Returns a set of new (renamed) column names to be kept in the selection."""
        return set(v.new_name for v in self.renames if v.keep)

    @property
    def rename_table(self):
        """Generates a dictionary for use in Polars' `.rename()` method."""
        return {v.old_name: v.new_name for v in self.renames if v.is_available and (v.keep or v.join_key)}

    def get_select_cols(self, include_join_key: bool = True):
        """Gets a list of original column names to select from the source DataFrame."""
        return [v.old_name for v in self.renames if v.keep or (v.join_key and include_join_key)]

    def __add__(self, other: "SelectInput"):
        """Allows adding a SelectInput using the '+' operator."""
        self.renames.append(other)

    def append(self, other: "SelectInput"):
        """Appends a new SelectInput to the list of renames."""
        self.renames.append(other)

    def remove_select_input(self, old_key: str):
        """Removes a SelectInput from the list based on its original name."""
        self.renames = [rename for rename in self.renames if rename.old_name != old_key]

    def unselect_field(self, old_key: str):
        """Marks a field to be dropped from the final selection by setting `keep` to False."""
        for rename in self.renames:
            if old_key == rename.old_name:
                rename.keep = False

    @classmethod
    def create_from_list(cls, col_list: List[str]):
        """Creates a SelectInputs object from a simple list of column names."""
        return cls([SelectInput(c) for c in col_list])

    @classmethod
    def create_from_pl_df(cls, df: pl.DataFrame | pl.LazyFrame):
        """Creates a SelectInputs object from a Polars DataFrame's columns."""
        return cls([SelectInput(c) for c in df.columns])

    def get_select_input_on_old_name(self, old_name: str) -> SelectInput | None:
        return next((v for v in self.renames if v.old_name == old_name), None)

    def get_select_input_on_new_name(self, old_name: str) -> SelectInput | None:
        return next((v for v in self.renames if v.new_name == old_name), None)
new_cols property

Returns a set of new (renamed) column names to be kept in the selection.

old_cols property

Returns a set of original column names to be kept in the selection.

rename_table property

Generates a dictionary for use in Polars' .rename() method.

__add__(other)

Allows adding a SelectInput using the '+' operator.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
161
162
163
def __add__(self, other: "SelectInput"):
    """Allows adding a SelectInput using the '+' operator."""
    self.renames.append(other)
append(other)

Appends a new SelectInput to the list of renames.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
165
166
167
def append(self, other: "SelectInput"):
    """Appends a new SelectInput to the list of renames."""
    self.renames.append(other)
create_from_list(col_list) classmethod

Creates a SelectInputs object from a simple list of column names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
179
180
181
182
@classmethod
def create_from_list(cls, col_list: List[str]):
    """Creates a SelectInputs object from a simple list of column names."""
    return cls([SelectInput(c) for c in col_list])
create_from_pl_df(df) classmethod

Creates a SelectInputs object from a Polars DataFrame's columns.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
184
185
186
187
@classmethod
def create_from_pl_df(cls, df: pl.DataFrame | pl.LazyFrame):
    """Creates a SelectInputs object from a Polars DataFrame's columns."""
    return cls([SelectInput(c) for c in df.columns])
get_select_cols(include_join_key=True)

Gets a list of original column names to select from the source DataFrame.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
157
158
159
def get_select_cols(self, include_join_key: bool = True):
    """Gets a list of original column names to select from the source DataFrame."""
    return [v.old_name for v in self.renames if v.keep or (v.join_key and include_join_key)]
remove_select_input(old_key)

Removes a SelectInput from the list based on its original name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
169
170
171
def remove_select_input(self, old_key: str):
    """Removes a SelectInput from the list based on its original name."""
    self.renames = [rename for rename in self.renames if rename.old_name != old_key]
unselect_field(old_key)

Marks a field to be dropped from the final selection by setting keep to False.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
173
174
175
176
177
def unselect_field(self, old_key: str):
    """Marks a field to be dropped from the final selection by setting `keep` to False."""
    for rename in self.renames:
        if old_key == rename.old_name:
            rename.keep = False
SortByInput dataclass

Defines a single sort condition on a column, including the direction.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
607
608
609
610
611
@dataclass
class SortByInput:
    """Defines a single sort condition on a column, including the direction."""
    column: str
    how: str = 'asc'
TextToRowsInput dataclass

Defines settings for splitting a text column into multiple rows based on a delimiter.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
623
624
625
626
627
628
629
630
@dataclass
class TextToRowsInput:
    """Defines settings for splitting a text column into multiple rows based on a delimiter."""
    column_to_split: str
    output_column_name: Optional[str] = None
    split_by_fixed_value: Optional[bool] = True
    split_fixed_value: Optional[str] = ','
    split_by_column: Optional[str] = None
UnionInput dataclass

Defines settings for a union (concatenation) operation.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
663
664
665
666
@dataclass
class UnionInput:
    """Defines settings for a union (concatenation) operation."""
    mode: Literal['selective', 'relaxed'] = 'relaxed'
UniqueInput dataclass

Defines settings for a uniqueness operation, specifying columns and which row to keep.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
669
670
671
672
673
@dataclass
class UniqueInput:
    """Defines settings for a uniqueness operation, specifying columns and which row to keep."""
    columns: Optional[List[str]] = None
    strategy: Literal["first", "last", "any", "none"] = "any"
UnpivotInput dataclass

Defines settings for an unpivot (wide-to-long) operation.

Methods:

Name Description
__post_init__

Ensures that list attributes are initialized correctly if they are None.

Attributes:

Name Type Description
data_type_selector_expr Optional[Callable]

Returns a Polars selector function based on the data_type_selector string.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
@dataclass
class UnpivotInput:
    """Defines settings for an unpivot (wide-to-long) operation."""
    index_columns: Optional[List[str]] = field(default_factory=list)
    value_columns: Optional[List[str]] = field(default_factory=list)
    data_type_selector: Optional[Literal['float', 'all', 'date', 'numeric', 'string']] = None
    data_type_selector_mode: Optional[Literal['data_type', 'column']] = 'column'

    def __post_init__(self):
        """Ensures that list attributes are initialized correctly if they are None."""
        if self.index_columns is None:
            self.index_columns = []
        if self.value_columns is None:
            self.value_columns = []
        if self.data_type_selector_mode is None:
            self.data_type_selector_mode = 'column'

    @property
    def data_type_selector_expr(self) -> Optional[Callable]:
        """Returns a Polars selector function based on the `data_type_selector` string."""
        if self.data_type_selector_mode == 'data_type':
            if self.data_type_selector is not None:
                try:
                    return getattr(selectors, self.data_type_selector)
                except Exception as e:
                    print(f'Could not find the selector: {self.data_type_selector}')
                    return selectors.all
            return selectors.all
data_type_selector_expr property

Returns a Polars selector function based on the data_type_selector string.

__post_init__()

Ensures that list attributes are initialized correctly if they are None.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
641
642
643
644
645
646
647
648
def __post_init__(self):
    """Ensures that list attributes are initialized correctly if they are None."""
    if self.index_columns is None:
        self.index_columns = []
    if self.value_columns is None:
        self.value_columns = []
    if self.data_type_selector_mode is None:
        self.data_type_selector_mode = 'column'
construct_join_key_name(side, column_name)

Creates a temporary, unique name for a join key column.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
32
33
34
def construct_join_key_name(side: SideLit, column_name: str) -> str:
    """Creates a temporary, unique name for a join key column."""
    return "_FLOWFILE_JOIN_KEY_" + side.upper() + "_" + column_name
get_func_type_mapping(func)

Infers the output data type of common aggregation functions.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
10
11
12
13
14
15
16
17
18
19
def get_func_type_mapping(func: str):
    """Infers the output data type of common aggregation functions."""
    if func in ["mean", "avg", "median", "std", "var"]:
        return "Float64"
    elif func in ['min', 'max', 'first', 'last', "cumsum", "sum"]:
        return None
    elif func in ['count', 'n_unique']:
        return "Int64"
    elif func in ['concat']:
        return "Utf8"
string_concat(*column)

A simple wrapper to concatenate string columns in Polars.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
22
23
24
def string_concat(*column: str):
    """A simple wrapper to concatenate string columns in Polars."""
    return pl.col(column).cast(pl.Utf8).str.concat(delimiter=',')

cloud_storage_schemas

flowfile_core.schemas.cloud_storage_schemas

Cloud storage connection schemas for S3, ADLS, and other cloud providers.

Classes:

Name Description
AuthSettingsInput

The information needed for the user to provide the details that are needed to provide how to connect to the

CloudStorageReadSettings

Settings for reading from cloud storage

CloudStorageSettings

Settings for cloud storage nodes in the visual designer

CloudStorageWriteSettings

Settings for writing to cloud storage

CloudStorageWriteSettingsWorkerInterface

Settings for writing to cloud storage in worker context

FullCloudStorageConnection

Internal model with decrypted secrets

FullCloudStorageConnectionInterface

API response model - no secrets exposed

FullCloudStorageConnectionWorkerInterface

Internal model with decrypted secrets

WriteSettingsWorkerInterface

Settings for writing to cloud storage

Functions:

Name Description
encrypt_for_worker

Encrypts a secret value for use in worker contexts.

get_cloud_storage_write_settings_worker_interface

Convert to a worker interface model with hashed secrets.

AuthSettingsInput pydantic-model

Bases: BaseModel

The information needed for the user to provide the details that are needed to provide how to connect to the Cloud provider

Show JSON schema:
{
  "description": "The information needed for the user to provide the details that are needed to provide how to connect to the\n Cloud provider",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "AuthSettingsInput",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (Optional[str])
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
25
26
27
28
29
30
31
32
class AuthSettingsInput(BaseModel):
    """
    The information needed for the user to provide the details that are needed to provide how to connect to the
     Cloud provider
    """
    storage_type: CloudStorageType
    auth_method: AuthMethod
    connection_name: Optional[str] = "None"  # This is the reference to the item we will fetch that contains the data
CloudStorageReadSettings pydantic-model

Bases: CloudStorageSettings

Settings for reading from cloud storage

Show JSON schema:
{
  "description": "Settings for reading from cloud storage",
  "properties": {
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    },
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "scan_mode": {
      "default": "single_file",
      "enum": [
        "single_file",
        "directory"
      ],
      "title": "Scan Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta",
        "iceberg"
      ],
      "title": "File Format",
      "type": "string"
    },
    "csv_has_header": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Csv Has Header"
    },
    "csv_delimiter": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": ",",
      "title": "Csv Delimiter"
    },
    "csv_encoding": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "utf8",
      "title": "Csv Encoding"
    },
    "delta_version": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Delta Version"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageReadSettings",
  "type": "object"
}

Fields:

  • auth_mode (AuthMethod)
  • connection_name (Optional[str])
  • resource_path (str)
  • scan_mode (Literal['single_file', 'directory'])
  • file_format (Literal['csv', 'parquet', 'json', 'delta', 'iceberg'])
  • csv_has_header (Optional[bool])
  • csv_delimiter (Optional[str])
  • csv_encoding (Optional[str])
  • delta_version (Optional[int])

Validators:

  • validate_auth_requirementsauth_mode
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
134
135
136
137
138
139
140
141
142
143
144
class CloudStorageReadSettings(CloudStorageSettings):
    """Settings for reading from cloud storage"""

    scan_mode: Literal["single_file", "directory"] = "single_file"
    file_format: Literal["csv", "parquet", "json", "delta", "iceberg"] = "parquet"
    # CSV specific options
    csv_has_header: Optional[bool] = True
    csv_delimiter: Optional[str] = ","
    csv_encoding: Optional[str] = "utf8"
    # Deltalake specific settings
    delta_version: Optional[int] = None
CloudStorageSettings pydantic-model

Bases: BaseModel

Settings for cloud storage nodes in the visual designer

Show JSON schema:
{
  "description": "Settings for cloud storage nodes in the visual designer",
  "properties": {
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    },
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageSettings",
  "type": "object"
}

Fields:

  • auth_mode (AuthMethod)
  • connection_name (Optional[str])
  • resource_path (str)

Validators:

  • validate_auth_requirementsauth_mode
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
119
120
121
122
123
124
125
126
127
128
129
130
131
class CloudStorageSettings(BaseModel):
    """Settings for cloud storage nodes in the visual designer"""

    auth_mode: AuthMethod = "auto"
    connection_name: Optional[str] = None  # Required only for 'reference' mode
    resource_path: str  # s3://bucket/path/to/file.csv

    @field_validator("auth_mode", mode="after")
    def validate_auth_requirements(cls, v, values):
        data = values.data
        if v == "reference" and not data.get("connection_name"):
            raise ValueError("connection_name required when using reference mode")
        return v
CloudStorageWriteSettings pydantic-model

Bases: CloudStorageSettings, WriteSettingsWorkerInterface

Settings for writing to cloud storage

Show JSON schema:
{
  "description": "Settings for writing to cloud storage",
  "properties": {
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "write_mode": {
      "default": "overwrite",
      "enum": [
        "overwrite",
        "append"
      ],
      "title": "Write Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta"
      ],
      "title": "File Format",
      "type": "string"
    },
    "parquet_compression": {
      "default": "snappy",
      "enum": [
        "snappy",
        "gzip",
        "brotli",
        "lz4",
        "zstd"
      ],
      "title": "Parquet Compression",
      "type": "string"
    },
    "csv_delimiter": {
      "default": ",",
      "title": "Csv Delimiter",
      "type": "string"
    },
    "csv_encoding": {
      "default": "utf8",
      "title": "Csv Encoding",
      "type": "string"
    },
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageWriteSettings",
  "type": "object"
}

Fields:

  • resource_path (str)
  • write_mode (Literal['overwrite', 'append'])
  • file_format (Literal['csv', 'parquet', 'json', 'delta'])
  • parquet_compression (Literal['snappy', 'gzip', 'brotli', 'lz4', 'zstd'])
  • csv_delimiter (str)
  • csv_encoding (str)
  • auth_mode (AuthMethod)
  • connection_name (Optional[str])

Validators:

  • validate_auth_requirementsauth_mode
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
class CloudStorageWriteSettings(CloudStorageSettings, WriteSettingsWorkerInterface):
    """Settings for writing to cloud storage"""
    pass

    def get_write_setting_worker_interface(self) -> WriteSettingsWorkerInterface:
        """
        Convert to a worker interface model without secrets.
        """
        return WriteSettingsWorkerInterface(
            resource_path=self.resource_path,
            write_mode=self.write_mode,
            file_format=self.file_format,
            parquet_compression=self.parquet_compression,
            csv_delimiter=self.csv_delimiter,
            csv_encoding=self.csv_encoding
        )
get_write_setting_worker_interface()

Convert to a worker interface model without secrets.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
169
170
171
172
173
174
175
176
177
178
179
180
def get_write_setting_worker_interface(self) -> WriteSettingsWorkerInterface:
    """
    Convert to a worker interface model without secrets.
    """
    return WriteSettingsWorkerInterface(
        resource_path=self.resource_path,
        write_mode=self.write_mode,
        file_format=self.file_format,
        parquet_compression=self.parquet_compression,
        csv_delimiter=self.csv_delimiter,
        csv_encoding=self.csv_encoding
    )
CloudStorageWriteSettingsWorkerInterface pydantic-model

Bases: BaseModel

Settings for writing to cloud storage in worker context

Show JSON schema:
{
  "$defs": {
    "FullCloudStorageConnectionWorkerInterface": {
      "description": "Internal model with decrypted secrets",
      "properties": {
        "storage_type": {
          "enum": [
            "s3",
            "adls",
            "gcs"
          ],
          "title": "Storage Type",
          "type": "string"
        },
        "auth_method": {
          "enum": [
            "access_key",
            "iam_role",
            "service_principal",
            "managed_identity",
            "sas_token",
            "aws-cli",
            "env_vars"
          ],
          "title": "Auth Method",
          "type": "string"
        },
        "connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "None",
          "title": "Connection Name"
        },
        "aws_region": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Region"
        },
        "aws_access_key_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Access Key Id"
        },
        "aws_secret_access_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Secret Access Key"
        },
        "aws_role_arn": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Role Arn"
        },
        "aws_allow_unsafe_html": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Allow Unsafe Html"
        },
        "aws_session_token": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Session Token"
        },
        "azure_account_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Account Name"
        },
        "azure_account_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Account Key"
        },
        "azure_tenant_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Tenant Id"
        },
        "azure_client_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Client Id"
        },
        "azure_client_secret": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Client Secret"
        },
        "endpoint_url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Endpoint Url"
        },
        "verify_ssl": {
          "default": true,
          "title": "Verify Ssl",
          "type": "boolean"
        }
      },
      "required": [
        "storage_type",
        "auth_method"
      ],
      "title": "FullCloudStorageConnectionWorkerInterface",
      "type": "object"
    },
    "WriteSettingsWorkerInterface": {
      "description": "Settings for writing to cloud storage",
      "properties": {
        "resource_path": {
          "title": "Resource Path",
          "type": "string"
        },
        "write_mode": {
          "default": "overwrite",
          "enum": [
            "overwrite",
            "append"
          ],
          "title": "Write Mode",
          "type": "string"
        },
        "file_format": {
          "default": "parquet",
          "enum": [
            "csv",
            "parquet",
            "json",
            "delta"
          ],
          "title": "File Format",
          "type": "string"
        },
        "parquet_compression": {
          "default": "snappy",
          "enum": [
            "snappy",
            "gzip",
            "brotli",
            "lz4",
            "zstd"
          ],
          "title": "Parquet Compression",
          "type": "string"
        },
        "csv_delimiter": {
          "default": ",",
          "title": "Csv Delimiter",
          "type": "string"
        },
        "csv_encoding": {
          "default": "utf8",
          "title": "Csv Encoding",
          "type": "string"
        }
      },
      "required": [
        "resource_path"
      ],
      "title": "WriteSettingsWorkerInterface",
      "type": "object"
    }
  },
  "description": "Settings for writing to cloud storage in worker context",
  "properties": {
    "operation": {
      "title": "Operation",
      "type": "string"
    },
    "write_settings": {
      "$ref": "#/$defs/WriteSettingsWorkerInterface"
    },
    "connection": {
      "$ref": "#/$defs/FullCloudStorageConnectionWorkerInterface"
    },
    "flowfile_flow_id": {
      "default": 1,
      "title": "Flowfile Flow Id",
      "type": "integer"
    },
    "flowfile_node_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "string"
        }
      ],
      "default": -1,
      "title": "Flowfile Node Id"
    }
  },
  "required": [
    "operation",
    "write_settings",
    "connection"
  ],
  "title": "CloudStorageWriteSettingsWorkerInterface",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
188
189
190
191
192
193
194
class CloudStorageWriteSettingsWorkerInterface(BaseModel):
    """Settings for writing to cloud storage in worker context"""
    operation: str
    write_settings: WriteSettingsWorkerInterface
    connection: FullCloudStorageConnectionWorkerInterface
    flowfile_flow_id: int = 1
    flowfile_node_id: int | str = -1
FullCloudStorageConnection pydantic-model

Bases: AuthSettingsInput

Internal model with decrypted secrets

Show JSON schema:
{
  "description": "Internal model with decrypted secrets",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_secret_access_key": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Secret Access Key"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_session_token": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Session Token"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_account_key": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Key"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "azure_client_secret": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Secret"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnection",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (Optional[str])
  • aws_region (Optional[str])
  • aws_access_key_id (Optional[str])
  • aws_secret_access_key (Optional[SecretStr])
  • aws_role_arn (Optional[str])
  • aws_allow_unsafe_html (Optional[bool])
  • aws_session_token (Optional[SecretStr])
  • azure_account_name (Optional[str])
  • azure_account_key (Optional[SecretStr])
  • azure_tenant_id (Optional[str])
  • azure_client_id (Optional[str])
  • azure_client_secret (Optional[SecretStr])
  • endpoint_url (Optional[str])
  • verify_ssl (bool)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
class FullCloudStorageConnection(AuthSettingsInput):
    """Internal model with decrypted secrets"""

    # AWS S3
    aws_region: Optional[str] = None
    aws_access_key_id: Optional[str] = None
    aws_secret_access_key: Optional[SecretStr] = None
    aws_role_arn: Optional[str] = None
    aws_allow_unsafe_html: Optional[bool] = None
    aws_session_token: Optional[SecretStr] = None

    # Azure ADLS
    azure_account_name: Optional[str] = None
    azure_account_key: Optional[SecretStr] = None
    azure_tenant_id: Optional[str] = None
    azure_client_id: Optional[str] = None
    azure_client_secret: Optional[SecretStr] = None

    # Common
    endpoint_url: Optional[str] = None
    verify_ssl: bool = True

    def get_worker_interface(self) -> "FullCloudStorageConnectionWorkerInterface":
        """
        Convert to a public interface model without secrets.
        """
        return FullCloudStorageConnectionWorkerInterface(
            storage_type=self.storage_type,
            auth_method=self.auth_method,
            connection_name=self.connection_name,
            aws_allow_unsafe_html=self.aws_allow_unsafe_html,
            aws_secret_access_key=encrypt_for_worker(self.aws_secret_access_key),
            aws_region=self.aws_region,
            aws_access_key_id=self.aws_access_key_id,
            aws_role_arn=self.aws_role_arn,
            aws_session_token=encrypt_for_worker(self.aws_session_token),
            azure_account_name=self.azure_account_name,
            azure_tenant_id=self.azure_tenant_id,
            azure_account_key=encrypt_for_worker(self.azure_account_key),
            azure_client_id=self.azure_client_id,
            azure_client_secret=encrypt_for_worker(self.azure_client_secret),
            endpoint_url=self.endpoint_url,
            verify_ssl=self.verify_ssl
        )
get_worker_interface()

Convert to a public interface model without secrets.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def get_worker_interface(self) -> "FullCloudStorageConnectionWorkerInterface":
    """
    Convert to a public interface model without secrets.
    """
    return FullCloudStorageConnectionWorkerInterface(
        storage_type=self.storage_type,
        auth_method=self.auth_method,
        connection_name=self.connection_name,
        aws_allow_unsafe_html=self.aws_allow_unsafe_html,
        aws_secret_access_key=encrypt_for_worker(self.aws_secret_access_key),
        aws_region=self.aws_region,
        aws_access_key_id=self.aws_access_key_id,
        aws_role_arn=self.aws_role_arn,
        aws_session_token=encrypt_for_worker(self.aws_session_token),
        azure_account_name=self.azure_account_name,
        azure_tenant_id=self.azure_tenant_id,
        azure_account_key=encrypt_for_worker(self.azure_account_key),
        azure_client_id=self.azure_client_id,
        azure_client_secret=encrypt_for_worker(self.azure_client_secret),
        endpoint_url=self.endpoint_url,
        verify_ssl=self.verify_ssl
    )
FullCloudStorageConnectionInterface pydantic-model

Bases: AuthSettingsInput

API response model - no secrets exposed

Show JSON schema:
{
  "description": "API response model - no secrets exposed",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnectionInterface",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (Optional[str])
  • aws_allow_unsafe_html (Optional[bool])
  • aws_region (Optional[str])
  • aws_access_key_id (Optional[str])
  • aws_role_arn (Optional[str])
  • azure_account_name (Optional[str])
  • azure_tenant_id (Optional[str])
  • azure_client_id (Optional[str])
  • endpoint_url (Optional[str])
  • verify_ssl (bool)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
104
105
106
107
108
109
110
111
112
113
114
115
116
class FullCloudStorageConnectionInterface(AuthSettingsInput):
    """API response model - no secrets exposed"""

    # Public fields only
    aws_allow_unsafe_html: Optional[bool] = None
    aws_region: Optional[str] = None
    aws_access_key_id: Optional[str] = None
    aws_role_arn: Optional[str] = None
    azure_account_name: Optional[str] = None
    azure_tenant_id: Optional[str] = None
    azure_client_id: Optional[str] = None
    endpoint_url: Optional[str] = None
    verify_ssl: bool = True
FullCloudStorageConnectionWorkerInterface pydantic-model

Bases: AuthSettingsInput

Internal model with decrypted secrets

Show JSON schema:
{
  "description": "Internal model with decrypted secrets",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_secret_access_key": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Secret Access Key"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_session_token": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Session Token"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_account_key": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Key"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "azure_client_secret": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Secret"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnectionWorkerInterface",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (Optional[str])
  • aws_region (Optional[str])
  • aws_access_key_id (Optional[str])
  • aws_secret_access_key (Optional[str])
  • aws_role_arn (Optional[str])
  • aws_allow_unsafe_html (Optional[bool])
  • aws_session_token (Optional[str])
  • azure_account_name (Optional[str])
  • azure_account_key (Optional[str])
  • azure_tenant_id (Optional[str])
  • azure_client_id (Optional[str])
  • azure_client_secret (Optional[str])
  • endpoint_url (Optional[str])
  • verify_ssl (bool)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class FullCloudStorageConnectionWorkerInterface(AuthSettingsInput):
    """Internal model with decrypted secrets"""

    # AWS S3
    aws_region: Optional[str] = None
    aws_access_key_id: Optional[str] = None
    aws_secret_access_key: Optional[str] = None
    aws_role_arn: Optional[str] = None
    aws_allow_unsafe_html: Optional[bool] = None
    aws_session_token: Optional[str] = None

    # Azure ADLS
    azure_account_name: Optional[str] = None
    azure_account_key: Optional[str] = None
    azure_tenant_id: Optional[str] = None
    azure_client_id: Optional[str] = None
    azure_client_secret: Optional[str] = None

    # Common
    endpoint_url: Optional[str] = None
    verify_ssl: bool = True
WriteSettingsWorkerInterface pydantic-model

Bases: BaseModel

Settings for writing to cloud storage

Show JSON schema:
{
  "description": "Settings for writing to cloud storage",
  "properties": {
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "write_mode": {
      "default": "overwrite",
      "enum": [
        "overwrite",
        "append"
      ],
      "title": "Write Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta"
      ],
      "title": "File Format",
      "type": "string"
    },
    "parquet_compression": {
      "default": "snappy",
      "enum": [
        "snappy",
        "gzip",
        "brotli",
        "lz4",
        "zstd"
      ],
      "title": "Parquet Compression",
      "type": "string"
    },
    "csv_delimiter": {
      "default": ",",
      "title": "Csv Delimiter",
      "type": "string"
    },
    "csv_encoding": {
      "default": "utf8",
      "title": "Csv Encoding",
      "type": "string"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "WriteSettingsWorkerInterface",
  "type": "object"
}

Fields:

  • resource_path (str)
  • write_mode (Literal['overwrite', 'append'])
  • file_format (Literal['csv', 'parquet', 'json', 'delta'])
  • parquet_compression (Literal['snappy', 'gzip', 'brotli', 'lz4', 'zstd'])
  • csv_delimiter (str)
  • csv_encoding (str)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
152
153
154
155
156
157
158
159
160
161
162
class WriteSettingsWorkerInterface(BaseModel):
    """Settings for writing to cloud storage"""
    resource_path: str  # s3://bucket/path/to/file.csv

    write_mode: Literal["overwrite", "append"] = "overwrite"
    file_format: Literal["csv", "parquet", "json", "delta"] = "parquet"

    parquet_compression: Literal["snappy", "gzip", "brotli", "lz4", "zstd"] = "snappy"

    csv_delimiter: str = ","
    csv_encoding: str = "utf8"
encrypt_for_worker(secret_value)

Encrypts a secret value for use in worker contexts. This is a placeholder function that simulates encryption. In practice, you would use a secure encryption method.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
15
16
17
18
19
20
21
22
def encrypt_for_worker(secret_value: SecretStr|None) -> str|None:
    """
    Encrypts a secret value for use in worker contexts.
    This is a placeholder function that simulates encryption.
    In practice, you would use a secure encryption method.
    """
    if secret_value is not None:
        return encrypt_secret(secret_value.get_secret_value())
get_cloud_storage_write_settings_worker_interface(write_settings, connection, lf, flowfile_flow_id=1, flowfile_node_id=-1)

Convert to a worker interface model with hashed secrets.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def get_cloud_storage_write_settings_worker_interface(
        write_settings: CloudStorageWriteSettings,
        connection: FullCloudStorageConnection,
        lf: pl.LazyFrame,
        flowfile_flow_id: int = 1,
        flowfile_node_id: int | str = -1,
        ) -> CloudStorageWriteSettingsWorkerInterface:
    """
    Convert to a worker interface model with hashed secrets.
    """
    operation = base64.b64encode(lf.serialize()).decode()

    return CloudStorageWriteSettingsWorkerInterface(
        operation=operation,
        write_settings=write_settings.get_write_setting_worker_interface(),
        connection=connection.get_worker_interface(),
        flowfile_flow_id=flowfile_flow_id,  # Default value, can be overridden
        flowfile_node_id=flowfile_node_id  # Default value, can be overridden
    )

output_model

flowfile_core.schemas.output_model

Classes:

Name Description
BaseItem

A base model for any item in a file system, like a file or directory.

ExpressionRef

A reference to a single Polars expression, including its name and docstring.

ExpressionsOverview

Represents a categorized list of available Polars expressions.

FileColumn

Represents detailed schema and statistics for a single column (field).

InstantFuncResult

Represents the result of a function that is expected to execute instantly.

ItemInfo

Provides detailed information about a single item in an output directory.

NodeData

A comprehensive model holding the complete state and data for a single node.

NodeResult

Represents the execution result of a single node in a FlowGraph run.

OutputDir

Represents the contents of a single output directory.

OutputFile

Represents a single file in an output directory, extending BaseItem.

OutputFiles

Represents a collection of files, typically within a directory.

OutputTree

Represents a directory tree, including subdirectories.

RunInformation

Contains summary information about a complete FlowGraph execution.

TableExample

Represents a preview of a table, including schema and sample data.

BaseItem pydantic-model

Bases: BaseModel

A base model for any item in a file system, like a file or directory.

Show JSON schema:
{
  "description": "A base model for any item in a file system, like a file or directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "BaseItem",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (Optional[int])
  • creation_date (Optional[datetime])
  • access_date (Optional[datetime])
  • modification_date (Optional[datetime])
  • source_path (Optional[str])
  • number_of_items (int)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
30
31
32
33
34
35
36
37
38
39
class BaseItem(BaseModel):
    """A base model for any item in a file system, like a file or directory."""
    name: str
    path: str
    size: Optional[int] = None
    creation_date: Optional[datetime] = None
    access_date: Optional[datetime] = None
    modification_date: Optional[datetime] = None
    source_path: Optional[str] = None
    number_of_items: int = -1
ExpressionRef pydantic-model

Bases: BaseModel

A reference to a single Polars expression, including its name and docstring.

Show JSON schema:
{
  "description": "A reference to a single Polars expression, including its name and docstring.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "doc": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Doc"
    }
  },
  "required": [
    "name",
    "doc"
  ],
  "title": "ExpressionRef",
  "type": "object"
}

Fields:

  • name (str)
  • doc (Optional[str])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
116
117
118
119
class ExpressionRef(BaseModel):
    """A reference to a single Polars expression, including its name and docstring."""
    name: str
    doc: Optional[str]
ExpressionsOverview pydantic-model

Bases: BaseModel

Represents a categorized list of available Polars expressions.

Show JSON schema:
{
  "$defs": {
    "ExpressionRef": {
      "description": "A reference to a single Polars expression, including its name and docstring.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "doc": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "title": "Doc"
        }
      },
      "required": [
        "name",
        "doc"
      ],
      "title": "ExpressionRef",
      "type": "object"
    }
  },
  "description": "Represents a categorized list of available Polars expressions.",
  "properties": {
    "expression_type": {
      "title": "Expression Type",
      "type": "string"
    },
    "expressions": {
      "items": {
        "$ref": "#/$defs/ExpressionRef"
      },
      "title": "Expressions",
      "type": "array"
    }
  },
  "required": [
    "expression_type",
    "expressions"
  ],
  "title": "ExpressionsOverview",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/output_model.py
122
123
124
125
class ExpressionsOverview(BaseModel):
    """Represents a categorized list of available Polars expressions."""
    expression_type: str
    expressions: List[ExpressionRef]
FileColumn pydantic-model

Bases: BaseModel

Represents detailed schema and statistics for a single column (field).

Show JSON schema:
{
  "description": "Represents detailed schema and statistics for a single column (field).",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "data_type": {
      "title": "Data Type",
      "type": "string"
    },
    "is_unique": {
      "title": "Is Unique",
      "type": "boolean"
    },
    "max_value": {
      "title": "Max Value",
      "type": "string"
    },
    "min_value": {
      "title": "Min Value",
      "type": "string"
    },
    "number_of_empty_values": {
      "title": "Number Of Empty Values",
      "type": "integer"
    },
    "number_of_filled_values": {
      "title": "Number Of Filled Values",
      "type": "integer"
    },
    "number_of_unique_values": {
      "title": "Number Of Unique Values",
      "type": "integer"
    },
    "size": {
      "title": "Size",
      "type": "integer"
    }
  },
  "required": [
    "name",
    "data_type",
    "is_unique",
    "max_value",
    "min_value",
    "number_of_empty_values",
    "number_of_filled_values",
    "number_of_unique_values",
    "size"
  ],
  "title": "FileColumn",
  "type": "object"
}

Fields:

  • name (str)
  • data_type (str)
  • is_unique (bool)
  • max_value (str)
  • min_value (str)
  • number_of_empty_values (int)
  • number_of_filled_values (int)
  • number_of_unique_values (int)
  • size (int)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
42
43
44
45
46
47
48
49
50
51
52
class FileColumn(BaseModel):
    """Represents detailed schema and statistics for a single column (field)."""
    name: str
    data_type: str
    is_unique: bool
    max_value: str
    min_value: str
    number_of_empty_values: int
    number_of_filled_values: int
    number_of_unique_values: int
    size: int
InstantFuncResult pydantic-model

Bases: BaseModel

Represents the result of a function that is expected to execute instantly.

Show JSON schema:
{
  "description": "Represents the result of a function that is expected to execute instantly.",
  "properties": {
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "result": {
      "title": "Result",
      "type": "string"
    }
  },
  "required": [
    "result"
  ],
  "title": "InstantFuncResult",
  "type": "object"
}

Fields:

  • success (Optional[bool])
  • result (str)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
128
129
130
131
class InstantFuncResult(BaseModel):
    """Represents the result of a function that is expected to execute instantly."""
    success: Optional[bool] = None
    result: str
ItemInfo pydantic-model

Bases: OutputFile

Provides detailed information about a single item in an output directory.

Show JSON schema:
{
  "description": "Provides detailed information about a single item in an output directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "ext": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Ext"
    },
    "mimetype": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Mimetype"
    },
    "id": {
      "default": -1,
      "title": "Id",
      "type": "integer"
    },
    "type": {
      "title": "Type",
      "type": "string"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "analysis_file_location": {
      "default": null,
      "title": "Analysis File Location",
      "type": "string"
    },
    "analysis_file_error": {
      "default": null,
      "title": "Analysis File Error",
      "type": "string"
    }
  },
  "required": [
    "name",
    "path",
    "type"
  ],
  "title": "ItemInfo",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (Optional[int])
  • creation_date (Optional[datetime])
  • access_date (Optional[datetime])
  • modification_date (Optional[datetime])
  • source_path (Optional[str])
  • number_of_items (int)
  • ext (Optional[str])
  • mimetype (Optional[str])
  • id (int)
  • type (str)
  • analysis_file_available (bool)
  • analysis_file_location (str)
  • analysis_file_error (str)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
101
102
103
104
105
106
107
class ItemInfo(OutputFile):
    """Provides detailed information about a single item in an output directory."""
    id: int = -1
    type: str
    analysis_file_available: bool = False
    analysis_file_location: str = None
    analysis_file_error: str = None
NodeData pydantic-model

Bases: BaseModel

A comprehensive model holding the complete state and data for a single node.

This includes its input/output data previews, settings, and run status.

Show JSON schema:
{
  "$defs": {
    "FileColumn": {
      "description": "Represents detailed schema and statistics for a single column (field).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "title": "Data Type",
          "type": "string"
        },
        "is_unique": {
          "title": "Is Unique",
          "type": "boolean"
        },
        "max_value": {
          "title": "Max Value",
          "type": "string"
        },
        "min_value": {
          "title": "Min Value",
          "type": "string"
        },
        "number_of_empty_values": {
          "title": "Number Of Empty Values",
          "type": "integer"
        },
        "number_of_filled_values": {
          "title": "Number Of Filled Values",
          "type": "integer"
        },
        "number_of_unique_values": {
          "title": "Number Of Unique Values",
          "type": "integer"
        },
        "size": {
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "name",
        "data_type",
        "is_unique",
        "max_value",
        "min_value",
        "number_of_empty_values",
        "number_of_filled_values",
        "number_of_unique_values",
        "size"
      ],
      "title": "FileColumn",
      "type": "object"
    },
    "TableExample": {
      "description": "Represents a preview of a table, including schema and sample data.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "number_of_records": {
          "title": "Number Of Records",
          "type": "integer"
        },
        "number_of_columns": {
          "title": "Number Of Columns",
          "type": "integer"
        },
        "name": {
          "title": "Name",
          "type": "string"
        },
        "table_schema": {
          "items": {
            "$ref": "#/$defs/FileColumn"
          },
          "title": "Table Schema",
          "type": "array"
        },
        "columns": {
          "items": {
            "type": "string"
          },
          "title": "Columns",
          "type": "array"
        },
        "data": {
          "anyOf": [
            {
              "items": {
                "type": "object"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": {},
          "title": "Data"
        }
      },
      "required": [
        "node_id",
        "number_of_records",
        "number_of_columns",
        "name",
        "table_schema",
        "columns"
      ],
      "title": "TableExample",
      "type": "object"
    }
  },
  "description": "A comprehensive model holding the complete state and data for a single node.\n\nThis includes its input/output data previews, settings, and run status.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "flow_type": {
      "title": "Flow Type",
      "type": "string"
    },
    "left_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "right_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "main_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "main_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "left_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "right_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "has_run": {
      "default": false,
      "title": "Has Run",
      "type": "boolean"
    },
    "is_cached": {
      "default": false,
      "title": "Is Cached",
      "type": "boolean"
    },
    "setting_input": {
      "default": null,
      "title": "Setting Input"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "flow_type"
  ],
  "title": "NodeData",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/output_model.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class NodeData(BaseModel):
    """A comprehensive model holding the complete state and data for a single node.

    This includes its input/output data previews, settings, and run status.
    """
    flow_id: int
    node_id: int
    flow_type: str
    left_input: Optional[TableExample] = None
    right_input: Optional[TableExample] = None
    main_input: Optional[TableExample] = None
    main_output: Optional[TableExample] = None
    left_output: Optional[TableExample] = None
    right_output: Optional[TableExample] = None
    has_run: bool = False
    is_cached: bool = False
    setting_input: Any = None
NodeResult pydantic-model

Bases: BaseModel

Represents the execution result of a single node in a FlowGraph run.

Show JSON schema:
{
  "description": "Represents the execution result of a single node in a FlowGraph run.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "node_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Name"
    },
    "start_timestamp": {
      "title": "Start Timestamp",
      "type": "number"
    },
    "end_timestamp": {
      "default": 0,
      "title": "End Timestamp",
      "type": "number"
    },
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "error": {
      "default": "",
      "title": "Error",
      "type": "string"
    },
    "run_time": {
      "default": -1,
      "title": "Run Time",
      "type": "integer"
    },
    "is_running": {
      "default": true,
      "title": "Is Running",
      "type": "boolean"
    }
  },
  "required": [
    "node_id"
  ],
  "title": "NodeResult",
  "type": "object"
}

Fields:

  • node_id (int)
  • node_name (Optional[str])
  • start_timestamp (float)
  • end_timestamp (float)
  • success (Optional[bool])
  • error (str)
  • run_time (int)
  • is_running (bool)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
 7
 8
 9
10
11
12
13
14
15
16
class NodeResult(BaseModel):
    """Represents the execution result of a single node in a FlowGraph run."""
    node_id: int
    node_name: Optional[str] = None
    start_timestamp: float = Field(default_factory=time.time)
    end_timestamp: float = 0
    success: Optional[bool] = None
    error: str = ''
    run_time: int = -1
    is_running: bool = True
OutputDir pydantic-model

Bases: BaseItem

Represents the contents of a single output directory.

Show JSON schema:
{
  "$defs": {
    "ItemInfo": {
      "description": "Provides detailed information about a single item in an output directory.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        },
        "id": {
          "default": -1,
          "title": "Id",
          "type": "integer"
        },
        "type": {
          "title": "Type",
          "type": "string"
        },
        "analysis_file_available": {
          "default": false,
          "title": "Analysis File Available",
          "type": "boolean"
        },
        "analysis_file_location": {
          "default": null,
          "title": "Analysis File Location",
          "type": "string"
        },
        "analysis_file_error": {
          "default": null,
          "title": "Analysis File Error",
          "type": "string"
        }
      },
      "required": [
        "name",
        "path",
        "type"
      ],
      "title": "ItemInfo",
      "type": "object"
    }
  },
  "description": "Represents the contents of a single output directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "all_items": {
      "items": {
        "type": "string"
      },
      "title": "All Items",
      "type": "array"
    },
    "items": {
      "items": {
        "$ref": "#/$defs/ItemInfo"
      },
      "title": "Items",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path",
    "all_items",
    "items"
  ],
  "title": "OutputDir",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (Optional[int])
  • creation_date (Optional[datetime])
  • access_date (Optional[datetime])
  • modification_date (Optional[datetime])
  • source_path (Optional[str])
  • number_of_items (int)
  • all_items (List[str])
  • items (List[ItemInfo])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
110
111
112
113
class OutputDir(BaseItem):
    """Represents the contents of a single output directory."""
    all_items: List[str]
    items: List[ItemInfo]
OutputFile pydantic-model

Bases: BaseItem

Represents a single file in an output directory, extending BaseItem.

Show JSON schema:
{
  "description": "Represents a single file in an output directory, extending BaseItem.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "ext": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Ext"
    },
    "mimetype": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Mimetype"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputFile",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (Optional[int])
  • creation_date (Optional[datetime])
  • access_date (Optional[datetime])
  • modification_date (Optional[datetime])
  • source_path (Optional[str])
  • number_of_items (int)
  • ext (Optional[str])
  • mimetype (Optional[str])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
85
86
87
88
class OutputFile(BaseItem):
    """Represents a single file in an output directory, extending BaseItem."""
    ext: Optional[str] = None
    mimetype: Optional[str] = None
OutputFiles pydantic-model

Bases: BaseItem

Represents a collection of files, typically within a directory.

Show JSON schema:
{
  "$defs": {
    "OutputFile": {
      "description": "Represents a single file in an output directory, extending BaseItem.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFile",
      "type": "object"
    }
  },
  "description": "Represents a collection of files, typically within a directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "files": {
      "items": {
        "$ref": "#/$defs/OutputFile"
      },
      "title": "Files",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputFiles",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (Optional[int])
  • creation_date (Optional[datetime])
  • access_date (Optional[datetime])
  • modification_date (Optional[datetime])
  • source_path (Optional[str])
  • number_of_items (int)
  • files (List[OutputFile])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
91
92
93
class OutputFiles(BaseItem):
    """Represents a collection of files, typically within a directory."""
    files: List[OutputFile] = Field(default_factory=list)
OutputTree pydantic-model

Bases: OutputFiles

Represents a directory tree, including subdirectories.

Show JSON schema:
{
  "$defs": {
    "OutputFile": {
      "description": "Represents a single file in an output directory, extending BaseItem.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFile",
      "type": "object"
    },
    "OutputFiles": {
      "description": "Represents a collection of files, typically within a directory.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "files": {
          "items": {
            "$ref": "#/$defs/OutputFile"
          },
          "title": "Files",
          "type": "array"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFiles",
      "type": "object"
    }
  },
  "description": "Represents a directory tree, including subdirectories.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "files": {
      "items": {
        "$ref": "#/$defs/OutputFile"
      },
      "title": "Files",
      "type": "array"
    },
    "directories": {
      "items": {
        "$ref": "#/$defs/OutputFiles"
      },
      "title": "Directories",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputTree",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (Optional[int])
  • creation_date (Optional[datetime])
  • access_date (Optional[datetime])
  • modification_date (Optional[datetime])
  • source_path (Optional[str])
  • number_of_items (int)
  • files (List[OutputFile])
  • directories (List[OutputFiles])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
96
97
98
class OutputTree(OutputFiles):
    """Represents a directory tree, including subdirectories."""
    directories: List[OutputFiles] = Field(default_factory=list)
RunInformation pydantic-model

Bases: BaseModel

Contains summary information about a complete FlowGraph execution.

Show JSON schema:
{
  "$defs": {
    "NodeResult": {
      "description": "Represents the execution result of a single node in a FlowGraph run.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "node_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Node Name"
        },
        "start_timestamp": {
          "title": "Start Timestamp",
          "type": "number"
        },
        "end_timestamp": {
          "default": 0,
          "title": "End Timestamp",
          "type": "number"
        },
        "success": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Success"
        },
        "error": {
          "default": "",
          "title": "Error",
          "type": "string"
        },
        "run_time": {
          "default": -1,
          "title": "Run Time",
          "type": "integer"
        },
        "is_running": {
          "default": true,
          "title": "Is Running",
          "type": "boolean"
        }
      },
      "required": [
        "node_id"
      ],
      "title": "NodeResult",
      "type": "object"
    }
  },
  "description": "Contains summary information about a complete FlowGraph execution.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "start_time": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Start Time"
    },
    "end_time": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "End Time"
    },
    "success": {
      "title": "Success",
      "type": "boolean"
    },
    "nodes_completed": {
      "default": 0,
      "title": "Nodes Completed",
      "type": "integer"
    },
    "number_of_nodes": {
      "default": 0,
      "title": "Number Of Nodes",
      "type": "integer"
    },
    "node_step_result": {
      "items": {
        "$ref": "#/$defs/NodeResult"
      },
      "title": "Node Step Result",
      "type": "array"
    }
  },
  "required": [
    "flow_id",
    "success",
    "node_step_result"
  ],
  "title": "RunInformation",
  "type": "object"
}

Fields:

  • flow_id (int)
  • start_time (Optional[datetime])
  • end_time (Optional[datetime])
  • success (bool)
  • nodes_completed (int)
  • number_of_nodes (int)
  • node_step_result (List[NodeResult])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
19
20
21
22
23
24
25
26
27
class RunInformation(BaseModel):
    """Contains summary information about a complete FlowGraph execution."""
    flow_id: int
    start_time: Optional[datetime] = Field(default_factory=datetime.now)
    end_time: Optional[datetime] = None
    success: bool
    nodes_completed: int = 0
    number_of_nodes: int = 0
    node_step_result: List[NodeResult]
TableExample pydantic-model

Bases: BaseModel

Represents a preview of a table, including schema and sample data.

Show JSON schema:
{
  "$defs": {
    "FileColumn": {
      "description": "Represents detailed schema and statistics for a single column (field).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "title": "Data Type",
          "type": "string"
        },
        "is_unique": {
          "title": "Is Unique",
          "type": "boolean"
        },
        "max_value": {
          "title": "Max Value",
          "type": "string"
        },
        "min_value": {
          "title": "Min Value",
          "type": "string"
        },
        "number_of_empty_values": {
          "title": "Number Of Empty Values",
          "type": "integer"
        },
        "number_of_filled_values": {
          "title": "Number Of Filled Values",
          "type": "integer"
        },
        "number_of_unique_values": {
          "title": "Number Of Unique Values",
          "type": "integer"
        },
        "size": {
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "name",
        "data_type",
        "is_unique",
        "max_value",
        "min_value",
        "number_of_empty_values",
        "number_of_filled_values",
        "number_of_unique_values",
        "size"
      ],
      "title": "FileColumn",
      "type": "object"
    }
  },
  "description": "Represents a preview of a table, including schema and sample data.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "number_of_records": {
      "title": "Number Of Records",
      "type": "integer"
    },
    "number_of_columns": {
      "title": "Number Of Columns",
      "type": "integer"
    },
    "name": {
      "title": "Name",
      "type": "string"
    },
    "table_schema": {
      "items": {
        "$ref": "#/$defs/FileColumn"
      },
      "title": "Table Schema",
      "type": "array"
    },
    "columns": {
      "items": {
        "type": "string"
      },
      "title": "Columns",
      "type": "array"
    },
    "data": {
      "anyOf": [
        {
          "items": {
            "type": "object"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": {},
      "title": "Data"
    }
  },
  "required": [
    "node_id",
    "number_of_records",
    "number_of_columns",
    "name",
    "table_schema",
    "columns"
  ],
  "title": "TableExample",
  "type": "object"
}

Fields:

  • node_id (int)
  • number_of_records (int)
  • number_of_columns (int)
  • name (str)
  • table_schema (List[FileColumn])
  • columns (List[str])
  • data (Optional[List[Dict]])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
55
56
57
58
59
60
61
62
63
class TableExample(BaseModel):
    """Represents a preview of a table, including schema and sample data."""
    node_id: int
    number_of_records: int
    number_of_columns: int
    name: str
    table_schema: List[FileColumn]
    columns: List[str]
    data: Optional[List[Dict]] = {}

Web API

This section documents the FastAPI routes that expose flowfile-core's functionality over HTTP.

routes

flowfile_core.routes.routes

Main API router and endpoint definitions for the Flowfile application.

This module sets up the FastAPI router, defines all the API endpoints for interacting with flows, nodes, files, and other core components of the application. It handles the logic for creating, reading, updating, and deleting these resources.

Functions:

Name Description
add_generic_settings

A generic endpoint to update the settings of any node.

add_node

Adds a new, unconfigured node (a "promise") to the flow graph.

cancel_flow

Cancels a currently running flow execution.

close_flow

Closes an active flow session.

connect_node

Creates a connection (edge) between two nodes in the flow graph.

copy_node

Copies an existing node's settings to a new node promise.

create_db_connection

Creates and securely stores a new database connection.

create_directory

Creates a new directory at the specified path.

create_flow

Creates a new, empty flow file at the specified path and registers a session for it.

delete_db_connection

Deletes a stored database connection.

delete_node

Deletes a node from the flow graph.

delete_node_connection

Deletes a connection (edge) between two nodes.

get_active_flow_file_sessions

Retrieves a list of all currently active flow sessions.

get_current_directory_contents

Gets the contents of the file explorer's current directory.

get_current_files

Gets the contents of the file explorer's current directory.

get_current_path

Returns the current absolute path of the file explorer.

get_db_connections

Retrieves all stored database connections for the current user (without passwords).

get_description_node

Retrieves the description text for a specific node.

get_directory_contents

Gets the contents of an arbitrary directory path.

get_downstream_node_ids

Gets a list of all node IDs that are downstream dependencies of a given node.

get_excel_sheet_names

Retrieves the sheet names from an Excel file.

get_expression_doc

Retrieves documentation for available Polars expressions.

get_expressions

Retrieves a list of all available Flowfile expression names.

get_flow

Retrieves the settings for a specific flow.

get_flow_frontend_data

Retrieves the data needed to render the flow graph in the frontend.

get_flow_settings

Retrieves the main settings for a flow.

get_generated_code

Generates and returns a Python script with Polars code representing the flow.

get_graphic_walker_input

Gets the data and configuration for the Graphic Walker data exploration tool.

get_instant_function_result

Executes a simple, instant function on a node's data and returns the result.

get_list_of_saved_flows

Scans a directory for saved flow files (.flowfile).

get_local_files

Retrieves a list of files from a specified local directory.

get_node

Retrieves the complete state and data preview for a single node.

get_node_list

Retrieves the list of all available node types and their templates.

get_node_model

(Internal) Retrieves a node's Pydantic model from the input_schema module by its name.

get_run_status

Retrieves the run status information for a specific flow.

get_table_example

Retrieves a data preview (schema and sample rows) for a node's output.

get_vue_flow_data

Retrieves the flow data formatted for the Vue-based frontend.

import_saved_flow

Imports a flow from a saved .flowfile and registers it as a new session.

navigate_into_directory

Navigates the file explorer into a specified subdirectory.

navigate_to_directory

Navigates the file explorer to an absolute directory path.

navigate_up

Navigates the file explorer one directory level up.

register_flow

Registers a new flow session with the application.

run_flow

Executes a flow in a background task.

save_flow

Saves the current state of a flow to a .flowfile.

update_description_node

Updates the description text for a specific node.

update_flow_settings

Updates the main settings for a flow.

upload_file

Uploads a file to the server's 'uploads' directory.

validate_db_settings

Validates that a connection can be made to a database with the given settings.

add_generic_settings(input_data, node_type, current_user=Depends(get_current_active_user))

A generic endpoint to update the settings of any node.

This endpoint dynamically determines the correct Pydantic model and update function based on the node_type parameter.

Source code in flowfile_core/flowfile_core/routes/routes.py
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
@router.post('/update_settings/', tags=['transform'])
def add_generic_settings(input_data: Dict[str, Any], node_type: str, current_user=Depends(get_current_active_user)):
    """A generic endpoint to update the settings of any node.

    This endpoint dynamically determines the correct Pydantic model and update
    function based on the `node_type` parameter.
    """
    input_data['user_id'] = current_user.id
    node_type = camel_case_to_snake_case(node_type)
    flow_id = int(input_data.get('flow_id'))
    logger.info(f'Updating the data for flow: {flow_id}, node {input_data["node_id"]}')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    add_func = getattr(flow, 'add_' + node_type)
    parsed_input = None
    setting_name_ref = 'node' + node_type.replace('_', '')
    if add_func is None:
        raise HTTPException(404, 'could not find the function')
    try:
        ref = get_node_model(setting_name_ref)
        if ref:
            parsed_input = ref(**input_data)
    except Exception as e:
        raise HTTPException(421, str(e))
    if parsed_input is None:
        raise HTTPException(404, 'could not find the interface')
    try:
        add_func(parsed_input)
    except Exception as e:
        logger.error(e)
        raise HTTPException(419, str(f'error: {e}'))
add_node(flow_id, node_id, node_type, pos_x=0, pos_y=0)

Adds a new, unconfigured node (a "promise") to the flow graph.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to add the node to.

required
node_id int

The client-generated ID for the new node.

required
node_type str

The type of the node to add (e.g., 'filter', 'join').

required
pos_x int

The X coordinate for the node's position in the UI.

0
pos_y int

The Y coordinate for the node's position in the UI.

0
Source code in flowfile_core/flowfile_core/routes/routes.py
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
@router.post('/editor/add_node/', tags=['editor'])
def add_node(flow_id: int, node_id: int, node_type: str, pos_x: int = 0, pos_y: int = 0):
    """Adds a new, unconfigured node (a "promise") to the flow graph.

    Args:
        flow_id: The ID of the flow to add the node to.
        node_id: The client-generated ID for the new node.
        node_type: The type of the node to add (e.g., 'filter', 'join').
        pos_x: The X coordinate for the node's position in the UI.
        pos_y: The Y coordinate for the node's position in the UI.
    """
    flow = flow_file_handler.get_flow(flow_id)
    logger.info(f'Adding a promise for {node_type}')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    node = flow.get_node(node_id)
    if node is not None:
        flow.delete_node(node_id)
    node_promise = input_schema.NodePromise(flow_id=flow_id, node_id=node_id, cache_results=False, pos_x=pos_x,
                                            pos_y=pos_y,
                                            node_type=node_type)
    if node_type == 'explore_data':
        flow.add_initial_node_analysis(node_promise)
        return
    else:
        logger.info("Adding node")
        flow.add_node_promise(node_promise)

    if nodes.check_if_has_default_setting(node_type):
        logger.info(f'Found standard settings for {node_type}, trying to upload them')
        setting_name_ref = 'node' + node_type.replace('_', '')
        node_model = get_node_model(setting_name_ref)
        add_func = getattr(flow, 'add_' + node_type)
        initial_settings = node_model(flow_id=flow_id, node_id=node_id, cache_results=False,
                                      pos_x=pos_x, pos_y=pos_y, node_type=node_type)
        add_func(initial_settings)
cancel_flow(flow_id)

Cancels a currently running flow execution.

Source code in flowfile_core/flowfile_core/routes/routes.py
222
223
224
225
226
227
228
@router.post('/flow/cancel/', tags=['editor'])
def cancel_flow(flow_id: int):
    """Cancels a currently running flow execution."""
    flow = flow_file_handler.get_flow(flow_id)
    if not flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is not running')
    flow.cancel()
close_flow(flow_id)

Closes an active flow session.

Source code in flowfile_core/flowfile_core/routes/routes.py
449
450
451
452
@router.post('/editor/close_flow/', tags=['editor'])
def close_flow(flow_id: int) -> None:
    """Closes an active flow session."""
    flow_file_handler.delete_flow(flow_id)
connect_node(flow_id, node_connection)

Creates a connection (edge) between two nodes in the flow graph.

Source code in flowfile_core/flowfile_core/routes/routes.py
399
400
401
402
403
404
405
406
407
408
@router.post('/editor/connect_node/', tags=['editor'])
def connect_node(flow_id: int, node_connection: input_schema.NodeConnection):
    """Creates a connection (edge) between two nodes in the flow graph."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        logger.info('could not find the flow')
        raise HTTPException(404, 'could not find the flow')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    add_connection(flow, node_connection)
copy_node(node_id_to_copy_from, flow_id_to_copy_from, node_promise)

Copies an existing node's settings to a new node promise.

Parameters:

Name Type Description Default
node_id_to_copy_from int

The ID of the node to copy the settings from.

required
flow_id_to_copy_from int

The ID of the flow containing the source node.

required
node_promise NodePromise

A NodePromise representing the new node to be created.

required
Source code in flowfile_core/flowfile_core/routes/routes.py
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
@router.post('/editor/copy_node', tags=['editor'])
def copy_node(node_id_to_copy_from: int, flow_id_to_copy_from: int, node_promise: input_schema.NodePromise):
    """Copies an existing node's settings to a new node promise.

    Args:
        node_id_to_copy_from: The ID of the node to copy the settings from.
        flow_id_to_copy_from: The ID of the flow containing the source node.
        node_promise: A `NodePromise` representing the new node to be created.
    """
    try:
        flow_to_copy_from = flow_file_handler.get_flow(flow_id_to_copy_from)
        flow = (flow_to_copy_from
                if flow_id_to_copy_from == node_promise.flow_id
                else flow_file_handler.get_flow(node_promise.flow_id)
                )
        node_to_copy = flow_to_copy_from.get_node(node_id_to_copy_from)
        logger.info(f"Copying data {node_promise.node_type}")

        if flow.flow_settings.is_running:
            raise HTTPException(422, "Flow is running")

        if flow.get_node(node_promise.node_id) is not None:
            flow.delete_node(node_promise.node_id)

        if node_promise.node_type == "explore_data":
            flow.add_initial_node_analysis(node_promise)
            return

        flow.copy_node(node_promise, node_to_copy.setting_input, node_to_copy.node_type)

    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
create_db_connection(input_connection, current_user=Depends(get_current_active_user), db=Depends(get_db))

Creates and securely stores a new database connection.

Source code in flowfile_core/flowfile_core/routes/routes.py
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
@router.post("/db_connection_lib", tags=['db_connections'])
def create_db_connection(input_connection: input_schema.FullDatabaseConnection,
                         current_user=Depends(get_current_active_user),
                         db: Session = Depends(get_db)
                         ):
    """Creates and securely stores a new database connection."""
    logger.info(f'Creating database connection {input_connection.connection_name}')
    try:
        store_database_connection(db, input_connection, current_user.id)
    except ValueError:
        raise HTTPException(422, 'Connection name already exists')
    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
    return {"message": "Database connection created successfully"}
create_directory(new_directory)

Creates a new directory at the specified path.

Parameters:

Name Type Description Default
new_directory NewDirectory

An input_schema.NewDirectory object with the path and name.

required

Returns:

Type Description
bool

True if the directory was created successfully.

Source code in flowfile_core/flowfile_core/routes/routes.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
@router.post('/files/create_directory', response_model=output_model.OutputDir, tags=['file manager'])
def create_directory(new_directory: input_schema.NewDirectory) -> bool:
    """Creates a new directory at the specified path.

    Args:
        new_directory: An `input_schema.NewDirectory` object with the path and name.

    Returns:
        `True` if the directory was created successfully.
    """
    result, error = create_dir(new_directory)
    if result:
        return True
    else:
        raise error
create_flow(flow_path)

Creates a new, empty flow file at the specified path and registers a session for it.

Source code in flowfile_core/flowfile_core/routes/routes.py
441
442
443
444
445
446
@router.post('/editor/create_flow/', tags=['editor'])
def create_flow(flow_path: str):
    """Creates a new, empty flow file at the specified path and registers a session for it."""
    flow_path = Path(flow_path)
    logger.info('Creating flow')
    return flow_file_handler.add_flow(name=flow_path.stem, flow_path=str(flow_path))
delete_db_connection(connection_name, current_user=Depends(get_current_active_user), db=Depends(get_db))

Deletes a stored database connection.

Source code in flowfile_core/flowfile_core/routes/routes.py
376
377
378
379
380
381
382
383
384
385
386
387
@router.delete('/db_connection_lib', tags=['db_connections'])
def delete_db_connection(connection_name: str,
                         current_user=Depends(get_current_active_user),
                         db: Session = Depends(get_db)
                         ):
    """Deletes a stored database connection."""
    logger.info(f'Deleting database connection {connection_name}')
    db_connection = get_database_connection(db, connection_name, current_user.id)
    if db_connection is None:
        raise HTTPException(404, 'Database connection not found')
    delete_database_connection(db, connection_name, current_user.id)
    return {"message": "Database connection deleted successfully"}
delete_node(flow_id, node_id)

Deletes a node from the flow graph.

Source code in flowfile_core/flowfile_core/routes/routes.py
337
338
339
340
341
342
343
344
@router.post('/editor/delete_node/', tags=['editor'])
def delete_node(flow_id: Optional[int], node_id: int):
    """Deletes a node from the flow graph."""
    logger.info('Deleting node')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    flow.delete_node(node_id)
delete_node_connection(flow_id, node_connection=None)

Deletes a connection (edge) between two nodes.

Source code in flowfile_core/flowfile_core/routes/routes.py
347
348
349
350
351
352
353
354
355
356
@router.post('/editor/delete_connection/', tags=['editor'])
def delete_node_connection(flow_id: int, node_connection: input_schema.NodeConnection = None):
    """Deletes a connection (edge) between two nodes."""
    flow_id = int(flow_id)
    logger.info(
        f'Deleting connection node {node_connection.output_connection.node_id} to node {node_connection.input_connection.node_id}')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    delete_connection(flow, node_connection)
get_active_flow_file_sessions() async

Retrieves a list of all currently active flow sessions.

Source code in flowfile_core/flowfile_core/routes/routes.py
195
196
197
198
@router.get('/active_flowfile_sessions/', response_model=List[schemas.FlowSettings])
async def get_active_flow_file_sessions() -> List[schemas.FlowSettings]:
    """Retrieves a list of all currently active flow sessions."""
    return [flf.flow_settings for flf in flow_file_handler.flowfile_flows]
get_current_directory_contents(file_types=None, include_hidden=False) async

Gets the contents of the file explorer's current directory.

Source code in flowfile_core/flowfile_core/routes/routes.py
159
160
161
162
@router.get('/files/current_directory_contents/', response_model=List[FileInfo], tags=['file manager'])
async def get_current_directory_contents(file_types: List[str] = None, include_hidden: bool = False) -> List[FileInfo]:
    """Gets the contents of the file explorer's current directory."""
    return file_explorer.list_contents(file_types=file_types, show_hidden=include_hidden)
get_current_files() async

Gets the contents of the file explorer's current directory.

Source code in flowfile_core/flowfile_core/routes/routes.py
104
105
106
107
108
@router.get('/files/tree/', response_model=List[FileInfo], tags=['file manager'])
async def get_current_files() -> List[FileInfo]:
    """Gets the contents of the file explorer's current directory."""
    f = file_explorer.list_contents()
    return f
get_current_path() async

Returns the current absolute path of the file explorer.

Source code in flowfile_core/flowfile_core/routes/routes.py
132
133
134
135
@router.get('/files/current_path/', response_model=str, tags=['file manager'])
async def get_current_path() -> str:
    """Returns the current absolute path of the file explorer."""
    return str(file_explorer.current_path)
get_db_connections(db=Depends(get_db), current_user=Depends(get_current_active_user))

Retrieves all stored database connections for the current user (without passwords).

Source code in flowfile_core/flowfile_core/routes/routes.py
390
391
392
393
394
395
396
@router.get('/db_connection_lib', tags=['db_connections'],
            response_model=List[input_schema.FullDatabaseConnectionInterface])
def get_db_connections(
        db: Session = Depends(get_db),
        current_user=Depends(get_current_active_user)) -> List[input_schema.FullDatabaseConnectionInterface]:
    """Retrieves all stored database connections for the current user (without passwords)."""
    return get_all_database_connections_interface(db, current_user.id)
get_description_node(flow_id, node_id)

Retrieves the description text for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py
528
529
530
531
532
533
534
535
536
537
@router.get('/node/description', tags=['editor'])
def get_description_node(flow_id: int, node_id: int):
    """Retrieves the description text for a specific node."""
    try:
        node = flow_file_handler.get_flow(flow_id).get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    if node is None:
        raise HTTPException(404, 'Could not find the node')
    return node.setting_input.description
get_directory_contents(directory, file_types=None, include_hidden=False) async

Gets the contents of an arbitrary directory path.

Parameters:

Name Type Description Default
directory str

The absolute path to the directory.

required
file_types List[str]

An optional list of file extensions to filter by.

None
include_hidden bool

If True, includes hidden files and directories.

False

Returns:

Type Description
List[FileInfo]

A list of FileInfo objects representing the directory's contents.

Source code in flowfile_core/flowfile_core/routes/routes.py
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
@router.get('/files/directory_contents/', response_model=List[FileInfo], tags=['file manager'])
async def get_directory_contents(directory: str, file_types: List[str] = None,
                                 include_hidden: bool = False) -> List[FileInfo]:
    """Gets the contents of an arbitrary directory path.

    Args:
        directory: The absolute path to the directory.
        file_types: An optional list of file extensions to filter by.
        include_hidden: If True, includes hidden files and directories.

    Returns:
        A list of `FileInfo` objects representing the directory's contents.
    """
    directory_explorer = FileExplorer(directory)
    try:
        return directory_explorer.list_contents(show_hidden=include_hidden, file_types=file_types)
    except Exception as e:
        logger.error(e)
        HTTPException(404, 'Could not access the directory')
get_downstream_node_ids(flow_id, node_id) async

Gets a list of all node IDs that are downstream dependencies of a given node.

Source code in flowfile_core/flowfile_core/routes/routes.py
548
549
550
551
552
553
@router.get('/node/downstream_node_ids', response_model=List[int], tags=['editor'])
async def get_downstream_node_ids(flow_id: int, node_id: int) -> List[int]:
    """Gets a list of all node IDs that are downstream dependencies of a given node."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    return list(node.get_all_dependent_node_ids())
get_excel_sheet_names(path) async

Retrieves the sheet names from an Excel file.

Source code in flowfile_core/flowfile_core/routes/routes.py
631
632
633
634
635
636
637
638
@router.get('/api/get_xlsx_sheet_names', tags=['excel_reader'], response_model=List[str])
async def get_excel_sheet_names(path: str) -> List[str] | None:
    """Retrieves the sheet names from an Excel file."""
    sheet_names = excel_file_manager.get_sheet_names(path)
    if sheet_names:
        return sheet_names
    else:
        raise HTTPException(404, 'File not found')
get_expression_doc()

Retrieves documentation for available Polars expressions.

Source code in flowfile_core/flowfile_core/routes/routes.py
411
412
413
414
@router.get('/editor/expression_doc', tags=['editor'], response_model=List[output_model.ExpressionsOverview])
def get_expression_doc() -> List[output_model.ExpressionsOverview]:
    """Retrieves documentation for available Polars expressions."""
    return get_expression_overview()
get_expressions()

Retrieves a list of all available Flowfile expression names.

Source code in flowfile_core/flowfile_core/routes/routes.py
417
418
419
420
@router.get('/editor/expressions', tags=['editor'], response_model=List[str])
def get_expressions() -> List[str]:
    """Retrieves a list of all available Flowfile expression names."""
    return get_all_expressions()
get_flow(flow_id)

Retrieves the settings for a specific flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
423
424
425
426
427
428
@router.get('/editor/flow', tags=['editor'], response_model=schemas.FlowSettings)
def get_flow(flow_id: int):
    """Retrieves the settings for a specific flow."""
    flow_id = int(flow_id)
    result = get_flow_settings(flow_id)
    return result
get_flow_frontend_data(flow_id=1)

Retrieves the data needed to render the flow graph in the frontend.

Source code in flowfile_core/flowfile_core/routes/routes.py
572
573
574
575
576
577
578
@router.get('/flow_data', tags=['manager'])
def get_flow_frontend_data(flow_id: Optional[int] = 1):
    """Retrieves the data needed to render the flow graph in the frontend."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return flow.get_frontend_data()
get_flow_settings(flow_id=1)

Retrieves the main settings for a flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
581
582
583
584
585
586
587
@router.get('/flow_settings', tags=['manager'], response_model=schemas.FlowSettings)
def get_flow_settings(flow_id: Optional[int] = 1) -> schemas.FlowSettings:
    """Retrieves the main settings for a flow."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return flow.flow_settings
get_generated_code(flow_id)

Generates and returns a Python script with Polars code representing the flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
431
432
433
434
435
436
437
438
@router.get("/editor/code_to_polars", tags=[], response_model=str)
def get_generated_code(flow_id: int) -> str:
    """Generates and returns a Python script with Polars code representing the flow."""
    flow_id = int(flow_id)
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return export_flow_to_polars(flow)
get_graphic_walker_input(flow_id, node_id)

Gets the data and configuration for the Graphic Walker data exploration tool.

Source code in flowfile_core/flowfile_core/routes/routes.py
609
610
611
612
613
614
615
616
617
@router.get('/analysis_data/graphic_walker_input', tags=['analysis'], response_model=input_schema.NodeExploreData)
def get_graphic_walker_input(flow_id: int, node_id: int):
    """Gets the data and configuration for the Graphic Walker data exploration tool."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    if node.results.analysis_data_generator is None:
        logger.error('The data is not refreshed and available for analysis')
        raise HTTPException(422, 'The data is not refreshed and available for analysis')
    return AnalyticsProcessor.process_graphic_walker_input(node)
get_instant_function_result(flow_id, node_id, func_string) async

Executes a simple, instant function on a node's data and returns the result.

Source code in flowfile_core/flowfile_core/routes/routes.py
620
621
622
623
624
625
626
627
628
@router.get('/custom_functions/instant_result', tags=[])
async def get_instant_function_result(flow_id: int, node_id: int, func_string: str):
    """Executes a simple, instant function on a node's data and returns the result."""
    try:
        node = flow_file_handler.get_node(flow_id, node_id)
        result = await asyncio.to_thread(get_instant_func_results, node, func_string)
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
get_list_of_saved_flows(path)

Scans a directory for saved flow files (.flowfile).

Source code in flowfile_core/flowfile_core/routes/routes.py
491
492
493
494
495
496
497
@router.get('/files/available_flow_files', tags=['editor'], response_model=List[FileInfo])
def get_list_of_saved_flows(path: str):
    """Scans a directory for saved flow files (`.flowfile`)."""
    try:
        return get_files_from_directory(path, types=['flowfile'])
    except:
        return []
get_local_files(directory) async

Retrieves a list of files from a specified local directory.

Parameters:

Name Type Description Default
directory str

The absolute path of the directory to scan.

required

Returns:

Type Description
List[FileInfo]

A list of FileInfo objects for each item in the directory.

Raises:

Type Description
HTTPException

404 if the directory does not exist.

Source code in flowfile_core/flowfile_core/routes/routes.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
@router.get('/files/files_in_local_directory/', response_model=List[FileInfo], tags=['file manager'])
async def get_local_files(directory: str) -> List[FileInfo]:
    """Retrieves a list of files from a specified local directory.

    Args:
        directory: The absolute path of the directory to scan.

    Returns:
        A list of `FileInfo` objects for each item in the directory.

    Raises:
        HTTPException: 404 if the directory does not exist.
    """
    files = get_files_from_directory(directory)
    if files is None:
        raise HTTPException(404, 'Directory does not exist')
    return files
get_node(flow_id, node_id, get_data=False)

Retrieves the complete state and data preview for a single node.

Source code in flowfile_core/flowfile_core/routes/routes.py
505
506
507
508
509
510
511
512
513
514
@router.get('/node', response_model=output_model.NodeData, tags=['editor'])
def get_node(flow_id: int, node_id: int, get_data: bool = False):
    """Retrieves the complete state and data preview for a single node."""
    logging.info(f'Getting node {node_id} from flow {flow_id}')
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    if node is None:
        raise HTTPException(422, 'Not found')
    v = node.get_node_data(flow_id=flow.flow_id, include_example=get_data)
    return v
get_node_list()

Retrieves the list of all available node types and their templates.

Source code in flowfile_core/flowfile_core/routes/routes.py
499
500
501
502
@router.get('/node_list', response_model=List[nodes.NodeTemplate])
def get_node_list() -> List[nodes.NodeTemplate]:
    """Retrieves the list of all available node types and their templates."""
    return nodes.nodes_list
get_node_model(setting_name_ref)

(Internal) Retrieves a node's Pydantic model from the input_schema module by its name.

Source code in flowfile_core/flowfile_core/routes/routes.py
60
61
62
63
64
65
66
def get_node_model(setting_name_ref: str):
    """(Internal) Retrieves a node's Pydantic model from the input_schema module by its name."""
    logger.info("Getting node model for: " + setting_name_ref)
    for ref_name, ref in inspect.getmodule(input_schema).__dict__.items():
        if ref_name.lower() == setting_name_ref:
            return ref
    logger.error(f"Could not find node model for: {setting_name_ref}")
get_run_status(flow_id, response)

Retrieves the run status information for a specific flow.

Returns a 202 Accepted status while the flow is running, and 200 OK when finished.

Source code in flowfile_core/flowfile_core/routes/routes.py
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
@router.get('/flow/run_status/', tags=['editor'],
            response_model=output_model.RunInformation)
def get_run_status(flow_id: int, response: Response):
    """Retrieves the run status information for a specific flow.

    Returns a 202 Accepted status while the flow is running, and 200 OK when finished.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    if flow.flow_settings.is_running:
        response.status_code = status.HTTP_202_ACCEPTED
        return flow.get_run_info()
    response.status_code = status.HTTP_200_OK
    return flow.get_run_info()
get_table_example(flow_id, node_id)

Retrieves a data preview (schema and sample rows) for a node's output.

Source code in flowfile_core/flowfile_core/routes/routes.py
540
541
542
543
544
545
@router.get('/node/data', response_model=output_model.TableExample, tags=['editor'])
def get_table_example(flow_id: int, node_id: int):
    """Retrieves a data preview (schema and sample rows) for a node's output."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    return node.get_table_example(True)
get_vue_flow_data(flow_id)

Retrieves the flow data formatted for the Vue-based frontend.

Source code in flowfile_core/flowfile_core/routes/routes.py
599
600
601
602
603
604
605
606
@router.get('/flow_data/v2', tags=['manager'])
def get_vue_flow_data(flow_id: int) -> schemas.VueFlowInput:
    """Retrieves the flow data formatted for the Vue-based frontend."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    data = flow.get_vue_flow_input()
    return data
import_saved_flow(flow_path)

Imports a flow from a saved .flowfile and registers it as a new session.

Source code in flowfile_core/flowfile_core/routes/routes.py
556
557
558
559
560
561
562
@router.get('/import_flow/', tags=['editor'], response_model=int)
def import_saved_flow(flow_path: str) -> int:
    """Imports a flow from a saved `.flowfile` and registers it as a new session."""
    flow_path = Path(flow_path)
    if not flow_path.exists():
        raise HTTPException(404, 'File not found')
    return flow_file_handler.import_flow(flow_path)
navigate_into_directory(directory_name) async

Navigates the file explorer into a specified subdirectory.

Source code in flowfile_core/flowfile_core/routes/routes.py
118
119
120
121
122
@router.post('/files/navigate_into/', response_model=str, tags=['file manager'])
async def navigate_into_directory(directory_name: str) -> str:
    """Navigates the file explorer into a specified subdirectory."""
    file_explorer.navigate_into(directory_name)
    return str(file_explorer.current_path)
navigate_to_directory(directory_name) async

Navigates the file explorer to an absolute directory path.

Source code in flowfile_core/flowfile_core/routes/routes.py
125
126
127
128
129
@router.post('/files/navigate_to/', tags=['file manager'])
async def navigate_to_directory(directory_name: str) -> str:
    """Navigates the file explorer to an absolute directory path."""
    file_explorer.navigate_to(directory_name)
    return str(file_explorer.current_path)
navigate_up() async

Navigates the file explorer one directory level up.

Source code in flowfile_core/flowfile_core/routes/routes.py
111
112
113
114
115
@router.post('/files/navigate_up/', response_model=str, tags=['file manager'])
async def navigate_up() -> str:
    """Navigates the file explorer one directory level up."""
    file_explorer.navigate_up()
    return str(file_explorer.current_path)
register_flow(flow_data)

Registers a new flow session with the application.

Parameters:

Name Type Description Default
flow_data FlowSettings

The FlowSettings for the new flow.

required

Returns:

Type Description
int

The ID of the newly registered flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
182
183
184
185
186
187
188
189
190
191
192
@router.post('/flow/register/', tags=['editor'])
def register_flow(flow_data: schemas.FlowSettings) -> int:
    """Registers a new flow session with the application.

    Args:
        flow_data: The `FlowSettings` for the new flow.

    Returns:
        The ID of the newly registered flow.
    """
    return flow_file_handler.register_flow(flow_data)
run_flow(flow_id, background_tasks) async

Executes a flow in a background task.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to execute.

required
background_tasks BackgroundTasks

FastAPI's background task runner.

required

Returns:

Type Description
JSONResponse

A JSON response indicating that the flow has started.

Source code in flowfile_core/flowfile_core/routes/routes.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
@router.post('/flow/run/', tags=['editor'])
async def run_flow(flow_id: int, background_tasks: BackgroundTasks) -> JSONResponse:
    """Executes a flow in a background task.

    Args:
        flow_id: The ID of the flow to execute.
        background_tasks: FastAPI's background task runner.

    Returns:
        A JSON response indicating that the flow has started.
    """
    logger.info('starting to run...')
    flow = flow_file_handler.get_flow(flow_id)
    lock = get_flow_run_lock(flow_id)
    async with lock:
        if flow.flow_settings.is_running:
            raise HTTPException(422, 'Flow is already running')
        background_tasks.add_task(flow.run_graph)
    return JSONResponse(content={"message": "Data started", "flow_id": flow_id}, status_code=status.HTTP_200_OK)
save_flow(flow_id, flow_path=None)

Saves the current state of a flow to a .flowfile.

Source code in flowfile_core/flowfile_core/routes/routes.py
565
566
567
568
569
@router.get('/save_flow', tags=['editor'])
def save_flow(flow_id: int, flow_path: str = None):
    """Saves the current state of a flow to a `.flowfile`."""
    flow = flow_file_handler.get_flow(flow_id)
    flow.save_flow(flow_path=flow_path)
update_description_node(flow_id, node_id, description=Body(...))

Updates the description text for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py
517
518
519
520
521
522
523
524
525
@router.post('/node/description/', tags=['editor'])
def update_description_node(flow_id: int, node_id: int, description: str = Body(...)):
    """Updates the description text for a specific node."""
    try:
        node = flow_file_handler.get_flow(flow_id).get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    node.setting_input.description = description
    return True
update_flow_settings(flow_settings)

Updates the main settings for a flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
590
591
592
593
594
595
596
@router.post('/flow_settings', tags=['manager'])
def update_flow_settings(flow_settings: schemas.FlowSettings):
    """Updates the main settings for a flow."""
    flow = flow_file_handler.get_flow(flow_settings.flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    flow.flow_settings = flow_settings
upload_file(file=File(...)) async

Uploads a file to the server's 'uploads' directory.

Parameters:

Name Type Description Default
file UploadFile

The file to be uploaded.

File(...)

Returns:

Type Description
JSONResponse

A JSON response containing the filename and the path where it was saved.

Source code in flowfile_core/flowfile_core/routes/routes.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
@router.post("/upload/")
async def upload_file(file: UploadFile = File(...)) -> JSONResponse:
    """Uploads a file to the server's 'uploads' directory.

    Args:
        file: The file to be uploaded.

    Returns:
        A JSON response containing the filename and the path where it was saved.
    """
    file_location = f"uploads/{file.filename}"
    with open(file_location, "wb+") as file_object:
        file_object.write(file.file.read())
    return JSONResponse(content={"filename": file.filename, "filepath": file_location})
validate_db_settings(database_settings, current_user=Depends(get_current_active_user)) async

Validates that a connection can be made to a database with the given settings.

Source code in flowfile_core/flowfile_core/routes/routes.py
641
642
643
644
645
646
647
648
649
650
651
652
653
@router.post("/validate_db_settings")
async def validate_db_settings(
        database_settings: input_schema.DatabaseSettings,
        current_user=Depends(get_current_active_user)
):
    """Validates that a connection can be made to a database with the given settings."""
    # Validate the query settings
    try:
        sql_source = create_sql_source_from_db_settings(database_settings, user_id=current_user.id)
        sql_source.validate()
        return {"message": "Query settings are valid"}
    except Exception as e:
        raise HTTPException(status_code=422, detail=str(e))

auth

flowfile_core.routes.auth

cloud_connections

flowfile_core.routes.cloud_connections

Functions:

Name Description
create_cloud_storage_connection

Create a new cloud storage connection.

delete_cloud_connection_with_connection_name

Delete a cloud connection.

get_cloud_connections

Get all cloud storage connections for the current user.

create_cloud_storage_connection(input_connection, current_user=Depends(get_current_active_user), db=Depends(get_db))

Create a new cloud storage connection. Parameters input_connection: FullCloudStorageConnection schema containing connection details current_user: User obtained from Depends(get_current_active_user) db: Session obtained from Depends(get_db) Returns Dict with a success message

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
@router.post("/cloud_connection", tags=['cloud_connections'])
def create_cloud_storage_connection(input_connection: FullCloudStorageConnection,
                                    current_user=Depends(get_current_active_user),
                                    db: Session = Depends(get_db)
                                    ):
    """
    Create a new cloud storage connection.
    Parameters
        input_connection: FullCloudStorageConnection schema containing connection details
        current_user: User obtained from Depends(get_current_active_user)
        db: Session obtained from Depends(get_db)
    Returns
        Dict with a success message
    """
    logger.info(f'Create cloud connection {input_connection.connection_name}')
    try:
        store_cloud_connection(db, input_connection, current_user.id)
    except ValueError:
        raise HTTPException(422, 'Connection name already exists')
    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
    return {"message": "Cloud connection created successfully"}
delete_cloud_connection_with_connection_name(connection_name, current_user=Depends(get_current_active_user), db=Depends(get_db))

Delete a cloud connection.

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@router.delete('/cloud_connection', tags=['cloud_connections'])
def delete_cloud_connection_with_connection_name(connection_name: str,
                                                 current_user=Depends(get_current_active_user),
                                                 db: Session = Depends(get_db)
                                                 ):
    """
    Delete a cloud connection.
    """
    logger.info(f'Deleting cloud connection {connection_name}')
    cloud_storage_connection = get_cloud_connection_schema(db, connection_name, current_user.id)
    if cloud_storage_connection is None:
        raise HTTPException(404, 'Cloud connection connection not found')
    delete_cloud_connection(db, connection_name, current_user.id)
    return {"message": "Cloud connection deleted successfully"}
get_cloud_connections(db=Depends(get_db), current_user=Depends(get_current_active_user))

Get all cloud storage connections for the current user. Parameters db: Session obtained from Depends(get_db) current_user: User obtained from Depends(get_current_active_user)

Returns List[FullCloudStorageConnectionInterface]

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@router.get('/cloud_connections', tags=['cloud_connection'],
            response_model=List[FullCloudStorageConnectionInterface])
def get_cloud_connections(
        db: Session = Depends(get_db),
        current_user=Depends(get_current_active_user)) -> List[FullCloudStorageConnectionInterface]:
    """
    Get all cloud storage connections for the current user.
    Parameters
        db: Session obtained from Depends(get_db)
        current_user: User obtained from Depends(get_current_active_user)

    Returns
        List[FullCloudStorageConnectionInterface]
    """
    return get_all_cloud_connections_interface(db, current_user.id)

logs

flowfile_core.routes.logs

Functions:

Name Description
add_log

Adds a log message to the log file for a given flow_id.

add_raw_log

Adds a log message to the log file for a given flow_id.

format_sse_message

Format the data as a proper SSE message

stream_logs

Streams logs for a given flow_id using Server-Sent Events.

add_log(flow_id, log_message) async

Adds a log message to the log file for a given flow_id.

Source code in flowfile_core/flowfile_core/routes/logs.py
34
35
36
37
38
39
40
41
@router.post("/logs/{flow_id}", tags=['flow_logging'])
async def add_log(flow_id: int, log_message: str):
    """Adds a log message to the log file for a given flow_id."""
    flow = flow_file_handler.get_flow(flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    flow.flow_logger.info(log_message)
    return {"message": "Log added successfully"}
add_raw_log(raw_log_input) async

Adds a log message to the log file for a given flow_id.

Source code in flowfile_core/flowfile_core/routes/logs.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@router.post("/raw_logs", tags=['flow_logging'])
async def add_raw_log(raw_log_input: schemas.RawLogInput):
    """Adds a log message to the log file for a given flow_id."""
    logger.info('Adding raw logs')
    flow = flow_file_handler.get_flow(raw_log_input.flowfile_flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    flow.flow_logger.get_log_filepath()
    flow_logger = flow.flow_logger
    flow_logger.get_log_filepath()
    if raw_log_input.log_type == 'INFO':
        flow_logger.info(raw_log_input.log_message,
                         extra=raw_log_input.extra)
    elif raw_log_input.log_type == 'ERROR':
        flow_logger.error(raw_log_input.log_message,
                          extra=raw_log_input.extra)
    return {"message": "Log added successfully"}
format_sse_message(data) async

Format the data as a proper SSE message

Source code in flowfile_core/flowfile_core/routes/logs.py
29
30
31
async def format_sse_message(data: str) -> str:
    """Format the data as a proper SSE message"""
    return f"data: {json.dumps(data)}\n\n"
stream_logs(flow_id, idle_timeout=300, current_user=Depends(get_current_user_from_query)) async

Streams logs for a given flow_id using Server-Sent Events. Requires authentication via token in query parameter. The connection will close gracefully if the server shuts down.

Source code in flowfile_core/flowfile_core/routes/logs.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@router.get("/logs/{flow_id}", tags=['flow_logging'])
async def stream_logs(
    flow_id: int,
    idle_timeout: int = 300,
    current_user=Depends(get_current_user_from_query)
):
    """
    Streams logs for a given flow_id using Server-Sent Events.
    Requires authentication via token in query parameter.
    The connection will close gracefully if the server shuts down.
    """
    logger.info(f"Starting log stream for flow_id: {flow_id} by user: {current_user.username}")
    await asyncio.sleep(.3)
    flow = flow_file_handler.get_flow(flow_id)
    logger.info('Streaming logs')
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")

    log_file_path = flow.flow_logger.get_log_filepath()
    if not Path(log_file_path).exists():
        raise HTTPException(status_code=404, detail="Log file not found")

    class RunningState:
        def __init__(self):
            self.has_started = False

        def is_running(self):
            if flow.flow_settings.is_running:
                self.has_started = True
            return flow.flow_settings.is_running or not self.has_started

    running_state = RunningState()

    return StreamingResponse(
        stream_log_file(log_file_path, running_state.is_running, idle_timeout),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "text/event-stream",
        }
    )

public

flowfile_core.routes.public

Functions:

Name Description
docs_redirect

Redirects to the documentation page.

docs_redirect() async

Redirects to the documentation page.

Source code in flowfile_core/flowfile_core/routes/public.py
 8
 9
10
11
@router.get("/", tags=['admin'])
async def docs_redirect():
    """ Redirects to the documentation page."""
    return RedirectResponse(url='/docs')

secrets

flowfile_core.routes.secrets

Manages CRUD (Create, Read, Update, Delete) operations for secrets.

This router provides secure endpoints for creating, retrieving, and deleting sensitive credentials for the authenticated user. Secrets are encrypted before being stored and are associated with the user's ID.

Functions:

Name Description
create_secret

Creates a new secret for the authenticated user.

delete_secret

Deletes a secret by name for the authenticated user.

get_secret

Retrieves a specific secret by name for the authenticated user.

get_secrets

Retrieves all secret names for the currently authenticated user.

create_secret(secret, current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Creates a new secret for the authenticated user.

The secret value is encrypted before being stored in the database. A secret name must be unique for a given user.

Parameters:

Name Type Description Default
secret SecretInput

A SecretInput object containing the name and plaintext value of the secret.

required
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Raises:

Type Description
HTTPException

400 if a secret with the same name already exists for the user.

Returns:

Type Description
Secret

A Secret object containing the name and the encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
@router.post("/secrets", response_model=Secret)
async def create_secret(secret: SecretInput, current_user=Depends(get_current_active_user),
                        db: Session = Depends(get_db)) -> Secret:
    """Creates a new secret for the authenticated user.

    The secret value is encrypted before being stored in the database. A secret
    name must be unique for a given user.

    Args:
        secret: A `SecretInput` object containing the name and plaintext value of the secret.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Raises:
        HTTPException: 400 if a secret with the same name already exists for the user.

    Returns:
        A `Secret` object containing the name and the *encrypted* value.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id

    existing_secret = db.query(db_models.Secret).filter(
        db_models.Secret.user_id == user_id,
        db_models.Secret.name == secret.name
    ).first()

    if existing_secret:
        raise HTTPException(status_code=400, detail="Secret with this name already exists")

    # The store_secret function handles encryption and DB storage
    stored_secret = store_secret(db, secret, user_id)
    return Secret(name=stored_secret.name, value=stored_secret.encrypted_value, user_id=str(user_id))
delete_secret(secret_name, current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Deletes a secret by name for the authenticated user.

Parameters:

Name Type Description Default
secret_name str

The name of the secret to delete.

required
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Returns:

Type Description
None

An empty response with a 204 No Content status code upon success.

Source code in flowfile_core/flowfile_core/routes/secrets.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
@router.delete("/secrets/{secret_name}", status_code=204)
async def delete_secret(secret_name: str, current_user=Depends(get_current_active_user),
                        db: Session = Depends(get_db)) -> None:
    """Deletes a secret by name for the authenticated user.

    Args:
        secret_name: The name of the secret to delete.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Returns:
        An empty response with a 204 No Content status code upon success.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id
    delete_secret_action(db, secret_name, user_id)
    return None
get_secret(secret_name, current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Retrieves a specific secret by name for the authenticated user.

Note: This endpoint returns the secret name and metadata but does not expose the decrypted secret value.

Parameters:

Name Type Description Default
secret_name str

The name of the secret to retrieve.

required
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Raises:

Type Description
HTTPException

404 if the secret is not found.

Returns:

Type Description
Secret

A Secret object containing the name and encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
@router.get("/secrets/{secret_name}", response_model=Secret)
async def get_secret(secret_name: str,
                     current_user=Depends(get_current_active_user), db: Session = Depends(get_db)) -> Secret:
    """Retrieves a specific secret by name for the authenticated user.

    Note: This endpoint returns the secret name and metadata but does not
    expose the decrypted secret value.

    Args:
        secret_name: The name of the secret to retrieve.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Raises:
        HTTPException: 404 if the secret is not found.

    Returns:
        A `Secret` object containing the name and encrypted value.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id

    # Get secret from database
    db_secret = db.query(db_models.Secret).filter(
        db_models.Secret.user_id == user_id,
        db_models.Secret.name == secret_name
    ).first()

    if not db_secret:
        raise HTTPException(status_code=404, detail="Secret not found")

    return Secret(
        name=db_secret.name,
        value=db_secret.encrypted_value,
        user_id=str(db_secret.user_id)
    )
get_secrets(current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Retrieves all secret names for the currently authenticated user.

Note: This endpoint returns the secret names and metadata but does not expose the decrypted secret values.

Parameters:

Name Type Description Default
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Returns:

Type Description

A list of Secret objects, each containing the name and encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
@router.get("/secrets", response_model=List[Secret])
async def get_secrets(current_user=Depends(get_current_active_user), db: Session = Depends(get_db)):
    """Retrieves all secret names for the currently authenticated user.

    Note: This endpoint returns the secret names and metadata but does not
    expose the decrypted secret values.

    Args:
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Returns:
        A list of `Secret` objects, each containing the name and encrypted value.
    """
    user_id = current_user.id

    # Get secrets from database
    db_secrets = db.query(db_models.Secret).filter(db_models.Secret.user_id == user_id).all()

    # Prepare response model (without decrypting)
    secrets = []
    for db_secret in db_secrets:
        secrets.append(Secret(
            name=db_secret.name,
            value=db_secret.encrypted_value,
            user_id=str(db_secret.user_id)
        ))

    return secrets