Skip to content

Flowfile Core API Reference

This section provides a detailed API reference for the core Python objects, data models, and API routes in flowfile-core. The documentation is generated directly from the source code docstrings.


Core Components

This section covers the fundamental classes that manage the state and execution of data pipelines. These are the main "verbs" of the library.

FlowGraph

The FlowGraph is the central object that orchestrates the execution of data transformations. It is built incrementally as you chain operations. This DAG (Directed Acyclic Graph) represents the entire pipeline.

flowfile_core.flowfile.flow_graph.FlowGraph

A class representing a Directed Acyclic Graph (DAG) for data processing pipelines.

It manages nodes, connections, and the execution of the entire flow.

Methods:

Name Description
__init__

Initializes a new FlowGraph instance.

__repr__

Provides the official string representation of the FlowGraph instance.

add_cloud_storage_reader

Adds a cloud storage read node to the flow graph.

add_cloud_storage_writer

Adds a node to write data to a cloud storage provider.

add_cross_join

Adds a cross join node to the graph.

add_database_reader

Adds a node to read data from a database.

add_database_writer

Adds a node to write data to a database.

add_datasource

Adds a data source node to the graph.

add_dependency_on_polars_lazy_frame

Adds a special node that directly injects a Polars LazyFrame into the graph.

add_explore_data

Adds a specialized node for data exploration and visualization.

add_external_source

Adds a node for a custom external data source.

add_filter

Adds a filter node to the graph.

add_formula

Adds a node that applies a formula to create or modify a column.

add_fuzzy_match

Adds a fuzzy matching node to join data on approximate string matches.

add_graph_solver

Adds a node that solves graph-like problems within the data.

add_group_by

Adds a group-by aggregation node to the graph.

add_include_cols

Adds columns to both the input and output column lists.

add_initial_node_analysis

Adds a data exploration/analysis node based on a node promise.

add_join

Adds a join node to combine two data streams based on key columns.

add_manual_input

Adds a node for manual data entry.

add_node_promise

Adds a placeholder node to the graph that is not yet fully configured.

add_node_step

The core method for adding or updating a node in the graph.

add_node_to_starting_list

Adds a node to the list of starting nodes for the flow if not already present.

add_output

Adds an output node to write the final data to a destination.

add_pivot

Adds a pivot node to the graph.

add_polars_code

Adds a node that executes custom Polars code.

add_read

Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

add_record_count

Adds a filter node to the graph.

add_record_id

Adds a node to create a new column with a unique ID for each record.

add_sample

Adds a node to take a random or top-N sample of the data.

add_select

Adds a node to select, rename, reorder, or drop columns.

add_sort

Adds a node to sort the data based on one or more columns.

add_sql_source

Adds a node that reads data from a SQL source.

add_text_to_rows

Adds a node that splits cell values into multiple rows.

add_union

Adds a union node to combine multiple data streams.

add_unique

Adds a node to find and remove duplicate rows.

add_unpivot

Adds an unpivot node to the graph.

add_user_defined_node

Adds a user-defined custom node to the graph.

apply_layout

Calculates and applies a layered layout to all nodes in the graph.

cancel

Cancels an ongoing graph execution.

capture_history_if_changed

Capture history only if the flow state actually changed.

capture_history_snapshot

Capture the current state before a change for undo support.

close_flow

Performs cleanup operations, such as clearing node caches.

copy_node

Creates a copy of an existing node.

delete_node

Deletes a node from the graph and updates all its connections.

generate_code

Generates code for the flow graph.

get_frontend_data

Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

get_history_state

Get the current state of the history system.

get_implicit_starter_nodes

Finds nodes that can act as starting points but are not explicitly defined as such.

get_node

Retrieves a node from the graph by its ID.

get_node_data

Retrieves all data needed to render a node in the UI.

get_node_storage

Serializes the entire graph's state into a storable format.

get_nodes_overview

Gets a list of dictionary representations for all nodes in the graph.

get_run_info

Gets a summary of the most recent graph execution.

get_vue_flow_input

Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

print_tree

Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art.

redo

Redo the last undone action.

remove_from_output_cols

Removes specified columns from the list of expected output columns.

reset

Forces a deep reset on all nodes in the graph.

restore_from_snapshot

Clear current state and rebuild from a snapshot.

run_graph

Executes the entire data flow graph from start to finish.

save_flow

Saves the current state of the flow graph to a file.

trigger_fetch_node

Executes a specific node in the graph by its ID.

undo

Undo the last action by restoring to the previous state.

Attributes:

Name Type Description
execution_location ExecutionLocationsLiteral

Gets the current execution location.

execution_mode ExecutionModeLiteral

Gets the current execution mode ('Development' or 'Performance').

flow_id int

Gets the unique identifier of the flow.

graph_has_functions bool

Checks if the graph has any nodes.

graph_has_input_data bool

Checks if the graph has an initial input data source.

node_connections list[tuple[int, int]]

Computes and returns a list of all connections in the graph.

nodes list[FlowNode]

Gets a list of all FlowNode objects in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
class FlowGraph:
    """A class representing a Directed Acyclic Graph (DAG) for data processing pipelines.

    It manages nodes, connections, and the execution of the entire flow.
    """

    uuid: str
    depends_on: dict[
        int,
        Union[
            ParquetFile,
            FlowDataEngine,
            "FlowGraph",
            pl.DataFrame,
        ],
    ]
    _flow_id: int
    _input_data: Union[ParquetFile, FlowDataEngine, "FlowGraph"]
    _input_cols: list[str]
    _output_cols: list[str]
    _node_db: dict[str | int, FlowNode]
    _node_ids: list[str | int]
    _results: FlowDataEngine | None = None
    cache_results: bool = False
    schema: list[FlowfileColumn] | None = None
    has_over_row_function: bool = False
    _flow_starts: list[int | str] = None
    latest_run_info: RunInformation | None = None
    start_datetime: datetime = None
    end_datetime: datetime = None
    _flow_settings: schemas.FlowSettings = None
    flow_logger: FlowLogger

    def __init__(
        self,
        flow_settings: schemas.FlowSettings | schemas.FlowGraphConfig,
        name: str = None,
        input_cols: list[str] = None,
        output_cols: list[str] = None,
        path_ref: str = None,
        input_flow: Union[ParquetFile, FlowDataEngine, "FlowGraph"] = None,
        cache_results: bool = False,
    ):
        """Initializes a new FlowGraph instance.

        Args:
            flow_settings: The configuration settings for the flow.
            name: The name of the flow.
            input_cols: A list of input column names.
            output_cols: A list of output column names.
            path_ref: An optional path to an initial data source.
            input_flow: An optional existing data object to start the flow with.
            cache_results: A global flag to enable or disable result caching.
        """
        if isinstance(flow_settings, schemas.FlowGraphConfig):
            flow_settings = schemas.FlowSettings.from_flow_settings_input(flow_settings)

        self._flow_settings = flow_settings
        self.uuid = str(uuid1())
        self.start_datetime = None
        self.end_datetime = None
        self.latest_run_info = None
        self._flow_id = flow_settings.flow_id
        self.flow_logger = FlowLogger(flow_settings.flow_id)
        self._flow_starts: list[FlowNode] = []
        self._results = None
        self.schema = None
        self.has_over_row_function = False
        self._input_cols = [] if input_cols is None else input_cols
        self._output_cols = [] if output_cols is None else output_cols
        self._node_ids = []
        self._node_db = {}
        self.cache_results = cache_results
        self.__name__ = name if name else "flow_" + str(id(self))
        self.depends_on = {}

        # Initialize history manager for undo/redo support
        from flowfile_core.flowfile.history_manager import HistoryManager
        from flowfile_core.schemas.history_schema import HistoryConfig
        history_config = HistoryConfig(enabled=flow_settings.track_history)
        self._history_manager = HistoryManager(config=history_config)

        if path_ref is not None:
            self.add_datasource(input_schema.NodeDatasource(file_path=path_ref))
        elif input_flow is not None:
            self.add_datasource(input_file=input_flow)

    @property
    def flow_settings(self) -> schemas.FlowSettings:
        return self._flow_settings

    @flow_settings.setter
    def flow_settings(self, flow_settings: schemas.FlowSettings):
        if (self._flow_settings.execution_location != flow_settings.execution_location) or (
            self._flow_settings.execution_mode != flow_settings.execution_mode
        ):
            self.reset()
        self._flow_settings = flow_settings

    # ==================== History Management Methods ====================

    def capture_history_snapshot(
        self,
        action_type: HistoryActionType,
        description: str,
        node_id: int = None,
    ) -> bool:
        """Capture the current state before a change for undo support.

        Args:
            action_type: The type of action being performed.
            description: Human-readable description of the action.
            node_id: Optional ID of the affected node.

        Returns:
            True if snapshot was captured, False if skipped.
        """
        return self._history_manager.capture_snapshot(self, action_type, description, node_id)

    def capture_history_if_changed(
        self,
        pre_snapshot: schemas.FlowfileData,
        action_type: HistoryActionType,
        description: str,
        node_id: int = None,
    ) -> bool:
        """Capture history only if the flow state actually changed.

        Use this for settings updates where the change might be a no-op.
        Call this AFTER the change is applied.

        Args:
            pre_snapshot: The FlowfileData captured BEFORE the change.
            action_type: The type of action that was performed.
            description: Human-readable description of the action.
            node_id: Optional ID of the affected node.

        Returns:
            True if a change was detected and snapshot was captured.
        """
        return self._history_manager.capture_if_changed(
            self, pre_snapshot, action_type, description, node_id
        )

    def undo(self) -> UndoRedoResult:
        """Undo the last action by restoring to the previous state.

        Returns:
            UndoRedoResult indicating success or failure.
        """
        return self._history_manager.undo(self)

    def redo(self) -> UndoRedoResult:
        """Redo the last undone action.

        Returns:
            UndoRedoResult indicating success or failure.
        """
        return self._history_manager.redo(self)

    def get_history_state(self) -> HistoryState:
        """Get the current state of the history system.

        Returns:
            HistoryState with information about available undo/redo operations.
        """
        return self._history_manager.get_state()

    def _execute_with_history(
        self,
        operation: Callable[[], Any],
        action_type: HistoryActionType,
        description: str,
        node_id: int = None,
    ) -> Any:
        """Execute an operation with automatic history capture.

        This helper captures the state before the operation, executes it,
        and records history only if the state actually changed.

        Args:
            operation: A callable that performs the actual operation.
            action_type: The type of action being performed.
            description: Human-readable description of the action.
            node_id: Optional ID of the affected node.

        Returns:
            The result of the operation (if any).
        """
        # Skip history capture if tracking is disabled for this flow
        if not self.flow_settings.track_history:
            return operation()

        pre_snapshot = self.get_flowfile_data()
        result = operation()
        self._history_manager.capture_if_changed(
            self, pre_snapshot, action_type, description, node_id
        )
        return result

    def restore_from_snapshot(self, snapshot: schemas.FlowfileData) -> None:
        """Clear current state and rebuild from a snapshot.

        This method is used internally by undo/redo to restore a previous state.

        Args:
            snapshot: The FlowfileData snapshot to restore from.
        """
        from flowfile_core.flowfile.manage.io_flowfile import (
            _flowfile_data_to_flow_information,
            determine_insertion_order,
        )

        # Preserve the current flow_id
        original_flow_id = self._flow_id

        # Convert snapshot to FlowInformation
        flow_info = _flowfile_data_to_flow_information(snapshot)

        # Clear current state
        self._node_db.clear()
        self._node_ids.clear()
        self._flow_starts.clear()
        self._results = None

        # Restore flow settings (preserve original flow_id)
        self._flow_settings = flow_info.flow_settings
        self._flow_settings.flow_id = original_flow_id
        self._flow_id = original_flow_id
        self.__name__ = flow_info.flow_name or self.__name__

        # Determine node insertion order
        ingestion_order = determine_insertion_order(flow_info)

        # First pass: Create all nodes as promises
        for node_id in ingestion_order:
            node_info = flow_info.data[node_id]
            node_promise = input_schema.NodePromise(
                flow_id=original_flow_id,
                node_id=node_info.id,
                pos_x=node_info.x_position or 0,
                pos_y=node_info.y_position or 0,
                node_type=node_info.type,
            )
            if hasattr(node_info.setting_input, "cache_results"):
                node_promise.cache_results = node_info.setting_input.cache_results
            self.add_node_promise(node_promise)

        # Second pass: Apply settings using add_<node_type> methods
        for node_id in ingestion_order:
            node_info = flow_info.data[node_id]
            if node_info.is_setup and node_info.setting_input is not None:
                # Update flow_id in setting_input
                if hasattr(node_info.setting_input, "flow_id"):
                    node_info.setting_input.flow_id = original_flow_id

                if hasattr(node_info.setting_input, "is_user_defined") and node_info.setting_input.is_user_defined:
                    if node_info.type in CUSTOM_NODE_STORE:
                        user_defined_node_class = CUSTOM_NODE_STORE[node_info.type]
                        self.add_user_defined_node(
                            custom_node=user_defined_node_class.from_settings(node_info.setting_input.settings),
                            user_defined_node_settings=node_info.setting_input,
                        )
                else:
                    add_method = getattr(self, "add_" + node_info.type, None)
                    if add_method:
                        add_method(node_info.setting_input)

        # Third pass: Restore connections
        for node_id in ingestion_order:
            node_info = flow_info.data[node_id]
            from_node = self.get_node(node_id)
            if from_node is None:
                continue

            for output_node_id in node_info.outputs or []:
                to_node = self.get_node(output_node_id)
                if to_node is None:
                    continue

                output_node_info = flow_info.data.get(output_node_id)
                if output_node_info is None:
                    continue

                # Determine connection type
                is_left_input = (output_node_info.left_input_id == node_id) and (
                    to_node.left_input is None or to_node.left_input.node_id != node_id
                )
                is_right_input = (output_node_info.right_input_id == node_id) and (
                    to_node.right_input is None or to_node.right_input.node_id != node_id
                )
                is_main_input = node_id in (output_node_info.input_ids or [])

                if is_left_input:
                    insert_type = "left"
                elif is_right_input:
                    insert_type = "right"
                elif is_main_input:
                    insert_type = "main"
                else:
                    continue

                to_node.add_node_connection(from_node, insert_type)

        logger.info(f"Restored flow from snapshot with {len(self._node_db)} nodes")

    # ==================== End History Management Methods ====================

    def add_node_to_starting_list(self, node: FlowNode) -> None:
        """Adds a node to the list of starting nodes for the flow if not already present.

        Args:
            node: The FlowNode to add as a starting node.
        """
        if node.node_id not in {self_node.node_id for self_node in self._flow_starts}:
            self._flow_starts.append(node)

    def add_node_promise(self, node_promise: input_schema.NodePromise, track_history: bool = True):
        """Adds a placeholder node to the graph that is not yet fully configured.

        Useful for building the graph structure before all settings are available.
        Automatically captures history for undo/redo support.

        Args:
            node_promise: A promise object containing basic node information.
            track_history: Whether to track this change in history (default True).
        """
        def _do_add():
            def placeholder(n: FlowNode = None):
                if n is None:
                    return FlowDataEngine()
                return n

            self.add_node_step(
                node_id=node_promise.node_id,
                node_type=node_promise.node_type,
                function=placeholder,
                setting_input=node_promise,
            )
            if node_promise.is_user_defined:
                node_needs_settings: bool
                custom_node = CUSTOM_NODE_STORE.get(node_promise.node_type)
                if custom_node is None:
                    raise Exception(f"Custom node type '{node_promise.node_type}' not found in registry.")
                settings_schema = custom_node.model_fields["settings_schema"].default
                node_needs_settings = settings_schema is not None and not settings_schema.is_empty()
                if not node_needs_settings:
                    user_defined_node_settings = input_schema.UserDefinedNode(settings={}, **node_promise.model_dump())
                    initialized_model = custom_node()
                    self.add_user_defined_node(
                        custom_node=initialized_model, user_defined_node_settings=user_defined_node_settings
                    )

        if track_history:
            self._execute_with_history(
                _do_add,
                HistoryActionType.ADD_NODE,
                f"Add {node_promise.node_type} node",
                node_id=node_promise.node_id,
            )
        else:
            _do_add()

    def apply_layout(self, y_spacing: int = 150, x_spacing: int = 200, initial_y: int = 100):
        """Calculates and applies a layered layout to all nodes in the graph.

        This updates their x and y positions for UI rendering.

        Args:
            y_spacing: The vertical spacing between layers.
            x_spacing: The horizontal spacing between nodes in the same layer.
            initial_y: The initial y-position for the first layer.
        """
        self.flow_logger.info("Applying layered layout...")
        start_time = time()
        try:
            # Calculate new positions for all nodes
            new_positions = calculate_layered_layout(
                self, y_spacing=y_spacing, x_spacing=x_spacing, initial_y=initial_y
            )

            if not new_positions:
                self.flow_logger.warning("Layout calculation returned no positions.")
                return

            # Apply the new positions to the setting_input of each node
            updated_count = 0
            for node_id, (pos_x, pos_y) in new_positions.items():
                node = self.get_node(node_id)
                if node and hasattr(node, "setting_input"):
                    setting = node.setting_input
                    if hasattr(setting, "pos_x") and hasattr(setting, "pos_y"):
                        setting.pos_x = pos_x
                        setting.pos_y = pos_y
                        updated_count += 1
                    else:
                        self.flow_logger.warning(
                            f"Node {node_id} setting_input ({type(setting)}) lacks pos_x/pos_y attributes."
                        )
                elif node:
                    self.flow_logger.warning(f"Node {node_id} lacks setting_input attribute.")
                # else: Node not found, already warned by calculate_layered_layout

            end_time = time()
            self.flow_logger.info(
                f"Layout applied to {updated_count}/{len(self.nodes)} nodes in {end_time - start_time:.2f} seconds."
            )

        except Exception as e:
            self.flow_logger.error(f"Error applying layout: {e}")
            raise  # Optional: re-raise the exception

    @property
    def flow_id(self) -> int:
        """Gets the unique identifier of the flow."""
        return self._flow_id

    @flow_id.setter
    def flow_id(self, new_id: int):
        """Sets the unique identifier for the flow and updates all child nodes.

        Args:
            new_id: The new flow ID.
        """
        self._flow_id = new_id
        for node in self.nodes:
            if hasattr(node.setting_input, "flow_id"):
                node.setting_input.flow_id = new_id
        self.flow_settings.flow_id = new_id

    def __repr__(self):
        """Provides the official string representation of the FlowGraph instance."""
        settings_str = "  -" + "\n  -".join(f"{k}: {v}" for k, v in self.flow_settings)
        return f"FlowGraph(\nNodes: {self._node_db}\n\nSettings:\n{settings_str}"

    def print_tree(self):
        """Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art."""
        if not self._node_db:
            self.flow_logger.info("Empty flow graph")
            return

        # Build node information
        node_info = build_node_info(self.nodes)

        # Calculate depths for all nodes
        for node_id in node_info:
            calculate_depth(node_id, node_info)

        # Group nodes by depth
        depth_groups, max_depth = group_nodes_by_depth(node_info)

        # Sort nodes within each depth group
        for depth in depth_groups:
            depth_groups[depth].sort()

        # Create the main flow visualization
        lines = ["=" * 80, "Flow Graph Visualization", "=" * 80, ""]

        # Track which nodes connect to what
        merge_points = define_node_connections(node_info)

        # Build the flow paths

        # Find the maximum label length for each depth level
        max_label_length = {}
        for depth in range(max_depth + 1):
            if depth in depth_groups:
                max_len = max(len(node_info[nid].label) for nid in depth_groups[depth])
                max_label_length[depth] = max_len

        # Draw the paths
        drawn_nodes = set()
        merge_drawn = set()

        # Group paths by their merge points
        paths_by_merge = {}
        standalone_paths = []

        # Build flow paths
        paths = build_flow_paths(node_info, self._flow_starts, merge_points)

        # Define paths to merge and standalone paths
        for path in paths:
            if len(path) > 1 and path[-1] in merge_points and len(merge_points[path[-1]]) > 1:
                merge_id = path[-1]
                if merge_id not in paths_by_merge:
                    paths_by_merge[merge_id] = []
                paths_by_merge[merge_id].append(path)
            else:
                standalone_paths.append(path)

        # Draw merged paths
        draw_merged_paths(node_info, merge_points, paths_by_merge, merge_drawn, drawn_nodes, lines)

        # Draw standlone paths
        draw_standalone_paths(drawn_nodes, standalone_paths, lines, node_info)

        # Add undrawn nodes
        add_un_drawn_nodes(drawn_nodes, node_info, lines)

        try:
            execution_plan = compute_execution_plan(
                nodes=self.nodes, flow_starts=self._flow_starts + self.get_implicit_starter_nodes()
            )
            ordered_nodes = execution_plan.all_nodes
            if ordered_nodes:
                for i, node in enumerate(ordered_nodes, 1):
                    lines.append(f"  {i:3d}. {node_info[node.node_id].label}")
        except Exception as e:
            lines.append(f"  Could not determine execution order: {e}")

        # Print everything
        output = "\n".join(lines)

        print(output)

    def get_nodes_overview(self):
        """Gets a list of dictionary representations for all nodes in the graph."""
        output = []
        for v in self._node_db.values():
            output.append(v.get_repr())
        return output

    def remove_from_output_cols(self, columns: list[str]):
        """Removes specified columns from the list of expected output columns.

        Args:
            columns: A list of column names to remove.
        """
        cols = set(columns)
        self._output_cols = [c for c in self._output_cols if c not in cols]

    def get_node(self, node_id: int | str = None) -> FlowNode | None:
        """Retrieves a node from the graph by its ID.

        Args:
            node_id: The ID of the node to retrieve. If None, retrieves the last added node.

        Returns:
            The FlowNode object, or None if not found.
        """
        if node_id is None:
            node_id = self._node_ids[-1]
        node = self._node_db.get(node_id)
        if node is not None:
            return node

    def add_user_defined_node(
        self, *, custom_node: CustomNodeBase, user_defined_node_settings: input_schema.UserDefinedNode
    ):
        """Adds a user-defined custom node to the graph.

        Args:
            custom_node: The custom node instance to add.
            user_defined_node_settings: The settings for the user-defined node.
        """

        def _func(*flow_data_engine: FlowDataEngine) -> FlowDataEngine | None:
            user_id = user_defined_node_settings.user_id
            if user_id is not None:
                custom_node.set_execution_context(user_id)
                if custom_node.settings_schema:
                    custom_node.settings_schema.set_secret_context(user_id, custom_node.accessed_secrets)

            output = custom_node.process(*(fde.data_frame for fde in flow_data_engine))

            accessed_secrets = custom_node.get_accessed_secrets()
            if accessed_secrets:
                logger.info(f"Node '{user_defined_node_settings.node_id}' accessed secrets: {accessed_secrets}")
            if isinstance(output, (pl.LazyFrame, pl.DataFrame)):
                return FlowDataEngine(output)
            return None

        self.add_node_step(
            node_id=user_defined_node_settings.node_id,
            function=_func,
            setting_input=user_defined_node_settings,
            input_node_ids=user_defined_node_settings.depending_on_ids,
            node_type=custom_node.item,
        )
        if custom_node.number_of_inputs == 0:
            node = self.get_node(user_defined_node_settings.node_id)
            self.add_node_to_starting_list(node)

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_pivot(self, pivot_settings: input_schema.NodePivot):
        """Adds a pivot node to the graph.

        Args:
            pivot_settings: The settings for the pivot operation.
        """

        def _func(fl: FlowDataEngine):
            return fl.do_pivot(pivot_settings.pivot_input, self.flow_logger.get_node_logger(pivot_settings.node_id))

        self.add_node_step(
            node_id=pivot_settings.node_id,
            function=_func,
            node_type="pivot",
            setting_input=pivot_settings,
            input_node_ids=[pivot_settings.depending_on_id],
        )

        node = self.get_node(pivot_settings.node_id)

        def schema_callback():
            input_data = node.singular_main_input.get_resulting_data()  # get from the previous step the data
            input_data.lazy = True  # ensure the dataset is lazy
            input_lf = input_data.data_frame  # get the lazy frame
            return pre_calculate_pivot_schema(input_data.schema, pivot_settings.pivot_input, input_lf=input_lf)

        node.schema_callback = schema_callback

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_unpivot(self, unpivot_settings: input_schema.NodeUnpivot):
        """Adds an unpivot node to the graph.

        Args:
            unpivot_settings: The settings for the unpivot operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.unpivot(unpivot_settings.unpivot_input)

        self.add_node_step(
            node_id=unpivot_settings.node_id,
            function=_func,
            node_type="unpivot",
            setting_input=unpivot_settings,
            input_node_ids=[unpivot_settings.depending_on_id],
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_union(self, union_settings: input_schema.NodeUnion):
        """Adds a union node to combine multiple data streams.

        Args:
            union_settings: The settings for the union operation.
        """

        def _func(*flowfile_tables: FlowDataEngine):
            dfs: list[pl.LazyFrame] | list[pl.DataFrame] = [flt.data_frame for flt in flowfile_tables]
            return FlowDataEngine(pl.concat(dfs, how="diagonal_relaxed"))

        self.add_node_step(
            node_id=union_settings.node_id,
            function=_func,
            node_type="union",
            setting_input=union_settings,
            input_node_ids=union_settings.depending_on_ids,
        )

    def add_initial_node_analysis(self, node_promise: input_schema.NodePromise, track_history: bool = True):
        """Adds a data exploration/analysis node based on a node promise.

        Automatically captures history for undo/redo support.

        Args:
            node_promise: The promise representing the node to be analyzed.
            track_history: Whether to track this change in history (default True).
        """
        def _do_add():
            node_analysis = create_graphic_walker_node_from_node_promise(node_promise)
            self.add_explore_data(node_analysis)

        if track_history:
            self._execute_with_history(
                _do_add,
                HistoryActionType.ADD_NODE,
                f"Add {node_promise.node_type} node",
                node_id=node_promise.node_id,
            )
        else:
            _do_add()

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_explore_data(self, node_analysis: input_schema.NodeExploreData):
        """Adds a specialized node for data exploration and visualization.

        Args:
            node_analysis: The settings for the data exploration node.
        """
        sample_size: int = 10000

        def analysis_preparation(flowfile_table: FlowDataEngine):
            if flowfile_table.number_of_records <= 0:
                number_of_records = flowfile_table.get_number_of_records(calculate_in_worker_process=True)
            else:
                number_of_records = flowfile_table.number_of_records
            if number_of_records > sample_size:
                flowfile_table = flowfile_table.get_sample(sample_size, random=True)
            external_sampler = ExternalDfFetcher(
                lf=flowfile_table.data_frame,
                file_ref="__gf_walker" + node.hash,
                wait_on_completion=True,
                node_id=node.node_id,
                flow_id=self.flow_id,
            )
            node.results.analysis_data_generator = get_read_top_n(
                external_sampler.status.file_ref, n=min(sample_size, number_of_records)
            )
            return flowfile_table

        def schema_callback():
            node = self.get_node(node_analysis.node_id)
            if len(node.all_inputs) == 1:
                input_node = node.all_inputs[0]
                return input_node.schema
            else:
                return [FlowfileColumn.from_input("col_1", "na")]

        self.add_node_step(
            node_id=node_analysis.node_id,
            node_type="explore_data",
            function=analysis_preparation,
            setting_input=node_analysis,
            schema_callback=schema_callback,
        )
        node = self.get_node(node_analysis.node_id)

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_group_by(self, group_by_settings: input_schema.NodeGroupBy):
        """Adds a group-by aggregation node to the graph.

        Args:
            group_by_settings: The settings for the group-by operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.do_group_by(group_by_settings.groupby_input, False)

        self.add_node_step(
            node_id=group_by_settings.node_id,
            function=_func,
            node_type="group_by",
            setting_input=group_by_settings,
            input_node_ids=[group_by_settings.depending_on_id],
        )

        node = self.get_node(group_by_settings.node_id)

        def schema_callback():
            output_columns = [(c.old_name, c.new_name, c.output_type) for c in group_by_settings.groupby_input.agg_cols]
            depends_on = node.node_inputs.main_inputs[0]
            input_schema_dict: dict[str, str] = {s.name: s.data_type for s in depends_on.schema}
            output_schema = []
            for old_name, new_name, data_type in output_columns:
                data_type = input_schema_dict[old_name] if data_type is None else data_type
                output_schema.append(FlowfileColumn.from_input(data_type=data_type, column_name=new_name))
            return output_schema

        node.schema_callback = schema_callback

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_filter(self, filter_settings: input_schema.NodeFilter):
        """Adds a filter node to the graph.

        Args:
            filter_settings: The settings for the filter operation.
        """

        def _func(fl: FlowDataEngine):
            is_advanced = filter_settings.filter_input.is_advanced()

            if is_advanced:
                predicate = filter_settings.filter_input.advanced_filter
                return fl.do_filter(predicate)
            else:
                basic_filter = filter_settings.filter_input.basic_filter
                if basic_filter is None:
                    logger.warning("Basic filter is None, returning unfiltered data")
                    return fl

                try:
                    field_data_type = fl.get_schema_column(basic_filter.field).generic_datatype()
                except Exception:
                    field_data_type = None

                expression = build_filter_expression(basic_filter, field_data_type)
                filter_settings.filter_input.advanced_filter = expression
                return fl.do_filter(expression)

        self.add_node_step(
            filter_settings.node_id,
            _func,
            node_type="filter",
            renew_schema=False,
            setting_input=filter_settings,
            input_node_ids=[filter_settings.depending_on_id],
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_record_count(self, node_number_of_records: input_schema.NodeRecordCount):
        """Adds a filter node to the graph.

        Args:
            node_number_of_records: The settings for the record count operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.get_record_count()

        self.add_node_step(
            node_id=node_number_of_records.node_id,
            function=_func,
            node_type="record_count",
            setting_input=node_number_of_records,
            input_node_ids=[node_number_of_records.depending_on_id],
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_polars_code(self, node_polars_code: input_schema.NodePolarsCode):
        """Adds a node that executes custom Polars code.

        Args:
            node_polars_code: The settings for the Polars code node.
        """

        def _func(*flowfile_tables: FlowDataEngine) -> FlowDataEngine:
            return execute_polars_code(*flowfile_tables, code=node_polars_code.polars_code_input.polars_code)

        self.add_node_step(
            node_id=node_polars_code.node_id,
            function=_func,
            node_type="polars_code",
            setting_input=node_polars_code,
            input_node_ids=node_polars_code.depending_on_ids,
        )

        try:
            polars_code_parser.validate_code(node_polars_code.polars_code_input.polars_code)
        except Exception as e:
            node = self.get_node(node_id=node_polars_code.node_id)
            node.results.errors = str(e)

    def add_dependency_on_polars_lazy_frame(self, lazy_frame: pl.LazyFrame, node_id: int):
        """Adds a special node that directly injects a Polars LazyFrame into the graph.

        Note: This is intended for backend use and will not work in the UI editor.

        Args:
            lazy_frame: The Polars LazyFrame to inject.
            node_id: The ID for the new node.
        """

        def _func():
            return FlowDataEngine(lazy_frame)

        node_promise = input_schema.NodePromise(
            flow_id=self.flow_id, node_id=node_id, node_type="polars_lazy_frame", is_setup=True
        )
        self.add_node_step(
            node_id=node_promise.node_id, node_type=node_promise.node_type, function=_func, setting_input=node_promise
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_unique(self, unique_settings: input_schema.NodeUnique):
        """Adds a node to find and remove duplicate rows.

        Args:
            unique_settings: The settings for the unique operation.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.make_unique(unique_settings.unique_input)

        self.add_node_step(
            node_id=unique_settings.node_id,
            function=_func,
            input_columns=[],
            node_type="unique",
            setting_input=unique_settings,
            input_node_ids=[unique_settings.depending_on_id],
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_graph_solver(self, graph_solver_settings: input_schema.NodeGraphSolver):
        """Adds a node that solves graph-like problems within the data.

        This node can be used for operations like finding network paths,
        calculating connected components, or performing other graph algorithms
        on relational data that represents nodes and edges.

        Args:
            graph_solver_settings: The settings object defining the graph inputs
                and the specific algorithm to apply.
        """

        def _func(fl: FlowDataEngine) -> FlowDataEngine:
            return fl.solve_graph(graph_solver_settings.graph_solver_input)

        self.add_node_step(
            node_id=graph_solver_settings.node_id,
            function=_func,
            node_type="graph_solver",
            setting_input=graph_solver_settings,
            input_node_ids=[graph_solver_settings.depending_on_id],
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_formula(self, function_settings: input_schema.NodeFormula):
        """Adds a node that applies a formula to create or modify a column.

        Args:
            function_settings: The settings for the formula operation.
        """

        error = ""
        if function_settings.function.field.data_type not in (None, transform_schema.AUTO_DATA_TYPE):
            output_type = cast_str_to_polars_type(function_settings.function.field.data_type)
        else:
            output_type = None
        if output_type not in (None, transform_schema.AUTO_DATA_TYPE):
            new_col = [
                FlowfileColumn.from_input(column_name=function_settings.function.field.name, data_type=str(output_type))
            ]
        else:
            new_col = [FlowfileColumn.from_input(function_settings.function.field.name, "String")]

        def _func(fl: FlowDataEngine):
            return fl.apply_sql_formula(
                func=function_settings.function.function,
                col_name=function_settings.function.field.name,
                output_data_type=output_type,
            )

        self.add_node_step(
            function_settings.node_id,
            _func,
            output_schema=new_col,
            node_type="formula",
            renew_schema=False,
            setting_input=function_settings,
            input_node_ids=[function_settings.depending_on_id],
        )
        if error != "":
            node = self.get_node(function_settings.node_id)
            node.results.errors = error
            return False, error
        else:
            return True, ""

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_cross_join(self, cross_join_settings: input_schema.NodeCrossJoin) -> "FlowGraph":
        """Adds a cross join node to the graph.

        Args:
            cross_join_settings: The settings for the cross join operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            for left_select in cross_join_settings.cross_join_input.left_select.renames:
                left_select.is_available = True if left_select.old_name in main.schema else False
            for right_select in cross_join_settings.cross_join_input.right_select.renames:
                right_select.is_available = True if right_select.old_name in right.schema else False
            return main.do_cross_join(
                cross_join_input=cross_join_settings.cross_join_input,
                auto_generate_selection=cross_join_settings.auto_generate_selection,
                verify_integrity=False,
                other=right,
            )

        self.add_node_step(
            node_id=cross_join_settings.node_id,
            function=_func,
            input_columns=[],
            node_type="cross_join",
            setting_input=cross_join_settings,
            input_node_ids=cross_join_settings.depending_on_ids,
        )
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_join(self, join_settings: input_schema.NodeJoin) -> "FlowGraph":
        """Adds a join node to combine two data streams based on key columns.

        Args:
            join_settings: The settings for the join operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            for left_select in join_settings.join_input.left_select.renames:
                left_select.is_available = True if left_select.old_name in main.schema else False
            for right_select in join_settings.join_input.right_select.renames:
                right_select.is_available = True if right_select.old_name in right.schema else False
            return main.join(
                join_input=join_settings.join_input,
                auto_generate_selection=join_settings.auto_generate_selection,
                verify_integrity=False,
                other=right,
            )

        self.add_node_step(
            node_id=join_settings.node_id,
            function=_func,
            input_columns=[],
            node_type="join",
            setting_input=join_settings,
            input_node_ids=join_settings.depending_on_ids,
        )
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_fuzzy_match(self, fuzzy_settings: input_schema.NodeFuzzyMatch) -> "FlowGraph":
        """Adds a fuzzy matching node to join data on approximate string matches.

        Args:
            fuzzy_settings: The settings for the fuzzy match operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
            node = self.get_node(node_id=fuzzy_settings.node_id)
            if self.execution_location == "local":
                return main.fuzzy_join(
                    fuzzy_match_input=deepcopy(fuzzy_settings.join_input),
                    other=right,
                    node_logger=self.flow_logger.get_node_logger(fuzzy_settings.node_id),
                )

            f = main.start_fuzzy_join(
                fuzzy_match_input=deepcopy(fuzzy_settings.join_input),
                other=right,
                file_ref=node.hash,
                flow_id=self.flow_id,
                node_id=fuzzy_settings.node_id,
            )
            logger.info("Started the fuzzy match action")
            node._fetch_cached_df = f  # Add to the node so it can be cancelled and fetch later if needed
            return FlowDataEngine(f.get_result())

        def schema_callback():
            fm_input_copy = FuzzyMatchInputManager(
                fuzzy_settings.join_input
            )  # Deepcopy create an unique object per func
            node = self.get_node(node_id=fuzzy_settings.node_id)
            return calculate_fuzzy_match_schema(
                fm_input_copy,
                left_schema=node.node_inputs.main_inputs[0].schema,
                right_schema=node.node_inputs.right_input.schema,
            )

        self.add_node_step(
            node_id=fuzzy_settings.node_id,
            function=_func,
            input_columns=[],
            node_type="fuzzy_match",
            setting_input=fuzzy_settings,
            input_node_ids=fuzzy_settings.depending_on_ids,
            schema_callback=schema_callback,
        )

        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_text_to_rows(self, node_text_to_rows: input_schema.NodeTextToRows) -> "FlowGraph":
        """Adds a node that splits cell values into multiple rows.

        This is useful for un-nesting data where a single field contains multiple
        values separated by a delimiter.

        Args:
            node_text_to_rows: The settings object that specifies the column to split
                and the delimiter to use.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.split(node_text_to_rows.text_to_rows_input)

        self.add_node_step(
            node_id=node_text_to_rows.node_id,
            function=_func,
            node_type="text_to_rows",
            setting_input=node_text_to_rows,
            input_node_ids=[node_text_to_rows.depending_on_id],
        )
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_sort(self, sort_settings: input_schema.NodeSort) -> "FlowGraph":
        """Adds a node to sort the data based on one or more columns.

        Args:
            sort_settings: The settings for the sort operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.do_sort(sort_settings.sort_input)

        self.add_node_step(
            node_id=sort_settings.node_id,
            function=_func,
            node_type="sort",
            setting_input=sort_settings,
            input_node_ids=[sort_settings.depending_on_id],
        )
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_sample(self, sample_settings: input_schema.NodeSample) -> "FlowGraph":
        """Adds a node to take a random or top-N sample of the data.

        Args:
            sample_settings: The settings object specifying the size of the sample.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.get_sample(sample_settings.sample_size)

        self.add_node_step(
            node_id=sample_settings.node_id,
            function=_func,
            node_type="sample",
            setting_input=sample_settings,
            input_node_ids=[sample_settings.depending_on_id],
        )
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_record_id(self, record_id_settings: input_schema.NodeRecordId) -> "FlowGraph":
        """Adds a node to create a new column with a unique ID for each record.

        Args:
            record_id_settings: The settings object specifying the name of the
                new record ID column.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            return table.add_record_id(record_id_settings.record_id_input)

        self.add_node_step(
            node_id=record_id_settings.node_id,
            function=_func,
            node_type="record_id",
            setting_input=record_id_settings,
            input_node_ids=[record_id_settings.depending_on_id],
        )
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_select(self, select_settings: input_schema.NodeSelect) -> "FlowGraph":
        """Adds a node to select, rename, reorder, or drop columns.

        Args:
            select_settings: The settings for the select operation.

        Returns:
            The `FlowGraph` instance for method chaining.
        """

        select_cols = select_settings.select_input
        drop_cols = tuple(s.old_name for s in select_settings.select_input)

        def _func(table: FlowDataEngine) -> FlowDataEngine:
            input_cols = set(f.name for f in table.schema)
            ids_to_remove = []
            for i, select_col in enumerate(select_cols):
                if select_col.data_type is None:
                    select_col.data_type = table.get_schema_column(select_col.old_name).data_type
                if select_col.old_name not in input_cols:
                    select_col.is_available = False
                    if not select_col.keep:
                        ids_to_remove.append(i)
                else:
                    select_col.is_available = True
            ids_to_remove.reverse()
            for i in ids_to_remove:
                v = select_cols.pop(i)
                del v
            return table.do_select(
                select_inputs=transform_schema.SelectInputs(select_cols), keep_missing=select_settings.keep_missing
            )

        self.add_node_step(
            node_id=select_settings.node_id,
            function=_func,
            input_columns=[],
            node_type="select",
            drop_columns=list(drop_cols),
            setting_input=select_settings,
            input_node_ids=[select_settings.depending_on_id],
        )
        return self

    @property
    def graph_has_functions(self) -> bool:
        """Checks if the graph has any nodes."""
        return len(self._node_ids) > 0

    def delete_node(self, node_id: int | str):
        """Deletes a node from the graph and updates all its connections.

        Args:
            node_id: The ID of the node to delete.

        Raises:
            Exception: If the node with the given ID does not exist.
        """
        logger.info(f"Starting deletion of node with ID: {node_id}")

        node = self._node_db.get(node_id)
        if node:
            logger.info(f"Found node: {node_id}, processing deletion")

            lead_to_steps: list[FlowNode] = node.leads_to_nodes
            logger.debug(f"Node {node_id} leads to {len(lead_to_steps)} other nodes")

            if len(lead_to_steps) > 0:
                for lead_to_step in lead_to_steps:
                    logger.debug(f"Deleting input node {node_id} from dependent node {lead_to_step}")
                    lead_to_step.delete_input_node(node_id, complete=True)

            if not node.is_start:
                depends_on: list[FlowNode] = node.node_inputs.get_all_inputs()
                logger.debug(f"Node {node_id} depends on {len(depends_on)} other nodes")

                for depend_on in depends_on:
                    logger.debug(f"Removing lead_to reference {node_id} from node {depend_on}")
                    depend_on.delete_lead_to_node(node_id)

            self._node_db.pop(node_id)
            logger.debug(f"Successfully removed node {node_id} from node_db")
            del node
            logger.info("Node object deleted")
        else:
            logger.error(f"Failed to find node with id {node_id}")
            raise Exception(f"Node with id {node_id} does not exist")

    @property
    def graph_has_input_data(self) -> bool:
        """Checks if the graph has an initial input data source."""
        return self._input_data is not None

    def add_node_step(
        self,
        node_id: int | str,
        function: Callable,
        input_columns: list[str] = None,
        output_schema: list[FlowfileColumn] = None,
        node_type: str = None,
        drop_columns: list[str] = None,
        renew_schema: bool = True,
        setting_input: Any = None,
        cache_results: bool = None,
        schema_callback: Callable = None,
        input_node_ids: list[int] = None,
    ) -> FlowNode:
        """The core method for adding or updating a node in the graph.

        Args:
            node_id: The unique ID for the node.
            function: The core processing function for the node.
            input_columns: A list of input column names required by the function.
            output_schema: A predefined schema for the node's output.
            node_type: A string identifying the type of node (e.g., 'filter', 'join').
            drop_columns: A list of columns to be dropped after the function executes.
            renew_schema: If True, the schema is recalculated after execution.
            setting_input: A configuration object containing settings for the node.
            cache_results: If True, the node's results are cached for future runs.
            schema_callback: A function that dynamically calculates the output schema.
            input_node_ids: A list of IDs for the nodes that this node depends on.

        Returns:
            The created or updated FlowNode object.
        """
        # Wrap schema_callback with output_field_config support
        # If the node has output_field_config enabled, use it for schema prediction
        output_field_config = getattr(setting_input, 'output_field_config', None) if setting_input else None

        logger.info(
            f"add_node_step: node_id={node_id}, node_type={node_type}, "
            f"has_setting_input={setting_input is not None}, "
            f"has_output_field_config={output_field_config is not None}, "
            f"config_enabled={output_field_config.enabled if output_field_config else False}, "
            f"has_schema_callback={schema_callback is not None}"
        )

        # IMPORTANT: Always create wrapped callback if output_field_config exists (even if enabled=False)
        # This ensures nodes like PolarsCode get a schema callback when output_field_config is defined
        if output_field_config:
            if output_field_config.enabled:
                logger.info(
                    f"add_node_step: Creating/wrapping schema_callback for node {node_id} with output_field_config "
                    f"(validation_mode={output_field_config.validation_mode_behavior}, {len(output_field_config.fields)} fields, "
                    f"base_callback={'present' if schema_callback else 'None'})"
                )
            else:
                logger.debug(f"add_node_step: output_field_config present for node {node_id} but disabled")

            # Even if schema_callback is None, create a wrapped one for output_field_config
            schema_callback = create_schema_callback_with_output_config(schema_callback, output_field_config)
            logger.info(f"add_node_step: schema_callback {'created' if schema_callback else 'failed'} for node {node_id}")

        existing_node = self.get_node(node_id)
        if existing_node is not None:
            if existing_node.node_type != node_type:
                self.delete_node(existing_node.node_id)
                existing_node = None
        if existing_node:
            input_nodes = existing_node.all_inputs
        elif input_node_ids is not None:
            input_nodes = [self.get_node(node_id) for node_id in input_node_ids]
        else:
            input_nodes = None
        if isinstance(input_columns, str):
            input_columns = [input_columns]
        if (
            input_nodes is not None
            or function.__name__ in ("placeholder", "analysis_preparation")
            or node_type in ("cloud_storage_reader", "polars_lazy_frame", "input_data")
        ):
            if not existing_node:
                node = FlowNode(
                    node_id=node_id,
                    function=function,
                    output_schema=output_schema,
                    input_columns=input_columns,
                    drop_columns=drop_columns,
                    renew_schema=renew_schema,
                    setting_input=setting_input,
                    node_type=node_type,
                    name=function.__name__,
                    schema_callback=schema_callback,
                    parent_uuid=self.uuid,
                )
            else:
                existing_node.update_node(
                    function=function,
                    output_schema=output_schema,
                    input_columns=input_columns,
                    drop_columns=drop_columns,
                    setting_input=setting_input,
                    schema_callback=schema_callback,
                )
                node = existing_node
        else:
            raise Exception("No data initialized")
        self._node_db[node_id] = node
        self._node_ids.append(node_id)
        return node

    def add_include_cols(self, include_columns: list[str]):
        """Adds columns to both the input and output column lists.

        Args:
            include_columns: A list of column names to include.
        """
        for column in include_columns:
            if column not in self._input_cols:
                self._input_cols.append(column)
            if column not in self._output_cols:
                self._output_cols.append(column)
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_output(self, output_file: input_schema.NodeOutput):
        """Adds an output node to write the final data to a destination.

        Args:
            output_file: The settings for the output file.
        """

        def _func(df: FlowDataEngine):
            execute_remote = self.execution_location != "local"
            df.output(
                output_fs=output_file.output_settings,
                flow_id=self.flow_id,
                node_id=output_file.node_id,
                execute_remote=execute_remote,
            )
            return df

        def schema_callback():
            input_node: FlowNode = self.get_node(output_file.node_id).node_inputs.main_inputs[0]

            return input_node.schema

        input_node_id = output_file.depending_on_id if hasattr(output_file, "depending_on_id") else None
        self.add_node_step(
            node_id=output_file.node_id,
            function=_func,
            input_columns=[],
            node_type="output",
            setting_input=output_file,
            schema_callback=schema_callback,
            input_node_ids=[input_node_id],
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_database_writer(self, node_database_writer: input_schema.NodeDatabaseWriter):
        """Adds a node to write data to a database.

        Args:
            node_database_writer: The settings for the database writer node.
        """

        node_type = "database_writer"
        database_settings: input_schema.DatabaseWriteSettings = node_database_writer.database_write_settings
        database_connection: input_schema.DatabaseConnection | input_schema.FullDatabaseConnection | None
        if database_settings.connection_mode == "inline":
            database_connection: input_schema.DatabaseConnection = database_settings.database_connection
            encrypted_password = get_encrypted_secret(
                current_user_id=node_database_writer.user_id, secret_name=database_connection.password_ref
            )
            if encrypted_password is None:
                raise HTTPException(status_code=400, detail="Password not found")
        else:
            database_reference_settings = get_local_database_connection(
                database_settings.database_connection_name, node_database_writer.user_id
            )
            encrypted_password = database_reference_settings.password.get_secret_value()

        def _func(df: FlowDataEngine):
            df.lazy = True
            database_external_write_settings = (
                sql_models.DatabaseExternalWriteSettings.create_from_from_node_database_writer(
                    node_database_writer=node_database_writer,
                    password=encrypted_password,
                    table_name=(
                        database_settings.schema_name + "." + database_settings.table_name
                        if database_settings.schema_name
                        else database_settings.table_name
                    ),
                    database_reference_settings=(
                        database_reference_settings if database_settings.connection_mode == "reference" else None
                    ),
                    lf=df.data_frame,
                )
            )
            external_database_writer = ExternalDatabaseWriter(
                database_external_write_settings, wait_on_completion=False
            )
            node._fetch_cached_df = external_database_writer
            external_database_writer.get_result()
            return df

        def schema_callback():
            input_node: FlowNode = self.get_node(node_database_writer.node_id).node_inputs.main_inputs[0]
            return input_node.schema

        self.add_node_step(
            node_id=node_database_writer.node_id,
            function=_func,
            input_columns=[],
            node_type=node_type,
            setting_input=node_database_writer,
            schema_callback=schema_callback,
        )
        node = self.get_node(node_database_writer.node_id)

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_database_reader(self, node_database_reader: input_schema.NodeDatabaseReader):
        """Adds a node to read data from a database.

        Args:
            node_database_reader: The settings for the database reader node.
        """

        logger.info("Adding database reader")
        node_type = "database_reader"
        database_settings: input_schema.DatabaseSettings = node_database_reader.database_settings
        database_connection: input_schema.DatabaseConnection | input_schema.FullDatabaseConnection | None
        if database_settings.connection_mode == "inline":
            database_connection: input_schema.DatabaseConnection = database_settings.database_connection
            encrypted_password = get_encrypted_secret(
                current_user_id=node_database_reader.user_id, secret_name=database_connection.password_ref
            )
            if encrypted_password is None:
                raise HTTPException(status_code=400, detail="Password not found")
        else:
            database_reference_settings = get_local_database_connection(
                database_settings.database_connection_name, node_database_reader.user_id
            )
            database_connection = database_reference_settings
            encrypted_password = database_reference_settings.password.get_secret_value()

        def _func():
            sql_source = BaseSqlSource(
                query=None if database_settings.query_mode == "table" else database_settings.query,
                table_name=database_settings.table_name,
                schema_name=database_settings.schema_name,
                fields=node_database_reader.fields,
            )
            database_external_read_settings = (
                sql_models.DatabaseExternalReadSettings.create_from_from_node_database_reader(
                    node_database_reader=node_database_reader,
                    password=encrypted_password,
                    query=sql_source.query,
                    database_reference_settings=(
                        database_reference_settings if database_settings.connection_mode == "reference" else None
                    ),
                )
            )

            external_database_fetcher = ExternalDatabaseFetcher(
                database_external_read_settings, wait_on_completion=False
            )
            node._fetch_cached_df = external_database_fetcher
            fl = FlowDataEngine(external_database_fetcher.get_result())
            node_database_reader.fields = [c.get_minimal_field_info() for c in fl.schema]
            return fl

        def schema_callback():
            sql_source = SqlSource(
                connection_string=sql_utils.construct_sql_uri(
                    database_type=database_connection.database_type,
                    host=database_connection.host,
                    port=database_connection.port,
                    database=database_connection.database,
                    username=database_connection.username,
                    password=decrypt_secret(encrypted_password),
                ),
                query=None if database_settings.query_mode == "table" else database_settings.query,
                table_name=database_settings.table_name,
                schema_name=database_settings.schema_name,
                fields=node_database_reader.fields,
            )
            return sql_source.get_schema()

        node = self.get_node(node_database_reader.node_id)
        if node:
            node.node_type = node_type
            node.name = node_type
            node.function = _func
            node.setting_input = node_database_reader
            node.node_settings.cache_results = node_database_reader.cache_results
            self.add_node_to_starting_list(node)
            node.schema_callback = schema_callback
        else:
            node = FlowNode(
                node_database_reader.node_id,
                function=_func,
                setting_input=node_database_reader,
                name=node_type,
                node_type=node_type,
                parent_uuid=self.uuid,
                schema_callback=schema_callback,
            )
            self._node_db[node_database_reader.node_id] = node
            self.add_node_to_starting_list(node)
            self._node_ids.append(node_database_reader.node_id)

    def add_sql_source(self, external_source_input: input_schema.NodeExternalSource):
        """Adds a node that reads data from a SQL source.

        This is a convenience alias for `add_external_source`.

        Args:
            external_source_input: The settings for the external SQL source node.
        """
        logger.info("Adding sql source")
        self.add_external_source(external_source_input)

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_cloud_storage_writer(self, node_cloud_storage_writer: input_schema.NodeCloudStorageWriter) -> None:
        """Adds a node to write data to a cloud storage provider.

        Args:
            node_cloud_storage_writer: The settings for the cloud storage writer node.
        """

        node_type = "cloud_storage_writer"

        def _func(df: FlowDataEngine):
            df.lazy = True
            execute_remote = self.execution_location != "local"
            cloud_connection_settings = get_cloud_connection_settings(
                connection_name=node_cloud_storage_writer.cloud_storage_settings.connection_name,
                user_id=node_cloud_storage_writer.user_id,
                auth_mode=node_cloud_storage_writer.cloud_storage_settings.auth_mode,
            )
            full_cloud_storage_connection = FullCloudStorageConnection(
                storage_type=cloud_connection_settings.storage_type,
                auth_method=cloud_connection_settings.auth_method,
                aws_allow_unsafe_html=cloud_connection_settings.aws_allow_unsafe_html,
                **CloudStorageReader.get_storage_options(cloud_connection_settings),
            )
            if execute_remote:
                settings = get_cloud_storage_write_settings_worker_interface(
                    write_settings=node_cloud_storage_writer.cloud_storage_settings,
                    connection=full_cloud_storage_connection,
                    lf=df.data_frame,
                    user_id=node_cloud_storage_writer.user_id,
                    flowfile_node_id=node_cloud_storage_writer.node_id,
                    flowfile_flow_id=self.flow_id,
                )
                external_database_writer = ExternalCloudWriter(settings, wait_on_completion=False)
                node._fetch_cached_df = external_database_writer
                external_database_writer.get_result()
            else:
                cloud_storage_write_settings_internal = CloudStorageWriteSettingsInternal(
                    connection=full_cloud_storage_connection,
                    write_settings=node_cloud_storage_writer.cloud_storage_settings,
                )
                df.to_cloud_storage_obj(cloud_storage_write_settings_internal)
            return df

        def schema_callback():
            logger.info("Starting to run the schema callback for cloud storage writer")
            if self.get_node(node_cloud_storage_writer.node_id).is_correct:
                return self.get_node(node_cloud_storage_writer.node_id).node_inputs.main_inputs[0].schema
            else:
                return [FlowfileColumn.from_input(column_name="__error__", data_type="String")]

        self.add_node_step(
            node_id=node_cloud_storage_writer.node_id,
            function=_func,
            input_columns=[],
            node_type=node_type,
            setting_input=node_cloud_storage_writer,
            schema_callback=schema_callback,
            input_node_ids=[node_cloud_storage_writer.depending_on_id],
        )

        node = self.get_node(node_cloud_storage_writer.node_id)

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_cloud_storage_reader(self, node_cloud_storage_reader: input_schema.NodeCloudStorageReader) -> None:
        """Adds a cloud storage read node to the flow graph.

        Args:
            node_cloud_storage_reader: The settings for the cloud storage read node.
        """
        node_type = "cloud_storage_reader"
        logger.info("Adding cloud storage reader")
        cloud_storage_read_settings = node_cloud_storage_reader.cloud_storage_settings

        def _func():
            logger.info("Starting to run the schema callback for cloud storage reader")
            self.flow_logger.info("Starting to run the schema callback for cloud storage reader")
            settings = CloudStorageReadSettingsInternal(
                read_settings=cloud_storage_read_settings,
                connection=get_cloud_connection_settings(
                    connection_name=cloud_storage_read_settings.connection_name,
                    user_id=node_cloud_storage_reader.user_id,
                    auth_mode=cloud_storage_read_settings.auth_mode,
                ),
            )
            fl = FlowDataEngine.from_cloud_storage_obj(settings)
            return fl

        node = self.add_node_step(
            node_id=node_cloud_storage_reader.node_id,
            function=_func,
            cache_results=node_cloud_storage_reader.cache_results,
            setting_input=node_cloud_storage_reader,
            node_type=node_type,
        )
        self.add_node_to_starting_list(node)

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_external_source(self, external_source_input: input_schema.NodeExternalSource):
        """Adds a node for a custom external data source.

        Args:
            external_source_input: The settings for the external source node.
        """

        node_type = "external_source"
        external_source_script = getattr(external_sources.custom_external_sources, external_source_input.identifier)
        source_settings = getattr(
            input_schema, snake_case_to_camel_case(external_source_input.identifier)
        ).model_validate(external_source_input.source_settings)
        if hasattr(external_source_script, "initial_getter"):
            initial_getter = external_source_script.initial_getter(source_settings)
        else:
            initial_getter = None
        data_getter = external_source_script.getter(source_settings)
        external_source = data_source_factory(
            source_type="custom",
            data_getter=data_getter,
            initial_data_getter=initial_getter,
            orientation=external_source_input.source_settings.orientation,
            schema=None,
        )

        def _func():
            logger.info("Calling external source")
            fl = FlowDataEngine.create_from_external_source(external_source=external_source)
            external_source_input.source_settings.fields = [c.get_minimal_field_info() for c in fl.schema]
            return fl

        node = self.get_node(external_source_input.node_id)
        if node:
            node.node_type = node_type
            node.name = node_type
            node.function = _func
            node.setting_input = external_source_input
            node.node_settings.cache_results = external_source_input.cache_results
            self.add_node_to_starting_list(node)

        else:
            node = FlowNode(
                external_source_input.node_id,
                function=_func,
                setting_input=external_source_input,
                name=node_type,
                node_type=node_type,
                parent_uuid=self.uuid,
            )
            self._node_db[external_source_input.node_id] = node
            self.add_node_to_starting_list(node)
            self._node_ids.append(external_source_input.node_id)
        if external_source_input.source_settings.fields and len(external_source_input.source_settings.fields) > 0:
            logger.info("Using provided schema in the node")

            def schema_callback():
                return [
                    FlowfileColumn.from_input(f.name, f.data_type) for f in external_source_input.source_settings.fields
                ]

            node.schema_callback = schema_callback
        else:
            logger.warning("Removing schema")
            node._schema_callback = None
        self.add_node_step(
            node_id=external_source_input.node_id,
            function=_func,
            input_columns=[],
            node_type=node_type,
            setting_input=external_source_input,
        )

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_read(self, input_file: input_schema.NodeRead):
        """Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

        Args:
            input_file: The settings for the read operation.
        """
        if (
            input_file.received_file.file_type in ("xlsx", "excel")
            and input_file.received_file.table_settings.sheet_name == ""
        ):
            sheet_name = fastexcel.read_excel(input_file.received_file.path).sheet_names[0]
            input_file.received_file.table_settings.sheet_name = sheet_name

        received_file = input_file.received_file
        input_file.received_file.set_absolute_filepath()

        def _func():
            input_file.received_file.set_absolute_filepath()
            if input_file.received_file.file_type == "parquet":
                input_data = FlowDataEngine.create_from_path(input_file.received_file)
            elif (
                input_file.received_file.file_type == "csv"
                and "utf" in input_file.received_file.table_settings.encoding
            ):
                input_data = FlowDataEngine.create_from_path(input_file.received_file)
            else:
                input_data = FlowDataEngine.create_from_path_worker(
                    input_file.received_file, node_id=input_file.node_id, flow_id=self.flow_id
                )
            input_data.name = input_file.received_file.name
            return input_data

        node = self.get_node(input_file.node_id)
        schema_callback = None
        if node:
            start_hash = node.hash
            node.node_type = "read"
            node.name = "read"
            node.function = _func
            node.setting_input = input_file
            self.add_node_to_starting_list(node)

            if start_hash != node.hash:
                logger.info("Hash changed, updating schema")
                if len(received_file.fields) > 0:
                    # If the file has fields defined, we can use them to create the schema
                    def schema_callback():
                        return [FlowfileColumn.from_input(f.name, f.data_type) for f in received_file.fields]

                elif input_file.received_file.file_type in ("csv", "json", "parquet"):
                    # everything that can be scanned by polars
                    def schema_callback():
                        input_data = FlowDataEngine.create_from_path(input_file.received_file)
                        return input_data.schema

                elif input_file.received_file.file_type in ("xlsx", "excel"):
                    # If the file is an Excel file, we need to use the openpyxl engine to read the schema
                    schema_callback = get_xlsx_schema_callback(
                        engine="openpyxl",
                        file_path=received_file.file_path,
                        sheet_name=received_file.table_settings.sheet_name,
                        start_row=received_file.table_settings.start_row,
                        end_row=received_file.table_settings.end_row,
                        start_column=received_file.table_settings.start_column,
                        end_column=received_file.table_settings.end_column,
                        has_headers=received_file.table_settings.has_headers,
                    )
                else:
                    schema_callback = None
        else:
            node = FlowNode(
                input_file.node_id,
                function=_func,
                setting_input=input_file,
                name="read",
                node_type="read",
                parent_uuid=self.uuid,
            )
            self._node_db[input_file.node_id] = node
            self.add_node_to_starting_list(node)
            self._node_ids.append(input_file.node_id)

        if schema_callback is not None:
            node.schema_callback = schema_callback
            node.user_provided_schema_callback = schema_callback
        return self

    @with_history_capture(HistoryActionType.UPDATE_SETTINGS)
    def add_datasource(self, input_file: input_schema.NodeDatasource | input_schema.NodeManualInput) -> "FlowGraph":
        """Adds a data source node to the graph.

        This method serves as a factory for creating starting nodes, handling both
        file-based sources and direct manual data entry.

        Args:
            input_file: The configuration object for the data source.

        Returns:
            The `FlowGraph` instance for method chaining.
        """
        if isinstance(input_file, input_schema.NodeManualInput):
            input_data = FlowDataEngine(input_file.raw_data_format)
            ref = "manual_input"
        else:
            input_data = FlowDataEngine(path_ref=input_file.file_ref)
            ref = "datasource"
        node = self.get_node(input_file.node_id)
        if node:
            node.node_type = ref
            node.name = ref
            node.function = input_data
            node.setting_input = input_file
            self.add_node_to_starting_list(node)

        else:
            input_data.collect()
            node = FlowNode(
                input_file.node_id,
                function=input_data,
                setting_input=input_file,
                name=ref,
                node_type=ref,
                parent_uuid=self.uuid,
            )
            self._node_db[input_file.node_id] = node
            self.add_node_to_starting_list(node)
            self._node_ids.append(input_file.node_id)
        return self

    def add_manual_input(self, input_file: input_schema.NodeManualInput):
        """Adds a node for manual data entry.

        This is a convenience alias for `add_datasource`.

        Args:
            input_file: The settings and data for the manual input node.
        """
        self.add_datasource(input_file)

    @property
    def nodes(self) -> list[FlowNode]:
        """Gets a list of all FlowNode objects in the graph."""

        return list(self._node_db.values())

    @property
    def execution_mode(self) -> schemas.ExecutionModeLiteral:
        """Gets the current execution mode ('Development' or 'Performance')."""
        return self.flow_settings.execution_mode

    def get_implicit_starter_nodes(self) -> list[FlowNode]:
        """Finds nodes that can act as starting points but are not explicitly defined as such.

        Some nodes, like the Polars Code node, can function without an input. This
        method identifies such nodes if they have no incoming connections.

        Returns:
            A list of `FlowNode` objects that are implicit starting nodes.
        """
        starting_node_ids = [node.node_id for node in self._flow_starts]
        implicit_starting_nodes = []
        for node in self.nodes:
            if node.node_template.can_be_start and not node.has_input and node.node_id not in starting_node_ids:
                implicit_starting_nodes.append(node)
        return implicit_starting_nodes

    @execution_mode.setter
    def execution_mode(self, mode: schemas.ExecutionModeLiteral):
        """Sets the execution mode for the flow.

        Args:
            mode: The execution mode to set.
        """
        self.flow_settings.execution_mode = mode

    @property
    def execution_location(self) -> schemas.ExecutionLocationsLiteral:
        """Gets the current execution location."""
        return self.flow_settings.execution_location

    @execution_location.setter
    def execution_location(self, execution_location: schemas.ExecutionLocationsLiteral):
        """Sets the execution location for the flow.

        Args:
            execution_location: The execution location to set.
        """
        if self.flow_settings.execution_location != execution_location:
            self.reset()
        self.flow_settings.execution_location = execution_location

    def validate_if_node_can_be_fetched(self, node_id: int) -> None:
        flow_node = self._node_db.get(node_id)
        if not flow_node:
            raise Exception("Node not found found")
        execution_plan = compute_execution_plan(
            nodes=self.nodes, flow_starts=self._flow_starts + self.get_implicit_starter_nodes()
        )
        if flow_node.node_id in [skip_node.node_id for skip_node in execution_plan.skip_nodes]:
            raise Exception("Node can not be executed because it does not have it's inputs")

    def create_initial_run_information(self, number_of_nodes: int, run_type: Literal["fetch_one", "full_run"]):
        return RunInformation(
            flow_id=self.flow_id,
            start_time=datetime.datetime.now(),
            end_time=None,
            success=None,
            number_of_nodes=number_of_nodes,
            node_step_result=[],
            run_type=run_type,
        )

    def create_empty_run_information(self) -> RunInformation:
        return RunInformation(
            flow_id=self.flow_id,
            start_time=None,
            end_time=None,
            success=None,
            number_of_nodes=0,
            node_step_result=[],
            run_type="init",
        )

    def trigger_fetch_node(self, node_id: int) -> RunInformation | None:
        """Executes a specific node in the graph by its ID."""
        if self.flow_settings.is_running:
            raise Exception("Flow is already running")
        flow_node = self.get_node(node_id)
        self.flow_settings.is_running = True
        self.flow_settings.is_canceled = False
        self.flow_logger.clear_log_file()
        self.latest_run_info = self.create_initial_run_information(1, "fetch_one")
        node_logger = self.flow_logger.get_node_logger(flow_node.node_id)
        node_result = NodeResult(node_id=flow_node.node_id, node_name=flow_node.name)
        logger.info(f"Starting to run: node {flow_node.node_id}, start time: {node_result.start_timestamp}")
        try:
            self.latest_run_info.node_step_result.append(node_result)
            flow_node.execute_node(
                run_location=self.flow_settings.execution_location,
                performance_mode=False,
                node_logger=node_logger,
                optimize_for_downstream=False,
                reset_cache=True,
            )
            node_result.error = str(flow_node.results.errors)
            if self.flow_settings.is_canceled:
                node_result.success = None
                node_result.success = None
                node_result.is_running = False
            node_result.success = flow_node.results.errors is None
            node_result.end_timestamp = time()
            node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
            node_result.is_running = False
            self.latest_run_info.nodes_completed += 1
            self.latest_run_info.end_time = datetime.datetime.now()
            self.flow_settings.is_running = False
            return self.get_run_info()
        except Exception as e:
            node_result.error = "Node did not run"
            node_result.success = False
            node_result.end_timestamp = time()
            node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
            node_result.is_running = False
            node_logger.error(f"Error in node {flow_node.node_id}: {e}")
        finally:
            self.flow_settings.is_running = False

    def _execute_single_node(
        self,
        node: FlowNode,
        performance_mode: bool,
        run_info_lock: threading.Lock,
    ) -> tuple[NodeResult, FlowNode]:
        """Executes a single node, records its result, and returns both.

        Thread-safe: uses run_info_lock when mutating shared run information.

        Args:
            node: The node to execute.
            performance_mode: Whether to run in performance mode.
            run_info_lock: Lock protecting shared RunInformation state.

        Returns:
            A (NodeResult, FlowNode) tuple for post-stage failure propagation.
        """
        node_logger = self.flow_logger.get_node_logger(node.node_id)
        node_result = NodeResult(node_id=node.node_id, node_name=node.name)

        with run_info_lock:
            self.latest_run_info.node_step_result.append(node_result)

        logger.info(f"Starting to run: node {node.node_id}, start time: {node_result.start_timestamp}")
        node.execute_node(
            run_location=self.flow_settings.execution_location,
            performance_mode=performance_mode,
            node_logger=node_logger,
        )
        try:
            node_result.error = str(node.results.errors)
            if self.flow_settings.is_canceled:
                node_result.success = None
                node_result.is_running = False
                return node_result, node
            node_result.success = node.results.errors is None
            node_result.end_timestamp = time()
            node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
            node_result.is_running = False
        except Exception as e:
            node_result.error = "Node did not run"
            node_result.success = False
            node_result.end_timestamp = time()
            node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
            node_result.is_running = False
            node_logger.error(f"Error in node {node.node_id}: {e}")

        node_logger.info(f"Completed node with success: {node_result.success}")
        with run_info_lock:
            self.latest_run_info.nodes_completed += 1

        return node_result, node

    def run_graph(self) -> RunInformation | None:
        """Executes the entire data flow graph from start to finish.

        Independent nodes within the same execution stage are run in parallel
        using threads. Stages are processed sequentially so that all dependencies
        are satisfied before a stage begins.

        Returns:
            A RunInformation object summarizing the execution results.

        Raises:
            Exception: If the flow is already running.
        """
        if self.flow_settings.is_running:
            raise Exception("Flow is already running")
        try:
            self.flow_settings.is_running = True
            self.flow_settings.is_canceled = False
            self.flow_logger.clear_log_file()
            self.flow_logger.info("Starting to run flowfile flow...")
            execution_plan = compute_execution_plan(
                nodes=self.nodes, flow_starts=self._flow_starts + self.get_implicit_starter_nodes()
            )

            self.latest_run_info = self.create_initial_run_information(
                execution_plan.node_count, "full_run"
            )

            skip_node_message(self.flow_logger, execution_plan.skip_nodes)
            execution_order_message(self.flow_logger, execution_plan.stages)
            performance_mode = self.flow_settings.execution_mode == "Performance"

            run_info_lock = threading.Lock()
            skip_node_ids: set[str | int] = {n.node_id for n in execution_plan.skip_nodes}

            for stage in execution_plan.stages:
                if self.flow_settings.is_canceled:
                    self.flow_logger.info("Flow canceled")
                    break

                nodes_to_run = [n for n in stage.nodes if n.node_id not in skip_node_ids]

                for skipped in stage.nodes:
                    if skipped.node_id in skip_node_ids:
                        node_logger = self.flow_logger.get_node_logger(skipped.node_id)
                        node_logger.info(f"Skipping node {skipped.node_id}")

                if not nodes_to_run:
                    continue

                is_local = self.flow_settings.execution_location == "local"
                max_workers = 1 if is_local else self.flow_settings.max_parallel_workers
                if len(nodes_to_run) == 1 or max_workers == 1:
                    # Single node or parallelism disabled — run sequentially
                    stage_results = [
                        self._execute_single_node(node, performance_mode, run_info_lock)
                        for node in nodes_to_run
                    ]
                else:
                    # Multiple independent nodes — run in parallel
                    stage_results: list[tuple[NodeResult, FlowNode]] = []
                    workers = min(max_workers, len(nodes_to_run))
                    with ThreadPoolExecutor(max_workers=workers) as executor:
                        futures = {
                            executor.submit(
                                self._execute_single_node, node, performance_mode, run_info_lock
                            ): node
                            for node in nodes_to_run
                        }
                        for future in as_completed(futures):
                            stage_results.append(future.result())

                # After the stage completes, propagate failures to downstream nodes
                for node_result, node in stage_results:
                    if not node_result.success:
                        for dep in node.get_all_dependent_nodes():
                            skip_node_ids.add(dep.node_id)

            self.latest_run_info.end_time = datetime.datetime.now()
            self.flow_logger.info("Flow completed!")
            self.end_datetime = datetime.datetime.now()
            self.flow_settings.is_running = False
            if self.flow_settings.is_canceled:
                self.flow_logger.info("Flow canceled")
            return self.get_run_info()
        except Exception as e:
            raise e
        finally:
            self.flow_settings.is_running = False

    def get_run_info(self) -> RunInformation:
        """Gets a summary of the most recent graph execution.

        Returns:
            A RunInformation object with details about the last run.
        """
        is_running = self.flow_settings.is_running
        if self.latest_run_info is None:
            return self.create_empty_run_information()

        elif not is_running and self.latest_run_info.success is not None:
            return self.latest_run_info

        run_info = self.latest_run_info
        if not is_running:
            run_info.success = all(nr.success for nr in run_info.node_step_result)
        return run_info

    @property
    def node_connections(self) -> list[tuple[int, int]]:
        """Computes and returns a list of all connections in the graph.

        Returns:
            A list of tuples, where each tuple is a (source_id, target_id) pair.
        """
        connections = set()
        for node in self.nodes:
            outgoing_connections = [(node.node_id, ltn.node_id) for ltn in node.leads_to_nodes]
            incoming_connections = [(don.node_id, node.node_id) for don in node.all_inputs]
            node_connections = [
                c for c in outgoing_connections + incoming_connections if (c[0] is not None and c[1] is not None)
            ]
            for node_connection in node_connections:
                if node_connection not in connections:
                    connections.add(node_connection)
        return list(connections)

    def get_node_data(self, node_id: int, include_example: bool = True) -> NodeData:
        """Retrieves all data needed to render a node in the UI.

        Args:
            node_id: The ID of the node.
            include_example: Whether to include data samples in the result.

        Returns:
            A NodeData object, or None if the node is not found.
        """
        node = self._node_db[node_id]
        return node.get_node_data(flow_id=self.flow_id, include_example=include_example)

    def get_flowfile_data(self) -> schemas.FlowfileData:
        start_node_ids = {v.node_id for v in self._flow_starts}

        nodes = []
        for node in self.nodes:
            node_info = node.get_node_information()
            flowfile_node = schemas.FlowfileNode(
                id=node_info.id,
                type=node_info.type,
                is_start_node=node.node_id in start_node_ids,
                description=node_info.description,
                node_reference=node_info.node_reference,
                x_position=int(node_info.x_position),
                y_position=int(node_info.y_position),
                left_input_id=node_info.left_input_id,
                right_input_id=node_info.right_input_id,
                input_ids=node_info.input_ids,
                outputs=node_info.outputs,
                setting_input=node_info.setting_input,
            )
            nodes.append(flowfile_node)

        settings = schemas.FlowfileSettings(
            description=self.flow_settings.description,
            execution_mode=self.flow_settings.execution_mode,
            execution_location=self.flow_settings.execution_location,
            auto_save=self.flow_settings.auto_save,
            show_detailed_progress=self.flow_settings.show_detailed_progress,
            max_parallel_workers=self.flow_settings.max_parallel_workers,
        )
        return schemas.FlowfileData(
            flowfile_version=__version__,
            flowfile_id=self.flow_id,
            flowfile_name=self.__name__,
            flowfile_settings=settings,
            nodes=nodes,
        )

    def get_node_storage(self) -> schemas.FlowInformation:
        """Serializes the entire graph's state into a storable format.

        Returns:
            A FlowInformation object representing the complete graph.
        """
        node_information = {
            node.node_id: node.get_node_information() for node in self.nodes if node.is_setup and node.is_correct
        }

        return schemas.FlowInformation(
            flow_id=self.flow_id,
            flow_name=self.__name__,
            flow_settings=self.flow_settings,
            data=node_information,
            node_starts=[v.node_id for v in self._flow_starts],
            node_connections=self.node_connections,
        )

    def cancel(self):
        """Cancels an ongoing graph execution."""

        if not self.flow_settings.is_running:
            return
        self.flow_settings.is_canceled = True
        for node in self.nodes:
            node.cancel()

    def close_flow(self):
        """Performs cleanup operations, such as clearing node caches."""

        for node in self.nodes:
            node.remove_cache()

    def _handle_flow_renaming(self, new_name: str, new_path: Path):
        """
        Handle the rename of a flow when it is being saved.
        """
        if (
            self.flow_settings
            and self.flow_settings.path
            and Path(self.flow_settings.path).absolute() != new_path.absolute()
        ):
            self.__name__ = new_name
            self.flow_settings.save_location = str(new_path.absolute())
            self.flow_settings.name = new_name
        if self.flow_settings and not self.flow_settings.save_location:
            self.flow_settings.save_location = str(new_path.absolute())
            self.__name__ = new_name
            self.flow_settings.name = new_name

    def save_flow(self, flow_path: str):
        """Saves the current state of the flow graph to a file.

        Supports multiple formats based on file extension:
        - .yaml / .yml: New YAML format
        - .json: JSON format

        Args:
            flow_path: The path where the flow file will be saved.
        """
        logger.info("Saving flow to %s", flow_path)
        path = Path(flow_path)
        os.makedirs(path.parent, exist_ok=True)
        suffix = path.suffix.lower()
        new_flow_name = path.name.replace(suffix, "")
        self._handle_flow_renaming(new_flow_name, path)
        self.flow_settings.modified_on = datetime.datetime.now().timestamp()
        try:
            if suffix == ".flowfile":
                raise DeprecationWarning(
                    "The .flowfile format is deprecated. Please use .yaml or .json formats.\n\n"
                    "Or stay on v0.4.1 if you still need .flowfile support.\n\n"
                )
            elif suffix in (".yaml", ".yml"):
                flowfile_data = self.get_flowfile_data()
                data = flowfile_data.model_dump(mode="json")
                with open(flow_path, "w", encoding="utf-8") as f:
                    yaml.dump(data, f, default_flow_style=False, sort_keys=False, allow_unicode=True)
            elif suffix == ".json":
                flowfile_data = self.get_flowfile_data()
                data = flowfile_data.model_dump(mode="json")
                with open(flow_path, "w", encoding="utf-8") as f:
                    json.dump(data, f, indent=2, ensure_ascii=False)

            else:
                flowfile_data = self.get_flowfile_data()
                logger.warning(f"Unknown file extension {suffix}. Defaulting to YAML format.")
                data = flowfile_data.model_dump(mode="json")
                with open(flow_path, "w", encoding="utf-8") as f:
                    yaml.dump(data, f, default_flow_style=False, sort_keys=False, allow_unicode=True)

        except Exception as e:
            logger.error(f"Error saving flow: {e}")
            raise

        self.flow_settings.path = flow_path

    def get_frontend_data(self) -> dict:
        """Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

        This method transforms the graph's state into a format compatible with the
        Drawflow.js library.

        Returns:
            A dictionary representing the graph in Drawflow format.
        """
        result = {"Home": {"data": {}}}
        flow_info: schemas.FlowInformation = self.get_node_storage()

        for node_id, node_info in flow_info.data.items():
            if node_info.is_setup:
                try:
                    pos_x = node_info.data.pos_x
                    pos_y = node_info.data.pos_y
                    # Basic node structure
                    result["Home"]["data"][str(node_id)] = {
                        "id": node_info.id,
                        "name": node_info.type,
                        "data": {},  # Additional data can go here
                        "class": node_info.type,
                        "html": node_info.type,
                        "typenode": "vue",
                        "inputs": {},
                        "outputs": {},
                        "pos_x": pos_x,
                        "pos_y": pos_y,
                    }
                except Exception as e:
                    logger.error(e)
            # Add outputs to the node based on `outputs` in your backend data
            if node_info.outputs:
                outputs = {o: 0 for o in node_info.outputs}
                for o in node_info.outputs:
                    outputs[o] += 1
                connections = []
                for output_node_id, n_connections in outputs.items():
                    leading_to_node = self.get_node(output_node_id)
                    input_types = leading_to_node.get_input_type(node_info.id)
                    for input_type in input_types:
                        if input_type == "main":
                            input_frontend_id = "input_1"
                        elif input_type == "right":
                            input_frontend_id = "input_2"
                        elif input_type == "left":
                            input_frontend_id = "input_3"
                        else:
                            input_frontend_id = "input_1"
                        connection = {"node": str(output_node_id), "input": input_frontend_id}
                        connections.append(connection)

                result["Home"]["data"][str(node_id)]["outputs"]["output_1"] = {"connections": connections}
            else:
                result["Home"]["data"][str(node_id)]["outputs"] = {"output_1": {"connections": []}}

            # Add input to the node based on `depending_on_id` in your backend data
            if (
                node_info.left_input_id is not None
                or node_info.right_input_id is not None
                or node_info.input_ids is not None
            ):
                main_inputs = node_info.main_input_ids
                result["Home"]["data"][str(node_id)]["inputs"]["input_1"] = {
                    "connections": [{"node": str(main_node_id), "input": "output_1"} for main_node_id in main_inputs]
                }
                if node_info.right_input_id is not None:
                    result["Home"]["data"][str(node_id)]["inputs"]["input_2"] = {
                        "connections": [{"node": str(node_info.right_input_id), "input": "output_1"}]
                    }
                if node_info.left_input_id is not None:
                    result["Home"]["data"][str(node_id)]["inputs"]["input_3"] = {
                        "connections": [{"node": str(node_info.left_input_id), "input": "output_1"}]
                    }
        return result

    def get_vue_flow_input(self) -> schemas.VueFlowInput:
        """Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

        Returns:
            A VueFlowInput object.
        """
        edges: list[schemas.NodeEdge] = []
        nodes: list[schemas.NodeInput] = []
        for node in self.nodes:
            nodes.append(node.get_node_input())
            edges.extend(node.get_edge_input())
        return schemas.VueFlowInput(node_edges=edges, node_inputs=nodes)

    def reset(self):
        """Forces a deep reset on all nodes in the graph."""

        for node in self.nodes:
            node.reset(True)

    def copy_node(
        self, new_node_settings: input_schema.NodePromise, existing_setting_input: Any, node_type: str
    ) -> None:
        """Creates a copy of an existing node.

        Args:
            new_node_settings: The promise containing new settings (like ID and position).
            existing_setting_input: The settings object from the node being copied.
            node_type: The type of the node being copied.
        """
        self.add_node_promise(new_node_settings)

        if isinstance(existing_setting_input, input_schema.NodePromise):
            return

        combined_settings = combine_existing_settings_and_new_settings(existing_setting_input, new_node_settings)
        getattr(self, f"add_{node_type}")(combined_settings)

    def generate_code(self):
        """Generates code for the flow graph.
        This method exports the flow graph to a Polars-compatible format.
        """
        from flowfile_core.flowfile.code_generator.code_generator import export_flow_to_polars

        print(export_flow_to_polars(self))
execution_location property writable

Gets the current execution location.

execution_mode property writable

Gets the current execution mode ('Development' or 'Performance').

flow_id property writable

Gets the unique identifier of the flow.

graph_has_functions property

Checks if the graph has any nodes.

graph_has_input_data property

Checks if the graph has an initial input data source.

node_connections property

Computes and returns a list of all connections in the graph.

Returns:

Type Description
list[tuple[int, int]]

A list of tuples, where each tuple is a (source_id, target_id) pair.

nodes property

Gets a list of all FlowNode objects in the graph.

__init__(flow_settings, name=None, input_cols=None, output_cols=None, path_ref=None, input_flow=None, cache_results=False)

Initializes a new FlowGraph instance.

Parameters:

Name Type Description Default
flow_settings FlowSettings | FlowGraphConfig

The configuration settings for the flow.

required
name str

The name of the flow.

None
input_cols list[str]

A list of input column names.

None
output_cols list[str]

A list of output column names.

None
path_ref str

An optional path to an initial data source.

None
input_flow Union[ParquetFile, FlowDataEngine, FlowGraph]

An optional existing data object to start the flow with.

None
cache_results bool

A global flag to enable or disable result caching.

False
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
def __init__(
    self,
    flow_settings: schemas.FlowSettings | schemas.FlowGraphConfig,
    name: str = None,
    input_cols: list[str] = None,
    output_cols: list[str] = None,
    path_ref: str = None,
    input_flow: Union[ParquetFile, FlowDataEngine, "FlowGraph"] = None,
    cache_results: bool = False,
):
    """Initializes a new FlowGraph instance.

    Args:
        flow_settings: The configuration settings for the flow.
        name: The name of the flow.
        input_cols: A list of input column names.
        output_cols: A list of output column names.
        path_ref: An optional path to an initial data source.
        input_flow: An optional existing data object to start the flow with.
        cache_results: A global flag to enable or disable result caching.
    """
    if isinstance(flow_settings, schemas.FlowGraphConfig):
        flow_settings = schemas.FlowSettings.from_flow_settings_input(flow_settings)

    self._flow_settings = flow_settings
    self.uuid = str(uuid1())
    self.start_datetime = None
    self.end_datetime = None
    self.latest_run_info = None
    self._flow_id = flow_settings.flow_id
    self.flow_logger = FlowLogger(flow_settings.flow_id)
    self._flow_starts: list[FlowNode] = []
    self._results = None
    self.schema = None
    self.has_over_row_function = False
    self._input_cols = [] if input_cols is None else input_cols
    self._output_cols = [] if output_cols is None else output_cols
    self._node_ids = []
    self._node_db = {}
    self.cache_results = cache_results
    self.__name__ = name if name else "flow_" + str(id(self))
    self.depends_on = {}

    # Initialize history manager for undo/redo support
    from flowfile_core.flowfile.history_manager import HistoryManager
    from flowfile_core.schemas.history_schema import HistoryConfig
    history_config = HistoryConfig(enabled=flow_settings.track_history)
    self._history_manager = HistoryManager(config=history_config)

    if path_ref is not None:
        self.add_datasource(input_schema.NodeDatasource(file_path=path_ref))
    elif input_flow is not None:
        self.add_datasource(input_file=input_flow)
__repr__()

Provides the official string representation of the FlowGraph instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
714
715
716
717
def __repr__(self):
    """Provides the official string representation of the FlowGraph instance."""
    settings_str = "  -" + "\n  -".join(f"{k}: {v}" for k, v in self.flow_settings)
    return f"FlowGraph(\nNodes: {self._node_db}\n\nSettings:\n{settings_str}"
add_cloud_storage_reader(node_cloud_storage_reader)

Adds a cloud storage read node to the flow graph.

Parameters:

Name Type Description Default
node_cloud_storage_reader NodeCloudStorageReader

The settings for the cloud storage read node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_cloud_storage_reader(self, node_cloud_storage_reader: input_schema.NodeCloudStorageReader) -> None:
    """Adds a cloud storage read node to the flow graph.

    Args:
        node_cloud_storage_reader: The settings for the cloud storage read node.
    """
    node_type = "cloud_storage_reader"
    logger.info("Adding cloud storage reader")
    cloud_storage_read_settings = node_cloud_storage_reader.cloud_storage_settings

    def _func():
        logger.info("Starting to run the schema callback for cloud storage reader")
        self.flow_logger.info("Starting to run the schema callback for cloud storage reader")
        settings = CloudStorageReadSettingsInternal(
            read_settings=cloud_storage_read_settings,
            connection=get_cloud_connection_settings(
                connection_name=cloud_storage_read_settings.connection_name,
                user_id=node_cloud_storage_reader.user_id,
                auth_mode=cloud_storage_read_settings.auth_mode,
            ),
        )
        fl = FlowDataEngine.from_cloud_storage_obj(settings)
        return fl

    node = self.add_node_step(
        node_id=node_cloud_storage_reader.node_id,
        function=_func,
        cache_results=node_cloud_storage_reader.cache_results,
        setting_input=node_cloud_storage_reader,
        node_type=node_type,
    )
    self.add_node_to_starting_list(node)
add_cloud_storage_writer(node_cloud_storage_writer)

Adds a node to write data to a cloud storage provider.

Parameters:

Name Type Description Default
node_cloud_storage_writer NodeCloudStorageWriter

The settings for the cloud storage writer node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_cloud_storage_writer(self, node_cloud_storage_writer: input_schema.NodeCloudStorageWriter) -> None:
    """Adds a node to write data to a cloud storage provider.

    Args:
        node_cloud_storage_writer: The settings for the cloud storage writer node.
    """

    node_type = "cloud_storage_writer"

    def _func(df: FlowDataEngine):
        df.lazy = True
        execute_remote = self.execution_location != "local"
        cloud_connection_settings = get_cloud_connection_settings(
            connection_name=node_cloud_storage_writer.cloud_storage_settings.connection_name,
            user_id=node_cloud_storage_writer.user_id,
            auth_mode=node_cloud_storage_writer.cloud_storage_settings.auth_mode,
        )
        full_cloud_storage_connection = FullCloudStorageConnection(
            storage_type=cloud_connection_settings.storage_type,
            auth_method=cloud_connection_settings.auth_method,
            aws_allow_unsafe_html=cloud_connection_settings.aws_allow_unsafe_html,
            **CloudStorageReader.get_storage_options(cloud_connection_settings),
        )
        if execute_remote:
            settings = get_cloud_storage_write_settings_worker_interface(
                write_settings=node_cloud_storage_writer.cloud_storage_settings,
                connection=full_cloud_storage_connection,
                lf=df.data_frame,
                user_id=node_cloud_storage_writer.user_id,
                flowfile_node_id=node_cloud_storage_writer.node_id,
                flowfile_flow_id=self.flow_id,
            )
            external_database_writer = ExternalCloudWriter(settings, wait_on_completion=False)
            node._fetch_cached_df = external_database_writer
            external_database_writer.get_result()
        else:
            cloud_storage_write_settings_internal = CloudStorageWriteSettingsInternal(
                connection=full_cloud_storage_connection,
                write_settings=node_cloud_storage_writer.cloud_storage_settings,
            )
            df.to_cloud_storage_obj(cloud_storage_write_settings_internal)
        return df

    def schema_callback():
        logger.info("Starting to run the schema callback for cloud storage writer")
        if self.get_node(node_cloud_storage_writer.node_id).is_correct:
            return self.get_node(node_cloud_storage_writer.node_id).node_inputs.main_inputs[0].schema
        else:
            return [FlowfileColumn.from_input(column_name="__error__", data_type="String")]

    self.add_node_step(
        node_id=node_cloud_storage_writer.node_id,
        function=_func,
        input_columns=[],
        node_type=node_type,
        setting_input=node_cloud_storage_writer,
        schema_callback=schema_callback,
        input_node_ids=[node_cloud_storage_writer.depending_on_id],
    )

    node = self.get_node(node_cloud_storage_writer.node_id)
add_cross_join(cross_join_settings)

Adds a cross join node to the graph.

Parameters:

Name Type Description Default
cross_join_settings NodeCrossJoin

The settings for the cross join operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_cross_join(self, cross_join_settings: input_schema.NodeCrossJoin) -> "FlowGraph":
    """Adds a cross join node to the graph.

    Args:
        cross_join_settings: The settings for the cross join operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        for left_select in cross_join_settings.cross_join_input.left_select.renames:
            left_select.is_available = True if left_select.old_name in main.schema else False
        for right_select in cross_join_settings.cross_join_input.right_select.renames:
            right_select.is_available = True if right_select.old_name in right.schema else False
        return main.do_cross_join(
            cross_join_input=cross_join_settings.cross_join_input,
            auto_generate_selection=cross_join_settings.auto_generate_selection,
            verify_integrity=False,
            other=right,
        )

    self.add_node_step(
        node_id=cross_join_settings.node_id,
        function=_func,
        input_columns=[],
        node_type="cross_join",
        setting_input=cross_join_settings,
        input_node_ids=cross_join_settings.depending_on_ids,
    )
    return self
add_database_reader(node_database_reader)

Adds a node to read data from a database.

Parameters:

Name Type Description Default
node_database_reader NodeDatabaseReader

The settings for the database reader node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_database_reader(self, node_database_reader: input_schema.NodeDatabaseReader):
    """Adds a node to read data from a database.

    Args:
        node_database_reader: The settings for the database reader node.
    """

    logger.info("Adding database reader")
    node_type = "database_reader"
    database_settings: input_schema.DatabaseSettings = node_database_reader.database_settings
    database_connection: input_schema.DatabaseConnection | input_schema.FullDatabaseConnection | None
    if database_settings.connection_mode == "inline":
        database_connection: input_schema.DatabaseConnection = database_settings.database_connection
        encrypted_password = get_encrypted_secret(
            current_user_id=node_database_reader.user_id, secret_name=database_connection.password_ref
        )
        if encrypted_password is None:
            raise HTTPException(status_code=400, detail="Password not found")
    else:
        database_reference_settings = get_local_database_connection(
            database_settings.database_connection_name, node_database_reader.user_id
        )
        database_connection = database_reference_settings
        encrypted_password = database_reference_settings.password.get_secret_value()

    def _func():
        sql_source = BaseSqlSource(
            query=None if database_settings.query_mode == "table" else database_settings.query,
            table_name=database_settings.table_name,
            schema_name=database_settings.schema_name,
            fields=node_database_reader.fields,
        )
        database_external_read_settings = (
            sql_models.DatabaseExternalReadSettings.create_from_from_node_database_reader(
                node_database_reader=node_database_reader,
                password=encrypted_password,
                query=sql_source.query,
                database_reference_settings=(
                    database_reference_settings if database_settings.connection_mode == "reference" else None
                ),
            )
        )

        external_database_fetcher = ExternalDatabaseFetcher(
            database_external_read_settings, wait_on_completion=False
        )
        node._fetch_cached_df = external_database_fetcher
        fl = FlowDataEngine(external_database_fetcher.get_result())
        node_database_reader.fields = [c.get_minimal_field_info() for c in fl.schema]
        return fl

    def schema_callback():
        sql_source = SqlSource(
            connection_string=sql_utils.construct_sql_uri(
                database_type=database_connection.database_type,
                host=database_connection.host,
                port=database_connection.port,
                database=database_connection.database,
                username=database_connection.username,
                password=decrypt_secret(encrypted_password),
            ),
            query=None if database_settings.query_mode == "table" else database_settings.query,
            table_name=database_settings.table_name,
            schema_name=database_settings.schema_name,
            fields=node_database_reader.fields,
        )
        return sql_source.get_schema()

    node = self.get_node(node_database_reader.node_id)
    if node:
        node.node_type = node_type
        node.name = node_type
        node.function = _func
        node.setting_input = node_database_reader
        node.node_settings.cache_results = node_database_reader.cache_results
        self.add_node_to_starting_list(node)
        node.schema_callback = schema_callback
    else:
        node = FlowNode(
            node_database_reader.node_id,
            function=_func,
            setting_input=node_database_reader,
            name=node_type,
            node_type=node_type,
            parent_uuid=self.uuid,
            schema_callback=schema_callback,
        )
        self._node_db[node_database_reader.node_id] = node
        self.add_node_to_starting_list(node)
        self._node_ids.append(node_database_reader.node_id)
add_database_writer(node_database_writer)

Adds a node to write data to a database.

Parameters:

Name Type Description Default
node_database_writer NodeDatabaseWriter

The settings for the database writer node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_database_writer(self, node_database_writer: input_schema.NodeDatabaseWriter):
    """Adds a node to write data to a database.

    Args:
        node_database_writer: The settings for the database writer node.
    """

    node_type = "database_writer"
    database_settings: input_schema.DatabaseWriteSettings = node_database_writer.database_write_settings
    database_connection: input_schema.DatabaseConnection | input_schema.FullDatabaseConnection | None
    if database_settings.connection_mode == "inline":
        database_connection: input_schema.DatabaseConnection = database_settings.database_connection
        encrypted_password = get_encrypted_secret(
            current_user_id=node_database_writer.user_id, secret_name=database_connection.password_ref
        )
        if encrypted_password is None:
            raise HTTPException(status_code=400, detail="Password not found")
    else:
        database_reference_settings = get_local_database_connection(
            database_settings.database_connection_name, node_database_writer.user_id
        )
        encrypted_password = database_reference_settings.password.get_secret_value()

    def _func(df: FlowDataEngine):
        df.lazy = True
        database_external_write_settings = (
            sql_models.DatabaseExternalWriteSettings.create_from_from_node_database_writer(
                node_database_writer=node_database_writer,
                password=encrypted_password,
                table_name=(
                    database_settings.schema_name + "." + database_settings.table_name
                    if database_settings.schema_name
                    else database_settings.table_name
                ),
                database_reference_settings=(
                    database_reference_settings if database_settings.connection_mode == "reference" else None
                ),
                lf=df.data_frame,
            )
        )
        external_database_writer = ExternalDatabaseWriter(
            database_external_write_settings, wait_on_completion=False
        )
        node._fetch_cached_df = external_database_writer
        external_database_writer.get_result()
        return df

    def schema_callback():
        input_node: FlowNode = self.get_node(node_database_writer.node_id).node_inputs.main_inputs[0]
        return input_node.schema

    self.add_node_step(
        node_id=node_database_writer.node_id,
        function=_func,
        input_columns=[],
        node_type=node_type,
        setting_input=node_database_writer,
        schema_callback=schema_callback,
    )
    node = self.get_node(node_database_writer.node_id)
add_datasource(input_file)

Adds a data source node to the graph.

This method serves as a factory for creating starting nodes, handling both file-based sources and direct manual data entry.

Parameters:

Name Type Description Default
input_file NodeDatasource | NodeManualInput

The configuration object for the data source.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_datasource(self, input_file: input_schema.NodeDatasource | input_schema.NodeManualInput) -> "FlowGraph":
    """Adds a data source node to the graph.

    This method serves as a factory for creating starting nodes, handling both
    file-based sources and direct manual data entry.

    Args:
        input_file: The configuration object for the data source.

    Returns:
        The `FlowGraph` instance for method chaining.
    """
    if isinstance(input_file, input_schema.NodeManualInput):
        input_data = FlowDataEngine(input_file.raw_data_format)
        ref = "manual_input"
    else:
        input_data = FlowDataEngine(path_ref=input_file.file_ref)
        ref = "datasource"
    node = self.get_node(input_file.node_id)
    if node:
        node.node_type = ref
        node.name = ref
        node.function = input_data
        node.setting_input = input_file
        self.add_node_to_starting_list(node)

    else:
        input_data.collect()
        node = FlowNode(
            input_file.node_id,
            function=input_data,
            setting_input=input_file,
            name=ref,
            node_type=ref,
            parent_uuid=self.uuid,
        )
        self._node_db[input_file.node_id] = node
        self.add_node_to_starting_list(node)
        self._node_ids.append(input_file.node_id)
    return self
add_dependency_on_polars_lazy_frame(lazy_frame, node_id)

Adds a special node that directly injects a Polars LazyFrame into the graph.

Note: This is intended for backend use and will not work in the UI editor.

Parameters:

Name Type Description Default
lazy_frame LazyFrame

The Polars LazyFrame to inject.

required
node_id int

The ID for the new node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
def add_dependency_on_polars_lazy_frame(self, lazy_frame: pl.LazyFrame, node_id: int):
    """Adds a special node that directly injects a Polars LazyFrame into the graph.

    Note: This is intended for backend use and will not work in the UI editor.

    Args:
        lazy_frame: The Polars LazyFrame to inject.
        node_id: The ID for the new node.
    """

    def _func():
        return FlowDataEngine(lazy_frame)

    node_promise = input_schema.NodePromise(
        flow_id=self.flow_id, node_id=node_id, node_type="polars_lazy_frame", is_setup=True
    )
    self.add_node_step(
        node_id=node_promise.node_id, node_type=node_promise.node_type, function=_func, setting_input=node_promise
    )
add_explore_data(node_analysis)

Adds a specialized node for data exploration and visualization.

Parameters:

Name Type Description Default
node_analysis NodeExploreData

The settings for the data exploration node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_explore_data(self, node_analysis: input_schema.NodeExploreData):
    """Adds a specialized node for data exploration and visualization.

    Args:
        node_analysis: The settings for the data exploration node.
    """
    sample_size: int = 10000

    def analysis_preparation(flowfile_table: FlowDataEngine):
        if flowfile_table.number_of_records <= 0:
            number_of_records = flowfile_table.get_number_of_records(calculate_in_worker_process=True)
        else:
            number_of_records = flowfile_table.number_of_records
        if number_of_records > sample_size:
            flowfile_table = flowfile_table.get_sample(sample_size, random=True)
        external_sampler = ExternalDfFetcher(
            lf=flowfile_table.data_frame,
            file_ref="__gf_walker" + node.hash,
            wait_on_completion=True,
            node_id=node.node_id,
            flow_id=self.flow_id,
        )
        node.results.analysis_data_generator = get_read_top_n(
            external_sampler.status.file_ref, n=min(sample_size, number_of_records)
        )
        return flowfile_table

    def schema_callback():
        node = self.get_node(node_analysis.node_id)
        if len(node.all_inputs) == 1:
            input_node = node.all_inputs[0]
            return input_node.schema
        else:
            return [FlowfileColumn.from_input("col_1", "na")]

    self.add_node_step(
        node_id=node_analysis.node_id,
        node_type="explore_data",
        function=analysis_preparation,
        setting_input=node_analysis,
        schema_callback=schema_callback,
    )
    node = self.get_node(node_analysis.node_id)
add_external_source(external_source_input)

Adds a node for a custom external data source.

Parameters:

Name Type Description Default
external_source_input NodeExternalSource

The settings for the external source node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_external_source(self, external_source_input: input_schema.NodeExternalSource):
    """Adds a node for a custom external data source.

    Args:
        external_source_input: The settings for the external source node.
    """

    node_type = "external_source"
    external_source_script = getattr(external_sources.custom_external_sources, external_source_input.identifier)
    source_settings = getattr(
        input_schema, snake_case_to_camel_case(external_source_input.identifier)
    ).model_validate(external_source_input.source_settings)
    if hasattr(external_source_script, "initial_getter"):
        initial_getter = external_source_script.initial_getter(source_settings)
    else:
        initial_getter = None
    data_getter = external_source_script.getter(source_settings)
    external_source = data_source_factory(
        source_type="custom",
        data_getter=data_getter,
        initial_data_getter=initial_getter,
        orientation=external_source_input.source_settings.orientation,
        schema=None,
    )

    def _func():
        logger.info("Calling external source")
        fl = FlowDataEngine.create_from_external_source(external_source=external_source)
        external_source_input.source_settings.fields = [c.get_minimal_field_info() for c in fl.schema]
        return fl

    node = self.get_node(external_source_input.node_id)
    if node:
        node.node_type = node_type
        node.name = node_type
        node.function = _func
        node.setting_input = external_source_input
        node.node_settings.cache_results = external_source_input.cache_results
        self.add_node_to_starting_list(node)

    else:
        node = FlowNode(
            external_source_input.node_id,
            function=_func,
            setting_input=external_source_input,
            name=node_type,
            node_type=node_type,
            parent_uuid=self.uuid,
        )
        self._node_db[external_source_input.node_id] = node
        self.add_node_to_starting_list(node)
        self._node_ids.append(external_source_input.node_id)
    if external_source_input.source_settings.fields and len(external_source_input.source_settings.fields) > 0:
        logger.info("Using provided schema in the node")

        def schema_callback():
            return [
                FlowfileColumn.from_input(f.name, f.data_type) for f in external_source_input.source_settings.fields
            ]

        node.schema_callback = schema_callback
    else:
        logger.warning("Removing schema")
        node._schema_callback = None
    self.add_node_step(
        node_id=external_source_input.node_id,
        function=_func,
        input_columns=[],
        node_type=node_type,
        setting_input=external_source_input,
    )
add_filter(filter_settings)

Adds a filter node to the graph.

Parameters:

Name Type Description Default
filter_settings NodeFilter

The settings for the filter operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_filter(self, filter_settings: input_schema.NodeFilter):
    """Adds a filter node to the graph.

    Args:
        filter_settings: The settings for the filter operation.
    """

    def _func(fl: FlowDataEngine):
        is_advanced = filter_settings.filter_input.is_advanced()

        if is_advanced:
            predicate = filter_settings.filter_input.advanced_filter
            return fl.do_filter(predicate)
        else:
            basic_filter = filter_settings.filter_input.basic_filter
            if basic_filter is None:
                logger.warning("Basic filter is None, returning unfiltered data")
                return fl

            try:
                field_data_type = fl.get_schema_column(basic_filter.field).generic_datatype()
            except Exception:
                field_data_type = None

            expression = build_filter_expression(basic_filter, field_data_type)
            filter_settings.filter_input.advanced_filter = expression
            return fl.do_filter(expression)

    self.add_node_step(
        filter_settings.node_id,
        _func,
        node_type="filter",
        renew_schema=False,
        setting_input=filter_settings,
        input_node_ids=[filter_settings.depending_on_id],
    )
add_formula(function_settings)

Adds a node that applies a formula to create or modify a column.

Parameters:

Name Type Description Default
function_settings NodeFormula

The settings for the formula operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_formula(self, function_settings: input_schema.NodeFormula):
    """Adds a node that applies a formula to create or modify a column.

    Args:
        function_settings: The settings for the formula operation.
    """

    error = ""
    if function_settings.function.field.data_type not in (None, transform_schema.AUTO_DATA_TYPE):
        output_type = cast_str_to_polars_type(function_settings.function.field.data_type)
    else:
        output_type = None
    if output_type not in (None, transform_schema.AUTO_DATA_TYPE):
        new_col = [
            FlowfileColumn.from_input(column_name=function_settings.function.field.name, data_type=str(output_type))
        ]
    else:
        new_col = [FlowfileColumn.from_input(function_settings.function.field.name, "String")]

    def _func(fl: FlowDataEngine):
        return fl.apply_sql_formula(
            func=function_settings.function.function,
            col_name=function_settings.function.field.name,
            output_data_type=output_type,
        )

    self.add_node_step(
        function_settings.node_id,
        _func,
        output_schema=new_col,
        node_type="formula",
        renew_schema=False,
        setting_input=function_settings,
        input_node_ids=[function_settings.depending_on_id],
    )
    if error != "":
        node = self.get_node(function_settings.node_id)
        node.results.errors = error
        return False, error
    else:
        return True, ""
add_fuzzy_match(fuzzy_settings)

Adds a fuzzy matching node to join data on approximate string matches.

Parameters:

Name Type Description Default
fuzzy_settings NodeFuzzyMatch

The settings for the fuzzy match operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_fuzzy_match(self, fuzzy_settings: input_schema.NodeFuzzyMatch) -> "FlowGraph":
    """Adds a fuzzy matching node to join data on approximate string matches.

    Args:
        fuzzy_settings: The settings for the fuzzy match operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        node = self.get_node(node_id=fuzzy_settings.node_id)
        if self.execution_location == "local":
            return main.fuzzy_join(
                fuzzy_match_input=deepcopy(fuzzy_settings.join_input),
                other=right,
                node_logger=self.flow_logger.get_node_logger(fuzzy_settings.node_id),
            )

        f = main.start_fuzzy_join(
            fuzzy_match_input=deepcopy(fuzzy_settings.join_input),
            other=right,
            file_ref=node.hash,
            flow_id=self.flow_id,
            node_id=fuzzy_settings.node_id,
        )
        logger.info("Started the fuzzy match action")
        node._fetch_cached_df = f  # Add to the node so it can be cancelled and fetch later if needed
        return FlowDataEngine(f.get_result())

    def schema_callback():
        fm_input_copy = FuzzyMatchInputManager(
            fuzzy_settings.join_input
        )  # Deepcopy create an unique object per func
        node = self.get_node(node_id=fuzzy_settings.node_id)
        return calculate_fuzzy_match_schema(
            fm_input_copy,
            left_schema=node.node_inputs.main_inputs[0].schema,
            right_schema=node.node_inputs.right_input.schema,
        )

    self.add_node_step(
        node_id=fuzzy_settings.node_id,
        function=_func,
        input_columns=[],
        node_type="fuzzy_match",
        setting_input=fuzzy_settings,
        input_node_ids=fuzzy_settings.depending_on_ids,
        schema_callback=schema_callback,
    )

    return self
add_graph_solver(graph_solver_settings)

Adds a node that solves graph-like problems within the data.

This node can be used for operations like finding network paths, calculating connected components, or performing other graph algorithms on relational data that represents nodes and edges.

Parameters:

Name Type Description Default
graph_solver_settings NodeGraphSolver

The settings object defining the graph inputs and the specific algorithm to apply.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_graph_solver(self, graph_solver_settings: input_schema.NodeGraphSolver):
    """Adds a node that solves graph-like problems within the data.

    This node can be used for operations like finding network paths,
    calculating connected components, or performing other graph algorithms
    on relational data that represents nodes and edges.

    Args:
        graph_solver_settings: The settings object defining the graph inputs
            and the specific algorithm to apply.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.solve_graph(graph_solver_settings.graph_solver_input)

    self.add_node_step(
        node_id=graph_solver_settings.node_id,
        function=_func,
        node_type="graph_solver",
        setting_input=graph_solver_settings,
        input_node_ids=[graph_solver_settings.depending_on_id],
    )
add_group_by(group_by_settings)

Adds a group-by aggregation node to the graph.

Parameters:

Name Type Description Default
group_by_settings NodeGroupBy

The settings for the group-by operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_group_by(self, group_by_settings: input_schema.NodeGroupBy):
    """Adds a group-by aggregation node to the graph.

    Args:
        group_by_settings: The settings for the group-by operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.do_group_by(group_by_settings.groupby_input, False)

    self.add_node_step(
        node_id=group_by_settings.node_id,
        function=_func,
        node_type="group_by",
        setting_input=group_by_settings,
        input_node_ids=[group_by_settings.depending_on_id],
    )

    node = self.get_node(group_by_settings.node_id)

    def schema_callback():
        output_columns = [(c.old_name, c.new_name, c.output_type) for c in group_by_settings.groupby_input.agg_cols]
        depends_on = node.node_inputs.main_inputs[0]
        input_schema_dict: dict[str, str] = {s.name: s.data_type for s in depends_on.schema}
        output_schema = []
        for old_name, new_name, data_type in output_columns:
            data_type = input_schema_dict[old_name] if data_type is None else data_type
            output_schema.append(FlowfileColumn.from_input(data_type=data_type, column_name=new_name))
        return output_schema

    node.schema_callback = schema_callback
add_include_cols(include_columns)

Adds columns to both the input and output column lists.

Parameters:

Name Type Description Default
include_columns list[str]

A list of column names to include.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
def add_include_cols(self, include_columns: list[str]):
    """Adds columns to both the input and output column lists.

    Args:
        include_columns: A list of column names to include.
    """
    for column in include_columns:
        if column not in self._input_cols:
            self._input_cols.append(column)
        if column not in self._output_cols:
            self._output_cols.append(column)
    return self
add_initial_node_analysis(node_promise, track_history=True)

Adds a data exploration/analysis node based on a node promise.

Automatically captures history for undo/redo support.

Parameters:

Name Type Description Default
node_promise NodePromise

The promise representing the node to be analyzed.

required
track_history bool

Whether to track this change in history (default True).

True
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
def add_initial_node_analysis(self, node_promise: input_schema.NodePromise, track_history: bool = True):
    """Adds a data exploration/analysis node based on a node promise.

    Automatically captures history for undo/redo support.

    Args:
        node_promise: The promise representing the node to be analyzed.
        track_history: Whether to track this change in history (default True).
    """
    def _do_add():
        node_analysis = create_graphic_walker_node_from_node_promise(node_promise)
        self.add_explore_data(node_analysis)

    if track_history:
        self._execute_with_history(
            _do_add,
            HistoryActionType.ADD_NODE,
            f"Add {node_promise.node_type} node",
            node_id=node_promise.node_id,
        )
    else:
        _do_add()
add_join(join_settings)

Adds a join node to combine two data streams based on key columns.

Parameters:

Name Type Description Default
join_settings NodeJoin

The settings for the join operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_join(self, join_settings: input_schema.NodeJoin) -> "FlowGraph":
    """Adds a join node to combine two data streams based on key columns.

    Args:
        join_settings: The settings for the join operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(main: FlowDataEngine, right: FlowDataEngine) -> FlowDataEngine:
        for left_select in join_settings.join_input.left_select.renames:
            left_select.is_available = True if left_select.old_name in main.schema else False
        for right_select in join_settings.join_input.right_select.renames:
            right_select.is_available = True if right_select.old_name in right.schema else False
        return main.join(
            join_input=join_settings.join_input,
            auto_generate_selection=join_settings.auto_generate_selection,
            verify_integrity=False,
            other=right,
        )

    self.add_node_step(
        node_id=join_settings.node_id,
        function=_func,
        input_columns=[],
        node_type="join",
        setting_input=join_settings,
        input_node_ids=join_settings.depending_on_ids,
    )
    return self
add_manual_input(input_file)

Adds a node for manual data entry.

This is a convenience alias for add_datasource.

Parameters:

Name Type Description Default
input_file NodeManualInput

The settings and data for the manual input node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2157
2158
2159
2160
2161
2162
2163
2164
2165
def add_manual_input(self, input_file: input_schema.NodeManualInput):
    """Adds a node for manual data entry.

    This is a convenience alias for `add_datasource`.

    Args:
        input_file: The settings and data for the manual input node.
    """
    self.add_datasource(input_file)
add_node_promise(node_promise, track_history=True)

Adds a placeholder node to the graph that is not yet fully configured.

Useful for building the graph structure before all settings are available. Automatically captures history for undo/redo support.

Parameters:

Name Type Description Default
node_promise NodePromise

A promise object containing basic node information.

required
track_history bool

Whether to track this change in history (default True).

True
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
def add_node_promise(self, node_promise: input_schema.NodePromise, track_history: bool = True):
    """Adds a placeholder node to the graph that is not yet fully configured.

    Useful for building the graph structure before all settings are available.
    Automatically captures history for undo/redo support.

    Args:
        node_promise: A promise object containing basic node information.
        track_history: Whether to track this change in history (default True).
    """
    def _do_add():
        def placeholder(n: FlowNode = None):
            if n is None:
                return FlowDataEngine()
            return n

        self.add_node_step(
            node_id=node_promise.node_id,
            node_type=node_promise.node_type,
            function=placeholder,
            setting_input=node_promise,
        )
        if node_promise.is_user_defined:
            node_needs_settings: bool
            custom_node = CUSTOM_NODE_STORE.get(node_promise.node_type)
            if custom_node is None:
                raise Exception(f"Custom node type '{node_promise.node_type}' not found in registry.")
            settings_schema = custom_node.model_fields["settings_schema"].default
            node_needs_settings = settings_schema is not None and not settings_schema.is_empty()
            if not node_needs_settings:
                user_defined_node_settings = input_schema.UserDefinedNode(settings={}, **node_promise.model_dump())
                initialized_model = custom_node()
                self.add_user_defined_node(
                    custom_node=initialized_model, user_defined_node_settings=user_defined_node_settings
                )

    if track_history:
        self._execute_with_history(
            _do_add,
            HistoryActionType.ADD_NODE,
            f"Add {node_promise.node_type} node",
            node_id=node_promise.node_id,
        )
    else:
        _do_add()
add_node_step(node_id, function, input_columns=None, output_schema=None, node_type=None, drop_columns=None, renew_schema=True, setting_input=None, cache_results=None, schema_callback=None, input_node_ids=None)

The core method for adding or updating a node in the graph.

Parameters:

Name Type Description Default
node_id int | str

The unique ID for the node.

required
function Callable

The core processing function for the node.

required
input_columns list[str]

A list of input column names required by the function.

None
output_schema list[FlowfileColumn]

A predefined schema for the node's output.

None
node_type str

A string identifying the type of node (e.g., 'filter', 'join').

None
drop_columns list[str]

A list of columns to be dropped after the function executes.

None
renew_schema bool

If True, the schema is recalculated after execution.

True
setting_input Any

A configuration object containing settings for the node.

None
cache_results bool

If True, the node's results are cached for future runs.

None
schema_callback Callable

A function that dynamically calculates the output schema.

None
input_node_ids list[int]

A list of IDs for the nodes that this node depends on.

None

Returns:

Type Description
FlowNode

The created or updated FlowNode object.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
def add_node_step(
    self,
    node_id: int | str,
    function: Callable,
    input_columns: list[str] = None,
    output_schema: list[FlowfileColumn] = None,
    node_type: str = None,
    drop_columns: list[str] = None,
    renew_schema: bool = True,
    setting_input: Any = None,
    cache_results: bool = None,
    schema_callback: Callable = None,
    input_node_ids: list[int] = None,
) -> FlowNode:
    """The core method for adding or updating a node in the graph.

    Args:
        node_id: The unique ID for the node.
        function: The core processing function for the node.
        input_columns: A list of input column names required by the function.
        output_schema: A predefined schema for the node's output.
        node_type: A string identifying the type of node (e.g., 'filter', 'join').
        drop_columns: A list of columns to be dropped after the function executes.
        renew_schema: If True, the schema is recalculated after execution.
        setting_input: A configuration object containing settings for the node.
        cache_results: If True, the node's results are cached for future runs.
        schema_callback: A function that dynamically calculates the output schema.
        input_node_ids: A list of IDs for the nodes that this node depends on.

    Returns:
        The created or updated FlowNode object.
    """
    # Wrap schema_callback with output_field_config support
    # If the node has output_field_config enabled, use it for schema prediction
    output_field_config = getattr(setting_input, 'output_field_config', None) if setting_input else None

    logger.info(
        f"add_node_step: node_id={node_id}, node_type={node_type}, "
        f"has_setting_input={setting_input is not None}, "
        f"has_output_field_config={output_field_config is not None}, "
        f"config_enabled={output_field_config.enabled if output_field_config else False}, "
        f"has_schema_callback={schema_callback is not None}"
    )

    # IMPORTANT: Always create wrapped callback if output_field_config exists (even if enabled=False)
    # This ensures nodes like PolarsCode get a schema callback when output_field_config is defined
    if output_field_config:
        if output_field_config.enabled:
            logger.info(
                f"add_node_step: Creating/wrapping schema_callback for node {node_id} with output_field_config "
                f"(validation_mode={output_field_config.validation_mode_behavior}, {len(output_field_config.fields)} fields, "
                f"base_callback={'present' if schema_callback else 'None'})"
            )
        else:
            logger.debug(f"add_node_step: output_field_config present for node {node_id} but disabled")

        # Even if schema_callback is None, create a wrapped one for output_field_config
        schema_callback = create_schema_callback_with_output_config(schema_callback, output_field_config)
        logger.info(f"add_node_step: schema_callback {'created' if schema_callback else 'failed'} for node {node_id}")

    existing_node = self.get_node(node_id)
    if existing_node is not None:
        if existing_node.node_type != node_type:
            self.delete_node(existing_node.node_id)
            existing_node = None
    if existing_node:
        input_nodes = existing_node.all_inputs
    elif input_node_ids is not None:
        input_nodes = [self.get_node(node_id) for node_id in input_node_ids]
    else:
        input_nodes = None
    if isinstance(input_columns, str):
        input_columns = [input_columns]
    if (
        input_nodes is not None
        or function.__name__ in ("placeholder", "analysis_preparation")
        or node_type in ("cloud_storage_reader", "polars_lazy_frame", "input_data")
    ):
        if not existing_node:
            node = FlowNode(
                node_id=node_id,
                function=function,
                output_schema=output_schema,
                input_columns=input_columns,
                drop_columns=drop_columns,
                renew_schema=renew_schema,
                setting_input=setting_input,
                node_type=node_type,
                name=function.__name__,
                schema_callback=schema_callback,
                parent_uuid=self.uuid,
            )
        else:
            existing_node.update_node(
                function=function,
                output_schema=output_schema,
                input_columns=input_columns,
                drop_columns=drop_columns,
                setting_input=setting_input,
                schema_callback=schema_callback,
            )
            node = existing_node
    else:
        raise Exception("No data initialized")
    self._node_db[node_id] = node
    self._node_ids.append(node_id)
    return node
add_node_to_starting_list(node)

Adds a node to the list of starting nodes for the flow if not already present.

Parameters:

Name Type Description Default
node FlowNode

The FlowNode to add as a starting node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
592
593
594
595
596
597
598
599
def add_node_to_starting_list(self, node: FlowNode) -> None:
    """Adds a node to the list of starting nodes for the flow if not already present.

    Args:
        node: The FlowNode to add as a starting node.
    """
    if node.node_id not in {self_node.node_id for self_node in self._flow_starts}:
        self._flow_starts.append(node)
add_output(output_file)

Adds an output node to write the final data to a destination.

Parameters:

Name Type Description Default
output_file NodeOutput

The settings for the output file.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_output(self, output_file: input_schema.NodeOutput):
    """Adds an output node to write the final data to a destination.

    Args:
        output_file: The settings for the output file.
    """

    def _func(df: FlowDataEngine):
        execute_remote = self.execution_location != "local"
        df.output(
            output_fs=output_file.output_settings,
            flow_id=self.flow_id,
            node_id=output_file.node_id,
            execute_remote=execute_remote,
        )
        return df

    def schema_callback():
        input_node: FlowNode = self.get_node(output_file.node_id).node_inputs.main_inputs[0]

        return input_node.schema

    input_node_id = output_file.depending_on_id if hasattr(output_file, "depending_on_id") else None
    self.add_node_step(
        node_id=output_file.node_id,
        function=_func,
        input_columns=[],
        node_type="output",
        setting_input=output_file,
        schema_callback=schema_callback,
        input_node_ids=[input_node_id],
    )
add_pivot(pivot_settings)

Adds a pivot node to the graph.

Parameters:

Name Type Description Default
pivot_settings NodePivot

The settings for the pivot operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_pivot(self, pivot_settings: input_schema.NodePivot):
    """Adds a pivot node to the graph.

    Args:
        pivot_settings: The settings for the pivot operation.
    """

    def _func(fl: FlowDataEngine):
        return fl.do_pivot(pivot_settings.pivot_input, self.flow_logger.get_node_logger(pivot_settings.node_id))

    self.add_node_step(
        node_id=pivot_settings.node_id,
        function=_func,
        node_type="pivot",
        setting_input=pivot_settings,
        input_node_ids=[pivot_settings.depending_on_id],
    )

    node = self.get_node(pivot_settings.node_id)

    def schema_callback():
        input_data = node.singular_main_input.get_resulting_data()  # get from the previous step the data
        input_data.lazy = True  # ensure the dataset is lazy
        input_lf = input_data.data_frame  # get the lazy frame
        return pre_calculate_pivot_schema(input_data.schema, pivot_settings.pivot_input, input_lf=input_lf)

    node.schema_callback = schema_callback
add_polars_code(node_polars_code)

Adds a node that executes custom Polars code.

Parameters:

Name Type Description Default
node_polars_code NodePolarsCode

The settings for the Polars code node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_polars_code(self, node_polars_code: input_schema.NodePolarsCode):
    """Adds a node that executes custom Polars code.

    Args:
        node_polars_code: The settings for the Polars code node.
    """

    def _func(*flowfile_tables: FlowDataEngine) -> FlowDataEngine:
        return execute_polars_code(*flowfile_tables, code=node_polars_code.polars_code_input.polars_code)

    self.add_node_step(
        node_id=node_polars_code.node_id,
        function=_func,
        node_type="polars_code",
        setting_input=node_polars_code,
        input_node_ids=node_polars_code.depending_on_ids,
    )

    try:
        polars_code_parser.validate_code(node_polars_code.polars_code_input.polars_code)
    except Exception as e:
        node = self.get_node(node_id=node_polars_code.node_id)
        node.results.errors = str(e)
add_read(input_file)

Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

Parameters:

Name Type Description Default
input_file NodeRead

The settings for the read operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_read(self, input_file: input_schema.NodeRead):
    """Adds a node to read data from a local file (e.g., CSV, Parquet, Excel).

    Args:
        input_file: The settings for the read operation.
    """
    if (
        input_file.received_file.file_type in ("xlsx", "excel")
        and input_file.received_file.table_settings.sheet_name == ""
    ):
        sheet_name = fastexcel.read_excel(input_file.received_file.path).sheet_names[0]
        input_file.received_file.table_settings.sheet_name = sheet_name

    received_file = input_file.received_file
    input_file.received_file.set_absolute_filepath()

    def _func():
        input_file.received_file.set_absolute_filepath()
        if input_file.received_file.file_type == "parquet":
            input_data = FlowDataEngine.create_from_path(input_file.received_file)
        elif (
            input_file.received_file.file_type == "csv"
            and "utf" in input_file.received_file.table_settings.encoding
        ):
            input_data = FlowDataEngine.create_from_path(input_file.received_file)
        else:
            input_data = FlowDataEngine.create_from_path_worker(
                input_file.received_file, node_id=input_file.node_id, flow_id=self.flow_id
            )
        input_data.name = input_file.received_file.name
        return input_data

    node = self.get_node(input_file.node_id)
    schema_callback = None
    if node:
        start_hash = node.hash
        node.node_type = "read"
        node.name = "read"
        node.function = _func
        node.setting_input = input_file
        self.add_node_to_starting_list(node)

        if start_hash != node.hash:
            logger.info("Hash changed, updating schema")
            if len(received_file.fields) > 0:
                # If the file has fields defined, we can use them to create the schema
                def schema_callback():
                    return [FlowfileColumn.from_input(f.name, f.data_type) for f in received_file.fields]

            elif input_file.received_file.file_type in ("csv", "json", "parquet"):
                # everything that can be scanned by polars
                def schema_callback():
                    input_data = FlowDataEngine.create_from_path(input_file.received_file)
                    return input_data.schema

            elif input_file.received_file.file_type in ("xlsx", "excel"):
                # If the file is an Excel file, we need to use the openpyxl engine to read the schema
                schema_callback = get_xlsx_schema_callback(
                    engine="openpyxl",
                    file_path=received_file.file_path,
                    sheet_name=received_file.table_settings.sheet_name,
                    start_row=received_file.table_settings.start_row,
                    end_row=received_file.table_settings.end_row,
                    start_column=received_file.table_settings.start_column,
                    end_column=received_file.table_settings.end_column,
                    has_headers=received_file.table_settings.has_headers,
                )
            else:
                schema_callback = None
    else:
        node = FlowNode(
            input_file.node_id,
            function=_func,
            setting_input=input_file,
            name="read",
            node_type="read",
            parent_uuid=self.uuid,
        )
        self._node_db[input_file.node_id] = node
        self.add_node_to_starting_list(node)
        self._node_ids.append(input_file.node_id)

    if schema_callback is not None:
        node.schema_callback = schema_callback
        node.user_provided_schema_callback = schema_callback
    return self
add_record_count(node_number_of_records)

Adds a filter node to the graph.

Parameters:

Name Type Description Default
node_number_of_records NodeRecordCount

The settings for the record count operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_record_count(self, node_number_of_records: input_schema.NodeRecordCount):
    """Adds a filter node to the graph.

    Args:
        node_number_of_records: The settings for the record count operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.get_record_count()

    self.add_node_step(
        node_id=node_number_of_records.node_id,
        function=_func,
        node_type="record_count",
        setting_input=node_number_of_records,
        input_node_ids=[node_number_of_records.depending_on_id],
    )
add_record_id(record_id_settings)

Adds a node to create a new column with a unique ID for each record.

Parameters:

Name Type Description Default
record_id_settings NodeRecordId

The settings object specifying the name of the new record ID column.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_record_id(self, record_id_settings: input_schema.NodeRecordId) -> "FlowGraph":
    """Adds a node to create a new column with a unique ID for each record.

    Args:
        record_id_settings: The settings object specifying the name of the
            new record ID column.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.add_record_id(record_id_settings.record_id_input)

    self.add_node_step(
        node_id=record_id_settings.node_id,
        function=_func,
        node_type="record_id",
        setting_input=record_id_settings,
        input_node_ids=[record_id_settings.depending_on_id],
    )
    return self
add_sample(sample_settings)

Adds a node to take a random or top-N sample of the data.

Parameters:

Name Type Description Default
sample_settings NodeSample

The settings object specifying the size of the sample.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_sample(self, sample_settings: input_schema.NodeSample) -> "FlowGraph":
    """Adds a node to take a random or top-N sample of the data.

    Args:
        sample_settings: The settings object specifying the size of the sample.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.get_sample(sample_settings.sample_size)

    self.add_node_step(
        node_id=sample_settings.node_id,
        function=_func,
        node_type="sample",
        setting_input=sample_settings,
        input_node_ids=[sample_settings.depending_on_id],
    )
    return self
add_select(select_settings)

Adds a node to select, rename, reorder, or drop columns.

Parameters:

Name Type Description Default
select_settings NodeSelect

The settings for the select operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_select(self, select_settings: input_schema.NodeSelect) -> "FlowGraph":
    """Adds a node to select, rename, reorder, or drop columns.

    Args:
        select_settings: The settings for the select operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    select_cols = select_settings.select_input
    drop_cols = tuple(s.old_name for s in select_settings.select_input)

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        input_cols = set(f.name for f in table.schema)
        ids_to_remove = []
        for i, select_col in enumerate(select_cols):
            if select_col.data_type is None:
                select_col.data_type = table.get_schema_column(select_col.old_name).data_type
            if select_col.old_name not in input_cols:
                select_col.is_available = False
                if not select_col.keep:
                    ids_to_remove.append(i)
            else:
                select_col.is_available = True
        ids_to_remove.reverse()
        for i in ids_to_remove:
            v = select_cols.pop(i)
            del v
        return table.do_select(
            select_inputs=transform_schema.SelectInputs(select_cols), keep_missing=select_settings.keep_missing
        )

    self.add_node_step(
        node_id=select_settings.node_id,
        function=_func,
        input_columns=[],
        node_type="select",
        drop_columns=list(drop_cols),
        setting_input=select_settings,
        input_node_ids=[select_settings.depending_on_id],
    )
    return self
add_sort(sort_settings)

Adds a node to sort the data based on one or more columns.

Parameters:

Name Type Description Default
sort_settings NodeSort

The settings for the sort operation.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_sort(self, sort_settings: input_schema.NodeSort) -> "FlowGraph":
    """Adds a node to sort the data based on one or more columns.

    Args:
        sort_settings: The settings for the sort operation.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.do_sort(sort_settings.sort_input)

    self.add_node_step(
        node_id=sort_settings.node_id,
        function=_func,
        node_type="sort",
        setting_input=sort_settings,
        input_node_ids=[sort_settings.depending_on_id],
    )
    return self
add_sql_source(external_source_input)

Adds a node that reads data from a SQL source.

This is a convenience alias for add_external_source.

Parameters:

Name Type Description Default
external_source_input NodeExternalSource

The settings for the external SQL source node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
def add_sql_source(self, external_source_input: input_schema.NodeExternalSource):
    """Adds a node that reads data from a SQL source.

    This is a convenience alias for `add_external_source`.

    Args:
        external_source_input: The settings for the external SQL source node.
    """
    logger.info("Adding sql source")
    self.add_external_source(external_source_input)
add_text_to_rows(node_text_to_rows)

Adds a node that splits cell values into multiple rows.

This is useful for un-nesting data where a single field contains multiple values separated by a delimiter.

Parameters:

Name Type Description Default
node_text_to_rows NodeTextToRows

The settings object that specifies the column to split and the delimiter to use.

required

Returns:

Type Description
FlowGraph

The FlowGraph instance for method chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_text_to_rows(self, node_text_to_rows: input_schema.NodeTextToRows) -> "FlowGraph":
    """Adds a node that splits cell values into multiple rows.

    This is useful for un-nesting data where a single field contains multiple
    values separated by a delimiter.

    Args:
        node_text_to_rows: The settings object that specifies the column to split
            and the delimiter to use.

    Returns:
        The `FlowGraph` instance for method chaining.
    """

    def _func(table: FlowDataEngine) -> FlowDataEngine:
        return table.split(node_text_to_rows.text_to_rows_input)

    self.add_node_step(
        node_id=node_text_to_rows.node_id,
        function=_func,
        node_type="text_to_rows",
        setting_input=node_text_to_rows,
        input_node_ids=[node_text_to_rows.depending_on_id],
    )
    return self
add_union(union_settings)

Adds a union node to combine multiple data streams.

Parameters:

Name Type Description Default
union_settings NodeUnion

The settings for the union operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_union(self, union_settings: input_schema.NodeUnion):
    """Adds a union node to combine multiple data streams.

    Args:
        union_settings: The settings for the union operation.
    """

    def _func(*flowfile_tables: FlowDataEngine):
        dfs: list[pl.LazyFrame] | list[pl.DataFrame] = [flt.data_frame for flt in flowfile_tables]
        return FlowDataEngine(pl.concat(dfs, how="diagonal_relaxed"))

    self.add_node_step(
        node_id=union_settings.node_id,
        function=_func,
        node_type="union",
        setting_input=union_settings,
        input_node_ids=union_settings.depending_on_ids,
    )
add_unique(unique_settings)

Adds a node to find and remove duplicate rows.

Parameters:

Name Type Description Default
unique_settings NodeUnique

The settings for the unique operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_unique(self, unique_settings: input_schema.NodeUnique):
    """Adds a node to find and remove duplicate rows.

    Args:
        unique_settings: The settings for the unique operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.make_unique(unique_settings.unique_input)

    self.add_node_step(
        node_id=unique_settings.node_id,
        function=_func,
        input_columns=[],
        node_type="unique",
        setting_input=unique_settings,
        input_node_ids=[unique_settings.depending_on_id],
    )
add_unpivot(unpivot_settings)

Adds an unpivot node to the graph.

Parameters:

Name Type Description Default
unpivot_settings NodeUnpivot

The settings for the unpivot operation.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
@with_history_capture(HistoryActionType.UPDATE_SETTINGS)
def add_unpivot(self, unpivot_settings: input_schema.NodeUnpivot):
    """Adds an unpivot node to the graph.

    Args:
        unpivot_settings: The settings for the unpivot operation.
    """

    def _func(fl: FlowDataEngine) -> FlowDataEngine:
        return fl.unpivot(unpivot_settings.unpivot_input)

    self.add_node_step(
        node_id=unpivot_settings.node_id,
        function=_func,
        node_type="unpivot",
        setting_input=unpivot_settings,
        input_node_ids=[unpivot_settings.depending_on_id],
    )
add_user_defined_node(*, custom_node, user_defined_node_settings)

Adds a user-defined custom node to the graph.

Parameters:

Name Type Description Default
custom_node CustomNodeBase

The custom node instance to add.

required
user_defined_node_settings UserDefinedNode

The settings for the user-defined node.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
def add_user_defined_node(
    self, *, custom_node: CustomNodeBase, user_defined_node_settings: input_schema.UserDefinedNode
):
    """Adds a user-defined custom node to the graph.

    Args:
        custom_node: The custom node instance to add.
        user_defined_node_settings: The settings for the user-defined node.
    """

    def _func(*flow_data_engine: FlowDataEngine) -> FlowDataEngine | None:
        user_id = user_defined_node_settings.user_id
        if user_id is not None:
            custom_node.set_execution_context(user_id)
            if custom_node.settings_schema:
                custom_node.settings_schema.set_secret_context(user_id, custom_node.accessed_secrets)

        output = custom_node.process(*(fde.data_frame for fde in flow_data_engine))

        accessed_secrets = custom_node.get_accessed_secrets()
        if accessed_secrets:
            logger.info(f"Node '{user_defined_node_settings.node_id}' accessed secrets: {accessed_secrets}")
        if isinstance(output, (pl.LazyFrame, pl.DataFrame)):
            return FlowDataEngine(output)
        return None

    self.add_node_step(
        node_id=user_defined_node_settings.node_id,
        function=_func,
        setting_input=user_defined_node_settings,
        input_node_ids=user_defined_node_settings.depending_on_ids,
        node_type=custom_node.item,
    )
    if custom_node.number_of_inputs == 0:
        node = self.get_node(user_defined_node_settings.node_id)
        self.add_node_to_starting_list(node)
apply_layout(y_spacing=150, x_spacing=200, initial_y=100)

Calculates and applies a layered layout to all nodes in the graph.

This updates their x and y positions for UI rendering.

Parameters:

Name Type Description Default
y_spacing int

The vertical spacing between layers.

150
x_spacing int

The horizontal spacing between nodes in the same layer.

200
initial_y int

The initial y-position for the first layer.

100
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
def apply_layout(self, y_spacing: int = 150, x_spacing: int = 200, initial_y: int = 100):
    """Calculates and applies a layered layout to all nodes in the graph.

    This updates their x and y positions for UI rendering.

    Args:
        y_spacing: The vertical spacing between layers.
        x_spacing: The horizontal spacing between nodes in the same layer.
        initial_y: The initial y-position for the first layer.
    """
    self.flow_logger.info("Applying layered layout...")
    start_time = time()
    try:
        # Calculate new positions for all nodes
        new_positions = calculate_layered_layout(
            self, y_spacing=y_spacing, x_spacing=x_spacing, initial_y=initial_y
        )

        if not new_positions:
            self.flow_logger.warning("Layout calculation returned no positions.")
            return

        # Apply the new positions to the setting_input of each node
        updated_count = 0
        for node_id, (pos_x, pos_y) in new_positions.items():
            node = self.get_node(node_id)
            if node and hasattr(node, "setting_input"):
                setting = node.setting_input
                if hasattr(setting, "pos_x") and hasattr(setting, "pos_y"):
                    setting.pos_x = pos_x
                    setting.pos_y = pos_y
                    updated_count += 1
                else:
                    self.flow_logger.warning(
                        f"Node {node_id} setting_input ({type(setting)}) lacks pos_x/pos_y attributes."
                    )
            elif node:
                self.flow_logger.warning(f"Node {node_id} lacks setting_input attribute.")
            # else: Node not found, already warned by calculate_layered_layout

        end_time = time()
        self.flow_logger.info(
            f"Layout applied to {updated_count}/{len(self.nodes)} nodes in {end_time - start_time:.2f} seconds."
        )

    except Exception as e:
        self.flow_logger.error(f"Error applying layout: {e}")
        raise  # Optional: re-raise the exception
cancel()

Cancels an ongoing graph execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2546
2547
2548
2549
2550
2551
2552
2553
def cancel(self):
    """Cancels an ongoing graph execution."""

    if not self.flow_settings.is_running:
        return
    self.flow_settings.is_canceled = True
    for node in self.nodes:
        node.cancel()
capture_history_if_changed(pre_snapshot, action_type, description, node_id=None)

Capture history only if the flow state actually changed.

Use this for settings updates where the change might be a no-op. Call this AFTER the change is applied.

Parameters:

Name Type Description Default
pre_snapshot FlowfileData

The FlowfileData captured BEFORE the change.

required
action_type HistoryActionType

The type of action that was performed.

required
description str

Human-readable description of the action.

required
node_id int

Optional ID of the affected node.

None

Returns:

Type Description
bool

True if a change was detected and snapshot was captured.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
def capture_history_if_changed(
    self,
    pre_snapshot: schemas.FlowfileData,
    action_type: HistoryActionType,
    description: str,
    node_id: int = None,
) -> bool:
    """Capture history only if the flow state actually changed.

    Use this for settings updates where the change might be a no-op.
    Call this AFTER the change is applied.

    Args:
        pre_snapshot: The FlowfileData captured BEFORE the change.
        action_type: The type of action that was performed.
        description: Human-readable description of the action.
        node_id: Optional ID of the affected node.

    Returns:
        True if a change was detected and snapshot was captured.
    """
    return self._history_manager.capture_if_changed(
        self, pre_snapshot, action_type, description, node_id
    )
capture_history_snapshot(action_type, description, node_id=None)

Capture the current state before a change for undo support.

Parameters:

Name Type Description Default
action_type HistoryActionType

The type of action being performed.

required
description str

Human-readable description of the action.

required
node_id int

Optional ID of the affected node.

None

Returns:

Type Description
bool

True if snapshot was captured, False if skipped.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
def capture_history_snapshot(
    self,
    action_type: HistoryActionType,
    description: str,
    node_id: int = None,
) -> bool:
    """Capture the current state before a change for undo support.

    Args:
        action_type: The type of action being performed.
        description: Human-readable description of the action.
        node_id: Optional ID of the affected node.

    Returns:
        True if snapshot was captured, False if skipped.
    """
    return self._history_manager.capture_snapshot(self, action_type, description, node_id)
close_flow()

Performs cleanup operations, such as clearing node caches.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2555
2556
2557
2558
2559
def close_flow(self):
    """Performs cleanup operations, such as clearing node caches."""

    for node in self.nodes:
        node.remove_cache()
copy_node(new_node_settings, existing_setting_input, node_type)

Creates a copy of an existing node.

Parameters:

Name Type Description Default
new_node_settings NodePromise

The promise containing new settings (like ID and position).

required
existing_setting_input Any

The settings object from the node being copied.

required
node_type str

The type of the node being copied.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
def copy_node(
    self, new_node_settings: input_schema.NodePromise, existing_setting_input: Any, node_type: str
) -> None:
    """Creates a copy of an existing node.

    Args:
        new_node_settings: The promise containing new settings (like ID and position).
        existing_setting_input: The settings object from the node being copied.
        node_type: The type of the node being copied.
    """
    self.add_node_promise(new_node_settings)

    if isinstance(existing_setting_input, input_schema.NodePromise):
        return

    combined_settings = combine_existing_settings_and_new_settings(existing_setting_input, new_node_settings)
    getattr(self, f"add_{node_type}")(combined_settings)
delete_node(node_id)

Deletes a node from the graph and updates all its connections.

Parameters:

Name Type Description Default
node_id int | str

The ID of the node to delete.

required

Raises:

Type Description
Exception

If the node with the given ID does not exist.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
def delete_node(self, node_id: int | str):
    """Deletes a node from the graph and updates all its connections.

    Args:
        node_id: The ID of the node to delete.

    Raises:
        Exception: If the node with the given ID does not exist.
    """
    logger.info(f"Starting deletion of node with ID: {node_id}")

    node = self._node_db.get(node_id)
    if node:
        logger.info(f"Found node: {node_id}, processing deletion")

        lead_to_steps: list[FlowNode] = node.leads_to_nodes
        logger.debug(f"Node {node_id} leads to {len(lead_to_steps)} other nodes")

        if len(lead_to_steps) > 0:
            for lead_to_step in lead_to_steps:
                logger.debug(f"Deleting input node {node_id} from dependent node {lead_to_step}")
                lead_to_step.delete_input_node(node_id, complete=True)

        if not node.is_start:
            depends_on: list[FlowNode] = node.node_inputs.get_all_inputs()
            logger.debug(f"Node {node_id} depends on {len(depends_on)} other nodes")

            for depend_on in depends_on:
                logger.debug(f"Removing lead_to reference {node_id} from node {depend_on}")
                depend_on.delete_lead_to_node(node_id)

        self._node_db.pop(node_id)
        logger.debug(f"Successfully removed node {node_id} from node_db")
        del node
        logger.info("Node object deleted")
    else:
        logger.error(f"Failed to find node with id {node_id}")
        raise Exception(f"Node with id {node_id} does not exist")
generate_code()

Generates code for the flow graph. This method exports the flow graph to a Polars-compatible format.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2739
2740
2741
2742
2743
2744
2745
def generate_code(self):
    """Generates code for the flow graph.
    This method exports the flow graph to a Polars-compatible format.
    """
    from flowfile_core.flowfile.code_generator.code_generator import export_flow_to_polars

    print(export_flow_to_polars(self))
get_frontend_data()

Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

This method transforms the graph's state into a format compatible with the Drawflow.js library.

Returns:

Type Description
dict

A dictionary representing the graph in Drawflow format.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
def get_frontend_data(self) -> dict:
    """Formats the graph structure into a JSON-like dictionary for a specific legacy frontend.

    This method transforms the graph's state into a format compatible with the
    Drawflow.js library.

    Returns:
        A dictionary representing the graph in Drawflow format.
    """
    result = {"Home": {"data": {}}}
    flow_info: schemas.FlowInformation = self.get_node_storage()

    for node_id, node_info in flow_info.data.items():
        if node_info.is_setup:
            try:
                pos_x = node_info.data.pos_x
                pos_y = node_info.data.pos_y
                # Basic node structure
                result["Home"]["data"][str(node_id)] = {
                    "id": node_info.id,
                    "name": node_info.type,
                    "data": {},  # Additional data can go here
                    "class": node_info.type,
                    "html": node_info.type,
                    "typenode": "vue",
                    "inputs": {},
                    "outputs": {},
                    "pos_x": pos_x,
                    "pos_y": pos_y,
                }
            except Exception as e:
                logger.error(e)
        # Add outputs to the node based on `outputs` in your backend data
        if node_info.outputs:
            outputs = {o: 0 for o in node_info.outputs}
            for o in node_info.outputs:
                outputs[o] += 1
            connections = []
            for output_node_id, n_connections in outputs.items():
                leading_to_node = self.get_node(output_node_id)
                input_types = leading_to_node.get_input_type(node_info.id)
                for input_type in input_types:
                    if input_type == "main":
                        input_frontend_id = "input_1"
                    elif input_type == "right":
                        input_frontend_id = "input_2"
                    elif input_type == "left":
                        input_frontend_id = "input_3"
                    else:
                        input_frontend_id = "input_1"
                    connection = {"node": str(output_node_id), "input": input_frontend_id}
                    connections.append(connection)

            result["Home"]["data"][str(node_id)]["outputs"]["output_1"] = {"connections": connections}
        else:
            result["Home"]["data"][str(node_id)]["outputs"] = {"output_1": {"connections": []}}

        # Add input to the node based on `depending_on_id` in your backend data
        if (
            node_info.left_input_id is not None
            or node_info.right_input_id is not None
            or node_info.input_ids is not None
        ):
            main_inputs = node_info.main_input_ids
            result["Home"]["data"][str(node_id)]["inputs"]["input_1"] = {
                "connections": [{"node": str(main_node_id), "input": "output_1"} for main_node_id in main_inputs]
            }
            if node_info.right_input_id is not None:
                result["Home"]["data"][str(node_id)]["inputs"]["input_2"] = {
                    "connections": [{"node": str(node_info.right_input_id), "input": "output_1"}]
                }
            if node_info.left_input_id is not None:
                result["Home"]["data"][str(node_id)]["inputs"]["input_3"] = {
                    "connections": [{"node": str(node_info.left_input_id), "input": "output_1"}]
                }
    return result
get_history_state()

Get the current state of the history system.

Returns:

Type Description
HistoryState

HistoryState with information about available undo/redo operations.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
444
445
446
447
448
449
450
def get_history_state(self) -> HistoryState:
    """Get the current state of the history system.

    Returns:
        HistoryState with information about available undo/redo operations.
    """
    return self._history_manager.get_state()
get_implicit_starter_nodes()

Finds nodes that can act as starting points but are not explicitly defined as such.

Some nodes, like the Polars Code node, can function without an input. This method identifies such nodes if they have no incoming connections.

Returns:

Type Description
list[FlowNode]

A list of FlowNode objects that are implicit starting nodes.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
def get_implicit_starter_nodes(self) -> list[FlowNode]:
    """Finds nodes that can act as starting points but are not explicitly defined as such.

    Some nodes, like the Polars Code node, can function without an input. This
    method identifies such nodes if they have no incoming connections.

    Returns:
        A list of `FlowNode` objects that are implicit starting nodes.
    """
    starting_node_ids = [node.node_id for node in self._flow_starts]
    implicit_starting_nodes = []
    for node in self.nodes:
        if node.node_template.can_be_start and not node.has_input and node.node_id not in starting_node_ids:
            implicit_starting_nodes.append(node)
    return implicit_starting_nodes
get_node(node_id=None)

Retrieves a node from the graph by its ID.

Parameters:

Name Type Description Default
node_id int | str

The ID of the node to retrieve. If None, retrieves the last added node.

None

Returns:

Type Description
FlowNode | None

The FlowNode object, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
816
817
818
819
820
821
822
823
824
825
826
827
828
829
def get_node(self, node_id: int | str = None) -> FlowNode | None:
    """Retrieves a node from the graph by its ID.

    Args:
        node_id: The ID of the node to retrieve. If None, retrieves the last added node.

    Returns:
        The FlowNode object, or None if not found.
    """
    if node_id is None:
        node_id = self._node_ids[-1]
    node = self._node_db.get(node_id)
    if node is not None:
        return node
get_node_data(node_id, include_example=True)

Retrieves all data needed to render a node in the UI.

Parameters:

Name Type Description Default
node_id int

The ID of the node.

required
include_example bool

Whether to include data samples in the result.

True

Returns:

Type Description
NodeData

A NodeData object, or None if the node is not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
def get_node_data(self, node_id: int, include_example: bool = True) -> NodeData:
    """Retrieves all data needed to render a node in the UI.

    Args:
        node_id: The ID of the node.
        include_example: Whether to include data samples in the result.

    Returns:
        A NodeData object, or None if the node is not found.
    """
    node = self._node_db[node_id]
    return node.get_node_data(flow_id=self.flow_id, include_example=include_example)
get_node_storage()

Serializes the entire graph's state into a storable format.

Returns:

Type Description
FlowInformation

A FlowInformation object representing the complete graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
def get_node_storage(self) -> schemas.FlowInformation:
    """Serializes the entire graph's state into a storable format.

    Returns:
        A FlowInformation object representing the complete graph.
    """
    node_information = {
        node.node_id: node.get_node_information() for node in self.nodes if node.is_setup and node.is_correct
    }

    return schemas.FlowInformation(
        flow_id=self.flow_id,
        flow_name=self.__name__,
        flow_settings=self.flow_settings,
        data=node_information,
        node_starts=[v.node_id for v in self._flow_starts],
        node_connections=self.node_connections,
    )
get_nodes_overview()

Gets a list of dictionary representations for all nodes in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
800
801
802
803
804
805
def get_nodes_overview(self):
    """Gets a list of dictionary representations for all nodes in the graph."""
    output = []
    for v in self._node_db.values():
        output.append(v.get_repr())
    return output
get_run_info()

Gets a summary of the most recent graph execution.

Returns:

Type Description
RunInformation

A RunInformation object with details about the last run.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
def get_run_info(self) -> RunInformation:
    """Gets a summary of the most recent graph execution.

    Returns:
        A RunInformation object with details about the last run.
    """
    is_running = self.flow_settings.is_running
    if self.latest_run_info is None:
        return self.create_empty_run_information()

    elif not is_running and self.latest_run_info.success is not None:
        return self.latest_run_info

    run_info = self.latest_run_info
    if not is_running:
        run_info.success = all(nr.success for nr in run_info.node_step_result)
    return run_info
get_vue_flow_input()

Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

Returns:

Type Description
VueFlowInput

A VueFlowInput object.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
def get_vue_flow_input(self) -> schemas.VueFlowInput:
    """Formats the graph's nodes and edges into a schema suitable for the VueFlow frontend.

    Returns:
        A VueFlowInput object.
    """
    edges: list[schemas.NodeEdge] = []
    nodes: list[schemas.NodeInput] = []
    for node in self.nodes:
        nodes.append(node.get_node_input())
        edges.extend(node.get_edge_input())
    return schemas.VueFlowInput(node_edges=edges, node_inputs=nodes)
print_tree()

Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
def print_tree(self):
    """Print flow_graph as a visual tree structure, showing the DAG relationships with ASCII art."""
    if not self._node_db:
        self.flow_logger.info("Empty flow graph")
        return

    # Build node information
    node_info = build_node_info(self.nodes)

    # Calculate depths for all nodes
    for node_id in node_info:
        calculate_depth(node_id, node_info)

    # Group nodes by depth
    depth_groups, max_depth = group_nodes_by_depth(node_info)

    # Sort nodes within each depth group
    for depth in depth_groups:
        depth_groups[depth].sort()

    # Create the main flow visualization
    lines = ["=" * 80, "Flow Graph Visualization", "=" * 80, ""]

    # Track which nodes connect to what
    merge_points = define_node_connections(node_info)

    # Build the flow paths

    # Find the maximum label length for each depth level
    max_label_length = {}
    for depth in range(max_depth + 1):
        if depth in depth_groups:
            max_len = max(len(node_info[nid].label) for nid in depth_groups[depth])
            max_label_length[depth] = max_len

    # Draw the paths
    drawn_nodes = set()
    merge_drawn = set()

    # Group paths by their merge points
    paths_by_merge = {}
    standalone_paths = []

    # Build flow paths
    paths = build_flow_paths(node_info, self._flow_starts, merge_points)

    # Define paths to merge and standalone paths
    for path in paths:
        if len(path) > 1 and path[-1] in merge_points and len(merge_points[path[-1]]) > 1:
            merge_id = path[-1]
            if merge_id not in paths_by_merge:
                paths_by_merge[merge_id] = []
            paths_by_merge[merge_id].append(path)
        else:
            standalone_paths.append(path)

    # Draw merged paths
    draw_merged_paths(node_info, merge_points, paths_by_merge, merge_drawn, drawn_nodes, lines)

    # Draw standlone paths
    draw_standalone_paths(drawn_nodes, standalone_paths, lines, node_info)

    # Add undrawn nodes
    add_un_drawn_nodes(drawn_nodes, node_info, lines)

    try:
        execution_plan = compute_execution_plan(
            nodes=self.nodes, flow_starts=self._flow_starts + self.get_implicit_starter_nodes()
        )
        ordered_nodes = execution_plan.all_nodes
        if ordered_nodes:
            for i, node in enumerate(ordered_nodes, 1):
                lines.append(f"  {i:3d}. {node_info[node.node_id].label}")
    except Exception as e:
        lines.append(f"  Could not determine execution order: {e}")

    # Print everything
    output = "\n".join(lines)

    print(output)
redo()

Redo the last undone action.

Returns:

Type Description
UndoRedoResult

UndoRedoResult indicating success or failure.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
436
437
438
439
440
441
442
def redo(self) -> UndoRedoResult:
    """Redo the last undone action.

    Returns:
        UndoRedoResult indicating success or failure.
    """
    return self._history_manager.redo(self)
remove_from_output_cols(columns)

Removes specified columns from the list of expected output columns.

Parameters:

Name Type Description Default
columns list[str]

A list of column names to remove.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
807
808
809
810
811
812
813
814
def remove_from_output_cols(self, columns: list[str]):
    """Removes specified columns from the list of expected output columns.

    Args:
        columns: A list of column names to remove.
    """
    cols = set(columns)
    self._output_cols = [c for c in self._output_cols if c not in cols]
reset()

Forces a deep reset on all nodes in the graph.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2715
2716
2717
2718
2719
def reset(self):
    """Forces a deep reset on all nodes in the graph."""

    for node in self.nodes:
        node.reset(True)
restore_from_snapshot(snapshot)

Clear current state and rebuild from a snapshot.

This method is used internally by undo/redo to restore a previous state.

Parameters:

Name Type Description Default
snapshot FlowfileData

The FlowfileData snapshot to restore from.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
def restore_from_snapshot(self, snapshot: schemas.FlowfileData) -> None:
    """Clear current state and rebuild from a snapshot.

    This method is used internally by undo/redo to restore a previous state.

    Args:
        snapshot: The FlowfileData snapshot to restore from.
    """
    from flowfile_core.flowfile.manage.io_flowfile import (
        _flowfile_data_to_flow_information,
        determine_insertion_order,
    )

    # Preserve the current flow_id
    original_flow_id = self._flow_id

    # Convert snapshot to FlowInformation
    flow_info = _flowfile_data_to_flow_information(snapshot)

    # Clear current state
    self._node_db.clear()
    self._node_ids.clear()
    self._flow_starts.clear()
    self._results = None

    # Restore flow settings (preserve original flow_id)
    self._flow_settings = flow_info.flow_settings
    self._flow_settings.flow_id = original_flow_id
    self._flow_id = original_flow_id
    self.__name__ = flow_info.flow_name or self.__name__

    # Determine node insertion order
    ingestion_order = determine_insertion_order(flow_info)

    # First pass: Create all nodes as promises
    for node_id in ingestion_order:
        node_info = flow_info.data[node_id]
        node_promise = input_schema.NodePromise(
            flow_id=original_flow_id,
            node_id=node_info.id,
            pos_x=node_info.x_position or 0,
            pos_y=node_info.y_position or 0,
            node_type=node_info.type,
        )
        if hasattr(node_info.setting_input, "cache_results"):
            node_promise.cache_results = node_info.setting_input.cache_results
        self.add_node_promise(node_promise)

    # Second pass: Apply settings using add_<node_type> methods
    for node_id in ingestion_order:
        node_info = flow_info.data[node_id]
        if node_info.is_setup and node_info.setting_input is not None:
            # Update flow_id in setting_input
            if hasattr(node_info.setting_input, "flow_id"):
                node_info.setting_input.flow_id = original_flow_id

            if hasattr(node_info.setting_input, "is_user_defined") and node_info.setting_input.is_user_defined:
                if node_info.type in CUSTOM_NODE_STORE:
                    user_defined_node_class = CUSTOM_NODE_STORE[node_info.type]
                    self.add_user_defined_node(
                        custom_node=user_defined_node_class.from_settings(node_info.setting_input.settings),
                        user_defined_node_settings=node_info.setting_input,
                    )
            else:
                add_method = getattr(self, "add_" + node_info.type, None)
                if add_method:
                    add_method(node_info.setting_input)

    # Third pass: Restore connections
    for node_id in ingestion_order:
        node_info = flow_info.data[node_id]
        from_node = self.get_node(node_id)
        if from_node is None:
            continue

        for output_node_id in node_info.outputs or []:
            to_node = self.get_node(output_node_id)
            if to_node is None:
                continue

            output_node_info = flow_info.data.get(output_node_id)
            if output_node_info is None:
                continue

            # Determine connection type
            is_left_input = (output_node_info.left_input_id == node_id) and (
                to_node.left_input is None or to_node.left_input.node_id != node_id
            )
            is_right_input = (output_node_info.right_input_id == node_id) and (
                to_node.right_input is None or to_node.right_input.node_id != node_id
            )
            is_main_input = node_id in (output_node_info.input_ids or [])

            if is_left_input:
                insert_type = "left"
            elif is_right_input:
                insert_type = "right"
            elif is_main_input:
                insert_type = "main"
            else:
                continue

            to_node.add_node_connection(from_node, insert_type)

    logger.info(f"Restored flow from snapshot with {len(self._node_db)} nodes")
run_graph()

Executes the entire data flow graph from start to finish.

Independent nodes within the same execution stage are run in parallel using threads. Stages are processed sequentially so that all dependencies are satisfied before a stage begins.

Returns:

Type Description
RunInformation | None

A RunInformation object summarizing the execution results.

Raises:

Type Description
Exception

If the flow is already running.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
def run_graph(self) -> RunInformation | None:
    """Executes the entire data flow graph from start to finish.

    Independent nodes within the same execution stage are run in parallel
    using threads. Stages are processed sequentially so that all dependencies
    are satisfied before a stage begins.

    Returns:
        A RunInformation object summarizing the execution results.

    Raises:
        Exception: If the flow is already running.
    """
    if self.flow_settings.is_running:
        raise Exception("Flow is already running")
    try:
        self.flow_settings.is_running = True
        self.flow_settings.is_canceled = False
        self.flow_logger.clear_log_file()
        self.flow_logger.info("Starting to run flowfile flow...")
        execution_plan = compute_execution_plan(
            nodes=self.nodes, flow_starts=self._flow_starts + self.get_implicit_starter_nodes()
        )

        self.latest_run_info = self.create_initial_run_information(
            execution_plan.node_count, "full_run"
        )

        skip_node_message(self.flow_logger, execution_plan.skip_nodes)
        execution_order_message(self.flow_logger, execution_plan.stages)
        performance_mode = self.flow_settings.execution_mode == "Performance"

        run_info_lock = threading.Lock()
        skip_node_ids: set[str | int] = {n.node_id for n in execution_plan.skip_nodes}

        for stage in execution_plan.stages:
            if self.flow_settings.is_canceled:
                self.flow_logger.info("Flow canceled")
                break

            nodes_to_run = [n for n in stage.nodes if n.node_id not in skip_node_ids]

            for skipped in stage.nodes:
                if skipped.node_id in skip_node_ids:
                    node_logger = self.flow_logger.get_node_logger(skipped.node_id)
                    node_logger.info(f"Skipping node {skipped.node_id}")

            if not nodes_to_run:
                continue

            is_local = self.flow_settings.execution_location == "local"
            max_workers = 1 if is_local else self.flow_settings.max_parallel_workers
            if len(nodes_to_run) == 1 or max_workers == 1:
                # Single node or parallelism disabled — run sequentially
                stage_results = [
                    self._execute_single_node(node, performance_mode, run_info_lock)
                    for node in nodes_to_run
                ]
            else:
                # Multiple independent nodes — run in parallel
                stage_results: list[tuple[NodeResult, FlowNode]] = []
                workers = min(max_workers, len(nodes_to_run))
                with ThreadPoolExecutor(max_workers=workers) as executor:
                    futures = {
                        executor.submit(
                            self._execute_single_node, node, performance_mode, run_info_lock
                        ): node
                        for node in nodes_to_run
                    }
                    for future in as_completed(futures):
                        stage_results.append(future.result())

            # After the stage completes, propagate failures to downstream nodes
            for node_result, node in stage_results:
                if not node_result.success:
                    for dep in node.get_all_dependent_nodes():
                        skip_node_ids.add(dep.node_id)

        self.latest_run_info.end_time = datetime.datetime.now()
        self.flow_logger.info("Flow completed!")
        self.end_datetime = datetime.datetime.now()
        self.flow_settings.is_running = False
        if self.flow_settings.is_canceled:
            self.flow_logger.info("Flow canceled")
        return self.get_run_info()
    except Exception as e:
        raise e
    finally:
        self.flow_settings.is_running = False
save_flow(flow_path)

Saves the current state of the flow graph to a file.

Supports multiple formats based on file extension: - .yaml / .yml: New YAML format - .json: JSON format

Parameters:

Name Type Description Default
flow_path str

The path where the flow file will be saved.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
def save_flow(self, flow_path: str):
    """Saves the current state of the flow graph to a file.

    Supports multiple formats based on file extension:
    - .yaml / .yml: New YAML format
    - .json: JSON format

    Args:
        flow_path: The path where the flow file will be saved.
    """
    logger.info("Saving flow to %s", flow_path)
    path = Path(flow_path)
    os.makedirs(path.parent, exist_ok=True)
    suffix = path.suffix.lower()
    new_flow_name = path.name.replace(suffix, "")
    self._handle_flow_renaming(new_flow_name, path)
    self.flow_settings.modified_on = datetime.datetime.now().timestamp()
    try:
        if suffix == ".flowfile":
            raise DeprecationWarning(
                "The .flowfile format is deprecated. Please use .yaml or .json formats.\n\n"
                "Or stay on v0.4.1 if you still need .flowfile support.\n\n"
            )
        elif suffix in (".yaml", ".yml"):
            flowfile_data = self.get_flowfile_data()
            data = flowfile_data.model_dump(mode="json")
            with open(flow_path, "w", encoding="utf-8") as f:
                yaml.dump(data, f, default_flow_style=False, sort_keys=False, allow_unicode=True)
        elif suffix == ".json":
            flowfile_data = self.get_flowfile_data()
            data = flowfile_data.model_dump(mode="json")
            with open(flow_path, "w", encoding="utf-8") as f:
                json.dump(data, f, indent=2, ensure_ascii=False)

        else:
            flowfile_data = self.get_flowfile_data()
            logger.warning(f"Unknown file extension {suffix}. Defaulting to YAML format.")
            data = flowfile_data.model_dump(mode="json")
            with open(flow_path, "w", encoding="utf-8") as f:
                yaml.dump(data, f, default_flow_style=False, sort_keys=False, allow_unicode=True)

    except Exception as e:
        logger.error(f"Error saving flow: {e}")
        raise

    self.flow_settings.path = flow_path
trigger_fetch_node(node_id)

Executes a specific node in the graph by its ID.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
def trigger_fetch_node(self, node_id: int) -> RunInformation | None:
    """Executes a specific node in the graph by its ID."""
    if self.flow_settings.is_running:
        raise Exception("Flow is already running")
    flow_node = self.get_node(node_id)
    self.flow_settings.is_running = True
    self.flow_settings.is_canceled = False
    self.flow_logger.clear_log_file()
    self.latest_run_info = self.create_initial_run_information(1, "fetch_one")
    node_logger = self.flow_logger.get_node_logger(flow_node.node_id)
    node_result = NodeResult(node_id=flow_node.node_id, node_name=flow_node.name)
    logger.info(f"Starting to run: node {flow_node.node_id}, start time: {node_result.start_timestamp}")
    try:
        self.latest_run_info.node_step_result.append(node_result)
        flow_node.execute_node(
            run_location=self.flow_settings.execution_location,
            performance_mode=False,
            node_logger=node_logger,
            optimize_for_downstream=False,
            reset_cache=True,
        )
        node_result.error = str(flow_node.results.errors)
        if self.flow_settings.is_canceled:
            node_result.success = None
            node_result.success = None
            node_result.is_running = False
        node_result.success = flow_node.results.errors is None
        node_result.end_timestamp = time()
        node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
        node_result.is_running = False
        self.latest_run_info.nodes_completed += 1
        self.latest_run_info.end_time = datetime.datetime.now()
        self.flow_settings.is_running = False
        return self.get_run_info()
    except Exception as e:
        node_result.error = "Node did not run"
        node_result.success = False
        node_result.end_timestamp = time()
        node_result.run_time = int(node_result.end_timestamp - node_result.start_timestamp)
        node_result.is_running = False
        node_logger.error(f"Error in node {flow_node.node_id}: {e}")
    finally:
        self.flow_settings.is_running = False
undo()

Undo the last action by restoring to the previous state.

Returns:

Type Description
UndoRedoResult

UndoRedoResult indicating success or failure.

Source code in flowfile_core/flowfile_core/flowfile/flow_graph.py
428
429
430
431
432
433
434
def undo(self) -> UndoRedoResult:
    """Undo the last action by restoring to the previous state.

    Returns:
        UndoRedoResult indicating success or failure.
    """
    return self._history_manager.undo(self)

FlowNode

The FlowNode represents a single operation in the FlowGraph. Each node corresponds to a specific transformation or action, such as filtering or grouping data.

flowfile_core.flowfile.flow_node.flow_node.FlowNode

Represents a single node in a data flow graph.

This class manages the node's state, its data processing function, and its connections to other nodes within the graph.

Methods:

Name Description
__call__

Makes the node instance callable, acting as an alias for execute_node.

__init__

Initializes a FlowNode instance.

__repr__

Provides a string representation of the FlowNode instance.

add_lead_to_in_depend_source

Ensures this node is registered in the leads_to_nodes list of its inputs.

add_node_connection

Adds a connection from a source node to this node.

calculate_hash

Calculates a hash based on settings and input node hashes.

cancel

Cancels an ongoing external process if one is running.

clear_table_example

Clear the table example in the results so that it clears the existing results

create_schema_callback_from_function

Wraps a node's function to create a schema callback that extracts the schema.

delete_input_node

Removes a connection from a specific input node.

delete_lead_to_node

Removes a connection to a specific downstream node.

evaluate_nodes

Triggers a state reset for all directly connected downstream nodes.

execute_full_local

Backward-compatible alias for _do_execute_full_local.

execute_local

Backward-compatible alias for _do_execute_local_with_sampling.

execute_node

Execute the node based on its current state and settings.

execute_remote

Backward-compatible alias for _do_execute_remote.

get_all_dependent_node_ids

Yields the IDs of all downstream nodes recursively.

get_all_dependent_nodes

Yields all downstream nodes recursively.

get_edge_input

Generates NodeEdge objects for all input connections to this node.

get_flow_file_column_schema

Retrieves the schema for a specific column from the output schema.

get_input_type

Gets the type of connection ('main', 'left', 'right') for a given input node ID.

get_node_data

Gathers all necessary data for representing the node in the UI.

get_node_information

Updates and returns the node's information object.

get_node_input

Creates a NodeInput schema object for representing this node in the UI.

get_output_data

Gets the full output data sample for this node.

get_predicted_resulting_data

Creates a FlowDataEngine instance based on the predicted schema.

get_predicted_schema

Predicts the output schema of the node without full execution.

get_repr

Gets a detailed dictionary representation of the node's state.

get_resulting_data

Executes the node's function to produce the actual output data.

get_table_example

Generates a TableExample model summarizing the node's output.

needs_reset

Checks if the node's hash has changed, indicating an outdated state.

needs_run

Determines if the node needs to be executed.

post_init

Initializes or resets the node's attributes to their default states.

prepare_before_run

Resets results and errors before a new execution.

print

Helper method to log messages with node context.

remove_cache

Removes cached results for this node.

reset

Resets the node's execution state and schema information.

set_node_information

Populates the node_information attribute with the current state.

store_example_data_generator

Stores a generator function for fetching a sample of the result data.

update_node

Updates the properties of the node.

Attributes:

Name Type Description
all_inputs list[FlowNode]

Gets a list of all nodes connected to any input port.

executor NodeExecutor

Lazy-initialized executor instance.

function Callable

Gets the core processing function of the node.

has_input bool

Checks if this node has any input connections.

has_next_step bool

Checks if this node has any downstream connections.

hash str

Gets the cached hash for the node, calculating it if it doesn't exist.

is_correct bool

Checks if the node's input connections satisfy its template requirements.

is_setup bool

Checks if the node has been properly configured and is ready for execution.

is_start bool

Determines if the node is a starting node in the flow.

left_input Optional[FlowNode]

Gets the node connected to the left input port.

main_input list[FlowNode]

Gets the list of nodes connected to the main input port(s).

name str

Gets the name of the node.

node_id str | int

Gets the unique identifier of the node.

number_of_leads_to_nodes int | None

Counts the number of downstream node connections.

right_input Optional[FlowNode]

Gets the node connected to the right input port.

schema list[FlowfileColumn]

Gets the definitive output schema of the node.

schema_callback SingleExecutionFuture

Gets the schema callback function, creating one if it doesn't exist.

setting_input Any

Gets the node's specific configuration settings.

singular_input bool

Checks if the node template specifies exactly one input.

singular_main_input FlowNode

Gets the input node, assuming it is a single-input type.

state_needs_reset bool

Checks if the node's state needs to be reset.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
class FlowNode:
    """Represents a single node in a data flow graph.

    This class manages the node's state, its data processing function,
    and its connections to other nodes within the graph.
    """

    parent_uuid: str
    node_type: str
    node_template: node_store.NodeTemplate
    node_default: schemas.NodeDefault
    node_schema: NodeSchemaInformation
    node_inputs: NodeStepInputs
    node_stats: NodeStepStats
    node_settings: NodeStepSettings
    results: NodeResults
    node_information: schemas.NodeInformation | None = None
    leads_to_nodes: list["FlowNode"] = []  # list with target flows, after execution the step will trigger those step(s)
    user_provided_schema_callback: Callable | None = None  # user provided callback function for schema calculation
    _setting_input: Any = None
    _hash: str | None = None  # host this for caching results
    _function: Callable = None  # the function that needs to be executed when triggered
    _name: str = None  # name of the node, used for display
    _schema_callback: SingleExecutionFuture | None = None  # Function that calculates the schema without executing
    _state_needs_reset: bool = False
    _fetch_cached_df: (
        ExternalDfFetcher | ExternalDatabaseFetcher | ExternalDatabaseWriter | ExternalCloudWriter | None
    ) = None
    _cache_progress: (
        ExternalDfFetcher | ExternalDatabaseFetcher | ExternalDatabaseWriter | ExternalCloudWriter | None
    ) = None
    _execution_state: NodeExecutionState = None
    _executor: NodeExecutor | None = None  # Lazy-initialized

    def __init__(
        self,
        node_id: str | int,
        function: Callable,
        parent_uuid: str,
        setting_input: Any,
        name: str,
        node_type: str,
        input_columns: list[str] = None,
        output_schema: list[FlowfileColumn] = None,
        drop_columns: list[str] = None,
        renew_schema: bool = True,
        pos_x: float = 0,
        pos_y: float = 0,
        schema_callback: Callable = None,
    ):
        """Initializes a FlowNode instance.

        Args:
            node_id: Unique identifier for the node.
            function: The core data processing function for the node.
            parent_uuid: The UUID of the parent flow.
            setting_input: The configuration/settings object for the node.
            name: The name of the node.
            node_type: The type identifier of the node (e.g., 'join', 'filter').
            input_columns: List of column names expected as input.
            output_schema: The schema of the columns to be added.
            drop_columns: List of column names to be dropped.
            renew_schema: Flag to indicate if the schema should be renewed.
            pos_x: The x-coordinate on the canvas.
            pos_y: The y-coordinate on the canvas.
            schema_callback: A custom function to calculate the output schema.
        """
        self._name = None
        self.parent_uuid = parent_uuid
        self.post_init()
        self.active = True
        self.node_information.id = node_id
        self.node_type = node_type
        self.node_settings.renew_schema = renew_schema
        self.update_node(
            function=function,
            input_columns=input_columns,
            output_schema=output_schema,
            drop_columns=drop_columns,
            setting_input=setting_input,
            name=name,
            pos_x=pos_x,
            pos_y=pos_y,
            schema_callback=schema_callback,
        )

    def post_init(self):
        """Initializes or resets the node's attributes to their default states."""
        self.node_inputs = NodeStepInputs()
        self.node_stats = NodeStepStats()
        self.node_settings = NodeStepSettings()
        self.node_schema = NodeSchemaInformation()
        self.results = NodeResults()
        self.node_information = schemas.NodeInformation()
        self.leads_to_nodes = []
        self._setting_input = None
        self._cache_progress = None
        self._schema_callback = None
        self._state_needs_reset = False
        self._execution_lock = threading.RLock()  # Protects concurrent access to get_resulting_data
        # Initialize execution state
        self._execution_state = NodeExecutionState()
        self._executor = None  # Will be lazily created

    @property
    def state_needs_reset(self) -> bool:
        """Checks if the node's state needs to be reset.

        Returns:
            True if a reset is required, False otherwise.
        """
        return self._state_needs_reset

    @state_needs_reset.setter
    def state_needs_reset(self, v: bool):
        """Sets the flag indicating that the node's state needs to be reset.

        Args:
            v: The boolean value to set.
        """
        self._state_needs_reset = v

    def create_schema_callback_from_function(self, f: Callable) -> Callable[[], list[FlowfileColumn]]:
        """Wraps a node's function to create a schema callback that extracts the schema.

        Thread-safe: uses _execution_lock to prevent concurrent execution with get_resulting_data.

        Args:
            f: The node's core function that returns a FlowDataEngine instance.

        Returns:
            A callable that, when executed, returns the output schema.
        """

        def schema_callback() -> list[FlowfileColumn]:
            try:
                logger.info("Executing the schema callback function based on the node function")
                with self._execution_lock:
                    return f().schema
            except Exception as e:
                logger.warning(f"Error with the schema callback: {e}")
                return []

        return schema_callback

    @property
    def schema_callback(self) -> SingleExecutionFuture:
        """Gets the schema callback function, creating one if it doesn't exist.

        The callback is used for predicting the output schema without full execution.

        Returns:
            A SingleExecutionFuture instance wrapping the schema function.
        """
        if self._schema_callback is None:
            if self.user_provided_schema_callback is not None:
                self.schema_callback = self.user_provided_schema_callback
            elif self.is_start:
                self.schema_callback = self.create_schema_callback_from_function(self._function)
        return self._schema_callback

    @schema_callback.setter
    def schema_callback(self, f: Callable):
        """Sets the schema callback function for the node.

        If the node has an enabled output_field_config, the callback is automatically
        wrapped to use the output_field_config schema for prediction.

        Args:
            f: The function to be used for schema calculation.
        """
        if f is None:
            return

        # Wrap callback with output_field_config support if present and enabled
        output_field_config = getattr(self._setting_input, 'output_field_config', None)
        if output_field_config and output_field_config.enabled:
            f = create_schema_callback_with_output_config(f, output_field_config)

        def error_callback(e: Exception) -> list:
            logger.warning(e)

            self.node_settings.setup_errors = True
            return []

        self._schema_callback = SingleExecutionFuture(f, error_callback)

    @property
    def executor(self) -> NodeExecutor:
        """Lazy-initialized executor instance.

        Reusing the same executor avoids object creation overhead
        when execute_node is called multiple times.
        """
        if self._executor is None:
            self._executor = NodeExecutor(self)
        return self._executor

    @property
    def is_start(self) -> bool:
        """Determines if the node is a starting node in the flow.

        A starting node requires no inputs.

        Returns:
            True if the node is a start node, False otherwise.
        """
        return not self.has_input and self.node_template.input == 0

    def get_input_type(self, node_id: int) -> list:
        """Gets the type of connection ('main', 'left', 'right') for a given input node ID.

        Args:
            node_id: The ID of the input node.

        Returns:
            A list of connection types for that node ID.
        """
        relation_type = []
        if node_id in [n.node_id for n in self.node_inputs.main_inputs]:
            relation_type.append("main")
        if self.node_inputs.left_input is not None and node_id == self.node_inputs.left_input.node_id:
            relation_type.append("left")
        if self.node_inputs.right_input is not None and node_id == self.node_inputs.right_input.node_id:
            relation_type.append("right")
        return list(set(relation_type))

    def update_node(
        self,
        function: Callable,
        input_columns: list[str] = None,
        output_schema: list[FlowfileColumn] = None,
        drop_columns: list[str] = None,
        name: str = None,
        setting_input: Any = None,
        pos_x: float = 0,
        pos_y: float = 0,
        schema_callback: Callable = None,
    ):
        """Updates the properties of the node.

        This is called during initialization and when settings are changed.

        Args:
            function: The new core data processing function.
            input_columns: The new list of input columns.
            output_schema: The new schema of added columns.
            drop_columns: The new list of dropped columns.
            name: The new name for the node.
            setting_input: The new settings object.
            pos_x: The new x-coordinate.
            pos_y: The new y-coordinate.
            schema_callback: The new custom schema callback function.
        """
        self.user_provided_schema_callback = schema_callback
        self.node_information.y_position = int(pos_y)
        self.node_information.x_position = int(pos_x)
        self.node_information.setting_input = setting_input
        self.name = self.node_type if name is None else name
        self._function = function

        self.node_schema.input_columns = [] if input_columns is None else input_columns
        self.node_schema.output_columns = [] if output_schema is None else output_schema
        self.node_schema.drop_columns = [] if drop_columns is None else drop_columns
        self.node_settings.renew_schema = True
        if hasattr(setting_input, "cache_results"):
            self.node_settings.cache_results = setting_input.cache_results

        self.results.errors = None
        self.add_lead_to_in_depend_source()
        _ = self.hash
        self.node_template = node_store.node_dict.get(self.node_type)
        if self.node_template is None:
            raise Exception(f"Node template {self.node_type} not found")
        self.node_default = node_store.node_defaults.get(self.node_type)
        self.setting_input = setting_input  # wait until the end so that the hash is calculated correctly

    @property
    def name(self) -> str:
        """Gets the name of the node.

        Returns:
            The node's name.
        """
        return self._name

    @name.setter
    def name(self, name: str):
        """Sets the name of the node.

        Args:
            name: The new name.
        """
        self._name = name
        self.__name__ = name

    @property
    def setting_input(self) -> Any:
        """Gets the node's specific configuration settings.

        Returns:
            The settings object.
        """
        return self._setting_input

    @setting_input.setter
    def setting_input(self, setting_input: Any):
        """Sets the node's configuration and triggers a reset if necessary.

        Args:
            setting_input: The new settings object.
        """
        is_manual_input = (
            self.node_type == "manual_input"
            and isinstance(setting_input, input_schema.NodeManualInput)
            and isinstance(self._setting_input, input_schema.NodeManualInput)
        )
        if is_manual_input:
            _ = self.hash
        self._setting_input = setting_input
        # Copy cache_results from setting_input to node_settings
        if hasattr(setting_input, "cache_results"):
            self.node_settings.cache_results = setting_input.cache_results
        self.set_node_information()
        if is_manual_input:
            if self.hash != self.calculate_hash(setting_input) or not self.node_stats.has_run_with_current_setup:
                self.function = FlowDataEngine(setting_input.raw_data_format)
                self.reset()
                self.get_predicted_schema()
        elif self._setting_input is not None:
            self.reset()

    @property
    def node_id(self) -> str | int:
        """Gets the unique identifier of the node.

        Returns:
            The node's ID.
        """
        return self.node_information.id

    @property
    def left_input(self) -> Optional["FlowNode"]:
        """Gets the node connected to the left input port.

        Returns:
            The left input FlowNode, or None.
        """
        return self.node_inputs.left_input

    @property
    def right_input(self) -> Optional["FlowNode"]:
        """Gets the node connected to the right input port.

        Returns:
            The right input FlowNode, or None.
        """
        return self.node_inputs.right_input

    @property
    def main_input(self) -> list["FlowNode"]:
        """Gets the list of nodes connected to the main input port(s).

        Returns:
            A list of main input FlowNodes.
        """
        return self.node_inputs.main_inputs

    @property
    def is_correct(self) -> bool:
        """Checks if the node's input connections satisfy its template requirements.

        Returns:
            True if connections are valid, False otherwise.
        """
        if isinstance(self.setting_input, input_schema.NodePromise):
            return False
        return (
            self.node_template.input == len(self.node_inputs.get_all_inputs())
            or (self.node_template.multi and len(self.node_inputs.get_all_inputs()) > 0)
            or (self.node_template.multi and self.node_template.can_be_start)
        )

    def set_node_information(self):
        """Populates the `node_information` attribute with the current state.

        This includes the node's connections, settings, and position.
        """
        node_information = self.node_information
        node_information.left_input_id = self.node_inputs.left_input.node_id if self.left_input else None
        node_information.right_input_id = self.node_inputs.right_input.node_id if self.right_input else None
        node_information.input_ids = (
            [mi.node_id for mi in self.node_inputs.main_inputs] if self.node_inputs.main_inputs is not None else None
        )
        node_information.setting_input = self.setting_input
        node_information.outputs = [n.node_id for n in self.leads_to_nodes]
        node_information.description = (
            self.setting_input.description if hasattr(self.setting_input, "description") else ""
        )
        node_information.node_reference = (
            self.setting_input.node_reference if hasattr(self.setting_input, "node_reference") else None
        )
        node_information.is_setup = self.is_setup
        node_information.x_position = self.setting_input.pos_x
        node_information.y_position = self.setting_input.pos_y
        node_information.type = self.node_type

    def get_node_information(self) -> schemas.NodeInformation:
        """Updates and returns the node's information object.

        Returns:
            The `NodeInformation` object for this node.
        """
        self.set_node_information()
        return self.node_information

    @property
    def function(self) -> Callable:
        """Gets the core processing function of the node.

        Returns:
            The callable function.
        """
        return self._function

    @function.setter
    def function(self, function: Callable):
        """Sets the core processing function of the node.

        Args:
            function: The new callable function.
        """
        self._function = function

    @property
    def all_inputs(self) -> list["FlowNode"]:
        """Gets a list of all nodes connected to any input port.

        Returns:
            A list of all input FlowNodes.
        """
        return self.node_inputs.get_all_inputs()

    def calculate_hash(self, setting_input: Any) -> str:
        """Calculates a hash based on settings and input node hashes.

        Args:
            setting_input: The node's settings object to be included in the hash.

        Returns:
            A string hash value.
        """
        depends_on_hashes = [_node.hash for _node in self.all_inputs]
        node_data_hash = get_hash(setting_input)
        return get_hash(depends_on_hashes + [node_data_hash, self.parent_uuid])

    @property
    def hash(self) -> str:
        """Gets the cached hash for the node, calculating it if it doesn't exist.

        Returns:
            The string hash value.
        """
        if not self._hash:
            self._hash = self.calculate_hash(self.setting_input)
        return self._hash

    def add_node_connection(
        self, from_node: "FlowNode", insert_type: Literal["main", "left", "right"] = "main"
    ) -> None:
        """Adds a connection from a source node to this node.

        Args:
            from_node: The node to connect from.
            insert_type: The type of input to connect to ('main', 'left', 'right').

        Raises:
            Exception: If the insert_type is invalid.
        """
        from_node.leads_to_nodes.append(self)
        if insert_type == "main":
            if self.node_template.input <= 2 or self.node_inputs.main_inputs is None:
                self.node_inputs.main_inputs = [from_node]
            else:
                self.node_inputs.main_inputs.append(from_node)
        elif insert_type == "right":
            self.node_inputs.right_input = from_node
        elif insert_type == "left":
            self.node_inputs.left_input = from_node
        else:
            raise Exception("Cannot find the connection")
        if self.setting_input.is_setup:
            if hasattr(self.setting_input, "depending_on_id") and insert_type == "main":
                self.setting_input.depending_on_id = from_node.node_id
        self.reset()
        from_node.reset()

    def evaluate_nodes(self, deep: bool = False) -> None:
        """Triggers a state reset for all directly connected downstream nodes.

        Args:
            deep: If True, the reset propagates recursively through the entire downstream graph.
        """
        for node in self.leads_to_nodes:
            self.print(f"resetting node: {node.node_id}")
            node.reset(deep)

    def get_flow_file_column_schema(self, col_name: str) -> FlowfileColumn | None:
        """Retrieves the schema for a specific column from the output schema.

        Args:
            col_name: The name of the column.

        Returns:
            The FlowfileColumn object for that column, or None if not found.
        """
        for s in self.schema:
            if s.column_name == col_name:
                return s

    def get_predicted_schema(self, force: bool = False) -> list[FlowfileColumn] | None:
        """Predicts the output schema of the node without full execution.

        It uses the schema_callback or infers from predicted data.

        Args:
            force: If True, forces recalculation even if a predicted schema exists.

        Returns:
            A list of FlowfileColumn objects representing the predicted schema.
        """
        logger.info(
            f"get_predicted_schema: node_id={self.node_id}, node_type={self.node_type}, force={force}, "
            f"has_predicted_schema={self.node_schema.predicted_schema is not None}, "
            f"has_schema_callback={self.schema_callback is not None}, "
            f"has_output_field_config={hasattr(self._setting_input, 'output_field_config') and self._setting_input.output_field_config is not None if self._setting_input else False}"
        )

        if self.node_schema.predicted_schema and not force:
            logger.debug(f"get_predicted_schema: node_id={self.node_id} - returning cached predicted_schema")
            return self.node_schema.predicted_schema

        if self.schema_callback is not None and (self.node_schema.predicted_schema is None or force):
            self.print("Getting the data from a schema callback")
            logger.info(f"get_predicted_schema: node_id={self.node_id} - invoking schema_callback")
            if force:
                # Force the schema callback to reset, so that it will be executed again
                logger.debug(f"get_predicted_schema: node_id={self.node_id} - forcing schema_callback reset")
                self.schema_callback.reset()

            try:
                schema = self.schema_callback()
                logger.info(
                    f"get_predicted_schema: node_id={self.node_id} - schema_callback returned "
                    f"{len(schema) if schema else 0} columns: {[c.name for c in schema] if schema else []}"
                )
            except Exception as e:
                logger.error(f"get_predicted_schema: node_id={self.node_id} - schema_callback raised exception: {e}")
                schema = None

            if schema is not None and len(schema) > 0:
                self.print("Calculating the schema based on the schema callback")
                self.node_schema.predicted_schema = schema
                logger.info(f"get_predicted_schema: node_id={self.node_id} - set predicted_schema from schema_callback")
                return self.node_schema.predicted_schema
            else:
                logger.warning(f"get_predicted_schema: node_id={self.node_id} - schema_callback returned empty/None schema")
        else:
            logger.debug(f"get_predicted_schema: node_id={self.node_id} - no schema_callback available")

        logger.debug(f"get_predicted_schema: node_id={self.node_id} - falling back to _predicted_data_getter")
        predicted_data = self._predicted_data_getter()
        if predicted_data is not None and predicted_data.schema is not None:
            self.print("Calculating the schema based on the predicted resulting data")
            logger.info(
                f"get_predicted_schema: node_id={self.node_id} - using schema from predicted_data "
                f"({len(predicted_data.schema)} columns)"
            )
            self.node_schema.predicted_schema = self._predicted_data_getter().schema
        else:
            logger.warning(
                f"get_predicted_schema: node_id={self.node_id} - no schema available from any source "
                f"(predicted_data={'None' if predicted_data is None else 'has_data'}, "
                f"schema={'None' if predicted_data is None or predicted_data.schema is None else 'has_schema'})"
            )

        return self.node_schema.predicted_schema

    @property
    def is_setup(self) -> bool:
        """Checks if the node has been properly configured and is ready for execution.

        Returns:
            True if the node is set up, False otherwise.
        """
        if not self.node_information.is_setup:
            if self.function.__name__ != "placeholder":
                self.node_information.is_setup = True
                self.setting_input.is_setup = True
        return self.node_information.is_setup

    def print(self, v: Any):
        """Helper method to log messages with node context.

        Args:
            v: The message or value to log.
        """
        logger.info(f"{self.node_type}, node_id: {self.node_id}: {v}")

    def get_resulting_data(self) -> FlowDataEngine | None:
        """Executes the node's function to produce the actual output data.

        Handles both regular functions and external data sources.
        Thread-safe: uses _execution_lock to prevent concurrent execution
        and concurrent access to the underlying LazyFrame by sibling nodes.

        Returns:
            A FlowDataEngine instance containing the result, or None on error.

        Raises:
            Exception: Propagates exceptions from the node's function execution.
        """
        if self.is_setup:
            with self._execution_lock:
                if self.results.resulting_data is None and self.results.errors is None:
                    self.print("getting resulting data")
                    try:
                        if isinstance(self.function, FlowDataEngine):
                            fl: FlowDataEngine = self.function
                        elif self.node_type == "external_source":
                            fl: FlowDataEngine = self.function()
                            fl.collect_external()
                            self.node_settings.streamable = False
                        else:
                            try:
                                self.print("Collecting input data from all inputs")
                                input_data = []
                                input_locks = []
                                try:
                                    for i, v in enumerate(self.all_inputs):
                                        self.print(f"Getting resulting data from input {i} (node {v.node_id})")
                                        # Lock the input node to prevent sibling nodes from
                                        # concurrently accessing the same upstream LazyFrame.
                                        v._execution_lock.acquire()
                                        input_locks.append(v._execution_lock)
                                        input_result = v.get_resulting_data()
                                        self.print(f"Input {i} data type: {type(input_result)}, dataframe type: {type(input_result.data_frame) if input_result else 'None'}")
                                        input_data.append(input_result)
                                    self.print(f"All {len(input_data)} inputs collected, calling node function")
                                    fl = self._function(*input_data)
                                finally:
                                    for lock in input_locks:
                                        lock.release()
                            except Exception as e:
                                raise e
                        fl.set_streamable(self.node_settings.streamable)

                        # Apply output field configuration if enabled
                        if hasattr(self._setting_input, 'output_field_config') and self._setting_input.output_field_config:
                            try:
                                fl = apply_output_field_config(fl, self._setting_input.output_field_config)
                            except Exception as e:
                                logger.error(f"Error applying output field config for node {self.node_id}: {e}")
                                raise

                        self.results.resulting_data = fl
                        self.node_schema.result_schema = fl.schema
                    except Exception as e:
                        self.results.resulting_data = FlowDataEngine()
                        self.results.errors = str(e)
                        self.node_stats.has_run_with_current_setup = False
                        self.node_stats.has_completed_last_run = False
                        raise e
                return self.results.resulting_data

    def _predicted_data_getter(self) -> FlowDataEngine | None:
        """Internal helper to get a predicted data result.

        This calls the function with predicted data from input nodes.

        Returns:
            A FlowDataEngine instance with predicted data, or an empty one on error.
        """
        try:
            fl = self._function(*[v.get_predicted_resulting_data() for v in self.all_inputs])

            # Apply output field configuration if enabled (mirrors get_resulting_data behavior)
            # This ensures schema prediction accounts for output_field_config validation
            if hasattr(self._setting_input, 'output_field_config') and self._setting_input.output_field_config:
                if self._setting_input.output_field_config.enabled:
                    fl = apply_output_field_config(fl, self._setting_input.output_field_config)

            return fl
        except ValueError as e:
            if str(e) == "generator already executing":
                logger.info("Generator already executing, waiting for the result")
                sleep(1)
                return self._predicted_data_getter()
            fl = FlowDataEngine()
            return fl

        except Exception as e:
            logger.warning("there was an issue with the function, returning an empty Flowfile")
            logger.warning(e)

    def get_predicted_resulting_data(self) -> FlowDataEngine:
        """Creates a `FlowDataEngine` instance based on the predicted schema.

        This avoids executing the node's full logic.

        Returns:
            A FlowDataEngine instance with a schema but no data.
        """
        if self.needs_run(False) and self.schema_callback is not None or self.node_schema.result_schema is not None:
            self.print("Getting data based on the schema")

            _s = self.schema_callback() if self.node_schema.result_schema is None else self.node_schema.result_schema
            return FlowDataEngine.create_from_schema(_s)
        else:
            if isinstance(self.function, FlowDataEngine):
                fl = self.function
            else:
                fl = FlowDataEngine.create_from_schema(self.get_predicted_schema())
            return fl

    def add_lead_to_in_depend_source(self):
        """Ensures this node is registered in the `leads_to_nodes` list of its inputs."""
        for input_node in self.all_inputs:
            if self.node_id not in [n.node_id for n in input_node.leads_to_nodes]:
                input_node.leads_to_nodes.append(self)

    def get_all_dependent_nodes(self) -> Generator["FlowNode", None, None]:
        """Yields all downstream nodes recursively.

        Returns:
            A generator of all dependent FlowNode objects.
        """
        for node in self.leads_to_nodes:
            yield node
            for n in node.get_all_dependent_nodes():
                yield n

    def get_all_dependent_node_ids(self) -> Generator[int, None, None]:
        """Yields the IDs of all downstream nodes recursively.

        Returns:
            A generator of all dependent node IDs.
        """
        for node in self.leads_to_nodes:
            yield node.node_id
            for n in node.get_all_dependent_node_ids():
                yield n

    @property
    def schema(self) -> list[FlowfileColumn]:
        """Gets the definitive output schema of the node.

        If not already run, it falls back to the predicted schema.

        Returns:
            A list of FlowfileColumn objects.
        """
        try:
            if self.is_setup and self.results.errors is None:
                if self.node_schema.result_schema is not None and len(self.node_schema.result_schema) > 0:
                    return self.node_schema.result_schema
                elif self.node_type == "output":
                    if len(self.node_inputs.main_inputs) > 0:
                        self.node_schema.result_schema = self.node_inputs.main_inputs[0].schema
                else:
                    self.node_schema.result_schema = self.get_predicted_schema()
                return self.node_schema.result_schema
            else:
                return []
        except Exception as e:
            logger.error(e)
            return []

    def remove_cache(self):
        """Removes cached results for this node.

        Note: Currently not fully implemented.
        """

        if results_exists(self.hash):
            logger.warning("Not implemented")
            clear_task_from_worker(self.hash)

    def needs_run(
        self,
        performance_mode: bool,
        node_logger: NodeLogger = None,
        execution_location: schemas.ExecutionLocationsLiteral = "remote",
    ) -> bool:
        """Determines if the node needs to be executed.

        The decision is based on its run state, caching settings, and execution mode.

        Args:
            performance_mode: True if the flow is in performance mode.
            node_logger: The logger instance for this node.
            execution_location: The target execution location.

        Returns:
            True if the node should be run, False otherwise.
        """
        if execution_location == "local":
            return False

        flow_logger = logger if node_logger is None else node_logger
        cache_result_exists = results_exists(self.hash)
        if not self.node_stats.has_run_with_current_setup:
            flow_logger.info("Node has not run, needs to run")
            return True
        if self.node_settings.cache_results and cache_result_exists:
            return False
        elif self.node_settings.cache_results and not cache_result_exists:
            return True
        elif not performance_mode and cache_result_exists:
            return False
        else:
            return True

    def __call__(self, *args, **kwargs):
        """Makes the node instance callable, acting as an alias for execute_node."""
        self.execute_node(*args, **kwargs)

    def _can_skip_execution_fast(
        self,
        run_location: schemas.ExecutionLocationsLiteral,
        performance_mode: bool,
        reset_cache: bool,
    ) -> bool:
        """Fast-path check to avoid executor overhead when we can skip.

        This inlines the most common skip conditions to avoid
        creating an executor instance when not needed.

        Returns True if execution can definitely be skipped.
        Returns False if full execution logic is needed.
        """
        # Can't skip if forced refresh
        if reset_cache:
            return False

        # Output nodes always run
        if self.node_template.node_group == "output":
            return False

        # Must run if never ran before
        if not self._execution_state.has_run_with_current_setup:
            return False

        # Check for source file changes (read nodes only)
        if self.node_type == "read" and self._execution_state.source_file_info:
            if self._execution_state.source_file_info.has_changed():
                return False

        # Cache-enabled nodes: only skip if the cache file is still present
        if self.node_settings.cache_results:
            return results_exists(self.hash)

        # Already ran with current settings → skip
        # Results are available in memory from previous execution
        return True

    def _do_execute_full_local(self, performance_mode: bool = False) -> None:
        """Executes the node's logic locally, including example data generation.

        Internal method called by NodeExecutor.

        Args:
            performance_mode: If True, skips generating example data.

        Raises:
            Exception: Propagates exceptions from the execution.
        """
        self.clear_table_example()

        def example_data_generator():
            example_data = None

            def get_example_data():
                nonlocal example_data
                if example_data is None:
                    example_data = resulting_data.get_sample(100).to_arrow()
                return example_data

            return get_example_data

        resulting_data = self.get_resulting_data()

        if not performance_mode:
            self.node_stats.has_run_with_current_setup = True
            self.results.example_data_generator = example_data_generator()
            self.node_schema.result_schema = self.results.resulting_data.schema
            self.node_stats.has_completed_last_run = True

    def _do_execute_local_with_sampling(self, performance_mode: bool = False, flow_id: int = None):
        """Executes the node's logic locally with external sampling.

        Internal method called by NodeExecutor.

        Args:
            performance_mode: If True, skips generating example data.
            flow_id: The ID of the parent flow.

        Raises:
            Exception: Propagates exceptions from the execution.
        """
        try:
            resulting_data = self.get_resulting_data()
            if not performance_mode:
                external_sampler = ExternalSampler(
                    lf=resulting_data.data_frame,
                    file_ref=self.hash,
                    wait_on_completion=True,
                    node_id=self.node_id,
                    flow_id=flow_id,
                )
                self.store_example_data_generator(external_sampler)
                if self.results.errors is None and not self.node_stats.is_canceled:
                    self.node_stats.has_run_with_current_setup = True
            self.node_schema.result_schema = resulting_data.schema

        except Exception as e:
            logger.warning(f"Error with step {self.__name__}")
            logger.error(str(e))
            self.results.errors = str(e)
            self.node_stats.has_run_with_current_setup = False
            self.node_stats.has_completed_last_run = False
            raise e

        if self.node_stats.has_run_with_current_setup:
            for step in self.leads_to_nodes:
                if not self.node_settings.streamable:
                    step.node_settings.streamable = self.node_settings.streamable

    def _do_execute_remote(self, performance_mode: bool = False, node_logger: NodeLogger = None):
        """Executes the node's logic remotely or handles cached results.

        Internal method called by NodeExecutor.

        Args:
            performance_mode: If True, skips generating example data.
            node_logger: The logger for this node execution.

        Raises:
            Exception: If the node_logger is not provided or if execution fails.
        """
        if node_logger is None:
            raise Exception("Node logger is not defined")
        if self.node_settings.cache_results and results_exists(self.hash):
            try:
                self.results.resulting_data = FlowDataEngine(get_external_df_result(self.hash))
                self._cache_progress = None
                return
            except Exception:
                node_logger.warning("Failed to read the cache, rerunning the code")
        if self.node_type == "output":
            self.results.resulting_data = self.get_resulting_data()
            self.node_stats.has_run_with_current_setup = True
            return

        try:
            result_data = self.get_resulting_data()
            # Use 'is not None' instead of truthiness check to avoid triggering __len__()
            # which calls .collect() on the LazyFrame and can cause issues
            if result_data is None:
                self.results.errors = "Error with creating the lazy frame, most likely due to invalid graph"
                raise Exception("get_resulting_data returned None")
        except Exception as e:
            self.results.errors = "Error with creating the lazy frame, most likely due to invalid graph"
            raise e

        if not performance_mode:
            external_df_fetcher = ExternalDfFetcher(
                lf=self.get_resulting_data().data_frame,
                file_ref=self.hash,
                wait_on_completion=False,
                flow_id=node_logger.flow_id,
                node_id=self.node_id,
            )
            self._fetch_cached_df = external_df_fetcher

            try:
                lf = external_df_fetcher.get_result()
                self.results.resulting_data = FlowDataEngine(
                    lf,
                    number_of_records=ExternalDfFetcher(
                        lf=lf,
                        operation_type="calculate_number_of_records",
                        flow_id=node_logger.flow_id,
                        node_id=self.node_id,
                    ).result,
                )

                if not performance_mode:
                    self.store_example_data_generator(external_df_fetcher)
                    self.node_stats.has_run_with_current_setup = True

            except Exception as e:
                node_logger.error("Error with external process")
                if external_df_fetcher.error_code == -1:
                    try:
                        self.results.resulting_data = self.get_resulting_data()
                        self.results.warnings = (
                            "Error with external process (unknown error), "
                            "likely the process was killed by the server because of memory constraints, "
                            "continue with the process. "
                            "We cannot display example data..."
                        )
                    except Exception as e:
                        self.results.errors = str(e)
                        raise e
                elif external_df_fetcher.error_description is None:
                    self.results.errors = str(e)
                    raise e
                else:
                    self.results.errors = external_df_fetcher.error_description
                    raise Exception(external_df_fetcher.error_description)
            finally:
                self._fetch_cached_df = None

    # Backward-compatible aliases for renamed methods
    def execute_full_local(self, performance_mode: bool = False) -> None:
        """Backward-compatible alias for _do_execute_full_local."""
        return self._do_execute_full_local(performance_mode)

    def execute_local(self, flow_id: int, performance_mode: bool = False):
        """Backward-compatible alias for _do_execute_local_with_sampling."""
        return self._do_execute_local_with_sampling(performance_mode, flow_id)

    def execute_remote(self, performance_mode: bool = False, node_logger: NodeLogger = None):
        """Backward-compatible alias for _do_execute_remote."""
        return self._do_execute_remote(performance_mode, node_logger)

    def prepare_before_run(self):
        """Resets results and errors before a new execution."""

        self.results.errors = None
        self.results.resulting_data = None
        self.results.example_data = None

    def cancel(self):
        """Cancels an ongoing external process if one is running."""

        if self._fetch_cached_df is not None:
            self._fetch_cached_df.cancel()
            self.node_stats.is_canceled = True
        else:
            logger.warning("No external process to cancel")
        self.node_stats.is_canceled = True

    def execute_node(
        self,
        run_location: schemas.ExecutionLocationsLiteral,
        reset_cache: bool = False,
        performance_mode: bool = False,
        retry: bool = True,
        node_logger: NodeLogger = None,
        optimize_for_downstream: bool = True,
    ) -> None:
        """Execute the node based on its current state and settings.

        This method uses a fast-path to quickly skip execution when possible,
        avoiding executor overhead. For cases requiring full execution logic,
        it delegates to the NodeExecutor.

        Args:
            run_location: Where to execute ('local' or 'remote')
            reset_cache: Force cache invalidation
            performance_mode: Skip example data generation for speed
            retry: Allow retry on recoverable errors
            node_logger: Logger for this node's execution
            optimize_for_downstream: Cache wide transforms for downstream nodes
        """
        if node_logger is None:
            raise ValueError("node_logger is required")

        if not self.is_setup:
            node_logger.warning(f"Node {self.__name__} is not setup, cannot run")
            return

        # Fast-path: check if we can skip without creating executor
        if self._can_skip_execution_fast(run_location, performance_mode, reset_cache):
            node_logger.info("Node is up-to-date, skipping execution")
            return

        # Full execution logic via executor
        self.executor.execute(
            run_location=run_location,
            reset_cache=reset_cache,
            performance_mode=performance_mode,
            retry=retry,
            node_logger=node_logger,
            optimize_for_downstream=optimize_for_downstream,
        )

    def store_example_data_generator(self, external_df_fetcher: ExternalDfFetcher | ExternalSampler):
        """Stores a generator function for fetching a sample of the result data.

        Args:
            external_df_fetcher: The process that generated the sample data.
        """
        if external_df_fetcher.status is not None:
            file_ref = external_df_fetcher.status.file_ref
            self.results.example_data_path = file_ref
            self.results.example_data_generator = get_read_top_n(file_path=file_ref, n=100)
        else:
            logger.error("Could not get the sample data, the external process is not ready")

    def needs_reset(self) -> bool:
        """Checks if the node's hash has changed, indicating an outdated state.

        Returns:
            True if the calculated hash differs from the stored hash.
        """
        return self._hash != self.calculate_hash(self.setting_input)

    def reset(self, deep: bool = False):
        """Resets the node's execution state and schema information.

        This also triggers a reset on all downstream nodes.

        Args:
            deep: If True, forces a reset even if the hash hasn't changed.
        """
        needs_reset = self.needs_reset() or deep
        if needs_reset:
            logger.info(f"{self.node_id}: Node needs reset")
            self.node_stats.has_run_with_current_setup = False
            self.results.reset()
            self.node_schema.result_schema = None
            self.node_schema.predicted_schema = None
            self._hash = None
            self.node_information.is_setup = None
            self.results.errors = None

            # Reset execution state but preserve source file info for change detection
            self._execution_state.has_run_with_current_setup = False
            self._execution_state.has_completed_last_run = False
            self._execution_state.result_schema = None
            self._execution_state.predicted_schema = None
            self._execution_state.execution_hash = None
            # Note: source_file_info NOT reset - needed for change detection

            if self.is_correct:
                self._schema_callback = None  # Ensure the schema callback is reset
                if self.schema_callback:
                    logger.info(f"{self.node_id}: Resetting the schema callback")
                    self.schema_callback.start()
            self.evaluate_nodes()
            _ = self.hash  # Recalculate the hash after reset

    def delete_lead_to_node(self, node_id: int) -> bool:
        """Removes a connection to a specific downstream node.

        Args:
            node_id: The ID of the downstream node to disconnect.

        Returns:
            True if the connection was found and removed, False otherwise.
        """
        logger.info(f"Deleting lead to node: {node_id}")
        for i, lead_to_node in enumerate(self.leads_to_nodes):
            logger.info(f"Checking lead to node: {lead_to_node.node_id}")
            if lead_to_node.node_id == node_id:
                logger.info(f"Found the node to delete: {node_id}")
                self.leads_to_nodes.pop(i)
                return True
        return False

    def delete_input_node(
        self, node_id: int, connection_type: input_schema.InputConnectionClass = "input-0", complete: bool = False
    ) -> bool:
        """Removes a connection from a specific input node.

        Args:
            node_id: The ID of the input node to disconnect.
            connection_type: The specific input handle (e.g., 'input-0', 'input-1').
            complete: If True, tries to delete from all input types.

        Returns:
            True if a connection was found and removed, False otherwise.
        """
        deleted: bool = False
        if connection_type == "input-0":
            for i, node in enumerate(self.node_inputs.main_inputs):
                if node.node_id == node_id:
                    self.node_inputs.main_inputs.pop(i)
                    deleted = True
                    if not complete:
                        continue
        elif connection_type == "input-1" or complete:
            if self.node_inputs.right_input is not None and self.node_inputs.right_input.node_id == node_id:
                self.node_inputs.right_input = None
                deleted = True
        elif connection_type == "input-2" or complete:
            if self.node_inputs.left_input is not None and self.node_inputs.right_input.node_id == node_id:
                self.node_inputs.left_input = None
                deleted = True
        else:
            logger.warning("Could not find the connection to delete...")
        if deleted:
            self.reset()
        return deleted

    def __repr__(self) -> str:
        """Provides a string representation of the FlowNode instance.

        Returns:
            A string showing the node's ID and type.
        """
        return f"Node id: {self.node_id} ({self.node_type})"

    def _get_readable_schema(self) -> list[dict] | None:
        """Helper to get a simplified, dictionary representation of the output schema.

        Returns:
            A list of dictionaries, each with 'column_name' and 'data_type'.
        """
        if self.is_setup:
            output = []
            for s in self.schema:
                output.append(dict(column_name=s.column_name, data_type=s.data_type))
            return output

    def get_repr(self) -> dict:
        """Gets a detailed dictionary representation of the node's state.

        Returns:
            A dictionary containing key information about the node.
        """
        return dict(
            FlowNode=dict(
                node_id=self.node_id,
                step_name=self.__name__,
                output_columns=self.node_schema.output_columns,
                output_schema=self._get_readable_schema(),
            )
        )

    @property
    def number_of_leads_to_nodes(self) -> int | None:
        """Counts the number of downstream node connections.

        Returns:
            The number of nodes this node leads to.
        """
        if self.is_setup:
            return len(self.leads_to_nodes)

    @property
    def has_next_step(self) -> bool:
        """Checks if this node has any downstream connections.

        Returns:
            True if it has at least one downstream node.
        """
        return len(self.leads_to_nodes) > 0

    @property
    def has_input(self) -> bool:
        """Checks if this node has any input connections.

        Returns:
            True if it has at least one input node.
        """
        return len(self.all_inputs) > 0

    @property
    def singular_input(self) -> bool:
        """Checks if the node template specifies exactly one input.

        Returns:
            True if the node is a single-input type.
        """
        return self.node_template.input == 1

    @property
    def singular_main_input(self) -> "FlowNode":
        """Gets the input node, assuming it is a single-input type.

        Returns:
            The single input FlowNode, or None.
        """
        if self.singular_input:
            return self.all_inputs[0]

    def clear_table_example(self) -> None:
        """
        Clear the table example in the results so that it clears the existing results
        Returns:
            None
        """

        self.results.example_data = None
        self.results.example_data_generator = None
        self.results.example_data_path = None

    def get_table_example(self, include_data: bool = False) -> TableExample | None:
        """Generates a `TableExample` model summarizing the node's output.

        This can optionally include a sample of the data.

        Args:
            include_data: If True, includes a data sample in the result.

        Returns:
            A `TableExample` object, or None if the node is not set up.
        """
        self.print("Getting a table example")
        if self.is_setup and include_data and self.node_stats.has_completed_last_run:
            if self.node_template.node_group == "output":
                self.print("getting the table example")
                return self.main_input[0].get_table_example(include_data)

            logger.info("getting the table example since the node has run")
            example_data_getter = self.results.example_data_generator
            if example_data_getter is not None:
                data = example_data_getter().to_pylist()
                if data is None:
                    data = []
            else:
                data = []
            schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
            fl = self.get_resulting_data()
            has_example_data = self.results.example_data_generator is not None

            return TableExample(
                node_id=self.node_id,
                name=str(self.node_id),
                number_of_records=999,
                number_of_columns=fl.number_of_fields,
                table_schema=schema,
                columns=fl.columns,
                data=data,
                has_example_data=has_example_data,
                has_run_with_current_setup=self.node_stats.has_run_with_current_setup,
            )
        else:
            logger.warning("getting the table example but the node has not run")
            try:
                schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
            except Exception as e:
                logger.warning(e)
                schema = []
            columns = [s.name for s in schema]
            return TableExample(
                node_id=self.node_id,
                name=str(self.node_id),
                number_of_records=0,
                number_of_columns=len(columns),
                table_schema=schema,
                columns=columns,
                data=[],
            )

    def get_node_data(self, flow_id: int, include_example: bool = False) -> NodeData:
        """Gathers all necessary data for representing the node in the UI.

        Args:
            flow_id: The ID of the parent flow.
            include_example: If True, includes data samples.

        Returns:
            A `NodeData` object.
        """
        node = NodeData(
            flow_id=flow_id,
            node_id=self.node_id,
            has_run=self.node_stats.has_run_with_current_setup,
            setting_input=self.setting_input,
            flow_type=self.node_type,
        )
        if self.main_input:
            node.main_input = self.main_input[0].get_table_example()
        if self.left_input:
            node.left_input = self.left_input.get_table_example()
        if self.right_input:
            node.right_input = self.right_input.get_table_example()
        if self.is_setup:
            node.main_output = self.get_table_example(include_example)
        node = setting_generator.get_setting_generator(self.node_type)(node)

        node = setting_updator.get_setting_updator(self.node_type)(node)
        # Save the updated settings back to the node so they persist across calls
        if node.setting_input is not None and not isinstance(node.setting_input, input_schema.NodePromise):
            self.setting_input = node.setting_input
        return node

    def get_output_data(self) -> TableExample:
        """Gets the full output data sample for this node.

        Returns:
            A `TableExample` object with data.
        """
        return self.get_table_example(True)

    def get_node_input(self) -> schemas.NodeInput:
        """Creates a `NodeInput` schema object for representing this node in the UI.

        Returns:
            A `NodeInput` object.
        """
        return schemas.NodeInput(
            pos_y=self.setting_input.pos_y,
            pos_x=self.setting_input.pos_x,
            id=self.node_id,
            **self.node_template.__dict__,
        )

    def get_edge_input(self) -> list[schemas.NodeEdge]:
        """Generates `NodeEdge` objects for all input connections to this node.

        Returns:
            A list of `NodeEdge` objects.
        """
        edges = []
        if self.node_inputs.main_inputs is not None:
            for i, main_input in enumerate(self.node_inputs.main_inputs):
                edges.append(
                    schemas.NodeEdge(
                        id=f"{main_input.node_id}-{self.node_id}-{i}",
                        source=main_input.node_id,
                        target=self.node_id,
                        sourceHandle="output-0",
                        targetHandle="input-0",
                    )
                )
        if self.node_inputs.left_input is not None:
            edges.append(
                schemas.NodeEdge(
                    id=f"{self.node_inputs.left_input.node_id}-{self.node_id}-right",
                    source=self.node_inputs.left_input.node_id,
                    target=self.node_id,
                    sourceHandle="output-0",
                    targetHandle="input-2",
                )
            )
        if self.node_inputs.right_input is not None:
            edges.append(
                schemas.NodeEdge(
                    id=f"{self.node_inputs.right_input.node_id}-{self.node_id}-left",
                    source=self.node_inputs.right_input.node_id,
                    target=self.node_id,
                    sourceHandle="output-0",
                    targetHandle="input-1",
                )
            )
        return edges
all_inputs property

Gets a list of all nodes connected to any input port.

Returns:

Type Description
list[FlowNode]

A list of all input FlowNodes.

executor property

Lazy-initialized executor instance.

Reusing the same executor avoids object creation overhead when execute_node is called multiple times.

function property writable

Gets the core processing function of the node.

Returns:

Type Description
Callable

The callable function.

has_input property

Checks if this node has any input connections.

Returns:

Type Description
bool

True if it has at least one input node.

has_next_step property

Checks if this node has any downstream connections.

Returns:

Type Description
bool

True if it has at least one downstream node.

hash property

Gets the cached hash for the node, calculating it if it doesn't exist.

Returns:

Type Description
str

The string hash value.

is_correct property

Checks if the node's input connections satisfy its template requirements.

Returns:

Type Description
bool

True if connections are valid, False otherwise.

is_setup property

Checks if the node has been properly configured and is ready for execution.

Returns:

Type Description
bool

True if the node is set up, False otherwise.

is_start property

Determines if the node is a starting node in the flow.

A starting node requires no inputs.

Returns:

Type Description
bool

True if the node is a start node, False otherwise.

left_input property

Gets the node connected to the left input port.

Returns:

Type Description
Optional[FlowNode]

The left input FlowNode, or None.

main_input property

Gets the list of nodes connected to the main input port(s).

Returns:

Type Description
list[FlowNode]

A list of main input FlowNodes.

name property writable

Gets the name of the node.

Returns:

Type Description
str

The node's name.

node_id property

Gets the unique identifier of the node.

Returns:

Type Description
str | int

The node's ID.

number_of_leads_to_nodes property

Counts the number of downstream node connections.

Returns:

Type Description
int | None

The number of nodes this node leads to.

right_input property

Gets the node connected to the right input port.

Returns:

Type Description
Optional[FlowNode]

The right input FlowNode, or None.

schema property

Gets the definitive output schema of the node.

If not already run, it falls back to the predicted schema.

Returns:

Type Description
list[FlowfileColumn]

A list of FlowfileColumn objects.

schema_callback property writable

Gets the schema callback function, creating one if it doesn't exist.

The callback is used for predicting the output schema without full execution.

Returns:

Type Description
SingleExecutionFuture

A SingleExecutionFuture instance wrapping the schema function.

setting_input property writable

Gets the node's specific configuration settings.

Returns:

Type Description
Any

The settings object.

singular_input property

Checks if the node template specifies exactly one input.

Returns:

Type Description
bool

True if the node is a single-input type.

singular_main_input property

Gets the input node, assuming it is a single-input type.

Returns:

Type Description
FlowNode

The single input FlowNode, or None.

state_needs_reset property writable

Checks if the node's state needs to be reset.

Returns:

Type Description
bool

True if a reset is required, False otherwise.

__call__(*args, **kwargs)

Makes the node instance callable, acting as an alias for execute_node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
862
863
864
def __call__(self, *args, **kwargs):
    """Makes the node instance callable, acting as an alias for execute_node."""
    self.execute_node(*args, **kwargs)
__init__(node_id, function, parent_uuid, setting_input, name, node_type, input_columns=None, output_schema=None, drop_columns=None, renew_schema=True, pos_x=0, pos_y=0, schema_callback=None)

Initializes a FlowNode instance.

Parameters:

Name Type Description Default
node_id str | int

Unique identifier for the node.

required
function Callable

The core data processing function for the node.

required
parent_uuid str

The UUID of the parent flow.

required
setting_input Any

The configuration/settings object for the node.

required
name str

The name of the node.

required
node_type str

The type identifier of the node (e.g., 'join', 'filter').

required
input_columns list[str]

List of column names expected as input.

None
output_schema list[FlowfileColumn]

The schema of the columns to be added.

None
drop_columns list[str]

List of column names to be dropped.

None
renew_schema bool

Flag to indicate if the schema should be renewed.

True
pos_x float

The x-coordinate on the canvas.

0
pos_y float

The y-coordinate on the canvas.

0
schema_callback Callable

A custom function to calculate the output schema.

None
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def __init__(
    self,
    node_id: str | int,
    function: Callable,
    parent_uuid: str,
    setting_input: Any,
    name: str,
    node_type: str,
    input_columns: list[str] = None,
    output_schema: list[FlowfileColumn] = None,
    drop_columns: list[str] = None,
    renew_schema: bool = True,
    pos_x: float = 0,
    pos_y: float = 0,
    schema_callback: Callable = None,
):
    """Initializes a FlowNode instance.

    Args:
        node_id: Unique identifier for the node.
        function: The core data processing function for the node.
        parent_uuid: The UUID of the parent flow.
        setting_input: The configuration/settings object for the node.
        name: The name of the node.
        node_type: The type identifier of the node (e.g., 'join', 'filter').
        input_columns: List of column names expected as input.
        output_schema: The schema of the columns to be added.
        drop_columns: List of column names to be dropped.
        renew_schema: Flag to indicate if the schema should be renewed.
        pos_x: The x-coordinate on the canvas.
        pos_y: The y-coordinate on the canvas.
        schema_callback: A custom function to calculate the output schema.
    """
    self._name = None
    self.parent_uuid = parent_uuid
    self.post_init()
    self.active = True
    self.node_information.id = node_id
    self.node_type = node_type
    self.node_settings.renew_schema = renew_schema
    self.update_node(
        function=function,
        input_columns=input_columns,
        output_schema=output_schema,
        drop_columns=drop_columns,
        setting_input=setting_input,
        name=name,
        pos_x=pos_x,
        pos_y=pos_y,
        schema_callback=schema_callback,
    )
__repr__()

Provides a string representation of the FlowNode instance.

Returns:

Type Description
str

A string showing the node's ID and type.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1247
1248
1249
1250
1251
1252
1253
def __repr__(self) -> str:
    """Provides a string representation of the FlowNode instance.

    Returns:
        A string showing the node's ID and type.
    """
    return f"Node id: {self.node_id} ({self.node_type})"
add_lead_to_in_depend_source()

Ensures this node is registered in the leads_to_nodes list of its inputs.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
764
765
766
767
768
def add_lead_to_in_depend_source(self):
    """Ensures this node is registered in the `leads_to_nodes` list of its inputs."""
    for input_node in self.all_inputs:
        if self.node_id not in [n.node_id for n in input_node.leads_to_nodes]:
            input_node.leads_to_nodes.append(self)
add_node_connection(from_node, insert_type='main')

Adds a connection from a source node to this node.

Parameters:

Name Type Description Default
from_node FlowNode

The node to connect from.

required
insert_type Literal['main', 'left', 'right']

The type of input to connect to ('main', 'left', 'right').

'main'

Raises:

Type Description
Exception

If the insert_type is invalid.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
def add_node_connection(
    self, from_node: "FlowNode", insert_type: Literal["main", "left", "right"] = "main"
) -> None:
    """Adds a connection from a source node to this node.

    Args:
        from_node: The node to connect from.
        insert_type: The type of input to connect to ('main', 'left', 'right').

    Raises:
        Exception: If the insert_type is invalid.
    """
    from_node.leads_to_nodes.append(self)
    if insert_type == "main":
        if self.node_template.input <= 2 or self.node_inputs.main_inputs is None:
            self.node_inputs.main_inputs = [from_node]
        else:
            self.node_inputs.main_inputs.append(from_node)
    elif insert_type == "right":
        self.node_inputs.right_input = from_node
    elif insert_type == "left":
        self.node_inputs.left_input = from_node
    else:
        raise Exception("Cannot find the connection")
    if self.setting_input.is_setup:
        if hasattr(self.setting_input, "depending_on_id") and insert_type == "main":
            self.setting_input.depending_on_id = from_node.node_id
    self.reset()
    from_node.reset()
calculate_hash(setting_input)

Calculates a hash based on settings and input node hashes.

Parameters:

Name Type Description Default
setting_input Any

The node's settings object to be included in the hash.

required

Returns:

Type Description
str

A string hash value.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
482
483
484
485
486
487
488
489
490
491
492
493
def calculate_hash(self, setting_input: Any) -> str:
    """Calculates a hash based on settings and input node hashes.

    Args:
        setting_input: The node's settings object to be included in the hash.

    Returns:
        A string hash value.
    """
    depends_on_hashes = [_node.hash for _node in self.all_inputs]
    node_data_hash = get_hash(setting_input)
    return get_hash(depends_on_hashes + [node_data_hash, self.parent_uuid])
cancel()

Cancels an ongoing external process if one is running.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1083
1084
1085
1086
1087
1088
1089
1090
1091
def cancel(self):
    """Cancels an ongoing external process if one is running."""

    if self._fetch_cached_df is not None:
        self._fetch_cached_df.cancel()
        self.node_stats.is_canceled = True
    else:
        logger.warning("No external process to cancel")
    self.node_stats.is_canceled = True
clear_table_example()

Clear the table example in the results so that it clears the existing results Returns: None

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
def clear_table_example(self) -> None:
    """
    Clear the table example in the results so that it clears the existing results
    Returns:
        None
    """

    self.results.example_data = None
    self.results.example_data_generator = None
    self.results.example_data_path = None
create_schema_callback_from_function(f)

Wraps a node's function to create a schema callback that extracts the schema.

Thread-safe: uses _execution_lock to prevent concurrent execution with get_resulting_data.

Parameters:

Name Type Description Default
f Callable

The node's core function that returns a FlowDataEngine instance.

required

Returns:

Type Description
Callable[[], list[FlowfileColumn]]

A callable that, when executed, returns the output schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
def create_schema_callback_from_function(self, f: Callable) -> Callable[[], list[FlowfileColumn]]:
    """Wraps a node's function to create a schema callback that extracts the schema.

    Thread-safe: uses _execution_lock to prevent concurrent execution with get_resulting_data.

    Args:
        f: The node's core function that returns a FlowDataEngine instance.

    Returns:
        A callable that, when executed, returns the output schema.
    """

    def schema_callback() -> list[FlowfileColumn]:
        try:
            logger.info("Executing the schema callback function based on the node function")
            with self._execution_lock:
                return f().schema
        except Exception as e:
            logger.warning(f"Error with the schema callback: {e}")
            return []

    return schema_callback
delete_input_node(node_id, connection_type='input-0', complete=False)

Removes a connection from a specific input node.

Parameters:

Name Type Description Default
node_id int

The ID of the input node to disconnect.

required
connection_type InputConnectionClass

The specific input handle (e.g., 'input-0', 'input-1').

'input-0'
complete bool

If True, tries to delete from all input types.

False

Returns:

Type Description
bool

True if a connection was found and removed, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
def delete_input_node(
    self, node_id: int, connection_type: input_schema.InputConnectionClass = "input-0", complete: bool = False
) -> bool:
    """Removes a connection from a specific input node.

    Args:
        node_id: The ID of the input node to disconnect.
        connection_type: The specific input handle (e.g., 'input-0', 'input-1').
        complete: If True, tries to delete from all input types.

    Returns:
        True if a connection was found and removed, False otherwise.
    """
    deleted: bool = False
    if connection_type == "input-0":
        for i, node in enumerate(self.node_inputs.main_inputs):
            if node.node_id == node_id:
                self.node_inputs.main_inputs.pop(i)
                deleted = True
                if not complete:
                    continue
    elif connection_type == "input-1" or complete:
        if self.node_inputs.right_input is not None and self.node_inputs.right_input.node_id == node_id:
            self.node_inputs.right_input = None
            deleted = True
    elif connection_type == "input-2" or complete:
        if self.node_inputs.left_input is not None and self.node_inputs.right_input.node_id == node_id:
            self.node_inputs.left_input = None
            deleted = True
    else:
        logger.warning("Could not find the connection to delete...")
    if deleted:
        self.reset()
    return deleted
delete_lead_to_node(node_id)

Removes a connection to a specific downstream node.

Parameters:

Name Type Description Default
node_id int

The ID of the downstream node to disconnect.

required

Returns:

Type Description
bool

True if the connection was found and removed, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
def delete_lead_to_node(self, node_id: int) -> bool:
    """Removes a connection to a specific downstream node.

    Args:
        node_id: The ID of the downstream node to disconnect.

    Returns:
        True if the connection was found and removed, False otherwise.
    """
    logger.info(f"Deleting lead to node: {node_id}")
    for i, lead_to_node in enumerate(self.leads_to_nodes):
        logger.info(f"Checking lead to node: {lead_to_node.node_id}")
        if lead_to_node.node_id == node_id:
            logger.info(f"Found the node to delete: {node_id}")
            self.leads_to_nodes.pop(i)
            return True
    return False
evaluate_nodes(deep=False)

Triggers a state reset for all directly connected downstream nodes.

Parameters:

Name Type Description Default
deep bool

If True, the reset propagates recursively through the entire downstream graph.

False
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
536
537
538
539
540
541
542
543
544
def evaluate_nodes(self, deep: bool = False) -> None:
    """Triggers a state reset for all directly connected downstream nodes.

    Args:
        deep: If True, the reset propagates recursively through the entire downstream graph.
    """
    for node in self.leads_to_nodes:
        self.print(f"resetting node: {node.node_id}")
        node.reset(deep)
execute_full_local(performance_mode=False)

Backward-compatible alias for _do_execute_full_local.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1064
1065
1066
def execute_full_local(self, performance_mode: bool = False) -> None:
    """Backward-compatible alias for _do_execute_full_local."""
    return self._do_execute_full_local(performance_mode)
execute_local(flow_id, performance_mode=False)

Backward-compatible alias for _do_execute_local_with_sampling.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1068
1069
1070
def execute_local(self, flow_id: int, performance_mode: bool = False):
    """Backward-compatible alias for _do_execute_local_with_sampling."""
    return self._do_execute_local_with_sampling(performance_mode, flow_id)
execute_node(run_location, reset_cache=False, performance_mode=False, retry=True, node_logger=None, optimize_for_downstream=True)

Execute the node based on its current state and settings.

This method uses a fast-path to quickly skip execution when possible, avoiding executor overhead. For cases requiring full execution logic, it delegates to the NodeExecutor.

Parameters:

Name Type Description Default
run_location ExecutionLocationsLiteral

Where to execute ('local' or 'remote')

required
reset_cache bool

Force cache invalidation

False
performance_mode bool

Skip example data generation for speed

False
retry bool

Allow retry on recoverable errors

True
node_logger NodeLogger

Logger for this node's execution

None
optimize_for_downstream bool

Cache wide transforms for downstream nodes

True
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
def execute_node(
    self,
    run_location: schemas.ExecutionLocationsLiteral,
    reset_cache: bool = False,
    performance_mode: bool = False,
    retry: bool = True,
    node_logger: NodeLogger = None,
    optimize_for_downstream: bool = True,
) -> None:
    """Execute the node based on its current state and settings.

    This method uses a fast-path to quickly skip execution when possible,
    avoiding executor overhead. For cases requiring full execution logic,
    it delegates to the NodeExecutor.

    Args:
        run_location: Where to execute ('local' or 'remote')
        reset_cache: Force cache invalidation
        performance_mode: Skip example data generation for speed
        retry: Allow retry on recoverable errors
        node_logger: Logger for this node's execution
        optimize_for_downstream: Cache wide transforms for downstream nodes
    """
    if node_logger is None:
        raise ValueError("node_logger is required")

    if not self.is_setup:
        node_logger.warning(f"Node {self.__name__} is not setup, cannot run")
        return

    # Fast-path: check if we can skip without creating executor
    if self._can_skip_execution_fast(run_location, performance_mode, reset_cache):
        node_logger.info("Node is up-to-date, skipping execution")
        return

    # Full execution logic via executor
    self.executor.execute(
        run_location=run_location,
        reset_cache=reset_cache,
        performance_mode=performance_mode,
        retry=retry,
        node_logger=node_logger,
        optimize_for_downstream=optimize_for_downstream,
    )
execute_remote(performance_mode=False, node_logger=None)

Backward-compatible alias for _do_execute_remote.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1072
1073
1074
def execute_remote(self, performance_mode: bool = False, node_logger: NodeLogger = None):
    """Backward-compatible alias for _do_execute_remote."""
    return self._do_execute_remote(performance_mode, node_logger)
get_all_dependent_node_ids()

Yields the IDs of all downstream nodes recursively.

Returns:

Type Description
None

A generator of all dependent node IDs.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
781
782
783
784
785
786
787
788
789
790
def get_all_dependent_node_ids(self) -> Generator[int, None, None]:
    """Yields the IDs of all downstream nodes recursively.

    Returns:
        A generator of all dependent node IDs.
    """
    for node in self.leads_to_nodes:
        yield node.node_id
        for n in node.get_all_dependent_node_ids():
            yield n
get_all_dependent_nodes()

Yields all downstream nodes recursively.

Returns:

Type Description
None

A generator of all dependent FlowNode objects.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
770
771
772
773
774
775
776
777
778
779
def get_all_dependent_nodes(self) -> Generator["FlowNode", None, None]:
    """Yields all downstream nodes recursively.

    Returns:
        A generator of all dependent FlowNode objects.
    """
    for node in self.leads_to_nodes:
        yield node
        for n in node.get_all_dependent_nodes():
            yield n
get_edge_input()

Generates NodeEdge objects for all input connections to this node.

Returns:

Type Description
list[NodeEdge]

A list of NodeEdge objects.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
def get_edge_input(self) -> list[schemas.NodeEdge]:
    """Generates `NodeEdge` objects for all input connections to this node.

    Returns:
        A list of `NodeEdge` objects.
    """
    edges = []
    if self.node_inputs.main_inputs is not None:
        for i, main_input in enumerate(self.node_inputs.main_inputs):
            edges.append(
                schemas.NodeEdge(
                    id=f"{main_input.node_id}-{self.node_id}-{i}",
                    source=main_input.node_id,
                    target=self.node_id,
                    sourceHandle="output-0",
                    targetHandle="input-0",
                )
            )
    if self.node_inputs.left_input is not None:
        edges.append(
            schemas.NodeEdge(
                id=f"{self.node_inputs.left_input.node_id}-{self.node_id}-right",
                source=self.node_inputs.left_input.node_id,
                target=self.node_id,
                sourceHandle="output-0",
                targetHandle="input-2",
            )
        )
    if self.node_inputs.right_input is not None:
        edges.append(
            schemas.NodeEdge(
                id=f"{self.node_inputs.right_input.node_id}-{self.node_id}-left",
                source=self.node_inputs.right_input.node_id,
                target=self.node_id,
                sourceHandle="output-0",
                targetHandle="input-1",
            )
        )
    return edges
get_flow_file_column_schema(col_name)

Retrieves the schema for a specific column from the output schema.

Parameters:

Name Type Description Default
col_name str

The name of the column.

required

Returns:

Type Description
FlowfileColumn | None

The FlowfileColumn object for that column, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
546
547
548
549
550
551
552
553
554
555
556
557
def get_flow_file_column_schema(self, col_name: str) -> FlowfileColumn | None:
    """Retrieves the schema for a specific column from the output schema.

    Args:
        col_name: The name of the column.

    Returns:
        The FlowfileColumn object for that column, or None if not found.
    """
    for s in self.schema:
        if s.column_name == col_name:
            return s
get_input_type(node_id)

Gets the type of connection ('main', 'left', 'right') for a given input node ID.

Parameters:

Name Type Description Default
node_id int

The ID of the input node.

required

Returns:

Type Description
list

A list of connection types for that node ID.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
def get_input_type(self, node_id: int) -> list:
    """Gets the type of connection ('main', 'left', 'right') for a given input node ID.

    Args:
        node_id: The ID of the input node.

    Returns:
        A list of connection types for that node ID.
    """
    relation_type = []
    if node_id in [n.node_id for n in self.node_inputs.main_inputs]:
        relation_type.append("main")
    if self.node_inputs.left_input is not None and node_id == self.node_inputs.left_input.node_id:
        relation_type.append("left")
    if self.node_inputs.right_input is not None and node_id == self.node_inputs.right_input.node_id:
        relation_type.append("right")
    return list(set(relation_type))
get_node_data(flow_id, include_example=False)

Gathers all necessary data for representing the node in the UI.

Parameters:

Name Type Description Default
flow_id int

The ID of the parent flow.

required
include_example bool

If True, includes data samples.

False

Returns:

Type Description
NodeData

A NodeData object.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
def get_node_data(self, flow_id: int, include_example: bool = False) -> NodeData:
    """Gathers all necessary data for representing the node in the UI.

    Args:
        flow_id: The ID of the parent flow.
        include_example: If True, includes data samples.

    Returns:
        A `NodeData` object.
    """
    node = NodeData(
        flow_id=flow_id,
        node_id=self.node_id,
        has_run=self.node_stats.has_run_with_current_setup,
        setting_input=self.setting_input,
        flow_type=self.node_type,
    )
    if self.main_input:
        node.main_input = self.main_input[0].get_table_example()
    if self.left_input:
        node.left_input = self.left_input.get_table_example()
    if self.right_input:
        node.right_input = self.right_input.get_table_example()
    if self.is_setup:
        node.main_output = self.get_table_example(include_example)
    node = setting_generator.get_setting_generator(self.node_type)(node)

    node = setting_updator.get_setting_updator(self.node_type)(node)
    # Save the updated settings back to the node so they persist across calls
    if node.setting_input is not None and not isinstance(node.setting_input, input_schema.NodePromise):
        self.setting_input = node.setting_input
    return node
get_node_information()

Updates and returns the node's information object.

Returns:

Type Description
NodeInformation

The NodeInformation object for this node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
446
447
448
449
450
451
452
453
def get_node_information(self) -> schemas.NodeInformation:
    """Updates and returns the node's information object.

    Returns:
        The `NodeInformation` object for this node.
    """
    self.set_node_information()
    return self.node_information
get_node_input()

Creates a NodeInput schema object for representing this node in the UI.

Returns:

Type Description
NodeInput

A NodeInput object.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
def get_node_input(self) -> schemas.NodeInput:
    """Creates a `NodeInput` schema object for representing this node in the UI.

    Returns:
        A `NodeInput` object.
    """
    return schemas.NodeInput(
        pos_y=self.setting_input.pos_y,
        pos_x=self.setting_input.pos_x,
        id=self.node_id,
        **self.node_template.__dict__,
    )
get_output_data()

Gets the full output data sample for this node.

Returns:

Type Description
TableExample

A TableExample object with data.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1431
1432
1433
1434
1435
1436
1437
def get_output_data(self) -> TableExample:
    """Gets the full output data sample for this node.

    Returns:
        A `TableExample` object with data.
    """
    return self.get_table_example(True)
get_predicted_resulting_data()

Creates a FlowDataEngine instance based on the predicted schema.

This avoids executing the node's full logic.

Returns:

Type Description
FlowDataEngine

A FlowDataEngine instance with a schema but no data.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
def get_predicted_resulting_data(self) -> FlowDataEngine:
    """Creates a `FlowDataEngine` instance based on the predicted schema.

    This avoids executing the node's full logic.

    Returns:
        A FlowDataEngine instance with a schema but no data.
    """
    if self.needs_run(False) and self.schema_callback is not None or self.node_schema.result_schema is not None:
        self.print("Getting data based on the schema")

        _s = self.schema_callback() if self.node_schema.result_schema is None else self.node_schema.result_schema
        return FlowDataEngine.create_from_schema(_s)
    else:
        if isinstance(self.function, FlowDataEngine):
            fl = self.function
        else:
            fl = FlowDataEngine.create_from_schema(self.get_predicted_schema())
        return fl
get_predicted_schema(force=False)

Predicts the output schema of the node without full execution.

It uses the schema_callback or infers from predicted data.

Parameters:

Name Type Description Default
force bool

If True, forces recalculation even if a predicted schema exists.

False

Returns:

Type Description
list[FlowfileColumn] | None

A list of FlowfileColumn objects representing the predicted schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
def get_predicted_schema(self, force: bool = False) -> list[FlowfileColumn] | None:
    """Predicts the output schema of the node without full execution.

    It uses the schema_callback or infers from predicted data.

    Args:
        force: If True, forces recalculation even if a predicted schema exists.

    Returns:
        A list of FlowfileColumn objects representing the predicted schema.
    """
    logger.info(
        f"get_predicted_schema: node_id={self.node_id}, node_type={self.node_type}, force={force}, "
        f"has_predicted_schema={self.node_schema.predicted_schema is not None}, "
        f"has_schema_callback={self.schema_callback is not None}, "
        f"has_output_field_config={hasattr(self._setting_input, 'output_field_config') and self._setting_input.output_field_config is not None if self._setting_input else False}"
    )

    if self.node_schema.predicted_schema and not force:
        logger.debug(f"get_predicted_schema: node_id={self.node_id} - returning cached predicted_schema")
        return self.node_schema.predicted_schema

    if self.schema_callback is not None and (self.node_schema.predicted_schema is None or force):
        self.print("Getting the data from a schema callback")
        logger.info(f"get_predicted_schema: node_id={self.node_id} - invoking schema_callback")
        if force:
            # Force the schema callback to reset, so that it will be executed again
            logger.debug(f"get_predicted_schema: node_id={self.node_id} - forcing schema_callback reset")
            self.schema_callback.reset()

        try:
            schema = self.schema_callback()
            logger.info(
                f"get_predicted_schema: node_id={self.node_id} - schema_callback returned "
                f"{len(schema) if schema else 0} columns: {[c.name for c in schema] if schema else []}"
            )
        except Exception as e:
            logger.error(f"get_predicted_schema: node_id={self.node_id} - schema_callback raised exception: {e}")
            schema = None

        if schema is not None and len(schema) > 0:
            self.print("Calculating the schema based on the schema callback")
            self.node_schema.predicted_schema = schema
            logger.info(f"get_predicted_schema: node_id={self.node_id} - set predicted_schema from schema_callback")
            return self.node_schema.predicted_schema
        else:
            logger.warning(f"get_predicted_schema: node_id={self.node_id} - schema_callback returned empty/None schema")
    else:
        logger.debug(f"get_predicted_schema: node_id={self.node_id} - no schema_callback available")

    logger.debug(f"get_predicted_schema: node_id={self.node_id} - falling back to _predicted_data_getter")
    predicted_data = self._predicted_data_getter()
    if predicted_data is not None and predicted_data.schema is not None:
        self.print("Calculating the schema based on the predicted resulting data")
        logger.info(
            f"get_predicted_schema: node_id={self.node_id} - using schema from predicted_data "
            f"({len(predicted_data.schema)} columns)"
        )
        self.node_schema.predicted_schema = self._predicted_data_getter().schema
    else:
        logger.warning(
            f"get_predicted_schema: node_id={self.node_id} - no schema available from any source "
            f"(predicted_data={'None' if predicted_data is None else 'has_data'}, "
            f"schema={'None' if predicted_data is None or predicted_data.schema is None else 'has_schema'})"
        )

    return self.node_schema.predicted_schema
get_repr()

Gets a detailed dictionary representation of the node's state.

Returns:

Type Description
dict

A dictionary containing key information about the node.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
def get_repr(self) -> dict:
    """Gets a detailed dictionary representation of the node's state.

    Returns:
        A dictionary containing key information about the node.
    """
    return dict(
        FlowNode=dict(
            node_id=self.node_id,
            step_name=self.__name__,
            output_columns=self.node_schema.output_columns,
            output_schema=self._get_readable_schema(),
        )
    )
get_resulting_data()

Executes the node's function to produce the actual output data.

Handles both regular functions and external data sources. Thread-safe: uses _execution_lock to prevent concurrent execution and concurrent access to the underlying LazyFrame by sibling nodes.

Returns:

Type Description
FlowDataEngine | None

A FlowDataEngine instance containing the result, or None on error.

Raises:

Type Description
Exception

Propagates exceptions from the node's function execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
def get_resulting_data(self) -> FlowDataEngine | None:
    """Executes the node's function to produce the actual output data.

    Handles both regular functions and external data sources.
    Thread-safe: uses _execution_lock to prevent concurrent execution
    and concurrent access to the underlying LazyFrame by sibling nodes.

    Returns:
        A FlowDataEngine instance containing the result, or None on error.

    Raises:
        Exception: Propagates exceptions from the node's function execution.
    """
    if self.is_setup:
        with self._execution_lock:
            if self.results.resulting_data is None and self.results.errors is None:
                self.print("getting resulting data")
                try:
                    if isinstance(self.function, FlowDataEngine):
                        fl: FlowDataEngine = self.function
                    elif self.node_type == "external_source":
                        fl: FlowDataEngine = self.function()
                        fl.collect_external()
                        self.node_settings.streamable = False
                    else:
                        try:
                            self.print("Collecting input data from all inputs")
                            input_data = []
                            input_locks = []
                            try:
                                for i, v in enumerate(self.all_inputs):
                                    self.print(f"Getting resulting data from input {i} (node {v.node_id})")
                                    # Lock the input node to prevent sibling nodes from
                                    # concurrently accessing the same upstream LazyFrame.
                                    v._execution_lock.acquire()
                                    input_locks.append(v._execution_lock)
                                    input_result = v.get_resulting_data()
                                    self.print(f"Input {i} data type: {type(input_result)}, dataframe type: {type(input_result.data_frame) if input_result else 'None'}")
                                    input_data.append(input_result)
                                self.print(f"All {len(input_data)} inputs collected, calling node function")
                                fl = self._function(*input_data)
                            finally:
                                for lock in input_locks:
                                    lock.release()
                        except Exception as e:
                            raise e
                    fl.set_streamable(self.node_settings.streamable)

                    # Apply output field configuration if enabled
                    if hasattr(self._setting_input, 'output_field_config') and self._setting_input.output_field_config:
                        try:
                            fl = apply_output_field_config(fl, self._setting_input.output_field_config)
                        except Exception as e:
                            logger.error(f"Error applying output field config for node {self.node_id}: {e}")
                            raise

                    self.results.resulting_data = fl
                    self.node_schema.result_schema = fl.schema
                except Exception as e:
                    self.results.resulting_data = FlowDataEngine()
                    self.results.errors = str(e)
                    self.node_stats.has_run_with_current_setup = False
                    self.node_stats.has_completed_last_run = False
                    raise e
            return self.results.resulting_data
get_table_example(include_data=False)

Generates a TableExample model summarizing the node's output.

This can optionally include a sample of the data.

Parameters:

Name Type Description Default
include_data bool

If True, includes a data sample in the result.

False

Returns:

Type Description
TableExample | None

A TableExample object, or None if the node is not set up.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
def get_table_example(self, include_data: bool = False) -> TableExample | None:
    """Generates a `TableExample` model summarizing the node's output.

    This can optionally include a sample of the data.

    Args:
        include_data: If True, includes a data sample in the result.

    Returns:
        A `TableExample` object, or None if the node is not set up.
    """
    self.print("Getting a table example")
    if self.is_setup and include_data and self.node_stats.has_completed_last_run:
        if self.node_template.node_group == "output":
            self.print("getting the table example")
            return self.main_input[0].get_table_example(include_data)

        logger.info("getting the table example since the node has run")
        example_data_getter = self.results.example_data_generator
        if example_data_getter is not None:
            data = example_data_getter().to_pylist()
            if data is None:
                data = []
        else:
            data = []
        schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
        fl = self.get_resulting_data()
        has_example_data = self.results.example_data_generator is not None

        return TableExample(
            node_id=self.node_id,
            name=str(self.node_id),
            number_of_records=999,
            number_of_columns=fl.number_of_fields,
            table_schema=schema,
            columns=fl.columns,
            data=data,
            has_example_data=has_example_data,
            has_run_with_current_setup=self.node_stats.has_run_with_current_setup,
        )
    else:
        logger.warning("getting the table example but the node has not run")
        try:
            schema = [FileColumn.model_validate(c.get_column_repr()) for c in self.schema]
        except Exception as e:
            logger.warning(e)
            schema = []
        columns = [s.name for s in schema]
        return TableExample(
            node_id=self.node_id,
            name=str(self.node_id),
            number_of_records=0,
            number_of_columns=len(columns),
            table_schema=schema,
            columns=columns,
            data=[],
        )
needs_reset()

Checks if the node's hash has changed, indicating an outdated state.

Returns:

Type Description
bool

True if the calculated hash differs from the stored hash.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1151
1152
1153
1154
1155
1156
1157
def needs_reset(self) -> bool:
    """Checks if the node's hash has changed, indicating an outdated state.

    Returns:
        True if the calculated hash differs from the stored hash.
    """
    return self._hash != self.calculate_hash(self.setting_input)
needs_run(performance_mode, node_logger=None, execution_location='remote')

Determines if the node needs to be executed.

The decision is based on its run state, caching settings, and execution mode.

Parameters:

Name Type Description Default
performance_mode bool

True if the flow is in performance mode.

required
node_logger NodeLogger

The logger instance for this node.

None
execution_location ExecutionLocationsLiteral

The target execution location.

'remote'

Returns:

Type Description
bool

True if the node should be run, False otherwise.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
def needs_run(
    self,
    performance_mode: bool,
    node_logger: NodeLogger = None,
    execution_location: schemas.ExecutionLocationsLiteral = "remote",
) -> bool:
    """Determines if the node needs to be executed.

    The decision is based on its run state, caching settings, and execution mode.

    Args:
        performance_mode: True if the flow is in performance mode.
        node_logger: The logger instance for this node.
        execution_location: The target execution location.

    Returns:
        True if the node should be run, False otherwise.
    """
    if execution_location == "local":
        return False

    flow_logger = logger if node_logger is None else node_logger
    cache_result_exists = results_exists(self.hash)
    if not self.node_stats.has_run_with_current_setup:
        flow_logger.info("Node has not run, needs to run")
        return True
    if self.node_settings.cache_results and cache_result_exists:
        return False
    elif self.node_settings.cache_results and not cache_result_exists:
        return True
    elif not performance_mode and cache_result_exists:
        return False
    else:
        return True
post_init()

Initializes or resets the node's attributes to their default states.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def post_init(self):
    """Initializes or resets the node's attributes to their default states."""
    self.node_inputs = NodeStepInputs()
    self.node_stats = NodeStepStats()
    self.node_settings = NodeStepSettings()
    self.node_schema = NodeSchemaInformation()
    self.results = NodeResults()
    self.node_information = schemas.NodeInformation()
    self.leads_to_nodes = []
    self._setting_input = None
    self._cache_progress = None
    self._schema_callback = None
    self._state_needs_reset = False
    self._execution_lock = threading.RLock()  # Protects concurrent access to get_resulting_data
    # Initialize execution state
    self._execution_state = NodeExecutionState()
    self._executor = None  # Will be lazily created
prepare_before_run()

Resets results and errors before a new execution.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1076
1077
1078
1079
1080
1081
def prepare_before_run(self):
    """Resets results and errors before a new execution."""

    self.results.errors = None
    self.results.resulting_data = None
    self.results.example_data = None
print(v)

Helper method to log messages with node context.

Parameters:

Name Type Description Default
v Any

The message or value to log.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
640
641
642
643
644
645
646
def print(self, v: Any):
    """Helper method to log messages with node context.

    Args:
        v: The message or value to log.
    """
    logger.info(f"{self.node_type}, node_id: {self.node_id}: {v}")
remove_cache()

Removes cached results for this node.

Note: Currently not fully implemented.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
817
818
819
820
821
822
823
824
825
def remove_cache(self):
    """Removes cached results for this node.

    Note: Currently not fully implemented.
    """

    if results_exists(self.hash):
        logger.warning("Not implemented")
        clear_task_from_worker(self.hash)
reset(deep=False)

Resets the node's execution state and schema information.

This also triggers a reset on all downstream nodes.

Parameters:

Name Type Description Default
deep bool

If True, forces a reset even if the hash hasn't changed.

False
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
def reset(self, deep: bool = False):
    """Resets the node's execution state and schema information.

    This also triggers a reset on all downstream nodes.

    Args:
        deep: If True, forces a reset even if the hash hasn't changed.
    """
    needs_reset = self.needs_reset() or deep
    if needs_reset:
        logger.info(f"{self.node_id}: Node needs reset")
        self.node_stats.has_run_with_current_setup = False
        self.results.reset()
        self.node_schema.result_schema = None
        self.node_schema.predicted_schema = None
        self._hash = None
        self.node_information.is_setup = None
        self.results.errors = None

        # Reset execution state but preserve source file info for change detection
        self._execution_state.has_run_with_current_setup = False
        self._execution_state.has_completed_last_run = False
        self._execution_state.result_schema = None
        self._execution_state.predicted_schema = None
        self._execution_state.execution_hash = None
        # Note: source_file_info NOT reset - needed for change detection

        if self.is_correct:
            self._schema_callback = None  # Ensure the schema callback is reset
            if self.schema_callback:
                logger.info(f"{self.node_id}: Resetting the schema callback")
                self.schema_callback.start()
        self.evaluate_nodes()
        _ = self.hash  # Recalculate the hash after reset
set_node_information()

Populates the node_information attribute with the current state.

This includes the node's connections, settings, and position.

Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
def set_node_information(self):
    """Populates the `node_information` attribute with the current state.

    This includes the node's connections, settings, and position.
    """
    node_information = self.node_information
    node_information.left_input_id = self.node_inputs.left_input.node_id if self.left_input else None
    node_information.right_input_id = self.node_inputs.right_input.node_id if self.right_input else None
    node_information.input_ids = (
        [mi.node_id for mi in self.node_inputs.main_inputs] if self.node_inputs.main_inputs is not None else None
    )
    node_information.setting_input = self.setting_input
    node_information.outputs = [n.node_id for n in self.leads_to_nodes]
    node_information.description = (
        self.setting_input.description if hasattr(self.setting_input, "description") else ""
    )
    node_information.node_reference = (
        self.setting_input.node_reference if hasattr(self.setting_input, "node_reference") else None
    )
    node_information.is_setup = self.is_setup
    node_information.x_position = self.setting_input.pos_x
    node_information.y_position = self.setting_input.pos_y
    node_information.type = self.node_type
store_example_data_generator(external_df_fetcher)

Stores a generator function for fetching a sample of the result data.

Parameters:

Name Type Description Default
external_df_fetcher ExternalDfFetcher | ExternalSampler

The process that generated the sample data.

required
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
def store_example_data_generator(self, external_df_fetcher: ExternalDfFetcher | ExternalSampler):
    """Stores a generator function for fetching a sample of the result data.

    Args:
        external_df_fetcher: The process that generated the sample data.
    """
    if external_df_fetcher.status is not None:
        file_ref = external_df_fetcher.status.file_ref
        self.results.example_data_path = file_ref
        self.results.example_data_generator = get_read_top_n(file_path=file_ref, n=100)
    else:
        logger.error("Could not get the sample data, the external process is not ready")
update_node(function, input_columns=None, output_schema=None, drop_columns=None, name=None, setting_input=None, pos_x=0, pos_y=0, schema_callback=None)

Updates the properties of the node.

This is called during initialization and when settings are changed.

Parameters:

Name Type Description Default
function Callable

The new core data processing function.

required
input_columns list[str]

The new list of input columns.

None
output_schema list[FlowfileColumn]

The new schema of added columns.

None
drop_columns list[str]

The new list of dropped columns.

None
name str

The new name for the node.

None
setting_input Any

The new settings object.

None
pos_x float

The new x-coordinate.

0
pos_y float

The new y-coordinate.

0
schema_callback Callable

The new custom schema callback function.

None
Source code in flowfile_core/flowfile_core/flowfile/flow_node/flow_node.py
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
def update_node(
    self,
    function: Callable,
    input_columns: list[str] = None,
    output_schema: list[FlowfileColumn] = None,
    drop_columns: list[str] = None,
    name: str = None,
    setting_input: Any = None,
    pos_x: float = 0,
    pos_y: float = 0,
    schema_callback: Callable = None,
):
    """Updates the properties of the node.

    This is called during initialization and when settings are changed.

    Args:
        function: The new core data processing function.
        input_columns: The new list of input columns.
        output_schema: The new schema of added columns.
        drop_columns: The new list of dropped columns.
        name: The new name for the node.
        setting_input: The new settings object.
        pos_x: The new x-coordinate.
        pos_y: The new y-coordinate.
        schema_callback: The new custom schema callback function.
    """
    self.user_provided_schema_callback = schema_callback
    self.node_information.y_position = int(pos_y)
    self.node_information.x_position = int(pos_x)
    self.node_information.setting_input = setting_input
    self.name = self.node_type if name is None else name
    self._function = function

    self.node_schema.input_columns = [] if input_columns is None else input_columns
    self.node_schema.output_columns = [] if output_schema is None else output_schema
    self.node_schema.drop_columns = [] if drop_columns is None else drop_columns
    self.node_settings.renew_schema = True
    if hasattr(setting_input, "cache_results"):
        self.node_settings.cache_results = setting_input.cache_results

    self.results.errors = None
    self.add_lead_to_in_depend_source()
    _ = self.hash
    self.node_template = node_store.node_dict.get(self.node_type)
    if self.node_template is None:
        raise Exception(f"Node template {self.node_type} not found")
    self.node_default = node_store.node_defaults.get(self.node_type)
    self.setting_input = setting_input  # wait until the end so that the hash is calculated correctly

The FlowDataEngine

The FlowDataEngine is the primary engine of the library, providing a rich API for data manipulation, I/O, and transformation. Its methods are grouped below by functionality.

flowfile_core.flowfile.flow_data_engine.flow_data_engine.FlowDataEngine dataclass

The core data handling engine for Flowfile.

This class acts as a high-level wrapper around a Polars DataFrame or LazyFrame, providing a unified API for data ingestion, transformation, and output. It manages data state (lazy vs. eager), schema information, and execution logic.

Attributes:

Name Type Description
_data_frame DataFrame | LazyFrame

The underlying Polars DataFrame or LazyFrame.

columns list[Any]

A list of column names in the current data frame.

name str

An optional name for the data engine instance.

number_of_records int

The number of records. Can be -1 for lazy frames.

errors list

A list of errors encountered during operations.

_schema list[FlowfileColumn] | None

A cached list of FlowfileColumn objects representing the schema.

Methods:

Name Description
__call__

Makes the class instance callable, returning itself.

__get_sample__

Internal method to get a sample of the data.

__getitem__

Accesses a specific column or item from the DataFrame.

__init__

Initializes the FlowDataEngine from various data sources.

__len__

Returns the number of records in the table.

__repr__

Returns a string representation of the FlowDataEngine.

add_new_values

Adds a new column with the provided values.

add_record_id

Adds a record ID (row number) column to the DataFrame.

apply_flowfile_formula

Applies a formula to create a new column or transform an existing one.

apply_sql_formula

Applies an SQL-style formula using pl.sql_expr.

assert_equal

Asserts that this DataFrame is equal to another.

cache

Caches the current DataFrame to disk and updates the internal reference.

calculate_schema

Calculates and returns the schema.

change_column_types

Changes the data type of one or more columns.

collect

Collects the data and returns it as a Polars DataFrame.

collect_external

Materializes data from a tracked external source.

concat

Concatenates this DataFrame with one or more other DataFrames.

count

Gets the total number of records.

create_from_external_source

Creates a FlowDataEngine from an external data source.

create_from_path

Creates a FlowDataEngine from a local file path.

create_from_path_worker

Creates a FlowDataEngine from a path in a worker process.

create_from_schema

Creates an empty FlowDataEngine from a schema definition.

create_from_sql

Creates a FlowDataEngine by executing a SQL query.

create_random

Creates a FlowDataEngine with randomly generated data.

do_cross_join

Performs a cross join with another DataFrame.

do_filter

Filters rows based on a predicate expression.

do_group_by

Performs a group-by operation on the DataFrame.

do_pivot

Converts the DataFrame from a long to a wide format, aggregating values.

do_select

Performs a complex column selection, renaming, and reordering operation.

do_sort

Sorts the DataFrame by one or more columns.

drop_columns

Drops specified columns from the DataFrame.

from_cloud_storage_obj

Creates a FlowDataEngine from an object in cloud storage.

generate_enumerator

Generates a FlowDataEngine with a single column containing a sequence of integers.

get_estimated_file_size

Estimates the file size in bytes if the data originated from a local file.

get_number_of_records

Gets the total number of records in the DataFrame.

get_number_of_records_in_process

Get the number of records in the DataFrame in the local process.

get_output_sample

Gets a sample of the data as a list of dictionaries.

get_record_count

Returns a new FlowDataEngine with a single column 'number_of_records'

get_sample

Gets a sample of rows from the DataFrame.

get_schema_column

Retrieves the schema information for a single column by its name.

get_select_inputs

Gets SelectInput specifications for all columns in the current schema.

get_subset

Gets the first n_rows from the DataFrame.

initialize_empty_fl

Initializes an empty LazyFrame.

iter_batches

Iterates over the DataFrame in batches.

join

Performs a standard SQL-style join with another DataFrame.

make_unique

Gets the unique rows from the DataFrame.

output

Writes the DataFrame to an output file.

reorganize_order

Reorganizes columns into a specified order.

save

Saves the DataFrame to a file in a separate thread.

select_columns

Selects a subset of columns from the DataFrame.

set_streamable

Sets whether DataFrame operations should be streamable.

solve_graph

Solves a graph problem represented by 'from' and 'to' columns.

split

Splits a column's text values into multiple rows based on a delimiter.

start_fuzzy_join

Starts a fuzzy join operation in a background process.

to_arrow

Converts the DataFrame to a PyArrow Table.

to_cloud_storage_obj

Writes the DataFrame to an object in cloud storage.

to_dict

Converts the DataFrame to a Python dictionary of columns.

to_pylist

Converts the DataFrame to a list of Python dictionaries.

to_raw_data

Converts the DataFrame to a RawData schema object.

unpivot

Converts the DataFrame from a wide to a long format.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
@dataclass
class FlowDataEngine:
    """The core data handling engine for Flowfile.

    This class acts as a high-level wrapper around a Polars DataFrame or
    LazyFrame, providing a unified API for data ingestion, transformation,
    and output. It manages data state (lazy vs. eager), schema information,
    and execution logic.

    Attributes:
        _data_frame: The underlying Polars DataFrame or LazyFrame.
        columns: A list of column names in the current data frame.
        name: An optional name for the data engine instance.
        number_of_records: The number of records. Can be -1 for lazy frames.
        errors: A list of errors encountered during operations.
        _schema: A cached list of `FlowfileColumn` objects representing the schema.
    """

    # Core attributes
    _data_frame: pl.DataFrame | pl.LazyFrame
    columns: list[Any]

    # Metadata attributes
    name: str = None
    number_of_records: int = None
    errors: list = None
    _schema: list[FlowfileColumn] | None = None

    # Configuration attributes
    _optimize_memory: bool = False
    _lazy: bool = None
    _streamable: bool = True
    _calculate_schema_stats: bool = False

    # Cache and optimization attributes
    __col_name_idx_map: dict = None
    __data_map: dict = None
    __optimized_columns: list = None
    __sample__: str = None
    __number_of_fields: int = None
    _col_idx: dict[str, int] = None

    # Source tracking
    _org_path: str | None = None
    _external_source: ExternalDataSource | None = None

    # State tracking
    sorted_by: int = None
    is_future: bool = False
    is_collected: bool = True
    ind_schema_calculated: bool = False

    # Callbacks
    _future: Future = None
    _number_of_records_callback: Callable = None
    _data_callback: Callable = None

    def __init__(
        self,
        raw_data: list[dict] | list[Any] | dict[str, Any] | ParquetFile | pl.DataFrame | pl.LazyFrame | input_schema.RawData = None,
        path_ref: str = None,
        name: str = None,
        optimize_memory: bool = True,
        schema: list[FlowfileColumn] | list[str] | pl.Schema = None,
        number_of_records: int = None,
        calculate_schema_stats: bool = False,
        streamable: bool = True,
        number_of_records_callback: Callable = None,
        data_callback: Callable = None,
    ):
        """Initializes the FlowDataEngine from various data sources.

        Args:
            raw_data: The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame,
                or a `RawData` schema object.
            path_ref: A string path to a Parquet file.
            name: An optional name for the data engine instance.
            optimize_memory: If True, prefers lazy operations to conserve memory.
            schema: An optional schema definition. Can be a list of `FlowfileColumn` objects,
                a list of column names, or a Polars `Schema`.
            number_of_records: The number of records, if known.
            calculate_schema_stats: If True, computes detailed statistics for each column.
            streamable: If True, allows for streaming operations when possible.
            number_of_records_callback: A callback function to retrieve the number of records.
            data_callback: A callback function to retrieve the data.
        """
        self._initialize_attributes(number_of_records_callback, data_callback, streamable)

        if raw_data is not None:
            self._handle_raw_data(raw_data, number_of_records, optimize_memory)
        elif path_ref:
            self._handle_path_ref(path_ref, optimize_memory)
        else:
            self.initialize_empty_fl()
        self._finalize_initialization(name, optimize_memory, schema, calculate_schema_stats)

    def _initialize_attributes(self, number_of_records_callback, data_callback, streamable):
        """(Internal) Sets the initial default attributes for a new instance.

        This helper is called first during initialization to ensure all state-tracking
        and configuration attributes have a clean default value before data is processed.
        """
        self._external_source = None
        self._number_of_records_callback = number_of_records_callback
        self._data_callback = data_callback
        self.ind_schema_calculated = False
        self._streamable = streamable
        self._org_path = None
        self._lazy = False
        self.errors = []
        self._calculate_schema_stats = False
        self.is_collected = True
        self.is_future = False

    def _handle_raw_data(self, raw_data, number_of_records, optimize_memory):
        """(Internal) Dispatches raw data to the appropriate handler based on its type.

        This acts as a router during initialization, inspecting the type of `raw_data`
        and calling the corresponding specialized `_handle_*` method to process it.
        """
        if isinstance(raw_data, input_schema.RawData):
            self._handle_raw_data_format(raw_data)
        elif isinstance(raw_data, pl.DataFrame):
            self._handle_polars_dataframe(raw_data, number_of_records)
        elif isinstance(raw_data, pl.LazyFrame):
            self._handle_polars_lazy_frame(raw_data, number_of_records, optimize_memory)
        elif isinstance(raw_data, (list, dict)):
            self._handle_python_data(raw_data)

    def _handle_polars_dataframe(self, df: pl.DataFrame, number_of_records: int | None):
        """(Internal) Initializes the engine from an eager Polars DataFrame."""
        self.data_frame = df
        self.number_of_records = number_of_records or df.select(pl.len())[0, 0]

    def _handle_polars_lazy_frame(self, lf: pl.LazyFrame, number_of_records: int | None, optimize_memory: bool):
        """(Internal) Initializes the engine from a Polars LazyFrame."""
        self.data_frame = lf
        self._lazy = True
        if number_of_records is not None:
            self.number_of_records = number_of_records
        elif optimize_memory:
            self.number_of_records = -1
        else:
            self.number_of_records = lf.select(pl.len()).collect()[0, 0]

    def _handle_python_data(self, data: list | dict):
        """(Internal) Dispatches Python collections to the correct handler."""
        if isinstance(data, dict):
            self._handle_dict_input(data)
        else:
            self._handle_list_input(data)

    def _handle_dict_input(self, data: dict):
        """(Internal) Initializes the engine from a Python dictionary."""
        if len(data) == 0:
            self.initialize_empty_fl()
        lengths = [len(v) if isinstance(v, (list, tuple)) else 1 for v in data.values()]

        if len(set(lengths)) == 1 and lengths[0] > 1:
            self.number_of_records = lengths[0]
            self.data_frame = pl.DataFrame(data)
        else:
            self.number_of_records = 1
            self.data_frame = pl.DataFrame([data])
        self.lazy = True

    def _handle_raw_data_format(self, raw_data: input_schema.RawData):
        """(Internal) Initializes the engine from a `RawData` schema object.

        This method uses the schema provided in the `RawData` object to correctly
        infer data types when creating the Polars DataFrame.

        Args:
            raw_data: An instance of `RawData` containing the data and schema.
        """
        flowfile_schema = list(FlowfileColumn.create_from_minimal_field_info(c) for c in raw_data.columns)
        polars_schema = pl.Schema(
            [
                (flowfile_column.column_name, flowfile_column.get_polars_type().pl_datatype)
                for flowfile_column in flowfile_schema
            ]
        )
        try:
            df = pl.DataFrame(raw_data.data, polars_schema, strict=False)
        except TypeError as e:
            logger.warning(f"Could not parse the data with the schema:\n{e}")
            df = pl.DataFrame(raw_data.data)
        self.number_of_records = len(df)
        self.data_frame = df.lazy()
        self.lazy = True

    def _handle_list_input(self, data: list):
        """(Internal) Initializes the engine from a list of records."""
        number_of_records = len(data)
        if number_of_records > 0:
            processed_data = self._process_list_data(data)
            self.number_of_records = number_of_records
            self.data_frame = pl.DataFrame(processed_data)
            self.lazy = True
        else:
            self.initialize_empty_fl()
            self.number_of_records = 0

    @staticmethod
    def _process_list_data(data: list) -> list[dict]:
        """(Internal) Normalizes list data into a list of dictionaries.

        Ensures that a list of objects or non-dict items is converted into a
        uniform list of dictionaries suitable for Polars DataFrame creation.
        """
        if not (isinstance(data[0], dict) or hasattr(data[0], "__dict__")):
            try:
                return pl.DataFrame(data).to_dicts()
            except TypeError:
                raise Exception("Value must be able to be converted to dictionary")
            except Exception as e:
                raise Exception(f"Value must be able to be converted to dictionary: {e}")

        if not isinstance(data[0], dict):
            data = [row.__dict__ for row in data]

        return ensure_similarity_dicts(data)

    def to_cloud_storage_obj(self, settings: cloud_storage_schemas.CloudStorageWriteSettingsInternal):
        """Writes the DataFrame to an object in cloud storage.

        This method supports writing to various cloud storage providers like AWS S3,
        Azure Data Lake Storage, and Google Cloud Storage.

        Args:
            settings: A `CloudStorageWriteSettingsInternal` object containing connection
                details, file format, and write options.

        Raises:
            ValueError: If the specified file format is not supported for writing.
            NotImplementedError: If the 'append' write mode is used with an unsupported format.
            Exception: If the write operation to cloud storage fails for any reason.
        """
        connection = settings.connection
        write_settings = settings.write_settings

        logger.info(f"Writing to {connection.storage_type} storage: {write_settings.resource_path}")

        if write_settings.write_mode == "append" and write_settings.file_format != "delta":
            raise NotImplementedError("The 'append' write mode is not yet supported for this destination.")
        storage_options = CloudStorageReader.get_storage_options(connection)
        credential_provider = CloudStorageReader.get_credential_provider(connection)
        # Dispatch to the correct writer based on file format
        if write_settings.file_format == "parquet":
            self._write_parquet_to_cloud(
                write_settings.resource_path, storage_options, credential_provider, write_settings
            )
        elif write_settings.file_format == "delta":
            self._write_delta_to_cloud(
                write_settings.resource_path, storage_options, credential_provider, write_settings
            )
        elif write_settings.file_format == "csv":
            self._write_csv_to_cloud(write_settings.resource_path, storage_options, credential_provider, write_settings)
        elif write_settings.file_format == "json":
            self._write_json_to_cloud(
                write_settings.resource_path, storage_options, credential_provider, write_settings
            )
        else:
            raise ValueError(f"Unsupported file format for writing: {write_settings.file_format}")

        logger.info(f"Successfully wrote data to {write_settings.resource_path}")

    def _write_parquet_to_cloud(
        self,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        write_settings: cloud_storage_schemas.CloudStorageWriteSettings,
    ):
        """(Internal) Writes the DataFrame to a Parquet file in cloud storage.

        Uses `sink_parquet` for efficient streaming writes. Falls back to a
        collect-then-write pattern if sinking fails.
        """
        try:
            sink_kwargs = {
                "path": resource_path,
                "compression": write_settings.parquet_compression,
            }
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider
            try:
                self.data_frame.sink_parquet(**sink_kwargs)
            except Exception as e:
                logger.warning(f"Failed to sink the data, falling back to collecing and writing. \n {e}")
                pl_df = self.collect()
                sink_kwargs["file"] = sink_kwargs.pop("path")
                pl_df.write_parquet(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write Parquet to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write Parquet to cloud storage: {str(e)}")

    def _write_delta_to_cloud(
        self,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        write_settings: cloud_storage_schemas.CloudStorageWriteSettings,
    ):
        """(Internal) Writes the DataFrame to a Delta Lake table in cloud storage.

        This operation requires collecting the data first, as `write_delta` operates
        on an eager DataFrame.
        """
        sink_kwargs = {
            "target": resource_path,
            "mode": write_settings.write_mode,
        }
        if storage_options:
            sink_kwargs["storage_options"] = storage_options
        if credential_provider:
            sink_kwargs["credential_provider"] = credential_provider
        self.collect().write_delta(**sink_kwargs)

    def _write_csv_to_cloud(
        self,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        write_settings: cloud_storage_schemas.CloudStorageWriteSettings,
    ):
        """(Internal) Writes the DataFrame to a CSV file in cloud storage.

        Uses `sink_csv` for efficient, streaming writes of the data.
        """
        try:
            sink_kwargs = {
                "path": resource_path,
                "separator": write_settings.csv_delimiter,
            }
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider

            # sink_csv executes the lazy query and writes the result
            self.data_frame.sink_csv(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write CSV to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write CSV to cloud storage: {str(e)}")

    def _write_json_to_cloud(
        self,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        write_settings: cloud_storage_schemas.CloudStorageWriteSettings,
    ):
        """(Internal) Writes the DataFrame to a line-delimited JSON (NDJSON) file.

        Uses `sink_ndjson` for efficient, streaming writes.
        """
        try:
            sink_kwargs = {"path": resource_path}
            if storage_options:
                sink_kwargs["storage_options"] = storage_options
            if credential_provider:
                sink_kwargs["credential_provider"] = credential_provider
            self.data_frame.sink_ndjson(**sink_kwargs)

        except Exception as e:
            logger.error(f"Failed to write JSON to {resource_path}: {str(e)}")
            raise Exception(f"Failed to write JSON to cloud storage: {str(e)}")

    @classmethod
    def from_cloud_storage_obj(
        cls, settings: cloud_storage_schemas.CloudStorageReadSettingsInternal
    ) -> FlowDataEngine:
        """Creates a FlowDataEngine from an object in cloud storage.

        This method supports reading from various cloud storage providers like AWS S3,
        Azure Data Lake Storage, and Google Cloud Storage, with support for
        various authentication methods.

        Args:
            settings: A `CloudStorageReadSettingsInternal` object containing connection
                details, file format, and read options.

        Returns:
            A new `FlowDataEngine` instance containing the data from cloud storage.

        Raises:
            ValueError: If the storage type or file format is not supported.
            NotImplementedError: If a requested file format like "delta" or "iceberg"
                is not yet implemented.
            Exception: If reading from cloud storage fails.
        """
        connection = settings.connection
        read_settings = settings.read_settings

        logger.info(f"Reading from {connection.storage_type} storage: {read_settings.resource_path}")
        # Get storage options based on connection type
        storage_options = CloudStorageReader.get_storage_options(connection)
        # Get credential provider if needed
        credential_provider = CloudStorageReader.get_credential_provider(connection)
        if read_settings.file_format == "parquet":
            return cls._read_parquet_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings.scan_mode == "directory",
            )
        elif read_settings.file_format == "delta":
            return cls._read_delta_from_cloud(
                read_settings.resource_path, storage_options, credential_provider, read_settings
            )
        elif read_settings.file_format == "csv":
            return cls._read_csv_from_cloud(
                read_settings.resource_path, storage_options, credential_provider, read_settings
            )
        elif read_settings.file_format == "json":
            return cls._read_json_from_cloud(
                read_settings.resource_path,
                storage_options,
                credential_provider,
                read_settings.scan_mode == "directory",
            )
        elif read_settings.file_format == "iceberg":
            return cls._read_iceberg_from_cloud(
                read_settings.resource_path, storage_options, credential_provider, read_settings
            )

        elif read_settings.file_format in ["delta", "iceberg"]:
            # These would require additional libraries
            raise NotImplementedError(f"File format {read_settings.file_format} not yet implemented")
        else:
            raise ValueError(f"Unsupported file format: {read_settings.file_format}")

    @staticmethod
    def _get_schema_from_first_file_in_dir(
        source: str, storage_options: dict[str, Any], file_format: Literal["csv", "parquet", "json", "delta"]
    ) -> list[FlowfileColumn] | None:
        """Infers the schema by scanning the first file in a cloud directory."""
        try:
            scan_func = getattr(pl, "scan_" + file_format)
            first_file_ref = get_first_file_from_s3_dir(source, storage_options=storage_options)
            return convert_stats_to_column_info(
                FlowDataEngine._create_schema_stats_from_pl_schema(
                    scan_func(first_file_ref, storage_options=storage_options).collect_schema()
                )
            )
        except Exception as e:
            logger.warning(f"Could not read schema from first file in directory, using default schema: {e}")

    @classmethod
    def _read_iceberg_from_cloud(
        cls,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        read_settings: cloud_storage_schemas.CloudStorageReadSettings,
    ) -> FlowDataEngine:
        """Reads Iceberg table(s) from cloud storage."""
        raise NotImplementedError("Failed to read Iceberg table from cloud storage: Not yet implemented")

    @classmethod
    def _read_parquet_from_cloud(
        cls,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        is_directory: bool,
    ) -> FlowDataEngine:
        """Reads Parquet file(s) from cloud storage."""
        try:
            # Use scan_parquet for lazy evaluation
            if is_directory:
                resource_path = ensure_path_has_wildcard_pattern(resource_path=resource_path, file_format="parquet")
            scan_kwargs = {"source": resource_path}

            if storage_options:
                scan_kwargs["storage_options"] = storage_options

            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider
            if storage_options and is_directory:
                schema = cls._get_schema_from_first_file_in_dir(resource_path, storage_options, "parquet")
            else:
                schema = None
            lf = pl.scan_parquet(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Set so the provider is not accessed for this stat
                optimize_memory=True,
                streamable=True,
                schema=schema,
            )

        except Exception as e:
            logger.error(f"Failed to read Parquet from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read Parquet from cloud storage: {str(e)}")

    @classmethod
    def _read_delta_from_cloud(
        cls,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        read_settings: cloud_storage_schemas.CloudStorageReadSettings,
    ) -> FlowDataEngine:
        """Reads a Delta Lake table from cloud storage."""
        try:
            logger.info("Reading Delta file from cloud storage...")
            logger.info(f"read_settings: {read_settings}")
            scan_kwargs = {"source": resource_path}
            if read_settings.delta_version:
                scan_kwargs["version"] = read_settings.delta_version
            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider
            lf = pl.scan_delta(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Set so the provider is not accessed for this stat
                optimize_memory=True,
                streamable=True,
            )
        except Exception as e:
            logger.error(f"Failed to read Delta file from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read Delta file from cloud storage: {str(e)}")

    @classmethod
    def _read_csv_from_cloud(
        cls,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        read_settings: cloud_storage_schemas.CloudStorageReadSettings,
    ) -> FlowDataEngine:
        """Reads CSV file(s) from cloud storage."""
        try:
            scan_kwargs = {
                "source": resource_path,
                "has_header": read_settings.csv_has_header,
                "separator": read_settings.csv_delimiter,
                "encoding": read_settings.csv_encoding,
            }
            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider

            if read_settings.scan_mode == "directory":
                resource_path = ensure_path_has_wildcard_pattern(resource_path=resource_path, file_format="csv")
                scan_kwargs["source"] = resource_path
            if storage_options and read_settings.scan_mode == "directory":
                schema = cls._get_schema_from_first_file_in_dir(resource_path, storage_options, "csv")
            else:
                schema = None

            lf = pl.scan_csv(**scan_kwargs)

            return cls(
                lf,
                number_of_records=6_666_666,  # Will be calculated lazily
                optimize_memory=True,
                streamable=True,
                schema=schema,
            )

        except Exception as e:
            logger.error(f"Failed to read CSV from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read CSV from cloud storage: {str(e)}")

    @classmethod
    def _read_json_from_cloud(
        cls,
        resource_path: str,
        storage_options: dict[str, Any],
        credential_provider: Callable | None,
        is_directory: bool,
    ) -> FlowDataEngine:
        """Reads JSON file(s) from cloud storage."""
        try:
            if is_directory:
                resource_path = ensure_path_has_wildcard_pattern(resource_path, "json")
            scan_kwargs = {"source": resource_path}

            if storage_options:
                scan_kwargs["storage_options"] = storage_options
            if credential_provider:
                scan_kwargs["credential_provider"] = credential_provider

            lf = pl.scan_ndjson(**scan_kwargs)  # Using NDJSON for line-delimited JSON

            return cls(
                lf,
                number_of_records=-1,
                optimize_memory=True,
                streamable=True,
            )

        except Exception as e:
            logger.error(f"Failed to read JSON from {resource_path}: {str(e)}")
            raise Exception(f"Failed to read JSON from cloud storage: {str(e)}")

    def _handle_path_ref(self, path_ref: str, optimize_memory: bool):
        """Handles file path reference input."""
        try:
            pf = ParquetFile(path_ref)
        except Exception as e:
            logger.error(e)
            raise Exception("Provided ref is not a parquet file")

        self.number_of_records = pf.metadata.num_rows
        if optimize_memory:
            self._lazy = True
            self.data_frame = pl.scan_parquet(path_ref)
        else:
            self.data_frame = pl.read_parquet(path_ref)

    def _finalize_initialization(
        self, name: str, optimize_memory: bool, schema: Any | None, calculate_schema_stats: bool
    ):
        """Finalizes initialization by setting remaining attributes."""
        _ = calculate_schema_stats
        self.name = name
        self._optimize_memory = optimize_memory
        if assert_if_flowfile_schema(schema):
            self._schema = schema
            self.columns = [c.column_name for c in self._schema]
        else:
            pl_schema = self.data_frame.collect_schema()
            self._schema = self._handle_schema(schema, pl_schema)
            self.columns = [c.column_name for c in self._schema] if self._schema else pl_schema.names()

    def __getitem__(self, item):
        """Accesses a specific column or item from the DataFrame."""
        return self.data_frame.select([item])

    @property
    def data_frame(self) -> pl.LazyFrame | pl.DataFrame | None:
        """The underlying Polars DataFrame or LazyFrame.

        This property provides access to the Polars object that backs the
        FlowDataEngine. It handles lazy-loading from external sources if necessary.

        Returns:
            The active Polars `DataFrame` or `LazyFrame`.
        """
        if self._data_frame is not None and not self.is_future:
            return self._data_frame
        elif self.is_future:
            return self._data_frame
        elif self._external_source is not None and self.lazy:
            return self._data_frame
        elif self._external_source is not None and not self.lazy:
            if self._external_source.get_pl_df() is None:
                data_frame = list(self._external_source.get_iter())
                if len(data_frame) > 0:
                    self.data_frame = pl.DataFrame(data_frame)
            else:
                self.data_frame = self._external_source.get_pl_df()
            self.calculate_schema()
            return self._data_frame

    @data_frame.setter
    def data_frame(self, df: pl.LazyFrame | pl.DataFrame):
        """Sets the underlying Polars DataFrame or LazyFrame."""
        if self.lazy and isinstance(df, pl.DataFrame):
            raise Exception("Cannot set a non-lazy dataframe to a lazy flowfile")
        self._data_frame = df
        self._schema = None

    @staticmethod
    def _create_schema_stats_from_pl_schema(pl_schema: pl.Schema) -> list[dict]:
        """Converts a Polars Schema into a list of schema statistics dictionaries."""
        return [dict(column_name=k, pl_datatype=v, col_index=i) for i, (k, v) in enumerate(pl_schema.items())]

    def _add_schema_from_schema_stats(self, schema_stats: list[dict]):
        """Populates the schema from a list of schema statistics dictionaries."""
        self._schema = convert_stats_to_column_info(schema_stats)

    @property
    def schema(self) -> list[FlowfileColumn]:
        """The schema of the DataFrame as a list of `FlowfileColumn` objects.

        This property lazily calculates the schema if it hasn't been determined yet.

        Returns:
            A list of `FlowfileColumn` objects describing the schema.
        """
        if self.number_of_fields == 0:
            return []
        if self._schema is None or (self._calculate_schema_stats and not self.ind_schema_calculated):
            if self._calculate_schema_stats and not self.ind_schema_calculated:
                schema_stats = self._calculate_schema()
                self.ind_schema_calculated = True
            else:
                schema_stats = self._create_schema_stats_from_pl_schema(self.data_frame.collect_schema())
            self._add_schema_from_schema_stats(schema_stats)
        return self._schema

    @property
    def number_of_fields(self) -> int:
        """The number of columns (fields) in the DataFrame.

        Returns:
            The integer count of columns.
        """
        if self.__number_of_fields is None:
            self.__number_of_fields = len(self.columns)
        return self.__number_of_fields

    def collect(self, n_records: int = None) -> pl.DataFrame:
        """Collects the data and returns it as a Polars DataFrame.

        This method triggers the execution of the lazy query plan (if applicable)
        and returns the result. It supports streaming to optimize memory usage
        for large datasets.

        Args:
            n_records: The maximum number of records to collect. If None, all
                records are collected.

        Returns:
            A Polars `DataFrame` containing the collected data.
        """
        if n_records is None:
            logger.info(f'Fetching all data for Table object "{id(self)}". Settings: streaming={self._streamable}')
        else:
            logger.info(
                f'Fetching {n_records} record(s) for Table object "{id(self)}". '
                f"Settings: streaming={self._streamable}"
            )

        if not self.lazy:
            return self.data_frame

        try:
            return self._collect_data(n_records)
        except Exception as e:
            self.errors = [e]
            return self._handle_collection_error(n_records)

    def _collect_data(self, n_records: int = None) -> pl.DataFrame:
        """Internal method to handle data collection logic."""
        if n_records is None:
            self.collect_external()
            if self._streamable:
                try:
                    logger.info("Collecting data in streaming mode")
                    return self.data_frame.collect(engine="streaming")
                except PanicException:
                    self._streamable = False

            logger.info("Collecting data in non-streaming mode")
            return self.data_frame.collect()

        if self.external_source is not None:
            return self._collect_from_external_source(n_records)

        if self._streamable:
            return self.data_frame.head(n_records).collect(engine="streaming")
        return self.data_frame.head(n_records).collect()

    def _collect_from_external_source(self, n_records: int) -> pl.DataFrame:
        """Handles collection from an external source."""
        if self.external_source.get_pl_df() is not None:
            all_data = self.external_source.get_pl_df().head(n_records)
            self.data_frame = all_data
        else:
            all_data = self.external_source.get_sample(n_records)
            self.data_frame = pl.LazyFrame(all_data)
        return self.data_frame

    def _handle_collection_error(self, n_records: int) -> pl.DataFrame:
        """Handles errors during collection by attempting partial collection."""
        n_records = 100000000 if n_records is None else n_records
        ok_cols, error_cols = self._identify_valid_columns(n_records)

        if len(ok_cols) > 0:
            return self._create_partial_dataframe(ok_cols, error_cols, n_records)
        return self._create_empty_dataframe(n_records)

    def _identify_valid_columns(self, n_records: int) -> tuple[list[str], list[tuple[str, Any]]]:
        """Identifies which columns can be collected successfully."""
        ok_cols = []
        error_cols = []
        for c in self.columns:
            try:
                _ = self.data_frame.select(c).head(n_records).collect()
                ok_cols.append(c)
            except:
                error_cols.append((c, self.data_frame.schema[c]))
        return ok_cols, error_cols

    def _create_partial_dataframe(
        self, ok_cols: list[str], error_cols: list[tuple[str, Any]], n_records: int
    ) -> pl.DataFrame:
        """Creates a DataFrame with partial data for columns that could be collected."""
        df = self.data_frame.select(ok_cols)
        df = df.with_columns([pl.lit(None).alias(column_name).cast(data_type) for column_name, data_type in error_cols])
        return df.select(self.columns).head(n_records).collect()

    def _create_empty_dataframe(self, n_records: int) -> pl.DataFrame:
        """Creates an empty DataFrame with the correct schema."""
        if self.number_of_records > 0:
            return pl.DataFrame(
                {
                    column_name: pl.Series(
                        name=column_name, values=[None] * min(self.number_of_records, n_records)
                    ).cast(data_type)
                    for column_name, data_type in self.data_frame.schema.items()
                }
            )
        return pl.DataFrame(schema=self.data_frame.schema)

    def do_group_by(
        self, group_by_input: transform_schemas.GroupByInput, calculate_schema_stats: bool = True
    ) -> FlowDataEngine:
        """Performs a group-by operation on the DataFrame.

        Args:
            group_by_input: A `GroupByInput` object defining the grouping columns
                and aggregations.
            calculate_schema_stats: If True, calculates schema statistics for the
                resulting DataFrame.

        Returns:
            A new `FlowDataEngine` instance with the grouped and aggregated data.
        """
        aggregations = [c for c in group_by_input.agg_cols if c.agg != "groupby"]
        group_columns = [c for c in group_by_input.agg_cols if c.agg == "groupby"]

        if len(group_columns) == 0:
            return FlowDataEngine(
                self.data_frame.select(ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations),
                calculate_schema_stats=calculate_schema_stats,
            )

        df = self.data_frame.rename({c.old_name: c.new_name for c in group_columns})
        group_by_columns = [n_c.new_name for n_c in group_columns]

        # Handle case where there are no aggregations - just get unique combinations of group columns
        if len(aggregations) == 0:
            return FlowDataEngine(
                df.select(group_by_columns).unique(),
                calculate_schema_stats=calculate_schema_stats,
            )

        grouped_df = df.group_by(*group_by_columns)
        agg_exprs = [ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations]
        result_df = grouped_df.agg(agg_exprs)

        return FlowDataEngine(
            result_df,
            calculate_schema_stats=calculate_schema_stats,
        )

    def do_sort(self, sorts: list[transform_schemas.SortByInput]) -> FlowDataEngine:
        """Sorts the DataFrame by one or more columns.

        Args:
            sorts: A list of `SortByInput` objects, each specifying a column
                and sort direction ('asc' or 'desc').

        Returns:
            A new `FlowDataEngine` instance with the sorted data.
        """
        if not sorts:
            return self

        descending = [s.how == "desc" or s.how.lower() == "descending" for s in sorts]
        df = self.data_frame.sort([sort_by.column for sort_by in sorts], descending=descending)
        return FlowDataEngine(df, number_of_records=self.number_of_records, schema=self.schema)

    def change_column_types(
        self, transforms: list[transform_schemas.SelectInput], calculate_schema: bool = False
    ) -> FlowDataEngine:
        """Changes the data type of one or more columns.

        Args:
            transforms: A list of `SelectInput` objects, where each object specifies
                the column and its new `polars_type`.
            calculate_schema: If True, recalculates the schema after the type change.

        Returns:
            A new `FlowDataEngine` instance with the updated column types.
        """
        dtypes = [dtype.base_type() for dtype in self.data_frame.collect_schema().dtypes()]
        idx_mapping = list(
            (transform.old_name, self.cols_idx.get(transform.old_name), get_polars_type(transform.polars_type))
            for transform in transforms
            if transform.data_type is not None
        )

        actual_transforms = [c for c in idx_mapping if c[2] != dtypes[c[1]]]
        transformations = [
            utils.define_pl_col_transformation(col_name=transform[0], col_type=transform[2])
            for transform in actual_transforms
        ]

        df = self.data_frame.with_columns(transformations)
        return FlowDataEngine(
            df,
            number_of_records=self.number_of_records,
            calculate_schema_stats=calculate_schema,
            streamable=self._streamable,
        )

    def save(self, path: str, data_type: str = "parquet") -> Future:
        """Saves the DataFrame to a file in a separate thread.

        Args:
            path: The file path to save to.
            data_type: The format to save in (e.g., 'parquet', 'csv').

        Returns:
            A `loky.Future` object representing the asynchronous save operation.
        """
        estimated_size = deepcopy(self.get_estimated_file_size() * 4)
        df = deepcopy(self.data_frame)
        return write_threaded(_df=df, path=path, data_type=data_type, estimated_size=estimated_size)

    def to_pylist(self) -> list[dict]:
        """Converts the DataFrame to a list of Python dictionaries.

        Returns:
            A list where each item is a dictionary representing a row.
        """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dicts()
        return self.data_frame.to_dicts()

    def to_arrow(self) -> PaTable:
        """Converts the DataFrame to a PyArrow Table.

        This method triggers a `.collect()` call if the data is lazy,
        then converts the resulting eager DataFrame into a `pyarrow.Table`.

        Returns:
            A `pyarrow.Table` instance representing the data.
        """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_arrow()
        else:
            return self.data_frame.to_arrow()

    def to_raw_data(self) -> input_schema.RawData:
        """Converts the DataFrame to a `RawData` schema object.

        Returns:
            An `input_schema.RawData` object containing the schema and data.
        """
        columns = [c.get_minimal_field_info() for c in self.schema]
        data = list(self.to_dict().values())
        return input_schema.RawData(columns=columns, data=data)

    def to_dict(self) -> dict[str, list]:
        """Converts the DataFrame to a Python dictionary of columns.

        Each key in the dictionary is a column name, and the corresponding value
        is a list of the data in that column.

        Returns:
            A dictionary mapping column names to lists of their values.
        """
        if self.lazy:
            return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dict(as_series=False)
        else:
            return self.data_frame.to_dict(as_series=False)

    @classmethod
    def create_from_external_source(cls, external_source: ExternalDataSource) -> FlowDataEngine:
        """Creates a FlowDataEngine from an external data source.

        Args:
            external_source: An object that conforms to the `ExternalDataSource`
                interface.

        Returns:
            A new `FlowDataEngine` instance.
        """
        if external_source.schema is not None:
            ff = cls.create_from_schema(external_source.schema)
        elif external_source.initial_data_getter is not None:
            ff = cls(raw_data=external_source.initial_data_getter())
        else:
            ff = cls()
        ff._external_source = external_source
        return ff

    @classmethod
    def create_from_sql(cls, sql: str, conn: Any) -> FlowDataEngine:
        """Creates a FlowDataEngine by executing a SQL query.

        Args:
            sql: The SQL query string to execute.
            conn: A database connection object or connection URI string.

        Returns:
            A new `FlowDataEngine` instance with the query result.
        """
        return cls(pl.read_sql(sql, conn))

    @classmethod
    def create_from_schema(cls, schema: list[FlowfileColumn]) -> FlowDataEngine:
        """Creates an empty FlowDataEngine from a schema definition.

        Args:
            schema: A list of `FlowfileColumn` objects defining the schema.

        Returns:
            A new, empty `FlowDataEngine` instance with the specified schema.
        """
        pl_schema = []
        for i, flow_file_column in enumerate(schema):
            pl_schema.append((flow_file_column.name, cast_str_to_polars_type(flow_file_column.data_type)))
            schema[i].col_index = i
        df = pl.LazyFrame(schema=pl_schema)
        return cls(df, schema=schema, calculate_schema_stats=False, number_of_records=0)

    @classmethod
    def create_from_path(cls, received_table: input_schema.ReceivedTable) -> FlowDataEngine:
        """Creates a FlowDataEngine from a local file path.

        Supports various file types like CSV, Parquet, and Excel.

        Args:
            received_table: A `ReceivedTableBase` object containing the file path
                and format details.

        Returns:
            A new `FlowDataEngine` instance with data from the file.
        """
        received_table.set_absolute_filepath()
        file_type_handlers = {
            "csv": create_funcs.create_from_path_csv,
            "parquet": create_funcs.create_from_path_parquet,
            "excel": create_funcs.create_from_path_excel,
        }

        handler = file_type_handlers.get(received_table.file_type)
        if not handler:
            raise Exception(f"Cannot create from {received_table.file_type}")

        flow_file = cls(handler(received_table))
        flow_file._org_path = received_table.abs_file_path
        return flow_file

    @classmethod
    def create_random(cls, number_of_records: int = 1000) -> FlowDataEngine:
        """Creates a FlowDataEngine with randomly generated data.

        Useful for testing and examples.

        Args:
            number_of_records: The number of random records to generate.

        Returns:
            A new `FlowDataEngine` instance with fake data.
        """
        return cls(create_fake_data(number_of_records))

    @classmethod
    def generate_enumerator(cls, length: int = 1000, output_name: str = "output_column") -> FlowDataEngine:
        """Generates a FlowDataEngine with a single column containing a sequence of integers.

        Args:
            length: The number of integers to generate in the sequence.
            output_name: The name of the output column.

        Returns:
            A new `FlowDataEngine` instance.
        """
        if length > 10_000_000:
            length = 10_000_000
        return cls(pl.LazyFrame().select((pl.int_range(0, length, dtype=pl.UInt32)).alias(output_name)))

    def _handle_schema(
        self, schema: list[FlowfileColumn] | list[str] | pl.Schema | None, pl_schema: pl.Schema
    ) -> list[FlowfileColumn] | None:
        """Handles schema processing and validation during initialization."""
        if schema is None and pl_schema is not None:
            return convert_stats_to_column_info(self._create_schema_stats_from_pl_schema(pl_schema))
        elif schema is None and pl_schema is None:
            return None
        elif assert_if_flowfile_schema(schema) and pl_schema is None:
            return schema
        elif pl_schema is not None and schema is not None:
            if schema.__len__() != pl_schema.__len__():
                raise Exception(
                    f"Schema does not match the data got {schema.__len__()} columns expected {pl_schema.__len__()}"
                )
            if isinstance(schema, pl.Schema):
                return self._handle_polars_schema(schema, pl_schema)
            elif isinstance(schema, list) and len(schema) == 0:
                return []
            elif isinstance(schema[0], str):
                return self._handle_string_schema(schema, pl_schema)
            return schema

    def _handle_polars_schema(self, schema: pl.Schema, pl_schema: pl.Schema) -> list[FlowfileColumn]:
        """Handles Polars schema conversion."""
        flow_file_columns = [
            FlowfileColumn.create_from_polars_dtype(column_name=col_name, data_type=dtype)
            for col_name, dtype in zip(schema.names(), schema.dtypes(), strict=False)
        ]

        select_arg = [
            pl.col(o).alias(n).cast(schema_dtype)
            for o, n, schema_dtype in zip(pl_schema.names(), schema.names(), schema.dtypes(), strict=False)
        ]

        self.data_frame = self.data_frame.select(select_arg)
        return flow_file_columns

    def _handle_string_schema(self, schema: list[str], pl_schema: pl.Schema) -> list[FlowfileColumn]:
        """Handles string-based schema conversion."""
        flow_file_columns = [
            FlowfileColumn.create_from_polars_dtype(column_name=col_name, data_type=dtype)
            for col_name, dtype in zip(schema, pl_schema.dtypes(), strict=False)
        ]

        self.data_frame = self.data_frame.rename({o: n for o, n in zip(pl_schema.names(), schema, strict=False)})

        return flow_file_columns

    def split(self, split_input: transform_schemas.TextToRowsInput) -> FlowDataEngine:
        """Splits a column's text values into multiple rows based on a delimiter.

        This operation is often referred to as "exploding" the DataFrame, as it
        increases the number of rows.

        Args:
            split_input: A `TextToRowsInput` object specifying the column to split,
                the delimiter, and the output column name.

        Returns:
            A new `FlowDataEngine` instance with the exploded rows.
        """
        output_column_name = (
            split_input.output_column_name if split_input.output_column_name else split_input.column_to_split
        )

        split_value = (
            split_input.split_fixed_value if split_input.split_by_fixed_value else pl.col(split_input.split_by_column)
        )

        df = self.data_frame.with_columns(
            pl.col(split_input.column_to_split).str.split(by=split_value).alias(output_column_name)
        ).explode(output_column_name)

        return FlowDataEngine(df)

    def unpivot(self, unpivot_input: transform_schemas.UnpivotInput) -> FlowDataEngine:
        """Converts the DataFrame from a wide to a long format.

        This is the inverse of a pivot operation, taking columns and transforming
        them into `variable` and `value` rows.

        Args:
            unpivot_input: An `UnpivotInput` object specifying which columns to
                unpivot and which to keep as index columns.

        Returns:
            A new, unpivoted `FlowDataEngine` instance.
        """
        lf = self.data_frame

        if unpivot_input.data_type_selector_expr is not None:
            result = lf.unpivot(on=unpivot_input.data_type_selector_expr(), index=unpivot_input.index_columns)
        elif unpivot_input.value_columns is not None:
            result = lf.unpivot(on=unpivot_input.value_columns, index=unpivot_input.index_columns)
        else:
            result = lf.unpivot()

        return FlowDataEngine(result)

    def do_pivot(self, pivot_input: transform_schemas.PivotInput, node_logger: NodeLogger = None) -> FlowDataEngine:
        """Converts the DataFrame from a long to a wide format, aggregating values.

        Args:
            pivot_input: A `PivotInput` object defining the index, pivot, and value
                columns, along with the aggregation logic.
            node_logger: An optional logger for reporting warnings, e.g., if the
                pivot column has too many unique values.

        Returns:
            A new, pivoted `FlowDataEngine` instance.
        """
        # Get unique values for pivot columns
        max_unique_vals = 200
        new_cols_unique = fetch_unique_values(
            self.data_frame.select(pivot_input.pivot_column)
            .unique()
            .sort(pivot_input.pivot_column)
            .limit(max_unique_vals)
            .cast(pl.String)
        )
        if len(new_cols_unique) >= max_unique_vals:
            if node_logger:
                node_logger.warning(
                    "Pivot column has too many unique values. Please consider using a different column."
                    f" Max unique values: {max_unique_vals}"
                )

        if len(pivot_input.index_columns) == 0:
            no_index_cols = True
            pivot_input.index_columns = ["__temp__"]
            ff = self.apply_flowfile_formula("1", col_name="__temp__")
        else:
            no_index_cols = False
            ff = self

        # Perform pivot operations
        index_columns = pivot_input.get_index_columns()
        grouped_ff = ff.do_group_by(pivot_input.get_group_by_input(), False)
        pivot_column = pivot_input.get_pivot_column()

        input_df = grouped_ff.data_frame.with_columns(pivot_column.cast(pl.String).alias(pivot_input.pivot_column))
        number_of_aggregations = len(pivot_input.aggregations)
        df = (
            input_df.select(*index_columns, pivot_column, pivot_input.get_values_expr())
            .group_by(*index_columns)
            .agg(
                [
                    (pl.col("vals").filter(pivot_column == new_col_value)).first().alias(new_col_value)
                    for new_col_value in new_cols_unique
                ]
            )
            .select(
                *index_columns,
                *[
                    pl.col(new_col)
                    .struct.field(agg)
                    .alias(f'{new_col + "_" + agg if number_of_aggregations > 1 else new_col }')
                    for new_col in new_cols_unique
                    for agg in pivot_input.aggregations
                ],
            )
        )

        # Clean up temporary columns if needed
        if no_index_cols:
            df = df.drop("__temp__")
            pivot_input.index_columns = []

        return FlowDataEngine(df, calculate_schema_stats=False)

    def do_filter(self, predicate: str) -> FlowDataEngine:
        """Filters rows based on a predicate expression.

        Args:
            predicate: A string containing a Polars expression that evaluates to
                a boolean value.

        Returns:
            A new `FlowDataEngine` instance containing only the rows that match
            the predicate.
        """
        try:
            f = to_expr(predicate)
        except Exception as e:
            logger.warning(f"Error in filter expression: {e}")
            f = to_expr("False")
        df = self.data_frame.filter(f)
        return FlowDataEngine(df, schema=self.schema, streamable=self._streamable)

    def add_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> FlowDataEngine:
        """Adds a record ID (row number) column to the DataFrame.

        Can generate a simple sequential ID or a grouped ID that resets for
        each group.

        Args:
            record_id_settings: A `RecordIdInput` object specifying the output
                column name, offset, and optional grouping columns.

        Returns:
            A new `FlowDataEngine` instance with the added record ID column.
        """
        if record_id_settings.group_by and len(record_id_settings.group_by_columns) > 0:
            return self._add_grouped_record_id(record_id_settings)
        return self._add_simple_record_id(record_id_settings)

    def _add_grouped_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> FlowDataEngine:
        """Adds a record ID column with grouping."""
        select_cols = [pl.col(record_id_settings.output_column_name)] + [pl.col(c) for c in self.columns]

        df = (
            self.data_frame.with_columns(pl.lit(1).alias(record_id_settings.output_column_name))
            .with_columns(
                (
                    pl.cum_count(record_id_settings.output_column_name).over(record_id_settings.group_by_columns)
                    + record_id_settings.offset
                    - 1
                ).alias(record_id_settings.output_column_name)
            )
            .select(select_cols)
        )

        output_schema = [FlowfileColumn.from_input(record_id_settings.output_column_name, "UInt64")]
        output_schema.extend(self.schema)

        return FlowDataEngine(df, schema=output_schema)

    def _add_simple_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> FlowDataEngine:
        """Adds a simple sequential record ID column."""
        df = self.data_frame.with_row_index(record_id_settings.output_column_name, record_id_settings.offset)

        output_schema = [FlowfileColumn.from_input(record_id_settings.output_column_name, "UInt64")]
        output_schema.extend(self.schema)

        return FlowDataEngine(df, schema=output_schema)

    def get_schema_column(self, col_name: str) -> FlowfileColumn:
        """Retrieves the schema information for a single column by its name.

        Args:
            col_name: The name of the column to retrieve.

        Returns:
            A `FlowfileColumn` object for the specified column, or `None` if not found.
        """
        for s in self.schema:
            if s.name == col_name:
                return s

    def get_estimated_file_size(self) -> int:
        """Estimates the file size in bytes if the data originated from a local file.

        This relies on the original path being tracked during file ingestion.

        Returns:
            The file size in bytes, or 0 if the original path is unknown.
        """
        if self._org_path is not None:
            return os.path.getsize(self._org_path)
        return 0

    def __repr__(self) -> str:
        """Returns a string representation of the FlowDataEngine."""
        return f"flow data engine\n{self.data_frame.__repr__()}"

    def __call__(self) -> FlowDataEngine:
        """Makes the class instance callable, returning itself."""
        return self

    def __len__(self) -> int:
        """Returns the number of records in the table."""
        return self.number_of_records if self.number_of_records >= 0 else self.get_number_of_records()

    def cache(self) -> FlowDataEngine:
        """Caches the current DataFrame to disk and updates the internal reference.

        This triggers a background process to write the current LazyFrame's result
        to a temporary file. Subsequent operations on this `FlowDataEngine` instance
        will read from the cached file, which can speed up downstream computations.

        Returns:
            The same `FlowDataEngine` instance, now backed by the cached data.
        """
        edf = ExternalDfFetcher(
            lf=self.data_frame, file_ref=str(id(self)), wait_on_completion=False, flow_id=-1, node_id=-1
        )
        logger.info("Caching data in background")
        result = edf.get_result()
        if isinstance(result, pl.LazyFrame):
            logger.info("Data cached")
            del self._data_frame
            self.data_frame = result
            logger.info("Data loaded from cache")
        return self

    def collect_external(self):
        """Materializes data from a tracked external source.

        If the `FlowDataEngine` was created from an `ExternalDataSource`, this
        method will trigger the data retrieval, update the internal `_data_frame`
        to a `LazyFrame` of the collected data, and reset the schema to be
        re-evaluated.
        """
        if self._external_source is not None:
            logger.info("Collecting external source")
            if self.external_source.get_pl_df() is not None:
                self.data_frame = self.external_source.get_pl_df().lazy()
            else:
                self.data_frame = pl.LazyFrame(list(self.external_source.get_iter()))
            self._schema = None  # enforce reset schema

    def get_output_sample(self, n_rows: int = 10) -> list[dict]:
        """Gets a sample of the data as a list of dictionaries.

        This is typically used to display a preview of the data in a UI.

        Args:
            n_rows: The number of rows to sample.

        Returns:
            A list of dictionaries, where each dictionary represents a row.
        """
        if self.number_of_records > n_rows or self.number_of_records < 0:
            df = self.collect(n_rows)
        else:
            df = self.collect()
        return df.to_dicts()

    def __get_sample__(self, n_rows: int = 100, streamable: bool = True) -> FlowDataEngine:
        """Internal method to get a sample of the data."""
        if not self.lazy:
            df = self.data_frame.lazy()
        else:
            df = self.data_frame

        if streamable:
            try:
                df = df.head(n_rows).collect()
            except Exception as e:
                logger.warning(f"Error in getting sample: {e}")
                df = df.head(n_rows).collect(engine="auto")
        else:
            df = self.collect()
        return FlowDataEngine(df, number_of_records=len(df), schema=self.schema)

    def get_sample(
        self,
        n_rows: int = 100,
        random: bool = False,
        shuffle: bool = False,
        seed: int = None,
        execution_location: ExecutionLocationsLiteral | None = None,
    ) -> FlowDataEngine:
        """Gets a sample of rows from the DataFrame.

        Args:
            n_rows: The number of rows to sample.
            random: If True, performs random sampling. If False, takes the first n_rows.
            shuffle: If True (and `random` is True), shuffles the data before sampling.
            seed: A random seed for reproducibility.
            execution_location: Location which is used to calculate the size of the dataframe
        Returns:
            A new `FlowDataEngine` instance containing the sampled data.
        """
        logging.info(f"Getting sample of {n_rows} rows")
        if random:
            if self.lazy and self.external_source is not None:
                self.collect_external()

            if self.lazy and shuffle:
                sample_df = self.data_frame.collect(engine="streaming" if self._streamable else "auto").sample(
                    n_rows, seed=seed, shuffle=shuffle
                )
            elif shuffle:
                sample_df = self.data_frame.sample(n_rows, seed=seed, shuffle=shuffle)
            else:
                if execution_location is None:
                    execution_location = get_global_execution_location()
                n_rows = min(
                    n_rows, self.get_number_of_records(calculate_in_worker_process=execution_location == "remote")
                )

                every_n_records = ceil(self.number_of_records / n_rows)
                sample_df = self.data_frame.gather_every(every_n_records)
        else:
            if self.external_source:
                self.collect(n_rows)
            sample_df = self.data_frame.head(n_rows)

        return FlowDataEngine(sample_df, schema=self.schema)

    def get_subset(self, n_rows: int = 100) -> FlowDataEngine:
        """Gets the first `n_rows` from the DataFrame.

        Args:
            n_rows: The number of rows to include in the subset.

        Returns:
            A new `FlowDataEngine` instance containing the subset of data.
        """
        if not self.lazy:
            return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
        else:
            return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)

    def iter_batches(
        self, batch_size: int = 1000, columns: list | tuple | str = None
    ) -> Generator[FlowDataEngine, None, None]:
        """Iterates over the DataFrame in batches.

        Args:
            batch_size: The size of each batch.
            columns: A list of column names to include in the batches. If None,
                all columns are included.

        Yields:
            A `FlowDataEngine` instance for each batch.
        """
        if columns:
            self.data_frame = self.data_frame.select(columns)
        self.lazy = False
        batches = self.data_frame.iter_slices(batch_size)
        for batch in batches:
            yield FlowDataEngine(batch)

    def start_fuzzy_join(
        self,
        fuzzy_match_input: transform_schemas.FuzzyMatchInput,
        other: FlowDataEngine,
        file_ref: str,
        flow_id: int = -1,
        node_id: int | str = -1,
    ) -> ExternalFuzzyMatchFetcher:
        """Starts a fuzzy join operation in a background process.

        This method prepares the data and initiates the fuzzy matching in a
        separate process, returning a tracker object immediately.

        Args:
            fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
            other: The right `FlowDataEngine` to join with.
            file_ref: A reference string for temporary files.
            flow_id: The flow ID for tracking.
            node_id: The node ID for tracking.

        Returns:
            An `ExternalFuzzyMatchFetcher` object that can be used to track the
            progress and retrieve the result of the fuzzy join.
        """
        fuzzy_match_input_manager = transform_schemas.FuzzyMatchInputManager(fuzzy_match_input)
        left_df, right_df = prepare_for_fuzzy_match(
            left=self, right=other, fuzzy_match_input_manager=fuzzy_match_input_manager
        )

        return ExternalFuzzyMatchFetcher(
            left_df,
            right_df,
            fuzzy_maps=fuzzy_match_input_manager.fuzzy_maps,
            file_ref=file_ref + "_fm",
            wait_on_completion=False,
            flow_id=flow_id,
            node_id=node_id,
        )

    def fuzzy_join_external(
        self,
        fuzzy_match_input: transform_schemas.FuzzyMatchInput,
        other: FlowDataEngine,
        file_ref: str = None,
        flow_id: int = -1,
        node_id: int = -1,
    ):
        if file_ref is None:
            file_ref = str(id(self)) + "_" + str(id(other))
        fuzzy_match_input_manager = transform_schemas.FuzzyMatchInputManager(fuzzy_match_input)

        left_df, right_df = prepare_for_fuzzy_match(
            left=self, right=other, fuzzy_match_input_manager=fuzzy_match_input_manager
        )
        external_tracker = ExternalFuzzyMatchFetcher(
            left_df,
            right_df,
            fuzzy_maps=fuzzy_match_input_manager.fuzzy_maps,
            file_ref=file_ref + "_fm",
            wait_on_completion=False,
            flow_id=flow_id,
            node_id=node_id,
        )
        return FlowDataEngine(external_tracker.get_result())

    def fuzzy_join(
        self,
        fuzzy_match_input: transform_schemas.FuzzyMatchInput,
        other: FlowDataEngine,
        node_logger: NodeLogger = None,
    ) -> FlowDataEngine:
        fuzzy_match_input_manager = transform_schemas.FuzzyMatchInputManager(fuzzy_match_input)
        left_df, right_df = prepare_for_fuzzy_match(
            left=self, right=other, fuzzy_match_input_manager=fuzzy_match_input_manager
        )
        fuzzy_mappings = [FuzzyMapping(**fm.__dict__) for fm in fuzzy_match_input_manager.fuzzy_maps]
        return FlowDataEngine(
            fuzzy_match_dfs(
                left_df, right_df, fuzzy_maps=fuzzy_mappings, logger=node_logger.logger if node_logger else logger
            ).lazy()
        )

    def do_cross_join(
        self,
        cross_join_input: transform_schemas.CrossJoinInput,
        auto_generate_selection: bool,
        verify_integrity: bool,
        other: FlowDataEngine,
    ) -> FlowDataEngine:
        """Performs a cross join with another DataFrame.

        A cross join produces the Cartesian product of the two DataFrames.

        Args:
            cross_join_input: A `CrossJoinInput` object specifying column selections.
            auto_generate_selection: If True, automatically renames columns to avoid conflicts.
            verify_integrity: If True, checks if the resulting join would be too large.
            other: The right `FlowDataEngine` to join with.

        Returns:
            A new `FlowDataEngine` with the result of the cross join.

        Raises:
            Exception: If `verify_integrity` is True and the join would result in
                an excessively large number of records.
        """
        self.lazy = True
        other.lazy = True
        cross_join_input_manager = transform_schemas.CrossJoinInputManager(cross_join_input)
        verify_join_select_integrity(
            cross_join_input_manager.input, left_columns=self.columns, right_columns=other.columns
        )
        right_select = [
            v.old_name
            for v in cross_join_input_manager.right_select.renames
            if (v.keep or v.join_key) and v.is_available
        ]
        left_select = [
            v.old_name
            for v in cross_join_input_manager.left_select.renames
            if (v.keep or v.join_key) and v.is_available
        ]
        cross_join_input_manager.auto_rename(rename_mode="suffix")
        left = self.data_frame.select(left_select).rename(cross_join_input_manager.left_select.rename_table)
        right = other.data_frame.select(right_select).rename(cross_join_input_manager.right_select.rename_table)

        joined_df = left.join(right, how="cross")

        cols_to_delete_after = [
            col.new_name
            for col in cross_join_input_manager.left_select.renames + cross_join_input_manager.right_select.renames
            if col.join_key and not col.keep and col.is_available
        ]

        fl = FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False, streamable=False)
        return fl

    def join(
        self,
        join_input: transform_schemas.JoinInput,
        auto_generate_selection: bool,
        verify_integrity: bool,
        other: FlowDataEngine,
    ) -> FlowDataEngine:
        """Performs a standard SQL-style join with another DataFrame."""
        # Create manager from input
        join_manager = transform_schemas.JoinInputManager(join_input)
        ensure_right_unselect_for_semi_and_anti_joins(join_manager.input)
        for jk in join_manager.join_mapping:
            if jk.left_col not in {c.old_name for c in join_manager.left_select.renames}:
                join_manager.left_select.append(transform_schemas.SelectInput(jk.left_col, keep=False))
            if jk.right_col not in {c.old_name for c in join_manager.right_select.renames}:
                join_manager.right_select.append(transform_schemas.SelectInput(jk.right_col, keep=False))
        verify_join_select_integrity(join_manager.input, left_columns=self.columns, right_columns=other.columns)
        if not verify_join_map_integrity(join_manager.input, left_columns=self.schema, right_columns=other.schema):
            raise Exception("Join is not valid by the data fields")

        if auto_generate_selection:
            join_manager.auto_rename()

        # Use manager properties throughout
        left = self.data_frame.select(join_manager.left_manager.get_select_cols()).rename(
            join_manager.left_manager.get_rename_table()
        )
        right = other.data_frame.select(join_manager.right_manager.get_select_cols()).rename(
            join_manager.right_manager.get_rename_table()
        )

        left, right, reverse_join_key_mapping = _handle_duplication_join_keys(left, right, join_manager)
        left, right = rename_df_table_for_join(left, right, join_manager.get_join_key_renames())
        if join_manager.how == "right":
            joined_df = right.join(
                other=left,
                left_on=join_manager.right_join_keys,
                right_on=join_manager.left_join_keys,
                how="left",
                suffix="",
            ).rename(reverse_join_key_mapping)
        else:
            joined_df = left.join(
                other=right,
                left_on=join_manager.left_join_keys,
                right_on=join_manager.right_join_keys,
                how=join_manager.how,
                suffix="",
            ).rename(reverse_join_key_mapping)

        left_cols_to_delete_after = [
            get_col_name_to_delete(col, "left")
            for col in join_manager.input.left_select.renames
            if not col.keep and col.is_available and col.join_key
        ]

        right_cols_to_delete_after = [
            get_col_name_to_delete(col, "right")
            for col in join_manager.input.right_select.renames
            if not col.keep
            and col.is_available
            and col.join_key
            and join_manager.how in ("left", "right", "inner", "cross", "outer")
        ]

        if len(right_cols_to_delete_after + left_cols_to_delete_after) > 0:
            joined_df = joined_df.drop(left_cols_to_delete_after + right_cols_to_delete_after)

        undo_join_key_remapping = get_undo_rename_mapping_join(join_manager)
        joined_df = joined_df.rename(undo_join_key_remapping)

        return FlowDataEngine(joined_df, calculate_schema_stats=False, number_of_records=0, streamable=False)

    def solve_graph(self, graph_solver_input: transform_schemas.GraphSolverInput) -> FlowDataEngine:
        """Solves a graph problem represented by 'from' and 'to' columns.

        This is used for operations like finding connected components in a graph.

        Args:
            graph_solver_input: A `GraphSolverInput` object defining the source,
                destination, and output column names.

        Returns:
            A new `FlowDataEngine` instance with the solved graph data.
        """
        lf = self.data_frame.with_columns(
            graph_solver(graph_solver_input.col_from, graph_solver_input.col_to).alias(
                graph_solver_input.output_column_name
            )
        )
        return FlowDataEngine(lf)

    def add_new_values(self, values: Iterable, col_name: str = None) -> FlowDataEngine:
        """Adds a new column with the provided values.

        Args:
            values: An iterable (e.g., list, tuple) of values to add as a new column.
            col_name: The name for the new column. Defaults to 'new_values'.

        Returns:
            A new `FlowDataEngine` instance with the added column.
        """
        if col_name is None:
            col_name = "new_values"
        return FlowDataEngine(self.data_frame.with_columns(pl.Series(values).alias(col_name)))

    def get_record_count(self) -> FlowDataEngine:
        """Returns a new FlowDataEngine with a single column 'number_of_records'
        containing the total number of records.

        Returns:
            A new `FlowDataEngine` instance.
        """
        return FlowDataEngine(self.data_frame.select(pl.len().alias("number_of_records")))

    def assert_equal(self, other: FlowDataEngine, ordered: bool = True, strict_schema: bool = False):
        """Asserts that this DataFrame is equal to another.

        Useful for testing.

        Args:
            other: The other `FlowDataEngine` to compare with.
            ordered: If True, the row order must be identical.
            strict_schema: If True, the data types of the schemas must be identical.

        Raises:
            Exception: If the DataFrames are not equal based on the specified criteria.
        """
        org_laziness = self.lazy, other.lazy
        self.lazy = False
        other.lazy = False
        self.number_of_records = -1
        other.number_of_records = -1
        other = other.select_columns(self.columns)

        if self.get_number_of_records_in_process() != other.get_number_of_records_in_process():
            raise Exception("Number of records is not equal")

        if self.columns != other.columns:
            raise Exception("Schema is not equal")

        if strict_schema:
            assert self.data_frame.schema == other.data_frame.schema, "Data types do not match"

        if ordered:
            self_lf = self.data_frame.sort(by=self.columns)
            other_lf = other.data_frame.sort(by=other.columns)
        else:
            self_lf = self.data_frame
            other_lf = other.data_frame

        self.lazy, other.lazy = org_laziness
        assert self_lf.equals(other_lf), "Data is not equal"

    def initialize_empty_fl(self):
        """Initializes an empty LazyFrame."""
        self.data_frame = pl.LazyFrame()
        self.number_of_records = 0
        self._lazy = True

    def _calculate_number_of_records_in_worker(self) -> int:
        """Calculates the number of records in a worker process."""
        number_of_records = ExternalDfFetcher(
            lf=self.data_frame,
            operation_type="calculate_number_of_records",
            flow_id=-1,
            node_id=-1,
            wait_on_completion=True,
        ).result
        return number_of_records

    def get_number_of_records_in_process(self, force_calculate: bool = False):
        """
        Get the number of records in the DataFrame in the local process.

        args:
            force_calculate: If True, forces recalculation even if a value is cached.

        Returns:
            The total number of records.
        """
        return self.get_number_of_records(force_calculate=force_calculate)

    def get_number_of_records(
        self, warn: bool = False, force_calculate: bool = False, calculate_in_worker_process: bool = False
    ) -> int:
        """Gets the total number of records in the DataFrame.

        For lazy frames, this may trigger a full data scan, which can be expensive.

        Args:
            warn: If True, logs a warning if a potentially expensive calculation is triggered.
            force_calculate: If True, forces recalculation even if a value is cached.
            calculate_in_worker_process: If True, offloads the calculation to a worker process.

        Returns:
            The total number of records.

        Raises:
            ValueError: If the number of records could not be determined.
        """
        if self.is_future and not self.is_collected:
            return -1
        if self.number_of_records is None or self.number_of_records < 0 or force_calculate:
            if self._number_of_records_callback is not None:
                self._number_of_records_callback(self)

            if self.lazy:
                if calculate_in_worker_process:
                    try:
                        self.number_of_records = self._calculate_number_of_records_in_worker()
                        return self.number_of_records
                    except Exception as e:
                        logger.error(f"Error: {e}")
                if warn:
                    logger.warning("Calculating the number of records this can be expensive on a lazy frame")
                try:
                    self.number_of_records = self.data_frame.select(pl.len()).collect(
                        engine="streaming" if self._streamable else "auto"
                    )[0, 0]
                except Exception:
                    raise ValueError("Could not get number of records")
            else:
                self.number_of_records = self.data_frame.__len__()
        return self.number_of_records

    @property
    def has_errors(self) -> bool:
        """Checks if there are any errors."""
        return len(self.errors) > 0

    @property
    def lazy(self) -> bool:
        """Indicates if the DataFrame is in lazy mode."""
        return self._lazy

    @lazy.setter
    def lazy(self, exec_lazy: bool = False):
        """Sets the laziness of the DataFrame.

        Args:
            exec_lazy: If True, converts the DataFrame to a LazyFrame. If False,
                collects the data and converts it to an eager DataFrame.
        """
        if exec_lazy != self._lazy:
            if exec_lazy:
                self.data_frame = self.data_frame.lazy()
            else:
                self._lazy = exec_lazy
                if self.external_source is not None:
                    df = self.collect()
                    self.data_frame = df
                else:
                    self.data_frame = self.data_frame.collect(engine="streaming" if self._streamable else "auto")
            self._lazy = exec_lazy

    @property
    def external_source(self) -> ExternalDataSource:
        """The external data source, if any."""
        return self._external_source

    @property
    def cols_idx(self) -> dict[str, int]:
        """A dictionary mapping column names to their integer index."""
        if self._col_idx is None:
            self._col_idx = {c: i for i, c in enumerate(self.columns)}
        return self._col_idx

    @property
    def __name__(self) -> str:
        """The name of the table."""
        return self.name

    def get_select_inputs(self) -> transform_schemas.SelectInputs:
        """Gets `SelectInput` specifications for all columns in the current schema.

        Returns:
            A `SelectInputs` object that can be used to configure selection or
            transformation operations.
        """
        return transform_schemas.SelectInputs(
            [transform_schemas.SelectInput(old_name=c.name, data_type=c.data_type) for c in self.schema]
        )

    def select_columns(self, list_select: list[str] | tuple[str] | str) -> FlowDataEngine:
        """Selects a subset of columns from the DataFrame.

        Args:
            list_select: A list, tuple, or single string of column names to select.

        Returns:
            A new `FlowDataEngine` instance containing only the selected columns.
        """
        if isinstance(list_select, str):
            list_select = [list_select]

        idx_to_keep = [self.cols_idx.get(c) for c in list_select]
        selects = [ls for ls, id_to_keep in zip(list_select, idx_to_keep, strict=False) if id_to_keep is not None]
        new_schema = [self.schema[i] for i in idx_to_keep if i is not None]

        return FlowDataEngine(
            self.data_frame.select(selects),
            number_of_records=self.number_of_records,
            schema=new_schema,
            streamable=self._streamable,
        )

    def drop_columns(self, columns: list[str]) -> FlowDataEngine:
        """Drops specified columns from the DataFrame.

        Args:
            columns: A list of column names to drop.

        Returns:
            A new `FlowDataEngine` instance without the dropped columns.
        """
        cols_for_select = tuple(set(self.columns) - set(columns))
        idx_to_keep = [self.cols_idx.get(c) for c in cols_for_select]
        new_schema = [self.schema[i] for i in idx_to_keep]

        return FlowDataEngine(
            self.data_frame.select(cols_for_select), number_of_records=self.number_of_records, schema=new_schema
        )

    def reorganize_order(self, column_order: list[str]) -> FlowDataEngine:
        """Reorganizes columns into a specified order.

        Args:
            column_order: A list of column names in the desired order.

        Returns:
            A new `FlowDataEngine` instance with the columns reordered.
        """
        df = self.data_frame.select(column_order)
        schema = sorted(self.schema, key=lambda x: column_order.index(x.column_name))
        return FlowDataEngine(df, schema=schema, number_of_records=self.number_of_records)

    def apply_flowfile_formula(
        self, func: str, col_name: str, output_data_type: pl.DataType = None
    ) -> FlowDataEngine:
        """Applies a formula to create a new column or transform an existing one.

        Args:
            func: A string containing a Polars expression formula.
            col_name: The name of the new or transformed column.
            output_data_type: The desired Polars data type for the output column.

        Returns:
            A new `FlowDataEngine` instance with the applied formula.
        """
        parsed_func = to_expr(func)
        if output_data_type is not None:
            df2 = self.data_frame.with_columns(parsed_func.cast(output_data_type).alias(col_name))
        else:
            df2 = self.data_frame.with_columns(parsed_func.alias(col_name))

        return FlowDataEngine(df2, number_of_records=self.number_of_records)

    def apply_sql_formula(self, func: str, col_name: str, output_data_type: pl.DataType = None) -> FlowDataEngine:
        """Applies an SQL-style formula using `pl.sql_expr`.

        Args:
            func: A string containing an SQL expression.
            col_name: The name of the new or transformed column.
            output_data_type: The desired Polars data type for the output column.

        Returns:
            A new `FlowDataEngine` instance with the applied formula.
        """
        expr = to_expr(func)
        if output_data_type not in (None, transform_schemas.AUTO_DATA_TYPE):
            df = self.data_frame.with_columns(expr.cast(output_data_type).alias(col_name))
        else:
            df = self.data_frame.with_columns(expr.alias(col_name))

        return FlowDataEngine(df, number_of_records=self.number_of_records)

    def output(
        self, output_fs: input_schema.OutputSettings, flow_id: int, node_id: int | str, execute_remote: bool = True
    ) -> FlowDataEngine:
        """Writes the DataFrame to an output file.

        Can execute the write operation locally or in a remote worker process.

        Args:
            output_fs: An `OutputSettings` object with details about the output file.
            flow_id: The flow ID for tracking.
            node_id: The node ID for tracking.
            execute_remote: If True, executes the write in a worker process.

        Returns:
            The same `FlowDataEngine` instance for chaining.
        """
        logger.info("Starting to write output")
        if execute_remote:
            status = utils.write_output(
                self.data_frame,
                data_type=output_fs.file_type,
                path=output_fs.abs_file_path,
                write_mode=output_fs.write_mode,
                sheet_name=output_fs.sheet_name,
                delimiter=output_fs.delimiter,
                flow_id=flow_id,
                node_id=node_id,
            )
            tracker = ExternalExecutorTracker(status)
            tracker.get_result()
            logger.info("Finished writing output")
        else:
            logger.info("Starting to write results locally")
            utils.local_write_output(
                self.data_frame,
                data_type=output_fs.file_type,
                path=output_fs.abs_file_path,
                write_mode=output_fs.write_mode,
                sheet_name=output_fs.sheet_name,
                delimiter=output_fs.delimiter,
                flow_id=flow_id,
                node_id=node_id,
            )
            logger.info("Finished writing output")
        return self

    def make_unique(self, unique_input: transform_schemas.UniqueInput = None) -> FlowDataEngine:
        """Gets the unique rows from the DataFrame.

        Args:
            unique_input: A `UniqueInput` object specifying a subset of columns
                to consider for uniqueness and a strategy for keeping rows.

        Returns:
            A new `FlowDataEngine` instance with unique rows.
        """
        if unique_input is None or unique_input.columns is None:
            return FlowDataEngine(self.data_frame.unique())
        return FlowDataEngine(self.data_frame.unique(unique_input.columns, keep=unique_input.strategy))

    def concat(self, other: Iterable[FlowDataEngine] | FlowDataEngine) -> FlowDataEngine:
        """Concatenates this DataFrame with one or more other DataFrames.

        Args:
            other: A single `FlowDataEngine` or an iterable of them.

        Returns:
            A new `FlowDataEngine` containing the concatenated data.
        """
        if isinstance(other, FlowDataEngine):
            other = [other]

        dfs: list[pl.LazyFrame] | list[pl.DataFrame] = [self.data_frame] + [flt.data_frame for flt in other]
        return FlowDataEngine(pl.concat(dfs, how="diagonal_relaxed"))

    def do_select(self, select_inputs: transform_schemas.SelectInputs, keep_missing: bool = True) -> FlowDataEngine:
        """Performs a complex column selection, renaming, and reordering operation.

        Args:
            select_inputs: A `SelectInputs` object defining the desired transformations.
            keep_missing: If True, columns not specified in `select_inputs` are kept.
                If False, they are dropped.

        Returns:
            A new `FlowDataEngine` with the transformed selection.
        """
        new_schema = deepcopy(self.schema)
        renames = [r for r in select_inputs.renames if r.is_available]
        if not keep_missing:
            drop_cols = set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames).union(
                set(r.old_name for r in renames if not r.keep)
            )
            keep_cols = []
        else:
            keep_cols = list(set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames))
            drop_cols = set(r.old_name for r in renames if not r.keep)

        if len(drop_cols) > 0:
            new_schema = [s for s in new_schema if s.name not in drop_cols]
        new_schema_mapping = {v.name: v for v in new_schema}

        available_renames = []
        for rename in renames:
            if (rename.new_name != rename.old_name or rename.new_name not in new_schema_mapping) and rename.keep:
                schema_entry = new_schema_mapping.get(rename.old_name)
                if schema_entry is not None:
                    available_renames.append(rename)
                    schema_entry.column_name = rename.new_name

        rename_dict = {r.old_name: r.new_name for r in available_renames}
        fl = self.select_columns(
            list_select=[col_to_keep.old_name for col_to_keep in renames if col_to_keep.keep] + keep_cols
        )
        fl = fl.change_column_types(transforms=[r for r in renames if r.keep])
        ndf = fl.data_frame.rename(rename_dict)
        renames.sort(key=lambda r: 0 if r.position is None else r.position)
        sorted_cols = utils.match_order(
            ndf.collect_schema().names(), [r.new_name for r in renames] + self.data_frame.collect_schema().names()
        )
        output_file = FlowDataEngine(ndf, number_of_records=self.number_of_records)
        return output_file.reorganize_order(sorted_cols)

    def set_streamable(self, streamable: bool = False):
        """Sets whether DataFrame operations should be streamable."""
        self._streamable = streamable

    def _calculate_schema(self) -> list[dict]:
        """Calculates schema statistics."""
        if self.external_source is not None:
            self.collect_external()
        v = utils.calculate_schema(self.data_frame)
        return v

    def calculate_schema(self):
        """Calculates and returns the schema."""
        self._calculate_schema_stats = True
        return self.schema

    def count(self) -> int:
        """Gets the total number of records."""
        return self.get_number_of_records()

    @classmethod
    def create_from_path_worker(cls, received_table: input_schema.ReceivedTable, flow_id: int, node_id: int | str):
        """Creates a FlowDataEngine from a path in a worker process."""
        received_table.set_absolute_filepath()

        external_fetcher = ExternalCreateFetcher(
            received_table=received_table, file_type=received_table.file_type, flow_id=flow_id, node_id=node_id
        )
        return cls(external_fetcher.get_result())
__name__ property

The name of the table.

cols_idx property

A dictionary mapping column names to their integer index.

data_frame property writable

The underlying Polars DataFrame or LazyFrame.

This property provides access to the Polars object that backs the FlowDataEngine. It handles lazy-loading from external sources if necessary.

Returns:

Type Description
LazyFrame | DataFrame | None

The active Polars DataFrame or LazyFrame.

external_source property

The external data source, if any.

has_errors property

Checks if there are any errors.

lazy property writable

Indicates if the DataFrame is in lazy mode.

number_of_fields property

The number of columns (fields) in the DataFrame.

Returns:

Type Description
int

The integer count of columns.

schema property

The schema of the DataFrame as a list of FlowfileColumn objects.

This property lazily calculates the schema if it hasn't been determined yet.

Returns:

Type Description
list[FlowfileColumn]

A list of FlowfileColumn objects describing the schema.

__call__()

Makes the class instance callable, returning itself.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1500
1501
1502
def __call__(self) -> FlowDataEngine:
    """Makes the class instance callable, returning itself."""
    return self
__get_sample__(n_rows=100, streamable=True)

Internal method to get a sample of the data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
def __get_sample__(self, n_rows: int = 100, streamable: bool = True) -> FlowDataEngine:
    """Internal method to get a sample of the data."""
    if not self.lazy:
        df = self.data_frame.lazy()
    else:
        df = self.data_frame

    if streamable:
        try:
            df = df.head(n_rows).collect()
        except Exception as e:
            logger.warning(f"Error in getting sample: {e}")
            df = df.head(n_rows).collect(engine="auto")
    else:
        df = self.collect()
    return FlowDataEngine(df, number_of_records=len(df), schema=self.schema)
__getitem__(item)

Accesses a specific column or item from the DataFrame.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
790
791
792
def __getitem__(self, item):
    """Accesses a specific column or item from the DataFrame."""
    return self.data_frame.select([item])
__init__(raw_data=None, path_ref=None, name=None, optimize_memory=True, schema=None, number_of_records=None, calculate_schema_stats=False, streamable=True, number_of_records_callback=None, data_callback=None)

Initializes the FlowDataEngine from various data sources.

Parameters:

Name Type Description Default
raw_data list[dict] | list[Any] | dict[str, Any] | ParquetFile | DataFrame | LazyFrame | RawData

The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame, or a RawData schema object.

None
path_ref str

A string path to a Parquet file.

None
name str

An optional name for the data engine instance.

None
optimize_memory bool

If True, prefers lazy operations to conserve memory.

True
schema list[FlowfileColumn] | list[str] | Schema

An optional schema definition. Can be a list of FlowfileColumn objects, a list of column names, or a Polars Schema.

None
number_of_records int

The number of records, if known.

None
calculate_schema_stats bool

If True, computes detailed statistics for each column.

False
streamable bool

If True, allows for streaming operations when possible.

True
number_of_records_callback Callable

A callback function to retrieve the number of records.

None
data_callback Callable

A callback function to retrieve the data.

None
Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
def __init__(
    self,
    raw_data: list[dict] | list[Any] | dict[str, Any] | ParquetFile | pl.DataFrame | pl.LazyFrame | input_schema.RawData = None,
    path_ref: str = None,
    name: str = None,
    optimize_memory: bool = True,
    schema: list[FlowfileColumn] | list[str] | pl.Schema = None,
    number_of_records: int = None,
    calculate_schema_stats: bool = False,
    streamable: bool = True,
    number_of_records_callback: Callable = None,
    data_callback: Callable = None,
):
    """Initializes the FlowDataEngine from various data sources.

    Args:
        raw_data: The input data. Can be a list of dicts, a Polars DataFrame/LazyFrame,
            or a `RawData` schema object.
        path_ref: A string path to a Parquet file.
        name: An optional name for the data engine instance.
        optimize_memory: If True, prefers lazy operations to conserve memory.
        schema: An optional schema definition. Can be a list of `FlowfileColumn` objects,
            a list of column names, or a Polars `Schema`.
        number_of_records: The number of records, if known.
        calculate_schema_stats: If True, computes detailed statistics for each column.
        streamable: If True, allows for streaming operations when possible.
        number_of_records_callback: A callback function to retrieve the number of records.
        data_callback: A callback function to retrieve the data.
    """
    self._initialize_attributes(number_of_records_callback, data_callback, streamable)

    if raw_data is not None:
        self._handle_raw_data(raw_data, number_of_records, optimize_memory)
    elif path_ref:
        self._handle_path_ref(path_ref, optimize_memory)
    else:
        self.initialize_empty_fl()
    self._finalize_initialization(name, optimize_memory, schema, calculate_schema_stats)
__len__()

Returns the number of records in the table.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1504
1505
1506
def __len__(self) -> int:
    """Returns the number of records in the table."""
    return self.number_of_records if self.number_of_records >= 0 else self.get_number_of_records()
__repr__()

Returns a string representation of the FlowDataEngine.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1496
1497
1498
def __repr__(self) -> str:
    """Returns a string representation of the FlowDataEngine."""
    return f"flow data engine\n{self.data_frame.__repr__()}"
add_new_values(values, col_name=None)

Adds a new column with the provided values.

Parameters:

Name Type Description Default
values Iterable

An iterable (e.g., list, tuple) of values to add as a new column.

required
col_name str

The name for the new column. Defaults to 'new_values'.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the added column.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
def add_new_values(self, values: Iterable, col_name: str = None) -> FlowDataEngine:
    """Adds a new column with the provided values.

    Args:
        values: An iterable (e.g., list, tuple) of values to add as a new column.
        col_name: The name for the new column. Defaults to 'new_values'.

    Returns:
        A new `FlowDataEngine` instance with the added column.
    """
    if col_name is None:
        col_name = "new_values"
    return FlowDataEngine(self.data_frame.with_columns(pl.Series(values).alias(col_name)))
add_record_id(record_id_settings)

Adds a record ID (row number) column to the DataFrame.

Can generate a simple sequential ID or a grouped ID that resets for each group.

Parameters:

Name Type Description Default
record_id_settings RecordIdInput

A RecordIdInput object specifying the output column name, offset, and optional grouping columns.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the added record ID column.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
def add_record_id(self, record_id_settings: transform_schemas.RecordIdInput) -> FlowDataEngine:
    """Adds a record ID (row number) column to the DataFrame.

    Can generate a simple sequential ID or a grouped ID that resets for
    each group.

    Args:
        record_id_settings: A `RecordIdInput` object specifying the output
            column name, offset, and optional grouping columns.

    Returns:
        A new `FlowDataEngine` instance with the added record ID column.
    """
    if record_id_settings.group_by and len(record_id_settings.group_by_columns) > 0:
        return self._add_grouped_record_id(record_id_settings)
    return self._add_simple_record_id(record_id_settings)
apply_flowfile_formula(func, col_name, output_data_type=None)

Applies a formula to create a new column or transform an existing one.

Parameters:

Name Type Description Default
func str

A string containing a Polars expression formula.

required
col_name str

The name of the new or transformed column.

required
output_data_type DataType

The desired Polars data type for the output column.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the applied formula.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
def apply_flowfile_formula(
    self, func: str, col_name: str, output_data_type: pl.DataType = None
) -> FlowDataEngine:
    """Applies a formula to create a new column or transform an existing one.

    Args:
        func: A string containing a Polars expression formula.
        col_name: The name of the new or transformed column.
        output_data_type: The desired Polars data type for the output column.

    Returns:
        A new `FlowDataEngine` instance with the applied formula.
    """
    parsed_func = to_expr(func)
    if output_data_type is not None:
        df2 = self.data_frame.with_columns(parsed_func.cast(output_data_type).alias(col_name))
    else:
        df2 = self.data_frame.with_columns(parsed_func.alias(col_name))

    return FlowDataEngine(df2, number_of_records=self.number_of_records)
apply_sql_formula(func, col_name, output_data_type=None)

Applies an SQL-style formula using pl.sql_expr.

Parameters:

Name Type Description Default
func str

A string containing an SQL expression.

required
col_name str

The name of the new or transformed column.

required
output_data_type DataType

The desired Polars data type for the output column.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the applied formula.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
def apply_sql_formula(self, func: str, col_name: str, output_data_type: pl.DataType = None) -> FlowDataEngine:
    """Applies an SQL-style formula using `pl.sql_expr`.

    Args:
        func: A string containing an SQL expression.
        col_name: The name of the new or transformed column.
        output_data_type: The desired Polars data type for the output column.

    Returns:
        A new `FlowDataEngine` instance with the applied formula.
    """
    expr = to_expr(func)
    if output_data_type not in (None, transform_schemas.AUTO_DATA_TYPE):
        df = self.data_frame.with_columns(expr.cast(output_data_type).alias(col_name))
    else:
        df = self.data_frame.with_columns(expr.alias(col_name))

    return FlowDataEngine(df, number_of_records=self.number_of_records)
assert_equal(other, ordered=True, strict_schema=False)

Asserts that this DataFrame is equal to another.

Useful for testing.

Parameters:

Name Type Description Default
other FlowDataEngine

The other FlowDataEngine to compare with.

required
ordered bool

If True, the row order must be identical.

True
strict_schema bool

If True, the data types of the schemas must be identical.

False

Raises:

Type Description
Exception

If the DataFrames are not equal based on the specified criteria.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
def assert_equal(self, other: FlowDataEngine, ordered: bool = True, strict_schema: bool = False):
    """Asserts that this DataFrame is equal to another.

    Useful for testing.

    Args:
        other: The other `FlowDataEngine` to compare with.
        ordered: If True, the row order must be identical.
        strict_schema: If True, the data types of the schemas must be identical.

    Raises:
        Exception: If the DataFrames are not equal based on the specified criteria.
    """
    org_laziness = self.lazy, other.lazy
    self.lazy = False
    other.lazy = False
    self.number_of_records = -1
    other.number_of_records = -1
    other = other.select_columns(self.columns)

    if self.get_number_of_records_in_process() != other.get_number_of_records_in_process():
        raise Exception("Number of records is not equal")

    if self.columns != other.columns:
        raise Exception("Schema is not equal")

    if strict_schema:
        assert self.data_frame.schema == other.data_frame.schema, "Data types do not match"

    if ordered:
        self_lf = self.data_frame.sort(by=self.columns)
        other_lf = other.data_frame.sort(by=other.columns)
    else:
        self_lf = self.data_frame
        other_lf = other.data_frame

    self.lazy, other.lazy = org_laziness
    assert self_lf.equals(other_lf), "Data is not equal"
cache()

Caches the current DataFrame to disk and updates the internal reference.

This triggers a background process to write the current LazyFrame's result to a temporary file. Subsequent operations on this FlowDataEngine instance will read from the cached file, which can speed up downstream computations.

Returns:

Type Description
FlowDataEngine

The same FlowDataEngine instance, now backed by the cached data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
def cache(self) -> FlowDataEngine:
    """Caches the current DataFrame to disk and updates the internal reference.

    This triggers a background process to write the current LazyFrame's result
    to a temporary file. Subsequent operations on this `FlowDataEngine` instance
    will read from the cached file, which can speed up downstream computations.

    Returns:
        The same `FlowDataEngine` instance, now backed by the cached data.
    """
    edf = ExternalDfFetcher(
        lf=self.data_frame, file_ref=str(id(self)), wait_on_completion=False, flow_id=-1, node_id=-1
    )
    logger.info("Caching data in background")
    result = edf.get_result()
    if isinstance(result, pl.LazyFrame):
        logger.info("Data cached")
        del self._data_frame
        self.data_frame = result
        logger.info("Data loaded from cache")
    return self
calculate_schema()

Calculates and returns the schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2307
2308
2309
2310
def calculate_schema(self):
    """Calculates and returns the schema."""
    self._calculate_schema_stats = True
    return self.schema
change_column_types(transforms, calculate_schema=False)

Changes the data type of one or more columns.

Parameters:

Name Type Description Default
transforms list[SelectInput]

A list of SelectInput objects, where each object specifies the column and its new polars_type.

required
calculate_schema bool

If True, recalculates the schema after the type change.

False

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the updated column types.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
def change_column_types(
    self, transforms: list[transform_schemas.SelectInput], calculate_schema: bool = False
) -> FlowDataEngine:
    """Changes the data type of one or more columns.

    Args:
        transforms: A list of `SelectInput` objects, where each object specifies
            the column and its new `polars_type`.
        calculate_schema: If True, recalculates the schema after the type change.

    Returns:
        A new `FlowDataEngine` instance with the updated column types.
    """
    dtypes = [dtype.base_type() for dtype in self.data_frame.collect_schema().dtypes()]
    idx_mapping = list(
        (transform.old_name, self.cols_idx.get(transform.old_name), get_polars_type(transform.polars_type))
        for transform in transforms
        if transform.data_type is not None
    )

    actual_transforms = [c for c in idx_mapping if c[2] != dtypes[c[1]]]
    transformations = [
        utils.define_pl_col_transformation(col_name=transform[0], col_type=transform[2])
        for transform in actual_transforms
    ]

    df = self.data_frame.with_columns(transformations)
    return FlowDataEngine(
        df,
        number_of_records=self.number_of_records,
        calculate_schema_stats=calculate_schema,
        streamable=self._streamable,
    )
collect(n_records=None)

Collects the data and returns it as a Polars DataFrame.

This method triggers the execution of the lazy query plan (if applicable) and returns the result. It supports streaming to optimize memory usage for large datasets.

Parameters:

Name Type Description Default
n_records int

The maximum number of records to collect. If None, all records are collected.

None

Returns:

Type Description
DataFrame

A Polars DataFrame containing the collected data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
def collect(self, n_records: int = None) -> pl.DataFrame:
    """Collects the data and returns it as a Polars DataFrame.

    This method triggers the execution of the lazy query plan (if applicable)
    and returns the result. It supports streaming to optimize memory usage
    for large datasets.

    Args:
        n_records: The maximum number of records to collect. If None, all
            records are collected.

    Returns:
        A Polars `DataFrame` containing the collected data.
    """
    if n_records is None:
        logger.info(f'Fetching all data for Table object "{id(self)}". Settings: streaming={self._streamable}')
    else:
        logger.info(
            f'Fetching {n_records} record(s) for Table object "{id(self)}". '
            f"Settings: streaming={self._streamable}"
        )

    if not self.lazy:
        return self.data_frame

    try:
        return self._collect_data(n_records)
    except Exception as e:
        self.errors = [e]
        return self._handle_collection_error(n_records)
collect_external()

Materializes data from a tracked external source.

If the FlowDataEngine was created from an ExternalDataSource, this method will trigger the data retrieval, update the internal _data_frame to a LazyFrame of the collected data, and reset the schema to be re-evaluated.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
def collect_external(self):
    """Materializes data from a tracked external source.

    If the `FlowDataEngine` was created from an `ExternalDataSource`, this
    method will trigger the data retrieval, update the internal `_data_frame`
    to a `LazyFrame` of the collected data, and reset the schema to be
    re-evaluated.
    """
    if self._external_source is not None:
        logger.info("Collecting external source")
        if self.external_source.get_pl_df() is not None:
            self.data_frame = self.external_source.get_pl_df().lazy()
        else:
            self.data_frame = pl.LazyFrame(list(self.external_source.get_iter()))
        self._schema = None  # enforce reset schema
concat(other)

Concatenates this DataFrame with one or more other DataFrames.

Parameters:

Name Type Description Default
other Iterable[FlowDataEngine] | FlowDataEngine

A single FlowDataEngine or an iterable of them.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine containing the concatenated data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
def concat(self, other: Iterable[FlowDataEngine] | FlowDataEngine) -> FlowDataEngine:
    """Concatenates this DataFrame with one or more other DataFrames.

    Args:
        other: A single `FlowDataEngine` or an iterable of them.

    Returns:
        A new `FlowDataEngine` containing the concatenated data.
    """
    if isinstance(other, FlowDataEngine):
        other = [other]

    dfs: list[pl.LazyFrame] | list[pl.DataFrame] = [self.data_frame] + [flt.data_frame for flt in other]
    return FlowDataEngine(pl.concat(dfs, how="diagonal_relaxed"))
count()

Gets the total number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2312
2313
2314
def count(self) -> int:
    """Gets the total number of records."""
    return self.get_number_of_records()
create_from_external_source(external_source) classmethod

Creates a FlowDataEngine from an external data source.

Parameters:

Name Type Description Default
external_source ExternalDataSource

An object that conforms to the ExternalDataSource interface.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
@classmethod
def create_from_external_source(cls, external_source: ExternalDataSource) -> FlowDataEngine:
    """Creates a FlowDataEngine from an external data source.

    Args:
        external_source: An object that conforms to the `ExternalDataSource`
            interface.

    Returns:
        A new `FlowDataEngine` instance.
    """
    if external_source.schema is not None:
        ff = cls.create_from_schema(external_source.schema)
    elif external_source.initial_data_getter is not None:
        ff = cls(raw_data=external_source.initial_data_getter())
    else:
        ff = cls()
    ff._external_source = external_source
    return ff
create_from_path(received_table) classmethod

Creates a FlowDataEngine from a local file path.

Supports various file types like CSV, Parquet, and Excel.

Parameters:

Name Type Description Default
received_table ReceivedTable

A ReceivedTableBase object containing the file path and format details.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with data from the file.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
@classmethod
def create_from_path(cls, received_table: input_schema.ReceivedTable) -> FlowDataEngine:
    """Creates a FlowDataEngine from a local file path.

    Supports various file types like CSV, Parquet, and Excel.

    Args:
        received_table: A `ReceivedTableBase` object containing the file path
            and format details.

    Returns:
        A new `FlowDataEngine` instance with data from the file.
    """
    received_table.set_absolute_filepath()
    file_type_handlers = {
        "csv": create_funcs.create_from_path_csv,
        "parquet": create_funcs.create_from_path_parquet,
        "excel": create_funcs.create_from_path_excel,
    }

    handler = file_type_handlers.get(received_table.file_type)
    if not handler:
        raise Exception(f"Cannot create from {received_table.file_type}")

    flow_file = cls(handler(received_table))
    flow_file._org_path = received_table.abs_file_path
    return flow_file
create_from_path_worker(received_table, flow_id, node_id) classmethod

Creates a FlowDataEngine from a path in a worker process.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2316
2317
2318
2319
2320
2321
2322
2323
2324
@classmethod
def create_from_path_worker(cls, received_table: input_schema.ReceivedTable, flow_id: int, node_id: int | str):
    """Creates a FlowDataEngine from a path in a worker process."""
    received_table.set_absolute_filepath()

    external_fetcher = ExternalCreateFetcher(
        received_table=received_table, file_type=received_table.file_type, flow_id=flow_id, node_id=node_id
    )
    return cls(external_fetcher.get_result())
create_from_schema(schema) classmethod

Creates an empty FlowDataEngine from a schema definition.

Parameters:

Name Type Description Default
schema list[FlowfileColumn]

A list of FlowfileColumn objects defining the schema.

required

Returns:

Type Description
FlowDataEngine

A new, empty FlowDataEngine instance with the specified schema.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
@classmethod
def create_from_schema(cls, schema: list[FlowfileColumn]) -> FlowDataEngine:
    """Creates an empty FlowDataEngine from a schema definition.

    Args:
        schema: A list of `FlowfileColumn` objects defining the schema.

    Returns:
        A new, empty `FlowDataEngine` instance with the specified schema.
    """
    pl_schema = []
    for i, flow_file_column in enumerate(schema):
        pl_schema.append((flow_file_column.name, cast_str_to_polars_type(flow_file_column.data_type)))
        schema[i].col_index = i
    df = pl.LazyFrame(schema=pl_schema)
    return cls(df, schema=schema, calculate_schema_stats=False, number_of_records=0)
create_from_sql(sql, conn) classmethod

Creates a FlowDataEngine by executing a SQL query.

Parameters:

Name Type Description Default
sql str

The SQL query string to execute.

required
conn Any

A database connection object or connection URI string.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the query result.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
@classmethod
def create_from_sql(cls, sql: str, conn: Any) -> FlowDataEngine:
    """Creates a FlowDataEngine by executing a SQL query.

    Args:
        sql: The SQL query string to execute.
        conn: A database connection object or connection URI string.

    Returns:
        A new `FlowDataEngine` instance with the query result.
    """
    return cls(pl.read_sql(sql, conn))
create_random(number_of_records=1000) classmethod

Creates a FlowDataEngine with randomly generated data.

Useful for testing and examples.

Parameters:

Name Type Description Default
number_of_records int

The number of random records to generate.

1000

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with fake data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
@classmethod
def create_random(cls, number_of_records: int = 1000) -> FlowDataEngine:
    """Creates a FlowDataEngine with randomly generated data.

    Useful for testing and examples.

    Args:
        number_of_records: The number of random records to generate.

    Returns:
        A new `FlowDataEngine` instance with fake data.
    """
    return cls(create_fake_data(number_of_records))
do_cross_join(cross_join_input, auto_generate_selection, verify_integrity, other)

Performs a cross join with another DataFrame.

A cross join produces the Cartesian product of the two DataFrames.

Parameters:

Name Type Description Default
cross_join_input CrossJoinInput

A CrossJoinInput object specifying column selections.

required
auto_generate_selection bool

If True, automatically renames columns to avoid conflicts.

required
verify_integrity bool

If True, checks if the resulting join would be too large.

required
other FlowDataEngine

The right FlowDataEngine to join with.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine with the result of the cross join.

Raises:

Type Description
Exception

If verify_integrity is True and the join would result in an excessively large number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
def do_cross_join(
    self,
    cross_join_input: transform_schemas.CrossJoinInput,
    auto_generate_selection: bool,
    verify_integrity: bool,
    other: FlowDataEngine,
) -> FlowDataEngine:
    """Performs a cross join with another DataFrame.

    A cross join produces the Cartesian product of the two DataFrames.

    Args:
        cross_join_input: A `CrossJoinInput` object specifying column selections.
        auto_generate_selection: If True, automatically renames columns to avoid conflicts.
        verify_integrity: If True, checks if the resulting join would be too large.
        other: The right `FlowDataEngine` to join with.

    Returns:
        A new `FlowDataEngine` with the result of the cross join.

    Raises:
        Exception: If `verify_integrity` is True and the join would result in
            an excessively large number of records.
    """
    self.lazy = True
    other.lazy = True
    cross_join_input_manager = transform_schemas.CrossJoinInputManager(cross_join_input)
    verify_join_select_integrity(
        cross_join_input_manager.input, left_columns=self.columns, right_columns=other.columns
    )
    right_select = [
        v.old_name
        for v in cross_join_input_manager.right_select.renames
        if (v.keep or v.join_key) and v.is_available
    ]
    left_select = [
        v.old_name
        for v in cross_join_input_manager.left_select.renames
        if (v.keep or v.join_key) and v.is_available
    ]
    cross_join_input_manager.auto_rename(rename_mode="suffix")
    left = self.data_frame.select(left_select).rename(cross_join_input_manager.left_select.rename_table)
    right = other.data_frame.select(right_select).rename(cross_join_input_manager.right_select.rename_table)

    joined_df = left.join(right, how="cross")

    cols_to_delete_after = [
        col.new_name
        for col in cross_join_input_manager.left_select.renames + cross_join_input_manager.right_select.renames
        if col.join_key and not col.keep and col.is_available
    ]

    fl = FlowDataEngine(joined_df.drop(cols_to_delete_after), calculate_schema_stats=False, streamable=False)
    return fl
do_filter(predicate)

Filters rows based on a predicate expression.

Parameters:

Name Type Description Default
predicate str

A string containing a Polars expression that evaluates to a boolean value.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing only the rows that match

FlowDataEngine

the predicate.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
def do_filter(self, predicate: str) -> FlowDataEngine:
    """Filters rows based on a predicate expression.

    Args:
        predicate: A string containing a Polars expression that evaluates to
            a boolean value.

    Returns:
        A new `FlowDataEngine` instance containing only the rows that match
        the predicate.
    """
    try:
        f = to_expr(predicate)
    except Exception as e:
        logger.warning(f"Error in filter expression: {e}")
        f = to_expr("False")
    df = self.data_frame.filter(f)
    return FlowDataEngine(df, schema=self.schema, streamable=self._streamable)
do_group_by(group_by_input, calculate_schema_stats=True)

Performs a group-by operation on the DataFrame.

Parameters:

Name Type Description Default
group_by_input GroupByInput

A GroupByInput object defining the grouping columns and aggregations.

required
calculate_schema_stats bool

If True, calculates schema statistics for the resulting DataFrame.

True

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the grouped and aggregated data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
def do_group_by(
    self, group_by_input: transform_schemas.GroupByInput, calculate_schema_stats: bool = True
) -> FlowDataEngine:
    """Performs a group-by operation on the DataFrame.

    Args:
        group_by_input: A `GroupByInput` object defining the grouping columns
            and aggregations.
        calculate_schema_stats: If True, calculates schema statistics for the
            resulting DataFrame.

    Returns:
        A new `FlowDataEngine` instance with the grouped and aggregated data.
    """
    aggregations = [c for c in group_by_input.agg_cols if c.agg != "groupby"]
    group_columns = [c for c in group_by_input.agg_cols if c.agg == "groupby"]

    if len(group_columns) == 0:
        return FlowDataEngine(
            self.data_frame.select(ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations),
            calculate_schema_stats=calculate_schema_stats,
        )

    df = self.data_frame.rename({c.old_name: c.new_name for c in group_columns})
    group_by_columns = [n_c.new_name for n_c in group_columns]

    # Handle case where there are no aggregations - just get unique combinations of group columns
    if len(aggregations) == 0:
        return FlowDataEngine(
            df.select(group_by_columns).unique(),
            calculate_schema_stats=calculate_schema_stats,
        )

    grouped_df = df.group_by(*group_by_columns)
    agg_exprs = [ac.agg_func(ac.old_name).alias(ac.new_name) for ac in aggregations]
    result_df = grouped_df.agg(agg_exprs)

    return FlowDataEngine(
        result_df,
        calculate_schema_stats=calculate_schema_stats,
    )
do_pivot(pivot_input, node_logger=None)

Converts the DataFrame from a long to a wide format, aggregating values.

Parameters:

Name Type Description Default
pivot_input PivotInput

A PivotInput object defining the index, pivot, and value columns, along with the aggregation logic.

required
node_logger NodeLogger

An optional logger for reporting warnings, e.g., if the pivot column has too many unique values.

None

Returns:

Type Description
FlowDataEngine

A new, pivoted FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
def do_pivot(self, pivot_input: transform_schemas.PivotInput, node_logger: NodeLogger = None) -> FlowDataEngine:
    """Converts the DataFrame from a long to a wide format, aggregating values.

    Args:
        pivot_input: A `PivotInput` object defining the index, pivot, and value
            columns, along with the aggregation logic.
        node_logger: An optional logger for reporting warnings, e.g., if the
            pivot column has too many unique values.

    Returns:
        A new, pivoted `FlowDataEngine` instance.
    """
    # Get unique values for pivot columns
    max_unique_vals = 200
    new_cols_unique = fetch_unique_values(
        self.data_frame.select(pivot_input.pivot_column)
        .unique()
        .sort(pivot_input.pivot_column)
        .limit(max_unique_vals)
        .cast(pl.String)
    )
    if len(new_cols_unique) >= max_unique_vals:
        if node_logger:
            node_logger.warning(
                "Pivot column has too many unique values. Please consider using a different column."
                f" Max unique values: {max_unique_vals}"
            )

    if len(pivot_input.index_columns) == 0:
        no_index_cols = True
        pivot_input.index_columns = ["__temp__"]
        ff = self.apply_flowfile_formula("1", col_name="__temp__")
    else:
        no_index_cols = False
        ff = self

    # Perform pivot operations
    index_columns = pivot_input.get_index_columns()
    grouped_ff = ff.do_group_by(pivot_input.get_group_by_input(), False)
    pivot_column = pivot_input.get_pivot_column()

    input_df = grouped_ff.data_frame.with_columns(pivot_column.cast(pl.String).alias(pivot_input.pivot_column))
    number_of_aggregations = len(pivot_input.aggregations)
    df = (
        input_df.select(*index_columns, pivot_column, pivot_input.get_values_expr())
        .group_by(*index_columns)
        .agg(
            [
                (pl.col("vals").filter(pivot_column == new_col_value)).first().alias(new_col_value)
                for new_col_value in new_cols_unique
            ]
        )
        .select(
            *index_columns,
            *[
                pl.col(new_col)
                .struct.field(agg)
                .alias(f'{new_col + "_" + agg if number_of_aggregations > 1 else new_col }')
                for new_col in new_cols_unique
                for agg in pivot_input.aggregations
            ],
        )
    )

    # Clean up temporary columns if needed
    if no_index_cols:
        df = df.drop("__temp__")
        pivot_input.index_columns = []

    return FlowDataEngine(df, calculate_schema_stats=False)
do_select(select_inputs, keep_missing=True)

Performs a complex column selection, renaming, and reordering operation.

Parameters:

Name Type Description Default
select_inputs SelectInputs

A SelectInputs object defining the desired transformations.

required
keep_missing bool

If True, columns not specified in select_inputs are kept. If False, they are dropped.

True

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine with the transformed selection.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
def do_select(self, select_inputs: transform_schemas.SelectInputs, keep_missing: bool = True) -> FlowDataEngine:
    """Performs a complex column selection, renaming, and reordering operation.

    Args:
        select_inputs: A `SelectInputs` object defining the desired transformations.
        keep_missing: If True, columns not specified in `select_inputs` are kept.
            If False, they are dropped.

    Returns:
        A new `FlowDataEngine` with the transformed selection.
    """
    new_schema = deepcopy(self.schema)
    renames = [r for r in select_inputs.renames if r.is_available]
    if not keep_missing:
        drop_cols = set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames).union(
            set(r.old_name for r in renames if not r.keep)
        )
        keep_cols = []
    else:
        keep_cols = list(set(self.data_frame.collect_schema().names()) - set(r.old_name for r in renames))
        drop_cols = set(r.old_name for r in renames if not r.keep)

    if len(drop_cols) > 0:
        new_schema = [s for s in new_schema if s.name not in drop_cols]
    new_schema_mapping = {v.name: v for v in new_schema}

    available_renames = []
    for rename in renames:
        if (rename.new_name != rename.old_name or rename.new_name not in new_schema_mapping) and rename.keep:
            schema_entry = new_schema_mapping.get(rename.old_name)
            if schema_entry is not None:
                available_renames.append(rename)
                schema_entry.column_name = rename.new_name

    rename_dict = {r.old_name: r.new_name for r in available_renames}
    fl = self.select_columns(
        list_select=[col_to_keep.old_name for col_to_keep in renames if col_to_keep.keep] + keep_cols
    )
    fl = fl.change_column_types(transforms=[r for r in renames if r.keep])
    ndf = fl.data_frame.rename(rename_dict)
    renames.sort(key=lambda r: 0 if r.position is None else r.position)
    sorted_cols = utils.match_order(
        ndf.collect_schema().names(), [r.new_name for r in renames] + self.data_frame.collect_schema().names()
    )
    output_file = FlowDataEngine(ndf, number_of_records=self.number_of_records)
    return output_file.reorganize_order(sorted_cols)
do_sort(sorts)

Sorts the DataFrame by one or more columns.

Parameters:

Name Type Description Default
sorts list[SortByInput]

A list of SortByInput objects, each specifying a column and sort direction ('asc' or 'desc').

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the sorted data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
def do_sort(self, sorts: list[transform_schemas.SortByInput]) -> FlowDataEngine:
    """Sorts the DataFrame by one or more columns.

    Args:
        sorts: A list of `SortByInput` objects, each specifying a column
            and sort direction ('asc' or 'desc').

    Returns:
        A new `FlowDataEngine` instance with the sorted data.
    """
    if not sorts:
        return self

    descending = [s.how == "desc" or s.how.lower() == "descending" for s in sorts]
    df = self.data_frame.sort([sort_by.column for sort_by in sorts], descending=descending)
    return FlowDataEngine(df, number_of_records=self.number_of_records, schema=self.schema)
drop_columns(columns)

Drops specified columns from the DataFrame.

Parameters:

Name Type Description Default
columns list[str]

A list of column names to drop.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance without the dropped columns.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
def drop_columns(self, columns: list[str]) -> FlowDataEngine:
    """Drops specified columns from the DataFrame.

    Args:
        columns: A list of column names to drop.

    Returns:
        A new `FlowDataEngine` instance without the dropped columns.
    """
    cols_for_select = tuple(set(self.columns) - set(columns))
    idx_to_keep = [self.cols_idx.get(c) for c in cols_for_select]
    new_schema = [self.schema[i] for i in idx_to_keep]

    return FlowDataEngine(
        self.data_frame.select(cols_for_select), number_of_records=self.number_of_records, schema=new_schema
    )
from_cloud_storage_obj(settings) classmethod

Creates a FlowDataEngine from an object in cloud storage.

This method supports reading from various cloud storage providers like AWS S3, Azure Data Lake Storage, and Google Cloud Storage, with support for various authentication methods.

Parameters:

Name Type Description Default
settings CloudStorageReadSettingsInternal

A CloudStorageReadSettingsInternal object containing connection details, file format, and read options.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing the data from cloud storage.

Raises:

Type Description
ValueError

If the storage type or file format is not supported.

NotImplementedError

If a requested file format like "delta" or "iceberg" is not yet implemented.

Exception

If reading from cloud storage fails.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
@classmethod
def from_cloud_storage_obj(
    cls, settings: cloud_storage_schemas.CloudStorageReadSettingsInternal
) -> FlowDataEngine:
    """Creates a FlowDataEngine from an object in cloud storage.

    This method supports reading from various cloud storage providers like AWS S3,
    Azure Data Lake Storage, and Google Cloud Storage, with support for
    various authentication methods.

    Args:
        settings: A `CloudStorageReadSettingsInternal` object containing connection
            details, file format, and read options.

    Returns:
        A new `FlowDataEngine` instance containing the data from cloud storage.

    Raises:
        ValueError: If the storage type or file format is not supported.
        NotImplementedError: If a requested file format like "delta" or "iceberg"
            is not yet implemented.
        Exception: If reading from cloud storage fails.
    """
    connection = settings.connection
    read_settings = settings.read_settings

    logger.info(f"Reading from {connection.storage_type} storage: {read_settings.resource_path}")
    # Get storage options based on connection type
    storage_options = CloudStorageReader.get_storage_options(connection)
    # Get credential provider if needed
    credential_provider = CloudStorageReader.get_credential_provider(connection)
    if read_settings.file_format == "parquet":
        return cls._read_parquet_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings.scan_mode == "directory",
        )
    elif read_settings.file_format == "delta":
        return cls._read_delta_from_cloud(
            read_settings.resource_path, storage_options, credential_provider, read_settings
        )
    elif read_settings.file_format == "csv":
        return cls._read_csv_from_cloud(
            read_settings.resource_path, storage_options, credential_provider, read_settings
        )
    elif read_settings.file_format == "json":
        return cls._read_json_from_cloud(
            read_settings.resource_path,
            storage_options,
            credential_provider,
            read_settings.scan_mode == "directory",
        )
    elif read_settings.file_format == "iceberg":
        return cls._read_iceberg_from_cloud(
            read_settings.resource_path, storage_options, credential_provider, read_settings
        )

    elif read_settings.file_format in ["delta", "iceberg"]:
        # These would require additional libraries
        raise NotImplementedError(f"File format {read_settings.file_format} not yet implemented")
    else:
        raise ValueError(f"Unsupported file format: {read_settings.file_format}")
generate_enumerator(length=1000, output_name='output_column') classmethod

Generates a FlowDataEngine with a single column containing a sequence of integers.

Parameters:

Name Type Description Default
length int

The number of integers to generate in the sequence.

1000
output_name str

The name of the output column.

'output_column'

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
@classmethod
def generate_enumerator(cls, length: int = 1000, output_name: str = "output_column") -> FlowDataEngine:
    """Generates a FlowDataEngine with a single column containing a sequence of integers.

    Args:
        length: The number of integers to generate in the sequence.
        output_name: The name of the output column.

    Returns:
        A new `FlowDataEngine` instance.
    """
    if length > 10_000_000:
        length = 10_000_000
    return cls(pl.LazyFrame().select((pl.int_range(0, length, dtype=pl.UInt32)).alias(output_name)))
get_estimated_file_size()

Estimates the file size in bytes if the data originated from a local file.

This relies on the original path being tracked during file ingestion.

Returns:

Type Description
int

The file size in bytes, or 0 if the original path is unknown.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
def get_estimated_file_size(self) -> int:
    """Estimates the file size in bytes if the data originated from a local file.

    This relies on the original path being tracked during file ingestion.

    Returns:
        The file size in bytes, or 0 if the original path is unknown.
    """
    if self._org_path is not None:
        return os.path.getsize(self._org_path)
    return 0
get_number_of_records(warn=False, force_calculate=False, calculate_in_worker_process=False)

Gets the total number of records in the DataFrame.

For lazy frames, this may trigger a full data scan, which can be expensive.

Parameters:

Name Type Description Default
warn bool

If True, logs a warning if a potentially expensive calculation is triggered.

False
force_calculate bool

If True, forces recalculation even if a value is cached.

False
calculate_in_worker_process bool

If True, offloads the calculation to a worker process.

False

Returns:

Type Description
int

The total number of records.

Raises:

Type Description
ValueError

If the number of records could not be determined.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
def get_number_of_records(
    self, warn: bool = False, force_calculate: bool = False, calculate_in_worker_process: bool = False
) -> int:
    """Gets the total number of records in the DataFrame.

    For lazy frames, this may trigger a full data scan, which can be expensive.

    Args:
        warn: If True, logs a warning if a potentially expensive calculation is triggered.
        force_calculate: If True, forces recalculation even if a value is cached.
        calculate_in_worker_process: If True, offloads the calculation to a worker process.

    Returns:
        The total number of records.

    Raises:
        ValueError: If the number of records could not be determined.
    """
    if self.is_future and not self.is_collected:
        return -1
    if self.number_of_records is None or self.number_of_records < 0 or force_calculate:
        if self._number_of_records_callback is not None:
            self._number_of_records_callback(self)

        if self.lazy:
            if calculate_in_worker_process:
                try:
                    self.number_of_records = self._calculate_number_of_records_in_worker()
                    return self.number_of_records
                except Exception as e:
                    logger.error(f"Error: {e}")
            if warn:
                logger.warning("Calculating the number of records this can be expensive on a lazy frame")
            try:
                self.number_of_records = self.data_frame.select(pl.len()).collect(
                    engine="streaming" if self._streamable else "auto"
                )[0, 0]
            except Exception:
                raise ValueError("Could not get number of records")
        else:
            self.number_of_records = self.data_frame.__len__()
    return self.number_of_records
get_number_of_records_in_process(force_calculate=False)

Get the number of records in the DataFrame in the local process.

Parameters:

Name Type Description Default
force_calculate bool

If True, forces recalculation even if a value is cached.

False

Returns:

Type Description

The total number of records.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
def get_number_of_records_in_process(self, force_calculate: bool = False):
    """
    Get the number of records in the DataFrame in the local process.

    args:
        force_calculate: If True, forces recalculation even if a value is cached.

    Returns:
        The total number of records.
    """
    return self.get_number_of_records(force_calculate=force_calculate)
get_output_sample(n_rows=10)

Gets a sample of the data as a list of dictionaries.

This is typically used to display a preview of the data in a UI.

Parameters:

Name Type Description Default
n_rows int

The number of rows to sample.

10

Returns:

Type Description
list[dict]

A list of dictionaries, where each dictionary represents a row.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
def get_output_sample(self, n_rows: int = 10) -> list[dict]:
    """Gets a sample of the data as a list of dictionaries.

    This is typically used to display a preview of the data in a UI.

    Args:
        n_rows: The number of rows to sample.

    Returns:
        A list of dictionaries, where each dictionary represents a row.
    """
    if self.number_of_records > n_rows or self.number_of_records < 0:
        df = self.collect(n_rows)
    else:
        df = self.collect()
    return df.to_dicts()
get_record_count()

Returns a new FlowDataEngine with a single column 'number_of_records' containing the total number of records.

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1903
1904
1905
1906
1907
1908
1909
1910
def get_record_count(self) -> FlowDataEngine:
    """Returns a new FlowDataEngine with a single column 'number_of_records'
    containing the total number of records.

    Returns:
        A new `FlowDataEngine` instance.
    """
    return FlowDataEngine(self.data_frame.select(pl.len().alias("number_of_records")))
get_sample(n_rows=100, random=False, shuffle=False, seed=None, execution_location=None)

Gets a sample of rows from the DataFrame.

Parameters:

Name Type Description Default
n_rows int

The number of rows to sample.

100
random bool

If True, performs random sampling. If False, takes the first n_rows.

False
shuffle bool

If True (and random is True), shuffles the data before sampling.

False
seed int

A random seed for reproducibility.

None
execution_location ExecutionLocationsLiteral | None

Location which is used to calculate the size of the dataframe

None

Returns: A new FlowDataEngine instance containing the sampled data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
def get_sample(
    self,
    n_rows: int = 100,
    random: bool = False,
    shuffle: bool = False,
    seed: int = None,
    execution_location: ExecutionLocationsLiteral | None = None,
) -> FlowDataEngine:
    """Gets a sample of rows from the DataFrame.

    Args:
        n_rows: The number of rows to sample.
        random: If True, performs random sampling. If False, takes the first n_rows.
        shuffle: If True (and `random` is True), shuffles the data before sampling.
        seed: A random seed for reproducibility.
        execution_location: Location which is used to calculate the size of the dataframe
    Returns:
        A new `FlowDataEngine` instance containing the sampled data.
    """
    logging.info(f"Getting sample of {n_rows} rows")
    if random:
        if self.lazy and self.external_source is not None:
            self.collect_external()

        if self.lazy and shuffle:
            sample_df = self.data_frame.collect(engine="streaming" if self._streamable else "auto").sample(
                n_rows, seed=seed, shuffle=shuffle
            )
        elif shuffle:
            sample_df = self.data_frame.sample(n_rows, seed=seed, shuffle=shuffle)
        else:
            if execution_location is None:
                execution_location = get_global_execution_location()
            n_rows = min(
                n_rows, self.get_number_of_records(calculate_in_worker_process=execution_location == "remote")
            )

            every_n_records = ceil(self.number_of_records / n_rows)
            sample_df = self.data_frame.gather_every(every_n_records)
    else:
        if self.external_source:
            self.collect(n_rows)
        sample_df = self.data_frame.head(n_rows)

    return FlowDataEngine(sample_df, schema=self.schema)
get_schema_column(col_name)

Retrieves the schema information for a single column by its name.

Parameters:

Name Type Description Default
col_name str

The name of the column to retrieve.

required

Returns:

Type Description
FlowfileColumn

A FlowfileColumn object for the specified column, or None if not found.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
def get_schema_column(self, col_name: str) -> FlowfileColumn:
    """Retrieves the schema information for a single column by its name.

    Args:
        col_name: The name of the column to retrieve.

    Returns:
        A `FlowfileColumn` object for the specified column, or `None` if not found.
    """
    for s in self.schema:
        if s.name == col_name:
            return s
get_select_inputs()

Gets SelectInput specifications for all columns in the current schema.

Returns:

Type Description
SelectInputs

A SelectInputs object that can be used to configure selection or

SelectInputs

transformation operations.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
def get_select_inputs(self) -> transform_schemas.SelectInputs:
    """Gets `SelectInput` specifications for all columns in the current schema.

    Returns:
        A `SelectInputs` object that can be used to configure selection or
        transformation operations.
    """
    return transform_schemas.SelectInputs(
        [transform_schemas.SelectInput(old_name=c.name, data_type=c.data_type) for c in self.schema]
    )
get_subset(n_rows=100)

Gets the first n_rows from the DataFrame.

Parameters:

Name Type Description Default
n_rows int

The number of rows to include in the subset.

100

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing the subset of data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
def get_subset(self, n_rows: int = 100) -> FlowDataEngine:
    """Gets the first `n_rows` from the DataFrame.

    Args:
        n_rows: The number of rows to include in the subset.

    Returns:
        A new `FlowDataEngine` instance containing the subset of data.
    """
    if not self.lazy:
        return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
    else:
        return FlowDataEngine(self.data_frame.head(n_rows), calculate_schema_stats=True)
initialize_empty_fl()

Initializes an empty LazyFrame.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1951
1952
1953
1954
1955
def initialize_empty_fl(self):
    """Initializes an empty LazyFrame."""
    self.data_frame = pl.LazyFrame()
    self.number_of_records = 0
    self._lazy = True
iter_batches(batch_size=1000, columns=None)

Iterates over the DataFrame in batches.

Parameters:

Name Type Description Default
batch_size int

The size of each batch.

1000
columns list | tuple | str

A list of column names to include in the batches. If None, all columns are included.

None

Yields:

Type Description
FlowDataEngine

A FlowDataEngine instance for each batch.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
def iter_batches(
    self, batch_size: int = 1000, columns: list | tuple | str = None
) -> Generator[FlowDataEngine, None, None]:
    """Iterates over the DataFrame in batches.

    Args:
        batch_size: The size of each batch.
        columns: A list of column names to include in the batches. If None,
            all columns are included.

    Yields:
        A `FlowDataEngine` instance for each batch.
    """
    if columns:
        self.data_frame = self.data_frame.select(columns)
    self.lazy = False
    batches = self.data_frame.iter_slices(batch_size)
    for batch in batches:
        yield FlowDataEngine(batch)
join(join_input, auto_generate_selection, verify_integrity, other)

Performs a standard SQL-style join with another DataFrame.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
def join(
    self,
    join_input: transform_schemas.JoinInput,
    auto_generate_selection: bool,
    verify_integrity: bool,
    other: FlowDataEngine,
) -> FlowDataEngine:
    """Performs a standard SQL-style join with another DataFrame."""
    # Create manager from input
    join_manager = transform_schemas.JoinInputManager(join_input)
    ensure_right_unselect_for_semi_and_anti_joins(join_manager.input)
    for jk in join_manager.join_mapping:
        if jk.left_col not in {c.old_name for c in join_manager.left_select.renames}:
            join_manager.left_select.append(transform_schemas.SelectInput(jk.left_col, keep=False))
        if jk.right_col not in {c.old_name for c in join_manager.right_select.renames}:
            join_manager.right_select.append(transform_schemas.SelectInput(jk.right_col, keep=False))
    verify_join_select_integrity(join_manager.input, left_columns=self.columns, right_columns=other.columns)
    if not verify_join_map_integrity(join_manager.input, left_columns=self.schema, right_columns=other.schema):
        raise Exception("Join is not valid by the data fields")

    if auto_generate_selection:
        join_manager.auto_rename()

    # Use manager properties throughout
    left = self.data_frame.select(join_manager.left_manager.get_select_cols()).rename(
        join_manager.left_manager.get_rename_table()
    )
    right = other.data_frame.select(join_manager.right_manager.get_select_cols()).rename(
        join_manager.right_manager.get_rename_table()
    )

    left, right, reverse_join_key_mapping = _handle_duplication_join_keys(left, right, join_manager)
    left, right = rename_df_table_for_join(left, right, join_manager.get_join_key_renames())
    if join_manager.how == "right":
        joined_df = right.join(
            other=left,
            left_on=join_manager.right_join_keys,
            right_on=join_manager.left_join_keys,
            how="left",
            suffix="",
        ).rename(reverse_join_key_mapping)
    else:
        joined_df = left.join(
            other=right,
            left_on=join_manager.left_join_keys,
            right_on=join_manager.right_join_keys,
            how=join_manager.how,
            suffix="",
        ).rename(reverse_join_key_mapping)

    left_cols_to_delete_after = [
        get_col_name_to_delete(col, "left")
        for col in join_manager.input.left_select.renames
        if not col.keep and col.is_available and col.join_key
    ]

    right_cols_to_delete_after = [
        get_col_name_to_delete(col, "right")
        for col in join_manager.input.right_select.renames
        if not col.keep
        and col.is_available
        and col.join_key
        and join_manager.how in ("left", "right", "inner", "cross", "outer")
    ]

    if len(right_cols_to_delete_after + left_cols_to_delete_after) > 0:
        joined_df = joined_df.drop(left_cols_to_delete_after + right_cols_to_delete_after)

    undo_join_key_remapping = get_undo_rename_mapping_join(join_manager)
    joined_df = joined_df.rename(undo_join_key_remapping)

    return FlowDataEngine(joined_df, calculate_schema_stats=False, number_of_records=0, streamable=False)
make_unique(unique_input=None)

Gets the unique rows from the DataFrame.

Parameters:

Name Type Description Default
unique_input UniqueInput

A UniqueInput object specifying a subset of columns to consider for uniqueness and a strategy for keeping rows.

None

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with unique rows.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
def make_unique(self, unique_input: transform_schemas.UniqueInput = None) -> FlowDataEngine:
    """Gets the unique rows from the DataFrame.

    Args:
        unique_input: A `UniqueInput` object specifying a subset of columns
            to consider for uniqueness and a strategy for keeping rows.

    Returns:
        A new `FlowDataEngine` instance with unique rows.
    """
    if unique_input is None or unique_input.columns is None:
        return FlowDataEngine(self.data_frame.unique())
    return FlowDataEngine(self.data_frame.unique(unique_input.columns, keep=unique_input.strategy))
output(output_fs, flow_id, node_id, execute_remote=True)

Writes the DataFrame to an output file.

Can execute the write operation locally or in a remote worker process.

Parameters:

Name Type Description Default
output_fs OutputSettings

An OutputSettings object with details about the output file.

required
flow_id int

The flow ID for tracking.

required
node_id int | str

The node ID for tracking.

required
execute_remote bool

If True, executes the write in a worker process.

True

Returns:

Type Description
FlowDataEngine

The same FlowDataEngine instance for chaining.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
def output(
    self, output_fs: input_schema.OutputSettings, flow_id: int, node_id: int | str, execute_remote: bool = True
) -> FlowDataEngine:
    """Writes the DataFrame to an output file.

    Can execute the write operation locally or in a remote worker process.

    Args:
        output_fs: An `OutputSettings` object with details about the output file.
        flow_id: The flow ID for tracking.
        node_id: The node ID for tracking.
        execute_remote: If True, executes the write in a worker process.

    Returns:
        The same `FlowDataEngine` instance for chaining.
    """
    logger.info("Starting to write output")
    if execute_remote:
        status = utils.write_output(
            self.data_frame,
            data_type=output_fs.file_type,
            path=output_fs.abs_file_path,
            write_mode=output_fs.write_mode,
            sheet_name=output_fs.sheet_name,
            delimiter=output_fs.delimiter,
            flow_id=flow_id,
            node_id=node_id,
        )
        tracker = ExternalExecutorTracker(status)
        tracker.get_result()
        logger.info("Finished writing output")
    else:
        logger.info("Starting to write results locally")
        utils.local_write_output(
            self.data_frame,
            data_type=output_fs.file_type,
            path=output_fs.abs_file_path,
            write_mode=output_fs.write_mode,
            sheet_name=output_fs.sheet_name,
            delimiter=output_fs.delimiter,
            flow_id=flow_id,
            node_id=node_id,
        )
        logger.info("Finished writing output")
    return self
reorganize_order(column_order)

Reorganizes columns into a specified order.

Parameters:

Name Type Description Default
column_order list[str]

A list of column names in the desired order.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the columns reordered.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
def reorganize_order(self, column_order: list[str]) -> FlowDataEngine:
    """Reorganizes columns into a specified order.

    Args:
        column_order: A list of column names in the desired order.

    Returns:
        A new `FlowDataEngine` instance with the columns reordered.
    """
    df = self.data_frame.select(column_order)
    schema = sorted(self.schema, key=lambda x: column_order.index(x.column_name))
    return FlowDataEngine(df, schema=schema, number_of_records=self.number_of_records)
save(path, data_type='parquet')

Saves the DataFrame to a file in a separate thread.

Parameters:

Name Type Description Default
path str

The file path to save to.

required
data_type str

The format to save in (e.g., 'parquet', 'csv').

'parquet'

Returns:

Type Description
Future

A loky.Future object representing the asynchronous save operation.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
def save(self, path: str, data_type: str = "parquet") -> Future:
    """Saves the DataFrame to a file in a separate thread.

    Args:
        path: The file path to save to.
        data_type: The format to save in (e.g., 'parquet', 'csv').

    Returns:
        A `loky.Future` object representing the asynchronous save operation.
    """
    estimated_size = deepcopy(self.get_estimated_file_size() * 4)
    df = deepcopy(self.data_frame)
    return write_threaded(_df=df, path=path, data_type=data_type, estimated_size=estimated_size)
select_columns(list_select)

Selects a subset of columns from the DataFrame.

Parameters:

Name Type Description Default
list_select list[str] | tuple[str] | str

A list, tuple, or single string of column names to select.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance containing only the selected columns.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
def select_columns(self, list_select: list[str] | tuple[str] | str) -> FlowDataEngine:
    """Selects a subset of columns from the DataFrame.

    Args:
        list_select: A list, tuple, or single string of column names to select.

    Returns:
        A new `FlowDataEngine` instance containing only the selected columns.
    """
    if isinstance(list_select, str):
        list_select = [list_select]

    idx_to_keep = [self.cols_idx.get(c) for c in list_select]
    selects = [ls for ls, id_to_keep in zip(list_select, idx_to_keep, strict=False) if id_to_keep is not None]
    new_schema = [self.schema[i] for i in idx_to_keep if i is not None]

    return FlowDataEngine(
        self.data_frame.select(selects),
        number_of_records=self.number_of_records,
        schema=new_schema,
        streamable=self._streamable,
    )
set_streamable(streamable=False)

Sets whether DataFrame operations should be streamable.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
2296
2297
2298
def set_streamable(self, streamable: bool = False):
    """Sets whether DataFrame operations should be streamable."""
    self._streamable = streamable
solve_graph(graph_solver_input)

Solves a graph problem represented by 'from' and 'to' columns.

This is used for operations like finding connected components in a graph.

Parameters:

Name Type Description Default
graph_solver_input GraphSolverInput

A GraphSolverInput object defining the source, destination, and output column names.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the solved graph data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
def solve_graph(self, graph_solver_input: transform_schemas.GraphSolverInput) -> FlowDataEngine:
    """Solves a graph problem represented by 'from' and 'to' columns.

    This is used for operations like finding connected components in a graph.

    Args:
        graph_solver_input: A `GraphSolverInput` object defining the source,
            destination, and output column names.

    Returns:
        A new `FlowDataEngine` instance with the solved graph data.
    """
    lf = self.data_frame.with_columns(
        graph_solver(graph_solver_input.col_from, graph_solver_input.col_to).alias(
            graph_solver_input.output_column_name
        )
    )
    return FlowDataEngine(lf)
split(split_input)

Splits a column's text values into multiple rows based on a delimiter.

This operation is often referred to as "exploding" the DataFrame, as it increases the number of rows.

Parameters:

Name Type Description Default
split_input TextToRowsInput

A TextToRowsInput object specifying the column to split, the delimiter, and the output column name.

required

Returns:

Type Description
FlowDataEngine

A new FlowDataEngine instance with the exploded rows.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
def split(self, split_input: transform_schemas.TextToRowsInput) -> FlowDataEngine:
    """Splits a column's text values into multiple rows based on a delimiter.

    This operation is often referred to as "exploding" the DataFrame, as it
    increases the number of rows.

    Args:
        split_input: A `TextToRowsInput` object specifying the column to split,
            the delimiter, and the output column name.

    Returns:
        A new `FlowDataEngine` instance with the exploded rows.
    """
    output_column_name = (
        split_input.output_column_name if split_input.output_column_name else split_input.column_to_split
    )

    split_value = (
        split_input.split_fixed_value if split_input.split_by_fixed_value else pl.col(split_input.split_by_column)
    )

    df = self.data_frame.with_columns(
        pl.col(split_input.column_to_split).str.split(by=split_value).alias(output_column_name)
    ).explode(output_column_name)

    return FlowDataEngine(df)
start_fuzzy_join(fuzzy_match_input, other, file_ref, flow_id=-1, node_id=-1)

Starts a fuzzy join operation in a background process.

This method prepares the data and initiates the fuzzy matching in a separate process, returning a tracker object immediately.

Parameters:

Name Type Description Default
fuzzy_match_input FuzzyMatchInput

A FuzzyMatchInput object with the matching parameters.

required
other FlowDataEngine

The right FlowDataEngine to join with.

required
file_ref str

A reference string for temporary files.

required
flow_id int

The flow ID for tracking.

-1
node_id int | str

The node ID for tracking.

-1

Returns:

Type Description
ExternalFuzzyMatchFetcher

An ExternalFuzzyMatchFetcher object that can be used to track the

ExternalFuzzyMatchFetcher

progress and retrieve the result of the fuzzy join.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
def start_fuzzy_join(
    self,
    fuzzy_match_input: transform_schemas.FuzzyMatchInput,
    other: FlowDataEngine,
    file_ref: str,
    flow_id: int = -1,
    node_id: int | str = -1,
) -> ExternalFuzzyMatchFetcher:
    """Starts a fuzzy join operation in a background process.

    This method prepares the data and initiates the fuzzy matching in a
    separate process, returning a tracker object immediately.

    Args:
        fuzzy_match_input: A `FuzzyMatchInput` object with the matching parameters.
        other: The right `FlowDataEngine` to join with.
        file_ref: A reference string for temporary files.
        flow_id: The flow ID for tracking.
        node_id: The node ID for tracking.

    Returns:
        An `ExternalFuzzyMatchFetcher` object that can be used to track the
        progress and retrieve the result of the fuzzy join.
    """
    fuzzy_match_input_manager = transform_schemas.FuzzyMatchInputManager(fuzzy_match_input)
    left_df, right_df = prepare_for_fuzzy_match(
        left=self, right=other, fuzzy_match_input_manager=fuzzy_match_input_manager
    )

    return ExternalFuzzyMatchFetcher(
        left_df,
        right_df,
        fuzzy_maps=fuzzy_match_input_manager.fuzzy_maps,
        file_ref=file_ref + "_fm",
        wait_on_completion=False,
        flow_id=flow_id,
        node_id=node_id,
    )
to_arrow()

Converts the DataFrame to a PyArrow Table.

This method triggers a .collect() call if the data is lazy, then converts the resulting eager DataFrame into a pyarrow.Table.

Returns:

Type Description
Table

A pyarrow.Table instance representing the data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
def to_arrow(self) -> PaTable:
    """Converts the DataFrame to a PyArrow Table.

    This method triggers a `.collect()` call if the data is lazy,
    then converts the resulting eager DataFrame into a `pyarrow.Table`.

    Returns:
        A `pyarrow.Table` instance representing the data.
    """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_arrow()
    else:
        return self.data_frame.to_arrow()
to_cloud_storage_obj(settings)

Writes the DataFrame to an object in cloud storage.

This method supports writing to various cloud storage providers like AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

Parameters:

Name Type Description Default
settings CloudStorageWriteSettingsInternal

A CloudStorageWriteSettingsInternal object containing connection details, file format, and write options.

required

Raises:

Type Description
ValueError

If the specified file format is not supported for writing.

NotImplementedError

If the 'append' write mode is used with an unsupported format.

Exception

If the write operation to cloud storage fails for any reason.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
def to_cloud_storage_obj(self, settings: cloud_storage_schemas.CloudStorageWriteSettingsInternal):
    """Writes the DataFrame to an object in cloud storage.

    This method supports writing to various cloud storage providers like AWS S3,
    Azure Data Lake Storage, and Google Cloud Storage.

    Args:
        settings: A `CloudStorageWriteSettingsInternal` object containing connection
            details, file format, and write options.

    Raises:
        ValueError: If the specified file format is not supported for writing.
        NotImplementedError: If the 'append' write mode is used with an unsupported format.
        Exception: If the write operation to cloud storage fails for any reason.
    """
    connection = settings.connection
    write_settings = settings.write_settings

    logger.info(f"Writing to {connection.storage_type} storage: {write_settings.resource_path}")

    if write_settings.write_mode == "append" and write_settings.file_format != "delta":
        raise NotImplementedError("The 'append' write mode is not yet supported for this destination.")
    storage_options = CloudStorageReader.get_storage_options(connection)
    credential_provider = CloudStorageReader.get_credential_provider(connection)
    # Dispatch to the correct writer based on file format
    if write_settings.file_format == "parquet":
        self._write_parquet_to_cloud(
            write_settings.resource_path, storage_options, credential_provider, write_settings
        )
    elif write_settings.file_format == "delta":
        self._write_delta_to_cloud(
            write_settings.resource_path, storage_options, credential_provider, write_settings
        )
    elif write_settings.file_format == "csv":
        self._write_csv_to_cloud(write_settings.resource_path, storage_options, credential_provider, write_settings)
    elif write_settings.file_format == "json":
        self._write_json_to_cloud(
            write_settings.resource_path, storage_options, credential_provider, write_settings
        )
    else:
        raise ValueError(f"Unsupported file format for writing: {write_settings.file_format}")

    logger.info(f"Successfully wrote data to {write_settings.resource_path}")
to_dict()

Converts the DataFrame to a Python dictionary of columns.

Each key in the dictionary is a column name, and the corresponding value is a list of the data in that column.

Returns:

Type Description
dict[str, list]

A dictionary mapping column names to lists of their values.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
def to_dict(self) -> dict[str, list]:
    """Converts the DataFrame to a Python dictionary of columns.

    Each key in the dictionary is a column name, and the corresponding value
    is a list of the data in that column.

    Returns:
        A dictionary mapping column names to lists of their values.
    """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dict(as_series=False)
    else:
        return self.data_frame.to_dict(as_series=False)
to_pylist()

Converts the DataFrame to a list of Python dictionaries.

Returns:

Type Description
list[dict]

A list where each item is a dictionary representing a row.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1079
1080
1081
1082
1083
1084
1085
1086
1087
def to_pylist(self) -> list[dict]:
    """Converts the DataFrame to a list of Python dictionaries.

    Returns:
        A list where each item is a dictionary representing a row.
    """
    if self.lazy:
        return self.data_frame.collect(engine="streaming" if self._streamable else "auto").to_dicts()
    return self.data_frame.to_dicts()
to_raw_data()

Converts the DataFrame to a RawData schema object.

Returns:

Type Description
RawData

An input_schema.RawData object containing the schema and data.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1103
1104
1105
1106
1107
1108
1109
1110
1111
def to_raw_data(self) -> input_schema.RawData:
    """Converts the DataFrame to a `RawData` schema object.

    Returns:
        An `input_schema.RawData` object containing the schema and data.
    """
    columns = [c.get_minimal_field_info() for c in self.schema]
    data = list(self.to_dict().values())
    return input_schema.RawData(columns=columns, data=data)
unpivot(unpivot_input)

Converts the DataFrame from a wide to a long format.

This is the inverse of a pivot operation, taking columns and transforming them into variable and value rows.

Parameters:

Name Type Description Default
unpivot_input UnpivotInput

An UnpivotInput object specifying which columns to unpivot and which to keep as index columns.

required

Returns:

Type Description
FlowDataEngine

A new, unpivoted FlowDataEngine instance.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_data_engine.py
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
def unpivot(self, unpivot_input: transform_schemas.UnpivotInput) -> FlowDataEngine:
    """Converts the DataFrame from a wide to a long format.

    This is the inverse of a pivot operation, taking columns and transforming
    them into `variable` and `value` rows.

    Args:
        unpivot_input: An `UnpivotInput` object specifying which columns to
            unpivot and which to keep as index columns.

    Returns:
        A new, unpivoted `FlowDataEngine` instance.
    """
    lf = self.data_frame

    if unpivot_input.data_type_selector_expr is not None:
        result = lf.unpivot(on=unpivot_input.data_type_selector_expr(), index=unpivot_input.index_columns)
    elif unpivot_input.value_columns is not None:
        result = lf.unpivot(on=unpivot_input.value_columns, index=unpivot_input.index_columns)
    else:
        result = lf.unpivot()

    return FlowDataEngine(result)

FlowfileColumn

FlowfileColumn

The FlowfileColumn is a data class that holds the schema and rich metadata for a single column managed by the FlowDataEngine.

flowfile_core.flowfile.flow_data_engine.flow_file_column.main.FlowfileColumn dataclass

Methods:

Name Description
__repr__

Provides a concise, developer-friendly representation of the object.

__str__

Provides a detailed, readable summary of the column's metadata.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_file_column/main.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
@dataclass
class FlowfileColumn:
    column_name: str
    data_type: str
    size: int
    max_value: str
    min_value: str
    col_index: int
    number_of_empty_values: int
    number_of_unique_values: int
    example_values: str
    data_type_group: ReadableDataTypeGroup
    __sql_type: Any | None
    __is_unique: bool | None
    __nullable: bool | None
    __has_values: bool | None
    average_value: str | None
    __perc_unique: float | None

    def __init__(self, polars_type: PlType):
        self.data_type = convert_pl_type_to_string(polars_type.pl_datatype)
        self.size = polars_type.count - polars_type.null_count
        self.max_value = polars_type.max
        self.min_value = polars_type.min
        self.number_of_unique_values = polars_type.n_unique
        self.number_of_empty_values = polars_type.null_count
        self.example_values = polars_type.examples
        self.column_name = polars_type.column_name
        self.average_value = polars_type.mean
        self.col_index = polars_type.col_index
        self.__has_values = None
        self.__nullable = None
        self.__is_unique = None
        self.__sql_type = None
        self.__perc_unique = None
        self.data_type_group = self.get_readable_datatype_group()

    def __repr__(self):
        """
        Provides a concise, developer-friendly representation of the object.
        Ideal for debugging and console inspection.
        """
        return (
            f"FlowfileColumn(name='{self.column_name}', "
            f"type={self.data_type}, "
            f"size={self.size}, "
            f"nulls={self.number_of_empty_values})"
        )

    def __str__(self):
        """
        Provides a detailed, readable summary of the column's metadata.
        It conditionally omits any attribute that is None, ensuring a clean output.
        """
        # --- Header (Always Shown) ---
        header = f"<FlowfileColumn: '{self.column_name}'>"
        lines = []

        # --- Core Attributes (Conditionally Shown) ---
        if self.data_type is not None:
            lines.append(f"  Type: {self.data_type}")
        if self.size is not None:
            lines.append(f"  Non-Nulls: {self.size}")

        # Calculate and display nulls if possible
        if self.size is not None and self.number_of_empty_values is not None:
            total_entries = self.size + self.number_of_empty_values
            if total_entries > 0:
                null_perc = (self.number_of_empty_values / total_entries) * 100
                null_info = f"{self.number_of_empty_values} ({null_perc:.1f}%)"
            else:
                null_info = "0 (0.0%)"
            lines.append(f"  Nulls: {null_info}")

        if self.number_of_unique_values is not None:
            lines.append(f"  Unique: {self.number_of_unique_values}")

        # --- Conditional Stats Section ---
        stats = []
        if self.min_value is not None:
            stats.append(f"    Min: {self.min_value}")
        if self.max_value is not None:
            stats.append(f"    Max: {self.max_value}")
        if self.average_value is not None:
            stats.append(f"    Mean: {self.average_value}")

        if stats:
            lines.append("  Stats:")
            lines.extend(stats)

        # --- Conditional Examples Section ---
        if self.example_values:
            example_str = str(self.example_values)
            # Truncate long example strings for cleaner display
            if len(example_str) > 70:
                example_str = example_str[:67] + "..."
            lines.append(f"  Examples: {example_str}")

        return f"{header}\n" + "\n".join(lines)

    @classmethod
    def create_from_polars_type(cls, polars_type: PlType, **kwargs) -> "FlowfileColumn":
        for k, v in kwargs.items():
            if hasattr(polars_type, k):
                setattr(polars_type, k, v)
        return cls(polars_type)

    @classmethod
    def from_input(cls, column_name: str, data_type: str, **kwargs) -> "FlowfileColumn":
        pl_type = cast_str_to_polars_type(data_type)
        if pl_type is not None:
            data_type = pl_type
        return cls(PlType(column_name=column_name, pl_datatype=data_type, **kwargs))

    @classmethod
    def create_from_polars_dtype(cls, column_name: str, data_type: pl.DataType, **kwargs):
        return cls(PlType(column_name=column_name, pl_datatype=data_type, **kwargs))

    def get_minimal_field_info(self) -> input_schema.MinimalFieldInfo:
        return input_schema.MinimalFieldInfo(name=self.column_name, data_type=self.data_type)

    @classmethod
    def create_from_minimal_field_info(cls, minimal_field_info: input_schema.MinimalFieldInfo) -> "FlowfileColumn":
        return cls.from_input(column_name=minimal_field_info.name, data_type=minimal_field_info.data_type)

    @property
    def is_unique(self) -> bool:
        if self.__is_unique is None:
            if self.has_values:
                self.__is_unique = self.number_of_unique_values == self.number_of_filled_values
            else:
                self.__is_unique = False
        return self.__is_unique

    @property
    def perc_unique(self) -> float:
        if self.__perc_unique is None:
            self.__perc_unique = self.number_of_unique_values / self.number_of_filled_values
        return self.__perc_unique

    @property
    def has_values(self) -> bool:
        if not self.__has_values:
            self.__has_values = self.number_of_unique_values > 0
        return self.__has_values

    @property
    def number_of_filled_values(self):
        return self.size

    @property
    def nullable(self):
        if self.__nullable is None:
            self.__nullable = self.number_of_empty_values > 0
        return self.__nullable

    @property
    def name(self):
        return self.column_name

    def get_column_repr(self):
        return dict(
            name=self.name,
            size=self.size,
            data_type=str(self.data_type),
            has_values=self.has_values,
            is_unique=self.is_unique,
            max_value=str(self.max_value),
            min_value=str(self.min_value),
            number_of_unique_values=self.number_of_unique_values,
            number_of_filled_values=self.number_of_filled_values,
            number_of_empty_values=self.number_of_empty_values,
            average_size=self.average_value,
        )

    def generic_datatype(self) -> DataTypeGroup:
        if self.data_type in ("Utf8", "VARCHAR", "CHAR", "NVARCHAR", "String"):
            return "str"
        elif self.data_type in (
            "fixed_decimal",
            "decimal",
            "float",
            "integer",
            "boolean",
            "double",
            "Int16",
            "Int32",
            "Int64",
            "Float32",
            "Float64",
            "Decimal",
            "Binary",
            "Boolean",
            "Uint8",
            "Uint16",
            "Uint32",
            "Uint64",
        ):
            return "numeric"
        elif self.data_type in ("datetime", "date", "Date", "Datetime", "Time"):
            return "date"
        else:
            return "str"

    def get_readable_datatype_group(self) -> ReadableDataTypeGroup:
        if self.data_type in ("Utf8", "VARCHAR", "CHAR", "NVARCHAR", "String"):
            return "String"
        elif self.data_type in (
            "fixed_decimal",
            "decimal",
            "float",
            "integer",
            "boolean",
            "double",
            "Int16",
            "Int32",
            "Int64",
            "Float32",
            "Float64",
            "Decimal",
            "Binary",
            "Boolean",
            "Uint8",
            "Uint16",
            "Uint32",
            "Uint64",
        ):
            return "Numeric"
        elif self.data_type in ("datetime", "date", "Date", "Datetime", "Time"):
            return "Date"
        else:
            return "Other"

    def get_polars_type(self) -> PlType:
        pl_datatype = cast_str_to_polars_type(self.data_type)
        pl_type = PlType(pl_datatype=pl_datatype, **self.__dict__)
        return pl_type

    def update_type_from_polars_type(self, pl_type: PlType):
        self.data_type = str(pl_type.pl_datatype.base_type())
__repr__()

Provides a concise, developer-friendly representation of the object. Ideal for debugging and console inspection.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_file_column/main.py
51
52
53
54
55
56
57
58
59
60
61
def __repr__(self):
    """
    Provides a concise, developer-friendly representation of the object.
    Ideal for debugging and console inspection.
    """
    return (
        f"FlowfileColumn(name='{self.column_name}', "
        f"type={self.data_type}, "
        f"size={self.size}, "
        f"nulls={self.number_of_empty_values})"
    )
__str__()

Provides a detailed, readable summary of the column's metadata. It conditionally omits any attribute that is None, ensuring a clean output.

Source code in flowfile_core/flowfile_core/flowfile/flow_data_engine/flow_file_column/main.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
def __str__(self):
    """
    Provides a detailed, readable summary of the column's metadata.
    It conditionally omits any attribute that is None, ensuring a clean output.
    """
    # --- Header (Always Shown) ---
    header = f"<FlowfileColumn: '{self.column_name}'>"
    lines = []

    # --- Core Attributes (Conditionally Shown) ---
    if self.data_type is not None:
        lines.append(f"  Type: {self.data_type}")
    if self.size is not None:
        lines.append(f"  Non-Nulls: {self.size}")

    # Calculate and display nulls if possible
    if self.size is not None and self.number_of_empty_values is not None:
        total_entries = self.size + self.number_of_empty_values
        if total_entries > 0:
            null_perc = (self.number_of_empty_values / total_entries) * 100
            null_info = f"{self.number_of_empty_values} ({null_perc:.1f}%)"
        else:
            null_info = "0 (0.0%)"
        lines.append(f"  Nulls: {null_info}")

    if self.number_of_unique_values is not None:
        lines.append(f"  Unique: {self.number_of_unique_values}")

    # --- Conditional Stats Section ---
    stats = []
    if self.min_value is not None:
        stats.append(f"    Min: {self.min_value}")
    if self.max_value is not None:
        stats.append(f"    Max: {self.max_value}")
    if self.average_value is not None:
        stats.append(f"    Mean: {self.average_value}")

    if stats:
        lines.append("  Stats:")
        lines.extend(stats)

    # --- Conditional Examples Section ---
    if self.example_values:
        example_str = str(self.example_values)
        # Truncate long example strings for cleaner display
        if len(example_str) > 70:
            example_str = example_str[:67] + "..."
        lines.append(f"  Examples: {example_str}")

    return f"{header}\n" + "\n".join(lines)

Data Modeling (Schemas)

This section documents the Pydantic models that define the structure of settings and data.

schemas

flowfile_core.schemas.schemas

Classes:

Name Description
FlowGraphConfig

Configuration model for a flow graph's basic properties.

FlowInformation

Represents the complete state of a flow, including settings, nodes, and connections.

FlowSettings

Extends FlowGraphConfig with additional operational settings for a flow.

FlowfileData

Root model for flowfile serialization (YAML/JSON).

FlowfileNode

Node representation for flowfile serialization (YAML/JSON).

FlowfileSettings

Settings for flowfile serialization (YAML/JSON).

NodeConnection

Represents a connection between two nodes in the flow.

NodeDefault

Defines default properties for a node type.

NodeEdge

Represents a connection (edge) between two nodes in the frontend.

NodeInformation

Stores the state and configuration of a specific node instance within a flow.

NodeInput

Represents a node as it is received from the frontend, including position.

NodeTemplate

Defines the template for a node type, specifying its UI and functional characteristics.

RawLogInput

Schema for a raw log message.

VueFlowInput

Represents the complete graph structure from the Vue-based frontend.

Functions:

Name Description
get_global_execution_location

Calculates the default execution location based on the global settings

get_settings_class_for_node_type

Get the settings class for a node type, supporting both standard and user-defined nodes.

FlowGraphConfig pydantic-model

Bases: BaseModel

Configuration model for a flow graph's basic properties.

Attributes:

Name Type Description
flow_id int

Unique identifier for the flow.

description Optional[str]

A description of the flow.

save_location Optional[str]

The location where the flow is saved.

name str

The name of the flow.

path str

The file path associated with the flow.

execution_mode ExecutionModeLiteral

The mode of execution ('Development' or 'Performance').

execution_location ExecutionLocationsLiteral

The location for execution ('local', 'remote').

max_parallel_workers int

Maximum number of threads used for parallel node execution within a stage. Set to 1 to disable parallelism. Defaults to 4.

Show JSON schema:
{
  "description": "Configuration model for a flow graph's basic properties.\n\nAttributes:\n    flow_id (int): Unique identifier for the flow.\n    description (Optional[str]): A description of the flow.\n    save_location (Optional[str]): The location where the flow is saved.\n    name (str): The name of the flow.\n    path (str): The file path associated with the flow.\n    execution_mode (ExecutionModeLiteral): The mode of execution ('Development' or 'Performance').\n    execution_location (ExecutionLocationsLiteral): The location for execution ('local', 'remote').\n    max_parallel_workers (int): Maximum number of threads used for parallel node execution within a\n        stage. Set to 1 to disable parallelism. Defaults to 4.",
  "properties": {
    "flow_id": {
      "description": "Unique identifier for the flow.",
      "title": "Flow Id",
      "type": "integer"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Description"
    },
    "save_location": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Save Location"
    },
    "name": {
      "default": "",
      "title": "Name",
      "type": "string"
    },
    "path": {
      "default": "",
      "title": "Path",
      "type": "string"
    },
    "execution_mode": {
      "default": "Performance",
      "enum": [
        "Development",
        "Performance"
      ],
      "title": "Execution Mode",
      "type": "string"
    },
    "execution_location": {
      "enum": [
        "local",
        "remote"
      ],
      "title": "Execution Location",
      "type": "string"
    },
    "max_parallel_workers": {
      "default": 4,
      "description": "Max threads for parallel node execution.",
      "minimum": 1,
      "title": "Max Parallel Workers",
      "type": "integer"
    }
  },
  "title": "FlowGraphConfig",
  "type": "object"
}

Fields:

  • flow_id (int)
  • description (str | None)
  • save_location (str | None)
  • name (str)
  • path (str)
  • execution_mode (ExecutionModeLiteral)
  • execution_location (ExecutionLocationsLiteral)
  • max_parallel_workers (int)

Validators:

Source code in flowfile_core/flowfile_core/schemas/schemas.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
class FlowGraphConfig(BaseModel):
    """
    Configuration model for a flow graph's basic properties.

    Attributes:
        flow_id (int): Unique identifier for the flow.
        description (Optional[str]): A description of the flow.
        save_location (Optional[str]): The location where the flow is saved.
        name (str): The name of the flow.
        path (str): The file path associated with the flow.
        execution_mode (ExecutionModeLiteral): The mode of execution ('Development' or 'Performance').
        execution_location (ExecutionLocationsLiteral): The location for execution ('local', 'remote').
        max_parallel_workers (int): Maximum number of threads used for parallel node execution within a
            stage. Set to 1 to disable parallelism. Defaults to 4.
    """

    flow_id: int = Field(default_factory=create_unique_id, description="Unique identifier for the flow.")
    description: str | None = None
    save_location: str | None = None
    name: str = ""
    path: str = ""
    execution_mode: ExecutionModeLiteral = "Performance"
    execution_location: ExecutionLocationsLiteral = Field(default_factory=get_global_execution_location)
    max_parallel_workers: int = Field(default=4, ge=1, description="Max threads for parallel node execution.")

    @field_validator("execution_location", mode="before")
    def validate_and_set_execution_location(cls, v: ExecutionLocationsLiteral | None) -> ExecutionLocationsLiteral:
        """
        Validates and sets the execution location.
        1.  **If `None` is provided**: It defaults to the location determined by global settings.
        2.  **If a value is provided**: It checks if the value is compatible with the global
            settings. If not (e.g., requesting 'remote' when only 'local' is possible),
            it corrects the value to a compatible one.
        """
        if v is None:
            return get_global_execution_location()
        if v == "auto":
            return get_global_execution_location()

        return get_prio_execution_location(v, get_global_execution_location())
flow_id pydantic-field

Unique identifier for the flow.

max_parallel_workers = 4 pydantic-field

Max threads for parallel node execution.

validate_and_set_execution_location(v) pydantic-validator

Validates and sets the execution location. 1. If None is provided: It defaults to the location determined by global settings. 2. If a value is provided: It checks if the value is compatible with the global settings. If not (e.g., requesting 'remote' when only 'local' is possible), it corrects the value to a compatible one.

Source code in flowfile_core/flowfile_core/schemas/schemas.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
@field_validator("execution_location", mode="before")
def validate_and_set_execution_location(cls, v: ExecutionLocationsLiteral | None) -> ExecutionLocationsLiteral:
    """
    Validates and sets the execution location.
    1.  **If `None` is provided**: It defaults to the location determined by global settings.
    2.  **If a value is provided**: It checks if the value is compatible with the global
        settings. If not (e.g., requesting 'remote' when only 'local' is possible),
        it corrects the value to a compatible one.
    """
    if v is None:
        return get_global_execution_location()
    if v == "auto":
        return get_global_execution_location()

    return get_prio_execution_location(v, get_global_execution_location())
FlowInformation pydantic-model

Bases: BaseModel

Represents the complete state of a flow, including settings, nodes, and connections.

Attributes:

Name Type Description
flow_id int

The unique ID of the flow.

flow_name Optional[str]

The name of the flow.

flow_settings FlowSettings

The settings for the flow.

data Dict[int, NodeInformation]

A dictionary mapping node IDs to their information.

node_starts List[int]

A list of starting node IDs.

node_connections List[Tuple[int, int]]

A list of tuples representing connections between nodes.

Show JSON schema:
{
  "$defs": {
    "FlowSettings": {
      "description": "Extends FlowGraphConfig with additional operational settings for a flow.\n\nAttributes:\n    auto_save (bool): Flag to enable or disable automatic saving.\n    modified_on (Optional[float]): Timestamp of the last modification.\n    show_detailed_progress (bool): Flag to show detailed progress during execution.\n    is_running (bool): Indicates if the flow is currently running.\n    is_canceled (bool): Indicates if the flow execution has been canceled.\n    track_history (bool): Flag to enable or disable undo/redo history tracking.",
      "properties": {
        "flow_id": {
          "description": "Unique identifier for the flow.",
          "title": "Flow Id",
          "type": "integer"
        },
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Description"
        },
        "save_location": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Save Location"
        },
        "name": {
          "default": "",
          "title": "Name",
          "type": "string"
        },
        "path": {
          "default": "",
          "title": "Path",
          "type": "string"
        },
        "execution_mode": {
          "default": "Performance",
          "enum": [
            "Development",
            "Performance"
          ],
          "title": "Execution Mode",
          "type": "string"
        },
        "execution_location": {
          "enum": [
            "local",
            "remote"
          ],
          "title": "Execution Location",
          "type": "string"
        },
        "max_parallel_workers": {
          "default": 4,
          "description": "Max threads for parallel node execution.",
          "minimum": 1,
          "title": "Max Parallel Workers",
          "type": "integer"
        },
        "auto_save": {
          "default": false,
          "title": "Auto Save",
          "type": "boolean"
        },
        "modified_on": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modified On"
        },
        "show_detailed_progress": {
          "default": true,
          "title": "Show Detailed Progress",
          "type": "boolean"
        },
        "is_running": {
          "default": false,
          "title": "Is Running",
          "type": "boolean"
        },
        "is_canceled": {
          "default": false,
          "title": "Is Canceled",
          "type": "boolean"
        },
        "track_history": {
          "default": true,
          "title": "Track History",
          "type": "boolean"
        }
      },
      "title": "FlowSettings",
      "type": "object"
    },
    "NodeInformation": {
      "description": "Stores the state and configuration of a specific node instance within a flow.",
      "properties": {
        "id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Type"
        },
        "is_setup": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Is Setup"
        },
        "is_start_node": {
          "default": false,
          "title": "Is Start Node",
          "type": "boolean"
        },
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "",
          "title": "Description"
        },
        "node_reference": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Node Reference"
        },
        "x_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "X Position"
        },
        "y_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "Y Position"
        },
        "left_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Left Input Id"
        },
        "right_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Right Input Id"
        },
        "input_ids": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Input Ids"
        },
        "outputs": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Outputs"
        },
        "setting_input": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Setting Input"
        }
      },
      "title": "NodeInformation",
      "type": "object"
    }
  },
  "description": "Represents the complete state of a flow, including settings, nodes, and connections.\n\nAttributes:\n    flow_id (int): The unique ID of the flow.\n    flow_name (Optional[str]): The name of the flow.\n    flow_settings (FlowSettings): The settings for the flow.\n    data (Dict[int, NodeInformation]): A dictionary mapping node IDs to their information.\n    node_starts (List[int]): A list of starting node IDs.\n    node_connections (List[Tuple[int, int]]): A list of tuples representing connections between nodes.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "flow_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Flow Name"
    },
    "flow_settings": {
      "$ref": "#/$defs/FlowSettings"
    },
    "data": {
      "additionalProperties": {
        "$ref": "#/$defs/NodeInformation"
      },
      "default": {},
      "title": "Data",
      "type": "object"
    },
    "node_starts": {
      "items": {
        "type": "integer"
      },
      "title": "Node Starts",
      "type": "array"
    },
    "node_connections": {
      "default": [],
      "items": {
        "maxItems": 2,
        "minItems": 2,
        "prefixItems": [
          {
            "type": "integer"
          },
          {
            "type": "integer"
          }
        ],
        "type": "array"
      },
      "title": "Node Connections",
      "type": "array"
    }
  },
  "required": [
    "flow_id",
    "flow_settings",
    "node_starts"
  ],
  "title": "FlowInformation",
  "type": "object"
}

Fields:

  • flow_id (int)
  • flow_name (str | None)
  • flow_settings (FlowSettings)
  • data (dict[int, NodeInformation])
  • node_starts (list[int])
  • node_connections (list[tuple[int, int]])

Validators:

Source code in flowfile_core/flowfile_core/schemas/schemas.py
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
class FlowInformation(BaseModel):
    """
    Represents the complete state of a flow, including settings, nodes, and connections.

    Attributes:
        flow_id (int): The unique ID of the flow.
        flow_name (Optional[str]): The name of the flow.
        flow_settings (FlowSettings): The settings for the flow.
        data (Dict[int, NodeInformation]): A dictionary mapping node IDs to their information.
        node_starts (List[int]): A list of starting node IDs.
        node_connections (List[Tuple[int, int]]): A list of tuples representing connections between nodes.
    """

    flow_id: int
    flow_name: str | None = ""
    flow_settings: FlowSettings
    data: dict[int, NodeInformation] = {}
    node_starts: list[int]
    node_connections: list[tuple[int, int]] = []

    @field_validator("flow_name", mode="before")
    def ensure_string(cls, v):
        """
        Validator to ensure the flow_name is always a string.
        :param v: The value to validate.
        :return: The value as a string, or an empty string if it's None.
        """
        return str(v) if v is not None else ""
ensure_string(v) pydantic-validator

Validator to ensure the flow_name is always a string. :param v: The value to validate. :return: The value as a string, or an empty string if it's None.

Source code in flowfile_core/flowfile_core/schemas/schemas.py
354
355
356
357
358
359
360
361
@field_validator("flow_name", mode="before")
def ensure_string(cls, v):
    """
    Validator to ensure the flow_name is always a string.
    :param v: The value to validate.
    :return: The value as a string, or an empty string if it's None.
    """
    return str(v) if v is not None else ""
FlowSettings pydantic-model

Bases: FlowGraphConfig

Extends FlowGraphConfig with additional operational settings for a flow.

Attributes:

Name Type Description
auto_save bool

Flag to enable or disable automatic saving.

modified_on Optional[float]

Timestamp of the last modification.

show_detailed_progress bool

Flag to show detailed progress during execution.

is_running bool

Indicates if the flow is currently running.

is_canceled bool

Indicates if the flow execution has been canceled.

track_history bool

Flag to enable or disable undo/redo history tracking.

Show JSON schema:
{
  "description": "Extends FlowGraphConfig with additional operational settings for a flow.\n\nAttributes:\n    auto_save (bool): Flag to enable or disable automatic saving.\n    modified_on (Optional[float]): Timestamp of the last modification.\n    show_detailed_progress (bool): Flag to show detailed progress during execution.\n    is_running (bool): Indicates if the flow is currently running.\n    is_canceled (bool): Indicates if the flow execution has been canceled.\n    track_history (bool): Flag to enable or disable undo/redo history tracking.",
  "properties": {
    "flow_id": {
      "description": "Unique identifier for the flow.",
      "title": "Flow Id",
      "type": "integer"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Description"
    },
    "save_location": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Save Location"
    },
    "name": {
      "default": "",
      "title": "Name",
      "type": "string"
    },
    "path": {
      "default": "",
      "title": "Path",
      "type": "string"
    },
    "execution_mode": {
      "default": "Performance",
      "enum": [
        "Development",
        "Performance"
      ],
      "title": "Execution Mode",
      "type": "string"
    },
    "execution_location": {
      "enum": [
        "local",
        "remote"
      ],
      "title": "Execution Location",
      "type": "string"
    },
    "max_parallel_workers": {
      "default": 4,
      "description": "Max threads for parallel node execution.",
      "minimum": 1,
      "title": "Max Parallel Workers",
      "type": "integer"
    },
    "auto_save": {
      "default": false,
      "title": "Auto Save",
      "type": "boolean"
    },
    "modified_on": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modified On"
    },
    "show_detailed_progress": {
      "default": true,
      "title": "Show Detailed Progress",
      "type": "boolean"
    },
    "is_running": {
      "default": false,
      "title": "Is Running",
      "type": "boolean"
    },
    "is_canceled": {
      "default": false,
      "title": "Is Canceled",
      "type": "boolean"
    },
    "track_history": {
      "default": true,
      "title": "Track History",
      "type": "boolean"
    }
  },
  "title": "FlowSettings",
  "type": "object"
}

Fields:

  • flow_id (int)
  • description (str | None)
  • save_location (str | None)
  • name (str)
  • path (str)
  • execution_mode (ExecutionModeLiteral)
  • execution_location (ExecutionLocationsLiteral)
  • max_parallel_workers (int)
  • auto_save (bool)
  • modified_on (float | None)
  • show_detailed_progress (bool)
  • is_running (bool)
  • is_canceled (bool)
  • track_history (bool)

Validators:

Source code in flowfile_core/flowfile_core/schemas/schemas.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
class FlowSettings(FlowGraphConfig):
    """
    Extends FlowGraphConfig with additional operational settings for a flow.

    Attributes:
        auto_save (bool): Flag to enable or disable automatic saving.
        modified_on (Optional[float]): Timestamp of the last modification.
        show_detailed_progress (bool): Flag to show detailed progress during execution.
        is_running (bool): Indicates if the flow is currently running.
        is_canceled (bool): Indicates if the flow execution has been canceled.
        track_history (bool): Flag to enable or disable undo/redo history tracking.
    """

    auto_save: bool = False
    modified_on: float | None = None
    show_detailed_progress: bool = True
    is_running: bool = False
    is_canceled: bool = False
    track_history: bool = True

    @classmethod
    def from_flow_settings_input(cls, flow_graph_config: FlowGraphConfig):
        """
        Creates a FlowSettings instance from a FlowGraphConfig instance.

        :param flow_graph_config: The base flow graph configuration.
        :return: A new instance of FlowSettings with data from flow_graph_config.
        """
        return cls.model_validate(flow_graph_config.model_dump())
from_flow_settings_input(flow_graph_config) classmethod

Creates a FlowSettings instance from a FlowGraphConfig instance.

:param flow_graph_config: The base flow graph configuration. :return: A new instance of FlowSettings with data from flow_graph_config.

Source code in flowfile_core/flowfile_core/schemas/schemas.py
159
160
161
162
163
164
165
166
167
@classmethod
def from_flow_settings_input(cls, flow_graph_config: FlowGraphConfig):
    """
    Creates a FlowSettings instance from a FlowGraphConfig instance.

    :param flow_graph_config: The base flow graph configuration.
    :return: A new instance of FlowSettings with data from flow_graph_config.
    """
    return cls.model_validate(flow_graph_config.model_dump())
FlowfileData pydantic-model

Bases: BaseModel

Root model for flowfile serialization (YAML/JSON).

Show JSON schema:
{
  "$defs": {
    "FlowfileNode": {
      "description": "Node representation for flowfile serialization (YAML/JSON).",
      "properties": {
        "id": {
          "title": "Id",
          "type": "integer"
        },
        "type": {
          "title": "Type",
          "type": "string"
        },
        "is_start_node": {
          "default": false,
          "title": "Is Start Node",
          "type": "boolean"
        },
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "",
          "title": "Description"
        },
        "node_reference": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Node Reference"
        },
        "x_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "X Position"
        },
        "y_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 0,
          "title": "Y Position"
        },
        "left_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Left Input Id"
        },
        "right_input_id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Right Input Id"
        },
        "input_ids": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Input Ids"
        },
        "outputs": {
          "anyOf": [
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Outputs"
        },
        "setting_input": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Setting Input"
        }
      },
      "required": [
        "id",
        "type"
      ],
      "title": "FlowfileNode",
      "type": "object"
    },
    "FlowfileSettings": {
      "description": "Settings for flowfile serialization (YAML/JSON).\n\nExcludes runtime state fields like is_running, is_canceled, modified_on.",
      "properties": {
        "description": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Description"
        },
        "execution_mode": {
          "default": "Performance",
          "enum": [
            "Development",
            "Performance"
          ],
          "title": "Execution Mode",
          "type": "string"
        },
        "execution_location": {
          "default": "local",
          "enum": [
            "local",
            "remote"
          ],
          "title": "Execution Location",
          "type": "string"
        },
        "auto_save": {
          "default": false,
          "title": "Auto Save",
          "type": "boolean"
        },
        "show_detailed_progress": {
          "default": true,
          "title": "Show Detailed Progress",
          "type": "boolean"
        },
        "max_parallel_workers": {
          "default": 4,
          "minimum": 1,
          "title": "Max Parallel Workers",
          "type": "integer"
        }
      },
      "title": "FlowfileSettings",
      "type": "object"
    }
  },
  "description": "Root model for flowfile serialization (YAML/JSON).",
  "properties": {
    "flowfile_version": {
      "title": "Flowfile Version",
      "type": "string"
    },
    "flowfile_id": {
      "title": "Flowfile Id",
      "type": "integer"
    },
    "flowfile_name": {
      "title": "Flowfile Name",
      "type": "string"
    },
    "flowfile_settings": {
      "$ref": "#/$defs/FlowfileSettings"
    },
    "nodes": {
      "items": {
        "$ref": "#/$defs/FlowfileNode"
      },
      "title": "Nodes",
      "type": "array"
    }
  },
  "required": [
    "flowfile_version",
    "flowfile_id",
    "flowfile_name",
    "flowfile_settings",
    "nodes"
  ],
  "title": "FlowfileData",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/schemas.py
245
246
247
248
249
250
251
252
class FlowfileData(BaseModel):
    """Root model for flowfile serialization (YAML/JSON)."""

    flowfile_version: str
    flowfile_id: int
    flowfile_name: str
    flowfile_settings: FlowfileSettings
    nodes: list[FlowfileNode]
FlowfileNode pydantic-model

Bases: BaseModel

Node representation for flowfile serialization (YAML/JSON).

Show JSON schema:
{
  "description": "Node representation for flowfile serialization (YAML/JSON).",
  "properties": {
    "id": {
      "title": "Id",
      "type": "integer"
    },
    "type": {
      "title": "Type",
      "type": "string"
    },
    "is_start_node": {
      "default": false,
      "title": "Is Start Node",
      "type": "boolean"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "x_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "X Position"
    },
    "y_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Y Position"
    },
    "left_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Left Input Id"
    },
    "right_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Right Input Id"
    },
    "input_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Input Ids"
    },
    "outputs": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Outputs"
    },
    "setting_input": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Setting Input"
    }
  },
  "required": [
    "id",
    "type"
  ],
  "title": "FlowfileNode",
  "type": "object"
}

Fields:

  • id (int)
  • type (str)
  • is_start_node (bool)
  • description (str | None)
  • node_reference (str | None)
  • x_position (int | None)
  • y_position (int | None)
  • left_input_id (int | None)
  • right_input_id (int | None)
  • input_ids (list[int] | None)
  • outputs (list[int] | None)
  • setting_input (Any | None)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
class FlowfileNode(BaseModel):
    """Node representation for flowfile serialization (YAML/JSON)."""

    id: int
    type: str
    is_start_node: bool = False
    description: str | None = ""
    node_reference: str | None = None  # Unique reference identifier for code generation
    x_position: int | None = 0
    y_position: int | None = 0
    left_input_id: int | None = None
    right_input_id: int | None = None
    input_ids: list[int] | None = Field(default_factory=list)
    outputs: list[int] | None = Field(default_factory=list)
    setting_input: Any | None = None

    _setting_input_exclude: ClassVar[set] = {
        "flow_id",
        "node_id",
        "pos_x",
        "pos_y",
        "is_setup",
        "description",
        "node_reference",
        "user_id",
        "is_flow_output",
        "is_user_defined",
        "depending_on_id",
        "depending_on_ids",
    }

    @field_serializer("setting_input")
    def serialize_setting_input(self, value, _info):
        if value is None:
            return None
        if isinstance(value, input_schema.NodePromise):
            return None
        if hasattr(value, "to_yaml_dict"):
            return value.to_yaml_dict()
        if hasattr(value, "to_yaml_dict"):
            return value.to_yaml_dict()
        return value.model_dump(exclude=self._setting_input_exclude)
FlowfileSettings pydantic-model

Bases: BaseModel

Settings for flowfile serialization (YAML/JSON).

Excludes runtime state fields like is_running, is_canceled, modified_on.

Show JSON schema:
{
  "description": "Settings for flowfile serialization (YAML/JSON).\n\nExcludes runtime state fields like is_running, is_canceled, modified_on.",
  "properties": {
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Description"
    },
    "execution_mode": {
      "default": "Performance",
      "enum": [
        "Development",
        "Performance"
      ],
      "title": "Execution Mode",
      "type": "string"
    },
    "execution_location": {
      "default": "local",
      "enum": [
        "local",
        "remote"
      ],
      "title": "Execution Location",
      "type": "string"
    },
    "auto_save": {
      "default": false,
      "title": "Auto Save",
      "type": "boolean"
    },
    "show_detailed_progress": {
      "default": true,
      "title": "Show Detailed Progress",
      "type": "boolean"
    },
    "max_parallel_workers": {
      "default": 4,
      "minimum": 1,
      "title": "Max Parallel Workers",
      "type": "integer"
    }
  },
  "title": "FlowfileSettings",
  "type": "object"
}

Fields:

  • description (str | None)
  • execution_mode (ExecutionModeLiteral)
  • execution_location (ExecutionLocationsLiteral)
  • auto_save (bool)
  • show_detailed_progress (bool)
  • max_parallel_workers (int)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
187
188
189
190
191
192
193
194
195
196
197
198
class FlowfileSettings(BaseModel):
    """Settings for flowfile serialization (YAML/JSON).

    Excludes runtime state fields like is_running, is_canceled, modified_on.
    """

    description: str | None = None
    execution_mode: ExecutionModeLiteral = "Performance"
    execution_location: ExecutionLocationsLiteral = "local"
    auto_save: bool = False
    show_detailed_progress: bool = True
    max_parallel_workers: int = Field(default=4, ge=1)
NodeConnection pydantic-model

Bases: BaseModel

Represents a connection between two nodes in the flow.

Attributes:

Name Type Description
from_node_id int

The ID of the source node.

to_node_id int

The ID of the target node.

Show JSON schema:
{
  "description": "Represents a connection between two nodes in the flow.\n\nAttributes:\n    from_node_id (int): The ID of the source node.\n    to_node_id (int): The ID of the target node.",
  "properties": {
    "from_node_id": {
      "title": "From Node Id",
      "type": "integer"
    },
    "to_node_id": {
      "title": "To Node Id",
      "type": "integer"
    }
  },
  "required": [
    "from_node_id",
    "to_node_id"
  ],
  "title": "NodeConnection",
  "type": "object"
}

Config:

  • frozen: True

Fields:

  • from_node_id (int)
  • to_node_id (int)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
364
365
366
367
368
369
370
371
372
373
374
375
class NodeConnection(BaseModel):
    """
    Represents a connection between two nodes in the flow.

    Attributes:
        from_node_id (int): The ID of the source node.
        to_node_id (int): The ID of the target node.
    """

    model_config = ConfigDict(frozen=True)
    from_node_id: int
    to_node_id: int
NodeDefault pydantic-model

Bases: BaseModel

Defines default properties for a node type.

Attributes:

Name Type Description
node_name str

The name of the node.

node_type NodeTypeLiteral

The functional type of the node ('input', 'output', 'process').

transform_type TransformTypeLiteral

The data transformation behavior ('narrow', 'wide', 'other').

has_default_settings Optional[Any]

Indicates if the node has predefined default settings.

Show JSON schema:
{
  "description": "Defines default properties for a node type.\n\nAttributes:\n    node_name (str): The name of the node.\n    node_type (NodeTypeLiteral): The functional type of the node ('input', 'output', 'process').\n    transform_type (TransformTypeLiteral): The data transformation behavior ('narrow', 'wide', 'other').\n    has_default_settings (Optional[Any]): Indicates if the node has predefined default settings.",
  "properties": {
    "node_name": {
      "title": "Node Name",
      "type": "string"
    },
    "node_type": {
      "enum": [
        "input",
        "output",
        "process"
      ],
      "title": "Node Type",
      "type": "string"
    },
    "transform_type": {
      "enum": [
        "narrow",
        "wide",
        "other"
      ],
      "title": "Transform Type",
      "type": "string"
    },
    "has_default_settings": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Has Default Settings"
    }
  },
  "required": [
    "node_name",
    "node_type",
    "transform_type"
  ],
  "title": "NodeDefault",
  "type": "object"
}

Fields:

  • node_name (str)
  • node_type (NodeTypeLiteral)
  • transform_type (TransformTypeLiteral)
  • has_default_settings (Any | None)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
class NodeDefault(BaseModel):
    """
    Defines default properties for a node type.

    Attributes:
        node_name (str): The name of the node.
        node_type (NodeTypeLiteral): The functional type of the node ('input', 'output', 'process').
        transform_type (TransformTypeLiteral): The data transformation behavior ('narrow', 'wide', 'other').
        has_default_settings (Optional[Any]): Indicates if the node has predefined default settings.
    """

    node_name: str
    node_type: NodeTypeLiteral
    transform_type: TransformTypeLiteral
    has_default_settings: Any | None = None
NodeEdge pydantic-model

Bases: BaseModel

Represents a connection (edge) between two nodes in the frontend.

Attributes:

Name Type Description
id str

A unique identifier for the edge.

source str

The ID of the source node.

target str

The ID of the target node.

targetHandle str

The specific input handle on the target node.

sourceHandle str

The specific output handle on the source node.

Show JSON schema:
{
  "description": "Represents a connection (edge) between two nodes in the frontend.\n\nAttributes:\n    id (str): A unique identifier for the edge.\n    source (str): The ID of the source node.\n    target (str): The ID of the target node.\n    targetHandle (str): The specific input handle on the target node.\n    sourceHandle (str): The specific output handle on the source node.",
  "properties": {
    "id": {
      "title": "Id",
      "type": "string"
    },
    "source": {
      "title": "Source",
      "type": "string"
    },
    "target": {
      "title": "Target",
      "type": "string"
    },
    "targetHandle": {
      "title": "Targethandle",
      "type": "string"
    },
    "sourceHandle": {
      "title": "Sourcehandle",
      "type": "string"
    }
  },
  "required": [
    "id",
    "source",
    "target",
    "targetHandle",
    "sourceHandle"
  ],
  "title": "NodeEdge",
  "type": "object"
}

Config:

  • coerce_numbers_to_str: True

Fields:

  • id (str)
  • source (str)
  • target (str)
  • targetHandle (str)
  • sourceHandle (str)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
class NodeEdge(BaseModel):
    """
    Represents a connection (edge) between two nodes in the frontend.

    Attributes:
        id (str): A unique identifier for the edge.
        source (str): The ID of the source node.
        target (str): The ID of the target node.
        targetHandle (str): The specific input handle on the target node.
        sourceHandle (str): The specific output handle on the source node.
    """

    model_config = ConfigDict(coerce_numbers_to_str=True)
    id: str
    source: str
    target: str
    targetHandle: str
    sourceHandle: str
NodeInformation pydantic-model

Bases: BaseModel

Stores the state and configuration of a specific node instance within a flow.

Show JSON schema:
{
  "description": "Stores the state and configuration of a specific node instance within a flow.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Type"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Is Setup"
    },
    "is_start_node": {
      "default": false,
      "title": "Is Start Node",
      "type": "boolean"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "x_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "X Position"
    },
    "y_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Y Position"
    },
    "left_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Left Input Id"
    },
    "right_input_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Right Input Id"
    },
    "input_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Input Ids"
    },
    "outputs": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Outputs"
    },
    "setting_input": {
      "anyOf": [
        {},
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Setting Input"
    }
  },
  "title": "NodeInformation",
  "type": "object"
}

Fields:

  • id (int | None)
  • type (str | None)
  • is_setup (bool | None)
  • is_start_node (bool)
  • description (str | None)
  • node_reference (str | None)
  • x_position (int | None)
  • y_position (int | None)
  • left_input_id (int | None)
  • right_input_id (int | None)
  • input_ids (list[int] | None)
  • outputs (list[int] | None)
  • setting_input (Any | None)

Validators:

  • validate_setting_inputsetting_input
Source code in flowfile_core/flowfile_core/schemas/schemas.py
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
class NodeInformation(BaseModel):
    """
    Stores the state and configuration of a specific node instance within a flow.
    """

    id: int | None = None
    type: str | None = None
    is_setup: bool | None = None
    is_start_node: bool = False
    description: str | None = ""
    node_reference: str | None = None  # Unique reference identifier for code generation
    x_position: int | None = 0
    y_position: int | None = 0
    left_input_id: int | None = None
    right_input_id: int | None = None
    input_ids: list[int] | None = Field(default_factory=list)
    outputs: list[int] | None = Field(default_factory=list)
    setting_input: Any | None = None

    @property
    def data(self) -> Any:
        return self.setting_input

    @property
    def main_input_ids(self) -> list[int] | None:
        return self.input_ids

    @field_validator("setting_input", mode="before")
    @classmethod
    def validate_setting_input(cls, v, info: ValidationInfo):
        if v is None:
            return None
        if isinstance(v, BaseModel):
            return v

        node_type = info.data.get("type")
        model_class = get_settings_class_for_node_type(node_type)

        if model_class is None:
            raise ValueError(f"Unknown node type: {node_type}")

        if isinstance(v, model_class):
            return v

        return model_class.model_validate(v)
NodeInput pydantic-model

Bases: NodeTemplate

Represents a node as it is received from the frontend, including position.

Attributes:

Name Type Description
id int

The unique ID of the node instance.

pos_x float

The x-coordinate on the canvas.

pos_y float

The y-coordinate on the canvas.

Show JSON schema:
{
  "description": "Represents a node as it is received from the frontend, including position.\n\nAttributes:\n    id (int): The unique ID of the node instance.\n    pos_x (float): The x-coordinate on the canvas.\n    pos_y (float): The y-coordinate on the canvas.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "item": {
      "title": "Item",
      "type": "string"
    },
    "input": {
      "title": "Input",
      "type": "integer"
    },
    "output": {
      "title": "Output",
      "type": "integer"
    },
    "image": {
      "title": "Image",
      "type": "string"
    },
    "multi": {
      "default": false,
      "title": "Multi",
      "type": "boolean"
    },
    "node_type": {
      "enum": [
        "input",
        "output",
        "process"
      ],
      "title": "Node Type",
      "type": "string"
    },
    "transform_type": {
      "enum": [
        "narrow",
        "wide",
        "other"
      ],
      "title": "Transform Type",
      "type": "string"
    },
    "node_group": {
      "title": "Node Group",
      "type": "string"
    },
    "prod_ready": {
      "default": true,
      "title": "Prod Ready",
      "type": "boolean"
    },
    "can_be_start": {
      "default": false,
      "title": "Can Be Start",
      "type": "boolean"
    },
    "drawer_title": {
      "default": "Node title",
      "title": "Drawer Title",
      "type": "string"
    },
    "drawer_intro": {
      "default": "Drawer into",
      "title": "Drawer Intro",
      "type": "string"
    },
    "custom_node": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Custom Node"
    },
    "id": {
      "title": "Id",
      "type": "integer"
    },
    "pos_x": {
      "title": "Pos X",
      "type": "number"
    },
    "pos_y": {
      "title": "Pos Y",
      "type": "number"
    }
  },
  "required": [
    "name",
    "item",
    "input",
    "output",
    "image",
    "node_type",
    "transform_type",
    "node_group",
    "id",
    "pos_x",
    "pos_y"
  ],
  "title": "NodeInput",
  "type": "object"
}

Fields:

  • name (str)
  • item (str)
  • input (int)
  • output (int)
  • image (str)
  • multi (bool)
  • node_type (NodeTypeLiteral)
  • transform_type (TransformTypeLiteral)
  • node_group (str)
  • prod_ready (bool)
  • can_be_start (bool)
  • drawer_title (str)
  • drawer_intro (str)
  • custom_node (bool | None)
  • id (int)
  • pos_x (float)
  • pos_y (float)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
378
379
380
381
382
383
384
385
386
387
388
389
390
class NodeInput(NodeTemplate):
    """
    Represents a node as it is received from the frontend, including position.

    Attributes:
        id (int): The unique ID of the node instance.
        pos_x (float): The x-coordinate on the canvas.
        pos_y (float): The y-coordinate on the canvas.
    """

    id: int
    pos_x: float
    pos_y: float
NodeTemplate pydantic-model

Bases: BaseModel

Defines the template for a node type, specifying its UI and functional characteristics.

Attributes:

Name Type Description
name str

The display name of the node.

item str

The unique identifier for the node type.

input int

The number of required input connections.

output int

The number of output connections.

image str

The filename of the icon for the node.

multi bool

Whether the node accepts multiple main input connections.

node_group str

The category group the node belongs to (e.g., 'input', 'transform').

prod_ready bool

Whether the node is considered production-ready.

can_be_start bool

Whether the node can be a starting point in a flow.

Show JSON schema:
{
  "description": "Defines the template for a node type, specifying its UI and functional characteristics.\n\nAttributes:\n    name (str): The display name of the node.\n    item (str): The unique identifier for the node type.\n    input (int): The number of required input connections.\n    output (int): The number of output connections.\n    image (str): The filename of the icon for the node.\n    multi (bool): Whether the node accepts multiple main input connections.\n    node_group (str): The category group the node belongs to (e.g., 'input', 'transform').\n    prod_ready (bool): Whether the node is considered production-ready.\n    can_be_start (bool): Whether the node can be a starting point in a flow.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "item": {
      "title": "Item",
      "type": "string"
    },
    "input": {
      "title": "Input",
      "type": "integer"
    },
    "output": {
      "title": "Output",
      "type": "integer"
    },
    "image": {
      "title": "Image",
      "type": "string"
    },
    "multi": {
      "default": false,
      "title": "Multi",
      "type": "boolean"
    },
    "node_type": {
      "enum": [
        "input",
        "output",
        "process"
      ],
      "title": "Node Type",
      "type": "string"
    },
    "transform_type": {
      "enum": [
        "narrow",
        "wide",
        "other"
      ],
      "title": "Transform Type",
      "type": "string"
    },
    "node_group": {
      "title": "Node Group",
      "type": "string"
    },
    "prod_ready": {
      "default": true,
      "title": "Prod Ready",
      "type": "boolean"
    },
    "can_be_start": {
      "default": false,
      "title": "Can Be Start",
      "type": "boolean"
    },
    "drawer_title": {
      "default": "Node title",
      "title": "Drawer Title",
      "type": "string"
    },
    "drawer_intro": {
      "default": "Drawer into",
      "title": "Drawer Intro",
      "type": "string"
    },
    "custom_node": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Custom Node"
    }
  },
  "required": [
    "name",
    "item",
    "input",
    "output",
    "image",
    "node_type",
    "transform_type",
    "node_group"
  ],
  "title": "NodeTemplate",
  "type": "object"
}

Fields:

  • name (str)
  • item (str)
  • input (int)
  • output (int)
  • image (str)
  • multi (bool)
  • node_type (NodeTypeLiteral)
  • transform_type (TransformTypeLiteral)
  • node_group (str)
  • prod_ready (bool)
  • can_be_start (bool)
  • drawer_title (str)
  • drawer_intro (str)
  • custom_node (bool | None)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
class NodeTemplate(BaseModel):
    """
    Defines the template for a node type, specifying its UI and functional characteristics.

    Attributes:
        name (str): The display name of the node.
        item (str): The unique identifier for the node type.
        input (int): The number of required input connections.
        output (int): The number of output connections.
        image (str): The filename of the icon for the node.
        multi (bool): Whether the node accepts multiple main input connections.
        node_group (str): The category group the node belongs to (e.g., 'input', 'transform').
        prod_ready (bool): Whether the node is considered production-ready.
        can_be_start (bool): Whether the node can be a starting point in a flow.
    """

    name: str
    item: str
    input: int
    output: int
    image: str
    multi: bool = False
    node_type: NodeTypeLiteral
    transform_type: TransformTypeLiteral
    node_group: str
    prod_ready: bool = True
    can_be_start: bool = False
    drawer_title: str = "Node title"
    drawer_intro: str = "Drawer into"
    custom_node: bool | None = False
RawLogInput pydantic-model

Bases: BaseModel

Schema for a raw log message.

Attributes:

Name Type Description
flowfile_flow_id int

The ID of the flow that generated the log.

log_message str

The content of the log message.

log_type Literal['INFO', 'ERROR']

The type of log.

extra Optional[dict]

Extra context data for the log.

Show JSON schema:
{
  "description": "Schema for a raw log message.\n\nAttributes:\n    flowfile_flow_id (int): The ID of the flow that generated the log.\n    log_message (str): The content of the log message.\n    log_type (Literal[\"INFO\", \"ERROR\"]): The type of log.\n    extra (Optional[dict]): Extra context data for the log.",
  "properties": {
    "flowfile_flow_id": {
      "title": "Flowfile Flow Id",
      "type": "integer"
    },
    "log_message": {
      "title": "Log Message",
      "type": "string"
    },
    "log_type": {
      "enum": [
        "INFO",
        "ERROR"
      ],
      "title": "Log Type",
      "type": "string"
    },
    "extra": {
      "anyOf": [
        {
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Extra"
    }
  },
  "required": [
    "flowfile_flow_id",
    "log_message",
    "log_type"
  ],
  "title": "RawLogInput",
  "type": "object"
}

Fields:

  • flowfile_flow_id (int)
  • log_message (str)
  • log_type (Literal['INFO', 'ERROR'])
  • extra (dict | None)
Source code in flowfile_core/flowfile_core/schemas/schemas.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
class RawLogInput(BaseModel):
    """
    Schema for a raw log message.

    Attributes:
        flowfile_flow_id (int): The ID of the flow that generated the log.
        log_message (str): The content of the log message.
        log_type (Literal["INFO", "ERROR"]): The type of log.
        extra (Optional[dict]): Extra context data for the log.
    """

    flowfile_flow_id: int
    log_message: str
    log_type: Literal["INFO", "ERROR"]
    extra: dict | None = None
VueFlowInput pydantic-model

Bases: BaseModel

Represents the complete graph structure from the Vue-based frontend.

Attributes:

Name Type Description
node_edges List[NodeEdge]

A list of all edges in the graph.

node_inputs List[NodeInput]

A list of all nodes in the graph.

Show JSON schema:
{
  "$defs": {
    "NodeEdge": {
      "description": "Represents a connection (edge) between two nodes in the frontend.\n\nAttributes:\n    id (str): A unique identifier for the edge.\n    source (str): The ID of the source node.\n    target (str): The ID of the target node.\n    targetHandle (str): The specific input handle on the target node.\n    sourceHandle (str): The specific output handle on the source node.",
      "properties": {
        "id": {
          "title": "Id",
          "type": "string"
        },
        "source": {
          "title": "Source",
          "type": "string"
        },
        "target": {
          "title": "Target",
          "type": "string"
        },
        "targetHandle": {
          "title": "Targethandle",
          "type": "string"
        },
        "sourceHandle": {
          "title": "Sourcehandle",
          "type": "string"
        }
      },
      "required": [
        "id",
        "source",
        "target",
        "targetHandle",
        "sourceHandle"
      ],
      "title": "NodeEdge",
      "type": "object"
    },
    "NodeInput": {
      "description": "Represents a node as it is received from the frontend, including position.\n\nAttributes:\n    id (int): The unique ID of the node instance.\n    pos_x (float): The x-coordinate on the canvas.\n    pos_y (float): The y-coordinate on the canvas.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "item": {
          "title": "Item",
          "type": "string"
        },
        "input": {
          "title": "Input",
          "type": "integer"
        },
        "output": {
          "title": "Output",
          "type": "integer"
        },
        "image": {
          "title": "Image",
          "type": "string"
        },
        "multi": {
          "default": false,
          "title": "Multi",
          "type": "boolean"
        },
        "node_type": {
          "enum": [
            "input",
            "output",
            "process"
          ],
          "title": "Node Type",
          "type": "string"
        },
        "transform_type": {
          "enum": [
            "narrow",
            "wide",
            "other"
          ],
          "title": "Transform Type",
          "type": "string"
        },
        "node_group": {
          "title": "Node Group",
          "type": "string"
        },
        "prod_ready": {
          "default": true,
          "title": "Prod Ready",
          "type": "boolean"
        },
        "can_be_start": {
          "default": false,
          "title": "Can Be Start",
          "type": "boolean"
        },
        "drawer_title": {
          "default": "Node title",
          "title": "Drawer Title",
          "type": "string"
        },
        "drawer_intro": {
          "default": "Drawer into",
          "title": "Drawer Intro",
          "type": "string"
        },
        "custom_node": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Custom Node"
        },
        "id": {
          "title": "Id",
          "type": "integer"
        },
        "pos_x": {
          "title": "Pos X",
          "type": "number"
        },
        "pos_y": {
          "title": "Pos Y",
          "type": "number"
        }
      },
      "required": [
        "name",
        "item",
        "input",
        "output",
        "image",
        "node_type",
        "transform_type",
        "node_group",
        "id",
        "pos_x",
        "pos_y"
      ],
      "title": "NodeInput",
      "type": "object"
    }
  },
  "description": "Represents the complete graph structure from the Vue-based frontend.\n\nAttributes:\n    node_edges (List[NodeEdge]): A list of all edges in the graph.\n    node_inputs (List[NodeInput]): A list of all nodes in the graph.",
  "properties": {
    "node_edges": {
      "items": {
        "$ref": "#/$defs/NodeEdge"
      },
      "title": "Node Edges",
      "type": "array"
    },
    "node_inputs": {
      "items": {
        "$ref": "#/$defs/NodeInput"
      },
      "title": "Node Inputs",
      "type": "array"
    }
  },
  "required": [
    "node_edges",
    "node_inputs"
  ],
  "title": "VueFlowInput",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/schemas.py
413
414
415
416
417
418
419
420
421
422
423
424
class VueFlowInput(BaseModel):
    """

    Represents the complete graph structure from the Vue-based frontend.

    Attributes:
        node_edges (List[NodeEdge]): A list of all edges in the graph.
        node_inputs (List[NodeInput]): A list of all nodes in the graph.
    """

    node_edges: list[NodeEdge]
    node_inputs: list[NodeInput]
get_global_execution_location()

Calculates the default execution location based on the global settings Returns


ExecutionLocationsLiteral where the current

Source code in flowfile_core/flowfile_core/schemas/schemas.py
50
51
52
53
54
55
56
57
58
59
def get_global_execution_location() -> ExecutionLocationsLiteral:
    """
    Calculates the default execution location based on the global settings
    Returns
    -------
    ExecutionLocationsLiteral where the current
    """
    if OFFLOAD_TO_WORKER:
        return "remote"
    return "local"
get_settings_class_for_node_type(node_type)

Get the settings class for a node type, supporting both standard and user-defined nodes.

Source code in flowfile_core/flowfile_core/schemas/schemas.py
72
73
74
75
76
77
78
79
def get_settings_class_for_node_type(node_type: str):
    """Get the settings class for a node type, supporting both standard and user-defined nodes."""
    model_class = NODE_TYPE_TO_SETTINGS_CLASS.get(node_type)
    if model_class is None:
        if node_type in _get_custom_node_store():
            return input_schema.UserDefinedNode
        return None
    return model_class

input_schema

flowfile_core.schemas.input_schema

Classes:

Name Description
DatabaseConnection

Defines the connection parameters for a database.

DatabaseSettings

Defines settings for reading from a database, either via table or query.

DatabaseWriteSettings

Defines settings for writing data to a database table.

ExternalSource

Base model for data coming from a predefined external source.

FullDatabaseConnection

A complete database connection model including the secret password.

FullDatabaseConnectionInterface

A database connection model intended for UI display, omitting the password.

InputCsvTable

Defines settings for reading a CSV file.

InputExcelTable

Defines settings for reading an Excel file.

InputJsonTable

Defines settings for reading a JSON file.

InputParquetTable

Defines settings for reading a Parquet file.

InputTableBase

Base settings for input file operations.

MinimalFieldInfo

Represents the most basic information about a data field (column).

NewDirectory

Defines the information required to create a new directory.

NodeBase

Base model for all nodes in a FlowGraph. Contains common metadata.

NodeCloudStorageReader

Settings for a node that reads from a cloud storage service (S3, GCS, etc.).

NodeCloudStorageWriter

Settings for a node that writes to a cloud storage service.

NodeConnection

Represents a connection (edge) between two nodes in the graph.

NodeCrossJoin

Settings for a node that performs a cross join.

NodeDatabaseReader

Settings for a node that reads from a database.

NodeDatabaseWriter

Settings for a node that writes data to a database.

NodeDatasource

Base settings for a node that acts as a data source.

NodeDescription

A simple model for updating a node's description text.

NodeExploreData

Settings for a node that provides an interactive data exploration interface.

NodeExternalSource

Settings for a node that connects to a registered external data source.

NodeFilter

Settings for a node that filters rows based on a condition.

NodeFormula

Settings for a node that applies a formula to create/modify a column.

NodeFuzzyMatch

Settings for a node that performs a fuzzy join based on string similarity.

NodeGraphSolver

Settings for a node that solves graph-based problems (e.g., connected components).

NodeGroupBy

Settings for a node that performs a group-by and aggregation operation.

NodeInputConnection

Represents the input side of a connection between two nodes.

NodeJoin

Settings for a node that performs a standard SQL-style join.

NodeManualInput

Settings for a node that allows direct data entry in the UI.

NodeMultiInput

A base model for any node that takes multiple data inputs.

NodeOutput

Settings for a node that writes its input to a file.

NodeOutputConnection

Represents the output side of a connection between two nodes.

NodePivot

Settings for a node that pivots data from a long to a wide format.

NodePolarsCode

Settings for a node that executes arbitrary user-provided Polars code.

NodePromise

A placeholder node for an operation that has not yet been configured.

NodeRead

Settings for a node that reads data from a file.

NodeRecordCount

Settings for a node that counts the number of records.

NodeRecordId

Settings for a node that adds a unique record ID column.

NodeSample

Settings for a node that samples a subset of the data.

NodeSelect

Settings for a node that selects, renames, and reorders columns.

NodeSingleInput

A base model for any node that takes a single data input.

NodeSort

Settings for a node that sorts the data by one or more columns.

NodeTextToRows

Settings for a node that splits a text column into multiple rows.

NodeUnion

Settings for a node that concatenates multiple data inputs.

NodeUnique

Settings for a node that returns the unique rows from the data.

NodeUnpivot

Settings for a node that unpivots data from a wide to a long format.

OutputCsvTable

Defines settings for writing a CSV file.

OutputExcelTable

Defines settings for writing an Excel file.

OutputFieldConfig

Configuration for output field validation and transformation behavior.

OutputFieldInfo

Field information with optional default value for output field configuration.

OutputParquetTable

Defines settings for writing a Parquet file.

OutputSettings

Defines the complete settings for an output node.

RawData

Represents data in a raw, columnar format for manual input.

ReceivedTable

Model for defining a table received from an external source.

RemoveItem

Represents a single item to be removed from a directory or list.

RemoveItemsInput

Defines a list of items to be removed.

SampleUsers

Settings for generating a sample dataset of users.

UserDefinedNode

Settings for a node that contains the user defined node information

DatabaseConnection pydantic-model

Bases: BaseModel

Defines the connection parameters for a database.

Show JSON schema:
{
  "description": "Defines the connection parameters for a database.",
  "properties": {
    "database_type": {
      "default": "postgresql",
      "title": "Database Type",
      "type": "string"
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Username"
    },
    "password_ref": {
      "anyOf": [
        {
          "description": "An ID referencing an encrypted secret.",
          "maxLength": 100,
          "minLength": 1,
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Password Ref"
    },
    "host": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Host"
    },
    "port": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Port"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database"
    },
    "url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Url"
    }
  },
  "title": "DatabaseConnection",
  "type": "object"
}

Fields:

  • database_type (str)
  • username (str | None)
  • password_ref (SecretRef | None)
  • host (str | None)
  • port (int | None)
  • database (str | None)
  • url (str | None)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
614
615
616
617
618
619
620
621
622
623
class DatabaseConnection(BaseModel):
    """Defines the connection parameters for a database."""

    database_type: str = "postgresql"
    username: str | None = None
    password_ref: SecretRef | None = None
    host: str | None = None
    port: int | None = None
    database: str | None = None
    url: str | None = None
DatabaseSettings pydantic-model

Bases: BaseModel

Defines settings for reading from a database, either via table or query.

Show JSON schema:
{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    }
  },
  "description": "Defines settings for reading from a database, either via table or query.",
  "properties": {
    "connection_mode": {
      "anyOf": [
        {
          "enum": [
            "inline",
            "reference"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "inline",
      "title": "Connection Mode"
    },
    "database_connection": {
      "anyOf": [
        {
          "$ref": "#/$defs/DatabaseConnection"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "database_connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database Connection Name"
    },
    "schema_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Schema Name"
    },
    "table_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Table Name"
    },
    "query": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Query"
    },
    "query_mode": {
      "default": "table",
      "enum": [
        "query",
        "table",
        "reference"
      ],
      "title": "Query Mode",
      "type": "string"
    }
  },
  "title": "DatabaseSettings",
  "type": "object"
}

Fields:

  • connection_mode (Literal['inline', 'reference'] | None)
  • database_connection (DatabaseConnection | None)
  • database_connection_name (str | None)
  • schema_name (str | None)
  • table_name (str | None)
  • query (str | None)
  • query_mode (Literal['query', 'table', 'reference'])

Validators:

  • validate_table_or_query
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
class DatabaseSettings(BaseModel):
    """Defines settings for reading from a database, either via table or query."""

    connection_mode: Literal["inline", "reference"] | None = "inline"
    database_connection: DatabaseConnection | None = None
    database_connection_name: str | None = None
    schema_name: str | None = None
    table_name: str | None = None
    query: str | None = None
    query_mode: Literal["query", "table", "reference"] = "table"

    @model_validator(mode="after")
    def validate_table_or_query(self):
        # Validate that either table_name or query is provided
        if (not self.table_name and not self.query) and self.query_mode == "inline":
            raise ValueError("Either 'table_name' or 'query' must be provided")

        # Validate correct connection information based on connection_mode
        if self.connection_mode == "inline" and self.database_connection is None:
            raise ValueError("When 'connection_mode' is 'inline', 'database_connection' must be provided")

        if self.connection_mode == "reference" and not self.database_connection_name:
            raise ValueError("When 'connection_mode' is 'reference', 'database_connection_name' must be provided")

        return self
DatabaseWriteSettings pydantic-model

Bases: BaseModel

Defines settings for writing data to a database table.

Show JSON schema:
{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    }
  },
  "description": "Defines settings for writing data to a database table.",
  "properties": {
    "connection_mode": {
      "anyOf": [
        {
          "enum": [
            "inline",
            "reference"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "inline",
      "title": "Connection Mode"
    },
    "database_connection": {
      "anyOf": [
        {
          "$ref": "#/$defs/DatabaseConnection"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "database_connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database Connection Name"
    },
    "table_name": {
      "title": "Table Name",
      "type": "string"
    },
    "schema_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Schema Name"
    },
    "if_exists": {
      "anyOf": [
        {
          "enum": [
            "append",
            "replace",
            "fail"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "append",
      "title": "If Exists"
    }
  },
  "required": [
    "table_name"
  ],
  "title": "DatabaseWriteSettings",
  "type": "object"
}

Fields:

  • connection_mode (Literal['inline', 'reference'] | None)
  • database_connection (DatabaseConnection | None)
  • database_connection_name (str | None)
  • table_name (str)
  • schema_name (str | None)
  • if_exists (Literal['append', 'replace', 'fail'] | None)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
680
681
682
683
684
685
686
687
688
class DatabaseWriteSettings(BaseModel):
    """Defines settings for writing data to a database table."""

    connection_mode: Literal["inline", "reference"] | None = "inline"
    database_connection: DatabaseConnection | None = None
    database_connection_name: str | None = None
    table_name: str
    schema_name: str | None = None
    if_exists: Literal["append", "replace", "fail"] | None = "append"
ExternalSource pydantic-model

Bases: BaseModel

Base model for data coming from a predefined external source.

Show JSON schema:
{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Base model for data coming from a predefined external source.",
  "properties": {
    "orientation": {
      "default": "row",
      "title": "Orientation",
      "type": "string"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    }
  },
  "title": "ExternalSource",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
717
718
719
720
721
class ExternalSource(BaseModel):
    """Base model for data coming from a predefined external source."""

    orientation: str = "row"
    fields: list[MinimalFieldInfo] | None = None
FullDatabaseConnection pydantic-model

Bases: BaseModel

A complete database connection model including the secret password.

Show JSON schema:
{
  "description": "A complete database connection model including the secret password.",
  "properties": {
    "connection_name": {
      "title": "Connection Name",
      "type": "string"
    },
    "database_type": {
      "default": "postgresql",
      "title": "Database Type",
      "type": "string"
    },
    "username": {
      "title": "Username",
      "type": "string"
    },
    "password": {
      "format": "password",
      "title": "Password",
      "type": "string",
      "writeOnly": true
    },
    "host": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Host"
    },
    "port": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Port"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database"
    },
    "ssl_enabled": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Ssl Enabled"
    },
    "url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Url"
    }
  },
  "required": [
    "connection_name",
    "username",
    "password"
  ],
  "title": "FullDatabaseConnection",
  "type": "object"
}

Fields:

  • connection_name (str)
  • database_type (str)
  • username (str)
  • password (SecretStr)
  • host (str | None)
  • port (int | None)
  • database (str | None)
  • ssl_enabled (bool | None)
  • url (str | None)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
626
627
628
629
630
631
632
633
634
635
636
637
class FullDatabaseConnection(BaseModel):
    """A complete database connection model including the secret password."""

    connection_name: str
    database_type: str = "postgresql"
    username: str
    password: SecretStr
    host: str | None = None
    port: int | None = None
    database: str | None = None
    ssl_enabled: bool | None = False
    url: str | None = None
FullDatabaseConnectionInterface pydantic-model

Bases: BaseModel

A database connection model intended for UI display, omitting the password.

Show JSON schema:
{
  "description": "A database connection model intended for UI display, omitting the password.",
  "properties": {
    "connection_name": {
      "title": "Connection Name",
      "type": "string"
    },
    "database_type": {
      "default": "postgresql",
      "title": "Database Type",
      "type": "string"
    },
    "username": {
      "title": "Username",
      "type": "string"
    },
    "host": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Host"
    },
    "port": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Port"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Database"
    },
    "ssl_enabled": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Ssl Enabled"
    },
    "url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Url"
    }
  },
  "required": [
    "connection_name",
    "username"
  ],
  "title": "FullDatabaseConnectionInterface",
  "type": "object"
}

Fields:

  • connection_name (str)
  • database_type (str)
  • username (str)
  • host (str | None)
  • port (int | None)
  • database (str | None)
  • ssl_enabled (bool | None)
  • url (str | None)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
640
641
642
643
644
645
646
647
648
649
650
class FullDatabaseConnectionInterface(BaseModel):
    """A database connection model intended for UI display, omitting the password."""

    connection_name: str
    database_type: str = "postgresql"
    username: str
    host: str | None = None
    port: int | None = None
    database: str | None = None
    ssl_enabled: bool | None = False
    url: str | None = None
InputCsvTable pydantic-model

Bases: InputTableBase

Defines settings for reading a CSV file.

Show JSON schema:
{
  "description": "Defines settings for reading a CSV file.",
  "properties": {
    "file_type": {
      "const": "csv",
      "default": "csv",
      "enum": [
        "csv"
      ],
      "title": "File Type",
      "type": "string"
    },
    "reference": {
      "default": "",
      "title": "Reference",
      "type": "string"
    },
    "starting_from_line": {
      "default": 0,
      "title": "Starting From Line",
      "type": "integer"
    },
    "delimiter": {
      "default": ",",
      "title": "Delimiter",
      "type": "string"
    },
    "has_headers": {
      "default": true,
      "title": "Has Headers",
      "type": "boolean"
    },
    "encoding": {
      "default": "utf-8",
      "title": "Encoding",
      "type": "string"
    },
    "parquet_ref": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Parquet Ref"
    },
    "row_delimiter": {
      "default": "\n",
      "title": "Row Delimiter",
      "type": "string"
    },
    "quote_char": {
      "default": "\"",
      "title": "Quote Char",
      "type": "string"
    },
    "infer_schema_length": {
      "default": 10000,
      "title": "Infer Schema Length",
      "type": "integer"
    },
    "truncate_ragged_lines": {
      "default": false,
      "title": "Truncate Ragged Lines",
      "type": "boolean"
    },
    "ignore_errors": {
      "default": false,
      "title": "Ignore Errors",
      "type": "boolean"
    }
  },
  "title": "InputCsvTable",
  "type": "object"
}

Fields:

  • file_type (Literal['csv'])
  • reference (str)
  • starting_from_line (int)
  • delimiter (str)
  • has_headers (bool)
  • encoding (str)
  • parquet_ref (str | None)
  • row_delimiter (str)
  • quote_char (str)
  • infer_schema_length (int)
  • truncate_ragged_lines (bool)
  • ignore_errors (bool)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
class InputCsvTable(InputTableBase):
    """Defines settings for reading a CSV file."""

    file_type: Literal["csv"] = "csv"
    reference: str = ""
    starting_from_line: int = 0
    delimiter: str = ","
    has_headers: bool = True
    encoding: str = "utf-8"
    parquet_ref: str | None = None
    row_delimiter: str = "\n"
    quote_char: str = '"'
    infer_schema_length: int = 10_000
    truncate_ragged_lines: bool = False
    ignore_errors: bool = False
InputExcelTable pydantic-model

Bases: InputTableBase

Defines settings for reading an Excel file.

Show JSON schema:
{
  "description": "Defines settings for reading an Excel file.",
  "properties": {
    "file_type": {
      "const": "excel",
      "default": "excel",
      "enum": [
        "excel"
      ],
      "title": "File Type",
      "type": "string"
    },
    "sheet_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Sheet Name"
    },
    "start_row": {
      "default": 0,
      "title": "Start Row",
      "type": "integer"
    },
    "start_column": {
      "default": 0,
      "title": "Start Column",
      "type": "integer"
    },
    "end_row": {
      "default": 0,
      "title": "End Row",
      "type": "integer"
    },
    "end_column": {
      "default": 0,
      "title": "End Column",
      "type": "integer"
    },
    "has_headers": {
      "default": true,
      "title": "Has Headers",
      "type": "boolean"
    },
    "type_inference": {
      "default": false,
      "title": "Type Inference",
      "type": "boolean"
    }
  },
  "title": "InputExcelTable",
  "type": "object"
}

Fields:

  • file_type (Literal['excel'])
  • sheet_name (str | None)
  • start_row (int)
  • start_column (int)
  • end_row (int)
  • end_column (int)
  • has_headers (bool)
  • type_inference (bool)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
class InputExcelTable(InputTableBase):
    """Defines settings for reading an Excel file."""

    file_type: Literal["excel"] = "excel"
    sheet_name: str | None = None
    start_row: int = 0
    start_column: int = 0
    end_row: int = 0
    end_column: int = 0
    has_headers: bool = True
    type_inference: bool = False

    @model_validator(mode="after")
    def validate_range_values(self):
        """Validates that the Excel cell range is logical."""
        for attribute in [self.start_row, self.start_column, self.end_row, self.end_column]:
            if not isinstance(attribute, int) or attribute < 0:
                raise ValueError("Row and column indices must be non-negative integers")
        if (self.end_row > 0 and self.start_row > self.end_row) or (
            self.end_column > 0 and self.start_column > self.end_column
        ):
            raise ValueError("Start row/column must not be greater than end row/column")
        return self
validate_range_values() pydantic-validator

Validates that the Excel cell range is logical.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
153
154
155
156
157
158
159
160
161
162
163
@model_validator(mode="after")
def validate_range_values(self):
    """Validates that the Excel cell range is logical."""
    for attribute in [self.start_row, self.start_column, self.end_row, self.end_column]:
        if not isinstance(attribute, int) or attribute < 0:
            raise ValueError("Row and column indices must be non-negative integers")
    if (self.end_row > 0 and self.start_row > self.end_row) or (
        self.end_column > 0 and self.start_column > self.end_column
    ):
        raise ValueError("Start row/column must not be greater than end row/column")
    return self
InputJsonTable pydantic-model

Bases: InputCsvTable

Defines settings for reading a JSON file.

Show JSON schema:
{
  "description": "Defines settings for reading a JSON file.",
  "properties": {
    "file_type": {
      "const": "json",
      "default": "json",
      "enum": [
        "json"
      ],
      "title": "File Type",
      "type": "string"
    },
    "reference": {
      "default": "",
      "title": "Reference",
      "type": "string"
    },
    "starting_from_line": {
      "default": 0,
      "title": "Starting From Line",
      "type": "integer"
    },
    "delimiter": {
      "default": ",",
      "title": "Delimiter",
      "type": "string"
    },
    "has_headers": {
      "default": true,
      "title": "Has Headers",
      "type": "boolean"
    },
    "encoding": {
      "default": "utf-8",
      "title": "Encoding",
      "type": "string"
    },
    "parquet_ref": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Parquet Ref"
    },
    "row_delimiter": {
      "default": "\n",
      "title": "Row Delimiter",
      "type": "string"
    },
    "quote_char": {
      "default": "\"",
      "title": "Quote Char",
      "type": "string"
    },
    "infer_schema_length": {
      "default": 10000,
      "title": "Infer Schema Length",
      "type": "integer"
    },
    "truncate_ragged_lines": {
      "default": false,
      "title": "Truncate Ragged Lines",
      "type": "boolean"
    },
    "ignore_errors": {
      "default": false,
      "title": "Ignore Errors",
      "type": "boolean"
    }
  },
  "title": "InputJsonTable",
  "type": "object"
}

Fields:

  • reference (str)
  • starting_from_line (int)
  • delimiter (str)
  • has_headers (bool)
  • encoding (str)
  • parquet_ref (str | None)
  • row_delimiter (str)
  • quote_char (str)
  • infer_schema_length (int)
  • truncate_ragged_lines (bool)
  • ignore_errors (bool)
  • file_type (Literal['json'])
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
129
130
131
132
class InputJsonTable(InputCsvTable):
    """Defines settings for reading a JSON file."""

    file_type: Literal["json"] = "json"
InputParquetTable pydantic-model

Bases: InputTableBase

Defines settings for reading a Parquet file.

Show JSON schema:
{
  "description": "Defines settings for reading a Parquet file.",
  "properties": {
    "file_type": {
      "const": "parquet",
      "default": "parquet",
      "enum": [
        "parquet"
      ],
      "title": "File Type",
      "type": "string"
    }
  },
  "title": "InputParquetTable",
  "type": "object"
}

Fields:

  • file_type (Literal['parquet'])
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
135
136
137
138
class InputParquetTable(InputTableBase):
    """Defines settings for reading a Parquet file."""

    file_type: Literal["parquet"] = "parquet"
InputTableBase pydantic-model

Bases: BaseModel

Base settings for input file operations.

Show JSON schema:
{
  "description": "Base settings for input file operations.",
  "properties": {
    "file_type": {
      "title": "File Type",
      "type": "string"
    }
  },
  "required": [
    "file_type"
  ],
  "title": "InputTableBase",
  "type": "object"
}

Fields:

  • file_type (str)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
106
107
108
109
class InputTableBase(BaseModel):
    """Base settings for input file operations."""

    file_type: str  # Will be overridden with Literal in subclasses
MinimalFieldInfo pydantic-model

Bases: BaseModel

Represents the most basic information about a data field (column).

Show JSON schema:
{
  "description": "Represents the most basic information about a data field (column).",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "data_type": {
      "default": "String",
      "title": "Data Type",
      "type": "string"
    }
  },
  "required": [
    "name"
  ],
  "title": "MinimalFieldInfo",
  "type": "object"
}

Fields:

  • name (str)
  • data_type (str)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
77
78
79
80
81
class MinimalFieldInfo(BaseModel):
    """Represents the most basic information about a data field (column)."""

    name: str
    data_type: str = "String"
NewDirectory pydantic-model

Bases: BaseModel

Defines the information required to create a new directory.

Show JSON schema:
{
  "description": "Defines the information required to create a new directory.",
  "properties": {
    "source_path": {
      "title": "Source Path",
      "type": "string"
    },
    "dir_name": {
      "title": "Dir Name",
      "type": "string"
    }
  },
  "required": [
    "source_path",
    "dir_name"
  ],
  "title": "NewDirectory",
  "type": "object"
}

Fields:

  • source_path (str)
  • dir_name (str)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
56
57
58
59
60
class NewDirectory(BaseModel):
    """Defines the information required to create a new directory."""

    source_path: str
    dir_name: str
NodeBase pydantic-model

Bases: BaseModel

Base model for all nodes in a FlowGraph. Contains common metadata.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Base model for all nodes in a FlowGraph. Contains common metadata.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeBase",
  "type": "object"
}

Config:

  • arbitrary_types_allowed: True

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
class NodeBase(BaseModel):
    """Base model for all nodes in a FlowGraph. Contains common metadata."""

    model_config = ConfigDict(arbitrary_types_allowed=True)
    flow_id: int
    node_id: int
    cache_results: bool | None = False
    pos_x: float | None = 0
    pos_y: float | None = 0
    is_setup: bool | None = True
    description: str | None = ""
    node_reference: str | None = None  # Unique reference identifier for code generation (lowercase, no spaces)
    user_id: int | None = None
    is_flow_output: bool | None = False
    is_user_defined: bool | None = False  # Indicator if the node is a user defined node
    output_field_config: OutputFieldConfig | None = None

    @field_validator("node_reference", mode="before")
    @classmethod
    def validate_node_reference(cls, v):
        """Validates that node_reference is lowercase and contains no spaces."""
        if v is None or v == "":
            return None
        if not isinstance(v, str):
            raise ValueError("node_reference must be a string")
        if " " in v:
            raise ValueError("node_reference cannot contain spaces")
        if v != v.lower():
            raise ValueError("node_reference must be lowercase")
        return v
validate_node_reference(v) pydantic-validator

Validates that node_reference is lowercase and contains no spaces.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
366
367
368
369
370
371
372
373
374
375
376
377
378
@field_validator("node_reference", mode="before")
@classmethod
def validate_node_reference(cls, v):
    """Validates that node_reference is lowercase and contains no spaces."""
    if v is None or v == "":
        return None
    if not isinstance(v, str):
        raise ValueError("node_reference must be a string")
    if " " in v:
        raise ValueError("node_reference cannot contain spaces")
    if v != v.lower():
        raise ValueError("node_reference must be lowercase")
    return v
NodeCloudStorageReader pydantic-model

Bases: NodeBase

Settings for a node that reads from a cloud storage service (S3, GCS, etc.).

Show JSON schema:
{
  "$defs": {
    "CloudStorageReadSettings": {
      "description": "Settings for reading from cloud storage",
      "properties": {
        "auth_mode": {
          "default": "auto",
          "enum": [
            "access_key",
            "iam_role",
            "service_principal",
            "managed_identity",
            "sas_token",
            "aws-cli",
            "env_vars"
          ],
          "title": "Auth Mode",
          "type": "string"
        },
        "connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Connection Name"
        },
        "resource_path": {
          "title": "Resource Path",
          "type": "string"
        },
        "scan_mode": {
          "default": "single_file",
          "enum": [
            "single_file",
            "directory"
          ],
          "title": "Scan Mode",
          "type": "string"
        },
        "file_format": {
          "default": "parquet",
          "enum": [
            "csv",
            "parquet",
            "json",
            "delta",
            "iceberg"
          ],
          "title": "File Format",
          "type": "string"
        },
        "csv_has_header": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Csv Has Header"
        },
        "csv_delimiter": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": ",",
          "title": "Csv Delimiter"
        },
        "csv_encoding": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "utf8",
          "title": "Csv Encoding"
        },
        "delta_version": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Delta Version"
        }
      },
      "required": [
        "resource_path"
      ],
      "title": "CloudStorageReadSettings",
      "type": "object"
    },
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that reads from a cloud storage service (S3, GCS, etc.).",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "cloud_storage_settings": {
      "$ref": "#/$defs/CloudStorageReadSettings"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "cloud_storage_settings"
  ],
  "title": "NodeCloudStorageReader",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • cloud_storage_settings (CloudStorageReadSettings)
  • fields (list[MinimalFieldInfo] | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
704
705
706
707
708
class NodeCloudStorageReader(NodeBase):
    """Settings for a node that reads from a cloud storage service (S3, GCS, etc.)."""

    cloud_storage_settings: CloudStorageReadSettings
    fields: list[MinimalFieldInfo] | None = None
NodeCloudStorageWriter pydantic-model

Bases: NodeSingleInput

Settings for a node that writes to a cloud storage service.

Show JSON schema:
{
  "$defs": {
    "CloudStorageWriteSettings": {
      "description": "Settings for writing to cloud storage",
      "properties": {
        "resource_path": {
          "title": "Resource Path",
          "type": "string"
        },
        "write_mode": {
          "default": "overwrite",
          "enum": [
            "overwrite",
            "append"
          ],
          "title": "Write Mode",
          "type": "string"
        },
        "file_format": {
          "default": "parquet",
          "enum": [
            "csv",
            "parquet",
            "json",
            "delta"
          ],
          "title": "File Format",
          "type": "string"
        },
        "parquet_compression": {
          "default": "snappy",
          "enum": [
            "snappy",
            "gzip",
            "brotli",
            "lz4",
            "zstd"
          ],
          "title": "Parquet Compression",
          "type": "string"
        },
        "csv_delimiter": {
          "default": ",",
          "title": "Csv Delimiter",
          "type": "string"
        },
        "csv_encoding": {
          "default": "utf8",
          "title": "Csv Encoding",
          "type": "string"
        },
        "auth_mode": {
          "default": "auto",
          "enum": [
            "access_key",
            "iam_role",
            "service_principal",
            "managed_identity",
            "sas_token",
            "aws-cli",
            "env_vars"
          ],
          "title": "Auth Mode",
          "type": "string"
        },
        "connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Connection Name"
        }
      },
      "required": [
        "resource_path"
      ],
      "title": "CloudStorageWriteSettings",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that writes to a cloud storage service.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "cloud_storage_settings": {
      "$ref": "#/$defs/CloudStorageWriteSettings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "cloud_storage_settings"
  ],
  "title": "NodeCloudStorageWriter",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • cloud_storage_settings (CloudStorageWriteSettings)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
711
712
713
714
class NodeCloudStorageWriter(NodeSingleInput):
    """Settings for a node that writes to a cloud storage service."""

    cloud_storage_settings: CloudStorageWriteSettings
NodeConnection pydantic-model

Bases: BaseModel

Represents a connection (edge) between two nodes in the graph.

Show JSON schema:
{
  "$defs": {
    "NodeInputConnection": {
      "description": "Represents the input side of a connection between two nodes.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "connection_class": {
          "enum": [
            "input-0",
            "input-1",
            "input-2",
            "input-3",
            "input-4",
            "input-5",
            "input-6",
            "input-7",
            "input-8",
            "input-9"
          ],
          "title": "Connection Class",
          "type": "string"
        }
      },
      "required": [
        "node_id",
        "connection_class"
      ],
      "title": "NodeInputConnection",
      "type": "object"
    },
    "NodeOutputConnection": {
      "description": "Represents the output side of a connection between two nodes.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "connection_class": {
          "enum": [
            "output-0",
            "output-1",
            "output-2",
            "output-3",
            "output-4",
            "output-5",
            "output-6",
            "output-7",
            "output-8",
            "output-9"
          ],
          "title": "Connection Class",
          "type": "string"
        }
      },
      "required": [
        "node_id",
        "connection_class"
      ],
      "title": "NodeOutputConnection",
      "type": "object"
    }
  },
  "description": "Represents a connection (edge) between two nodes in the graph.",
  "properties": {
    "input_connection": {
      "$ref": "#/$defs/NodeInputConnection"
    },
    "output_connection": {
      "$ref": "#/$defs/NodeOutputConnection"
    }
  },
  "required": [
    "input_connection",
    "output_connection"
  ],
  "title": "NodeConnection",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
class NodeConnection(BaseModel):
    """Represents a connection (edge) between two nodes in the graph."""

    input_connection: NodeInputConnection
    output_connection: NodeOutputConnection

    @classmethod
    def create_from_simple_input(cls, from_id: int, to_id: int, input_type: InputType = "input-0"):
        """Creates a standard connection between two nodes."""
        match input_type:
            case "main":
                connection_class: InputConnectionClass = "input-0"
            case "right":
                connection_class: InputConnectionClass = "input-1"
            case "left":
                connection_class: InputConnectionClass = "input-2"
            case _:
                connection_class: InputConnectionClass = "input-0"
        node_input = NodeInputConnection(node_id=to_id, connection_class=connection_class)
        node_output = NodeOutputConnection(node_id=from_id, connection_class="output-0")
        return cls(input_connection=node_input, output_connection=node_output)
create_from_simple_input(from_id, to_id, input_type='input-0') classmethod

Creates a standard connection between two nodes.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
@classmethod
def create_from_simple_input(cls, from_id: int, to_id: int, input_type: InputType = "input-0"):
    """Creates a standard connection between two nodes."""
    match input_type:
        case "main":
            connection_class: InputConnectionClass = "input-0"
        case "right":
            connection_class: InputConnectionClass = "input-1"
        case "left":
            connection_class: InputConnectionClass = "input-2"
        case _:
            connection_class: InputConnectionClass = "input-0"
    node_input = NodeInputConnection(node_id=to_id, connection_class=connection_class)
    node_output = NodeOutputConnection(node_id=from_id, connection_class="output-0")
    return cls(input_connection=node_input, output_connection=node_output)
NodeCrossJoin pydantic-model

Bases: NodeMultiInput

Settings for a node that performs a cross join.

Show JSON schema:
{
  "$defs": {
    "CrossJoinInput": {
      "description": "Data model for cross join operations.",
      "properties": {
        "left_select": {
          "$ref": "#/$defs/JoinInputs"
        },
        "right_select": {
          "$ref": "#/$defs/JoinInputs"
        }
      },
      "required": [
        "left_select",
        "right_select"
      ],
      "title": "CrossJoinInput",
      "type": "object"
    },
    "JoinInputs": {
      "description": "Data model for join-specific select inputs (extends SelectInputs).",
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "title": "JoinInputs",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a cross join.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Depending On Ids"
    },
    "auto_generate_selection": {
      "default": true,
      "title": "Auto Generate Selection",
      "type": "boolean"
    },
    "verify_integrity": {
      "default": true,
      "title": "Verify Integrity",
      "type": "boolean"
    },
    "cross_join_input": {
      "$ref": "#/$defs/CrossJoinInput"
    },
    "auto_keep_all": {
      "default": true,
      "title": "Auto Keep All",
      "type": "boolean"
    },
    "auto_keep_right": {
      "default": true,
      "title": "Auto Keep Right",
      "type": "boolean"
    },
    "auto_keep_left": {
      "default": true,
      "title": "Auto Keep Left",
      "type": "boolean"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "cross_join_input"
  ],
  "title": "NodeCrossJoin",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_ids (list[int] | None)
  • auto_generate_selection (bool)
  • verify_integrity (bool)
  • cross_join_input (CrossJoinInput)
  • auto_keep_all (bool)
  • auto_keep_right (bool)
  • auto_keep_left (bool)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
class NodeCrossJoin(NodeMultiInput):
    """Settings for a node that performs a cross join."""

    auto_generate_selection: bool = True
    verify_integrity: bool = True
    cross_join_input: transform_schema.CrossJoinInput
    auto_keep_all: bool = True
    auto_keep_right: bool = True
    auto_keep_left: bool = True

    def to_yaml_dict(self) -> NodeCrossJoinYaml:
        """Converts the cross join node settings to a dictionary for YAML serialization."""
        result: NodeCrossJoinYaml = {
            "cache_results": self.cache_results,
            "auto_generate_selection": self.auto_generate_selection,
            "verify_integrity": self.verify_integrity,
            "cross_join_input": self.cross_join_input.to_yaml_dict(),
            "auto_keep_all": self.auto_keep_all,
            "auto_keep_right": self.auto_keep_right,
            "auto_keep_left": self.auto_keep_left,
        }
        if self.output_field_config:
            result["output_field_config"] = {
                "enabled": self.output_field_config.enabled,
                "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
                "validate_data_types": self.output_field_config.validate_data_types,
                "fields": [
                    {
                        "name": f.name,
                        "data_type": f.data_type,
                        "default_value": f.default_value,
                    }
                    for f in self.output_field_config.fields
                ],
            }
        return result
to_yaml_dict()

Converts the cross join node settings to a dictionary for YAML serialization.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
def to_yaml_dict(self) -> NodeCrossJoinYaml:
    """Converts the cross join node settings to a dictionary for YAML serialization."""
    result: NodeCrossJoinYaml = {
        "cache_results": self.cache_results,
        "auto_generate_selection": self.auto_generate_selection,
        "verify_integrity": self.verify_integrity,
        "cross_join_input": self.cross_join_input.to_yaml_dict(),
        "auto_keep_all": self.auto_keep_all,
        "auto_keep_right": self.auto_keep_right,
        "auto_keep_left": self.auto_keep_left,
    }
    if self.output_field_config:
        result["output_field_config"] = {
            "enabled": self.output_field_config.enabled,
            "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
            "validate_data_types": self.output_field_config.validate_data_types,
            "fields": [
                {
                    "name": f.name,
                    "data_type": f.data_type,
                    "default_value": f.default_value,
                }
                for f in self.output_field_config.fields
            ],
        }
    return result
NodeDatabaseReader pydantic-model

Bases: NodeBase

Settings for a node that reads from a database.

Show JSON schema:
{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    },
    "DatabaseSettings": {
      "description": "Defines settings for reading from a database, either via table or query.",
      "properties": {
        "connection_mode": {
          "anyOf": [
            {
              "enum": [
                "inline",
                "reference"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "inline",
          "title": "Connection Mode"
        },
        "database_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/DatabaseConnection"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "database_connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database Connection Name"
        },
        "schema_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Schema Name"
        },
        "table_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Table Name"
        },
        "query": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Query"
        },
        "query_mode": {
          "default": "table",
          "enum": [
            "query",
            "table",
            "reference"
          ],
          "title": "Query Mode",
          "type": "string"
        }
      },
      "title": "DatabaseSettings",
      "type": "object"
    },
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that reads from a database.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "database_settings": {
      "$ref": "#/$defs/DatabaseSettings"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "database_settings"
  ],
  "title": "NodeDatabaseReader",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • database_settings (DatabaseSettings)
  • fields (list[MinimalFieldInfo] | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
691
692
693
694
695
class NodeDatabaseReader(NodeBase):
    """Settings for a node that reads from a database."""

    database_settings: DatabaseSettings
    fields: list[MinimalFieldInfo] | None = None
NodeDatabaseWriter pydantic-model

Bases: NodeSingleInput

Settings for a node that writes data to a database.

Show JSON schema:
{
  "$defs": {
    "DatabaseConnection": {
      "description": "Defines the connection parameters for a database.",
      "properties": {
        "database_type": {
          "default": "postgresql",
          "title": "Database Type",
          "type": "string"
        },
        "username": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Username"
        },
        "password_ref": {
          "anyOf": [
            {
              "description": "An ID referencing an encrypted secret.",
              "maxLength": 100,
              "minLength": 1,
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Password Ref"
        },
        "host": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Host"
        },
        "port": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Port"
        },
        "database": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database"
        },
        "url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Url"
        }
      },
      "title": "DatabaseConnection",
      "type": "object"
    },
    "DatabaseWriteSettings": {
      "description": "Defines settings for writing data to a database table.",
      "properties": {
        "connection_mode": {
          "anyOf": [
            {
              "enum": [
                "inline",
                "reference"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "inline",
          "title": "Connection Mode"
        },
        "database_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/DatabaseConnection"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "database_connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Database Connection Name"
        },
        "table_name": {
          "title": "Table Name",
          "type": "string"
        },
        "schema_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Schema Name"
        },
        "if_exists": {
          "anyOf": [
            {
              "enum": [
                "append",
                "replace",
                "fail"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "append",
          "title": "If Exists"
        }
      },
      "required": [
        "table_name"
      ],
      "title": "DatabaseWriteSettings",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that writes data to a database.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "database_write_settings": {
      "$ref": "#/$defs/DatabaseWriteSettings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "database_write_settings"
  ],
  "title": "NodeDatabaseWriter",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • database_write_settings (DatabaseWriteSettings)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
698
699
700
701
class NodeDatabaseWriter(NodeSingleInput):
    """Settings for a node that writes data to a database."""

    database_write_settings: DatabaseWriteSettings
NodeDatasource pydantic-model

Bases: NodeBase

Base settings for a node that acts as a data source.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Base settings for a node that acts as a data source.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "file_ref": {
      "default": null,
      "title": "File Ref",
      "type": "string"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeDatasource",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • file_ref (str)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
564
565
566
567
class NodeDatasource(NodeBase):
    """Base settings for a node that acts as a data source."""

    file_ref: str = None
NodeDescription pydantic-model

Bases: BaseModel

A simple model for updating a node's description text.

Show JSON schema:
{
  "description": "A simple model for updating a node's description text.",
  "properties": {
    "description": {
      "default": "",
      "title": "Description",
      "type": "string"
    }
  },
  "title": "NodeDescription",
  "type": "object"
}

Fields:

  • description (str)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
854
855
856
857
class NodeDescription(BaseModel):
    """A simple model for updating a node's description text."""

    description: str = ""
NodeExploreData pydantic-model

Bases: NodeBase

Settings for a node that provides an interactive data exploration interface.

Show JSON schema:
{
  "$defs": {
    "DataModel": {
      "properties": {
        "data": {
          "items": {
            "type": "object"
          },
          "title": "Data",
          "type": "array"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/MutField"
          },
          "title": "Fields",
          "type": "array"
        }
      },
      "required": [
        "data",
        "fields"
      ],
      "title": "DataModel",
      "type": "object"
    },
    "GraphicWalkerInput": {
      "properties": {
        "dataModel": {
          "$ref": "#/$defs/DataModel"
        },
        "is_initial": {
          "default": true,
          "title": "Is Initial",
          "type": "boolean"
        },
        "specList": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Speclist"
        }
      },
      "title": "GraphicWalkerInput",
      "type": "object"
    },
    "MutField": {
      "properties": {
        "fid": {
          "title": "Fid",
          "type": "string"
        },
        "key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Key"
        },
        "name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Name"
        },
        "basename": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Basename"
        },
        "disable": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Disable"
        },
        "semanticType": {
          "title": "Semantictype",
          "type": "string"
        },
        "analyticType": {
          "enum": [
            "measure",
            "dimension"
          ],
          "title": "Analytictype",
          "type": "string"
        },
        "path": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Path"
        },
        "offset": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Offset"
        }
      },
      "required": [
        "fid",
        "semanticType",
        "analyticType"
      ],
      "title": "MutField",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that provides an interactive data exploration interface.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "graphic_walker_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/GraphicWalkerInput"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeExploreData",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • graphic_walker_input (GraphicWalkerInput | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
860
861
862
863
class NodeExploreData(NodeBase):
    """Settings for a node that provides an interactive data exploration interface."""

    graphic_walker_input: gs_schemas.GraphicWalkerInput | None = None
NodeExternalSource pydantic-model

Bases: NodeBase

Settings for a node that connects to a registered external data source.

Show JSON schema:
{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "SampleUsers": {
      "description": "Settings for generating a sample dataset of users.",
      "properties": {
        "orientation": {
          "default": "row",
          "title": "Orientation",
          "type": "string"
        },
        "fields": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/MinimalFieldInfo"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Fields"
        },
        "SAMPLE_USERS": {
          "title": "Sample Users",
          "type": "boolean"
        },
        "class_name": {
          "default": "sample_users",
          "title": "Class Name",
          "type": "string"
        },
        "size": {
          "default": 100,
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "SAMPLE_USERS"
      ],
      "title": "SampleUsers",
      "type": "object"
    }
  },
  "description": "Settings for a node that connects to a registered external data source.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "identifier": {
      "title": "Identifier",
      "type": "string"
    },
    "source_settings": {
      "$ref": "#/$defs/SampleUsers"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "identifier",
    "source_settings"
  ],
  "title": "NodeExternalSource",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • identifier (str)
  • source_settings (SampleUsers)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
732
733
734
735
736
class NodeExternalSource(NodeBase):
    """Settings for a node that connects to a registered external data source."""

    identifier: str
    source_settings: SampleUsers
NodeFilter pydantic-model

Bases: NodeSingleInput

Settings for a node that filters rows based on a condition.

Show JSON schema:
{
  "$defs": {
    "BasicFilter": {
      "description": "Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').\n\nAttributes:\n    field: The column name to filter on.\n    operator: The comparison operator (FilterOperator enum value or symbol).\n    value: The value to compare against.\n    value2: Second value for BETWEEN operator (optional).",
      "properties": {
        "field": {
          "default": "",
          "title": "Field",
          "type": "string"
        },
        "operator": {
          "anyOf": [
            {
              "$ref": "#/$defs/FilterOperator"
            },
            {
              "type": "string"
            }
          ],
          "default": "equals",
          "title": "Operator"
        },
        "value": {
          "default": "",
          "title": "Value",
          "type": "string"
        },
        "value2": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Value2"
        },
        "filter_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Filter Type"
        },
        "filter_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Filter Value"
        }
      },
      "title": "BasicFilter",
      "type": "object"
    },
    "FilterInput": {
      "description": "Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.\n\nAttributes:\n    mode: The filter mode - \"basic\" or \"advanced\".\n    basic_filter: The basic filter configuration (used when mode=\"basic\").\n    advanced_filter: The advanced filter expression string (used when mode=\"advanced\").",
      "properties": {
        "mode": {
          "default": "basic",
          "enum": [
            "basic",
            "advanced"
          ],
          "title": "Mode",
          "type": "string"
        },
        "basic_filter": {
          "anyOf": [
            {
              "$ref": "#/$defs/BasicFilter"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "advanced_filter": {
          "default": "",
          "title": "Advanced Filter",
          "type": "string"
        },
        "filter_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Filter Type"
        }
      },
      "title": "FilterInput",
      "type": "object"
    },
    "FilterOperator": {
      "description": "Supported filter comparison operators.",
      "enum": [
        "equals",
        "not_equals",
        "greater_than",
        "greater_than_or_equals",
        "less_than",
        "less_than_or_equals",
        "contains",
        "not_contains",
        "starts_with",
        "ends_with",
        "is_null",
        "is_not_null",
        "in",
        "not_in",
        "between"
      ],
      "title": "FilterOperator",
      "type": "string"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that filters rows based on a condition.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "filter_input": {
      "$ref": "#/$defs/FilterInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "filter_input"
  ],
  "title": "NodeFilter",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • filter_input (FilterInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
425
426
427
428
class NodeFilter(NodeSingleInput):
    """Settings for a node that filters rows based on a condition."""

    filter_input: transform_schema.FilterInput
NodeFormula pydantic-model

Bases: NodeSingleInput

Settings for a node that applies a formula to create/modify a column.

Show JSON schema:
{
  "$defs": {
    "DataType": {
      "description": "Specific data types for fine-grained control.",
      "enum": [
        "Int8",
        "Int16",
        "Int32",
        "Int64",
        "UInt8",
        "UInt16",
        "UInt32",
        "UInt64",
        "Float32",
        "Float64",
        "Decimal",
        "String",
        "Categorical",
        "Date",
        "Datetime",
        "Time",
        "Duration",
        "Boolean",
        "Binary",
        "List",
        "Struct",
        "Array"
      ],
      "title": "DataType",
      "type": "string"
    },
    "FieldInput": {
      "description": "Represents a single field with its name and data type, typically for defining an output column.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "anyOf": [
            {
              "$ref": "#/$defs/DataType"
            },
            {
              "const": "Auto",
              "enum": [
                "Auto"
              ],
              "type": "string"
            },
            {
              "enum": [
                "Int8",
                "Int16",
                "Int32",
                "Int64",
                "UInt8",
                "UInt16",
                "UInt32",
                "UInt64",
                "Float32",
                "Float64",
                "Decimal",
                "String",
                "Date",
                "Datetime",
                "Time",
                "Duration",
                "Boolean",
                "Binary",
                "List",
                "Struct",
                "Array",
                "Integer",
                "Double",
                "Utf8"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "Auto",
          "title": "Data Type"
        }
      },
      "required": [
        "name"
      ],
      "title": "FieldInput",
      "type": "object"
    },
    "FunctionInput": {
      "description": "Defines a formula to be applied, including the output field information.",
      "properties": {
        "field": {
          "$ref": "#/$defs/FieldInput"
        },
        "function": {
          "title": "Function",
          "type": "string"
        }
      },
      "required": [
        "field",
        "function"
      ],
      "title": "FunctionInput",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that applies a formula to create/modify a column.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "function": {
      "$ref": "#/$defs/FunctionInput",
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeFormula",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • function (FunctionInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
739
740
741
742
class NodeFormula(NodeSingleInput):
    """Settings for a node that applies a formula to create/modify a column."""

    function: transform_schema.FunctionInput = None
NodeFuzzyMatch pydantic-model

Bases: NodeJoin

Settings for a node that performs a fuzzy join based on string similarity.

Show JSON schema:
{
  "$defs": {
    "FuzzyMapping": {
      "properties": {
        "left_col": {
          "title": "Left Col",
          "type": "string"
        },
        "right_col": {
          "title": "Right Col",
          "type": "string"
        },
        "threshold_score": {
          "default": 80.0,
          "title": "Threshold Score",
          "type": "number"
        },
        "fuzzy_type": {
          "default": "levenshtein",
          "enum": [
            "levenshtein",
            "jaro",
            "jaro_winkler",
            "hamming",
            "damerau_levenshtein",
            "indel"
          ],
          "title": "Fuzzy Type",
          "type": "string"
        },
        "perc_unique": {
          "default": 0.0,
          "title": "Perc Unique",
          "type": "number"
        },
        "output_column_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Column Name"
        },
        "valid": {
          "default": true,
          "title": "Valid",
          "type": "boolean"
        }
      },
      "required": [
        "left_col",
        "right_col"
      ],
      "title": "FuzzyMapping",
      "type": "object"
    },
    "FuzzyMatchInput": {
      "description": "Data model for fuzzy matching join operations.",
      "properties": {
        "join_mapping": {
          "items": {
            "$ref": "#/$defs/FuzzyMapping"
          },
          "title": "Join Mapping",
          "type": "array"
        },
        "left_select": {
          "$ref": "#/$defs/JoinInputs"
        },
        "right_select": {
          "$ref": "#/$defs/JoinInputs"
        },
        "how": {
          "default": "inner",
          "enum": [
            "inner",
            "left",
            "right",
            "full",
            "semi",
            "anti",
            "cross",
            "outer"
          ],
          "title": "How",
          "type": "string"
        },
        "aggregate_output": {
          "default": false,
          "title": "Aggregate Output",
          "type": "boolean"
        }
      },
      "required": [
        "join_mapping",
        "left_select",
        "right_select"
      ],
      "title": "FuzzyMatchInput",
      "type": "object"
    },
    "JoinInputs": {
      "description": "Data model for join-specific select inputs (extends SelectInputs).",
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "title": "JoinInputs",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a fuzzy join based on string similarity.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Depending On Ids"
    },
    "auto_generate_selection": {
      "default": true,
      "title": "Auto Generate Selection",
      "type": "boolean"
    },
    "verify_integrity": {
      "default": true,
      "title": "Verify Integrity",
      "type": "boolean"
    },
    "join_input": {
      "$ref": "#/$defs/FuzzyMatchInput"
    },
    "auto_keep_all": {
      "default": true,
      "title": "Auto Keep All",
      "type": "boolean"
    },
    "auto_keep_right": {
      "default": true,
      "title": "Auto Keep Right",
      "type": "boolean"
    },
    "auto_keep_left": {
      "default": true,
      "title": "Auto Keep Left",
      "type": "boolean"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "join_input"
  ],
  "title": "NodeFuzzyMatch",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_ids (list[int] | None)
  • auto_generate_selection (bool)
  • verify_integrity (bool)
  • auto_keep_all (bool)
  • auto_keep_right (bool)
  • auto_keep_left (bool)
  • join_input (FuzzyMatchInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
class NodeFuzzyMatch(NodeJoin):
    """Settings for a node that performs a fuzzy join based on string similarity."""

    join_input: transform_schema.FuzzyMatchInput

    def to_yaml_dict(self) -> NodeFuzzyMatchYaml:
        """Converts the fuzzy match node settings to a dictionary for YAML serialization."""
        result: NodeFuzzyMatchYaml = {
            "cache_results": self.cache_results,
            "auto_generate_selection": self.auto_generate_selection,
            "verify_integrity": self.verify_integrity,
            "join_input": self.join_input.to_yaml_dict(),
            "auto_keep_all": self.auto_keep_all,
            "auto_keep_right": self.auto_keep_right,
            "auto_keep_left": self.auto_keep_left,
        }
        if self.output_field_config:
            result["output_field_config"] = {
                "enabled": self.output_field_config.enabled,
                "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
                "validate_data_types": self.output_field_config.validate_data_types,
                "fields": [
                    {
                        "name": f.name,
                        "data_type": f.data_type,
                        "default_value": f.default_value,
                    }
                    for f in self.output_field_config.fields
                ],
            }
        return result
to_yaml_dict()

Converts the fuzzy match node settings to a dictionary for YAML serialization.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
def to_yaml_dict(self) -> NodeFuzzyMatchYaml:
    """Converts the fuzzy match node settings to a dictionary for YAML serialization."""
    result: NodeFuzzyMatchYaml = {
        "cache_results": self.cache_results,
        "auto_generate_selection": self.auto_generate_selection,
        "verify_integrity": self.verify_integrity,
        "join_input": self.join_input.to_yaml_dict(),
        "auto_keep_all": self.auto_keep_all,
        "auto_keep_right": self.auto_keep_right,
        "auto_keep_left": self.auto_keep_left,
    }
    if self.output_field_config:
        result["output_field_config"] = {
            "enabled": self.output_field_config.enabled,
            "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
            "validate_data_types": self.output_field_config.validate_data_types,
            "fields": [
                {
                    "name": f.name,
                    "data_type": f.data_type,
                    "default_value": f.default_value,
                }
                for f in self.output_field_config.fields
            ],
        }
    return result
NodeGraphSolver pydantic-model

Bases: NodeSingleInput

Settings for a node that solves graph-based problems (e.g., connected components).

Show JSON schema:
{
  "$defs": {
    "GraphSolverInput": {
      "description": "Defines settings for a graph-solving operation (e.g., finding connected components).",
      "properties": {
        "col_from": {
          "title": "Col From",
          "type": "string"
        },
        "col_to": {
          "title": "Col To",
          "type": "string"
        },
        "output_column_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "graph_group",
          "title": "Output Column Name"
        }
      },
      "required": [
        "col_from",
        "col_to"
      ],
      "title": "GraphSolverInput",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that solves graph-based problems (e.g., connected components).",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "graph_solver_input": {
      "$ref": "#/$defs/GraphSolverInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "graph_solver_input"
  ],
  "title": "NodeGraphSolver",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • graph_solver_input (GraphSolverInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
866
867
868
869
class NodeGraphSolver(NodeSingleInput):
    """Settings for a node that solves graph-based problems (e.g., connected components)."""

    graph_solver_input: transform_schema.GraphSolverInput
NodeGroupBy pydantic-model

Bases: NodeSingleInput

Settings for a node that performs a group-by and aggregation operation.

Show JSON schema:
{
  "$defs": {
    "AggColl": {
      "description": "A data class that represents a single aggregation operation for a group by operation.\n\nAttributes\n----------\nold_name : str\n    The name of the column in the original DataFrame to be aggregated.\n\nagg : str\n    The aggregation function to use. This can be a string representing a built-in function or a custom function.\n\nnew_name : Optional[str]\n    The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the\n    old_name appended with the aggregation function.\n\noutput_type : Optional[str]\n    The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function\n    using the `get_func_type_mapping` function.\n\nExample\n--------\nagg_col = AggColl(\n    old_name='col1',\n    agg='sum',\n    new_name='sum_col1',\n    output_type='float'\n)",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "agg": {
          "title": "Agg",
          "type": "string"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "output_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Type"
        }
      },
      "required": [
        "old_name",
        "agg"
      ],
      "title": "AggColl",
      "type": "object"
    },
    "GroupByInput": {
      "description": "A data class that represents the input for a group by operation.\n\nAttributes\n----------\nagg_cols : List[AggColl]\n    A list of `AggColl` objects that specify the aggregation operations to perform on the DataFrame columns\n    after grouping. Each `AggColl` object should specify the column to be aggregated and the aggregation\n    function to use.\n\nExample\n--------\ngroup_by_input = GroupByInput(\n    agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'),\n              AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')]\n)",
      "properties": {
        "agg_cols": {
          "items": {
            "$ref": "#/$defs/AggColl"
          },
          "title": "Agg Cols",
          "type": "array"
        }
      },
      "required": [
        "agg_cols"
      ],
      "title": "GroupByInput",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a group-by and aggregation operation.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "groupby_input": {
      "$ref": "#/$defs/GroupByInput",
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeGroupBy",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • groupby_input (GroupByInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
745
746
747
748
class NodeGroupBy(NodeSingleInput):
    """Settings for a node that performs a group-by and aggregation operation."""

    groupby_input: transform_schema.GroupByInput = None
NodeInputConnection pydantic-model

Bases: BaseModel

Represents the input side of a connection between two nodes.

Show JSON schema:
{
  "description": "Represents the input side of a connection between two nodes.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "connection_class": {
      "enum": [
        "input-0",
        "input-1",
        "input-2",
        "input-3",
        "input-4",
        "input-5",
        "input-6",
        "input-7",
        "input-8",
        "input-9"
      ],
      "title": "Connection Class",
      "type": "string"
    }
  },
  "required": [
    "node_id",
    "connection_class"
  ],
  "title": "NodeInputConnection",
  "type": "object"
}

Fields:

  • node_id (int)
  • connection_class (InputConnectionClass)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
class NodeInputConnection(BaseModel):
    """Represents the input side of a connection between two nodes."""

    node_id: int
    connection_class: InputConnectionClass

    def get_node_input_connection_type(self) -> Literal["main", "right", "left"]:
        """Determines the semantic type of the input (e.g., for a join)."""
        match self.connection_class:
            case "input-0":
                return "main"
            case "input-1":
                return "right"
            case "input-2":
                return "left"
            case _:
                raise ValueError(f"Unexpected connection_class: {self.connection_class}")
get_node_input_connection_type()

Determines the semantic type of the input (e.g., for a join).

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
764
765
766
767
768
769
770
771
772
773
774
def get_node_input_connection_type(self) -> Literal["main", "right", "left"]:
    """Determines the semantic type of the input (e.g., for a join)."""
    match self.connection_class:
        case "input-0":
            return "main"
        case "input-1":
            return "right"
        case "input-2":
            return "left"
        case _:
            raise ValueError(f"Unexpected connection_class: {self.connection_class}")
NodeJoin pydantic-model

Bases: NodeMultiInput

Settings for a node that performs a standard SQL-style join.

Show JSON schema:
{
  "$defs": {
    "JoinInput": {
      "description": "Data model for standard SQL-style join operations.",
      "properties": {
        "join_mapping": {
          "items": {
            "$ref": "#/$defs/JoinMap"
          },
          "title": "Join Mapping",
          "type": "array"
        },
        "left_select": {
          "$ref": "#/$defs/JoinInputs"
        },
        "right_select": {
          "$ref": "#/$defs/JoinInputs"
        },
        "how": {
          "default": "inner",
          "enum": [
            "inner",
            "left",
            "right",
            "full",
            "semi",
            "anti",
            "cross",
            "outer"
          ],
          "title": "How",
          "type": "string"
        }
      },
      "required": [
        "join_mapping",
        "left_select",
        "right_select"
      ],
      "title": "JoinInput",
      "type": "object"
    },
    "JoinInputs": {
      "description": "Data model for join-specific select inputs (extends SelectInputs).",
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "title": "JoinInputs",
      "type": "object"
    },
    "JoinMap": {
      "description": "Defines a single mapping between a left and right column for a join key.",
      "properties": {
        "left_col": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Left Col"
        },
        "right_col": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Right Col"
        }
      },
      "title": "JoinMap",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that performs a standard SQL-style join.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Depending On Ids"
    },
    "auto_generate_selection": {
      "default": true,
      "title": "Auto Generate Selection",
      "type": "boolean"
    },
    "verify_integrity": {
      "default": true,
      "title": "Verify Integrity",
      "type": "boolean"
    },
    "join_input": {
      "$ref": "#/$defs/JoinInput"
    },
    "auto_keep_all": {
      "default": true,
      "title": "Auto Keep All",
      "type": "boolean"
    },
    "auto_keep_right": {
      "default": true,
      "title": "Auto Keep Right",
      "type": "boolean"
    },
    "auto_keep_left": {
      "default": true,
      "title": "Auto Keep Left",
      "type": "boolean"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "join_input"
  ],
  "title": "NodeJoin",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_ids (list[int] | None)
  • auto_generate_selection (bool)
  • verify_integrity (bool)
  • join_input (JoinInput)
  • auto_keep_all (bool)
  • auto_keep_right (bool)
  • auto_keep_left (bool)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
class NodeJoin(NodeMultiInput):
    """Settings for a node that performs a standard SQL-style join."""

    auto_generate_selection: bool = True
    verify_integrity: bool = True
    join_input: transform_schema.JoinInput
    auto_keep_all: bool = True
    auto_keep_right: bool = True
    auto_keep_left: bool = True

    def to_yaml_dict(self) -> NodeJoinYaml:
        """Converts the join node settings to a dictionary for YAML serialization."""
        result: NodeJoinYaml = {
            "cache_results": self.cache_results,
            "auto_generate_selection": self.auto_generate_selection,
            "verify_integrity": self.verify_integrity,
            "join_input": self.join_input.to_yaml_dict(),
            "auto_keep_all": self.auto_keep_all,
            "auto_keep_right": self.auto_keep_right,
            "auto_keep_left": self.auto_keep_left,
        }
        if self.output_field_config:
            result["output_field_config"] = {
                "enabled": self.output_field_config.enabled,
                "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
                "validate_data_types": self.output_field_config.validate_data_types,
                "fields": [
                    {
                        "name": f.name,
                        "data_type": f.data_type,
                        "default_value": f.default_value,
                    }
                    for f in self.output_field_config.fields
                ],
            }
        return result
to_yaml_dict()

Converts the join node settings to a dictionary for YAML serialization.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
def to_yaml_dict(self) -> NodeJoinYaml:
    """Converts the join node settings to a dictionary for YAML serialization."""
    result: NodeJoinYaml = {
        "cache_results": self.cache_results,
        "auto_generate_selection": self.auto_generate_selection,
        "verify_integrity": self.verify_integrity,
        "join_input": self.join_input.to_yaml_dict(),
        "auto_keep_all": self.auto_keep_all,
        "auto_keep_right": self.auto_keep_right,
        "auto_keep_left": self.auto_keep_left,
    }
    if self.output_field_config:
        result["output_field_config"] = {
            "enabled": self.output_field_config.enabled,
            "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
            "validate_data_types": self.output_field_config.validate_data_types,
            "fields": [
                {
                    "name": f.name,
                    "data_type": f.data_type,
                    "default_value": f.default_value,
                }
                for f in self.output_field_config.fields
            ],
        }
    return result
NodeManualInput pydantic-model

Bases: NodeBase

Settings for a node that allows direct data entry in the UI.

Show JSON schema:
{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "RawData": {
      "description": "Represents data in a raw, columnar format for manual input.",
      "properties": {
        "columns": {
          "default": null,
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "title": "Columns",
          "type": "array"
        },
        "data": {
          "items": {
            "items": {},
            "type": "array"
          },
          "title": "Data",
          "type": "array"
        }
      },
      "required": [
        "data"
      ],
      "title": "RawData",
      "type": "object"
    }
  },
  "description": "Settings for a node that allows direct data entry in the UI.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "raw_data_format": {
      "anyOf": [
        {
          "$ref": "#/$defs/RawData"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeManualInput",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • raw_data_format (RawData | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
602
603
604
605
class NodeManualInput(NodeBase):
    """Settings for a node that allows direct data entry in the UI."""

    raw_data_format: RawData | None = None
NodeMultiInput pydantic-model

Bases: NodeBase

A base model for any node that takes multiple data inputs.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "A base model for any node that takes multiple data inputs.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Depending On Ids"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeMultiInput",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_ids (list[int] | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
387
388
389
390
class NodeMultiInput(NodeBase):
    """A base model for any node that takes multiple data inputs."""

    depending_on_ids: list[int] | None = Field(default_factory=list)
NodeOutput pydantic-model

Bases: NodeSingleInput

Settings for a node that writes its input to a file.

Show JSON schema:
{
  "$defs": {
    "OutputCsvTable": {
      "description": "Defines settings for writing a CSV file.",
      "properties": {
        "file_type": {
          "const": "csv",
          "default": "csv",
          "enum": [
            "csv"
          ],
          "title": "File Type",
          "type": "string"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        }
      },
      "title": "OutputCsvTable",
      "type": "object"
    },
    "OutputExcelTable": {
      "description": "Defines settings for writing an Excel file.",
      "properties": {
        "file_type": {
          "const": "excel",
          "default": "excel",
          "enum": [
            "excel"
          ],
          "title": "File Type",
          "type": "string"
        },
        "sheet_name": {
          "default": "Sheet1",
          "title": "Sheet Name",
          "type": "string"
        }
      },
      "title": "OutputExcelTable",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "OutputParquetTable": {
      "description": "Defines settings for writing a Parquet file.",
      "properties": {
        "file_type": {
          "const": "parquet",
          "default": "parquet",
          "enum": [
            "parquet"
          ],
          "title": "File Type",
          "type": "string"
        }
      },
      "title": "OutputParquetTable",
      "type": "object"
    },
    "OutputSettings": {
      "description": "Defines the complete settings for an output node.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "directory": {
          "title": "Directory",
          "type": "string"
        },
        "file_type": {
          "title": "File Type",
          "type": "string"
        },
        "fields": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Fields"
        },
        "write_mode": {
          "default": "overwrite",
          "title": "Write Mode",
          "type": "string"
        },
        "table_settings": {
          "discriminator": {
            "mapping": {
              "csv": "#/$defs/OutputCsvTable",
              "excel": "#/$defs/OutputExcelTable",
              "parquet": "#/$defs/OutputParquetTable"
            },
            "propertyName": "file_type"
          },
          "oneOf": [
            {
              "$ref": "#/$defs/OutputCsvTable"
            },
            {
              "$ref": "#/$defs/OutputParquetTable"
            },
            {
              "$ref": "#/$defs/OutputExcelTable"
            }
          ],
          "title": "Table Settings"
        },
        "abs_file_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Abs File Path"
        }
      },
      "required": [
        "name",
        "directory",
        "file_type",
        "table_settings"
      ],
      "title": "OutputSettings",
      "type": "object"
    }
  },
  "description": "Settings for a node that writes its input to a file.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "output_settings": {
      "$ref": "#/$defs/OutputSettings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "output_settings"
  ],
  "title": "NodeOutput",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • output_settings (OutputSettings)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
class NodeOutput(NodeSingleInput):
    """Settings for a node that writes its input to a file."""

    output_settings: OutputSettings

    def to_yaml_dict(self) -> NodeOutputYaml:
        """Converts the output node settings to a dictionary for YAML serialization."""
        result: NodeOutputYaml = {
            "cache_results": self.cache_results,
            "output_settings": self.output_settings.to_yaml_dict(),
        }
        if self.output_field_config:
            result["output_field_config"] = {
                "enabled": self.output_field_config.enabled,
                "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
                "validate_data_types": self.output_field_config.validate_data_types,
                "fields": [
                    {
                        "name": f.name,
                        "data_type": f.data_type,
                        "default_value": f.default_value,
                    }
                    for f in self.output_field_config.fields
                ],
            }
        return result
to_yaml_dict()

Converts the output node settings to a dictionary for YAML serialization.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
def to_yaml_dict(self) -> NodeOutputYaml:
    """Converts the output node settings to a dictionary for YAML serialization."""
    result: NodeOutputYaml = {
        "cache_results": self.cache_results,
        "output_settings": self.output_settings.to_yaml_dict(),
    }
    if self.output_field_config:
        result["output_field_config"] = {
            "enabled": self.output_field_config.enabled,
            "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
            "validate_data_types": self.output_field_config.validate_data_types,
            "fields": [
                {
                    "name": f.name,
                    "data_type": f.data_type,
                    "default_value": f.default_value,
                }
                for f in self.output_field_config.fields
            ],
        }
    return result
NodeOutputConnection pydantic-model

Bases: BaseModel

Represents the output side of a connection between two nodes.

Show JSON schema:
{
  "description": "Represents the output side of a connection between two nodes.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "connection_class": {
      "enum": [
        "output-0",
        "output-1",
        "output-2",
        "output-3",
        "output-4",
        "output-5",
        "output-6",
        "output-7",
        "output-8",
        "output-9"
      ],
      "title": "Connection Class",
      "type": "string"
    }
  },
  "required": [
    "node_id",
    "connection_class"
  ],
  "title": "NodeOutputConnection",
  "type": "object"
}

Fields:

  • node_id (int)
  • connection_class (OutputConnectionClass)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
824
825
826
827
828
class NodeOutputConnection(BaseModel):
    """Represents the output side of a connection between two nodes."""

    node_id: int
    connection_class: OutputConnectionClass
NodePivot pydantic-model

Bases: NodeSingleInput

Settings for a node that pivots data from a long to a wide format.

Show JSON schema:
{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "PivotInput": {
      "description": "Defines the settings for a pivot (long-to-wide) operation.",
      "properties": {
        "index_columns": {
          "items": {
            "type": "string"
          },
          "title": "Index Columns",
          "type": "array"
        },
        "pivot_column": {
          "title": "Pivot Column",
          "type": "string"
        },
        "value_col": {
          "title": "Value Col",
          "type": "string"
        },
        "aggregations": {
          "items": {
            "type": "string"
          },
          "title": "Aggregations",
          "type": "array"
        }
      },
      "required": [
        "index_columns",
        "pivot_column",
        "value_col",
        "aggregations"
      ],
      "title": "PivotInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that pivots data from a long to a wide format.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "pivot_input": {
      "$ref": "#/$defs/PivotInput",
      "default": null
    },
    "output_fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Output Fields"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodePivot",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • pivot_input (PivotInput)
  • output_fields (list[MinimalFieldInfo] | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
777
778
779
780
781
class NodePivot(NodeSingleInput):
    """Settings for a node that pivots data from a long to a wide format."""

    pivot_input: transform_schema.PivotInput = None
    output_fields: list[MinimalFieldInfo] | None = None
NodePolarsCode pydantic-model

Bases: NodeMultiInput

Settings for a node that executes arbitrary user-provided Polars code.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "PolarsCodeInput": {
      "description": "A simple container for a string of user-provided Polars code to be executed.",
      "properties": {
        "polars_code": {
          "title": "Polars Code",
          "type": "string"
        }
      },
      "required": [
        "polars_code"
      ],
      "title": "PolarsCodeInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that executes arbitrary user-provided Polars code.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Depending On Ids"
    },
    "polars_code_input": {
      "$ref": "#/$defs/PolarsCodeInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "polars_code_input"
  ],
  "title": "NodePolarsCode",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_ids (list[int] | None)
  • polars_code_input (PolarsCodeInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
884
885
886
887
class NodePolarsCode(NodeMultiInput):
    """Settings for a node that executes arbitrary user-provided Polars code."""

    polars_code_input: transform_schema.PolarsCodeInput
NodePromise pydantic-model

Bases: NodeBase

A placeholder node for an operation that has not yet been configured.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "A placeholder node for an operation that has not yet been configured.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "default": false,
      "title": "Is Setup",
      "type": "boolean"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "node_type": {
      "title": "Node Type",
      "type": "string"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "node_type"
  ],
  "title": "NodePromise",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • is_setup (bool)
  • node_type (str)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
751
752
753
754
755
class NodePromise(NodeBase):
    """A placeholder node for an operation that has not yet been configured."""

    is_setup: bool = False
    node_type: str
NodeRead pydantic-model

Bases: NodeBase

Settings for a node that reads data from a file.

Show JSON schema:
{
  "$defs": {
    "InputCsvTable": {
      "description": "Defines settings for reading a CSV file.",
      "properties": {
        "file_type": {
          "const": "csv",
          "default": "csv",
          "enum": [
            "csv"
          ],
          "title": "File Type",
          "type": "string"
        },
        "reference": {
          "default": "",
          "title": "Reference",
          "type": "string"
        },
        "starting_from_line": {
          "default": 0,
          "title": "Starting From Line",
          "type": "integer"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "has_headers": {
          "default": true,
          "title": "Has Headers",
          "type": "boolean"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        },
        "parquet_ref": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Parquet Ref"
        },
        "row_delimiter": {
          "default": "\n",
          "title": "Row Delimiter",
          "type": "string"
        },
        "quote_char": {
          "default": "\"",
          "title": "Quote Char",
          "type": "string"
        },
        "infer_schema_length": {
          "default": 10000,
          "title": "Infer Schema Length",
          "type": "integer"
        },
        "truncate_ragged_lines": {
          "default": false,
          "title": "Truncate Ragged Lines",
          "type": "boolean"
        },
        "ignore_errors": {
          "default": false,
          "title": "Ignore Errors",
          "type": "boolean"
        }
      },
      "title": "InputCsvTable",
      "type": "object"
    },
    "InputExcelTable": {
      "description": "Defines settings for reading an Excel file.",
      "properties": {
        "file_type": {
          "const": "excel",
          "default": "excel",
          "enum": [
            "excel"
          ],
          "title": "File Type",
          "type": "string"
        },
        "sheet_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Sheet Name"
        },
        "start_row": {
          "default": 0,
          "title": "Start Row",
          "type": "integer"
        },
        "start_column": {
          "default": 0,
          "title": "Start Column",
          "type": "integer"
        },
        "end_row": {
          "default": 0,
          "title": "End Row",
          "type": "integer"
        },
        "end_column": {
          "default": 0,
          "title": "End Column",
          "type": "integer"
        },
        "has_headers": {
          "default": true,
          "title": "Has Headers",
          "type": "boolean"
        },
        "type_inference": {
          "default": false,
          "title": "Type Inference",
          "type": "boolean"
        }
      },
      "title": "InputExcelTable",
      "type": "object"
    },
    "InputJsonTable": {
      "description": "Defines settings for reading a JSON file.",
      "properties": {
        "file_type": {
          "const": "json",
          "default": "json",
          "enum": [
            "json"
          ],
          "title": "File Type",
          "type": "string"
        },
        "reference": {
          "default": "",
          "title": "Reference",
          "type": "string"
        },
        "starting_from_line": {
          "default": 0,
          "title": "Starting From Line",
          "type": "integer"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "has_headers": {
          "default": true,
          "title": "Has Headers",
          "type": "boolean"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        },
        "parquet_ref": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Parquet Ref"
        },
        "row_delimiter": {
          "default": "\n",
          "title": "Row Delimiter",
          "type": "string"
        },
        "quote_char": {
          "default": "\"",
          "title": "Quote Char",
          "type": "string"
        },
        "infer_schema_length": {
          "default": 10000,
          "title": "Infer Schema Length",
          "type": "integer"
        },
        "truncate_ragged_lines": {
          "default": false,
          "title": "Truncate Ragged Lines",
          "type": "boolean"
        },
        "ignore_errors": {
          "default": false,
          "title": "Ignore Errors",
          "type": "boolean"
        }
      },
      "title": "InputJsonTable",
      "type": "object"
    },
    "InputParquetTable": {
      "description": "Defines settings for reading a Parquet file.",
      "properties": {
        "file_type": {
          "const": "parquet",
          "default": "parquet",
          "enum": [
            "parquet"
          ],
          "title": "File Type",
          "type": "string"
        }
      },
      "title": "InputParquetTable",
      "type": "object"
    },
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    },
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "ReceivedTable": {
      "description": "Model for defining a table received from an external source.",
      "properties": {
        "id": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Name"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "directory": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Directory"
        },
        "analysis_file_available": {
          "default": false,
          "title": "Analysis File Available",
          "type": "boolean"
        },
        "status": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Status"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "abs_file_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Abs File Path"
        },
        "file_type": {
          "enum": [
            "csv",
            "json",
            "parquet",
            "excel"
          ],
          "title": "File Type",
          "type": "string"
        },
        "table_settings": {
          "discriminator": {
            "mapping": {
              "csv": "#/$defs/InputCsvTable",
              "excel": "#/$defs/InputExcelTable",
              "json": "#/$defs/InputJsonTable",
              "parquet": "#/$defs/InputParquetTable"
            },
            "propertyName": "file_type"
          },
          "oneOf": [
            {
              "$ref": "#/$defs/InputCsvTable"
            },
            {
              "$ref": "#/$defs/InputJsonTable"
            },
            {
              "$ref": "#/$defs/InputParquetTable"
            },
            {
              "$ref": "#/$defs/InputExcelTable"
            }
          ],
          "title": "Table Settings"
        }
      },
      "required": [
        "path",
        "file_type",
        "table_settings"
      ],
      "title": "ReceivedTable",
      "type": "object"
    }
  },
  "description": "Settings for a node that reads data from a file.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "received_file": {
      "$ref": "#/$defs/ReceivedTable"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "received_file"
  ],
  "title": "NodeRead",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • received_file (ReceivedTable)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
608
609
610
611
class NodeRead(NodeBase):
    """Settings for a node that reads data from a file."""

    received_file: ReceivedTable
NodeRecordCount pydantic-model

Bases: NodeSingleInput

Settings for a node that counts the number of records.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that counts the number of records.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeRecordCount",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
878
879
880
881
class NodeRecordCount(NodeSingleInput):
    """Settings for a node that counts the number of records."""

    pass
NodeRecordId pydantic-model

Bases: NodeSingleInput

Settings for a node that adds a unique record ID column.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "RecordIdInput": {
      "description": "Defines settings for adding a record ID (row number) column to the data.",
      "properties": {
        "output_column_name": {
          "default": "record_id",
          "title": "Output Column Name",
          "type": "string"
        },
        "offset": {
          "default": 1,
          "title": "Offset",
          "type": "integer"
        },
        "group_by": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": false,
          "title": "Group By"
        },
        "group_by_columns": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Group By Columns"
        }
      },
      "title": "RecordIdInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that adds a unique record ID column.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "record_id_input": {
      "$ref": "#/$defs/RecordIdInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "record_id_input"
  ],
  "title": "NodeRecordId",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • record_id_input (RecordIdInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
449
450
451
452
class NodeRecordId(NodeSingleInput):
    """Settings for a node that adds a unique record ID column."""

    record_id_input: transform_schema.RecordIdInput
NodeSample pydantic-model

Bases: NodeSingleInput

Settings for a node that samples a subset of the data.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that samples a subset of the data.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "sample_size": {
      "default": 1000,
      "title": "Sample Size",
      "type": "integer"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSample",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • sample_size (int)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
443
444
445
446
class NodeSample(NodeSingleInput):
    """Settings for a node that samples a subset of the data."""

    sample_size: int = 1000
NodeSelect pydantic-model

Bases: NodeSingleInput

Settings for a node that selects, renames, and reorders columns.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that selects, renames, and reorders columns.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "keep_missing": {
      "default": true,
      "title": "Keep Missing",
      "type": "boolean"
    },
    "select_input": {
      "items": {
        "$ref": "#/$defs/SelectInput"
      },
      "title": "Select Input",
      "type": "array"
    },
    "sorted_by": {
      "anyOf": [
        {
          "enum": [
            "none",
            "asc",
            "desc"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "none",
      "title": "Sorted By"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSelect",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • keep_missing (bool)
  • select_input (list[SelectInput])
  • sorted_by (Literal['none', 'asc', 'desc'] | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
class NodeSelect(NodeSingleInput):
    """Settings for a node that selects, renames, and reorders columns."""

    keep_missing: bool = True
    select_input: list[transform_schema.SelectInput] = Field(default_factory=list)
    sorted_by: Literal["none", "asc", "desc"] | None = "none"

    def to_yaml_dict(self) -> NodeSelectYaml:
        """Converts the select node settings to a dictionary for YAML serialization."""
        result: NodeSelectYaml = {
            "cache_results": bool(self.cache_results),
            "keep_missing": self.keep_missing,
            "select_input": [s.to_yaml_dict() for s in self.select_input],
            "sorted_by": self.sorted_by,
        }
        if self.output_field_config:
            result["output_field_config"] = {
                "enabled": self.output_field_config.enabled,
                "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
                "validate_data_types": self.output_field_config.validate_data_types,
                "fields": [
                    {
                        "name": f.name,
                        "data_type": f.data_type,
                        "default_value": f.default_value,
                    }
                    for f in self.output_field_config.fields
                ],
            }
        return result
to_yaml_dict()

Converts the select node settings to a dictionary for YAML serialization.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
def to_yaml_dict(self) -> NodeSelectYaml:
    """Converts the select node settings to a dictionary for YAML serialization."""
    result: NodeSelectYaml = {
        "cache_results": bool(self.cache_results),
        "keep_missing": self.keep_missing,
        "select_input": [s.to_yaml_dict() for s in self.select_input],
        "sorted_by": self.sorted_by,
    }
    if self.output_field_config:
        result["output_field_config"] = {
            "enabled": self.output_field_config.enabled,
            "validation_mode_behavior": self.output_field_config.validation_mode_behavior,
            "validate_data_types": self.output_field_config.validate_data_types,
            "fields": [
                {
                    "name": f.name,
                    "data_type": f.data_type,
                    "default_value": f.default_value,
                }
                for f in self.output_field_config.fields
            ],
        }
    return result
NodeSingleInput pydantic-model

Bases: NodeBase

A base model for any node that takes a single data input.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "A base model for any node that takes a single data input.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSingleInput",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
381
382
383
384
class NodeSingleInput(NodeBase):
    """A base model for any node that takes a single data input."""

    depending_on_id: int | None = -1
NodeSort pydantic-model

Bases: NodeSingleInput

Settings for a node that sorts the data by one or more columns.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "SortByInput": {
      "description": "Defines a single sort condition on a column, including the direction.",
      "properties": {
        "column": {
          "title": "Column",
          "type": "string"
        },
        "how": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "asc",
          "title": "How"
        }
      },
      "required": [
        "column"
      ],
      "title": "SortByInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that sorts the data by one or more columns.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "sort_input": {
      "items": {
        "$ref": "#/$defs/SortByInput"
      },
      "title": "Sort Input",
      "type": "array"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeSort",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • sort_input (list[SortByInput])

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
431
432
433
434
class NodeSort(NodeSingleInput):
    """Settings for a node that sorts the data by one or more columns."""

    sort_input: list[transform_schema.SortByInput] = Field(default_factory=list)
NodeTextToRows pydantic-model

Bases: NodeSingleInput

Settings for a node that splits a text column into multiple rows.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "TextToRowsInput": {
      "description": "Defines settings for splitting a text column into multiple rows based on a delimiter.",
      "properties": {
        "column_to_split": {
          "title": "Column To Split",
          "type": "string"
        },
        "output_column_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Column Name"
        },
        "split_by_fixed_value": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "title": "Split By Fixed Value"
        },
        "split_fixed_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": ",",
          "title": "Split Fixed Value"
        },
        "split_by_column": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Split By Column"
        }
      },
      "required": [
        "column_to_split"
      ],
      "title": "TextToRowsInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that splits a text column into multiple rows.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "text_to_rows_input": {
      "$ref": "#/$defs/TextToRowsInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "text_to_rows_input"
  ],
  "title": "NodeTextToRows",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • text_to_rows_input (TextToRowsInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
437
438
439
440
class NodeTextToRows(NodeSingleInput):
    """Settings for a node that splits a text column into multiple rows."""

    text_to_rows_input: transform_schema.TextToRowsInput
NodeUnion pydantic-model

Bases: NodeMultiInput

Settings for a node that concatenates multiple data inputs.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "UnionInput": {
      "description": "Defines settings for a union (concatenation) operation.",
      "properties": {
        "mode": {
          "default": "relaxed",
          "enum": [
            "selective",
            "relaxed"
          ],
          "title": "Mode",
          "type": "string"
        }
      },
      "title": "UnionInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that concatenates multiple data inputs.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Depending On Ids"
    },
    "union_input": {
      "$ref": "#/$defs/UnionInput"
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeUnion",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_ids (list[int] | None)
  • union_input (UnionInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
790
791
792
793
class NodeUnion(NodeMultiInput):
    """Settings for a node that concatenates multiple data inputs."""

    union_input: transform_schema.UnionInput = Field(default_factory=transform_schema.UnionInput)
NodeUnique pydantic-model

Bases: NodeSingleInput

Settings for a node that returns the unique rows from the data.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "UniqueInput": {
      "description": "Defines settings for a uniqueness operation, specifying columns and which row to keep.",
      "properties": {
        "columns": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Columns"
        },
        "strategy": {
          "default": "any",
          "enum": [
            "first",
            "last",
            "any",
            "none"
          ],
          "title": "Strategy",
          "type": "string"
        }
      },
      "title": "UniqueInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that returns the unique rows from the data.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "unique_input": {
      "$ref": "#/$defs/UniqueInput"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "unique_input"
  ],
  "title": "NodeUnique",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • unique_input (UniqueInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
872
873
874
875
class NodeUnique(NodeSingleInput):
    """Settings for a node that returns the unique rows from the data."""

    unique_input: transform_schema.UniqueInput
NodeUnpivot pydantic-model

Bases: NodeSingleInput

Settings for a node that unpivots data from a wide to a long format.

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    },
    "UnpivotInput": {
      "description": "Defines settings for an unpivot (wide-to-long) operation.",
      "properties": {
        "index_columns": {
          "items": {
            "type": "string"
          },
          "title": "Index Columns",
          "type": "array"
        },
        "value_columns": {
          "items": {
            "type": "string"
          },
          "title": "Value Columns",
          "type": "array"
        },
        "data_type_selector": {
          "anyOf": [
            {
              "enum": [
                "float",
                "all",
                "date",
                "numeric",
                "string"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type Selector"
        },
        "data_type_selector_mode": {
          "default": "column",
          "enum": [
            "data_type",
            "column"
          ],
          "title": "Data Type Selector Mode",
          "type": "string"
        }
      },
      "title": "UnpivotInput",
      "type": "object"
    }
  },
  "description": "Settings for a node that unpivots data from a wide to a long format.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": -1,
      "title": "Depending On Id"
    },
    "unpivot_input": {
      "$ref": "#/$defs/UnpivotInput",
      "default": null
    }
  },
  "required": [
    "flow_id",
    "node_id"
  ],
  "title": "NodeUnpivot",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_id (int | None)
  • unpivot_input (UnpivotInput)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
784
785
786
787
class NodeUnpivot(NodeSingleInput):
    """Settings for a node that unpivots data from a wide to a long format."""

    unpivot_input: transform_schema.UnpivotInput = None
OutputCsvTable pydantic-model

Bases: BaseModel

Defines settings for writing a CSV file.

Show JSON schema:
{
  "description": "Defines settings for writing a CSV file.",
  "properties": {
    "file_type": {
      "const": "csv",
      "default": "csv",
      "enum": [
        "csv"
      ],
      "title": "File Type",
      "type": "string"
    },
    "delimiter": {
      "default": ",",
      "title": "Delimiter",
      "type": "string"
    },
    "encoding": {
      "default": "utf-8",
      "title": "Encoding",
      "type": "string"
    }
  },
  "title": "OutputCsvTable",
  "type": "object"
}

Fields:

  • file_type (Literal['csv'])
  • delimiter (str)
  • encoding (str)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
244
245
246
247
248
249
class OutputCsvTable(BaseModel):
    """Defines settings for writing a CSV file."""

    file_type: Literal["csv"] = "csv"
    delimiter: str = ","
    encoding: str = "utf-8"
OutputExcelTable pydantic-model

Bases: BaseModel

Defines settings for writing an Excel file.

Show JSON schema:
{
  "description": "Defines settings for writing an Excel file.",
  "properties": {
    "file_type": {
      "const": "excel",
      "default": "excel",
      "enum": [
        "excel"
      ],
      "title": "File Type",
      "type": "string"
    },
    "sheet_name": {
      "default": "Sheet1",
      "title": "Sheet Name",
      "type": "string"
    }
  },
  "title": "OutputExcelTable",
  "type": "object"
}

Fields:

  • file_type (Literal['excel'])
  • sheet_name (str)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
258
259
260
261
262
class OutputExcelTable(BaseModel):
    """Defines settings for writing an Excel file."""

    file_type: Literal["excel"] = "excel"
    sheet_name: str = "Sheet1"
OutputFieldConfig pydantic-model

Bases: BaseModel

Configuration for output field validation and transformation behavior.

Show JSON schema:
{
  "$defs": {
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Configuration for output field validation and transformation behavior.",
  "properties": {
    "enabled": {
      "default": false,
      "title": "Enabled",
      "type": "boolean"
    },
    "validation_mode_behavior": {
      "default": "select_only",
      "enum": [
        "add_missing",
        "add_missing_keep_extra",
        "raise_on_missing",
        "select_only"
      ],
      "title": "Validation Mode Behavior",
      "type": "string"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/OutputFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "validate_data_types": {
      "default": false,
      "title": "Validate Data Types",
      "type": "boolean"
    }
  },
  "title": "OutputFieldConfig",
  "type": "object"
}

Fields:

  • enabled (bool)
  • validation_mode_behavior (Literal['add_missing', 'add_missing_keep_extra', 'raise_on_missing', 'select_only'])
  • fields (list[OutputFieldInfo])
  • validate_data_types (bool)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
class OutputFieldConfig(BaseModel):
    """Configuration for output field validation and transformation behavior."""

    enabled: bool = False
    validation_mode_behavior: Literal[
        "add_missing",  # Add missing fields with defaults, remove extra columns
        "add_missing_keep_extra",  # Add missing fields with defaults, keep all incoming columns
        "raise_on_missing",  # Raise error if any fields are missing
        "select_only"  # Select only specified fields, skip missing silently
    ] = "select_only"
    fields: list[OutputFieldInfo] = Field(default_factory=list)
    validate_data_types: bool = False  # Enable data type validation without casting
OutputFieldInfo pydantic-model

Bases: BaseModel

Field information with optional default value for output field configuration.

Show JSON schema:
{
  "description": "Field information with optional default value for output field configuration.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "data_type": {
      "default": "String",
      "enum": [
        "Int8",
        "Int16",
        "Int32",
        "Int64",
        "UInt8",
        "UInt16",
        "UInt32",
        "UInt64",
        "Float32",
        "Float64",
        "Decimal",
        "String",
        "Date",
        "Datetime",
        "Time",
        "Duration",
        "Boolean",
        "Binary",
        "List",
        "Struct",
        "Array",
        "Integer",
        "Double",
        "Utf8"
      ],
      "title": "Data Type",
      "type": "string"
    },
    "default_value": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Default Value"
    }
  },
  "required": [
    "name"
  ],
  "title": "OutputFieldInfo",
  "type": "object"
}

Fields:

  • name (str)
  • data_type (DataTypeStr)
  • default_value (str | None)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
84
85
86
87
88
89
class OutputFieldInfo(BaseModel):
    """Field information with optional default value for output field configuration."""

    name: str
    data_type: DataTypeStr = "String"
    default_value: str | None = None  # Can be a literal value or expression
OutputParquetTable pydantic-model

Bases: BaseModel

Defines settings for writing a Parquet file.

Show JSON schema:
{
  "description": "Defines settings for writing a Parquet file.",
  "properties": {
    "file_type": {
      "const": "parquet",
      "default": "parquet",
      "enum": [
        "parquet"
      ],
      "title": "File Type",
      "type": "string"
    }
  },
  "title": "OutputParquetTable",
  "type": "object"
}

Fields:

  • file_type (Literal['parquet'])
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
252
253
254
255
class OutputParquetTable(BaseModel):
    """Defines settings for writing a Parquet file."""

    file_type: Literal["parquet"] = "parquet"
OutputSettings pydantic-model

Bases: BaseModel

Defines the complete settings for an output node.

Show JSON schema:
{
  "$defs": {
    "OutputCsvTable": {
      "description": "Defines settings for writing a CSV file.",
      "properties": {
        "file_type": {
          "const": "csv",
          "default": "csv",
          "enum": [
            "csv"
          ],
          "title": "File Type",
          "type": "string"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        }
      },
      "title": "OutputCsvTable",
      "type": "object"
    },
    "OutputExcelTable": {
      "description": "Defines settings for writing an Excel file.",
      "properties": {
        "file_type": {
          "const": "excel",
          "default": "excel",
          "enum": [
            "excel"
          ],
          "title": "File Type",
          "type": "string"
        },
        "sheet_name": {
          "default": "Sheet1",
          "title": "Sheet Name",
          "type": "string"
        }
      },
      "title": "OutputExcelTable",
      "type": "object"
    },
    "OutputParquetTable": {
      "description": "Defines settings for writing a Parquet file.",
      "properties": {
        "file_type": {
          "const": "parquet",
          "default": "parquet",
          "enum": [
            "parquet"
          ],
          "title": "File Type",
          "type": "string"
        }
      },
      "title": "OutputParquetTable",
      "type": "object"
    }
  },
  "description": "Defines the complete settings for an output node.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "directory": {
      "title": "Directory",
      "type": "string"
    },
    "file_type": {
      "title": "File Type",
      "type": "string"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Fields"
    },
    "write_mode": {
      "default": "overwrite",
      "title": "Write Mode",
      "type": "string"
    },
    "table_settings": {
      "discriminator": {
        "mapping": {
          "csv": "#/$defs/OutputCsvTable",
          "excel": "#/$defs/OutputExcelTable",
          "parquet": "#/$defs/OutputParquetTable"
        },
        "propertyName": "file_type"
      },
      "oneOf": [
        {
          "$ref": "#/$defs/OutputCsvTable"
        },
        {
          "$ref": "#/$defs/OutputParquetTable"
        },
        {
          "$ref": "#/$defs/OutputExcelTable"
        }
      ],
      "title": "Table Settings"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    }
  },
  "required": [
    "name",
    "directory",
    "file_type",
    "table_settings"
  ],
  "title": "OutputSettings",
  "type": "object"
}

Fields:

  • name (str)
  • directory (str)
  • file_type (str)
  • fields (list[str] | None)
  • write_mode (str)
  • table_settings (OutputTableSettings)
  • abs_file_path (str | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
class OutputSettings(BaseModel):
    """Defines the complete settings for an output node."""

    name: str
    directory: str
    file_type: str  # This drives which table_settings to use
    fields: list[str] | None = Field(default_factory=list)
    write_mode: str = "overwrite"
    table_settings: OutputTableSettings
    abs_file_path: str | None = None

    def to_yaml_dict(self) -> OutputSettingsYaml:
        """Converts the output settings to a dictionary suitable for YAML serialization."""
        result: OutputSettingsYaml = {
            "name": self.name,
            "directory": self.directory,
            "file_type": self.file_type,
            "write_mode": self.write_mode,
        }
        if self.abs_file_path:
            result["abs_file_path"] = self.abs_file_path
        if self.fields:
            result["fields"] = self.fields
        # Only include table_settings if it has non-default values beyond file_type
        ts_dict = self.table_settings.model_dump(exclude={"file_type"})
        if any(v for v in ts_dict.values()):  # Has meaningful settings
            result["table_settings"] = ts_dict
        return result

    @property
    def sheet_name(self) -> str | None:
        if self.file_type == "excel":
            return self.table_settings.sheet_name

    @property
    def delimiter(self) -> str | None:
        if self.file_type == "csv":
            return self.table_settings.delimiter

    @field_validator("table_settings", mode="before")
    @classmethod
    def validate_table_settings(cls, v, info: ValidationInfo):
        """Ensures table_settings matches the file_type."""
        if v is None:
            file_type = info.data.get("file_type", "csv")
            # Create default based on file_type
            match file_type:
                case "csv":
                    return OutputCsvTable()
                case "parquet":
                    return OutputParquetTable()
                case "excel":
                    return OutputExcelTable()
                case _:
                    return OutputCsvTable()

        # If it's a dict, add file_type if missing
        if isinstance(v, dict) and "file_type" not in v:
            v["file_type"] = info.data.get("file_type", "csv")

        return v

    def set_absolute_filepath(self):
        """Resolves the output directory and name into an absolute path."""
        base_path = Path(self.directory)
        if not base_path.is_absolute():
            base_path = Path.cwd() / base_path
        if self.name and self.name not in base_path.name:
            base_path = base_path / self.name
        self.abs_file_path = str(base_path.resolve())

    @model_validator(mode="after")
    def populate_abs_file_path(self):
        """Ensures the absolute file path is populated after validation."""
        self.set_absolute_filepath()
        return self
populate_abs_file_path() pydantic-validator

Ensures the absolute file path is populated after validation.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
342
343
344
345
346
@model_validator(mode="after")
def populate_abs_file_path(self):
    """Ensures the absolute file path is populated after validation."""
    self.set_absolute_filepath()
    return self
set_absolute_filepath()

Resolves the output directory and name into an absolute path.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
333
334
335
336
337
338
339
340
def set_absolute_filepath(self):
    """Resolves the output directory and name into an absolute path."""
    base_path = Path(self.directory)
    if not base_path.is_absolute():
        base_path = Path.cwd() / base_path
    if self.name and self.name not in base_path.name:
        base_path = base_path / self.name
    self.abs_file_path = str(base_path.resolve())
to_yaml_dict()

Converts the output settings to a dictionary suitable for YAML serialization.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
def to_yaml_dict(self) -> OutputSettingsYaml:
    """Converts the output settings to a dictionary suitable for YAML serialization."""
    result: OutputSettingsYaml = {
        "name": self.name,
        "directory": self.directory,
        "file_type": self.file_type,
        "write_mode": self.write_mode,
    }
    if self.abs_file_path:
        result["abs_file_path"] = self.abs_file_path
    if self.fields:
        result["fields"] = self.fields
    # Only include table_settings if it has non-default values beyond file_type
    ts_dict = self.table_settings.model_dump(exclude={"file_type"})
    if any(v for v in ts_dict.values()):  # Has meaningful settings
        result["table_settings"] = ts_dict
    return result
validate_table_settings(v, info) pydantic-validator

Ensures table_settings matches the file_type.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
@field_validator("table_settings", mode="before")
@classmethod
def validate_table_settings(cls, v, info: ValidationInfo):
    """Ensures table_settings matches the file_type."""
    if v is None:
        file_type = info.data.get("file_type", "csv")
        # Create default based on file_type
        match file_type:
            case "csv":
                return OutputCsvTable()
            case "parquet":
                return OutputParquetTable()
            case "excel":
                return OutputExcelTable()
            case _:
                return OutputCsvTable()

    # If it's a dict, add file_type if missing
    if isinstance(v, dict) and "file_type" not in v:
        v["file_type"] = info.data.get("file_type", "csv")

    return v
RawData pydantic-model

Bases: BaseModel

Represents data in a raw, columnar format for manual input.

Show JSON schema:
{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Represents data in a raw, columnar format for manual input.",
  "properties": {
    "columns": {
      "default": null,
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Columns",
      "type": "array"
    },
    "data": {
      "items": {
        "items": {},
        "type": "array"
      },
      "title": "Data",
      "type": "array"
    }
  },
  "required": [
    "data"
  ],
  "title": "RawData",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
class RawData(BaseModel):
    """Represents data in a raw, columnar format for manual input."""

    columns: list[MinimalFieldInfo] = None
    data: list[list]

    @classmethod
    def from_pylist(cls, pylist: list[dict]):
        """Creates a RawData object from a list of Python dictionaries."""
        if len(pylist) == 0:
            return cls(columns=[], data=[])
        pylist = ensure_similarity_dicts(pylist)
        values = [standardize_col_dtype([vv for vv in c]) for c in zip(*(r.values() for r in pylist), strict=False)]
        data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
        columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pylist[0].keys()]
        return cls(columns=columns, data=values)

    @classmethod
    def from_pydict(cls, pydict: dict[str, list]):
        """Creates a RawData object from a dictionary of lists."""
        if len(pydict) == 0:
            return cls(columns=[], data=[])
        values = [standardize_col_dtype(column_values) for column_values in pydict.values()]
        data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
        columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pydict.keys()]
        return cls(columns=columns, data=values)

    def to_pylist(self) -> list[dict]:
        """Converts the RawData object back into a list of Python dictionaries."""
        return [{c.name: self.data[ci][ri] for ci, c in enumerate(self.columns)} for ri in range(len(self.data[0]))]
from_pydict(pydict) classmethod

Creates a RawData object from a dictionary of lists.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
587
588
589
590
591
592
593
594
595
@classmethod
def from_pydict(cls, pydict: dict[str, list]):
    """Creates a RawData object from a dictionary of lists."""
    if len(pydict) == 0:
        return cls(columns=[], data=[])
    values = [standardize_col_dtype(column_values) for column_values in pydict.values()]
    data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
    columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pydict.keys()]
    return cls(columns=columns, data=values)
from_pylist(pylist) classmethod

Creates a RawData object from a list of Python dictionaries.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
576
577
578
579
580
581
582
583
584
585
@classmethod
def from_pylist(cls, pylist: list[dict]):
    """Creates a RawData object from a list of Python dictionaries."""
    if len(pylist) == 0:
        return cls(columns=[], data=[])
    pylist = ensure_similarity_dicts(pylist)
    values = [standardize_col_dtype([vv for vv in c]) for c in zip(*(r.values() for r in pylist), strict=False)]
    data_types = (pl.DataType.from_python(type(next((v for v in column_values), None))) for column_values in values)
    columns = [MinimalFieldInfo(name=c, data_type=str(next(data_types))) for c in pylist[0].keys()]
    return cls(columns=columns, data=values)
to_pylist()

Converts the RawData object back into a list of Python dictionaries.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
597
598
599
def to_pylist(self) -> list[dict]:
    """Converts the RawData object back into a list of Python dictionaries."""
    return [{c.name: self.data[ci][ri] for ci, c in enumerate(self.columns)} for ri in range(len(self.data[0]))]
ReceivedTable pydantic-model

Bases: BaseModel

Model for defining a table received from an external source.

Show JSON schema:
{
  "$defs": {
    "InputCsvTable": {
      "description": "Defines settings for reading a CSV file.",
      "properties": {
        "file_type": {
          "const": "csv",
          "default": "csv",
          "enum": [
            "csv"
          ],
          "title": "File Type",
          "type": "string"
        },
        "reference": {
          "default": "",
          "title": "Reference",
          "type": "string"
        },
        "starting_from_line": {
          "default": 0,
          "title": "Starting From Line",
          "type": "integer"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "has_headers": {
          "default": true,
          "title": "Has Headers",
          "type": "boolean"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        },
        "parquet_ref": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Parquet Ref"
        },
        "row_delimiter": {
          "default": "\n",
          "title": "Row Delimiter",
          "type": "string"
        },
        "quote_char": {
          "default": "\"",
          "title": "Quote Char",
          "type": "string"
        },
        "infer_schema_length": {
          "default": 10000,
          "title": "Infer Schema Length",
          "type": "integer"
        },
        "truncate_ragged_lines": {
          "default": false,
          "title": "Truncate Ragged Lines",
          "type": "boolean"
        },
        "ignore_errors": {
          "default": false,
          "title": "Ignore Errors",
          "type": "boolean"
        }
      },
      "title": "InputCsvTable",
      "type": "object"
    },
    "InputExcelTable": {
      "description": "Defines settings for reading an Excel file.",
      "properties": {
        "file_type": {
          "const": "excel",
          "default": "excel",
          "enum": [
            "excel"
          ],
          "title": "File Type",
          "type": "string"
        },
        "sheet_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Sheet Name"
        },
        "start_row": {
          "default": 0,
          "title": "Start Row",
          "type": "integer"
        },
        "start_column": {
          "default": 0,
          "title": "Start Column",
          "type": "integer"
        },
        "end_row": {
          "default": 0,
          "title": "End Row",
          "type": "integer"
        },
        "end_column": {
          "default": 0,
          "title": "End Column",
          "type": "integer"
        },
        "has_headers": {
          "default": true,
          "title": "Has Headers",
          "type": "boolean"
        },
        "type_inference": {
          "default": false,
          "title": "Type Inference",
          "type": "boolean"
        }
      },
      "title": "InputExcelTable",
      "type": "object"
    },
    "InputJsonTable": {
      "description": "Defines settings for reading a JSON file.",
      "properties": {
        "file_type": {
          "const": "json",
          "default": "json",
          "enum": [
            "json"
          ],
          "title": "File Type",
          "type": "string"
        },
        "reference": {
          "default": "",
          "title": "Reference",
          "type": "string"
        },
        "starting_from_line": {
          "default": 0,
          "title": "Starting From Line",
          "type": "integer"
        },
        "delimiter": {
          "default": ",",
          "title": "Delimiter",
          "type": "string"
        },
        "has_headers": {
          "default": true,
          "title": "Has Headers",
          "type": "boolean"
        },
        "encoding": {
          "default": "utf-8",
          "title": "Encoding",
          "type": "string"
        },
        "parquet_ref": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Parquet Ref"
        },
        "row_delimiter": {
          "default": "\n",
          "title": "Row Delimiter",
          "type": "string"
        },
        "quote_char": {
          "default": "\"",
          "title": "Quote Char",
          "type": "string"
        },
        "infer_schema_length": {
          "default": 10000,
          "title": "Infer Schema Length",
          "type": "integer"
        },
        "truncate_ragged_lines": {
          "default": false,
          "title": "Truncate Ragged Lines",
          "type": "boolean"
        },
        "ignore_errors": {
          "default": false,
          "title": "Ignore Errors",
          "type": "boolean"
        }
      },
      "title": "InputJsonTable",
      "type": "object"
    },
    "InputParquetTable": {
      "description": "Defines settings for reading a Parquet file.",
      "properties": {
        "file_type": {
          "const": "parquet",
          "default": "parquet",
          "enum": [
            "parquet"
          ],
          "title": "File Type",
          "type": "string"
        }
      },
      "title": "InputParquetTable",
      "type": "object"
    },
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Model for defining a table received from an external source.",
  "properties": {
    "id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Id"
    },
    "name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Name"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "directory": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Directory"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "status": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Status"
    },
    "fields": {
      "items": {
        "$ref": "#/$defs/MinimalFieldInfo"
      },
      "title": "Fields",
      "type": "array"
    },
    "abs_file_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Abs File Path"
    },
    "file_type": {
      "enum": [
        "csv",
        "json",
        "parquet",
        "excel"
      ],
      "title": "File Type",
      "type": "string"
    },
    "table_settings": {
      "discriminator": {
        "mapping": {
          "csv": "#/$defs/InputCsvTable",
          "excel": "#/$defs/InputExcelTable",
          "json": "#/$defs/InputJsonTable",
          "parquet": "#/$defs/InputParquetTable"
        },
        "propertyName": "file_type"
      },
      "oneOf": [
        {
          "$ref": "#/$defs/InputCsvTable"
        },
        {
          "$ref": "#/$defs/InputJsonTable"
        },
        {
          "$ref": "#/$defs/InputParquetTable"
        },
        {
          "$ref": "#/$defs/InputExcelTable"
        }
      ],
      "title": "Table Settings"
    }
  },
  "required": [
    "path",
    "file_type",
    "table_settings"
  ],
  "title": "ReceivedTable",
  "type": "object"
}

Fields:

  • id (int | None)
  • name (str | None)
  • path (str)
  • directory (str | None)
  • analysis_file_available (bool)
  • status (str | None)
  • fields (list[MinimalFieldInfo])
  • abs_file_path (str | None)
  • file_type (Literal['csv', 'json', 'parquet', 'excel'])
  • table_settings (InputTableSettings)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
class ReceivedTable(BaseModel):
    """Model for defining a table received from an external source."""

    # Metadata fields
    id: int | None = None
    name: str | None = None
    path: str  # This can be an absolute or relative path
    directory: str | None = None
    analysis_file_available: bool = False
    status: str | None = None
    fields: list[MinimalFieldInfo] = Field(default_factory=list)
    abs_file_path: str | None = None

    file_type: Literal["csv", "json", "parquet", "excel"]

    table_settings: InputTableSettings

    @classmethod
    def create_from_path(cls, path: str, file_type: Literal["csv", "json", "parquet", "excel"] = "csv"):
        """Creates an instance from a file path string."""
        filename = Path(path).name

        # Create appropriate table_settings based on file_type
        settings_map = {
            "csv": InputCsvTable(),
            "json": InputJsonTable(),
            "parquet": InputParquetTable(),
            "excel": InputExcelTable(),
        }

        return cls(
            name=filename, path=path, file_type=file_type, table_settings=settings_map.get(file_type, InputCsvTable())
        )

    @property
    def file_path(self) -> str:
        """Constructs the full file path from the directory and name."""
        if self.name and self.name not in self.path:
            return os.path.join(self.path, self.name)
        else:
            return self.path

    def set_absolute_filepath(self):
        """Resolves the path to an absolute file path."""
        base_path = Path(self.path).expanduser()
        if not base_path.is_absolute():
            base_path = Path.cwd() / base_path
        if self.name and self.name not in base_path.name:
            base_path = base_path / self.name
        self.abs_file_path = str(base_path.resolve())

    @model_validator(mode="before")
    @classmethod
    def set_default_table_settings(cls, data):
        """Create default table_settings based on file_type if not provided."""
        if isinstance(data, dict):
            if "table_settings" not in data or data["table_settings"] is None:
                data["table_settings"] = {}

            if isinstance(data["table_settings"], dict) and "file_type" not in data["table_settings"]:
                data["table_settings"]["file_type"] = data.get("file_type", "csv")
        return data

    @model_validator(mode="after")
    def populate_abs_file_path(self):
        """Ensures the absolute file path is populated after validation."""
        if not self.abs_file_path:
            self.set_absolute_filepath()
        return self
file_path property

Constructs the full file path from the directory and name.

create_from_path(path, file_type='csv') classmethod

Creates an instance from a file path string.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
@classmethod
def create_from_path(cls, path: str, file_type: Literal["csv", "json", "parquet", "excel"] = "csv"):
    """Creates an instance from a file path string."""
    filename = Path(path).name

    # Create appropriate table_settings based on file_type
    settings_map = {
        "csv": InputCsvTable(),
        "json": InputJsonTable(),
        "parquet": InputParquetTable(),
        "excel": InputExcelTable(),
    }

    return cls(
        name=filename, path=path, file_type=file_type, table_settings=settings_map.get(file_type, InputCsvTable())
    )
populate_abs_file_path() pydantic-validator

Ensures the absolute file path is populated after validation.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
236
237
238
239
240
241
@model_validator(mode="after")
def populate_abs_file_path(self):
    """Ensures the absolute file path is populated after validation."""
    if not self.abs_file_path:
        self.set_absolute_filepath()
    return self
set_absolute_filepath()

Resolves the path to an absolute file path.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
215
216
217
218
219
220
221
222
def set_absolute_filepath(self):
    """Resolves the path to an absolute file path."""
    base_path = Path(self.path).expanduser()
    if not base_path.is_absolute():
        base_path = Path.cwd() / base_path
    if self.name and self.name not in base_path.name:
        base_path = base_path / self.name
    self.abs_file_path = str(base_path.resolve())
set_default_table_settings(data) pydantic-validator

Create default table_settings based on file_type if not provided.

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
224
225
226
227
228
229
230
231
232
233
234
@model_validator(mode="before")
@classmethod
def set_default_table_settings(cls, data):
    """Create default table_settings based on file_type if not provided."""
    if isinstance(data, dict):
        if "table_settings" not in data or data["table_settings"] is None:
            data["table_settings"] = {}

        if isinstance(data["table_settings"], dict) and "file_type" not in data["table_settings"]:
            data["table_settings"]["file_type"] = data.get("file_type", "csv")
    return data
RemoveItem pydantic-model

Bases: BaseModel

Represents a single item to be removed from a directory or list.

Show JSON schema:
{
  "description": "Represents a single item to be removed from a directory or list.",
  "properties": {
    "path": {
      "title": "Path",
      "type": "string"
    },
    "id": {
      "default": -1,
      "title": "Id",
      "type": "integer"
    }
  },
  "required": [
    "path"
  ],
  "title": "RemoveItem",
  "type": "object"
}

Fields:

  • path (str)
  • id (int)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
63
64
65
66
67
class RemoveItem(BaseModel):
    """Represents a single item to be removed from a directory or list."""

    path: str
    id: int = -1
RemoveItemsInput pydantic-model

Bases: BaseModel

Defines a list of items to be removed.

Show JSON schema:
{
  "$defs": {
    "RemoveItem": {
      "description": "Represents a single item to be removed from a directory or list.",
      "properties": {
        "path": {
          "title": "Path",
          "type": "string"
        },
        "id": {
          "default": -1,
          "title": "Id",
          "type": "integer"
        }
      },
      "required": [
        "path"
      ],
      "title": "RemoveItem",
      "type": "object"
    }
  },
  "description": "Defines a list of items to be removed.",
  "properties": {
    "paths": {
      "items": {
        "$ref": "#/$defs/RemoveItem"
      },
      "title": "Paths",
      "type": "array"
    },
    "source_path": {
      "title": "Source Path",
      "type": "string"
    }
  },
  "required": [
    "paths",
    "source_path"
  ],
  "title": "RemoveItemsInput",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
70
71
72
73
74
class RemoveItemsInput(BaseModel):
    """Defines a list of items to be removed."""

    paths: list[RemoveItem]
    source_path: str
SampleUsers pydantic-model

Bases: ExternalSource

Settings for generating a sample dataset of users.

Show JSON schema:
{
  "$defs": {
    "MinimalFieldInfo": {
      "description": "Represents the most basic information about a data field (column).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "title": "Data Type",
          "type": "string"
        }
      },
      "required": [
        "name"
      ],
      "title": "MinimalFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for generating a sample dataset of users.",
  "properties": {
    "orientation": {
      "default": "row",
      "title": "Orientation",
      "type": "string"
    },
    "fields": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/MinimalFieldInfo"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Fields"
    },
    "SAMPLE_USERS": {
      "title": "Sample Users",
      "type": "boolean"
    },
    "class_name": {
      "default": "sample_users",
      "title": "Class Name",
      "type": "string"
    },
    "size": {
      "default": 100,
      "title": "Size",
      "type": "integer"
    }
  },
  "required": [
    "SAMPLE_USERS"
  ],
  "title": "SampleUsers",
  "type": "object"
}

Fields:

  • orientation (str)
  • fields (list[MinimalFieldInfo] | None)
  • SAMPLE_USERS (bool)
  • class_name (str)
  • size (int)
Source code in flowfile_core/flowfile_core/schemas/input_schema.py
724
725
726
727
728
729
class SampleUsers(ExternalSource):
    """Settings for generating a sample dataset of users."""

    SAMPLE_USERS: bool
    class_name: str = "sample_users"
    size: int = 100
UserDefinedNode pydantic-model

Bases: NodeMultiInput

Settings for a node that contains the user defined node information

Show JSON schema:
{
  "$defs": {
    "OutputFieldConfig": {
      "description": "Configuration for output field validation and transformation behavior.",
      "properties": {
        "enabled": {
          "default": false,
          "title": "Enabled",
          "type": "boolean"
        },
        "validation_mode_behavior": {
          "default": "select_only",
          "enum": [
            "add_missing",
            "add_missing_keep_extra",
            "raise_on_missing",
            "select_only"
          ],
          "title": "Validation Mode Behavior",
          "type": "string"
        },
        "fields": {
          "items": {
            "$ref": "#/$defs/OutputFieldInfo"
          },
          "title": "Fields",
          "type": "array"
        },
        "validate_data_types": {
          "default": false,
          "title": "Validate Data Types",
          "type": "boolean"
        }
      },
      "title": "OutputFieldConfig",
      "type": "object"
    },
    "OutputFieldInfo": {
      "description": "Field information with optional default value for output field configuration.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "default": "String",
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "title": "Data Type",
          "type": "string"
        },
        "default_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Default Value"
        }
      },
      "required": [
        "name"
      ],
      "title": "OutputFieldInfo",
      "type": "object"
    }
  },
  "description": "Settings for a node that contains the user defined node information",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "cache_results": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Cache Results"
    },
    "pos_x": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos X"
    },
    "pos_y": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": 0,
      "title": "Pos Y"
    },
    "is_setup": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Is Setup"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "",
      "title": "Description"
    },
    "node_reference": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Reference"
    },
    "user_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "User Id"
    },
    "is_flow_output": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is Flow Output"
    },
    "is_user_defined": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Is User Defined"
    },
    "output_field_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/OutputFieldConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "depending_on_ids": {
      "anyOf": [
        {
          "items": {
            "type": "integer"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Depending On Ids"
    },
    "settings": {
      "title": "Settings"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "settings"
  ],
  "title": "UserDefinedNode",
  "type": "object"
}

Fields:

  • flow_id (int)
  • node_id (int)
  • cache_results (bool | None)
  • pos_x (float | None)
  • pos_y (float | None)
  • is_setup (bool | None)
  • description (str | None)
  • node_reference (str | None)
  • user_id (int | None)
  • is_flow_output (bool | None)
  • is_user_defined (bool | None)
  • output_field_config (OutputFieldConfig | None)
  • depending_on_ids (list[int] | None)
  • settings (Any)

Validators:

Source code in flowfile_core/flowfile_core/schemas/input_schema.py
890
891
892
893
class UserDefinedNode(NodeMultiInput):
    """Settings for a node that contains the user defined node information"""

    settings: Any

transform_schema

flowfile_core.schemas.transform_schema

Classes:

Name Description
AggColl

A data class that represents a single aggregation operation for a group by operation.

BasicFilter

Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').

CrossJoinInput

Data model for cross join operations.

CrossJoinInputManager

Manager for cross join operations.

FieldInput

Represents a single field with its name and data type, typically for defining an output column.

FilterInput

Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.

FilterOperator

Supported filter comparison operators.

FullJoinKeyResponse

Holds the join key rename responses for both sides of a join.

FunctionInput

Defines a formula to be applied, including the output field information.

FuzzyMatchInput

Data model for fuzzy matching join operations.

FuzzyMatchInputManager

Manager for fuzzy matching join operations.

GraphSolverInput

Defines settings for a graph-solving operation (e.g., finding connected components).

GroupByInput

A data class that represents the input for a group by operation.

JoinInput

Data model for standard SQL-style join operations.

JoinInputManager

Manager for standard SQL-style join operations.

JoinInputs

Data model for join-specific select inputs (extends SelectInputs).

JoinInputsManager

Manager for join-specific operations, extends SelectInputsManager.

JoinKeyRename

Represents the renaming of a join key from its original to a temporary name.

JoinKeyRenameResponse

Contains a list of join key renames for one side of a join.

JoinMap

Defines a single mapping between a left and right column for a join key.

JoinSelectManagerMixin

Mixin providing common methods for join-like operations.

PivotInput

Defines the settings for a pivot (long-to-wide) operation.

PolarsCodeInput

A simple container for a string of user-provided Polars code to be executed.

RecordIdInput

Defines settings for adding a record ID (row number) column to the data.

SelectInput

Defines how a single column should be selected, renamed, or type-cast.

SelectInputs

A container for a list of SelectInput objects (pure data, no logic).

SelectInputsManager

Manager class that provides all query and mutation operations.

SortByInput

Defines a single sort condition on a column, including the direction.

TextToRowsInput

Defines settings for splitting a text column into multiple rows based on a delimiter.

UnionInput

Defines settings for a union (concatenation) operation.

UniqueInput

Defines settings for a uniqueness operation, specifying columns and which row to keep.

UnpivotInput

Defines settings for an unpivot (wide-to-long) operation.

Functions:

Name Description
construct_join_key_name

Creates a temporary, unique name for a join key column.

get_func_type_mapping

Infers the output data type of common aggregation functions.

string_concat

A simple wrapper to concatenate string columns in Polars.

AggColl pydantic-model

Bases: BaseModel

A data class that represents a single aggregation operation for a group by operation.

Attributes

old_name : str The name of the column in the original DataFrame to be aggregated.

str

The aggregation function to use. This can be a string representing a built-in function or a custom function.

Optional[str]

The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the old_name appended with the aggregation function.

Optional[str]

The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function using the get_func_type_mapping function.

Example

agg_col = AggColl( old_name='col1', agg='sum', new_name='sum_col1', output_type='float' )

Show JSON schema:
{
  "description": "A data class that represents a single aggregation operation for a group by operation.\n\nAttributes\n----------\nold_name : str\n    The name of the column in the original DataFrame to be aggregated.\n\nagg : str\n    The aggregation function to use. This can be a string representing a built-in function or a custom function.\n\nnew_name : Optional[str]\n    The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the\n    old_name appended with the aggregation function.\n\noutput_type : Optional[str]\n    The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function\n    using the `get_func_type_mapping` function.\n\nExample\n--------\nagg_col = AggColl(\n    old_name='col1',\n    agg='sum',\n    new_name='sum_col1',\n    output_type='float'\n)",
  "properties": {
    "old_name": {
      "title": "Old Name",
      "type": "string"
    },
    "agg": {
      "title": "Agg",
      "type": "string"
    },
    "new_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "New Name"
    },
    "output_type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Output Type"
    }
  },
  "required": [
    "old_name",
    "agg"
  ],
  "title": "AggColl",
  "type": "object"
}

Fields:

  • old_name (str)
  • agg (str)
  • new_name (str | None)
  • output_type (str | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
class AggColl(BaseModel):
    """
    A data class that represents a single aggregation operation for a group by operation.

    Attributes
    ----------
    old_name : str
        The name of the column in the original DataFrame to be aggregated.

    agg : str
        The aggregation function to use. This can be a string representing a built-in function or a custom function.

    new_name : Optional[str]
        The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the
        old_name appended with the aggregation function.

    output_type : Optional[str]
        The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function
        using the `get_func_type_mapping` function.

    Example
    --------
    agg_col = AggColl(
        old_name='col1',
        agg='sum',
        new_name='sum_col1',
        output_type='float'
    )
    """

    old_name: str
    agg: str
    new_name: str | None = None
    output_type: str | None = None

    def __init__(self, old_name: str, agg: str, new_name: str | None = None, output_type: str | None = None):
        data = {"old_name": old_name, "agg": agg}
        if new_name is not None:
            data["new_name"] = new_name
        if output_type is not None:
            data["output_type"] = output_type

        super().__init__(**data)

    @model_validator(mode="after")
    def set_defaults(self):
        """Set default new_name and output_type based on agg function."""
        # Set new_name
        if self.new_name is None:
            if self.agg != "groupby":
                self.new_name = self.old_name + "_" + self.agg
            else:
                self.new_name = self.old_name

        # Set output_type
        if self.output_type is None:
            self.output_type = get_func_type_mapping(self.agg)

        # Ensure old_name is a string
        self.old_name = str(self.old_name)

        return self

    @property
    def agg_func(self):
        """Returns the corresponding Polars aggregation function from the `agg` string."""
        if self.agg == "groupby":
            return self.agg
        elif self.agg == "concat":
            return string_concat
        else:
            return getattr(pl, self.agg) if isinstance(self.agg, str) else self.agg
agg_func property

Returns the corresponding Polars aggregation function from the agg string.

set_defaults() pydantic-validator

Set default new_name and output_type based on agg function.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
@model_validator(mode="after")
def set_defaults(self):
    """Set default new_name and output_type based on agg function."""
    # Set new_name
    if self.new_name is None:
        if self.agg != "groupby":
            self.new_name = self.old_name + "_" + self.agg
        else:
            self.new_name = self.old_name

    # Set output_type
    if self.output_type is None:
        self.output_type = get_func_type_mapping(self.agg)

    # Ensure old_name is a string
    self.old_name = str(self.old_name)

    return self
BasicFilter pydantic-model

Bases: BaseModel

Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').

Attributes:

Name Type Description
field str

The column name to filter on.

operator FilterOperator | str

The comparison operator (FilterOperator enum value or symbol).

value str

The value to compare against.

value2 str | None

Second value for BETWEEN operator (optional).

Show JSON schema:
{
  "$defs": {
    "FilterOperator": {
      "description": "Supported filter comparison operators.",
      "enum": [
        "equals",
        "not_equals",
        "greater_than",
        "greater_than_or_equals",
        "less_than",
        "less_than_or_equals",
        "contains",
        "not_contains",
        "starts_with",
        "ends_with",
        "is_null",
        "is_not_null",
        "in",
        "not_in",
        "between"
      ],
      "title": "FilterOperator",
      "type": "string"
    }
  },
  "description": "Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').\n\nAttributes:\n    field: The column name to filter on.\n    operator: The comparison operator (FilterOperator enum value or symbol).\n    value: The value to compare against.\n    value2: Second value for BETWEEN operator (optional).",
  "properties": {
    "field": {
      "default": "",
      "title": "Field",
      "type": "string"
    },
    "operator": {
      "anyOf": [
        {
          "$ref": "#/$defs/FilterOperator"
        },
        {
          "type": "string"
        }
      ],
      "default": "equals",
      "title": "Operator"
    },
    "value": {
      "default": "",
      "title": "Value",
      "type": "string"
    },
    "value2": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Value2"
    },
    "filter_type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Filter Type"
    },
    "filter_value": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Filter Value"
    }
  },
  "title": "BasicFilter",
  "type": "object"
}

Fields:

  • field (str)
  • operator (FilterOperator | str)
  • value (str)
  • value2 (str | None)
  • filter_type (str | None)
  • filter_value (str | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
class BasicFilter(BaseModel):
    """Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').

    Attributes:
        field: The column name to filter on.
        operator: The comparison operator (FilterOperator enum value or symbol).
        value: The value to compare against.
        value2: Second value for BETWEEN operator (optional).
    """

    field: str = ""
    operator: FilterOperator | str = FilterOperator.EQUALS
    value: str = ""
    value2: str | None = None  # For BETWEEN operator

    # Keep old field names for backward compatibility
    filter_type: str | None = None
    filter_value: str | None = None

    def __init__(
        self,
        field: str = None,
        operator: FilterOperator | str = None,
        value: str = None,
        value2: str = None,
        # Backward compatibility parameters
        filter_type: str = None,
        filter_value: str = None,
        **data,
    ):
        # Handle backward compatibility
        if filter_type is not None and operator is None:
            data["operator"] = filter_type
        elif operator is not None:
            data["operator"] = operator

        if filter_value is not None and value is None:
            data["value"] = filter_value
        elif value is not None:
            data["value"] = value

        if field is not None:
            data["field"] = field
        if value2 is not None:
            data["value2"] = value2

        super().__init__(**data)

    @model_validator(mode="after")
    def normalize_operator(self):
        """Normalize the operator to FilterOperator enum."""
        if isinstance(self.operator, str):
            try:
                self.operator = FilterOperator.from_symbol(self.operator)
            except ValueError:
                # Keep as string if conversion fails (for backward compat)
                pass
        return self

    def get_operator(self) -> FilterOperator:
        """Get the operator as FilterOperator enum."""
        if isinstance(self.operator, FilterOperator):
            return self.operator
        return FilterOperator.from_symbol(self.operator)

    def to_yaml_dict(self) -> BasicFilterYaml:
        """Serialize for YAML output."""
        result: BasicFilterYaml = {
            "field": self.field,
            "operator": self.operator.value if isinstance(self.operator, FilterOperator) else self.operator,
            "value": self.value,
        }
        if self.value2:
            result["value2"] = self.value2
        return result

    @classmethod
    def from_yaml_dict(cls, data: dict) -> "BasicFilter":
        """Load from YAML format."""
        return cls(
            field=data.get("field", ""),
            operator=data.get("operator", FilterOperator.EQUALS),
            value=data.get("value", ""),
            value2=data.get("value2"),
        )
from_yaml_dict(data) classmethod

Load from YAML format.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
356
357
358
359
360
361
362
363
364
@classmethod
def from_yaml_dict(cls, data: dict) -> "BasicFilter":
    """Load from YAML format."""
    return cls(
        field=data.get("field", ""),
        operator=data.get("operator", FilterOperator.EQUALS),
        value=data.get("value", ""),
        value2=data.get("value2"),
    )
get_operator()

Get the operator as FilterOperator enum.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
339
340
341
342
343
def get_operator(self) -> FilterOperator:
    """Get the operator as FilterOperator enum."""
    if isinstance(self.operator, FilterOperator):
        return self.operator
    return FilterOperator.from_symbol(self.operator)
normalize_operator() pydantic-validator

Normalize the operator to FilterOperator enum.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
328
329
330
331
332
333
334
335
336
337
@model_validator(mode="after")
def normalize_operator(self):
    """Normalize the operator to FilterOperator enum."""
    if isinstance(self.operator, str):
        try:
            self.operator = FilterOperator.from_symbol(self.operator)
        except ValueError:
            # Keep as string if conversion fails (for backward compat)
            pass
    return self
to_yaml_dict()

Serialize for YAML output.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
345
346
347
348
349
350
351
352
353
354
def to_yaml_dict(self) -> BasicFilterYaml:
    """Serialize for YAML output."""
    result: BasicFilterYaml = {
        "field": self.field,
        "operator": self.operator.value if isinstance(self.operator, FilterOperator) else self.operator,
        "value": self.value,
    }
    if self.value2:
        result["value2"] = self.value2
    return result
CrossJoinInput pydantic-model

Bases: BaseModel

Data model for cross join operations.

Show JSON schema:
{
  "$defs": {
    "JoinInputs": {
      "description": "Data model for join-specific select inputs (extends SelectInputs).",
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "title": "JoinInputs",
      "type": "object"
    },
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Data model for cross join operations.",
  "properties": {
    "left_select": {
      "$ref": "#/$defs/JoinInputs"
    },
    "right_select": {
      "$ref": "#/$defs/JoinInputs"
    }
  },
  "required": [
    "left_select",
    "right_select"
  ],
  "title": "CrossJoinInput",
  "type": "object"
}

Fields:

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
class CrossJoinInput(BaseModel):
    """Data model for cross join operations."""

    left_select: JoinInputs
    right_select: JoinInputs

    @model_validator(mode="before")
    @classmethod
    def parse_inputs(cls, data: Any) -> Any:
        """Parse flexible input formats before validation."""
        if isinstance(data, dict):
            # Parse join_mapping
            if "join_mapping" in data:
                data["join_mapping"] = cls._parse_join_mapping(data["join_mapping"])

            # Parse left_select
            if "left_select" in data:
                data["left_select"] = cls._parse_select(data["left_select"])

            # Parse right_select
            if "right_select" in data:
                data["right_select"] = cls._parse_select(data["right_select"])

        return data

    @staticmethod
    def _parse_join_mapping(join_mapping: Any) -> list[JoinMap]:
        """Parse various join_mapping formats."""
        # Already a list of JoinMaps
        if isinstance(join_mapping, list):
            result = []
            for jm in join_mapping:
                if isinstance(jm, JoinMap):
                    result.append(jm)
                elif isinstance(jm, dict):
                    result.append(JoinMap(**jm))
                elif isinstance(jm, (tuple, list)) and len(jm) == 2:
                    result.append(JoinMap(left_col=jm[0], right_col=jm[1]))
                elif isinstance(jm, str):
                    result.append(JoinMap(left_col=jm, right_col=jm))
                else:
                    raise ValueError(f"Invalid join mapping item: {jm}")
            return result

        # Single JoinMap
        if isinstance(join_mapping, JoinMap):
            return [join_mapping]

        # String: same column on both sides
        if isinstance(join_mapping, str):
            return [JoinMap(left_col=join_mapping, right_col=join_mapping)]

        # Tuple: (left, right)
        if isinstance(join_mapping, tuple) and len(join_mapping) == 2:
            return [JoinMap(left_col=join_mapping[0], right_col=join_mapping[1])]

        raise ValueError(f"Invalid join_mapping format: {type(join_mapping)}")

    @staticmethod
    def _parse_select(select: Any) -> JoinInputs:
        """Parse various select input formats."""
        # Already JoinInputs
        if isinstance(select, JoinInputs):
            return select

        # List of SelectInput objects
        if isinstance(select, list):
            if all(isinstance(s, SelectInput) for s in select):
                return JoinInputs(renames=select)
            elif all(isinstance(s, str) for s in select):
                return JoinInputs(renames=[SelectInput(old_name=s) for s in select])
            elif all(isinstance(s, dict) for s in select):
                return JoinInputs(renames=[SelectInput(**s) for s in select])

        # Dict with 'select' (new YAML) or 'renames' (internal) key
        if isinstance(select, dict):
            if "select" in select:
                return JoinInputs(renames=[SelectInput.from_yaml_dict(s) for s in select["select"]])
            if "renames" in select:
                return JoinInputs(**select)

        raise ValueError(f"Invalid select format: {type(select)}")

    def __init__(
        self,
        left_select: JoinInputs | list[SelectInput] | list[str] = None,
        right_select: JoinInputs | list[SelectInput] | list[str] = None,
        **data,
    ):
        """Custom init for backward compatibility with positional arguments."""
        if left_select is not None:
            data["left_select"] = left_select
        if right_select is not None:
            data["right_select"] = right_select
        super().__init__(**data)

    def to_yaml_dict(self) -> CrossJoinInputYaml:
        """Serialize for YAML output."""
        return {
            "left_select": self.left_select.to_yaml_dict(),
            "right_select": self.right_select.to_yaml_dict(),
        }

    def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
        """Adds a new column to the selection for either the left or right side."""
        target_input = self.right_select if side == "right" else self.left_select
        if select_input.new_name is None:
            select_input.new_name = select_input.old_name
        target_input.renames.append(select_input)
__init__(left_select=None, right_select=None, **data)

Custom init for backward compatibility with positional arguments.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
591
592
593
594
595
596
597
598
599
600
601
602
def __init__(
    self,
    left_select: JoinInputs | list[SelectInput] | list[str] = None,
    right_select: JoinInputs | list[SelectInput] | list[str] = None,
    **data,
):
    """Custom init for backward compatibility with positional arguments."""
    if left_select is not None:
        data["left_select"] = left_select
    if right_select is not None:
        data["right_select"] = right_select
    super().__init__(**data)
add_new_select_column(select_input, side)

Adds a new column to the selection for either the left or right side.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
611
612
613
614
615
616
def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
    """Adds a new column to the selection for either the left or right side."""
    target_input = self.right_select if side == "right" else self.left_select
    if select_input.new_name is None:
        select_input.new_name = select_input.old_name
    target_input.renames.append(select_input)
parse_inputs(data) pydantic-validator

Parse flexible input formats before validation.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
@model_validator(mode="before")
@classmethod
def parse_inputs(cls, data: Any) -> Any:
    """Parse flexible input formats before validation."""
    if isinstance(data, dict):
        # Parse join_mapping
        if "join_mapping" in data:
            data["join_mapping"] = cls._parse_join_mapping(data["join_mapping"])

        # Parse left_select
        if "left_select" in data:
            data["left_select"] = cls._parse_select(data["left_select"])

        # Parse right_select
        if "right_select" in data:
            data["right_select"] = cls._parse_select(data["right_select"])

    return data
to_yaml_dict()

Serialize for YAML output.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
604
605
606
607
608
609
def to_yaml_dict(self) -> CrossJoinInputYaml:
    """Serialize for YAML output."""
    return {
        "left_select": self.left_select.to_yaml_dict(),
        "right_select": self.right_select.to_yaml_dict(),
    }
CrossJoinInputManager

Bases: JoinSelectManagerMixin

Manager for cross join operations.

Methods:

Name Description
auto_rename

Automatically renames columns on the right side to prevent naming conflicts.

create

Factory method to create CrossJoinInput from various input formats.

get_overlapping_records

Finds column names that would conflict after the join.

to_cross_join_input

Creates a new CrossJoinInput instance based on the current manager settings.

Attributes:

Name Type Description
left_select JoinInputsManager

Backward compatibility: Access left_manager as left_select.

overlapping_records set[str]

Backward compatibility: Returns overlapping column names.

right_select JoinInputsManager

Backward compatibility: Access right_manager as right_select.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
class CrossJoinInputManager(JoinSelectManagerMixin):
    """Manager for cross join operations."""

    def __init__(self, cross_join_input: CrossJoinInput):
        self.input = deepcopy(cross_join_input)
        self.left_manager = JoinInputsManager(self.input.left_select)
        self.right_manager = JoinInputsManager(self.input.right_select)

    @classmethod
    def create(
        cls, left_select: list[SelectInput] | list[str], right_select: list[SelectInput] | list[str]
    ) -> "CrossJoinInputManager":
        """Factory method to create CrossJoinInput from various input formats."""
        left_inputs = cls.parse_select(left_select)
        right_inputs = cls.parse_select(right_select)

        cross_join = CrossJoinInput(left_select=left_inputs, right_select=right_inputs)
        return cls(cross_join)

    def get_overlapping_records(self) -> set[str]:
        """Finds column names that would conflict after the join."""
        return self.get_overlapping_columns()

    def auto_rename(self, rename_mode: Literal["suffix", "prefix"] = "prefix") -> None:
        """Automatically renames columns on the right side to prevent naming conflicts."""
        overlapping_records = self.get_overlapping_records()

        while len(overlapping_records) > 0:
            for right_col in self.input.right_select.renames:
                if right_col.new_name in overlapping_records:
                    if rename_mode == "prefix":
                        right_col.new_name = "right_" + right_col.new_name
                    elif rename_mode == "suffix":
                        right_col.new_name = right_col.new_name + "_right"
                    else:
                        raise ValueError(f"Unknown rename_mode: {rename_mode}")
            overlapping_records = self.get_overlapping_records()

    # === Backward Compatibility Properties ===

    @property
    def left_select(self) -> JoinInputsManager:
        """Backward compatibility: Access left_manager as left_select."""
        return self.left_manager

    @property
    def right_select(self) -> JoinInputsManager:
        """Backward compatibility: Access right_manager as right_select."""
        return self.right_manager

    @property
    def overlapping_records(self) -> set[str]:
        """Backward compatibility: Returns overlapping column names."""
        return self.get_overlapping_records()

    def to_cross_join_input(self) -> CrossJoinInput:
        """Creates a new CrossJoinInput instance based on the current manager settings.

        This is useful when you've modified the manager (e.g., via auto_rename) and
        want to get a fresh CrossJoinInput with all the current settings applied.

        Returns:
            A new CrossJoinInput instance with current settings
        """
        return CrossJoinInput(
            left_select=JoinInputs(renames=self.input.left_select.renames.copy()),
            right_select=JoinInputs(renames=self.input.right_select.renames.copy()),
        )
left_select property

Backward compatibility: Access left_manager as left_select.

overlapping_records property

Backward compatibility: Returns overlapping column names.

right_select property

Backward compatibility: Access right_manager as right_select.

auto_rename(rename_mode='prefix')

Automatically renames columns on the right side to prevent naming conflicts.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
def auto_rename(self, rename_mode: Literal["suffix", "prefix"] = "prefix") -> None:
    """Automatically renames columns on the right side to prevent naming conflicts."""
    overlapping_records = self.get_overlapping_records()

    while len(overlapping_records) > 0:
        for right_col in self.input.right_select.renames:
            if right_col.new_name in overlapping_records:
                if rename_mode == "prefix":
                    right_col.new_name = "right_" + right_col.new_name
                elif rename_mode == "suffix":
                    right_col.new_name = right_col.new_name + "_right"
                else:
                    raise ValueError(f"Unknown rename_mode: {rename_mode}")
        overlapping_records = self.get_overlapping_records()
create(left_select, right_select) classmethod

Factory method to create CrossJoinInput from various input formats.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
@classmethod
def create(
    cls, left_select: list[SelectInput] | list[str], right_select: list[SelectInput] | list[str]
) -> "CrossJoinInputManager":
    """Factory method to create CrossJoinInput from various input formats."""
    left_inputs = cls.parse_select(left_select)
    right_inputs = cls.parse_select(right_select)

    cross_join = CrossJoinInput(left_select=left_inputs, right_select=right_inputs)
    return cls(cross_join)
get_overlapping_records()

Finds column names that would conflict after the join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1243
1244
1245
def get_overlapping_records(self) -> set[str]:
    """Finds column names that would conflict after the join."""
    return self.get_overlapping_columns()
to_cross_join_input()

Creates a new CrossJoinInput instance based on the current manager settings.

This is useful when you've modified the manager (e.g., via auto_rename) and want to get a fresh CrossJoinInput with all the current settings applied.

Returns:

Type Description
CrossJoinInput

A new CrossJoinInput instance with current settings

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
def to_cross_join_input(self) -> CrossJoinInput:
    """Creates a new CrossJoinInput instance based on the current manager settings.

    This is useful when you've modified the manager (e.g., via auto_rename) and
    want to get a fresh CrossJoinInput with all the current settings applied.

    Returns:
        A new CrossJoinInput instance with current settings
    """
    return CrossJoinInput(
        left_select=JoinInputs(renames=self.input.left_select.renames.copy()),
        right_select=JoinInputs(renames=self.input.right_select.renames.copy()),
    )
FieldInput pydantic-model

Bases: BaseModel

Represents a single field with its name and data type, typically for defining an output column.

Show JSON schema:
{
  "$defs": {
    "DataType": {
      "description": "Specific data types for fine-grained control.",
      "enum": [
        "Int8",
        "Int16",
        "Int32",
        "Int64",
        "UInt8",
        "UInt16",
        "UInt32",
        "UInt64",
        "Float32",
        "Float64",
        "Decimal",
        "String",
        "Categorical",
        "Date",
        "Datetime",
        "Time",
        "Duration",
        "Boolean",
        "Binary",
        "List",
        "Struct",
        "Array"
      ],
      "title": "DataType",
      "type": "string"
    }
  },
  "description": "Represents a single field with its name and data type, typically for defining an output column.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "data_type": {
      "anyOf": [
        {
          "$ref": "#/$defs/DataType"
        },
        {
          "const": "Auto",
          "enum": [
            "Auto"
          ],
          "type": "string"
        },
        {
          "enum": [
            "Int8",
            "Int16",
            "Int32",
            "Int64",
            "UInt8",
            "UInt16",
            "UInt32",
            "UInt64",
            "Float32",
            "Float64",
            "Decimal",
            "String",
            "Date",
            "Datetime",
            "Time",
            "Duration",
            "Boolean",
            "Binary",
            "List",
            "Struct",
            "Array",
            "Integer",
            "Double",
            "Utf8"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "Auto",
      "title": "Data Type"
    }
  },
  "required": [
    "name"
  ],
  "title": "FieldInput",
  "type": "object"
}

Fields:

  • name (str)
  • data_type (DataType | Literal['Auto'] | DataTypeStr | None)
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
259
260
261
262
263
class FieldInput(BaseModel):
    """Represents a single field with its name and data type, typically for defining an output column."""

    name: str
    data_type: DataType | Literal["Auto"] | DataTypeStr | None = AUTO_DATA_TYPE
FilterInput pydantic-model

Bases: BaseModel

Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.

Attributes:

Name Type Description
mode FilterModeLiteral

The filter mode - "basic" or "advanced".

basic_filter BasicFilter | None

The basic filter configuration (used when mode="basic").

advanced_filter str

The advanced filter expression string (used when mode="advanced").

Show JSON schema:
{
  "$defs": {
    "BasicFilter": {
      "description": "Defines a simple, single-condition filter (e.g., 'column' 'equals' 'value').\n\nAttributes:\n    field: The column name to filter on.\n    operator: The comparison operator (FilterOperator enum value or symbol).\n    value: The value to compare against.\n    value2: Second value for BETWEEN operator (optional).",
      "properties": {
        "field": {
          "default": "",
          "title": "Field",
          "type": "string"
        },
        "operator": {
          "anyOf": [
            {
              "$ref": "#/$defs/FilterOperator"
            },
            {
              "type": "string"
            }
          ],
          "default": "equals",
          "title": "Operator"
        },
        "value": {
          "default": "",
          "title": "Value",
          "type": "string"
        },
        "value2": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Value2"
        },
        "filter_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Filter Type"
        },
        "filter_value": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Filter Value"
        }
      },
      "title": "BasicFilter",
      "type": "object"
    },
    "FilterOperator": {
      "description": "Supported filter comparison operators.",
      "enum": [
        "equals",
        "not_equals",
        "greater_than",
        "greater_than_or_equals",
        "less_than",
        "less_than_or_equals",
        "contains",
        "not_contains",
        "starts_with",
        "ends_with",
        "is_null",
        "is_not_null",
        "in",
        "not_in",
        "between"
      ],
      "title": "FilterOperator",
      "type": "string"
    }
  },
  "description": "Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.\n\nAttributes:\n    mode: The filter mode - \"basic\" or \"advanced\".\n    basic_filter: The basic filter configuration (used when mode=\"basic\").\n    advanced_filter: The advanced filter expression string (used when mode=\"advanced\").",
  "properties": {
    "mode": {
      "default": "basic",
      "enum": [
        "basic",
        "advanced"
      ],
      "title": "Mode",
      "type": "string"
    },
    "basic_filter": {
      "anyOf": [
        {
          "$ref": "#/$defs/BasicFilter"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "advanced_filter": {
      "default": "",
      "title": "Advanced Filter",
      "type": "string"
    },
    "filter_type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Filter Type"
    }
  },
  "title": "FilterInput",
  "type": "object"
}

Fields:

  • mode (FilterModeLiteral)
  • basic_filter (BasicFilter | None)
  • advanced_filter (str)
  • filter_type (str | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
class FilterInput(BaseModel):
    """Defines the settings for a filter operation, supporting basic or advanced (expression-based) modes.

    Attributes:
        mode: The filter mode - "basic" or "advanced".
        basic_filter: The basic filter configuration (used when mode="basic").
        advanced_filter: The advanced filter expression string (used when mode="advanced").
    """

    mode: FilterModeLiteral = "basic"
    basic_filter: BasicFilter | None = None
    advanced_filter: str = ""

    # Keep old field name for backward compatibility
    filter_type: str | None = None

    def __init__(
        self,
        mode: FilterModeLiteral = None,
        basic_filter: BasicFilter = None,
        advanced_filter: str = None,
        # Backward compatibility
        filter_type: str = None,
        **data,
    ):
        # Handle backward compatibility: filter_type -> mode
        if filter_type is not None and mode is None:
            data["mode"] = filter_type
        elif mode is not None:
            data["mode"] = mode

        if advanced_filter is not None:
            data["advanced_filter"] = advanced_filter
        if basic_filter is not None:
            data["basic_filter"] = basic_filter

        super().__init__(**data)

    @model_validator(mode="after")
    def ensure_basic_filter(self):
        """Ensure basic_filter exists when mode is basic."""
        if self.mode == "basic" and self.basic_filter is None:
            self.basic_filter = BasicFilter()
        return self

    def is_advanced(self) -> bool:
        """Check if filter is in advanced mode."""
        return self.mode == "advanced"

    def to_yaml_dict(self) -> FilterInputYaml:
        """Serialize for YAML output."""
        result: FilterInputYaml = {"mode": self.mode}
        if self.mode == "basic" and self.basic_filter:
            result["basic_filter"] = self.basic_filter.to_yaml_dict()
        elif self.mode == "advanced" and self.advanced_filter:
            result["advanced_filter"] = self.advanced_filter
        return result

    @classmethod
    def from_yaml_dict(cls, data: dict) -> "FilterInput":
        """Load from YAML format."""
        mode = data.get("mode", "basic")
        basic_filter = None
        if "basic_filter" in data:
            basic_filter = BasicFilter.from_yaml_dict(data["basic_filter"])
        return cls(
            mode=mode,
            basic_filter=basic_filter,
            advanced_filter=data.get("advanced_filter", ""),
        )
ensure_basic_filter() pydantic-validator

Ensure basic_filter exists when mode is basic.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
405
406
407
408
409
410
@model_validator(mode="after")
def ensure_basic_filter(self):
    """Ensure basic_filter exists when mode is basic."""
    if self.mode == "basic" and self.basic_filter is None:
        self.basic_filter = BasicFilter()
    return self
from_yaml_dict(data) classmethod

Load from YAML format.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
425
426
427
428
429
430
431
432
433
434
435
436
@classmethod
def from_yaml_dict(cls, data: dict) -> "FilterInput":
    """Load from YAML format."""
    mode = data.get("mode", "basic")
    basic_filter = None
    if "basic_filter" in data:
        basic_filter = BasicFilter.from_yaml_dict(data["basic_filter"])
    return cls(
        mode=mode,
        basic_filter=basic_filter,
        advanced_filter=data.get("advanced_filter", ""),
    )
is_advanced()

Check if filter is in advanced mode.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
412
413
414
def is_advanced(self) -> bool:
    """Check if filter is in advanced mode."""
    return self.mode == "advanced"
to_yaml_dict()

Serialize for YAML output.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
416
417
418
419
420
421
422
423
def to_yaml_dict(self) -> FilterInputYaml:
    """Serialize for YAML output."""
    result: FilterInputYaml = {"mode": self.mode}
    if self.mode == "basic" and self.basic_filter:
        result["basic_filter"] = self.basic_filter.to_yaml_dict()
    elif self.mode == "advanced" and self.advanced_filter:
        result["advanced_filter"] = self.advanced_filter
    return result
FilterOperator

Bases: str, Enum

Supported filter comparison operators.

Methods:

Name Description
from_symbol

Convert UI symbol to FilterOperator enum.

to_symbol

Convert FilterOperator to UI-friendly symbol.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
class FilterOperator(str, Enum):
    """Supported filter comparison operators."""

    EQUALS = "equals"
    NOT_EQUALS = "not_equals"
    GREATER_THAN = "greater_than"
    GREATER_THAN_OR_EQUALS = "greater_than_or_equals"
    LESS_THAN = "less_than"
    LESS_THAN_OR_EQUALS = "less_than_or_equals"
    CONTAINS = "contains"
    NOT_CONTAINS = "not_contains"
    STARTS_WITH = "starts_with"
    ENDS_WITH = "ends_with"
    IS_NULL = "is_null"
    IS_NOT_NULL = "is_not_null"
    IN = "in"
    NOT_IN = "not_in"
    BETWEEN = "between"

    def __str__(self) -> str:
        return self.value

    @classmethod
    def from_symbol(cls, symbol: str) -> "FilterOperator":
        """Convert UI symbol to FilterOperator enum."""
        symbol_mapping = {
            "=": cls.EQUALS,
            "==": cls.EQUALS,
            "!=": cls.NOT_EQUALS,
            "<>": cls.NOT_EQUALS,
            ">": cls.GREATER_THAN,
            ">=": cls.GREATER_THAN_OR_EQUALS,
            "<": cls.LESS_THAN,
            "<=": cls.LESS_THAN_OR_EQUALS,
            "contains": cls.CONTAINS,
            "not_contains": cls.NOT_CONTAINS,
            "starts_with": cls.STARTS_WITH,
            "ends_with": cls.ENDS_WITH,
            "is_null": cls.IS_NULL,
            "is_not_null": cls.IS_NOT_NULL,
            "in": cls.IN,
            "not_in": cls.NOT_IN,
            "between": cls.BETWEEN,
        }
        if symbol in symbol_mapping:
            return symbol_mapping[symbol]
        # Try to match by value directly
        try:
            return cls(symbol)
        except ValueError:
            raise ValueError(f"Unknown filter operator symbol: {symbol}")

    def to_symbol(self) -> str:
        """Convert FilterOperator to UI-friendly symbol."""
        symbol_mapping = {
            FilterOperator.EQUALS: "=",
            FilterOperator.NOT_EQUALS: "!=",
            FilterOperator.GREATER_THAN: ">",
            FilterOperator.GREATER_THAN_OR_EQUALS: ">=",
            FilterOperator.LESS_THAN: "<",
            FilterOperator.LESS_THAN_OR_EQUALS: "<=",
            FilterOperator.CONTAINS: "contains",
            FilterOperator.NOT_CONTAINS: "not_contains",
            FilterOperator.STARTS_WITH: "starts_with",
            FilterOperator.ENDS_WITH: "ends_with",
            FilterOperator.IS_NULL: "is_null",
            FilterOperator.IS_NOT_NULL: "is_not_null",
            FilterOperator.IN: "in",
            FilterOperator.NOT_IN: "not_in",
            FilterOperator.BETWEEN: "between",
        }
        return symbol_mapping.get(self, self.value)
from_symbol(symbol) classmethod

Convert UI symbol to FilterOperator enum.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@classmethod
def from_symbol(cls, symbol: str) -> "FilterOperator":
    """Convert UI symbol to FilterOperator enum."""
    symbol_mapping = {
        "=": cls.EQUALS,
        "==": cls.EQUALS,
        "!=": cls.NOT_EQUALS,
        "<>": cls.NOT_EQUALS,
        ">": cls.GREATER_THAN,
        ">=": cls.GREATER_THAN_OR_EQUALS,
        "<": cls.LESS_THAN,
        "<=": cls.LESS_THAN_OR_EQUALS,
        "contains": cls.CONTAINS,
        "not_contains": cls.NOT_CONTAINS,
        "starts_with": cls.STARTS_WITH,
        "ends_with": cls.ENDS_WITH,
        "is_null": cls.IS_NULL,
        "is_not_null": cls.IS_NOT_NULL,
        "in": cls.IN,
        "not_in": cls.NOT_IN,
        "between": cls.BETWEEN,
    }
    if symbol in symbol_mapping:
        return symbol_mapping[symbol]
    # Try to match by value directly
    try:
        return cls(symbol)
    except ValueError:
        raise ValueError(f"Unknown filter operator symbol: {symbol}")
to_symbol()

Convert FilterOperator to UI-friendly symbol.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def to_symbol(self) -> str:
    """Convert FilterOperator to UI-friendly symbol."""
    symbol_mapping = {
        FilterOperator.EQUALS: "=",
        FilterOperator.NOT_EQUALS: "!=",
        FilterOperator.GREATER_THAN: ">",
        FilterOperator.GREATER_THAN_OR_EQUALS: ">=",
        FilterOperator.LESS_THAN: "<",
        FilterOperator.LESS_THAN_OR_EQUALS: "<=",
        FilterOperator.CONTAINS: "contains",
        FilterOperator.NOT_CONTAINS: "not_contains",
        FilterOperator.STARTS_WITH: "starts_with",
        FilterOperator.ENDS_WITH: "ends_with",
        FilterOperator.IS_NULL: "is_null",
        FilterOperator.IS_NOT_NULL: "is_not_null",
        FilterOperator.IN: "in",
        FilterOperator.NOT_IN: "not_in",
        FilterOperator.BETWEEN: "between",
    }
    return symbol_mapping.get(self, self.value)
FullJoinKeyResponse

Bases: NamedTuple

Holds the join key rename responses for both sides of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
146
147
148
149
150
class FullJoinKeyResponse(NamedTuple):
    """Holds the join key rename responses for both sides of a join."""

    left: JoinKeyRenameResponse
    right: JoinKeyRenameResponse
FunctionInput pydantic-model

Bases: BaseModel

Defines a formula to be applied, including the output field information.

Show JSON schema:
{
  "$defs": {
    "DataType": {
      "description": "Specific data types for fine-grained control.",
      "enum": [
        "Int8",
        "Int16",
        "Int32",
        "Int64",
        "UInt8",
        "UInt16",
        "UInt32",
        "UInt64",
        "Float32",
        "Float64",
        "Decimal",
        "String",
        "Categorical",
        "Date",
        "Datetime",
        "Time",
        "Duration",
        "Boolean",
        "Binary",
        "List",
        "Struct",
        "Array"
      ],
      "title": "DataType",
      "type": "string"
    },
    "FieldInput": {
      "description": "Represents a single field with its name and data type, typically for defining an output column.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "anyOf": [
            {
              "$ref": "#/$defs/DataType"
            },
            {
              "const": "Auto",
              "enum": [
                "Auto"
              ],
              "type": "string"
            },
            {
              "enum": [
                "Int8",
                "Int16",
                "Int32",
                "Int64",
                "UInt8",
                "UInt16",
                "UInt32",
                "UInt64",
                "Float32",
                "Float64",
                "Decimal",
                "String",
                "Date",
                "Datetime",
                "Time",
                "Duration",
                "Boolean",
                "Binary",
                "List",
                "Struct",
                "Array",
                "Integer",
                "Double",
                "Utf8"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "Auto",
          "title": "Data Type"
        }
      },
      "required": [
        "name"
      ],
      "title": "FieldInput",
      "type": "object"
    }
  },
  "description": "Defines a formula to be applied, including the output field information.",
  "properties": {
    "field": {
      "$ref": "#/$defs/FieldInput"
    },
    "function": {
      "title": "Function",
      "type": "string"
    }
  },
  "required": [
    "field",
    "function"
  ],
  "title": "FunctionInput",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
266
267
268
269
270
271
272
273
274
275
276
277
class FunctionInput(BaseModel):
    """Defines a formula to be applied, including the output field information."""

    field: FieldInput
    function: str

    def __init__(self, field: FieldInput = None, function: str = None, **data):
        if field is not None:
            data["field"] = field
        if function is not None:
            data["function"] = function
        super().__init__(**data)
FuzzyMatchInput pydantic-model

Bases: BaseModel

Data model for fuzzy matching join operations.

Show JSON schema:
{
  "$defs": {
    "FuzzyMapping": {
      "properties": {
        "left_col": {
          "title": "Left Col",
          "type": "string"
        },
        "right_col": {
          "title": "Right Col",
          "type": "string"
        },
        "threshold_score": {
          "default": 80.0,
          "title": "Threshold Score",
          "type": "number"
        },
        "fuzzy_type": {
          "default": "levenshtein",
          "enum": [
            "levenshtein",
            "jaro",
            "jaro_winkler",
            "hamming",
            "damerau_levenshtein",
            "indel"
          ],
          "title": "Fuzzy Type",
          "type": "string"
        },
        "perc_unique": {
          "default": 0.0,
          "title": "Perc Unique",
          "type": "number"
        },
        "output_column_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Column Name"
        },
        "valid": {
          "default": true,
          "title": "Valid",
          "type": "boolean"
        }
      },
      "required": [
        "left_col",
        "right_col"
      ],
      "title": "FuzzyMapping",
      "type": "object"
    },
    "JoinInputs": {
      "description": "Data model for join-specific select inputs (extends SelectInputs).",
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "title": "JoinInputs",
      "type": "object"
    },
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Data model for fuzzy matching join operations.",
  "properties": {
    "join_mapping": {
      "items": {
        "$ref": "#/$defs/FuzzyMapping"
      },
      "title": "Join Mapping",
      "type": "array"
    },
    "left_select": {
      "$ref": "#/$defs/JoinInputs"
    },
    "right_select": {
      "$ref": "#/$defs/JoinInputs"
    },
    "how": {
      "default": "inner",
      "enum": [
        "inner",
        "left",
        "right",
        "full",
        "semi",
        "anti",
        "cross",
        "outer"
      ],
      "title": "How",
      "type": "string"
    },
    "aggregate_output": {
      "default": false,
      "title": "Aggregate Output",
      "type": "boolean"
    }
  },
  "required": [
    "join_mapping",
    "left_select",
    "right_select"
  ],
  "title": "FuzzyMatchInput",
  "type": "object"
}

Fields:

  • join_mapping (list[FuzzyMapping])
  • left_select (JoinInputs)
  • right_select (JoinInputs)
  • how (JoinStrategy)
  • aggregate_output (bool)

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
class FuzzyMatchInput(BaseModel):
    """Data model for fuzzy matching join operations."""

    join_mapping: list[FuzzyMapping]
    left_select: JoinInputs
    right_select: JoinInputs
    how: JoinStrategy = "inner"
    aggregate_output: bool = False

    def __init__(
        self,
        left_select: JoinInputs | list[SelectInput] | list[str] = None,
        right_select: JoinInputs | list[SelectInput] | list[str] = None,
        **data,
    ):
        """Custom init for backward compatibility with positional arguments."""
        if left_select is not None:
            data["left_select"] = left_select
        if right_select is not None:
            data["right_select"] = right_select

        super().__init__(**data)

    def to_yaml_dict(self) -> FuzzyMatchInputYaml:
        """Serialize for YAML output."""
        return {
            "join_mapping": [asdict(jm) for jm in self.join_mapping],
            "left_select": self.left_select.to_yaml_dict(),
            "right_select": self.right_select.to_yaml_dict(),
            "how": self.how,
            "aggregate_output": self.aggregate_output,
        }

    def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
        """Adds a new column to the selection for either the left or right side."""
        target_input = self.right_select if side == "right" else self.left_select
        if select_input.new_name is None:
            select_input.new_name = select_input.old_name
        target_input.renames.append(select_input)

    @staticmethod
    def _parse_select(select: Any) -> JoinInputs:
        """Parse various select input formats."""
        # Already JoinInputs
        if isinstance(select, JoinInputs):
            return select

        # List of SelectInput objects
        if isinstance(select, list):
            if all(isinstance(s, SelectInput) for s in select):
                return JoinInputs(renames=select)
            elif all(isinstance(s, str) for s in select):
                return JoinInputs(renames=[SelectInput(old_name=s) for s in select])
            elif all(isinstance(s, dict) for s in select):
                return JoinInputs(renames=[SelectInput(**s) for s in select])

        # Dict with 'select' (new YAML) or 'renames' (internal) key
        if isinstance(select, dict):
            if "select" in select:
                return JoinInputs(renames=[SelectInput.from_yaml_dict(s) for s in select["select"]])
            if "renames" in select:
                return JoinInputs(**select)

        raise ValueError(f"Invalid select format: {type(select)}")

    @model_validator(mode="before")
    @classmethod
    def parse_inputs(cls, data: Any) -> Any:
        """Parse flexible input formats before validation."""
        if isinstance(data, dict):
            # Parse left_select
            if "left_select" in data:
                data["left_select"] = cls._parse_select(data["left_select"])

            # Parse right_select
            if "right_select" in data:
                data["right_select"] = cls._parse_select(data["right_select"])

        return data
__init__(left_select=None, right_select=None, **data)

Custom init for backward compatibility with positional arguments.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
750
751
752
753
754
755
756
757
758
759
760
761
762
def __init__(
    self,
    left_select: JoinInputs | list[SelectInput] | list[str] = None,
    right_select: JoinInputs | list[SelectInput] | list[str] = None,
    **data,
):
    """Custom init for backward compatibility with positional arguments."""
    if left_select is not None:
        data["left_select"] = left_select
    if right_select is not None:
        data["right_select"] = right_select

    super().__init__(**data)
add_new_select_column(select_input, side)

Adds a new column to the selection for either the left or right side.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
774
775
776
777
778
779
def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
    """Adds a new column to the selection for either the left or right side."""
    target_input = self.right_select if side == "right" else self.left_select
    if select_input.new_name is None:
        select_input.new_name = select_input.old_name
    target_input.renames.append(select_input)
parse_inputs(data) pydantic-validator

Parse flexible input formats before validation.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
806
807
808
809
810
811
812
813
814
815
816
817
818
819
@model_validator(mode="before")
@classmethod
def parse_inputs(cls, data: Any) -> Any:
    """Parse flexible input formats before validation."""
    if isinstance(data, dict):
        # Parse left_select
        if "left_select" in data:
            data["left_select"] = cls._parse_select(data["left_select"])

        # Parse right_select
        if "right_select" in data:
            data["right_select"] = cls._parse_select(data["right_select"])

    return data
to_yaml_dict()

Serialize for YAML output.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
764
765
766
767
768
769
770
771
772
def to_yaml_dict(self) -> FuzzyMatchInputYaml:
    """Serialize for YAML output."""
    return {
        "join_mapping": [asdict(jm) for jm in self.join_mapping],
        "left_select": self.left_select.to_yaml_dict(),
        "right_select": self.right_select.to_yaml_dict(),
        "how": self.how,
        "aggregate_output": self.aggregate_output,
    }
FuzzyMatchInputManager

Bases: JoinInputManager

Manager for fuzzy matching join operations.

Methods:

Name Description
create

Factory method to create FuzzyMatchInput from various input formats.

get_fuzzy_maps

Returns the final fuzzy mappings after applying all column renames.

parse_fuzz_mapping

Parses various input formats into a list of FuzzyMapping objects.

to_fuzzy_match_input

Creates a new FuzzyMatchInput instance based on the current manager settings.

Attributes:

Name Type Description
aggregate_output bool

Backward compatibility: Access aggregate_output setting.

fuzzy_maps list[FuzzyMapping]

Backward compatibility: Returns fuzzy mappings.

join_mapping list[FuzzyMapping]

Backward compatibility: Access fuzzy join mapping.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
class FuzzyMatchInputManager(JoinInputManager):
    """Manager for fuzzy matching join operations."""

    def __init__(self, fuzzy_input: FuzzyMatchInput):
        self.fuzzy_input = deepcopy(fuzzy_input)
        super().__init__(
            JoinInput(
                join_mapping=[
                    JoinMap(left_col=fm.left_col, right_col=fm.right_col) for fm in self.fuzzy_input.join_mapping
                ],
                left_select=self.fuzzy_input.left_select,
                right_select=self.fuzzy_input.right_select,
                how=self.fuzzy_input.how,
            )
        )

    @classmethod
    def create(
        cls,
        join_mapping: list[FuzzyMapping] | tuple[str, str] | str,
        left_select: list[SelectInput] | list[str],
        right_select: list[SelectInput] | list[str],
        aggregate_output: bool = False,
        how: JoinStrategy = "inner",
    ) -> "FuzzyMatchInputManager":
        """Factory method to create FuzzyMatchInput from various input formats."""
        parsed_mapping = cls.parse_fuzz_mapping(join_mapping)
        left_inputs = cls.parse_select(left_select)
        right_inputs = cls.parse_select(right_select)

        fuzzy_input = FuzzyMatchInput(
            join_mapping=parsed_mapping,
            left_select=left_inputs,
            right_select=right_inputs,
            how=how,
            aggregate_output=aggregate_output,
        )

        manager = cls(fuzzy_input)

        right_old_names = {v.old_name for v in fuzzy_input.right_select.renames}
        left_old_names = {v.old_name for v in fuzzy_input.left_select.renames}

        for jm in parsed_mapping:
            if jm.right_col not in right_old_names:
                manager.right_manager.append(SelectInput(old_name=jm.right_col, keep=False, join_key=True))
            if jm.left_col not in left_old_names:
                manager.left_manager.append(SelectInput(old_name=jm.left_col, keep=False, join_key=True))

        manager.set_join_keys()
        return manager

    @staticmethod
    def parse_fuzz_mapping(
        fuzz_mapping: list[FuzzyMapping] | tuple[str, str] | str | FuzzyMapping | list[dict],
    ) -> list[FuzzyMapping]:
        """Parses various input formats into a list of FuzzyMapping objects."""
        if isinstance(fuzz_mapping, (tuple, list)):
            if len(fuzz_mapping) == 0:
                raise ValueError("Fuzzy mapping cannot be empty")

            if all(isinstance(fm, dict) for fm in fuzz_mapping):
                return [FuzzyMapping(**fm) for fm in fuzz_mapping]

            if all(isinstance(fm, FuzzyMapping) for fm in fuzz_mapping):
                return fuzz_mapping

            if len(fuzz_mapping) <= 2:
                if len(fuzz_mapping) == 2:
                    if isinstance(fuzz_mapping[0], str) and isinstance(fuzz_mapping[1], str):
                        return [FuzzyMapping(left_col=fuzz_mapping[0], right_col=fuzz_mapping[1])]
                elif len(fuzz_mapping) == 1 and isinstance(fuzz_mapping[0], str):
                    return [FuzzyMapping(left_col=fuzz_mapping[0], right_col=fuzz_mapping[0])]

        elif isinstance(fuzz_mapping, str):
            return [FuzzyMapping(left_col=fuzz_mapping, right_col=fuzz_mapping)]

        elif isinstance(fuzz_mapping, FuzzyMapping):
            return [fuzz_mapping]

        raise ValueError(f"No valid fuzzy mapping as input: {type(fuzz_mapping)}")

    def get_fuzzy_maps(self) -> list[FuzzyMapping]:
        """Returns the final fuzzy mappings after applying all column renames."""
        new_mappings = []
        left_rename_table = self.left_manager.get_rename_table()
        right_rename_table = self.right_manager.get_rename_table()

        for org_fuzzy_map in self.fuzzy_input.join_mapping:
            right_col = right_rename_table.get(org_fuzzy_map.right_col, org_fuzzy_map.right_col)
            left_col = left_rename_table.get(org_fuzzy_map.left_col, org_fuzzy_map.left_col)

            if right_col != org_fuzzy_map.right_col or left_col != org_fuzzy_map.left_col:
                new_mapping = deepcopy(org_fuzzy_map)
                new_mapping.left_col = left_col
                new_mapping.right_col = right_col
                new_mappings.append(new_mapping)
            else:
                new_mappings.append(org_fuzzy_map)

        return new_mappings

    # === Backward Compatibility Properties ===

    @property
    def fuzzy_maps(self) -> list[FuzzyMapping]:
        """Backward compatibility: Returns fuzzy mappings."""
        return self.get_fuzzy_maps()

    @property
    def join_mapping(self) -> list[FuzzyMapping]:
        """Backward compatibility: Access fuzzy join mapping."""
        return self.get_fuzzy_maps()

    @property
    def aggregate_output(self) -> bool:
        """Backward compatibility: Access aggregate_output setting."""
        return self.fuzzy_input.aggregate_output

    def to_fuzzy_match_input(self) -> FuzzyMatchInput:
        """Creates a new FuzzyMatchInput instance based on the current manager settings.

        This is useful when you've modified the manager (e.g., via auto_rename) and
        want to get a fresh FuzzyMatchInput with all the current settings applied.

        Returns:
            A new FuzzyMatchInput instance with current settings
        """
        return FuzzyMatchInput(
            join_mapping=self.fuzzy_input.join_mapping,
            left_select=JoinInputs(renames=self.input.left_select.renames.copy()),
            right_select=JoinInputs(renames=self.input.right_select.renames.copy()),
            how=self.fuzzy_input.how,
            aggregate_output=self.fuzzy_input.aggregate_output,
        )
aggregate_output property

Backward compatibility: Access aggregate_output setting.

fuzzy_maps property

Backward compatibility: Returns fuzzy mappings.

join_mapping property

Backward compatibility: Access fuzzy join mapping.

create(join_mapping, left_select, right_select, aggregate_output=False, how='inner') classmethod

Factory method to create FuzzyMatchInput from various input formats.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
@classmethod
def create(
    cls,
    join_mapping: list[FuzzyMapping] | tuple[str, str] | str,
    left_select: list[SelectInput] | list[str],
    right_select: list[SelectInput] | list[str],
    aggregate_output: bool = False,
    how: JoinStrategy = "inner",
) -> "FuzzyMatchInputManager":
    """Factory method to create FuzzyMatchInput from various input formats."""
    parsed_mapping = cls.parse_fuzz_mapping(join_mapping)
    left_inputs = cls.parse_select(left_select)
    right_inputs = cls.parse_select(right_select)

    fuzzy_input = FuzzyMatchInput(
        join_mapping=parsed_mapping,
        left_select=left_inputs,
        right_select=right_inputs,
        how=how,
        aggregate_output=aggregate_output,
    )

    manager = cls(fuzzy_input)

    right_old_names = {v.old_name for v in fuzzy_input.right_select.renames}
    left_old_names = {v.old_name for v in fuzzy_input.left_select.renames}

    for jm in parsed_mapping:
        if jm.right_col not in right_old_names:
            manager.right_manager.append(SelectInput(old_name=jm.right_col, keep=False, join_key=True))
        if jm.left_col not in left_old_names:
            manager.left_manager.append(SelectInput(old_name=jm.left_col, keep=False, join_key=True))

    manager.set_join_keys()
    return manager
get_fuzzy_maps()

Returns the final fuzzy mappings after applying all column renames.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
def get_fuzzy_maps(self) -> list[FuzzyMapping]:
    """Returns the final fuzzy mappings after applying all column renames."""
    new_mappings = []
    left_rename_table = self.left_manager.get_rename_table()
    right_rename_table = self.right_manager.get_rename_table()

    for org_fuzzy_map in self.fuzzy_input.join_mapping:
        right_col = right_rename_table.get(org_fuzzy_map.right_col, org_fuzzy_map.right_col)
        left_col = left_rename_table.get(org_fuzzy_map.left_col, org_fuzzy_map.left_col)

        if right_col != org_fuzzy_map.right_col or left_col != org_fuzzy_map.left_col:
            new_mapping = deepcopy(org_fuzzy_map)
            new_mapping.left_col = left_col
            new_mapping.right_col = right_col
            new_mappings.append(new_mapping)
        else:
            new_mappings.append(org_fuzzy_map)

    return new_mappings
parse_fuzz_mapping(fuzz_mapping) staticmethod

Parses various input formats into a list of FuzzyMapping objects.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
@staticmethod
def parse_fuzz_mapping(
    fuzz_mapping: list[FuzzyMapping] | tuple[str, str] | str | FuzzyMapping | list[dict],
) -> list[FuzzyMapping]:
    """Parses various input formats into a list of FuzzyMapping objects."""
    if isinstance(fuzz_mapping, (tuple, list)):
        if len(fuzz_mapping) == 0:
            raise ValueError("Fuzzy mapping cannot be empty")

        if all(isinstance(fm, dict) for fm in fuzz_mapping):
            return [FuzzyMapping(**fm) for fm in fuzz_mapping]

        if all(isinstance(fm, FuzzyMapping) for fm in fuzz_mapping):
            return fuzz_mapping

        if len(fuzz_mapping) <= 2:
            if len(fuzz_mapping) == 2:
                if isinstance(fuzz_mapping[0], str) and isinstance(fuzz_mapping[1], str):
                    return [FuzzyMapping(left_col=fuzz_mapping[0], right_col=fuzz_mapping[1])]
            elif len(fuzz_mapping) == 1 and isinstance(fuzz_mapping[0], str):
                return [FuzzyMapping(left_col=fuzz_mapping[0], right_col=fuzz_mapping[0])]

    elif isinstance(fuzz_mapping, str):
        return [FuzzyMapping(left_col=fuzz_mapping, right_col=fuzz_mapping)]

    elif isinstance(fuzz_mapping, FuzzyMapping):
        return [fuzz_mapping]

    raise ValueError(f"No valid fuzzy mapping as input: {type(fuzz_mapping)}")
to_fuzzy_match_input()

Creates a new FuzzyMatchInput instance based on the current manager settings.

This is useful when you've modified the manager (e.g., via auto_rename) and want to get a fresh FuzzyMatchInput with all the current settings applied.

Returns:

Type Description
FuzzyMatchInput

A new FuzzyMatchInput instance with current settings

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
def to_fuzzy_match_input(self) -> FuzzyMatchInput:
    """Creates a new FuzzyMatchInput instance based on the current manager settings.

    This is useful when you've modified the manager (e.g., via auto_rename) and
    want to get a fresh FuzzyMatchInput with all the current settings applied.

    Returns:
        A new FuzzyMatchInput instance with current settings
    """
    return FuzzyMatchInput(
        join_mapping=self.fuzzy_input.join_mapping,
        left_select=JoinInputs(renames=self.input.left_select.renames.copy()),
        right_select=JoinInputs(renames=self.input.right_select.renames.copy()),
        how=self.fuzzy_input.how,
        aggregate_output=self.fuzzy_input.aggregate_output,
    )
GraphSolverInput pydantic-model

Bases: BaseModel

Defines settings for a graph-solving operation (e.g., finding connected components).

Show JSON schema:
{
  "description": "Defines settings for a graph-solving operation (e.g., finding connected components).",
  "properties": {
    "col_from": {
      "title": "Col From",
      "type": "string"
    },
    "col_to": {
      "title": "Col To",
      "type": "string"
    },
    "output_column_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "graph_group",
      "title": "Output Column Name"
    }
  },
  "required": [
    "col_from",
    "col_to"
  ],
  "title": "GraphSolverInput",
  "type": "object"
}

Fields:

  • col_from (str)
  • col_to (str)
  • output_column_name (str | None)
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1019
1020
1021
1022
1023
1024
class GraphSolverInput(BaseModel):
    """Defines settings for a graph-solving operation (e.g., finding connected components)."""

    col_from: str
    col_to: str
    output_column_name: str | None = "graph_group"
GroupByInput pydantic-model

Bases: BaseModel

A data class that represents the input for a group by operation.

Attributes

agg_cols : List[AggColl] A list of AggColl objects that specify the aggregation operations to perform on the DataFrame columns after grouping. Each AggColl object should specify the column to be aggregated and the aggregation function to use.

Example

group_by_input = GroupByInput( agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'), AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')] )

Show JSON schema:
{
  "$defs": {
    "AggColl": {
      "description": "A data class that represents a single aggregation operation for a group by operation.\n\nAttributes\n----------\nold_name : str\n    The name of the column in the original DataFrame to be aggregated.\n\nagg : str\n    The aggregation function to use. This can be a string representing a built-in function or a custom function.\n\nnew_name : Optional[str]\n    The name of the resulting aggregated column in the output DataFrame. If not provided, it will default to the\n    old_name appended with the aggregation function.\n\noutput_type : Optional[str]\n    The type of the output values of the aggregation. If not provided, it is inferred from the aggregation function\n    using the `get_func_type_mapping` function.\n\nExample\n--------\nagg_col = AggColl(\n    old_name='col1',\n    agg='sum',\n    new_name='sum_col1',\n    output_type='float'\n)",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "agg": {
          "title": "Agg",
          "type": "string"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "output_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Output Type"
        }
      },
      "required": [
        "old_name",
        "agg"
      ],
      "title": "AggColl",
      "type": "object"
    }
  },
  "description": "A data class that represents the input for a group by operation.\n\nAttributes\n----------\nagg_cols : List[AggColl]\n    A list of `AggColl` objects that specify the aggregation operations to perform on the DataFrame columns\n    after grouping. Each `AggColl` object should specify the column to be aggregated and the aggregation\n    function to use.\n\nExample\n--------\ngroup_by_input = GroupByInput(\n    agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'),\n              AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')]\n)",
  "properties": {
    "agg_cols": {
      "items": {
        "$ref": "#/$defs/AggColl"
      },
      "title": "Agg Cols",
      "type": "array"
    }
  },
  "required": [
    "agg_cols"
  ],
  "title": "GroupByInput",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
class GroupByInput(BaseModel):
    """
    A data class that represents the input for a group by operation.

    Attributes
    ----------
    agg_cols : List[AggColl]
        A list of `AggColl` objects that specify the aggregation operations to perform on the DataFrame columns
        after grouping. Each `AggColl` object should specify the column to be aggregated and the aggregation
        function to use.

    Example
    --------
    group_by_input = GroupByInput(
        agg_cols=[AggColl(old_name='ix', agg='groupby'), AggColl(old_name='groups', agg='groupby'),
                  AggColl(old_name='col1', agg='sum'), AggColl(old_name='col2', agg='mean')]
    )
    """

    agg_cols: list[AggColl]

    def __init__(self, agg_cols: list[AggColl]):
        """Backwards compatibility implementation"""
        super().__init__(agg_cols=agg_cols)
__init__(agg_cols)

Backwards compatibility implementation

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
917
918
919
def __init__(self, agg_cols: list[AggColl]):
    """Backwards compatibility implementation"""
    super().__init__(agg_cols=agg_cols)
JoinInput pydantic-model

Bases: BaseModel

Data model for standard SQL-style join operations.

Show JSON schema:
{
  "$defs": {
    "JoinInputs": {
      "description": "Data model for join-specific select inputs (extends SelectInputs).",
      "properties": {
        "renames": {
          "items": {
            "$ref": "#/$defs/SelectInput"
          },
          "title": "Renames",
          "type": "array"
        }
      },
      "title": "JoinInputs",
      "type": "object"
    },
    "JoinMap": {
      "description": "Defines a single mapping between a left and right column for a join key.",
      "properties": {
        "left_col": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Left Col"
        },
        "right_col": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Right Col"
        }
      },
      "title": "JoinMap",
      "type": "object"
    },
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Data model for standard SQL-style join operations.",
  "properties": {
    "join_mapping": {
      "items": {
        "$ref": "#/$defs/JoinMap"
      },
      "title": "Join Mapping",
      "type": "array"
    },
    "left_select": {
      "$ref": "#/$defs/JoinInputs"
    },
    "right_select": {
      "$ref": "#/$defs/JoinInputs"
    },
    "how": {
      "default": "inner",
      "enum": [
        "inner",
        "left",
        "right",
        "full",
        "semi",
        "anti",
        "cross",
        "outer"
      ],
      "title": "How",
      "type": "string"
    }
  },
  "required": [
    "join_mapping",
    "left_select",
    "right_select"
  ],
  "title": "JoinInput",
  "type": "object"
}

Fields:

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
class JoinInput(BaseModel):
    """Data model for standard SQL-style join operations."""

    join_mapping: list[JoinMap]
    left_select: JoinInputs
    right_select: JoinInputs
    how: JoinStrategy = "inner"

    @model_validator(mode="before")
    @classmethod
    def parse_inputs(cls, data: Any) -> Any:
        """Parse flexible input formats before validation."""
        if isinstance(data, dict):
            # Parse join_mapping
            if "join_mapping" in data:
                data["join_mapping"] = cls._parse_join_mapping(data["join_mapping"])

            # Parse left_select
            if "left_select" in data:
                data["left_select"] = cls._parse_select(data["left_select"])

            # Parse right_select
            if "right_select" in data:
                data["right_select"] = cls._parse_select(data["right_select"])

        return data

    @staticmethod
    def _parse_join_mapping(join_mapping: Any) -> list[JoinMap]:
        """Parse various join_mapping formats."""
        # Already a list of JoinMaps
        if isinstance(join_mapping, list):
            result = []
            for jm in join_mapping:
                if isinstance(jm, JoinMap):
                    result.append(jm)
                elif isinstance(jm, dict):
                    result.append(JoinMap(**jm))
                elif isinstance(jm, (tuple, list)) and len(jm) == 2:
                    result.append(JoinMap(left_col=jm[0], right_col=jm[1]))
                elif isinstance(jm, str):
                    result.append(JoinMap(left_col=jm, right_col=jm))
                else:
                    raise ValueError(f"Invalid join mapping item: {jm}")
            return result

        # Single JoinMap
        if isinstance(join_mapping, JoinMap):
            return [join_mapping]

        # String: same column on both sides
        if isinstance(join_mapping, str):
            return [JoinMap(left_col=join_mapping, right_col=join_mapping)]

        # Tuple: (left, right)
        if isinstance(join_mapping, tuple) and len(join_mapping) == 2:
            return [JoinMap(left_col=join_mapping[0], right_col=join_mapping[1])]

        raise ValueError(f"Invalid join_mapping format: {type(join_mapping)}")

    @staticmethod
    def _parse_select(select: Any) -> JoinInputs:
        """Parse various select input formats."""
        # Already JoinInputs
        if isinstance(select, JoinInputs):
            return select

        # List of SelectInput objects
        if isinstance(select, list):
            if all(isinstance(s, SelectInput) for s in select):
                return JoinInputs(renames=select)
            elif all(isinstance(s, str) for s in select):
                return JoinInputs(renames=[SelectInput(old_name=s) for s in select])
            elif all(isinstance(s, dict) for s in select):
                return JoinInputs(renames=[SelectInput(**s) for s in select])

        # Dict with 'select' (new YAML) or 'renames' (internal) key
        if isinstance(select, dict):
            if "select" in select:
                return JoinInputs(renames=[SelectInput.from_yaml_dict(s) for s in select["select"]])
            if "renames" in select:
                return JoinInputs(**select)

        raise ValueError(f"Invalid select format: {type(select)}")

    def __init__(
        self,
        join_mapping: list[JoinMap] | JoinMap | tuple[str, str] | str | list[tuple] | list[str] = None,
        left_select: JoinInputs | list[SelectInput] | list[str] = None,
        right_select: JoinInputs | list[SelectInput] | list[str] = None,
        how: JoinStrategy = "inner",
        **data,
    ):
        """Custom init for backward compatibility with positional arguments."""
        if join_mapping is not None:
            data["join_mapping"] = join_mapping
        if left_select is not None:
            data["left_select"] = left_select
        if right_select is not None:
            data["right_select"] = right_select
        if how is not None:
            data["how"] = how

        super().__init__(**data)

    def to_yaml_dict(self) -> JoinInputYaml:
        """Serialize for YAML output."""
        return {
            "join_mapping": [{"left_col": jm.left_col, "right_col": jm.right_col} for jm in self.join_mapping],
            "left_select": self.left_select.to_yaml_dict(),
            "right_select": self.right_select.to_yaml_dict(),
            "how": self.how,
        }

    def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
        """Adds a new column to the selection for either the left or right side."""
        target_input = self.right_select if side == "right" else self.left_select
        if select_input.new_name is None:
            select_input.new_name = select_input.old_name
        target_input.renames.append(select_input)
__init__(join_mapping=None, left_select=None, right_select=None, how='inner', **data)

Custom init for backward compatibility with positional arguments.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
def __init__(
    self,
    join_mapping: list[JoinMap] | JoinMap | tuple[str, str] | str | list[tuple] | list[str] = None,
    left_select: JoinInputs | list[SelectInput] | list[str] = None,
    right_select: JoinInputs | list[SelectInput] | list[str] = None,
    how: JoinStrategy = "inner",
    **data,
):
    """Custom init for backward compatibility with positional arguments."""
    if join_mapping is not None:
        data["join_mapping"] = join_mapping
    if left_select is not None:
        data["left_select"] = left_select
    if right_select is not None:
        data["right_select"] = right_select
    if how is not None:
        data["how"] = how

    super().__init__(**data)
add_new_select_column(select_input, side)

Adds a new column to the selection for either the left or right side.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
733
734
735
736
737
738
def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
    """Adds a new column to the selection for either the left or right side."""
    target_input = self.right_select if side == "right" else self.left_select
    if select_input.new_name is None:
        select_input.new_name = select_input.old_name
    target_input.renames.append(select_input)
parse_inputs(data) pydantic-validator

Parse flexible input formats before validation.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
@model_validator(mode="before")
@classmethod
def parse_inputs(cls, data: Any) -> Any:
    """Parse flexible input formats before validation."""
    if isinstance(data, dict):
        # Parse join_mapping
        if "join_mapping" in data:
            data["join_mapping"] = cls._parse_join_mapping(data["join_mapping"])

        # Parse left_select
        if "left_select" in data:
            data["left_select"] = cls._parse_select(data["left_select"])

        # Parse right_select
        if "right_select" in data:
            data["right_select"] = cls._parse_select(data["right_select"])

    return data
to_yaml_dict()

Serialize for YAML output.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
724
725
726
727
728
729
730
731
def to_yaml_dict(self) -> JoinInputYaml:
    """Serialize for YAML output."""
    return {
        "join_mapping": [{"left_col": jm.left_col, "right_col": jm.right_col} for jm in self.join_mapping],
        "left_select": self.left_select.to_yaml_dict(),
        "right_select": self.right_select.to_yaml_dict(),
        "how": self.how,
    }
JoinInputManager

Bases: JoinSelectManagerMixin

Manager for standard SQL-style join operations.

Methods:

Name Description
auto_rename

Automatically renames columns on the right side to prevent naming conflicts.

create

Factory method to create JoinInput from various input formats.

get_join_key_renames

Gets the temporary rename mappings for the join keys on both sides.

get_left_join_keys

Returns a set of the left-side join key column names.

get_left_join_keys_list

Returns an ordered list of the left-side join key column names.

get_names_for_table_rename

Gets join mapping with renamed columns applied.

get_overlapping_records

Finds column names that would conflict after the join.

get_right_join_keys

Returns a set of the right-side join key column names.

get_right_join_keys_list

Returns an ordered list of the right-side join key column names.

get_used_join_mapping

Returns the final join mapping after applying all renames and transformations.

set_join_keys

Marks the SelectInput objects corresponding to join keys.

to_join_input

Creates a new JoinInput instance based on the current manager settings.

Attributes:

Name Type Description
how JoinStrategy

Backward compatibility: Access join strategy.

join_mapping list[JoinMap]

Backward compatibility: Access join mapping.

left_join_keys list[str]

Backward compatibility: Returns left join keys list.

left_select JoinInputsManager

Backward compatibility: Access left_manager as left_select.

overlapping_records set[str]

Backward compatibility: Returns overlapping column names.

right_join_keys list[str]

Backward compatibility: Returns right join keys list.

right_select JoinInputsManager

Backward compatibility: Access right_manager as right_select.

used_join_mapping list[JoinMap]

Backward compatibility: Returns used join mapping.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
class JoinInputManager(JoinSelectManagerMixin):
    """Manager for standard SQL-style join operations."""

    def __init__(self, join_input: JoinInput):
        self.input = deepcopy(join_input)
        self.left_manager = JoinInputsManager(self.input.left_select)
        self.right_manager = JoinInputsManager(self.input.right_select)
        self.set_join_keys()

    @classmethod
    def create(
        cls,
        join_mapping: list[JoinMap] | tuple[str, str] | str,
        left_select: list[SelectInput] | list[str],
        right_select: list[SelectInput] | list[str],
        how: JoinStrategy = "inner",
    ) -> "JoinInputManager":
        """Factory method to create JoinInput from various input formats."""
        # Use JoinInput's own create method for parsing
        join_input = JoinInput(join_mapping=join_mapping, left_select=left_select, right_select=right_select, how=how)

        manager = cls(join_input)
        manager.set_join_keys()
        return manager

    def set_join_keys(self) -> None:
        """Marks the `SelectInput` objects corresponding to join keys."""
        left_join_keys = self._get_left_join_keys_set()
        right_join_keys = self._get_right_join_keys_set()

        for select_input in self.input.left_select.renames:
            select_input.join_key = select_input.old_name in left_join_keys

        for select_input in self.input.right_select.renames:
            select_input.join_key = select_input.old_name in right_join_keys

    def _get_left_join_keys_set(self) -> set[str]:
        """Internal: Returns a set of the left-side join key column names."""
        return {jm.left_col for jm in self.input.join_mapping}

    def _get_right_join_keys_set(self) -> set[str]:
        """Internal: Returns a set of the right-side join key column names."""
        return {jm.right_col for jm in self.input.join_mapping}

    def get_left_join_keys(self) -> set[str]:
        """Returns a set of the left-side join key column names."""
        return self._get_left_join_keys_set()

    def get_right_join_keys(self) -> set[str]:
        """Returns a set of the right-side join key column names."""
        return self._get_right_join_keys_set()

    def get_left_join_keys_list(self) -> list[str]:
        """Returns an ordered list of the left-side join key column names."""
        return [jm.left_col for jm in self.used_join_mapping]

    def get_right_join_keys_list(self) -> list[str]:
        """Returns an ordered list of the right-side join key column names."""
        return [jm.right_col for jm in self.used_join_mapping]

    def get_overlapping_records(self) -> set[str]:
        """Finds column names that would conflict after the join."""
        return self.get_overlapping_columns()

    def auto_rename(self) -> None:
        """Automatically renames columns on the right side to prevent naming conflicts."""
        self.set_join_keys()
        overlapping_records = self.get_overlapping_records()

        while len(overlapping_records) > 0:
            for right_col in self.input.right_select.renames:
                if right_col.new_name in overlapping_records:
                    right_col.new_name = right_col.new_name + "_right"
            overlapping_records = self.get_overlapping_records()

    def get_join_key_renames(self, filter_drop: bool = False) -> FullJoinKeyResponse:
        """Gets the temporary rename mappings for the join keys on both sides."""
        left_renames = self.left_manager.get_join_key_renames(side="left", filter_drop=filter_drop)
        right_renames = self.right_manager.get_join_key_renames(side="right", filter_drop=filter_drop)
        return FullJoinKeyResponse(left_renames, right_renames)

    def get_names_for_table_rename(self) -> list[JoinMap]:
        """Gets join mapping with renamed columns applied."""
        new_mappings: list[JoinMap] = []
        left_rename_table = self.left_manager.get_rename_table()
        right_rename_table = self.right_manager.get_rename_table()

        for join_map in self.input.join_mapping:
            new_left = left_rename_table.get(join_map.left_col, join_map.left_col)
            new_right = right_rename_table.get(join_map.right_col, join_map.right_col)
            new_mappings.append(JoinMap(left_col=new_left, right_col=new_right))

        return new_mappings

    def get_used_join_mapping(self) -> list[JoinMap]:
        """Returns the final join mapping after applying all renames and transformations."""
        new_mappings: list[JoinMap] = []
        left_rename_table = self.left_manager.get_rename_table()
        right_rename_table = self.right_manager.get_rename_table()
        left_join_rename_mapping = self.left_manager.get_join_key_rename_mapping("left")
        right_join_rename_mapping = self.right_manager.get_join_key_rename_mapping("right")
        for join_map in self.input.join_mapping:
            left_col = left_rename_table.get(join_map.left_col, join_map.left_col)
            right_col = right_rename_table.get(join_map.right_col, join_map.left_col)

            final_left = left_join_rename_mapping.get(left_col, None)
            final_right = right_join_rename_mapping.get(right_col, None)

            new_mappings.append(JoinMap(left_col=final_left, right_col=final_right))

        return new_mappings

    def to_join_input(self) -> JoinInput:
        """Creates a new JoinInput instance based on the current manager settings.

        This is useful when you've modified the manager (e.g., via auto_rename) and
        want to get a fresh JoinInput with all the current settings applied.

        Returns:
            A new JoinInput instance with current settings
        """
        return JoinInput(
            join_mapping=self.input.join_mapping,
            left_select=JoinInputs(renames=self.input.left_select.renames.copy()),
            right_select=JoinInputs(renames=self.input.right_select.renames.copy()),
            how=self.input.how,
        )

    @property
    def left_select(self) -> JoinInputsManager:
        """Backward compatibility: Access left_manager as left_select.

        This returns the MANAGER, not the data model.
        Usage: manager.left_select.join_key_selects
        """
        return self.left_manager

    @property
    def right_select(self) -> JoinInputsManager:
        """Backward compatibility: Access right_manager as right_select.

        This returns the MANAGER, not the data model.
        Usage: manager.right_select.join_key_selects
        """
        return self.right_manager

    @property
    def how(self) -> JoinStrategy:
        """Backward compatibility: Access join strategy."""
        return self.input.how

    @property
    def join_mapping(self) -> list[JoinMap]:
        """Backward compatibility: Access join mapping."""
        return self.input.join_mapping

    @property
    def overlapping_records(self) -> set[str]:
        """Backward compatibility: Returns overlapping column names."""
        return self.get_overlapping_records()

    @property
    def used_join_mapping(self) -> list[JoinMap]:
        """Backward compatibility: Returns used join mapping.

        This property is critical - it's used by left_join_keys and right_join_keys.
        """
        return self.get_used_join_mapping()

    @property
    def left_join_keys(self) -> list[str]:
        """Backward compatibility: Returns left join keys list.

        IMPORTANT: Uses the used_join_mapping PROPERTY (not method).
        """
        return [jm.left_col for jm in self.used_join_mapping]

    @property
    def right_join_keys(self) -> list[str]:
        """Backward compatibility: Returns right join keys list.

        IMPORTANT: Uses the used_join_mapping PROPERTY (not method).
        """
        return [jm.right_col for jm in self.used_join_mapping]

    @property
    def _left_join_keys(self) -> set[str]:
        """Backward compatibility: Private property for left join key set."""
        return self._get_left_join_keys_set()

    @property
    def _right_join_keys(self) -> set[str]:
        """Backward compatibility: Private property for right join key set."""
        return self._get_right_join_keys_set()
how property

Backward compatibility: Access join strategy.

join_mapping property

Backward compatibility: Access join mapping.

left_join_keys property

Backward compatibility: Returns left join keys list.

IMPORTANT: Uses the used_join_mapping PROPERTY (not method).

left_select property

Backward compatibility: Access left_manager as left_select.

This returns the MANAGER, not the data model. Usage: manager.left_select.join_key_selects

overlapping_records property

Backward compatibility: Returns overlapping column names.

right_join_keys property

Backward compatibility: Returns right join keys list.

IMPORTANT: Uses the used_join_mapping PROPERTY (not method).

right_select property

Backward compatibility: Access right_manager as right_select.

This returns the MANAGER, not the data model. Usage: manager.right_select.join_key_selects

used_join_mapping property

Backward compatibility: Returns used join mapping.

This property is critical - it's used by left_join_keys and right_join_keys.

auto_rename()

Automatically renames columns on the right side to prevent naming conflicts.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
def auto_rename(self) -> None:
    """Automatically renames columns on the right side to prevent naming conflicts."""
    self.set_join_keys()
    overlapping_records = self.get_overlapping_records()

    while len(overlapping_records) > 0:
        for right_col in self.input.right_select.renames:
            if right_col.new_name in overlapping_records:
                right_col.new_name = right_col.new_name + "_right"
        overlapping_records = self.get_overlapping_records()
create(join_mapping, left_select, right_select, how='inner') classmethod

Factory method to create JoinInput from various input formats.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
@classmethod
def create(
    cls,
    join_mapping: list[JoinMap] | tuple[str, str] | str,
    left_select: list[SelectInput] | list[str],
    right_select: list[SelectInput] | list[str],
    how: JoinStrategy = "inner",
) -> "JoinInputManager":
    """Factory method to create JoinInput from various input formats."""
    # Use JoinInput's own create method for parsing
    join_input = JoinInput(join_mapping=join_mapping, left_select=left_select, right_select=right_select, how=how)

    manager = cls(join_input)
    manager.set_join_keys()
    return manager
get_join_key_renames(filter_drop=False)

Gets the temporary rename mappings for the join keys on both sides.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1369
1370
1371
1372
1373
def get_join_key_renames(self, filter_drop: bool = False) -> FullJoinKeyResponse:
    """Gets the temporary rename mappings for the join keys on both sides."""
    left_renames = self.left_manager.get_join_key_renames(side="left", filter_drop=filter_drop)
    right_renames = self.right_manager.get_join_key_renames(side="right", filter_drop=filter_drop)
    return FullJoinKeyResponse(left_renames, right_renames)
get_left_join_keys()

Returns a set of the left-side join key column names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1338
1339
1340
def get_left_join_keys(self) -> set[str]:
    """Returns a set of the left-side join key column names."""
    return self._get_left_join_keys_set()
get_left_join_keys_list()

Returns an ordered list of the left-side join key column names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1346
1347
1348
def get_left_join_keys_list(self) -> list[str]:
    """Returns an ordered list of the left-side join key column names."""
    return [jm.left_col for jm in self.used_join_mapping]
get_names_for_table_rename()

Gets join mapping with renamed columns applied.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
def get_names_for_table_rename(self) -> list[JoinMap]:
    """Gets join mapping with renamed columns applied."""
    new_mappings: list[JoinMap] = []
    left_rename_table = self.left_manager.get_rename_table()
    right_rename_table = self.right_manager.get_rename_table()

    for join_map in self.input.join_mapping:
        new_left = left_rename_table.get(join_map.left_col, join_map.left_col)
        new_right = right_rename_table.get(join_map.right_col, join_map.right_col)
        new_mappings.append(JoinMap(left_col=new_left, right_col=new_right))

    return new_mappings
get_overlapping_records()

Finds column names that would conflict after the join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1354
1355
1356
def get_overlapping_records(self) -> set[str]:
    """Finds column names that would conflict after the join."""
    return self.get_overlapping_columns()
get_right_join_keys()

Returns a set of the right-side join key column names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1342
1343
1344
def get_right_join_keys(self) -> set[str]:
    """Returns a set of the right-side join key column names."""
    return self._get_right_join_keys_set()
get_right_join_keys_list()

Returns an ordered list of the right-side join key column names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1350
1351
1352
def get_right_join_keys_list(self) -> list[str]:
    """Returns an ordered list of the right-side join key column names."""
    return [jm.right_col for jm in self.used_join_mapping]
get_used_join_mapping()

Returns the final join mapping after applying all renames and transformations.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
def get_used_join_mapping(self) -> list[JoinMap]:
    """Returns the final join mapping after applying all renames and transformations."""
    new_mappings: list[JoinMap] = []
    left_rename_table = self.left_manager.get_rename_table()
    right_rename_table = self.right_manager.get_rename_table()
    left_join_rename_mapping = self.left_manager.get_join_key_rename_mapping("left")
    right_join_rename_mapping = self.right_manager.get_join_key_rename_mapping("right")
    for join_map in self.input.join_mapping:
        left_col = left_rename_table.get(join_map.left_col, join_map.left_col)
        right_col = right_rename_table.get(join_map.right_col, join_map.left_col)

        final_left = left_join_rename_mapping.get(left_col, None)
        final_right = right_join_rename_mapping.get(right_col, None)

        new_mappings.append(JoinMap(left_col=final_left, right_col=final_right))

    return new_mappings
set_join_keys()

Marks the SelectInput objects corresponding to join keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
def set_join_keys(self) -> None:
    """Marks the `SelectInput` objects corresponding to join keys."""
    left_join_keys = self._get_left_join_keys_set()
    right_join_keys = self._get_right_join_keys_set()

    for select_input in self.input.left_select.renames:
        select_input.join_key = select_input.old_name in left_join_keys

    for select_input in self.input.right_select.renames:
        select_input.join_key = select_input.old_name in right_join_keys
to_join_input()

Creates a new JoinInput instance based on the current manager settings.

This is useful when you've modified the manager (e.g., via auto_rename) and want to get a fresh JoinInput with all the current settings applied.

Returns:

Type Description
JoinInput

A new JoinInput instance with current settings

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
def to_join_input(self) -> JoinInput:
    """Creates a new JoinInput instance based on the current manager settings.

    This is useful when you've modified the manager (e.g., via auto_rename) and
    want to get a fresh JoinInput with all the current settings applied.

    Returns:
        A new JoinInput instance with current settings
    """
    return JoinInput(
        join_mapping=self.input.join_mapping,
        left_select=JoinInputs(renames=self.input.left_select.renames.copy()),
        right_select=JoinInputs(renames=self.input.right_select.renames.copy()),
        how=self.input.how,
    )
JoinInputs pydantic-model

Bases: SelectInputs

Data model for join-specific select inputs (extends SelectInputs).

Show JSON schema:
{
  "$defs": {
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "Data model for join-specific select inputs (extends SelectInputs).",
  "properties": {
    "renames": {
      "items": {
        "$ref": "#/$defs/SelectInput"
      },
      "title": "Renames",
      "type": "array"
    }
  },
  "title": "JoinInputs",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
476
477
478
479
480
481
482
483
484
class JoinInputs(SelectInputs):
    """Data model for join-specific select inputs (extends SelectInputs)."""

    def __init__(self, renames: list[SelectInput] = None, **kwargs):
        if renames is not None:
            kwargs["renames"] = renames
        else:
            kwargs["renames"] = []
        super().__init__(**kwargs)
JoinInputsManager

Bases: SelectInputsManager

Manager for join-specific operations, extends SelectInputsManager.

Methods:

Name Description
get_join_key_rename_mapping

Returns a dictionary mapping original join key names to their temporary names.

get_join_key_renames

Gets the temporary rename mapping for all join keys on one side of a join.

get_join_key_selects

Returns only the SelectInput objects that are marked as join keys.

Attributes:

Name Type Description
join_key_selects list[SelectInput]

Backward compatibility: Returns join key SelectInputs.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
class JoinInputsManager(SelectInputsManager):
    """Manager for join-specific operations, extends SelectInputsManager."""

    def __init__(self, join_inputs: JoinInputs):
        super().__init__(join_inputs)
        self.join_inputs = join_inputs

    # === Query Methods ===

    def get_join_key_selects(self) -> list[SelectInput]:
        """Returns only the `SelectInput` objects that are marked as join keys."""
        return [v for v in self.join_inputs.renames if v.join_key]

    def get_join_key_renames(self, side: SideLit, filter_drop: bool = False) -> JoinKeyRenameResponse:
        """Gets the temporary rename mapping for all join keys on one side of a join."""
        join_key_selects = self.get_join_key_selects()
        join_key_list = [
            JoinKeyRename(jk.new_name, construct_join_key_name(side, jk.new_name))
            for jk in join_key_selects
            if jk.keep or not filter_drop
        ]
        return JoinKeyRenameResponse(side, join_key_list)

    def get_join_key_rename_mapping(self, side: SideLit) -> dict[str, str]:
        """Returns a dictionary mapping original join key names to their temporary names."""
        join_key_response = self.get_join_key_renames(side)
        return {jkr.original_name: jkr.temp_name for jkr in join_key_response.join_key_renames}

    @property
    def join_key_selects(self) -> list[SelectInput]:
        """Backward compatibility: Returns join key SelectInputs."""
        return self.get_join_key_selects()
join_key_selects property

Backward compatibility: Returns join key SelectInputs.

get_join_key_rename_mapping(side)

Returns a dictionary mapping original join key names to their temporary names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1162
1163
1164
1165
def get_join_key_rename_mapping(self, side: SideLit) -> dict[str, str]:
    """Returns a dictionary mapping original join key names to their temporary names."""
    join_key_response = self.get_join_key_renames(side)
    return {jkr.original_name: jkr.temp_name for jkr in join_key_response.join_key_renames}
get_join_key_renames(side, filter_drop=False)

Gets the temporary rename mapping for all join keys on one side of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1152
1153
1154
1155
1156
1157
1158
1159
1160
def get_join_key_renames(self, side: SideLit, filter_drop: bool = False) -> JoinKeyRenameResponse:
    """Gets the temporary rename mapping for all join keys on one side of a join."""
    join_key_selects = self.get_join_key_selects()
    join_key_list = [
        JoinKeyRename(jk.new_name, construct_join_key_name(side, jk.new_name))
        for jk in join_key_selects
        if jk.keep or not filter_drop
    ]
    return JoinKeyRenameResponse(side, join_key_list)
get_join_key_selects()

Returns only the SelectInput objects that are marked as join keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1148
1149
1150
def get_join_key_selects(self) -> list[SelectInput]:
    """Returns only the `SelectInput` objects that are marked as join keys."""
    return [v for v in self.join_inputs.renames if v.join_key]
JoinKeyRename

Bases: NamedTuple

Represents the renaming of a join key from its original to a temporary name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
132
133
134
135
136
class JoinKeyRename(NamedTuple):
    """Represents the renaming of a join key from its original to a temporary name."""

    original_name: str
    temp_name: str
JoinKeyRenameResponse

Bases: NamedTuple

Contains a list of join key renames for one side of a join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
139
140
141
142
143
class JoinKeyRenameResponse(NamedTuple):
    """Contains a list of join key renames for one side of a join."""

    side: SideLit
    join_key_renames: list[JoinKeyRename]
JoinMap pydantic-model

Bases: BaseModel

Defines a single mapping between a left and right column for a join key.

Show JSON schema:
{
  "description": "Defines a single mapping between a left and right column for a join key.",
  "properties": {
    "left_col": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Left Col"
    },
    "right_col": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Right Col"
    }
  },
  "title": "JoinMap",
  "type": "object"
}

Fields:

  • left_col (str | None)
  • right_col (str | None)

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
class JoinMap(BaseModel):
    """Defines a single mapping between a left and right column for a join key."""

    left_col: str | None = None
    right_col: str | None = None

    def __init__(self, left_col: str = None, right_col: str = None, **data):
        if left_col is not None:
            data["left_col"] = left_col
        if right_col is not None:
            data["right_col"] = right_col
        super().__init__(**data)

    @model_validator(mode="after")
    def set_default_right_col(self):
        """If right_col is None, default it to left_col."""
        if self.right_col is None:
            self.right_col = self.left_col
        return self
set_default_right_col() pydantic-validator

If right_col is None, default it to left_col.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
500
501
502
503
504
505
@model_validator(mode="after")
def set_default_right_col(self):
    """If right_col is None, default it to left_col."""
    if self.right_col is None:
        self.right_col = self.left_col
    return self
JoinSelectManagerMixin

Mixin providing common methods for join-like operations.

Methods:

Name Description
add_new_select_column

Adds a new column to the selection for either the left or right side.

auto_generate_new_col_name

Generates a new, non-conflicting column name by adding a suffix if necessary.

get_overlapping_columns

Finds column names that would conflict after the join.

parse_select

Parses various input formats into a standardized JoinInputs object.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
class JoinSelectManagerMixin:
    """Mixin providing common methods for join-like operations."""

    left_manager: JoinInputsManager
    right_manager: JoinInputsManager
    input: CrossJoinInput | JoinInput | FuzzyMatchInput

    @staticmethod
    def parse_select(select: list[SelectInput] | list[str] | list[dict] | dict) -> JoinInputs:
        """Parses various input formats into a standardized `JoinInputs` object."""
        if not select:
            return JoinInputs(renames=[])

        if all(isinstance(c, SelectInput) for c in select):
            return JoinInputs(renames=select)
        elif all(isinstance(c, dict) for c in select):
            return JoinInputs(renames=[SelectInput(**c) for c in select])
        elif isinstance(select, dict):
            renames = select.get("renames")
            if renames:
                return JoinInputs(renames=[SelectInput(**c) for c in renames])
            return JoinInputs(renames=[])
        elif all(isinstance(c, str) for c in select):
            return JoinInputs(renames=[SelectInput(old_name=s, new_name=s) for s in select])

        raise ValueError(f"Unable to parse select input: {type(select)}")

    def get_overlapping_columns(self) -> set[str]:
        """Finds column names that would conflict after the join."""
        return self.left_manager.get_new_cols() & self.right_manager.get_new_cols()

    def auto_generate_new_col_name(self, old_col_name: str, side: str) -> str:
        """Generates a new, non-conflicting column name by adding a suffix if necessary."""
        current_names = self.get_overlapping_columns()
        if old_col_name not in current_names:
            return old_col_name

        new_name = old_col_name
        while new_name in current_names:
            new_name = f"{side}_{new_name}"
        return new_name

    def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
        """Adds a new column to the selection for either the left or right side."""
        target_input = self.input.right_select if side == "right" else self.input.left_select

        select_input.new_name = self.auto_generate_new_col_name(select_input.old_name, side=side)

        target_input.renames.append(select_input)
add_new_select_column(select_input, side)

Adds a new column to the selection for either the left or right side.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1215
1216
1217
1218
1219
1220
1221
def add_new_select_column(self, select_input: SelectInput, side: str) -> None:
    """Adds a new column to the selection for either the left or right side."""
    target_input = self.input.right_select if side == "right" else self.input.left_select

    select_input.new_name = self.auto_generate_new_col_name(select_input.old_name, side=side)

    target_input.renames.append(select_input)
auto_generate_new_col_name(old_col_name, side)

Generates a new, non-conflicting column name by adding a suffix if necessary.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
def auto_generate_new_col_name(self, old_col_name: str, side: str) -> str:
    """Generates a new, non-conflicting column name by adding a suffix if necessary."""
    current_names = self.get_overlapping_columns()
    if old_col_name not in current_names:
        return old_col_name

    new_name = old_col_name
    while new_name in current_names:
        new_name = f"{side}_{new_name}"
    return new_name
get_overlapping_columns()

Finds column names that would conflict after the join.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1200
1201
1202
def get_overlapping_columns(self) -> set[str]:
    """Finds column names that would conflict after the join."""
    return self.left_manager.get_new_cols() & self.right_manager.get_new_cols()
parse_select(select) staticmethod

Parses various input formats into a standardized JoinInputs object.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
@staticmethod
def parse_select(select: list[SelectInput] | list[str] | list[dict] | dict) -> JoinInputs:
    """Parses various input formats into a standardized `JoinInputs` object."""
    if not select:
        return JoinInputs(renames=[])

    if all(isinstance(c, SelectInput) for c in select):
        return JoinInputs(renames=select)
    elif all(isinstance(c, dict) for c in select):
        return JoinInputs(renames=[SelectInput(**c) for c in select])
    elif isinstance(select, dict):
        renames = select.get("renames")
        if renames:
            return JoinInputs(renames=[SelectInput(**c) for c in renames])
        return JoinInputs(renames=[])
    elif all(isinstance(c, str) for c in select):
        return JoinInputs(renames=[SelectInput(old_name=s, new_name=s) for s in select])

    raise ValueError(f"Unable to parse select input: {type(select)}")
PivotInput pydantic-model

Bases: BaseModel

Defines the settings for a pivot (long-to-wide) operation.

Show JSON schema:
{
  "description": "Defines the settings for a pivot (long-to-wide) operation.",
  "properties": {
    "index_columns": {
      "items": {
        "type": "string"
      },
      "title": "Index Columns",
      "type": "array"
    },
    "pivot_column": {
      "title": "Pivot Column",
      "type": "string"
    },
    "value_col": {
      "title": "Value Col",
      "type": "string"
    },
    "aggregations": {
      "items": {
        "type": "string"
      },
      "title": "Aggregations",
      "type": "array"
    }
  },
  "required": [
    "index_columns",
    "pivot_column",
    "value_col",
    "aggregations"
  ],
  "title": "PivotInput",
  "type": "object"
}

Fields:

  • index_columns (list[str])
  • pivot_column (str)
  • value_col (str)
  • aggregations (list[str])
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
class PivotInput(BaseModel):
    """Defines the settings for a pivot (long-to-wide) operation."""

    index_columns: list[str]
    pivot_column: str
    value_col: str
    aggregations: list[str]

    @property
    def grouped_columns(self) -> list[str]:
        """Returns the list of columns to be used for the initial grouping stage of the pivot."""
        return self.index_columns + [self.pivot_column]

    def get_group_by_input(self) -> GroupByInput:
        """Constructs the `GroupByInput` needed for the pre-aggregation step of the pivot."""
        group_by_cols = [AggColl(old_name=c, agg="groupby") for c in self.grouped_columns]
        agg_cols = [
            AggColl(old_name=self.value_col, agg=aggregation, new_name=aggregation) for aggregation in self.aggregations
        ]
        return GroupByInput(agg_cols=group_by_cols + agg_cols)

    def get_index_columns(self) -> list[pl.col]:
        """Returns the index columns as Polars column expressions."""
        return [pl.col(c) for c in self.index_columns]

    def get_pivot_column(self) -> pl.Expr:
        """Returns the pivot column as a Polars column expression."""
        return pl.col(self.pivot_column)

    def get_values_expr(self) -> pl.Expr:
        """Creates the struct expression used to gather the values for pivoting."""
        return pl.struct([pl.col(c) for c in self.aggregations]).alias("vals")
grouped_columns property

Returns the list of columns to be used for the initial grouping stage of the pivot.

get_group_by_input()

Constructs the GroupByInput needed for the pre-aggregation step of the pivot.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
935
936
937
938
939
940
941
def get_group_by_input(self) -> GroupByInput:
    """Constructs the `GroupByInput` needed for the pre-aggregation step of the pivot."""
    group_by_cols = [AggColl(old_name=c, agg="groupby") for c in self.grouped_columns]
    agg_cols = [
        AggColl(old_name=self.value_col, agg=aggregation, new_name=aggregation) for aggregation in self.aggregations
    ]
    return GroupByInput(agg_cols=group_by_cols + agg_cols)
get_index_columns()

Returns the index columns as Polars column expressions.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
943
944
945
def get_index_columns(self) -> list[pl.col]:
    """Returns the index columns as Polars column expressions."""
    return [pl.col(c) for c in self.index_columns]
get_pivot_column()

Returns the pivot column as a Polars column expression.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
947
948
949
def get_pivot_column(self) -> pl.Expr:
    """Returns the pivot column as a Polars column expression."""
    return pl.col(self.pivot_column)
get_values_expr()

Creates the struct expression used to gather the values for pivoting.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
951
952
953
def get_values_expr(self) -> pl.Expr:
    """Creates the struct expression used to gather the values for pivoting."""
    return pl.struct([pl.col(c) for c in self.aggregations]).alias("vals")
PolarsCodeInput pydantic-model

Bases: BaseModel

A simple container for a string of user-provided Polars code to be executed.

Show JSON schema:
{
  "description": "A simple container for a string of user-provided Polars code to be executed.",
  "properties": {
    "polars_code": {
      "title": "Polars Code",
      "type": "string"
    }
  },
  "required": [
    "polars_code"
  ],
  "title": "PolarsCodeInput",
  "type": "object"
}

Fields:

  • polars_code (str)
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1027
1028
1029
1030
class PolarsCodeInput(BaseModel):
    """A simple container for a string of user-provided Polars code to be executed."""

    polars_code: str
RecordIdInput pydantic-model

Bases: BaseModel

Defines settings for adding a record ID (row number) column to the data.

Show JSON schema:
{
  "description": "Defines settings for adding a record ID (row number) column to the data.",
  "properties": {
    "output_column_name": {
      "default": "record_id",
      "title": "Output Column Name",
      "type": "string"
    },
    "offset": {
      "default": 1,
      "title": "Offset",
      "type": "integer"
    },
    "group_by": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "title": "Group By"
    },
    "group_by_columns": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "title": "Group By Columns"
    }
  },
  "title": "RecordIdInput",
  "type": "object"
}

Fields:

  • output_column_name (str)
  • offset (int)
  • group_by (bool | None)
  • group_by_columns (list[str] | None)
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
963
964
965
966
967
968
969
class RecordIdInput(BaseModel):
    """Defines settings for adding a record ID (row number) column to the data."""

    output_column_name: str = "record_id"
    offset: int = 1
    group_by: bool | None = False
    group_by_columns: list[str] | None = Field(default_factory=list)
SelectInput pydantic-model

Bases: BaseModel

Defines how a single column should be selected, renamed, or type-cast.

This is a core building block for any operation that involves column manipulation. It holds all the configuration for a single field in a selection operation.

Show JSON schema:
{
  "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
  "properties": {
    "old_name": {
      "title": "Old Name",
      "type": "string"
    },
    "original_position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Original Position"
    },
    "new_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "New Name"
    },
    "data_type": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Data Type"
    },
    "data_type_change": {
      "default": false,
      "title": "Data Type Change",
      "type": "boolean"
    },
    "join_key": {
      "default": false,
      "title": "Join Key",
      "type": "boolean"
    },
    "is_altered": {
      "default": false,
      "title": "Is Altered",
      "type": "boolean"
    },
    "position": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Position"
    },
    "is_available": {
      "default": true,
      "title": "Is Available",
      "type": "boolean"
    },
    "keep": {
      "default": true,
      "title": "Keep",
      "type": "boolean"
    }
  },
  "required": [
    "old_name"
  ],
  "title": "SelectInput",
  "type": "object"
}

Config:

  • frozen: False

Fields:

  • old_name (str)
  • original_position (int | None)
  • new_name (str | None)
  • data_type (str | None)
  • data_type_change (bool)
  • join_key (bool)
  • is_altered (bool)
  • position (int | None)
  • is_available (bool)
  • keep (bool)

Validators:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
class SelectInput(BaseModel):
    """Defines how a single column should be selected, renamed, or type-cast.

    This is a core building block for any operation that involves column manipulation.
    It holds all the configuration for a single field in a selection operation.
    """

    model_config = ConfigDict(frozen=False)

    old_name: str
    original_position: int | None = None
    new_name: str | None = None
    data_type: str | None = None
    data_type_change: bool = False
    join_key: bool = False
    is_altered: bool = False
    position: int | None = None
    is_available: bool = True
    keep: bool = True

    def __init__(self, old_name: str = None, new_name: str = None, **data):
        if old_name is not None:
            data["old_name"] = old_name
        if new_name is not None:
            data["new_name"] = new_name
        super().__init__(**data)

    def to_yaml_dict(self) -> SelectInputYaml:
        """Serialize for YAML output - only user-relevant fields."""
        result: SelectInputYaml = {"old_name": self.old_name}
        if self.new_name != self.old_name:
            result["new_name"] = self.new_name
        if not self.keep:
            result["keep"] = self.keep
        # Always include data_type if it's set, not just when data_type_change is True
        # This ensures undo/redo snapshots preserve the data_type field
        if self.data_type:
            result["data_type"] = self.data_type
        return result

    @classmethod
    def from_yaml_dict(cls, data: dict) -> "SelectInput":
        """Load from slim YAML format."""
        old_name = data["old_name"]
        new_name = data.get("new_name", old_name)
        data_type = data.get("data_type")
        # is_altered should be True if either name was changed OR data_type was explicitly set
        # This ensures updateNodeSelect in the frontend won't overwrite user-specified data_type
        is_altered = (old_name != new_name) or (data_type is not None)
        return cls(
            old_name=old_name,
            new_name=new_name,
            keep=data.get("keep", True),
            data_type=data_type,
            data_type_change=data_type is not None,
            is_altered=is_altered,
        )

    @model_validator(mode="before")
    @classmethod
    def infer_data_type_change(cls, data):
        """Infer data_type_change when loading from YAML.

        When data_type is present but data_type_change is not explicitly set,
        infer that the user explicitly set the data_type (e.g., when loading from YAML).
        This ensures is_altered will be set correctly in the after validator.
        """
        if isinstance(data, dict):
            if data.get("data_type") is not None and "data_type_change" not in data:
                data["data_type_change"] = True
        return data

    @model_validator(mode="after")
    def set_default_new_name(self):
        """If new_name is None, default it to old_name. Also set is_altered if needed."""
        if self.new_name is None:
            self.new_name = self.old_name
        if self.old_name != self.new_name:
            self.is_altered = True
        if self.data_type_change:
            self.is_altered = True
        return self

    def __hash__(self):
        """Allow SelectInput to be used in sets and as dict keys."""
        return hash(self.old_name)

    def __eq__(self, other):
        """Required when implementing __hash__."""
        if not isinstance(other, SelectInput):
            return False
        return self.old_name == other.old_name

    @property
    def polars_type(self) -> str:
        """Translates a user-friendly type name to a Polars data type string."""
        data_type_lower = self.data_type.lower()
        if data_type_lower == "string":
            return "Utf8"
        elif data_type_lower == "integer":
            return "Int64"
        elif data_type_lower == "double":
            return "Float64"
        return self.data_type
polars_type property

Translates a user-friendly type name to a Polars data type string.

__eq__(other)

Required when implementing hash.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
240
241
242
243
244
def __eq__(self, other):
    """Required when implementing __hash__."""
    if not isinstance(other, SelectInput):
        return False
    return self.old_name == other.old_name
__hash__()

Allow SelectInput to be used in sets and as dict keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
236
237
238
def __hash__(self):
    """Allow SelectInput to be used in sets and as dict keys."""
    return hash(self.old_name)
from_yaml_dict(data) classmethod

Load from slim YAML format.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
@classmethod
def from_yaml_dict(cls, data: dict) -> "SelectInput":
    """Load from slim YAML format."""
    old_name = data["old_name"]
    new_name = data.get("new_name", old_name)
    data_type = data.get("data_type")
    # is_altered should be True if either name was changed OR data_type was explicitly set
    # This ensures updateNodeSelect in the frontend won't overwrite user-specified data_type
    is_altered = (old_name != new_name) or (data_type is not None)
    return cls(
        old_name=old_name,
        new_name=new_name,
        keep=data.get("keep", True),
        data_type=data_type,
        data_type_change=data_type is not None,
        is_altered=is_altered,
    )
infer_data_type_change(data) pydantic-validator

Infer data_type_change when loading from YAML.

When data_type is present but data_type_change is not explicitly set, infer that the user explicitly set the data_type (e.g., when loading from YAML). This ensures is_altered will be set correctly in the after validator.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
211
212
213
214
215
216
217
218
219
220
221
222
223
@model_validator(mode="before")
@classmethod
def infer_data_type_change(cls, data):
    """Infer data_type_change when loading from YAML.

    When data_type is present but data_type_change is not explicitly set,
    infer that the user explicitly set the data_type (e.g., when loading from YAML).
    This ensures is_altered will be set correctly in the after validator.
    """
    if isinstance(data, dict):
        if data.get("data_type") is not None and "data_type_change" not in data:
            data["data_type_change"] = True
    return data
set_default_new_name() pydantic-validator

If new_name is None, default it to old_name. Also set is_altered if needed.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
225
226
227
228
229
230
231
232
233
234
@model_validator(mode="after")
def set_default_new_name(self):
    """If new_name is None, default it to old_name. Also set is_altered if needed."""
    if self.new_name is None:
        self.new_name = self.old_name
    if self.old_name != self.new_name:
        self.is_altered = True
    if self.data_type_change:
        self.is_altered = True
    return self
to_yaml_dict()

Serialize for YAML output - only user-relevant fields.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
180
181
182
183
184
185
186
187
188
189
190
191
def to_yaml_dict(self) -> SelectInputYaml:
    """Serialize for YAML output - only user-relevant fields."""
    result: SelectInputYaml = {"old_name": self.old_name}
    if self.new_name != self.old_name:
        result["new_name"] = self.new_name
    if not self.keep:
        result["keep"] = self.keep
    # Always include data_type if it's set, not just when data_type_change is True
    # This ensures undo/redo snapshots preserve the data_type field
    if self.data_type:
        result["data_type"] = self.data_type
    return result
SelectInputs pydantic-model

Bases: BaseModel

A container for a list of SelectInput objects (pure data, no logic).

Show JSON schema:
{
  "$defs": {
    "SelectInput": {
      "description": "Defines how a single column should be selected, renamed, or type-cast.\n\nThis is a core building block for any operation that involves column manipulation.\nIt holds all the configuration for a single field in a selection operation.",
      "properties": {
        "old_name": {
          "title": "Old Name",
          "type": "string"
        },
        "original_position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Original Position"
        },
        "new_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "New Name"
        },
        "data_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Data Type"
        },
        "data_type_change": {
          "default": false,
          "title": "Data Type Change",
          "type": "boolean"
        },
        "join_key": {
          "default": false,
          "title": "Join Key",
          "type": "boolean"
        },
        "is_altered": {
          "default": false,
          "title": "Is Altered",
          "type": "boolean"
        },
        "position": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Position"
        },
        "is_available": {
          "default": true,
          "title": "Is Available",
          "type": "boolean"
        },
        "keep": {
          "default": true,
          "title": "Keep",
          "type": "boolean"
        }
      },
      "required": [
        "old_name"
      ],
      "title": "SelectInput",
      "type": "object"
    }
  },
  "description": "A container for a list of `SelectInput` objects (pure data, no logic).",
  "properties": {
    "renames": {
      "items": {
        "$ref": "#/$defs/SelectInput"
      },
      "title": "Renames",
      "type": "array"
    }
  },
  "title": "SelectInputs",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
class SelectInputs(BaseModel):
    """A container for a list of `SelectInput` objects (pure data, no logic)."""

    renames: list[SelectInput] = Field(default_factory=list)

    def __init__(self, renames: list[SelectInput] = None, **kwargs):
        if renames is not None:
            kwargs["renames"] = renames
        else:
            kwargs["renames"] = []
        super().__init__(**kwargs)

    def to_yaml_dict(self) -> JoinInputsYaml:
        """Serialize for YAML output."""
        return {"select": [r.to_yaml_dict() for r in self.renames]}

    @classmethod
    def from_yaml_dict(cls, data: dict) -> "SelectInputs":
        """Load from slim YAML format. Supports both 'select' (new) and 'renames' (internal)."""
        items = data.get("select", data.get("renames", []))
        return cls(renames=[SelectInput.from_yaml_dict(item) for item in items])

    @classmethod
    def create_from_list(cls, col_list: list[str]) -> "SelectInputs":
        """Creates a SelectInputs object from a simple list of column names."""
        return cls(renames=[SelectInput(old_name=c) for c in col_list])

    @classmethod
    def create_from_pl_df(cls, df: pl.DataFrame | pl.LazyFrame) -> "SelectInputs":
        """Creates a SelectInputs object from a Polars DataFrame's columns."""
        return cls(renames=[SelectInput(old_name=c) for c in df.columns])

    def remove_select_input(self, old_key: str) -> None:
        """Removes a SelectInput from the list based on its original name."""
        self.renames = [rename for rename in self.renames if rename.old_name != old_key]
create_from_list(col_list) classmethod

Creates a SelectInputs object from a simple list of column names.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
461
462
463
464
@classmethod
def create_from_list(cls, col_list: list[str]) -> "SelectInputs":
    """Creates a SelectInputs object from a simple list of column names."""
    return cls(renames=[SelectInput(old_name=c) for c in col_list])
create_from_pl_df(df) classmethod

Creates a SelectInputs object from a Polars DataFrame's columns.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
466
467
468
469
@classmethod
def create_from_pl_df(cls, df: pl.DataFrame | pl.LazyFrame) -> "SelectInputs":
    """Creates a SelectInputs object from a Polars DataFrame's columns."""
    return cls(renames=[SelectInput(old_name=c) for c in df.columns])
from_yaml_dict(data) classmethod

Load from slim YAML format. Supports both 'select' (new) and 'renames' (internal).

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
455
456
457
458
459
@classmethod
def from_yaml_dict(cls, data: dict) -> "SelectInputs":
    """Load from slim YAML format. Supports both 'select' (new) and 'renames' (internal)."""
    items = data.get("select", data.get("renames", []))
    return cls(renames=[SelectInput.from_yaml_dict(item) for item in items])
remove_select_input(old_key)

Removes a SelectInput from the list based on its original name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
471
472
473
def remove_select_input(self, old_key: str) -> None:
    """Removes a SelectInput from the list based on its original name."""
    self.renames = [rename for rename in self.renames if rename.old_name != old_key]
to_yaml_dict()

Serialize for YAML output.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
451
452
453
def to_yaml_dict(self) -> JoinInputsYaml:
    """Serialize for YAML output."""
    return {"select": [r.to_yaml_dict() for r in self.renames]}
SelectInputsManager

Manager class that provides all query and mutation operations.

Methods:

Name Description
__add__

Backward compatibility: Support += operator for appending.

append

Appends a new SelectInput to the list of renames.

find_by_new_name

Find SelectInput by new column name.

find_by_old_name

Find SelectInput by original column name.

get_drop_columns

Returns a list of SelectInput objects that are marked to be dropped.

get_new_cols

Returns a set of new (renamed) column names to be kept in the selection.

get_non_jk_drop_columns

Returns drop columns that are not join keys.

get_old_cols

Returns a set of original column names to be kept in the selection.

get_rename_table

Generates a dictionary for use in Polars' .rename() method.

get_select_cols

Gets a list of original column names to select from the source DataFrame.

get_select_input_on_new_name

Backward compatibility alias: Find SelectInput by new column name.

get_select_input_on_old_name

Backward compatibility alias: Find SelectInput by original column name.

has_drop_cols

Checks if any column is marked to be dropped from the selection.

remove_select_input

Removes a SelectInput from the list based on its original name.

unselect_field

Marks a field to be dropped from the final selection by setting keep to False.

Attributes:

Name Type Description
drop_columns list[SelectInput]

Backward compatibility: Returns list of columns to drop.

new_cols set[str]

Backward compatibility: Returns set of new column names.

non_jk_drop_columns list[SelectInput]

Backward compatibility: Returns non-join-key columns to drop.

old_cols set[str]

Backward compatibility: Returns set of old column names.

rename_table dict[str, str]

Backward compatibility: Returns rename table dictionary.

renames list[SelectInput]

Backward compatibility: Direct access to renames list.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
class SelectInputsManager:
    """Manager class that provides all query and mutation operations."""

    def __init__(self, select_inputs: SelectInputs):
        self.select_inputs = select_inputs

    # === Query Methods (read-only) ===

    def get_old_cols(self) -> set[str]:
        """Returns a set of original column names to be kept in the selection."""
        return set(v.old_name for v in self.select_inputs.renames if v.keep)

    def get_new_cols(self) -> set[str]:
        """Returns a set of new (renamed) column names to be kept in the selection."""
        return set(v.new_name for v in self.select_inputs.renames if v.keep)

    def get_rename_table(self) -> dict[str, str]:
        """Generates a dictionary for use in Polars' `.rename()` method."""
        return {v.old_name: v.new_name for v in self.select_inputs.renames if v.is_available and (v.keep or v.join_key)}

    def get_select_cols(self, include_join_key: bool = True) -> list[str]:
        """Gets a list of original column names to select from the source DataFrame."""
        return [v.old_name for v in self.select_inputs.renames if v.keep or (v.join_key and include_join_key)]

    def has_drop_cols(self) -> bool:
        """Checks if any column is marked to be dropped from the selection."""
        return any(not v.keep for v in self.select_inputs.renames)

    def get_drop_columns(self) -> list[SelectInput]:
        """Returns a list of SelectInput objects that are marked to be dropped."""
        return [v for v in self.select_inputs.renames if not v.keep and v.is_available]

    def get_non_jk_drop_columns(self) -> list[SelectInput]:
        """Returns drop columns that are not join keys."""
        return [v for v in self.select_inputs.renames if not v.keep and v.is_available and not v.join_key]

    def find_by_old_name(self, old_name: str) -> SelectInput | None:
        """Find SelectInput by original column name."""
        return next((v for v in self.select_inputs.renames if v.old_name == old_name), None)

    def find_by_new_name(self, new_name: str) -> SelectInput | None:
        """Find SelectInput by new column name."""
        return next((v for v in self.select_inputs.renames if v.new_name == new_name), None)

    # === Mutation Methods ===

    def append(self, other: SelectInput) -> None:
        """Appends a new SelectInput to the list of renames."""
        self.select_inputs.renames.append(other)

    def remove_select_input(self, old_key: str) -> None:
        """Removes a SelectInput from the list based on its original name."""
        self.select_inputs.renames = [rename for rename in self.select_inputs.renames if rename.old_name != old_key]

    def unselect_field(self, old_key: str) -> None:
        """Marks a field to be dropped from the final selection by setting `keep` to False."""
        for rename in self.select_inputs.renames:
            if old_key == rename.old_name:
                rename.keep = False

    # === Backward Compatibility Properties ===

    @property
    def old_cols(self) -> set[str]:
        """Backward compatibility: Returns set of old column names."""
        return self.get_old_cols()

    @property
    def new_cols(self) -> set[str]:
        """Backward compatibility: Returns set of new column names."""
        return self.get_new_cols()

    @property
    def rename_table(self) -> dict[str, str]:
        """Backward compatibility: Returns rename table dictionary."""
        return self.get_rename_table()

    @property
    def drop_columns(self) -> list[SelectInput]:
        """Backward compatibility: Returns list of columns to drop."""
        return self.get_drop_columns()

    @property
    def non_jk_drop_columns(self) -> list[SelectInput]:
        """Backward compatibility: Returns non-join-key columns to drop."""
        return self.get_non_jk_drop_columns()

    @property
    def renames(self) -> list[SelectInput]:
        """Backward compatibility: Direct access to renames list."""
        return self.select_inputs.renames

    def get_select_input_on_old_name(self, old_name: str) -> SelectInput | None:
        """Backward compatibility alias: Find SelectInput by original column name."""
        return self.find_by_old_name(old_name)

    def get_select_input_on_new_name(self, new_name: str) -> SelectInput | None:
        """Backward compatibility alias: Find SelectInput by new column name."""
        return self.find_by_new_name(new_name)

    def __add__(self, other: SelectInput) -> "SelectInputsManager":
        """Backward compatibility: Support += operator for appending."""
        self.append(other)
        return self
drop_columns property

Backward compatibility: Returns list of columns to drop.

new_cols property

Backward compatibility: Returns set of new column names.

non_jk_drop_columns property

Backward compatibility: Returns non-join-key columns to drop.

old_cols property

Backward compatibility: Returns set of old column names.

rename_table property

Backward compatibility: Returns rename table dictionary.

renames property

Backward compatibility: Direct access to renames list.

__add__(other)

Backward compatibility: Support += operator for appending.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1133
1134
1135
1136
def __add__(self, other: SelectInput) -> "SelectInputsManager":
    """Backward compatibility: Support += operator for appending."""
    self.append(other)
    return self
append(other)

Appends a new SelectInput to the list of renames.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1079
1080
1081
def append(self, other: SelectInput) -> None:
    """Appends a new SelectInput to the list of renames."""
    self.select_inputs.renames.append(other)
find_by_new_name(new_name)

Find SelectInput by new column name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1073
1074
1075
def find_by_new_name(self, new_name: str) -> SelectInput | None:
    """Find SelectInput by new column name."""
    return next((v for v in self.select_inputs.renames if v.new_name == new_name), None)
find_by_old_name(old_name)

Find SelectInput by original column name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1069
1070
1071
def find_by_old_name(self, old_name: str) -> SelectInput | None:
    """Find SelectInput by original column name."""
    return next((v for v in self.select_inputs.renames if v.old_name == old_name), None)
get_drop_columns()

Returns a list of SelectInput objects that are marked to be dropped.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1061
1062
1063
def get_drop_columns(self) -> list[SelectInput]:
    """Returns a list of SelectInput objects that are marked to be dropped."""
    return [v for v in self.select_inputs.renames if not v.keep and v.is_available]
get_new_cols()

Returns a set of new (renamed) column names to be kept in the selection.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1045
1046
1047
def get_new_cols(self) -> set[str]:
    """Returns a set of new (renamed) column names to be kept in the selection."""
    return set(v.new_name for v in self.select_inputs.renames if v.keep)
get_non_jk_drop_columns()

Returns drop columns that are not join keys.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1065
1066
1067
def get_non_jk_drop_columns(self) -> list[SelectInput]:
    """Returns drop columns that are not join keys."""
    return [v for v in self.select_inputs.renames if not v.keep and v.is_available and not v.join_key]
get_old_cols()

Returns a set of original column names to be kept in the selection.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1041
1042
1043
def get_old_cols(self) -> set[str]:
    """Returns a set of original column names to be kept in the selection."""
    return set(v.old_name for v in self.select_inputs.renames if v.keep)
get_rename_table()

Generates a dictionary for use in Polars' .rename() method.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1049
1050
1051
def get_rename_table(self) -> dict[str, str]:
    """Generates a dictionary for use in Polars' `.rename()` method."""
    return {v.old_name: v.new_name for v in self.select_inputs.renames if v.is_available and (v.keep or v.join_key)}
get_select_cols(include_join_key=True)

Gets a list of original column names to select from the source DataFrame.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1053
1054
1055
def get_select_cols(self, include_join_key: bool = True) -> list[str]:
    """Gets a list of original column names to select from the source DataFrame."""
    return [v.old_name for v in self.select_inputs.renames if v.keep or (v.join_key and include_join_key)]
get_select_input_on_new_name(new_name)

Backward compatibility alias: Find SelectInput by new column name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1129
1130
1131
def get_select_input_on_new_name(self, new_name: str) -> SelectInput | None:
    """Backward compatibility alias: Find SelectInput by new column name."""
    return self.find_by_new_name(new_name)
get_select_input_on_old_name(old_name)

Backward compatibility alias: Find SelectInput by original column name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1125
1126
1127
def get_select_input_on_old_name(self, old_name: str) -> SelectInput | None:
    """Backward compatibility alias: Find SelectInput by original column name."""
    return self.find_by_old_name(old_name)
has_drop_cols()

Checks if any column is marked to be dropped from the selection.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1057
1058
1059
def has_drop_cols(self) -> bool:
    """Checks if any column is marked to be dropped from the selection."""
    return any(not v.keep for v in self.select_inputs.renames)
remove_select_input(old_key)

Removes a SelectInput from the list based on its original name.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1083
1084
1085
def remove_select_input(self, old_key: str) -> None:
    """Removes a SelectInput from the list based on its original name."""
    self.select_inputs.renames = [rename for rename in self.select_inputs.renames if rename.old_name != old_key]
unselect_field(old_key)

Marks a field to be dropped from the final selection by setting keep to False.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1087
1088
1089
1090
1091
def unselect_field(self, old_key: str) -> None:
    """Marks a field to be dropped from the final selection by setting `keep` to False."""
    for rename in self.select_inputs.renames:
        if old_key == rename.old_name:
            rename.keep = False
SortByInput pydantic-model

Bases: BaseModel

Defines a single sort condition on a column, including the direction.

Show JSON schema:
{
  "description": "Defines a single sort condition on a column, including the direction.",
  "properties": {
    "column": {
      "title": "Column",
      "type": "string"
    },
    "how": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "asc",
      "title": "How"
    }
  },
  "required": [
    "column"
  ],
  "title": "SortByInput",
  "type": "object"
}

Fields:

  • column (str)
  • how (str | None)
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
956
957
958
959
960
class SortByInput(BaseModel):
    """Defines a single sort condition on a column, including the direction."""

    column: str
    how: str | None = "asc"
TextToRowsInput pydantic-model

Bases: BaseModel

Defines settings for splitting a text column into multiple rows based on a delimiter.

Show JSON schema:
{
  "description": "Defines settings for splitting a text column into multiple rows based on a delimiter.",
  "properties": {
    "column_to_split": {
      "title": "Column To Split",
      "type": "string"
    },
    "output_column_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Output Column Name"
    },
    "split_by_fixed_value": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Split By Fixed Value"
    },
    "split_fixed_value": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": ",",
      "title": "Split Fixed Value"
    },
    "split_by_column": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Split By Column"
    }
  },
  "required": [
    "column_to_split"
  ],
  "title": "TextToRowsInput",
  "type": "object"
}

Fields:

  • column_to_split (str)
  • output_column_name (str | None)
  • split_by_fixed_value (bool | None)
  • split_fixed_value (str | None)
  • split_by_column (str | None)
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
972
973
974
975
976
977
978
979
class TextToRowsInput(BaseModel):
    """Defines settings for splitting a text column into multiple rows based on a delimiter."""

    column_to_split: str
    output_column_name: str | None = None
    split_by_fixed_value: bool | None = True
    split_fixed_value: str | None = ","
    split_by_column: str | None = None
UnionInput pydantic-model

Bases: BaseModel

Defines settings for a union (concatenation) operation.

Show JSON schema:
{
  "description": "Defines settings for a union (concatenation) operation.",
  "properties": {
    "mode": {
      "default": "relaxed",
      "enum": [
        "selective",
        "relaxed"
      ],
      "title": "Mode",
      "type": "string"
    }
  },
  "title": "UnionInput",
  "type": "object"
}

Fields:

  • mode (Literal['selective', 'relaxed'])
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1006
1007
1008
1009
class UnionInput(BaseModel):
    """Defines settings for a union (concatenation) operation."""

    mode: Literal["selective", "relaxed"] = "relaxed"
UniqueInput pydantic-model

Bases: BaseModel

Defines settings for a uniqueness operation, specifying columns and which row to keep.

Show JSON schema:
{
  "description": "Defines settings for a uniqueness operation, specifying columns and which row to keep.",
  "properties": {
    "columns": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Columns"
    },
    "strategy": {
      "default": "any",
      "enum": [
        "first",
        "last",
        "any",
        "none"
      ],
      "title": "Strategy",
      "type": "string"
    }
  },
  "title": "UniqueInput",
  "type": "object"
}

Fields:

  • columns (list[str] | None)
  • strategy (Literal['first', 'last', 'any', 'none'])
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
1012
1013
1014
1015
1016
class UniqueInput(BaseModel):
    """Defines settings for a uniqueness operation, specifying columns and which row to keep."""

    columns: list[str] | None = None
    strategy: Literal["first", "last", "any", "none"] = "any"
UnpivotInput pydantic-model

Bases: BaseModel

Defines settings for an unpivot (wide-to-long) operation.

Show JSON schema:
{
  "description": "Defines settings for an unpivot (wide-to-long) operation.",
  "properties": {
    "index_columns": {
      "items": {
        "type": "string"
      },
      "title": "Index Columns",
      "type": "array"
    },
    "value_columns": {
      "items": {
        "type": "string"
      },
      "title": "Value Columns",
      "type": "array"
    },
    "data_type_selector": {
      "anyOf": [
        {
          "enum": [
            "float",
            "all",
            "date",
            "numeric",
            "string"
          ],
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Data Type Selector"
    },
    "data_type_selector_mode": {
      "default": "column",
      "enum": [
        "data_type",
        "column"
      ],
      "title": "Data Type Selector Mode",
      "type": "string"
    }
  },
  "title": "UnpivotInput",
  "type": "object"
}

Config:

  • arbitrary_types_allowed: True

Fields:

  • index_columns (list[str])
  • value_columns (list[str])
  • data_type_selector (Literal['float', 'all', 'date', 'numeric', 'string'] | None)
  • data_type_selector_mode (Literal['data_type', 'column'])
Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
class UnpivotInput(BaseModel):
    """Defines settings for an unpivot (wide-to-long) operation."""

    model_config = ConfigDict(arbitrary_types_allowed=True)

    index_columns: list[str] = Field(default_factory=list)
    value_columns: list[str] = Field(default_factory=list)
    data_type_selector: Literal["float", "all", "date", "numeric", "string"] | None = None
    data_type_selector_mode: Literal["data_type", "column"] = "column"

    @property
    def data_type_selector_expr(self) -> Callable | None:
        """Returns a Polars selector function based on the `data_type_selector` string."""
        if self.data_type_selector_mode == "data_type":
            if self.data_type_selector is not None:
                try:
                    return getattr(selectors, self.data_type_selector)
                except Exception:
                    print(f"Could not find the selector: {self.data_type_selector}")
                    return selectors.all
            return selectors.all
        return None
data_type_selector_expr property

Returns a Polars selector function based on the data_type_selector string.

construct_join_key_name(side, column_name)

Creates a temporary, unique name for a join key column.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
127
128
129
def construct_join_key_name(side: SideLit, column_name: str) -> str:
    """Creates a temporary, unique name for a join key column."""
    return "_FLOWFILE_JOIN_KEY_" + side.upper() + "_" + column_name
get_func_type_mapping(func)

Infers the output data type of common aggregation functions.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
105
106
107
108
109
110
111
112
113
114
def get_func_type_mapping(func: str):
    """Infers the output data type of common aggregation functions."""
    if func in ["mean", "avg", "median", "std", "var"]:
        return "Float64"
    elif func in ["min", "max", "first", "last", "cumsum", "sum"]:
        return None
    elif func in ["count", "n_unique"]:
        return "Int64"
    elif func in ["concat"]:
        return "Utf8"
string_concat(*column)

A simple wrapper to concatenate string columns in Polars.

Source code in flowfile_core/flowfile_core/schemas/transform_schema.py
117
118
119
def string_concat(*column: str):
    """A simple wrapper to concatenate string columns in Polars."""
    return pl.col(column).cast(pl.Utf8).str.concat(delimiter=",")

cloud_storage_schemas

flowfile_core.schemas.cloud_storage_schemas

Cloud storage connection schemas for S3, ADLS, and other cloud providers.

Classes:

Name Description
AuthSettingsInput

The information needed for the user to provide the details that are needed to provide how to connect to the

CloudStorageReadSettings

Settings for reading from cloud storage

CloudStorageSettings

Settings for cloud storage nodes in the visual designer

CloudStorageWriteSettings

Settings for writing to cloud storage

CloudStorageWriteSettingsWorkerInterface

Settings for writing to cloud storage in worker context

FullCloudStorageConnection

Internal model with decrypted secrets

FullCloudStorageConnectionInterface

API response model - no secrets exposed

FullCloudStorageConnectionWorkerInterface

Internal model with decrypted secrets

WriteSettingsWorkerInterface

Settings for writing to cloud storage

Functions:

Name Description
encrypt_for_worker

Encrypts a secret value for use in worker contexts using per-user key derivation.

get_cloud_storage_write_settings_worker_interface

Convert to a worker interface model with encrypted secrets.

AuthSettingsInput pydantic-model

Bases: BaseModel

The information needed for the user to provide the details that are needed to provide how to connect to the Cloud provider

Show JSON schema:
{
  "description": "The information needed for the user to provide the details that are needed to provide how to connect to the\n Cloud provider",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "AuthSettingsInput",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (str | None)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
33
34
35
36
37
38
39
40
41
class AuthSettingsInput(BaseModel):
    """
    The information needed for the user to provide the details that are needed to provide how to connect to the
     Cloud provider
    """

    storage_type: CloudStorageType
    auth_method: AuthMethod
    connection_name: str | None = "None"  # This is the reference to the item we will fetch that contains the data
CloudStorageReadSettings pydantic-model

Bases: CloudStorageSettings

Settings for reading from cloud storage

Show JSON schema:
{
  "description": "Settings for reading from cloud storage",
  "properties": {
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    },
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "scan_mode": {
      "default": "single_file",
      "enum": [
        "single_file",
        "directory"
      ],
      "title": "Scan Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta",
        "iceberg"
      ],
      "title": "File Format",
      "type": "string"
    },
    "csv_has_header": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": true,
      "title": "Csv Has Header"
    },
    "csv_delimiter": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": ",",
      "title": "Csv Delimiter"
    },
    "csv_encoding": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "utf8",
      "title": "Csv Encoding"
    },
    "delta_version": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Delta Version"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageReadSettings",
  "type": "object"
}

Fields:

  • auth_mode (AuthMethod)
  • connection_name (str | None)
  • resource_path (str)
  • scan_mode (Literal['single_file', 'directory'])
  • file_format (Literal['csv', 'parquet', 'json', 'delta', 'iceberg'])
  • csv_has_header (bool | None)
  • csv_delimiter (str | None)
  • csv_encoding (str | None)
  • delta_version (int | None)

Validators:

  • validate_auth_requirementsauth_mode
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
149
150
151
152
153
154
155
156
157
class CloudStorageReadSettings(CloudStorageSettings):
    """Settings for reading from cloud storage"""

    scan_mode: Literal["single_file", "directory"] = "single_file"
    file_format: Literal["csv", "parquet", "json", "delta", "iceberg"] = "parquet"
    csv_has_header: bool | None = True
    csv_delimiter: str | None = ","
    csv_encoding: str | None = "utf8"
    delta_version: int | None = None
CloudStorageSettings pydantic-model

Bases: BaseModel

Settings for cloud storage nodes in the visual designer

Show JSON schema:
{
  "description": "Settings for cloud storage nodes in the visual designer",
  "properties": {
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    },
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageSettings",
  "type": "object"
}

Fields:

  • auth_mode (AuthMethod)
  • connection_name (str | None)
  • resource_path (str)

Validators:

  • validate_auth_requirementsauth_mode
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
134
135
136
137
138
139
140
141
142
143
144
145
146
class CloudStorageSettings(BaseModel):
    """Settings for cloud storage nodes in the visual designer"""

    auth_mode: AuthMethod = "auto"
    connection_name: str | None = None  # Required only for 'reference' mode
    resource_path: str  # s3://bucket/path/to/file.csv

    @field_validator("auth_mode", mode="after")
    def validate_auth_requirements(cls, v, values):
        data = values.data
        if v == "reference" and not data.get("connection_name"):
            raise ValueError("connection_name required when using reference mode")
        return v
CloudStorageWriteSettings pydantic-model

Bases: CloudStorageSettings, WriteSettingsWorkerInterface

Settings for writing to cloud storage

Show JSON schema:
{
  "description": "Settings for writing to cloud storage",
  "properties": {
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "write_mode": {
      "default": "overwrite",
      "enum": [
        "overwrite",
        "append"
      ],
      "title": "Write Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta"
      ],
      "title": "File Format",
      "type": "string"
    },
    "parquet_compression": {
      "default": "snappy",
      "enum": [
        "snappy",
        "gzip",
        "brotli",
        "lz4",
        "zstd"
      ],
      "title": "Parquet Compression",
      "type": "string"
    },
    "csv_delimiter": {
      "default": ",",
      "title": "Csv Delimiter",
      "type": "string"
    },
    "csv_encoding": {
      "default": "utf8",
      "title": "Csv Encoding",
      "type": "string"
    },
    "auth_mode": {
      "default": "auto",
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Mode",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Connection Name"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "CloudStorageWriteSettings",
  "type": "object"
}

Fields:

  • resource_path (str)
  • write_mode (Literal['overwrite', 'append'])
  • file_format (Literal['csv', 'parquet', 'json', 'delta'])
  • parquet_compression (Literal['snappy', 'gzip', 'brotli', 'lz4', 'zstd'])
  • csv_delimiter (str)
  • csv_encoding (str)
  • auth_mode (AuthMethod)
  • connection_name (str | None)

Validators:

  • validate_auth_requirementsauth_mode
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
class CloudStorageWriteSettings(CloudStorageSettings, WriteSettingsWorkerInterface):
    """Settings for writing to cloud storage"""

    pass

    def get_write_setting_worker_interface(self) -> WriteSettingsWorkerInterface:
        """
        Convert to a worker interface model without secrets.
        """
        return WriteSettingsWorkerInterface(
            resource_path=self.resource_path,
            write_mode=self.write_mode,
            file_format=self.file_format,
            parquet_compression=self.parquet_compression,
            csv_delimiter=self.csv_delimiter,
            csv_encoding=self.csv_encoding,
        )
get_write_setting_worker_interface()

Convert to a worker interface model without secrets.

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
184
185
186
187
188
189
190
191
192
193
194
195
def get_write_setting_worker_interface(self) -> WriteSettingsWorkerInterface:
    """
    Convert to a worker interface model without secrets.
    """
    return WriteSettingsWorkerInterface(
        resource_path=self.resource_path,
        write_mode=self.write_mode,
        file_format=self.file_format,
        parquet_compression=self.parquet_compression,
        csv_delimiter=self.csv_delimiter,
        csv_encoding=self.csv_encoding,
    )
CloudStorageWriteSettingsWorkerInterface pydantic-model

Bases: BaseModel

Settings for writing to cloud storage in worker context

Show JSON schema:
{
  "$defs": {
    "FullCloudStorageConnectionWorkerInterface": {
      "description": "Internal model with decrypted secrets",
      "properties": {
        "storage_type": {
          "enum": [
            "s3",
            "adls",
            "gcs"
          ],
          "title": "Storage Type",
          "type": "string"
        },
        "auth_method": {
          "enum": [
            "access_key",
            "iam_role",
            "service_principal",
            "managed_identity",
            "sas_token",
            "aws-cli",
            "env_vars"
          ],
          "title": "Auth Method",
          "type": "string"
        },
        "connection_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "None",
          "title": "Connection Name"
        },
        "aws_region": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Region"
        },
        "aws_access_key_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Access Key Id"
        },
        "aws_secret_access_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Secret Access Key"
        },
        "aws_role_arn": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Role Arn"
        },
        "aws_allow_unsafe_html": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Allow Unsafe Html"
        },
        "aws_session_token": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Aws Session Token"
        },
        "azure_account_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Account Name"
        },
        "azure_account_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Account Key"
        },
        "azure_tenant_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Tenant Id"
        },
        "azure_client_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Client Id"
        },
        "azure_client_secret": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Azure Client Secret"
        },
        "endpoint_url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Endpoint Url"
        },
        "verify_ssl": {
          "default": true,
          "title": "Verify Ssl",
          "type": "boolean"
        }
      },
      "required": [
        "storage_type",
        "auth_method"
      ],
      "title": "FullCloudStorageConnectionWorkerInterface",
      "type": "object"
    },
    "WriteSettingsWorkerInterface": {
      "description": "Settings for writing to cloud storage",
      "properties": {
        "resource_path": {
          "title": "Resource Path",
          "type": "string"
        },
        "write_mode": {
          "default": "overwrite",
          "enum": [
            "overwrite",
            "append"
          ],
          "title": "Write Mode",
          "type": "string"
        },
        "file_format": {
          "default": "parquet",
          "enum": [
            "csv",
            "parquet",
            "json",
            "delta"
          ],
          "title": "File Format",
          "type": "string"
        },
        "parquet_compression": {
          "default": "snappy",
          "enum": [
            "snappy",
            "gzip",
            "brotli",
            "lz4",
            "zstd"
          ],
          "title": "Parquet Compression",
          "type": "string"
        },
        "csv_delimiter": {
          "default": ",",
          "title": "Csv Delimiter",
          "type": "string"
        },
        "csv_encoding": {
          "default": "utf8",
          "title": "Csv Encoding",
          "type": "string"
        }
      },
      "required": [
        "resource_path"
      ],
      "title": "WriteSettingsWorkerInterface",
      "type": "object"
    }
  },
  "description": "Settings for writing to cloud storage in worker context",
  "properties": {
    "operation": {
      "title": "Operation",
      "type": "string"
    },
    "write_settings": {
      "$ref": "#/$defs/WriteSettingsWorkerInterface"
    },
    "connection": {
      "$ref": "#/$defs/FullCloudStorageConnectionWorkerInterface"
    },
    "flowfile_flow_id": {
      "default": 1,
      "title": "Flowfile Flow Id",
      "type": "integer"
    },
    "flowfile_node_id": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "string"
        }
      ],
      "default": -1,
      "title": "Flowfile Node Id"
    }
  },
  "required": [
    "operation",
    "write_settings",
    "connection"
  ],
  "title": "CloudStorageWriteSettingsWorkerInterface",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
203
204
205
206
207
208
209
210
class CloudStorageWriteSettingsWorkerInterface(BaseModel):
    """Settings for writing to cloud storage in worker context"""

    operation: str
    write_settings: WriteSettingsWorkerInterface
    connection: FullCloudStorageConnectionWorkerInterface
    flowfile_flow_id: int = 1
    flowfile_node_id: int | str = -1
FullCloudStorageConnection pydantic-model

Bases: AuthSettingsInput

Internal model with decrypted secrets

Show JSON schema:
{
  "description": "Internal model with decrypted secrets",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_secret_access_key": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Secret Access Key"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_session_token": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Session Token"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_account_key": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Key"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "azure_client_secret": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Secret"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnection",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (str | None)
  • aws_region (str | None)
  • aws_access_key_id (str | None)
  • aws_secret_access_key (SecretStr | None)
  • aws_role_arn (str | None)
  • aws_allow_unsafe_html (bool | None)
  • aws_session_token (SecretStr | None)
  • azure_account_name (str | None)
  • azure_account_key (SecretStr | None)
  • azure_tenant_id (str | None)
  • azure_client_id (str | None)
  • azure_client_secret (SecretStr | None)
  • endpoint_url (str | None)
  • verify_ssl (bool)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
class FullCloudStorageConnection(AuthSettingsInput):
    """Internal model with decrypted secrets"""

    # AWS S3
    aws_region: str | None = None
    aws_access_key_id: str | None = None
    aws_secret_access_key: SecretStr | None = None
    aws_role_arn: str | None = None
    aws_allow_unsafe_html: bool | None = None
    aws_session_token: SecretStr | None = None

    # Azure ADLS
    azure_account_name: str | None = None
    azure_account_key: SecretStr | None = None
    azure_tenant_id: str | None = None
    azure_client_id: str | None = None
    azure_client_secret: SecretStr | None = None

    # Common
    endpoint_url: str | None = None
    verify_ssl: bool = True

    def get_worker_interface(self, user_id: int) -> "FullCloudStorageConnectionWorkerInterface":
        """
        Convert to a worker interface model with encrypted secrets.

        Args:
            user_id: The user ID for per-user key derivation

        Returns:
            FullCloudStorageConnectionWorkerInterface with encrypted secrets
        """
        return FullCloudStorageConnectionWorkerInterface(
            storage_type=self.storage_type,
            auth_method=self.auth_method,
            connection_name=self.connection_name,
            aws_allow_unsafe_html=self.aws_allow_unsafe_html,
            aws_secret_access_key=encrypt_for_worker(self.aws_secret_access_key, user_id),
            aws_region=self.aws_region,
            aws_access_key_id=self.aws_access_key_id,
            aws_role_arn=self.aws_role_arn,
            aws_session_token=encrypt_for_worker(self.aws_session_token, user_id),
            azure_account_name=self.azure_account_name,
            azure_tenant_id=self.azure_tenant_id,
            azure_account_key=encrypt_for_worker(self.azure_account_key, user_id),
            azure_client_id=self.azure_client_id,
            azure_client_secret=encrypt_for_worker(self.azure_client_secret, user_id),
            endpoint_url=self.endpoint_url,
            verify_ssl=self.verify_ssl,
        )
get_worker_interface(user_id)

Convert to a worker interface model with encrypted secrets.

Parameters:

Name Type Description Default
user_id int

The user ID for per-user key derivation

required

Returns:

Type Description
FullCloudStorageConnectionWorkerInterface

FullCloudStorageConnectionWorkerInterface with encrypted secrets

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
def get_worker_interface(self, user_id: int) -> "FullCloudStorageConnectionWorkerInterface":
    """
    Convert to a worker interface model with encrypted secrets.

    Args:
        user_id: The user ID for per-user key derivation

    Returns:
        FullCloudStorageConnectionWorkerInterface with encrypted secrets
    """
    return FullCloudStorageConnectionWorkerInterface(
        storage_type=self.storage_type,
        auth_method=self.auth_method,
        connection_name=self.connection_name,
        aws_allow_unsafe_html=self.aws_allow_unsafe_html,
        aws_secret_access_key=encrypt_for_worker(self.aws_secret_access_key, user_id),
        aws_region=self.aws_region,
        aws_access_key_id=self.aws_access_key_id,
        aws_role_arn=self.aws_role_arn,
        aws_session_token=encrypt_for_worker(self.aws_session_token, user_id),
        azure_account_name=self.azure_account_name,
        azure_tenant_id=self.azure_tenant_id,
        azure_account_key=encrypt_for_worker(self.azure_account_key, user_id),
        azure_client_id=self.azure_client_id,
        azure_client_secret=encrypt_for_worker(self.azure_client_secret, user_id),
        endpoint_url=self.endpoint_url,
        verify_ssl=self.verify_ssl,
    )
FullCloudStorageConnectionInterface pydantic-model

Bases: AuthSettingsInput

API response model - no secrets exposed

Show JSON schema:
{
  "description": "API response model - no secrets exposed",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnectionInterface",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (str | None)
  • aws_allow_unsafe_html (bool | None)
  • aws_region (str | None)
  • aws_access_key_id (str | None)
  • aws_role_arn (str | None)
  • azure_account_name (str | None)
  • azure_tenant_id (str | None)
  • azure_client_id (str | None)
  • endpoint_url (str | None)
  • verify_ssl (bool)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
119
120
121
122
123
124
125
126
127
128
129
130
131
class FullCloudStorageConnectionInterface(AuthSettingsInput):
    """API response model - no secrets exposed"""

    # Public fields only
    aws_allow_unsafe_html: bool | None = None
    aws_region: str | None = None
    aws_access_key_id: str | None = None
    aws_role_arn: str | None = None
    azure_account_name: str | None = None
    azure_tenant_id: str | None = None
    azure_client_id: str | None = None
    endpoint_url: str | None = None
    verify_ssl: bool = True
FullCloudStorageConnectionWorkerInterface pydantic-model

Bases: AuthSettingsInput

Internal model with decrypted secrets

Show JSON schema:
{
  "description": "Internal model with decrypted secrets",
  "properties": {
    "storage_type": {
      "enum": [
        "s3",
        "adls",
        "gcs"
      ],
      "title": "Storage Type",
      "type": "string"
    },
    "auth_method": {
      "enum": [
        "access_key",
        "iam_role",
        "service_principal",
        "managed_identity",
        "sas_token",
        "aws-cli",
        "env_vars"
      ],
      "title": "Auth Method",
      "type": "string"
    },
    "connection_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "None",
      "title": "Connection Name"
    },
    "aws_region": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Region"
    },
    "aws_access_key_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Access Key Id"
    },
    "aws_secret_access_key": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Secret Access Key"
    },
    "aws_role_arn": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Role Arn"
    },
    "aws_allow_unsafe_html": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Allow Unsafe Html"
    },
    "aws_session_token": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Aws Session Token"
    },
    "azure_account_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Name"
    },
    "azure_account_key": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Account Key"
    },
    "azure_tenant_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Tenant Id"
    },
    "azure_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Id"
    },
    "azure_client_secret": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Azure Client Secret"
    },
    "endpoint_url": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Endpoint Url"
    },
    "verify_ssl": {
      "default": true,
      "title": "Verify Ssl",
      "type": "boolean"
    }
  },
  "required": [
    "storage_type",
    "auth_method"
  ],
  "title": "FullCloudStorageConnectionWorkerInterface",
  "type": "object"
}

Fields:

  • storage_type (CloudStorageType)
  • auth_method (AuthMethod)
  • connection_name (str | None)
  • aws_region (str | None)
  • aws_access_key_id (str | None)
  • aws_secret_access_key (str | None)
  • aws_role_arn (str | None)
  • aws_allow_unsafe_html (bool | None)
  • aws_session_token (str | None)
  • azure_account_name (str | None)
  • azure_account_key (str | None)
  • azure_tenant_id (str | None)
  • azure_client_id (str | None)
  • azure_client_secret (str | None)
  • endpoint_url (str | None)
  • verify_ssl (bool)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class FullCloudStorageConnectionWorkerInterface(AuthSettingsInput):
    """Internal model with decrypted secrets"""

    # AWS S3
    aws_region: str | None = None
    aws_access_key_id: str | None = None
    aws_secret_access_key: str | None = None
    aws_role_arn: str | None = None
    aws_allow_unsafe_html: bool | None = None
    aws_session_token: str | None = None

    # Azure ADLS
    azure_account_name: str | None = None
    azure_account_key: str | None = None
    azure_tenant_id: str | None = None
    azure_client_id: str | None = None
    azure_client_secret: str | None = None

    # Common
    endpoint_url: str | None = None
    verify_ssl: bool = True
WriteSettingsWorkerInterface pydantic-model

Bases: BaseModel

Settings for writing to cloud storage

Show JSON schema:
{
  "description": "Settings for writing to cloud storage",
  "properties": {
    "resource_path": {
      "title": "Resource Path",
      "type": "string"
    },
    "write_mode": {
      "default": "overwrite",
      "enum": [
        "overwrite",
        "append"
      ],
      "title": "Write Mode",
      "type": "string"
    },
    "file_format": {
      "default": "parquet",
      "enum": [
        "csv",
        "parquet",
        "json",
        "delta"
      ],
      "title": "File Format",
      "type": "string"
    },
    "parquet_compression": {
      "default": "snappy",
      "enum": [
        "snappy",
        "gzip",
        "brotli",
        "lz4",
        "zstd"
      ],
      "title": "Parquet Compression",
      "type": "string"
    },
    "csv_delimiter": {
      "default": ",",
      "title": "Csv Delimiter",
      "type": "string"
    },
    "csv_encoding": {
      "default": "utf8",
      "title": "Csv Encoding",
      "type": "string"
    }
  },
  "required": [
    "resource_path"
  ],
  "title": "WriteSettingsWorkerInterface",
  "type": "object"
}

Fields:

  • resource_path (str)
  • write_mode (Literal['overwrite', 'append'])
  • file_format (Literal['csv', 'parquet', 'json', 'delta'])
  • parquet_compression (Literal['snappy', 'gzip', 'brotli', 'lz4', 'zstd'])
  • csv_delimiter (str)
  • csv_encoding (str)
Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
165
166
167
168
169
170
171
172
173
174
175
176
class WriteSettingsWorkerInterface(BaseModel):
    """Settings for writing to cloud storage"""

    resource_path: str  # s3://bucket/path/to/file.csv

    write_mode: Literal["overwrite", "append"] = "overwrite"
    file_format: Literal["csv", "parquet", "json", "delta"] = "parquet"

    parquet_compression: Literal["snappy", "gzip", "brotli", "lz4", "zstd"] = "snappy"

    csv_delimiter: str = ","
    csv_encoding: str = "utf8"
encrypt_for_worker(secret_value, user_id)

Encrypts a secret value for use in worker contexts using per-user key derivation.

Parameters:

Name Type Description Default
secret_value SecretStr | None

The secret value to encrypt

required
user_id int

The user ID for key derivation

required

Returns:

Type Description
str | None

Encrypted secret with embedded user_id, or None if secret_value is None

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def encrypt_for_worker(secret_value: SecretStr | None, user_id: int) -> str | None:
    """
    Encrypts a secret value for use in worker contexts using per-user key derivation.

    Args:
        secret_value: The secret value to encrypt
        user_id: The user ID for key derivation

    Returns:
        Encrypted secret with embedded user_id, or None if secret_value is None
    """
    if secret_value is not None:
        return encrypt_secret(secret_value.get_secret_value(), user_id)
    return None
get_cloud_storage_write_settings_worker_interface(write_settings, connection, lf, user_id, flowfile_flow_id=1, flowfile_node_id=-1)

Convert to a worker interface model with encrypted secrets.

Parameters:

Name Type Description Default
write_settings CloudStorageWriteSettings

Cloud storage write settings

required
connection FullCloudStorageConnection

Full cloud storage connection with secrets

required
lf LazyFrame

LazyFrame to serialize

required
user_id int

User ID for per-user key derivation

required
flowfile_flow_id int

Flow ID for tracking

1
flowfile_node_id int | str

Node ID for tracking

-1

Returns:

Type Description
CloudStorageWriteSettingsWorkerInterface

CloudStorageWriteSettingsWorkerInterface ready for worker

Source code in flowfile_core/flowfile_core/schemas/cloud_storage_schemas.py
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
def get_cloud_storage_write_settings_worker_interface(
    write_settings: CloudStorageWriteSettings,
    connection: FullCloudStorageConnection,
    lf: pl.LazyFrame,
    user_id: int,
    flowfile_flow_id: int = 1,
    flowfile_node_id: int | str = -1,
) -> CloudStorageWriteSettingsWorkerInterface:
    """
    Convert to a worker interface model with encrypted secrets.

    Args:
        write_settings: Cloud storage write settings
        connection: Full cloud storage connection with secrets
        lf: LazyFrame to serialize
        user_id: User ID for per-user key derivation
        flowfile_flow_id: Flow ID for tracking
        flowfile_node_id: Node ID for tracking

    Returns:
        CloudStorageWriteSettingsWorkerInterface ready for worker
    """
    operation = base64.b64encode(lf.serialize()).decode()

    return CloudStorageWriteSettingsWorkerInterface(
        operation=operation,
        write_settings=write_settings.get_write_setting_worker_interface(),
        connection=connection.get_worker_interface(user_id),
        flowfile_flow_id=flowfile_flow_id,
        flowfile_node_id=flowfile_node_id,
    )

output_model

flowfile_core.schemas.output_model

Classes:

Name Description
BaseItem

A base model for any item in a file system, like a file or directory.

ExpressionRef

A reference to a single Polars expression, including its name and docstring.

ExpressionsOverview

Represents a categorized list of available Polars expressions.

FileColumn

Represents detailed schema and statistics for a single column (field).

InstantFuncResult

Represents the result of a function that is expected to execute instantly.

ItemInfo

Provides detailed information about a single item in an output directory.

NodeData

A comprehensive model holding the complete state and data for a single node.

NodeResult

Represents the execution result of a single node in a FlowGraph run.

OutputDir

Represents the contents of a single output directory.

OutputFile

Represents a single file in an output directory, extending BaseItem.

OutputFiles

Represents a collection of files, typically within a directory.

OutputTree

Represents a directory tree, including subdirectories.

RunInformation

Contains summary information about a complete FlowGraph execution.

TableExample

Represents a preview of a table, including schema and sample data.

BaseItem pydantic-model

Bases: BaseModel

A base model for any item in a file system, like a file or directory.

Show JSON schema:
{
  "description": "A base model for any item in a file system, like a file or directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "BaseItem",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (int | None)
  • creation_date (datetime | None)
  • access_date (datetime | None)
  • modification_date (datetime | None)
  • source_path (str | None)
  • number_of_items (int)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
34
35
36
37
38
39
40
41
42
43
44
class BaseItem(BaseModel):
    """A base model for any item in a file system, like a file or directory."""

    name: str
    path: str
    size: int | None = None
    creation_date: datetime | None = None
    access_date: datetime | None = None
    modification_date: datetime | None = None
    source_path: str | None = None
    number_of_items: int = -1
ExpressionRef pydantic-model

Bases: BaseModel

A reference to a single Polars expression, including its name and docstring.

Show JSON schema:
{
  "description": "A reference to a single Polars expression, including its name and docstring.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "doc": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Doc"
    }
  },
  "required": [
    "name",
    "doc"
  ],
  "title": "ExpressionRef",
  "type": "object"
}

Fields:

  • name (str)
  • doc (str | None)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
131
132
133
134
135
class ExpressionRef(BaseModel):
    """A reference to a single Polars expression, including its name and docstring."""

    name: str
    doc: str | None
ExpressionsOverview pydantic-model

Bases: BaseModel

Represents a categorized list of available Polars expressions.

Show JSON schema:
{
  "$defs": {
    "ExpressionRef": {
      "description": "A reference to a single Polars expression, including its name and docstring.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "doc": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "title": "Doc"
        }
      },
      "required": [
        "name",
        "doc"
      ],
      "title": "ExpressionRef",
      "type": "object"
    }
  },
  "description": "Represents a categorized list of available Polars expressions.",
  "properties": {
    "expression_type": {
      "title": "Expression Type",
      "type": "string"
    },
    "expressions": {
      "items": {
        "$ref": "#/$defs/ExpressionRef"
      },
      "title": "Expressions",
      "type": "array"
    }
  },
  "required": [
    "expression_type",
    "expressions"
  ],
  "title": "ExpressionsOverview",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/output_model.py
138
139
140
141
142
class ExpressionsOverview(BaseModel):
    """Represents a categorized list of available Polars expressions."""

    expression_type: str
    expressions: list[ExpressionRef]
FileColumn pydantic-model

Bases: BaseModel

Represents detailed schema and statistics for a single column (field).

Show JSON schema:
{
  "description": "Represents detailed schema and statistics for a single column (field).",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "data_type": {
      "title": "Data Type",
      "type": "string"
    },
    "is_unique": {
      "title": "Is Unique",
      "type": "boolean"
    },
    "max_value": {
      "title": "Max Value",
      "type": "string"
    },
    "min_value": {
      "title": "Min Value",
      "type": "string"
    },
    "number_of_empty_values": {
      "title": "Number Of Empty Values",
      "type": "integer"
    },
    "number_of_filled_values": {
      "title": "Number Of Filled Values",
      "type": "integer"
    },
    "number_of_unique_values": {
      "title": "Number Of Unique Values",
      "type": "integer"
    },
    "size": {
      "title": "Size",
      "type": "integer"
    }
  },
  "required": [
    "name",
    "data_type",
    "is_unique",
    "max_value",
    "min_value",
    "number_of_empty_values",
    "number_of_filled_values",
    "number_of_unique_values",
    "size"
  ],
  "title": "FileColumn",
  "type": "object"
}

Fields:

  • name (str)
  • data_type (str)
  • is_unique (bool)
  • max_value (str)
  • min_value (str)
  • number_of_empty_values (int)
  • number_of_filled_values (int)
  • number_of_unique_values (int)
  • size (int)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
47
48
49
50
51
52
53
54
55
56
57
58
class FileColumn(BaseModel):
    """Represents detailed schema and statistics for a single column (field)."""

    name: str
    data_type: str
    is_unique: bool
    max_value: str
    min_value: str
    number_of_empty_values: int
    number_of_filled_values: int
    number_of_unique_values: int
    size: int
InstantFuncResult pydantic-model

Bases: BaseModel

Represents the result of a function that is expected to execute instantly.

Show JSON schema:
{
  "description": "Represents the result of a function that is expected to execute instantly.",
  "properties": {
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "result": {
      "title": "Result",
      "type": "string"
    }
  },
  "required": [
    "result"
  ],
  "title": "InstantFuncResult",
  "type": "object"
}

Fields:

  • success (bool | None)
  • result (str)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
145
146
147
148
149
class InstantFuncResult(BaseModel):
    """Represents the result of a function that is expected to execute instantly."""

    success: bool | None = None
    result: str
ItemInfo pydantic-model

Bases: OutputFile

Provides detailed information about a single item in an output directory.

Show JSON schema:
{
  "description": "Provides detailed information about a single item in an output directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "ext": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Ext"
    },
    "mimetype": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Mimetype"
    },
    "id": {
      "default": -1,
      "title": "Id",
      "type": "integer"
    },
    "type": {
      "title": "Type",
      "type": "string"
    },
    "analysis_file_available": {
      "default": false,
      "title": "Analysis File Available",
      "type": "boolean"
    },
    "analysis_file_location": {
      "default": null,
      "title": "Analysis File Location",
      "type": "string"
    },
    "analysis_file_error": {
      "default": null,
      "title": "Analysis File Error",
      "type": "string"
    }
  },
  "required": [
    "name",
    "path",
    "type"
  ],
  "title": "ItemInfo",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (int | None)
  • creation_date (datetime | None)
  • access_date (datetime | None)
  • modification_date (datetime | None)
  • source_path (str | None)
  • number_of_items (int)
  • ext (str | None)
  • mimetype (str | None)
  • id (int)
  • type (str)
  • analysis_file_available (bool)
  • analysis_file_location (str)
  • analysis_file_error (str)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
114
115
116
117
118
119
120
121
class ItemInfo(OutputFile):
    """Provides detailed information about a single item in an output directory."""

    id: int = -1
    type: str
    analysis_file_available: bool = False
    analysis_file_location: str = None
    analysis_file_error: str = None
NodeData pydantic-model

Bases: BaseModel

A comprehensive model holding the complete state and data for a single node.

This includes its input/output data previews, settings, and run status.

Show JSON schema:
{
  "$defs": {
    "FileColumn": {
      "description": "Represents detailed schema and statistics for a single column (field).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "title": "Data Type",
          "type": "string"
        },
        "is_unique": {
          "title": "Is Unique",
          "type": "boolean"
        },
        "max_value": {
          "title": "Max Value",
          "type": "string"
        },
        "min_value": {
          "title": "Min Value",
          "type": "string"
        },
        "number_of_empty_values": {
          "title": "Number Of Empty Values",
          "type": "integer"
        },
        "number_of_filled_values": {
          "title": "Number Of Filled Values",
          "type": "integer"
        },
        "number_of_unique_values": {
          "title": "Number Of Unique Values",
          "type": "integer"
        },
        "size": {
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "name",
        "data_type",
        "is_unique",
        "max_value",
        "min_value",
        "number_of_empty_values",
        "number_of_filled_values",
        "number_of_unique_values",
        "size"
      ],
      "title": "FileColumn",
      "type": "object"
    },
    "TableExample": {
      "description": "Represents a preview of a table, including schema and sample data.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "number_of_records": {
          "title": "Number Of Records",
          "type": "integer"
        },
        "number_of_columns": {
          "title": "Number Of Columns",
          "type": "integer"
        },
        "name": {
          "title": "Name",
          "type": "string"
        },
        "table_schema": {
          "items": {
            "$ref": "#/$defs/FileColumn"
          },
          "title": "Table Schema",
          "type": "array"
        },
        "columns": {
          "items": {
            "type": "string"
          },
          "title": "Columns",
          "type": "array"
        },
        "data": {
          "anyOf": [
            {
              "items": {
                "type": "object"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": {},
          "title": "Data"
        },
        "has_example_data": {
          "default": false,
          "title": "Has Example Data",
          "type": "boolean"
        },
        "has_run_with_current_setup": {
          "default": false,
          "title": "Has Run With Current Setup",
          "type": "boolean"
        }
      },
      "required": [
        "node_id",
        "number_of_records",
        "number_of_columns",
        "name",
        "table_schema",
        "columns"
      ],
      "title": "TableExample",
      "type": "object"
    }
  },
  "description": "A comprehensive model holding the complete state and data for a single node.\n\nThis includes its input/output data previews, settings, and run status.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "flow_type": {
      "title": "Flow Type",
      "type": "string"
    },
    "left_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "right_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "main_input": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "main_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "left_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "right_output": {
      "anyOf": [
        {
          "$ref": "#/$defs/TableExample"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "has_run": {
      "default": false,
      "title": "Has Run",
      "type": "boolean"
    },
    "is_cached": {
      "default": false,
      "title": "Is Cached",
      "type": "boolean"
    },
    "setting_input": {
      "default": null,
      "title": "Setting Input"
    }
  },
  "required": [
    "flow_id",
    "node_id",
    "flow_type"
  ],
  "title": "NodeData",
  "type": "object"
}

Fields:

Source code in flowfile_core/flowfile_core/schemas/output_model.py
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
class NodeData(BaseModel):
    """A comprehensive model holding the complete state and data for a single node.

    This includes its input/output data previews, settings, and run status.
    """

    flow_id: int
    node_id: int
    flow_type: str
    left_input: TableExample | None = None
    right_input: TableExample | None = None
    main_input: TableExample | None = None
    main_output: TableExample | None = None
    left_output: TableExample | None = None
    right_output: TableExample | None = None
    has_run: bool = False
    is_cached: bool = False
    setting_input: Any = None
NodeResult pydantic-model

Bases: BaseModel

Represents the execution result of a single node in a FlowGraph run.

Show JSON schema:
{
  "description": "Represents the execution result of a single node in a FlowGraph run.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "node_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Node Name"
    },
    "start_timestamp": {
      "title": "Start Timestamp",
      "type": "number"
    },
    "end_timestamp": {
      "default": 0,
      "title": "End Timestamp",
      "type": "number"
    },
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "error": {
      "default": "",
      "title": "Error",
      "type": "string"
    },
    "run_time": {
      "default": -1,
      "title": "Run Time",
      "type": "integer"
    },
    "is_running": {
      "default": true,
      "title": "Is Running",
      "type": "boolean"
    }
  },
  "required": [
    "node_id"
  ],
  "title": "NodeResult",
  "type": "object"
}

Fields:

  • node_id (int)
  • node_name (str | None)
  • start_timestamp (float)
  • end_timestamp (float)
  • success (bool | None)
  • error (str)
  • run_time (int)
  • is_running (bool)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
 8
 9
10
11
12
13
14
15
16
17
18
class NodeResult(BaseModel):
    """Represents the execution result of a single node in a FlowGraph run."""

    node_id: int
    node_name: str | None = None
    start_timestamp: float = Field(default_factory=time.time)
    end_timestamp: float = 0
    success: bool | None = None
    error: str = ""
    run_time: int = -1
    is_running: bool = True
OutputDir pydantic-model

Bases: BaseItem

Represents the contents of a single output directory.

Show JSON schema:
{
  "$defs": {
    "ItemInfo": {
      "description": "Provides detailed information about a single item in an output directory.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        },
        "id": {
          "default": -1,
          "title": "Id",
          "type": "integer"
        },
        "type": {
          "title": "Type",
          "type": "string"
        },
        "analysis_file_available": {
          "default": false,
          "title": "Analysis File Available",
          "type": "boolean"
        },
        "analysis_file_location": {
          "default": null,
          "title": "Analysis File Location",
          "type": "string"
        },
        "analysis_file_error": {
          "default": null,
          "title": "Analysis File Error",
          "type": "string"
        }
      },
      "required": [
        "name",
        "path",
        "type"
      ],
      "title": "ItemInfo",
      "type": "object"
    }
  },
  "description": "Represents the contents of a single output directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "all_items": {
      "items": {
        "type": "string"
      },
      "title": "All Items",
      "type": "array"
    },
    "items": {
      "items": {
        "$ref": "#/$defs/ItemInfo"
      },
      "title": "Items",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path",
    "all_items",
    "items"
  ],
  "title": "OutputDir",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (int | None)
  • creation_date (datetime | None)
  • access_date (datetime | None)
  • modification_date (datetime | None)
  • source_path (str | None)
  • number_of_items (int)
  • all_items (list[str])
  • items (list[ItemInfo])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
124
125
126
127
128
class OutputDir(BaseItem):
    """Represents the contents of a single output directory."""

    all_items: list[str]
    items: list[ItemInfo]
OutputFile pydantic-model

Bases: BaseItem

Represents a single file in an output directory, extending BaseItem.

Show JSON schema:
{
  "description": "Represents a single file in an output directory, extending BaseItem.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "ext": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Ext"
    },
    "mimetype": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Mimetype"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputFile",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (int | None)
  • creation_date (datetime | None)
  • access_date (datetime | None)
  • modification_date (datetime | None)
  • source_path (str | None)
  • number_of_items (int)
  • ext (str | None)
  • mimetype (str | None)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
95
96
97
98
99
class OutputFile(BaseItem):
    """Represents a single file in an output directory, extending BaseItem."""

    ext: str | None = None
    mimetype: str | None = None
OutputFiles pydantic-model

Bases: BaseItem

Represents a collection of files, typically within a directory.

Show JSON schema:
{
  "$defs": {
    "OutputFile": {
      "description": "Represents a single file in an output directory, extending BaseItem.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFile",
      "type": "object"
    }
  },
  "description": "Represents a collection of files, typically within a directory.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "files": {
      "items": {
        "$ref": "#/$defs/OutputFile"
      },
      "title": "Files",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputFiles",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (int | None)
  • creation_date (datetime | None)
  • access_date (datetime | None)
  • modification_date (datetime | None)
  • source_path (str | None)
  • number_of_items (int)
  • files (list[OutputFile])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
102
103
104
105
class OutputFiles(BaseItem):
    """Represents a collection of files, typically within a directory."""

    files: list[OutputFile] = Field(default_factory=list)
OutputTree pydantic-model

Bases: OutputFiles

Represents a directory tree, including subdirectories.

Show JSON schema:
{
  "$defs": {
    "OutputFile": {
      "description": "Represents a single file in an output directory, extending BaseItem.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "ext": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ext"
        },
        "mimetype": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Mimetype"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFile",
      "type": "object"
    },
    "OutputFiles": {
      "description": "Represents a collection of files, typically within a directory.",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "path": {
          "title": "Path",
          "type": "string"
        },
        "size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Size"
        },
        "creation_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Creation Date"
        },
        "access_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Access Date"
        },
        "modification_date": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Modification Date"
        },
        "source_path": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Source Path"
        },
        "number_of_items": {
          "default": -1,
          "title": "Number Of Items",
          "type": "integer"
        },
        "files": {
          "items": {
            "$ref": "#/$defs/OutputFile"
          },
          "title": "Files",
          "type": "array"
        }
      },
      "required": [
        "name",
        "path"
      ],
      "title": "OutputFiles",
      "type": "object"
    }
  },
  "description": "Represents a directory tree, including subdirectories.",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "path": {
      "title": "Path",
      "type": "string"
    },
    "size": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Size"
    },
    "creation_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Creation Date"
    },
    "access_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Access Date"
    },
    "modification_date": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Modification Date"
    },
    "source_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Source Path"
    },
    "number_of_items": {
      "default": -1,
      "title": "Number Of Items",
      "type": "integer"
    },
    "files": {
      "items": {
        "$ref": "#/$defs/OutputFile"
      },
      "title": "Files",
      "type": "array"
    },
    "directories": {
      "items": {
        "$ref": "#/$defs/OutputFiles"
      },
      "title": "Directories",
      "type": "array"
    }
  },
  "required": [
    "name",
    "path"
  ],
  "title": "OutputTree",
  "type": "object"
}

Fields:

  • name (str)
  • path (str)
  • size (int | None)
  • creation_date (datetime | None)
  • access_date (datetime | None)
  • modification_date (datetime | None)
  • source_path (str | None)
  • number_of_items (int)
  • files (list[OutputFile])
  • directories (list[OutputFiles])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
108
109
110
111
class OutputTree(OutputFiles):
    """Represents a directory tree, including subdirectories."""

    directories: list[OutputFiles] = Field(default_factory=list)
RunInformation pydantic-model

Bases: BaseModel

Contains summary information about a complete FlowGraph execution.

Show JSON schema:
{
  "$defs": {
    "NodeResult": {
      "description": "Represents the execution result of a single node in a FlowGraph run.",
      "properties": {
        "node_id": {
          "title": "Node Id",
          "type": "integer"
        },
        "node_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Node Name"
        },
        "start_timestamp": {
          "title": "Start Timestamp",
          "type": "number"
        },
        "end_timestamp": {
          "default": 0,
          "title": "End Timestamp",
          "type": "number"
        },
        "success": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Success"
        },
        "error": {
          "default": "",
          "title": "Error",
          "type": "string"
        },
        "run_time": {
          "default": -1,
          "title": "Run Time",
          "type": "integer"
        },
        "is_running": {
          "default": true,
          "title": "Is Running",
          "type": "boolean"
        }
      },
      "required": [
        "node_id"
      ],
      "title": "NodeResult",
      "type": "object"
    }
  },
  "description": "Contains summary information about a complete FlowGraph execution.",
  "properties": {
    "flow_id": {
      "title": "Flow Id",
      "type": "integer"
    },
    "start_time": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "title": "Start Time"
    },
    "end_time": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "End Time"
    },
    "success": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Success"
    },
    "nodes_completed": {
      "default": 0,
      "title": "Nodes Completed",
      "type": "integer"
    },
    "number_of_nodes": {
      "default": 0,
      "title": "Number Of Nodes",
      "type": "integer"
    },
    "node_step_result": {
      "items": {
        "$ref": "#/$defs/NodeResult"
      },
      "title": "Node Step Result",
      "type": "array"
    },
    "run_type": {
      "enum": [
        "fetch_one",
        "full_run",
        "init"
      ],
      "title": "Run Type",
      "type": "string"
    }
  },
  "required": [
    "flow_id",
    "node_step_result",
    "run_type"
  ],
  "title": "RunInformation",
  "type": "object"
}

Fields:

  • flow_id (int)
  • start_time (datetime | None)
  • end_time (datetime | None)
  • success (bool | None)
  • nodes_completed (int)
  • number_of_nodes (int)
  • node_step_result (list[NodeResult])
  • run_type (Literal['fetch_one', 'full_run', 'init'])
Source code in flowfile_core/flowfile_core/schemas/output_model.py
21
22
23
24
25
26
27
28
29
30
31
class RunInformation(BaseModel):
    """Contains summary information about a complete FlowGraph execution."""

    flow_id: int
    start_time: datetime | None = Field(default_factory=datetime.now)
    end_time: datetime | None = None
    success: bool | None = None
    nodes_completed: int = 0
    number_of_nodes: int = 0
    node_step_result: list[NodeResult]
    run_type: Literal["fetch_one", "full_run", "init"]
TableExample pydantic-model

Bases: BaseModel

Represents a preview of a table, including schema and sample data.

Show JSON schema:
{
  "$defs": {
    "FileColumn": {
      "description": "Represents detailed schema and statistics for a single column (field).",
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "data_type": {
          "title": "Data Type",
          "type": "string"
        },
        "is_unique": {
          "title": "Is Unique",
          "type": "boolean"
        },
        "max_value": {
          "title": "Max Value",
          "type": "string"
        },
        "min_value": {
          "title": "Min Value",
          "type": "string"
        },
        "number_of_empty_values": {
          "title": "Number Of Empty Values",
          "type": "integer"
        },
        "number_of_filled_values": {
          "title": "Number Of Filled Values",
          "type": "integer"
        },
        "number_of_unique_values": {
          "title": "Number Of Unique Values",
          "type": "integer"
        },
        "size": {
          "title": "Size",
          "type": "integer"
        }
      },
      "required": [
        "name",
        "data_type",
        "is_unique",
        "max_value",
        "min_value",
        "number_of_empty_values",
        "number_of_filled_values",
        "number_of_unique_values",
        "size"
      ],
      "title": "FileColumn",
      "type": "object"
    }
  },
  "description": "Represents a preview of a table, including schema and sample data.",
  "properties": {
    "node_id": {
      "title": "Node Id",
      "type": "integer"
    },
    "number_of_records": {
      "title": "Number Of Records",
      "type": "integer"
    },
    "number_of_columns": {
      "title": "Number Of Columns",
      "type": "integer"
    },
    "name": {
      "title": "Name",
      "type": "string"
    },
    "table_schema": {
      "items": {
        "$ref": "#/$defs/FileColumn"
      },
      "title": "Table Schema",
      "type": "array"
    },
    "columns": {
      "items": {
        "type": "string"
      },
      "title": "Columns",
      "type": "array"
    },
    "data": {
      "anyOf": [
        {
          "items": {
            "type": "object"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": {},
      "title": "Data"
    },
    "has_example_data": {
      "default": false,
      "title": "Has Example Data",
      "type": "boolean"
    },
    "has_run_with_current_setup": {
      "default": false,
      "title": "Has Run With Current Setup",
      "type": "boolean"
    }
  },
  "required": [
    "node_id",
    "number_of_records",
    "number_of_columns",
    "name",
    "table_schema",
    "columns"
  ],
  "title": "TableExample",
  "type": "object"
}

Fields:

  • node_id (int)
  • number_of_records (int)
  • number_of_columns (int)
  • name (str)
  • table_schema (list[FileColumn])
  • columns (list[str])
  • data (list[dict] | None)
  • has_example_data (bool)
  • has_run_with_current_setup (bool)
Source code in flowfile_core/flowfile_core/schemas/output_model.py
61
62
63
64
65
66
67
68
69
70
71
72
class TableExample(BaseModel):
    """Represents a preview of a table, including schema and sample data."""

    node_id: int
    number_of_records: int
    number_of_columns: int
    name: str
    table_schema: list[FileColumn]
    columns: list[str]
    data: list[dict] | None = {}
    has_example_data: bool = False
    has_run_with_current_setup: bool = False

Web API

This section documents the FastAPI routes that expose flowfile-core's functionality over HTTP.

routes

flowfile_core.routes.routes

Main API router and endpoint definitions for the Flowfile application.

This module sets up the FastAPI router, defines all the API endpoints for interacting with flows, nodes, files, and other core components of the application. It handles the logic for creating, reading, updating, and deleting these resources.

Functions:

Name Description
add_generic_settings

A generic endpoint to update the settings of any node.

add_node

Adds a new, unconfigured node (a "promise") to the flow graph.

cancel_flow

Cancels a currently running flow execution.

clear_history

Clear all history for a flow.

close_flow

Closes an active flow session for the current user.

connect_node

Creates a connection (edge) between two nodes in the flow graph.

copy_node

Copies an existing node's settings to a new node promise.

create_db_connection

Creates and securely stores a new database connection.

create_directory

Creates a new directory at the specified path.

create_flow

Creates a new, empty flow file at the specified path and registers a session for it.

delete_db_connection

Deletes a stored database connection.

delete_node

Deletes a node from the flow graph.

delete_node_connection

Deletes a connection (edge) between two nodes.

get_active_flow_file_sessions

Retrieves a list of all currently active flow sessions for the current user.

get_db_connections

Retrieves all stored database connections for the current user (without passwords).

get_default_path

Returns the default starting path for the file browser (user data directory).

get_description_node

Retrieves the description text for a specific node.

get_directory_contents

Gets the contents of a directory path.

get_downstream_node_ids

Gets a list of all node IDs that are downstream dependencies of a given node.

get_excel_sheet_names

Retrieves the sheet names from an Excel file.

get_expression_doc

Retrieves documentation for available Polars expressions.

get_expressions

Retrieves a list of all available Flowfile expression names.

get_flow

Retrieves the settings for a specific flow.

get_flow_frontend_data

Retrieves the data needed to render the flow graph in the frontend.

get_flow_settings

Retrieves the main settings for a flow.

get_generated_code

Generates and returns a Python script with Polars code representing the flow.

get_graphic_walker_input

Gets the data and configuration for the Graphic Walker data exploration tool.

get_history_status

Get the current state of the history system for a flow.

get_instant_function_result

Executes a simple, instant function on a node's data and returns the result.

get_list_of_saved_flows

Scans a directory for saved flow files (.flowfile).

get_local_files

Retrieves a list of files from a specified local directory.

get_node

Retrieves the complete state and data preview for a single node.

get_node_list

Retrieves the list of all available node types and their templates.

get_node_model

(Internal) Retrieves a node's Pydantic model from the input_schema module by its name.

get_reference_node

Retrieves the reference identifier for a specific node.

get_run_status

Retrieves the run status information for a specific flow.

get_table_example

Retrieves a data preview (schema and sample rows) for a node's output.

get_vue_flow_data

Retrieves the flow data formatted for the Vue-based frontend.

import_saved_flow

Imports a flow from a saved .yaml and registers it as a new session for the current user.

redo_action

Redo the last undone action on the flow graph.

register_flow

Registers a new flow session with the application for the current user.

run_flow

Executes a flow in a background task.

save_flow

Saves the current state of a flow to a .yaml.

trigger_fetch_node_data

Fetches and refreshes the data for a specific node.

undo_action

Undo the last action on the flow graph.

update_description_node

Updates the description text for a specific node.

update_flow_settings

Updates the main settings for a flow.

update_reference_node

Updates the reference identifier for a specific node.

upload_file

Uploads a file to the server's 'uploads' directory.

validate_db_settings

Validates that a connection can be made to a database with the given settings.

validate_node_reference

Validates if a reference is valid and unique for a node.

add_generic_settings(input_data, node_type, current_user=Depends(get_current_active_user))

A generic endpoint to update the settings of any node.

This endpoint dynamically determines the correct Pydantic model and update function based on the node_type parameter.

Returns:

Type Description
OperationResponse

OperationResponse with current history state.

Source code in flowfile_core/flowfile_core/routes/routes.py
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
@router.post('/update_settings/', tags=['transform'], response_model=OperationResponse)
def add_generic_settings(input_data: dict[str, Any], node_type: str, current_user=Depends(get_current_active_user)) -> OperationResponse:
    """A generic endpoint to update the settings of any node.

    This endpoint dynamically determines the correct Pydantic model and update
    function based on the `node_type` parameter.

    Returns:
        OperationResponse with current history state.
    """
    input_data['user_id'] = current_user.id
    node_type = camel_case_to_snake_case(node_type)
    flow_id = int(input_data.get('flow_id'))
    node_id = int(input_data.get('node_id'))
    logger.info(f'Updating the data for flow: {flow_id}, node {node_id}')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    add_func = getattr(flow, 'add_' + node_type)
    parsed_input = None
    setting_name_ref = 'node' + node_type.replace('_', '')

    if add_func is None:
        raise HTTPException(404, 'could not find the function')
    try:
        ref = get_node_model(setting_name_ref)
        if ref:
            parsed_input = ref(**input_data)
    except Exception as e:
        raise HTTPException(421, str(e))
    if parsed_input is None:
        raise HTTPException(404, 'could not find the interface')
    try:
        # History capture is handled by the decorator on each add_* method
        add_func(parsed_input)
    except Exception as e:
        logger.error(e)
        raise HTTPException(419, str(f'error: {e}'))

    return OperationResponse(success=True, history=flow.get_history_state())
add_node(flow_id, node_id, node_type, pos_x=0, pos_y=0)

Adds a new, unconfigured node (a "promise") to the flow graph.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to add the node to.

required
node_id int

The client-generated ID for the new node.

required
node_type str

The type of the node to add (e.g., 'filter', 'join').

required
pos_x int | float

The X coordinate for the node's position in the UI.

0
pos_y int | float

The Y coordinate for the node's position in the UI.

0

Returns:

Type Description
OperationResponse | None

OperationResponse with current history state.

Source code in flowfile_core/flowfile_core/routes/routes.py
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
@router.post('/editor/add_node/', tags=['editor'], response_model=OperationResponse)
def add_node(flow_id: int, node_id: int, node_type: str, pos_x: int | float = 0,
             pos_y: int | float = 0) -> OperationResponse | None:
    """Adds a new, unconfigured node (a "promise") to the flow graph.

    Args:
        flow_id: The ID of the flow to add the node to.
        node_id: The client-generated ID for the new node.
        node_type: The type of the node to add (e.g., 'filter', 'join').
        pos_x: The X coordinate for the node's position in the UI.
        pos_y: The Y coordinate for the node's position in the UI.

    Returns:
        OperationResponse with current history state.
    """
    if isinstance(pos_x, float):
        pos_x = int(pos_x)
    if isinstance(pos_y, float):
        pos_y = int(pos_y)
    flow = flow_file_handler.get_flow(flow_id)
    logger.info(f'Adding a promise for {node_type}')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')

    node = flow.get_node(node_id)
    if node is not None:
        flow.delete_node(node_id)
    node_promise = input_schema.NodePromise(flow_id=flow_id, node_id=node_id, cache_results=False, pos_x=pos_x,
                                            pos_y=pos_y,
                                            node_type=node_type)
    if node_type == 'explore_data':
        flow.add_initial_node_analysis(node_promise)
    else:
        # Capture state BEFORE adding node (for batched history)
        pre_snapshot = flow.get_flowfile_data() if flow.flow_settings.track_history else None

        logger.info("Adding node")
        # Add node without individual history tracking
        flow.add_node_promise(node_promise, track_history=False)

        if check_if_has_default_setting(node_type):
            logger.info(f'Found standard settings for {node_type}, trying to upload them')
            setting_name_ref = 'node' + node_type.replace('_', '')
            node_model = get_node_model(setting_name_ref)

            # Temporarily disable history tracking for initial settings
            original_track_history = flow.flow_settings.track_history
            flow.flow_settings.track_history = False
            try:
                add_func = getattr(flow, 'add_' + node_type)
                initial_settings = node_model(flow_id=flow_id, node_id=node_id, cache_results=False,
                                              pos_x=pos_x, pos_y=pos_y, node_type=node_type)
                add_func(initial_settings)
            finally:
                flow.flow_settings.track_history = original_track_history

        # Capture batched history entry for the whole add_node operation
        if pre_snapshot is not None and flow.flow_settings.track_history:
            from flowfile_core.schemas.history_schema import HistoryActionType
            flow._history_manager.capture_if_changed(
                flow,
                pre_snapshot,
                HistoryActionType.ADD_NODE,
                f"Add {node_type} node",
                node_id,
            )
            logger.info(f"History: Captured batched 'Add {node_type} node' entry")

    logger.info(f"History state after add_node: {flow.get_history_state()}")
    return OperationResponse(success=True, history=flow.get_history_state())
cancel_flow(flow_id)

Cancels a currently running flow execution.

Source code in flowfile_core/flowfile_core/routes/routes.py
224
225
226
227
228
229
230
@router.post('/flow/cancel/', tags=['editor'])
def cancel_flow(flow_id: int):
    """Cancels a currently running flow execution."""
    flow = flow_file_handler.get_flow(flow_id)
    if not flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is not running')
    flow.cancel()
clear_history(flow_id)

Clear all history for a flow.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to clear history for.

required
Source code in flowfile_core/flowfile_core/routes/routes.py
630
631
632
633
634
635
636
637
638
639
640
641
@router.post('/editor/history_clear/', tags=['editor'])
def clear_history(flow_id: int):
    """Clear all history for a flow.

    Args:
        flow_id: The ID of the flow to clear history for.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'Could not find the flow')
    flow._history_manager.clear()
    return {"message": "History cleared successfully"}
close_flow(flow_id, current_user=Depends(get_current_active_user))

Closes an active flow session for the current user.

Source code in flowfile_core/flowfile_core/routes/routes.py
569
570
571
572
573
@router.post("/editor/close_flow/", tags=["editor"])
def close_flow(flow_id: int, current_user=Depends(get_current_active_user)) -> None:
    """Closes an active flow session for the current user."""
    user_id = current_user.id if current_user else None
    flow_file_handler.delete_flow(flow_id, user_id=user_id)
connect_node(flow_id, node_connection)

Creates a connection (edge) between two nodes in the flow graph.

Returns:

Type Description
OperationResponse

OperationResponse with current history state.

Source code in flowfile_core/flowfile_core/routes/routes.py
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
@router.post('/editor/connect_node/', tags=['editor'], response_model=OperationResponse)
def connect_node(flow_id: int, node_connection: input_schema.NodeConnection) -> OperationResponse:
    """Creates a connection (edge) between two nodes in the flow graph.

    Returns:
        OperationResponse with current history state.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        logger.info('could not find the flow')
        raise HTTPException(404, 'could not find the flow')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')

    # Capture history BEFORE the change
    from_id = node_connection.output_connection.node_id
    to_id = node_connection.input_connection.node_id
    flow.capture_history_snapshot(
        HistoryActionType.ADD_CONNECTION,
        f"Connect {from_id} -> {to_id}"
    )

    add_connection(flow, node_connection)

    return OperationResponse(success=True, history=flow.get_history_state())
copy_node(node_id_to_copy_from, flow_id_to_copy_from, node_promise)

Copies an existing node's settings to a new node promise.

Parameters:

Name Type Description Default
node_id_to_copy_from int

The ID of the node to copy the settings from.

required
flow_id_to_copy_from int

The ID of the flow containing the source node.

required
node_promise NodePromise

A NodePromise representing the new node to be created.

required

Returns:

Type Description
OperationResponse

OperationResponse with current history state.

Source code in flowfile_core/flowfile_core/routes/routes.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
@router.post('/editor/copy_node', tags=['editor'], response_model=OperationResponse)
def copy_node(node_id_to_copy_from: int, flow_id_to_copy_from: int, node_promise: input_schema.NodePromise) -> OperationResponse:
    """Copies an existing node's settings to a new node promise.

    Args:
        node_id_to_copy_from: The ID of the node to copy the settings from.
        flow_id_to_copy_from: The ID of the flow containing the source node.
        node_promise: A `NodePromise` representing the new node to be created.

    Returns:
        OperationResponse with current history state.
    """
    try:
        flow_to_copy_from = flow_file_handler.get_flow(flow_id_to_copy_from)
        flow = (flow_to_copy_from
                if flow_id_to_copy_from == node_promise.flow_id
                else flow_file_handler.get_flow(node_promise.flow_id)
                )
        node_to_copy = flow_to_copy_from.get_node(node_id_to_copy_from)
        logger.info(f"Copying data {node_promise.node_type}")

        if flow.flow_settings.is_running:
            raise HTTPException(422, "Flow is running")

        # Capture history BEFORE the change
        flow.capture_history_snapshot(
            HistoryActionType.COPY_NODE,
            f"Copy {node_promise.node_type} node",
            node_id=node_promise.node_id
        )

        if flow.get_node(node_promise.node_id) is not None:
            flow.delete_node(node_promise.node_id)

        if node_promise.node_type == "explore_data":
            flow.add_initial_node_analysis(node_promise)
            return OperationResponse(success=True, history=flow.get_history_state())

        flow.copy_node(node_promise, node_to_copy.setting_input, node_to_copy.node_type)

        return OperationResponse(success=True, history=flow.get_history_state())

    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
create_db_connection(input_connection, current_user=Depends(get_current_active_user), db=Depends(get_db))

Creates and securely stores a new database connection.

Source code in flowfile_core/flowfile_core/routes/routes.py
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
@router.post("/db_connection_lib", tags=['db_connections'])
def create_db_connection(input_connection: input_schema.FullDatabaseConnection,
                         current_user=Depends(get_current_active_user),
                         db: Session = Depends(get_db)
                         ):
    """Creates and securely stores a new database connection."""
    logger.info(f'Creating database connection {input_connection.connection_name}')
    try:
        store_database_connection(db, input_connection, current_user.id)
    except ValueError:
        raise HTTPException(422, 'Connection name already exists')
    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
    return {"message": "Database connection created successfully"}
create_directory(new_directory)

Creates a new directory at the specified path.

Parameters:

Name Type Description Default
new_directory NewDirectory

An input_schema.NewDirectory object with the path and name.

required

Returns:

Type Description
bool

True if the directory was created successfully.

Source code in flowfile_core/flowfile_core/routes/routes.py
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
@router.post('/files/create_directory', response_model=output_model.OutputDir, tags=['file manager'])
def create_directory(new_directory: input_schema.NewDirectory) -> bool:
    """Creates a new directory at the specified path.

    Args:
        new_directory: An `input_schema.NewDirectory` object with the path and name.

    Returns:
        `True` if the directory was created successfully.
    """
    result, error = create_dir(new_directory)
    if result:
        return True
    else:
        raise error
create_flow(flow_path=None, name=None, current_user=Depends(get_current_active_user))

Creates a new, empty flow file at the specified path and registers a session for it.

Source code in flowfile_core/flowfile_core/routes/routes.py
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
@router.post("/editor/create_flow/", tags=["editor"])
def create_flow(flow_path: str = None, name: str = None, current_user=Depends(get_current_active_user)):
    """Creates a new, empty flow file at the specified path and registers a session for it."""
    if flow_path is not None and name is None:
        name = Path(flow_path).stem
    elif flow_path is not None and name is not None:
        if name not in flow_path and (flow_path.endswith(".yaml") or flow_path.endswith(".yml")):
            raise HTTPException(422, 'The name must be part of the flow path when a full path is provided')
        elif name in flow_path and not (flow_path.endswith(".yaml") or flow_path.endswith(".yml")):
            flow_path = str(Path(flow_path) / (name + ".yaml"))
        elif name not in flow_path and (name.endswith(".yaml") or name.endswith(".yml")):
            flow_path = str(Path(flow_path) / name)
        elif name not in flow_path and not (name.endswith(".yaml") or name.endswith(".yml")):
            flow_path = str(Path(flow_path) / (name + ".yaml"))
    if flow_path is not None:
        # Validate path is within allowed sandbox
        flow_path = validate_path_under_cwd(flow_path)
        flow_path_ref = Path(flow_path)
        if not flow_path_ref.parent.exists():
            raise HTTPException(422, "The directory does not exist")
    user_id = current_user.id if current_user else None
    return flow_file_handler.add_flow(name=name, flow_path=flow_path, user_id=user_id)
delete_db_connection(connection_name, current_user=Depends(get_current_active_user), db=Depends(get_db))

Deletes a stored database connection.

Source code in flowfile_core/flowfile_core/routes/routes.py
465
466
467
468
469
470
471
472
473
474
475
476
@router.delete('/db_connection_lib', tags=['db_connections'])
def delete_db_connection(connection_name: str,
                         current_user=Depends(get_current_active_user),
                         db: Session = Depends(get_db)
                         ):
    """Deletes a stored database connection."""
    logger.info(f'Deleting database connection {connection_name}')
    db_connection = get_database_connection(db, connection_name, current_user.id)
    if db_connection is None:
        raise HTTPException(404, 'Database connection not found')
    delete_database_connection(db, connection_name, current_user.id)
    return {"message": "Database connection deleted successfully"}
delete_node(flow_id, node_id)

Deletes a node from the flow graph.

Returns:

Type Description
OperationResponse

OperationResponse with current history state.

Source code in flowfile_core/flowfile_core/routes/routes.py
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
@router.post('/editor/delete_node/', tags=['editor'], response_model=OperationResponse)
def delete_node(flow_id: int | None, node_id: int) -> OperationResponse:
    """Deletes a node from the flow graph.

    Returns:
        OperationResponse with current history state.
    """
    logger.info('Deleting node')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')

    # Capture history BEFORE the change
    node = flow.get_node(node_id)
    node_type = node.node_type if node else "unknown"
    flow.capture_history_snapshot(HistoryActionType.DELETE_NODE, f"Delete {node_type} node", node_id=node_id)

    flow.delete_node(node_id)

    return OperationResponse(success=True, history=flow.get_history_state())
delete_node_connection(flow_id, node_connection=None)

Deletes a connection (edge) between two nodes.

Returns:

Type Description
OperationResponse

OperationResponse with current history state.

Source code in flowfile_core/flowfile_core/routes/routes.py
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
@router.post('/editor/delete_connection/', tags=['editor'], response_model=OperationResponse)
def delete_node_connection(flow_id: int, node_connection: input_schema.NodeConnection = None) -> OperationResponse:
    """Deletes a connection (edge) between two nodes.

    Returns:
        OperationResponse with current history state.
    """
    flow_id = int(flow_id)
    logger.info(
        f'Deleting connection node {node_connection.output_connection.node_id} to node {node_connection.input_connection.node_id}')
    flow = flow_file_handler.get_flow(flow_id)
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')

    # Capture history BEFORE the change
    from_id = node_connection.output_connection.node_id
    to_id = node_connection.input_connection.node_id
    flow.capture_history_snapshot(
        HistoryActionType.DELETE_CONNECTION,
        f"Delete connection {from_id} -> {to_id}"
    )

    delete_connection(flow, node_connection)

    return OperationResponse(success=True, history=flow.get_history_state())
get_active_flow_file_sessions(current_user=Depends(get_current_active_user)) async

Retrieves a list of all currently active flow sessions for the current user.

Source code in flowfile_core/flowfile_core/routes/routes.py
178
179
180
181
182
@router.get("/active_flowfile_sessions/", response_model=list[schemas.FlowSettings])
async def get_active_flow_file_sessions(current_user=Depends(get_current_active_user)) -> list[schemas.FlowSettings]:
    """Retrieves a list of all currently active flow sessions for the current user."""
    user_id = current_user.id if current_user else None
    return [flf.flow_settings for flf in flow_file_handler.get_user_flows(user_id)]
get_db_connections(db=Depends(get_db), current_user=Depends(get_current_active_user))

Retrieves all stored database connections for the current user (without passwords).

Source code in flowfile_core/flowfile_core/routes/routes.py
479
480
481
482
483
484
485
@router.get('/db_connection_lib', tags=['db_connections'],
            response_model=list[input_schema.FullDatabaseConnectionInterface])
def get_db_connections(
        db: Session = Depends(get_db),
        current_user=Depends(get_current_active_user)) -> list[input_schema.FullDatabaseConnectionInterface]:
    """Retrieves all stored database connections for the current user (without passwords)."""
    return get_all_database_connections_interface(db, current_user.id)
get_default_path() async

Returns the default starting path for the file browser (user data directory).

Source code in flowfile_core/flowfile_core/routes/routes.py
120
121
122
123
@router.get('/files/default_path/', response_model=str, tags=['file manager'])
async def get_default_path() -> str:
    """Returns the default starting path for the file browser (user data directory)."""
    return str(storage.user_data_directory)
get_description_node(flow_id, node_id)

Retrieves the description text for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py
737
738
739
740
741
742
743
744
745
746
@router.get('/node/description', tags=['editor'])
def get_description_node(flow_id: int, node_id: int):
    """Retrieves the description text for a specific node."""
    try:
        node = flow_file_handler.get_flow(flow_id).get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    if node is None:
        raise HTTPException(404, 'Could not find the node')
    return node.setting_input.description
get_directory_contents(directory, file_types=None, include_hidden=False) async

Gets the contents of a directory path.

Parameters:

Name Type Description Default
directory str

The absolute path to the directory.

required
file_types list[str]

An optional list of file extensions to filter by.

None
include_hidden bool

If True, includes hidden files and directories.

False

Returns:

Type Description
list[FileInfo]

A list of FileInfo objects representing the directory's contents.

Source code in flowfile_core/flowfile_core/routes/routes.py
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
@router.get('/files/directory_contents/', response_model=list[FileInfo], tags=['file manager'])
async def get_directory_contents(directory: str, file_types: list[str] = None,
                                 include_hidden: bool = False) -> list[FileInfo]:
    """Gets the contents of a directory path.

    Args:
        directory: The absolute path to the directory.
        file_types: An optional list of file extensions to filter by.
        include_hidden: If True, includes hidden files and directories.

    Returns:
        A list of `FileInfo` objects representing the directory's contents.
    """
    directory_explorer = SecureFileExplorer(directory, storage.user_data_directory)
    try:
        return directory_explorer.list_contents(show_hidden=include_hidden, file_types=file_types)
    except Exception as e:
        logger.error(e)
        HTTPException(404, 'Could not access the directory')
get_downstream_node_ids(flow_id, node_id) async

Gets a list of all node IDs that are downstream dependencies of a given node.

Source code in flowfile_core/flowfile_core/routes/routes.py
842
843
844
845
846
847
@router.get('/node/downstream_node_ids', response_model=list[int], tags=['editor'])
async def get_downstream_node_ids(flow_id: int, node_id: int) -> list[int]:
    """Gets a list of all node IDs that are downstream dependencies of a given node."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    return list(node.get_all_dependent_node_ids())
get_excel_sheet_names(path) async

Retrieves the sheet names from an Excel file.

Source code in flowfile_core/flowfile_core/routes/routes.py
928
929
930
931
932
933
934
935
936
@router.get('/api/get_xlsx_sheet_names', tags=['excel_reader'], response_model=list[str])
async def get_excel_sheet_names(path: str) -> list[str] | None:
    """Retrieves the sheet names from an Excel file."""
    validated_path = validate_path_under_cwd(path)
    sheet_names = excel_file_manager.get_sheet_names(validated_path)
    if sheet_names:
        return sheet_names
    else:
        raise HTTPException(404, 'File not found')
get_expression_doc()

Retrieves documentation for available Polars expressions.

Source code in flowfile_core/flowfile_core/routes/routes.py
515
516
517
518
@router.get('/editor/expression_doc', tags=['editor'], response_model=list[output_model.ExpressionsOverview])
def get_expression_doc() -> list[output_model.ExpressionsOverview]:
    """Retrieves documentation for available Polars expressions."""
    return get_expression_overview()
get_expressions()

Retrieves a list of all available Flowfile expression names.

Source code in flowfile_core/flowfile_core/routes/routes.py
521
522
523
524
@router.get('/editor/expressions', tags=['editor'], response_model=list[str])
def get_expressions() -> list[str]:
    """Retrieves a list of all available Flowfile expression names."""
    return get_all_expressions()
get_flow(flow_id)

Retrieves the settings for a specific flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
527
528
529
530
531
532
@router.get('/editor/flow', tags=['editor'], response_model=schemas.FlowSettings)
def get_flow(flow_id: int):
    """Retrieves the settings for a specific flow."""
    flow_id = int(flow_id)
    result = get_flow_settings(flow_id)
    return result
get_flow_frontend_data(flow_id=1)

Retrieves the data needed to render the flow graph in the frontend.

Source code in flowfile_core/flowfile_core/routes/routes.py
869
870
871
872
873
874
875
@router.get('/flow_data', tags=['manager'])
def get_flow_frontend_data(flow_id: int | None = 1):
    """Retrieves the data needed to render the flow graph in the frontend."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return flow.get_frontend_data()
get_flow_settings(flow_id=1)

Retrieves the main settings for a flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
878
879
880
881
882
883
884
@router.get('/flow_settings', tags=['manager'], response_model=schemas.FlowSettings)
def get_flow_settings(flow_id: int | None = 1) -> schemas.FlowSettings:
    """Retrieves the main settings for a flow."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return flow.flow_settings
get_generated_code(flow_id)

Generates and returns a Python script with Polars code representing the flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
535
536
537
538
539
540
541
542
@router.get("/editor/code_to_polars", tags=[], response_model=str)
def get_generated_code(flow_id: int) -> str:
    """Generates and returns a Python script with Polars code representing the flow."""
    flow_id = int(flow_id)
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    return export_flow_to_polars(flow)
get_graphic_walker_input(flow_id, node_id)

Gets the data and configuration for the Graphic Walker data exploration tool.

Source code in flowfile_core/flowfile_core/routes/routes.py
906
907
908
909
910
911
912
913
914
@router.get('/analysis_data/graphic_walker_input', tags=['analysis'], response_model=input_schema.NodeExploreData)
def get_graphic_walker_input(flow_id: int, node_id: int):
    """Gets the data and configuration for the Graphic Walker data exploration tool."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    if node.results.analysis_data_generator is None:
        logger.error('The data is not refreshed and available for analysis')
        raise HTTPException(422, 'The data is not refreshed and available for analysis')
    return AnalyticsProcessor.process_graphic_walker_input(node)
get_history_status(flow_id)

Get the current state of the history system for a flow.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to get history status for.

required

Returns:

Type Description
HistoryState

HistoryState with information about available undo/redo operations.

Source code in flowfile_core/flowfile_core/routes/routes.py
614
615
616
617
618
619
620
621
622
623
624
625
626
627
@router.get('/editor/history_status/', tags=['editor'], response_model=HistoryState)
def get_history_status(flow_id: int) -> HistoryState:
    """Get the current state of the history system for a flow.

    Args:
        flow_id: The ID of the flow to get history status for.

    Returns:
        HistoryState with information about available undo/redo operations.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'Could not find the flow')
    return flow.get_history_state()
get_instant_function_result(flow_id, node_id, func_string) async

Executes a simple, instant function on a node's data and returns the result.

Source code in flowfile_core/flowfile_core/routes/routes.py
917
918
919
920
921
922
923
924
925
@router.get('/custom_functions/instant_result', tags=[])
async def get_instant_function_result(flow_id: int, node_id: int, func_string: str):
    """Executes a simple, instant function on a node's data and returns the result."""
    try:
        node = flow_file_handler.get_node(flow_id, node_id)
        result = await asyncio.to_thread(get_instant_func_results, node, func_string)
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
get_list_of_saved_flows(path)

Scans a directory for saved flow files (.flowfile).

Source code in flowfile_core/flowfile_core/routes/routes.py
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
@router.get('/files/available_flow_files', tags=['editor'], response_model=list[FileInfo])
def get_list_of_saved_flows(path: str):
    """Scans a directory for saved flow files (`.flowfile`)."""
    try:
        # Validate path is within sandbox before proceeding
        explorer = SecureFileExplorer(
            start_path=storage.user_data_directory,
            sandbox_root=storage.user_data_directory
        )
        validated_path = explorer.get_absolute_path(path)
        if validated_path is None:
            return []
        return get_files_from_directory(str(validated_path), types=['flowfile'], sandbox_root=storage.user_data_directory)
    except:
        return []
get_local_files(directory) async

Retrieves a list of files from a specified local directory.

Parameters:

Name Type Description Default
directory str

The absolute path of the directory to scan.

required

Returns:

Type Description
list[FileInfo]

A list of FileInfo objects for each item in the directory.

Raises:

Type Description
HTTPException

404 if the directory does not exist.

HTTPException

403 if access is denied (path outside sandbox).

Source code in flowfile_core/flowfile_core/routes/routes.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
@router.get('/files/files_in_local_directory/', response_model=list[FileInfo], tags=['file manager'])
async def get_local_files(directory: str) -> list[FileInfo]:
    """Retrieves a list of files from a specified local directory.

    Args:
        directory: The absolute path of the directory to scan.

    Returns:
        A list of `FileInfo` objects for each item in the directory.

    Raises:
        HTTPException: 404 if the directory does not exist.
        HTTPException: 403 if access is denied (path outside sandbox).
    """
    # Validate path is within sandbox before proceeding
    explorer = SecureFileExplorer(
        start_path=storage.user_data_directory,
        sandbox_root=storage.user_data_directory
    )
    validated_path = explorer.get_absolute_path(directory)
    if validated_path is None:
        raise HTTPException(403, 'Access denied or directory does not exist')
    if not validated_path.exists() or not validated_path.is_dir():
        raise HTTPException(404, 'Directory does not exist')
    files = get_files_from_directory(str(validated_path), sandbox_root=storage.user_data_directory)
    if files is None:
        raise HTTPException(403, 'Access denied or directory does not exist')
    return files
get_node(flow_id, node_id, get_data=False)

Retrieves the complete state and data preview for a single node.

Source code in flowfile_core/flowfile_core/routes/routes.py
714
715
716
717
718
719
720
721
722
723
@router.get('/node', response_model=output_model.NodeData, tags=['editor'])
def get_node(flow_id: int, node_id: int, get_data: bool = False):
    """Retrieves the complete state and data preview for a single node."""
    logging.info(f'Getting node {node_id} from flow {flow_id}')
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    if node is None:
        raise HTTPException(422, 'Not found')
    v = node.get_node_data(flow_id=flow.flow_id, include_example=get_data)
    return v
get_node_list()

Retrieves the list of all available node types and their templates.

Source code in flowfile_core/flowfile_core/routes/routes.py
708
709
710
711
@router.get('/node_list', response_model=list[schemas.NodeTemplate])
def get_node_list() -> list[schemas.NodeTemplate]:
    """Retrieves the list of all available node types and their templates."""
    return nodes_list
get_node_model(setting_name_ref)

(Internal) Retrieves a node's Pydantic model from the input_schema module by its name.

Source code in flowfile_core/flowfile_core/routes/routes.py
60
61
62
63
64
65
66
def get_node_model(setting_name_ref: str):
    """(Internal) Retrieves a node's Pydantic model from the input_schema module by its name."""
    logger.info("Getting node model for: " + setting_name_ref)
    for ref_name, ref in inspect.getmodule(input_schema).__dict__.items():
        if ref_name.lower() == setting_name_ref:
            return ref
    logger.error(f"Could not find node model for: {setting_name_ref}")
get_reference_node(flow_id, node_id)

Retrieves the reference identifier for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py
788
789
790
791
792
793
794
795
796
797
@router.get('/node/reference', tags=['editor'])
def get_reference_node(flow_id: int, node_id: int):
    """Retrieves the reference identifier for a specific node."""
    try:
        node = flow_file_handler.get_flow(flow_id).get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    if node is None:
        raise HTTPException(404, 'Could not find the node')
    return node.setting_input.node_reference or ""
get_run_status(flow_id, response)

Retrieves the run status information for a specific flow.

Returns a 202 Accepted status while the flow is running, and 200 OK when finished.

Source code in flowfile_core/flowfile_core/routes/routes.py
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
@router.get('/flow/run_status/', tags=['editor'],
            response_model=output_model.RunInformation)
def get_run_status(flow_id: int, response: Response):
    """Retrieves the run status information for a specific flow.

    Returns a 202 Accepted status while the flow is running, and 200 OK when finished.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    if flow.flow_settings.is_running:
        response.status_code = status.HTTP_202_ACCEPTED
    else:
        response.status_code = status.HTTP_200_OK
    return flow.get_run_info()
get_table_example(flow_id, node_id)

Retrieves a data preview (schema and sample rows) for a node's output.

Source code in flowfile_core/flowfile_core/routes/routes.py
834
835
836
837
838
839
@router.get('/node/data', response_model=output_model.TableExample, tags=['editor'])
def get_table_example(flow_id: int, node_id: int):
    """Retrieves a data preview (schema and sample rows) for a node's output."""
    flow = flow_file_handler.get_flow(flow_id)
    node = flow.get_node(node_id)
    return node.get_table_example(True)
get_vue_flow_data(flow_id)

Retrieves the flow data formatted for the Vue-based frontend.

Source code in flowfile_core/flowfile_core/routes/routes.py
896
897
898
899
900
901
902
903
@router.get('/flow_data/v2', tags=['manager'])
def get_vue_flow_data(flow_id: int) -> schemas.VueFlowInput:
    """Retrieves the flow data formatted for the Vue-based frontend."""
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    data = flow.get_vue_flow_input()
    return data
import_saved_flow(flow_path, current_user=Depends(get_current_active_user))

Imports a flow from a saved .yaml and registers it as a new session for the current user.

Source code in flowfile_core/flowfile_core/routes/routes.py
850
851
852
853
854
855
856
857
@router.get("/import_flow/", tags=["editor"], response_model=int)
def import_saved_flow(flow_path: str, current_user=Depends(get_current_active_user)) -> int:
    """Imports a flow from a saved `.yaml` and registers it as a new session for the current user."""
    validated_path = validate_path_under_cwd(flow_path)
    if not os.path.exists(validated_path):
        raise HTTPException(404, "File not found")
    user_id = current_user.id if current_user else None
    return flow_file_handler.import_flow(Path(validated_path), user_id=user_id)
redo_action(flow_id)

Redo the last undone action on the flow graph.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to redo.

required

Returns:

Type Description
UndoRedoResult

UndoRedoResult indicating success or failure.

Source code in flowfile_core/flowfile_core/routes/routes.py
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
@router.post('/editor/redo/', tags=['editor'], response_model=UndoRedoResult)
def redo_action(flow_id: int) -> UndoRedoResult:
    """Redo the last undone action on the flow graph.

    Args:
        flow_id: The ID of the flow to redo.

    Returns:
        UndoRedoResult indicating success or failure.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'Could not find the flow')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    return flow.redo()
register_flow(flow_data, current_user=Depends(get_current_active_user))

Registers a new flow session with the application for the current user.

Parameters:

Name Type Description Default
flow_data FlowSettings

The FlowSettings for the new flow.

required

Returns:

Type Description
int

The ID of the newly registered flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
164
165
166
167
168
169
170
171
172
173
174
175
@router.post("/flow/register/", tags=["editor"])
def register_flow(flow_data: schemas.FlowSettings, current_user=Depends(get_current_active_user)) -> int:
    """Registers a new flow session with the application for the current user.

    Args:
        flow_data: The `FlowSettings` for the new flow.

    Returns:
        The ID of the newly registered flow.
    """
    user_id = current_user.id if current_user else None
    return flow_file_handler.register_flow(flow_data, user_id=user_id)
run_flow(flow_id, background_tasks) async

Executes a flow in a background task.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to execute.

required
background_tasks BackgroundTasks

FastAPI's background task runner.

required

Returns:

Type Description
JSONResponse

A JSON response indicating that the flow has started.

Source code in flowfile_core/flowfile_core/routes/routes.py
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
@router.post('/flow/run/', tags=['editor'])
async def run_flow(flow_id: int, background_tasks: BackgroundTasks) -> JSONResponse:
    """Executes a flow in a background task.

    Args:
        flow_id: The ID of the flow to execute.
        background_tasks: FastAPI's background task runner.

    Returns:
        A JSON response indicating that the flow has started.
    """
    logger.info('starting to run...')
    flow = flow_file_handler.get_flow(flow_id)
    lock = get_flow_run_lock(flow_id)
    async with lock:
        if flow.flow_settings.is_running:
            raise HTTPException(422, 'Flow is already running')
        background_tasks.add_task(flow.run_graph)
    return JSONResponse(content={"message": "Data started", "flow_id": flow_id}, status_code=status.HTTP_200_OK)
save_flow(flow_id, flow_path=None)

Saves the current state of a flow to a .yaml.

Source code in flowfile_core/flowfile_core/routes/routes.py
860
861
862
863
864
865
866
@router.get('/save_flow', tags=['editor'])
def save_flow(flow_id: int, flow_path: str = None):
    """Saves the current state of a flow to a `.yaml`."""
    if flow_path is not None:
        flow_path = validate_path_under_cwd(flow_path)
    flow = flow_file_handler.get_flow(flow_id)
    flow.save_flow(flow_path=flow_path)
trigger_fetch_node_data(flow_id, node_id, background_tasks) async

Fetches and refreshes the data for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
@router.post("/node/trigger_fetch_data", tags=['editor'])
async def trigger_fetch_node_data(flow_id: int, node_id: int, background_tasks: BackgroundTasks):
    """Fetches and refreshes the data for a specific node."""
    flow = flow_file_handler.get_flow(flow_id)
    lock = get_flow_run_lock(flow_id)
    async with lock:
        if flow.flow_settings.is_running:
            raise HTTPException(422, 'Flow is already running')
        try:
            flow.validate_if_node_can_be_fetched(node_id)
        except Exception as e:
            raise HTTPException(422, str(e))
        background_tasks.add_task(flow.trigger_fetch_node, node_id)
    return JSONResponse(content={"message": "Data started",
                                 "flow_id": flow_id,
                                 "node_id": node_id}, status_code=status.HTTP_200_OK)
undo_action(flow_id)

Undo the last action on the flow graph.

Parameters:

Name Type Description Default
flow_id int

The ID of the flow to undo.

required

Returns:

Type Description
UndoRedoResult

UndoRedoResult indicating success or failure.

Source code in flowfile_core/flowfile_core/routes/routes.py
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
@router.post('/editor/undo/', tags=['editor'], response_model=UndoRedoResult)
def undo_action(flow_id: int) -> UndoRedoResult:
    """Undo the last action on the flow graph.

    Args:
        flow_id: The ID of the flow to undo.

    Returns:
        UndoRedoResult indicating success or failure.
    """
    flow = flow_file_handler.get_flow(flow_id)
    if flow is None:
        raise HTTPException(404, 'Could not find the flow')
    if flow.flow_settings.is_running:
        raise HTTPException(422, 'Flow is running')
    return flow.undo()
update_description_node(flow_id, node_id, description=Body(...))

Updates the description text for a specific node.

Source code in flowfile_core/flowfile_core/routes/routes.py
726
727
728
729
730
731
732
733
734
@router.post('/node/description/', tags=['editor'])
def update_description_node(flow_id: int, node_id: int, description: str = Body(...)):
    """Updates the description text for a specific node."""
    try:
        node = flow_file_handler.get_flow(flow_id).get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    node.setting_input.description = description
    return True
update_flow_settings(flow_settings)

Updates the main settings for a flow.

Source code in flowfile_core/flowfile_core/routes/routes.py
887
888
889
890
891
892
893
@router.post('/flow_settings', tags=['manager'])
def update_flow_settings(flow_settings: schemas.FlowSettings):
    """Updates the main settings for a flow."""
    flow = flow_file_handler.get_flow(flow_settings.flow_id)
    if flow is None:
        raise HTTPException(404, 'could not find the flow')
    flow.flow_settings = flow_settings
update_reference_node(flow_id, node_id, reference=Body(...))

Updates the reference identifier for a specific node.

The reference must be: - Lowercase only - No spaces allowed - Unique across all nodes in the flow

Source code in flowfile_core/flowfile_core/routes/routes.py
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
@router.post('/node/reference/', tags=['editor'])
def update_reference_node(flow_id: int, node_id: int, reference: str = Body(...)):
    """Updates the reference identifier for a specific node.

    The reference must be:
    - Lowercase only
    - No spaces allowed
    - Unique across all nodes in the flow
    """
    try:
        flow = flow_file_handler.get_flow(flow_id)
        node = flow.get_node(node_id)
    except:
        raise HTTPException(404, 'Could not find the node')
    if node is None:
        raise HTTPException(404, 'Could not find the node')

    # Handle empty reference (allow clearing)
    if reference == "" or reference is None:
        node.setting_input.node_reference = None
        return True

    # Validate: lowercase only, no spaces
    if " " in reference:
        raise HTTPException(422, 'Reference cannot contain spaces')
    if reference != reference.lower():
        raise HTTPException(422, 'Reference must be lowercase')

    # Validate: unique across all nodes in the flow
    for other_node in flow.nodes:
        if other_node.node_id != node_id:
            other_ref = getattr(other_node.setting_input, 'node_reference', None)
            if other_ref and other_ref == reference:
                raise HTTPException(422, f'Reference "{reference}" is already used by another node')

    node.setting_input.node_reference = reference
    return True
upload_file(file=File(...)) async

Uploads a file to the server's 'uploads' directory.

Parameters:

Name Type Description Default
file UploadFile

The file to be uploaded.

File(...)

Returns:

Type Description
JSONResponse

A JSON response containing the filename and the path where it was saved.

Source code in flowfile_core/flowfile_core/routes/routes.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@router.post("/upload/")
async def upload_file(file: UploadFile = File(...)) -> JSONResponse:
    """Uploads a file to the server's 'uploads' directory.

    Args:
        file: The file to be uploaded.

    Returns:
        A JSON response containing the filename and the path where it was saved.
    """
    safe_name = Path(file.filename).name.replace("..", "")
    if not safe_name:
        raise HTTPException(400, 'Invalid filename')
    uploads_dir = Path("uploads")
    uploads_dir.mkdir(exist_ok=True)
    file_location = uploads_dir / safe_name
    with open(file_location, "wb+") as file_object:
        file_object.write(file.file.read())
    return JSONResponse(content={"filename": safe_name, "filepath": str(file_location)})
validate_db_settings(database_settings, current_user=Depends(get_current_active_user)) async

Validates that a connection can be made to a database with the given settings.

Source code in flowfile_core/flowfile_core/routes/routes.py
939
940
941
942
943
944
945
946
947
948
949
950
951
@router.post("/validate_db_settings")
async def validate_db_settings(
        database_settings: input_schema.DatabaseSettings,
        current_user=Depends(get_current_active_user)
):
    """Validates that a connection can be made to a database with the given settings."""
    # Validate the query settings
    try:
        sql_source = create_sql_source_from_db_settings(database_settings, user_id=current_user.id)
        sql_source.validate()
        return {"message": "Query settings are valid"}
    except Exception as e:
        raise HTTPException(status_code=422, detail=str(e))
validate_node_reference(flow_id, node_id, reference)

Validates if a reference is valid and unique for a node.

Returns:

Type Description

Dict with 'valid' (bool) and 'error' (str or None) fields.

Source code in flowfile_core/flowfile_core/routes/routes.py
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
@router.get('/node/validate_reference', tags=['editor'])
def validate_node_reference(flow_id: int, node_id: int, reference: str):
    """Validates if a reference is valid and unique for a node.

    Returns:
        Dict with 'valid' (bool) and 'error' (str or None) fields.
    """
    try:
        flow = flow_file_handler.get_flow(flow_id)
    except:
        raise HTTPException(404, 'Could not find the flow')

    # Handle empty reference (always valid - means use default)
    if reference == "" or reference is None:
        return {"valid": True, "error": None}

    # Validate: lowercase only
    if reference != reference.lower():
        return {"valid": False, "error": "Reference must be lowercase"}

    # Validate: no spaces
    if " " in reference:
        return {"valid": False, "error": "Reference cannot contain spaces"}

    # Validate: unique across all nodes in the flow
    for other_node in flow.nodes:
        if other_node.node_id != node_id:
            other_ref = getattr(other_node.setting_input, 'node_reference', None)
            if other_ref and other_ref == reference:
                return {"valid": False, "error": f'Reference "{reference}" is already used by another node'}

    return {"valid": True, "error": None}

auth

flowfile_core.routes.auth

Functions:

Name Description
change_own_password

Change the current user's password

create_user

Create a new user (admin only)

delete_user

Delete a user (admin only)

get_password_requirements

Get password requirements for client-side validation

list_users

List all users (admin only)

update_user

Update a user (admin only)

change_own_password(password_data, current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Change the current user's password

Source code in flowfile_core/flowfile_core/routes/auth.py
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
@router.post("/users/me/change-password", response_model=User)
async def change_own_password(
    password_data: ChangePassword,
    current_user: User = Depends(get_current_active_user),
    db: Session = Depends(get_db)
):
    """Change the current user's password"""
    user = db.query(db_models.User).filter(db_models.User.id == current_user.id).first()
    if not user:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail="User not found"
        )

    # Verify current password
    if not verify_password(password_data.current_password, user.hashed_password):
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="Current password is incorrect"
        )

    # Validate new password requirements
    is_valid, error_message = validate_password(password_data.new_password)
    if not is_valid:
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail=error_message
        )

    # Update password and clear must_change_password flag
    user.hashed_password = get_password_hash(password_data.new_password)
    user.must_change_password = False
    db.commit()
    db.refresh(user)

    return User(
        username=user.username,
        id=user.id,
        email=user.email,
        full_name=user.full_name,
        disabled=user.disabled,
        is_admin=user.is_admin,
        must_change_password=user.must_change_password
    )
create_user(user_data, current_user=Depends(get_current_admin_user), db=Depends(get_db)) async

Create a new user (admin only)

Source code in flowfile_core/flowfile_core/routes/auth.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
@router.post("/users", response_model=User)
async def create_user(
    user_data: UserCreate,
    current_user: User = Depends(get_current_admin_user),
    db: Session = Depends(get_db)
):
    """Create a new user (admin only)"""
    # Check if username already exists
    existing_user = db.query(db_models.User).filter(
        db_models.User.username == user_data.username
    ).first()
    if existing_user:
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="Username already exists"
        )

    # Check if email already exists (if provided)
    if user_data.email:
        existing_email = db.query(db_models.User).filter(
            db_models.User.email == user_data.email
        ).first()
        if existing_email:
            raise HTTPException(
                status_code=status.HTTP_400_BAD_REQUEST,
                detail="Email already exists"
            )

    # Validate password requirements
    is_valid, error_message = validate_password(user_data.password)
    if not is_valid:
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail=error_message
        )

    # Create new user with must_change_password=True
    hashed_password = get_password_hash(user_data.password)
    new_user = db_models.User(
        username=user_data.username,
        email=user_data.email or f"{user_data.username}@flowfile.app",
        full_name=user_data.full_name,
        hashed_password=hashed_password,
        is_admin=user_data.is_admin,
        must_change_password=True
    )
    db.add(new_user)
    db.commit()
    db.refresh(new_user)

    return User(
        username=new_user.username,
        id=new_user.id,
        email=new_user.email,
        full_name=new_user.full_name,
        disabled=new_user.disabled,
        is_admin=new_user.is_admin,
        must_change_password=new_user.must_change_password
    )
delete_user(user_id, current_user=Depends(get_current_admin_user), db=Depends(get_db)) async

Delete a user (admin only)

Source code in flowfile_core/flowfile_core/routes/auth.py
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
@router.delete("/users/{user_id}")
async def delete_user(
    user_id: int,
    current_user: User = Depends(get_current_admin_user),
    db: Session = Depends(get_db)
):
    """Delete a user (admin only)"""
    user = db.query(db_models.User).filter(db_models.User.id == user_id).first()
    if not user:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail="User not found"
        )

    # Prevent admin from deleting themselves
    if user.id == current_user.id:
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="Cannot delete your own account"
        )

    # Delete user's secrets and connections first (cascade)
    db.query(db_models.Secret).filter(db_models.Secret.user_id == user_id).delete()
    db.query(db_models.DatabaseConnection).filter(db_models.DatabaseConnection.user_id == user_id).delete()
    db.query(db_models.CloudStorageConnection).filter(db_models.CloudStorageConnection.user_id == user_id).delete()

    db.delete(user)
    db.commit()

    return {"message": f"User '{user.username}' deleted successfully"}
get_password_requirements() async

Get password requirements for client-side validation

Source code in flowfile_core/flowfile_core/routes/auth.py
303
304
305
306
@router.get("/password-requirements")
async def get_password_requirements():
    """Get password requirements for client-side validation"""
    return PASSWORD_REQUIREMENTS
list_users(current_user=Depends(get_current_admin_user), db=Depends(get_db)) async

List all users (admin only)

Source code in flowfile_core/flowfile_core/routes/auth.py
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
@router.get("/users", response_model=list[User])
async def list_users(
    current_user: User = Depends(get_current_admin_user),
    db: Session = Depends(get_db)
):
    """List all users (admin only)"""
    users = db.query(db_models.User).all()
    return [
        User(
            username=u.username,
            id=u.id,
            email=u.email,
            full_name=u.full_name,
            disabled=u.disabled,
            is_admin=u.is_admin,
            must_change_password=u.must_change_password
        )
        for u in users
    ]
update_user(user_id, user_data, current_user=Depends(get_current_admin_user), db=Depends(get_db)) async

Update a user (admin only)

Source code in flowfile_core/flowfile_core/routes/auth.py
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
@router.put("/users/{user_id}", response_model=User)
async def update_user(
    user_id: int,
    user_data: UserUpdate,
    current_user: User = Depends(get_current_admin_user),
    db: Session = Depends(get_db)
):
    """Update a user (admin only)"""
    user = db.query(db_models.User).filter(db_models.User.id == user_id).first()
    if not user:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail="User not found"
        )

    # Prevent admin from disabling themselves
    if user.id == current_user.id and user_data.disabled:
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="Cannot disable your own account"
        )

    # Prevent admin from removing their own admin status
    if user.id == current_user.id and user_data.is_admin is False:
        raise HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail="Cannot remove your own admin privileges"
        )

    # Update fields
    if user_data.email is not None:
        # Check if email already exists for another user
        existing_email = db.query(db_models.User).filter(
            db_models.User.email == user_data.email,
            db_models.User.id != user_id
        ).first()
        if existing_email:
            raise HTTPException(
                status_code=status.HTTP_400_BAD_REQUEST,
                detail="Email already exists"
            )
        user.email = user_data.email

    if user_data.full_name is not None:
        user.full_name = user_data.full_name

    if user_data.disabled is not None:
        user.disabled = user_data.disabled

    if user_data.is_admin is not None:
        user.is_admin = user_data.is_admin

    if user_data.password is not None:
        # Validate password requirements
        is_valid, error_message = validate_password(user_data.password)
        if not is_valid:
            raise HTTPException(
                status_code=status.HTTP_400_BAD_REQUEST,
                detail=error_message
            )
        user.hashed_password = get_password_hash(user_data.password)
        # Reset must_change_password when admin sets a new password
        user.must_change_password = True

    if user_data.must_change_password is not None:
        user.must_change_password = user_data.must_change_password

    db.commit()
    db.refresh(user)

    return User(
        username=user.username,
        id=user.id,
        email=user.email,
        full_name=user.full_name,
        disabled=user.disabled,
        is_admin=user.is_admin,
        must_change_password=user.must_change_password
    )

cloud_connections

flowfile_core.routes.cloud_connections

Functions:

Name Description
create_cloud_storage_connection

Create a new cloud storage connection.

delete_cloud_connection_with_connection_name

Delete a cloud connection.

get_cloud_connections

Get all cloud storage connections for the current user.

create_cloud_storage_connection(input_connection, current_user=Depends(get_current_active_user), db=Depends(get_db))

Create a new cloud storage connection. Parameters input_connection: FullCloudStorageConnection schema containing connection details current_user: User obtained from Depends(get_current_active_user) db: Session obtained from Depends(get_db) Returns Dict with a success message

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
@router.post("/cloud_connection", tags=["cloud_connections"])
def create_cloud_storage_connection(
    input_connection: FullCloudStorageConnection,
    current_user=Depends(get_current_active_user),
    db: Session = Depends(get_db),
):
    """
    Create a new cloud storage connection.
    Parameters
        input_connection: FullCloudStorageConnection schema containing connection details
        current_user: User obtained from Depends(get_current_active_user)
        db: Session obtained from Depends(get_db)
    Returns
        Dict with a success message
    """
    logger.info(f"Create cloud connection {input_connection.connection_name}")
    try:
        store_cloud_connection(db, input_connection, current_user.id)
    except ValueError:
        raise HTTPException(422, "Connection name already exists")
    except Exception as e:
        logger.error(e)
        raise HTTPException(422, str(e))
    return {"message": "Cloud connection created successfully"}
delete_cloud_connection_with_connection_name(connection_name, current_user=Depends(get_current_active_user), db=Depends(get_db))

Delete a cloud connection.

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py
49
50
51
52
53
54
55
56
57
58
59
60
61
@router.delete("/cloud_connection", tags=["cloud_connections"])
def delete_cloud_connection_with_connection_name(
    connection_name: str, current_user=Depends(get_current_active_user), db: Session = Depends(get_db)
):
    """
    Delete a cloud connection.
    """
    logger.info(f"Deleting cloud connection {connection_name}")
    cloud_storage_connection = get_cloud_connection_schema(db, connection_name, current_user.id)
    if cloud_storage_connection is None:
        raise HTTPException(404, "Cloud connection connection not found")
    delete_cloud_connection(db, connection_name, current_user.id)
    return {"message": "Cloud connection deleted successfully"}
get_cloud_connections(db=Depends(get_db), current_user=Depends(get_current_active_user))

Get all cloud storage connections for the current user. Parameters db: Session obtained from Depends(get_db) current_user: User obtained from Depends(get_current_active_user)

Returns List[FullCloudStorageConnectionInterface]

Source code in flowfile_core/flowfile_core/routes/cloud_connections.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@router.get("/cloud_connections", tags=["cloud_connection"], response_model=list[FullCloudStorageConnectionInterface])
def get_cloud_connections(
    db: Session = Depends(get_db), current_user=Depends(get_current_active_user)
) -> list[FullCloudStorageConnectionInterface]:
    """
    Get all cloud storage connections for the current user.
    Parameters
        db: Session obtained from Depends(get_db)
        current_user: User obtained from Depends(get_current_active_user)

    Returns
        List[FullCloudStorageConnectionInterface]
    """
    return get_all_cloud_connections_interface(db, current_user.id)

logs

flowfile_core.routes.logs

Functions:

Name Description
add_log

Adds a log message to the log file for a given flow_id.

add_raw_log

Adds a log message to the log file for a given flow_id.

format_sse_message

Format the data as a proper SSE message

stream_logs

Streams logs for a given flow_id using Server-Sent Events.

add_log(flow_id, log_message) async

Adds a log message to the log file for a given flow_id.

Source code in flowfile_core/flowfile_core/routes/logs.py
35
36
37
38
39
40
41
42
@router.post("/logs/{flow_id}", tags=["flow_logging"])
async def add_log(flow_id: int, log_message: str):
    """Adds a log message to the log file for a given flow_id."""
    flow = flow_file_handler.get_flow(flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    flow.flow_logger.info(log_message)
    return {"message": "Log added successfully"}
add_raw_log(raw_log_input) async

Adds a log message to the log file for a given flow_id.

Source code in flowfile_core/flowfile_core/routes/logs.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@router.post("/raw_logs", tags=["flow_logging"])
async def add_raw_log(raw_log_input: schemas.RawLogInput):
    """Adds a log message to the log file for a given flow_id."""
    logger.info("Adding raw logs")
    flow = flow_file_handler.get_flow(raw_log_input.flowfile_flow_id)
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")
    flow.flow_logger.get_log_filepath()
    flow_logger = flow.flow_logger
    flow_logger.get_log_filepath()
    if raw_log_input.log_type == "INFO":
        flow_logger.info(raw_log_input.log_message, extra=raw_log_input.extra)
    elif raw_log_input.log_type == "ERROR":
        flow_logger.error(raw_log_input.log_message, extra=raw_log_input.extra)
    return {"message": "Log added successfully"}
format_sse_message(data) async

Format the data as a proper SSE message

Source code in flowfile_core/flowfile_core/routes/logs.py
30
31
32
async def format_sse_message(data: str) -> str:
    """Format the data as a proper SSE message"""
    return f"data: {json.dumps(data)}\n\n"
stream_logs(flow_id, idle_timeout=300, current_user=Depends(get_current_user_from_query)) async

Streams logs for a given flow_id using Server-Sent Events. Requires authentication via token in query parameter. The connection will close gracefully if the server shuts down.

Source code in flowfile_core/flowfile_core/routes/logs.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
@router.get("/logs/{flow_id}", tags=["flow_logging"])
async def stream_logs(flow_id: int, idle_timeout: int = 300, current_user=Depends(get_current_user_from_query)):
    """
    Streams logs for a given flow_id using Server-Sent Events.
    Requires authentication via token in query parameter.
    The connection will close gracefully if the server shuts down.
    """
    logger.info(f"Starting log stream for flow_id: {flow_id} by user: {current_user.username}")
    await asyncio.sleep(0.3)
    flow = flow_file_handler.get_flow(flow_id)
    logger.info("Streaming logs")
    if not flow:
        raise HTTPException(status_code=404, detail="Flow not found")

    log_file_path = flow.flow_logger.get_log_filepath()
    if not Path(log_file_path).exists():
        raise HTTPException(status_code=404, detail="Log file not found")

    class RunningState:
        def __init__(self):
            self.has_started = False

        def is_running(self):
            if flow.flow_settings.is_running:
                self.has_started = True
            return flow.flow_settings.is_running or not self.has_started

    running_state = RunningState()

    return StreamingResponse(
        stream_log_file(log_file_path, running_state.is_running, idle_timeout),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "text/event-stream",
        },
    )

public

flowfile_core.routes.public

Classes:

Name Description
GeneratedKey

Response model for the generate key endpoint.

SetupStatus

Response model for the setup status endpoint.

Functions:

Name Description
docs_redirect

Redirects to the documentation page.

generate_key

Generate a new master encryption key.

get_setup_status

Get the current setup status of the application.

GeneratedKey pydantic-model

Bases: BaseModel

Response model for the generate key endpoint.

Show JSON schema:
{
  "description": "Response model for the generate key endpoint.",
  "properties": {
    "key": {
      "title": "Key",
      "type": "string"
    },
    "instructions": {
      "title": "Instructions",
      "type": "string"
    }
  },
  "required": [
    "key",
    "instructions"
  ],
  "title": "GeneratedKey",
  "type": "object"
}

Fields:

  • key (str)
  • instructions (str)
Source code in flowfile_core/flowfile_core/routes/public.py
20
21
22
23
24
class GeneratedKey(BaseModel):
    """Response model for the generate key endpoint."""

    key: str
    instructions: str
SetupStatus pydantic-model

Bases: BaseModel

Response model for the setup status endpoint.

Show JSON schema:
{
  "description": "Response model for the setup status endpoint.",
  "properties": {
    "setup_required": {
      "title": "Setup Required",
      "type": "boolean"
    },
    "master_key_configured": {
      "title": "Master Key Configured",
      "type": "boolean"
    },
    "mode": {
      "title": "Mode",
      "type": "string"
    }
  },
  "required": [
    "setup_required",
    "master_key_configured",
    "mode"
  ],
  "title": "SetupStatus",
  "type": "object"
}

Fields:

  • setup_required (bool)
  • master_key_configured (bool)
  • mode (str)
Source code in flowfile_core/flowfile_core/routes/public.py
12
13
14
15
16
17
class SetupStatus(BaseModel):
    """Response model for the setup status endpoint."""

    setup_required: bool
    master_key_configured: bool
    mode: str
docs_redirect() async

Redirects to the documentation page.

Source code in flowfile_core/flowfile_core/routes/public.py
27
28
29
30
@router.get("/", tags=["admin"])
async def docs_redirect():
    """Redirects to the documentation page."""
    return RedirectResponse(url="/docs")
generate_key() async

Generate a new master encryption key.

Source code in flowfile_core/flowfile_core/routes/public.py
45
46
47
48
49
50
51
52
53
@router.post("/setup/generate-key", response_model=GeneratedKey, tags=["setup"])
async def generate_key():
    """Generate a new master encryption key."""
    key = generate_master_key()
    instructions = (
        f'Add to your .env file:\n  FLOWFILE_MASTER_KEY="{key}"\n\n'
        "Then restart: docker-compose down && docker-compose up"
    )
    return GeneratedKey(key=key, instructions=instructions)
get_setup_status() async

Get the current setup status of the application.

Source code in flowfile_core/flowfile_core/routes/public.py
33
34
35
36
37
38
39
40
41
42
@router.get("/health/status", response_model=SetupStatus, tags=["health"])
async def get_setup_status():
    """Get the current setup status of the application."""
    mode = os.environ.get("FLOWFILE_MODE", "electron")
    master_key_ok = is_master_key_configured()
    return SetupStatus(
        setup_required=not master_key_ok,
        master_key_configured=master_key_ok,
        mode=mode,
    )

secrets

flowfile_core.routes.secrets

Manages CRUD (Create, Read, Update, Delete) operations for secrets.

This router provides secure endpoints for creating, retrieving, and deleting sensitive credentials for the authenticated user. Secrets are encrypted before being stored and are associated with the user's ID.

Functions:

Name Description
create_secret

Creates a new secret for the authenticated user.

delete_secret

Deletes a secret by name for the authenticated user.

get_secret

Retrieves a specific secret by name for the authenticated user.

get_secrets

Retrieves all secret names for the currently authenticated user.

create_secret(secret, current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Creates a new secret for the authenticated user.

The secret value is encrypted before being stored in the database. A secret name must be unique for a given user.

Parameters:

Name Type Description Default
secret SecretInput

A SecretInput object containing the name and plaintext value of the secret.

required
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Raises:

Type Description
HTTPException

400 if a secret with the same name already exists for the user.

Returns:

Type Description
Secret

A Secret object containing the name and the encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
@router.post("/secrets", response_model=Secret)
async def create_secret(
    secret: SecretInput, current_user=Depends(get_current_active_user), db: Session = Depends(get_db)
) -> Secret:
    """Creates a new secret for the authenticated user.

    The secret value is encrypted before being stored in the database. A secret
    name must be unique for a given user.

    Args:
        secret: A `SecretInput` object containing the name and plaintext value of the secret.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Raises:
        HTTPException: 400 if a secret with the same name already exists for the user.

    Returns:
        A `Secret` object containing the name and the *encrypted* value.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id

    existing_secret = (
        db.query(db_models.Secret)
        .filter(db_models.Secret.user_id == user_id, db_models.Secret.name == secret.name)
        .first()
    )

    if existing_secret:
        raise HTTPException(status_code=400, detail="Secret with this name already exists")

    # The store_secret function handles encryption and DB storage
    stored_secret = store_secret(db, secret, user_id)
    return Secret(name=stored_secret.name, value=stored_secret.encrypted_value, user_id=str(user_id))
delete_secret(secret_name, current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Deletes a secret by name for the authenticated user.

Parameters:

Name Type Description Default
secret_name str

The name of the secret to delete.

required
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Returns:

Type Description
None

An empty response with a 204 No Content status code upon success.

Source code in flowfile_core/flowfile_core/routes/secrets.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
@router.delete("/secrets/{secret_name}", status_code=204)
async def delete_secret(
    secret_name: str, current_user=Depends(get_current_active_user), db: Session = Depends(get_db)
) -> None:
    """Deletes a secret by name for the authenticated user.

    Args:
        secret_name: The name of the secret to delete.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Returns:
        An empty response with a 204 No Content status code upon success.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id
    delete_secret_action(db, secret_name, user_id)
    return None
get_secret(secret_name, current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Retrieves a specific secret by name for the authenticated user.

Note: This endpoint returns the secret name and metadata but does not expose the decrypted secret value.

Parameters:

Name Type Description Default
secret_name str

The name of the secret to retrieve.

required
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Raises:

Type Description
HTTPException

404 if the secret is not found.

Returns:

Type Description
Secret

A Secret object containing the name and encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
@router.get("/secrets/{secret_name}", response_model=Secret)
async def get_secret(
    secret_name: str, current_user=Depends(get_current_active_user), db: Session = Depends(get_db)
) -> Secret:
    """Retrieves a specific secret by name for the authenticated user.

    Note: This endpoint returns the secret name and metadata but does not
    expose the decrypted secret value.

    Args:
        secret_name: The name of the secret to retrieve.
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Raises:
        HTTPException: 404 if the secret is not found.

    Returns:
        A `Secret` object containing the name and encrypted value.
    """
    # Get user ID
    user_id = 1 if os.environ.get("FLOWFILE_MODE") == "electron" else current_user.id

    # Get secret from database
    db_secret = (
        db.query(db_models.Secret)
        .filter(db_models.Secret.user_id == user_id, db_models.Secret.name == secret_name)
        .first()
    )

    if not db_secret:
        raise HTTPException(status_code=404, detail="Secret not found")

    return Secret(name=db_secret.name, value=db_secret.encrypted_value, user_id=str(db_secret.user_id))
get_secrets(current_user=Depends(get_current_active_user), db=Depends(get_db)) async

Retrieves all secret names for the currently authenticated user.

Note: This endpoint returns the secret names and metadata but does not expose the decrypted secret values.

Parameters:

Name Type Description Default
current_user

The authenticated user object, injected by FastAPI.

Depends(get_current_active_user)
db Session

The database session, injected by FastAPI.

Depends(get_db)

Returns:

Type Description

A list of Secret objects, each containing the name and encrypted value.

Source code in flowfile_core/flowfile_core/routes/secrets.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
@router.get("/secrets", response_model=list[Secret])
async def get_secrets(current_user=Depends(get_current_active_user), db: Session = Depends(get_db)):
    """Retrieves all secret names for the currently authenticated user.

    Note: This endpoint returns the secret names and metadata but does not
    expose the decrypted secret values.

    Args:
        current_user: The authenticated user object, injected by FastAPI.
        db: The database session, injected by FastAPI.

    Returns:
        A list of `Secret` objects, each containing the name and encrypted value.
    """
    user_id = current_user.id

    # Get secrets from database
    db_secrets = db.query(db_models.Secret).filter(db_models.Secret.user_id == user_id).all()

    # Prepare response model (without decrypting)
    secrets = []
    for db_secret in db_secrets:
        secrets.append(Secret(name=db_secret.name, value=db_secret.encrypted_value, user_id=str(db_secret.user_id)))

    return secrets