Luke
07/28/2022, 4:38 AMHuib Keemink
07/28/2022, 7:21 AMLuke
07/28/2022, 3:35 PMfrom pipelines.features import SeaonalFeaturizer
# create feature extraction pipeline from default boilerplate
pipeline = SeaonalFeaturizer()
# inspect default pipeline
# returns `yaml` definition?
# a display method would be nice too (DOT, html, etc)
pipeline.inspect()
# create custom nodes
. . .
# modify pipeline
# drop unneeded node / step
# add two new custom ones that are project specific
custom_pipeline = pipeline
.drop_node( . . . )
.add_node( . . . )
.add_node( . . . )
# could be local (single process or local spark)
# or remote (spark)
results_df = custom_pipeline.run(df, config)
# save transformed data frame
results_df.save( . . . )
Pipeline nodes could be primitive transformations (.groupby
), flow control (.map
), other pipelines, etc…
Sklearn Pipelines, neuraxle, feature_engine, pdpipe, sspipe, Apache Beam, etc… allow for portions of the workflow I'm looking for. Seems like Dagster with just the Python API for pipeline definition and inspection (without dagit, logging, scheduling, etc…) could be a fit.