https://dagster.io/ logo
#dagster-support
Title
# dagster-support
l

Luke

07/28/2022, 4:38 AM
Hi all, I'm looking for a lightweight way to reuse and compose pipelines and transformers for semi-adhoc data profiling, analysis, feature engineering. To start, I'd be interested in defining and executing pipelines within a Databricks notebook context. Future scalability would be a plus, but isn't currently needed. I think Dagster can fit this use case. This talk shows an example of a similar workflow: https://m.youtube.com/watch?v=xyEk-31Ff2w&t=16m40s I'm curious: • if the demo'd approach is recommended with the current Dagster version • if this is a reasonable use case for Dagster Background: Each pipeline will be run on-demand 30 or so times a year. The user will need the ability to modify the pipeline per run. This includes changing parameters but also ops. In this way, the pipelines will serve as canonical boilerplate, improving reuse and speed. Orchestration and monitoring is not needed. A pure Python interface for pipeline definition, inspection, and execution is needed. I've had some success with sklearn Pipelines, but quickly hit a wall because these don't support branching / merging paths I'm looking into neuraxle to solve this. Our team uses Airflow and Prefect for scheduled prod workflows. Those are too heavy here. Similar thread: https://dagster.slack.com/archives/C01U954MEER/p1657110830972549
h

Huib Keemink

07/28/2022, 7:21 AM
Not sure dagster fits the “very lightweight” category to be honest, to me it’s somewhere between airflow and prefect. Why would you not use databricks own workflows for this?
l

Luke

07/28/2022, 3:35 PM
Databricks workflows allows notebooks and scripts to be orchestrated. I'd like to compose more primitive transformations. This is the type of workflow I'd like to support:
Copy code
from pipelines.features import SeaonalFeaturizer

# create feature extraction pipeline from default boilerplate
pipeline = SeaonalFeaturizer()

# inspect default pipeline
# returns `yaml` definition?
# a display method would be nice too (DOT, html, etc)
pipeline.inspect()

# create custom nodes
. . . 

# modify pipeline
# drop unneeded node / step
# add two new custom ones that are project specific
custom_pipeline = pipeline
                  .drop_node( . . . )
                  .add_node( . . . )
                  .add_node( . . . )

# could be local (single process or local spark)
# or remote (spark)
results_df = custom_pipeline.run(df, config)

# save transformed data frame
results_df.save( . . . )
Pipeline nodes could be primitive transformations (
.groupby
), flow control (
.map
), other pipelines, etc… Sklearn Pipelines, neuraxle, feature_engine, pdpipe, sspipe, Apache Beam, etc… allow for portions of the workflow I'm looking for. Seems like Dagster with just the Python API for pipeline definition and inspection (without dagit, logging, scheduling, etc…) could be a fit.
3 Views