https://dagster.io/ logo
Title
d

David Baur

07/09/2021, 9:18 PM
Hello! Sorry in advance for the wall of text, tldr is in bold at the end. I’m looking for a system to orchestrate my data analysis workflows, and dagster seems like the best option except for one detail I’m struggling with. The most natural way for me to decompose my analysis into reusable functional steps/solids requires some simple logic at the dag/pipeline level to determine which steps need to be included and how they need to be connected. This logic would be simple loops and conditionals driven by configuration parameters that would be known at launch time, and it would not depend on any outputs or anything else that could change during runtime. When I look at the dagster documentation, I see that there are dynamic pipeline features for mapping solids over a dynamic output, but this only covers some of the variability that I’m looking for. This would also seem to require that I create extra solids purely for the purpose of converting configuration parameters into outputs, which gives me the impression that this feature is intended for things that vary during runtime instead of launch time. I also see features like lazy loaded pipeline definitions or reconstructable pipelines that defer pipeline generation until launch time, but those functions still don’t seem to have access to the context or configuration objects. There is an example in the docs showing a pipeline being programmatically built up from a yaml file, perhaps the most extreme version of dynamic pipeline definition possible, but even in that case the file being referencing is static/hardcoded.  I got excited when I saw that composite solids can take configuration, but I can’t figure out how to actually access that configuration when deciding which solids to run. The basic functionality I’m looking for here doesn’t seem like a strange ask, and it's so close to being provided by some of these features that I think I’m probably missing something obvious. Can anyone suggest a way to generate a pipeline as a function of the run configuration? Thanks!
b

Bryan Johnson

07/10/2021, 1:55 AM
I haven’t tried using the newer dynamic features yet … but one way I’ve hacked dynamic sub-graphs is to make those portions of the pipeline Optional inputs/outputs and then on all downstream tasks do the following:
if input is not None:
    # do normal logic
   yield Output(actual_output, "var_name")
else:
   yield Output(None, "var_name")
It’s quite hacky from a typing standpoint, but does work for skipping sub-graphs. There is probably a more correct way to do this.
a

alex

07/12/2021, 3:49 PM
There is currently not a way to use the dagster run configuration or context objects to control the pipeline structure. The example from the docs about programatic pipeline creation is the direction others have taken to support this type of functionality. The pipeline-structure-config --> pipeline-to-be-run translation has to happen at a layer above dagster.
d

David Baur

07/12/2021, 5:06 PM
Thank you both for the replies. @alex I’d be happy to have our system carry out the internal business logic to arrive at a final dag structure right before job submission, but I’m still a little confused about how to communicate that dag structure to dagster. In the approaches you’re referring to that others have taken, are they typically generating a fixed set of dags well in advance of job submission and placing those files on their dagit instance, or is there a way to include the “pipeline DSL” text as part of the job submission?
a

alex

07/12/2021, 5:22 PM
In the approaches you’re referring to that others have taken, are they typically generating a fixed set of dags well in advance of job submission and placing those files on their dagit instance, or is there a way to include the “pipeline DSL” text as part of the job submission?
The most dynamic system I have seen used a database as the means of storing the “pipeline DSL”, then a fixed repository definition would fetch from the DB to create the working set of pipelines. In this example they authored the pipelines in a web gui that would persist to the database. I have not seen an example of more ephemeral pipelines. I can not think of any way to use the existing abstractions to stash the “pipeline DSL” in dagster managed places. You would need to manage the working set available to the fixed repository location that the dagster machinery is pointed at.
b

Bryan Johnson

07/12/2021, 5:36 PM
Is this generally a design constraint where dagster DAGs are “compile-time” where something like Luigi allows override of “depends” method on a task which is a “runtime” definition?
a

alex

07/12/2021, 5:57 PM
yep
d

David Baur

07/12/2021, 6:03 PM
Thanks for the elaboration Alex, that makes a lot of sense. I'm not sure if I can make "compile-time" DAGs work for my specific use case, but I'll definitely at least keep dagster in mind for future projects.
👍 1