I am new to Dagster and trying to determine if it ...
# announcements
d
I am new to Dagster and trying to determine if it is a good fit. I would like to port in our data engineering pipeline that I run in a single ScriptRun in AzureML where I generate all potential features prior to training a model and then later filter to a specific feature set for each model. I’m hoping to filter the DAG built in dagster in our deployed model to only run the parts of the pipeline we need to generate the features we used in the model. Can dagster be used in reverse to go from leaf columns in a dataframe back to a set of pipeline sets and inputs required?
n
Not really sure I understand what you are asking
You write the DAG code yourself and then it executes
There isn't like a SAT solvers running under the hood to dynamically build it for you.
(unless you write one I guess)
d
thank you for replying
so I could build a method on my pipeline that would accept the target column list I need in my final dataframe and I can traverse the DAG created by dagster for my pipeline and trim unecessary branches?
n
Why would it have unnecessary branches though?
I don't understand what you are trying to remove 🙂
If you mean you want to share code between a "train all" and "train just X" mode
Then you would probably have two pipelines that use the same solids
(and depending on the complexity of the pipeline itself, possibly a shared base function or similar)
d
there is likely a better way to do this given dagster, but we currently pre-compute all possible features we might need at model training time in our feature engineering pipeline, and then we do many possible permutations of features to determine what is the best model. Whenever I have that best possible feature set and now I am deploying that model to our simulation phase, I don’t want to generate the other 350 features that are not being used in the final model.
n
Right, I guess I don't understand why this would imply unused code in the pipeline though
You wouldn't usually want to hardcode all 350 of those features as a solid invocation
d
one such set of these features are 150 X,Y,Z grids that are quite large each. With our new simulated data points we are attempting to predict if our model prefers to use only 10 of those grids I would prefer not have to sample all of our simulation data points to the other ~140 grids
n
Okay? I'm still not understanding what you are looking for. You would write the code that does the thing you want.
I still think you are assuming there is more magic than there actually is 😄
d
fair enough, my initial question was to determine if someone is doing this or if I’m missing the point and this is the wrong way to think about/use dagster. As we’ve chatted I gather that what is necessary is to just write the code myself to prune the DAG and generate a new pipeline given a set of target columns. Is there any show stopper/magic/design assumption that precludes that.
Thank you for help me Noah 😄
n
The only thing to be aware of is Dagster has fairly limited ability to change the DAG at runtime. There are a few, notably you can use optional outputs to prune trees below that point, and the new experimental map system in 0.10. But you would usually do a lot of the heavy lifting at the point where the DAG is compiled, rather than when it is run 🙂
The runtime mutation APIs are getting better quickly though
d
thank you, good to know
in this particular use case I was hoping to run the dag directly on a single node and would prune the serialized dagster pipeline from feature engineering I’m guessing just after it’s deserialized in dagster, and then execute that.
n
There are many executor layers available, include single node direct runs
But that's decoupled from what is being executed 🙂
m
Could memoized solids help here? Ie; rather than pruning the pipeline; only trigger re-executing a subset of the solids based on some (custom) logic about whether they contribute to the features you want changed?
d
@mrdavidlaing I think that sounds like the ideal implementation for grids, and likely our biggest win compute-wise at simulation time. Essentially I’m just adding a run time parameter to the “memoized soild” that dictates which grids to sample? And then changing that at simulation time that value is only the grids used in the model?
Thank you both for the help!
m
I’d love to see what your DAG graph looks like if this works out :)
d
I’ll follow up when we have it implemented