I am new to Dagster and trying to determine if it is a good dagster #announcements

I am new to Dagster and trying to determine if it ...

David Farnan-Williams

02/06/2021, 6:58 PM

I am new to Dagster and trying to determine if it is a good fit. I would like to port in our data engineering pipeline that I run in a single ScriptRun in AzureML where I generate all potential features prior to training a model and then later filter to a specific feature set for each model. I’m hoping to filter the DAG built in dagster in our deployed model to only run the parts of the pipeline we need to generate the features we used in the model. Can dagster be used in reverse to go from leaf columns in a dataframe back to a set of pipeline sets and inputs required?

Noah K

02/06/2021, 7:19 PM

Not really sure I understand what you are asking

Noah K

02/06/2021, 7:20 PM

You write the DAG code yourself and then it executes

Noah K

02/06/2021, 7:20 PM

There isn't like a SAT solvers running under the hood to dynamically build it for you.

Noah K

02/06/2021, 7:20 PM

(unless you write one I guess)

David Farnan-Williams

02/06/2021, 7:24 PM

thank you for replying

David Farnan-Williams

02/06/2021, 7:26 PM

so I could build a method on my pipeline that would accept the target column list I need in my final dataframe and I can traverse the DAG created by dagster for my pipeline and trim unecessary branches?

Noah K

02/06/2021, 7:26 PM

Why would it have unnecessary branches though?

Noah K

02/06/2021, 7:27 PM

I don't understand what you are trying to remove 🙂

Noah K

02/06/2021, 7:27 PM

If you mean you want to share code between a "train all" and "train just X" mode

Noah K

02/06/2021, 7:28 PM

Then you would probably have two pipelines that use the same solids

Noah K

02/06/2021, 7:28 PM

(and depending on the complexity of the pipeline itself, possibly a shared base function or similar)

David Farnan-Williams

02/06/2021, 7:30 PM

there is likely a better way to do this given dagster, but we currently pre-compute all possible features we might need at model training time in our feature engineering pipeline, and then we do many possible permutations of features to determine what is the best model. Whenever I have that best possible feature set and now I am deploying that model to our simulation phase, I don’t want to generate the other 350 features that are not being used in the final model.

Noah K

02/06/2021, 7:31 PM

Right, I guess I don't understand why this would imply unused code in the pipeline though

Noah K

02/06/2021, 7:31 PM

You wouldn't usually want to hardcode all 350 of those features as a solid invocation

David Farnan-Williams

02/06/2021, 7:36 PM

one such set of these features are 150 X,Y,Z grids that are quite large each. With our new simulated data points we are attempting to predict if our model prefers to use only 10 of those grids I would prefer not have to sample all of our simulation data points to the other ~140 grids

Noah K

02/06/2021, 7:36 PM

Okay? I'm still not understanding what you are looking for. You would write the code that does the thing you want.

Noah K

02/06/2021, 7:36 PM

I still think you are assuming there is more magic than there actually is 😄

David Farnan-Williams

02/06/2021, 7:40 PM

fair enough, my initial question was to determine if someone is doing this or if I’m missing the point and this is the wrong way to think about/use dagster. As we’ve chatted I gather that what is necessary is to just write the code myself to prune the DAG and generate a new pipeline given a set of target columns. Is there any show stopper/magic/design assumption that precludes that.

David Farnan-Williams

02/06/2021, 7:40 PM

Thank you for help me Noah 😄

Noah K

02/06/2021, 7:42 PM

The only thing to be aware of is Dagster has fairly limited ability to change the DAG at runtime. There are a few, notably you can use optional outputs to prune trees below that point, and the new experimental map system in 0.10. But you would usually do a lot of the heavy lifting at the point where the DAG is compiled, rather than when it is run 🙂

Noah K

02/06/2021, 7:42 PM

The runtime mutation APIs are getting better quickly though

David Farnan-Williams

02/06/2021, 7:43 PM

thank you, good to know

David Farnan-Williams

02/06/2021, 7:47 PM

in this particular use case I was hoping to run the dag directly on a single node and would prune the serialized dagster pipeline from feature engineering I’m guessing just after it’s deserialized in dagster, and then execute that.

Noah K

02/06/2021, 7:51 PM

There are many executor layers available, include single node direct runs

Noah K

02/06/2021, 7:51 PM

But that's decoupled from what is being executed 🙂

mrdavidlaing

02/06/2021, 7:54 PM

Could memoized solids help here? Ie; rather than pruning the pipeline; only trigger re-executing a subset of the solids based on some (custom) logic about whether they contribute to the features you want changed?

David Farnan-Williams

02/06/2021, 7:57 PM

@mrdavidlaing I think that sounds like the ideal implementation for grids, and likely our biggest win compute-wise at simulation time. Essentially I’m just adding a run time parameter to the “memoized soild” that dictates which grids to sample? And then changing that at simulation time that value is only the grids used in the model?

David Farnan-Williams

02/06/2021, 8:02 PM

Thank you both for the help!

mrdavidlaing

02/07/2021, 3:46 PM

I’d love to see what your DAG graph looks like if this works out :)

David Farnan-Williams

02/08/2021, 3:42 AM

I’ll follow up when we have it implemented

6 Views

Open in Slack

Previous Next