I was very happy to see the new youtube video and blog post dagster #ask-community

I was very happy to see the new youtube video and ...

Greg Whittier

03/22/2023, 3:26 PM

I was very happy to see the new youtube video and blog post https://dagster.io/blog/partitioned-data-pipelines on partitioned data pipelines and the reference to dynamic partitions in particular. At 6:38

https://youtu.be/LFOikWqCOAM?t=398▾

, Sandy mentions running a ML pipeline with ad hoc, generating a new partition. Is there code or a more detailed example for this? I'm wondering how this is done specifically. Would one approach be to specify a partition name (i.e., a run name) and set of hyperparameters in the run config (edited in launchpad) or is this an abuse of config? Being able to manually iterate in dagit would be very useful. How is it anticipated the hyperparameters would be later queried (for instance if one were making a graph of accuracy versus some parameter)? Would I populate my own database within the asset op (or use something like Mlflow) or is it anticipated dagster's facilities provide something natively. I.e., can I use dagster database somehow. What role are metadata and run config intended to play in the system after a run? Does this differ between SDA and graph/op paradigms? Are they just there to be visible in dagit or is it anticipated they'll be queried. If the latter, then how? GraphQL?

sandy

03/22/2023, 9:47 PM

Would one approach be to specify a partition name (i.e., a run name) and set of hyperparameters in the run config (edited in launchpad) or is this an abuse of config? Being able to manually iterate in dagit would be very useful.

Yes - exactly. This isn't an abuse of config. In the future, we'd like to find a way to allow these to be more tightly coupled. I.e. if you later want to re-run that experiment, ideally you'd maybe be able to just launch a materialization of that partition without needing to specify that config again? Curious if you have thoughts on the ideal workflow here.

How is it anticipated the hyperparameters would be later queried (for instance if one were making a graph of accuracy versus some parameter)? Would I populate my own database within the asset op (or use something like Mlflow) or is it anticipated dagster's facilities provide something natively. I.e., can I use dagster database somehow.

The interface for this isn't particularly nice right now, but all this data can be recovered from the Dagster database using the Python APIs. You'd need to do the following: • Fetch a list of all the partitions (

DagsterInstance.get_dynamic_partitions

) • For each partition, find the latest materialization (

DagsterInstance.get_event_records

) • On each of those materializations, use the run ID to find the run config (

DagsterInstance.get_run_records

) Would you ideally basically want to be able to see this visualized in the UI? Or do you imagine you're more likely to want to have a custom visualization in an internal tool / notebook?

sandy

03/22/2023, 9:56 PM

Another related question for you: right now, you're essentially required to come up with a partition name for each ad-hoc run. Theoretically, we could infer the name from the config. Would that be helpful for you? Or do you end up giving these names anyway to make them easier to refer back to?

Greg Whittier

03/23/2023, 3:10 PM

Sandy, thanks so much for taking time to reply. Having config coupled with materializations somehow would be nice and align more with how I think of things (not sure if it's the dagster way!). A pipeline has a fixed topology. The behavior (output/asset materializations) is affected by the input and the config. We want to be able to partition (sorry for overloading the term) the set of pipeline materializations in a way that we know how each subset was generated where a subset is a "materialization" if I'm understanding correctly. Ideally, we want to capture the input and config generating each materialization in a way we can look back later to know what was done and reproduce the materialization if we delete it without having to separately keep track of the config ourselves. (As an aside, I've been trying to think of all the ways to change the output of an op/asset-- hopefully it's just input and config). For the ML model example, the model would have a name and it's config might specify the training data used in the form of upstream asset keys as well as the model training parameters. A slightly more complicated example is data augmentation. We have some very expensive data augmentation computations where we save the output. Now we have a table with a GUID for a row that includes a field for the data (a file) that is being augmented and other fields defining the parameters of the augmentation. We don't really have a way of dealing with upstream changes to the asset being augmented. For dagster, we can use a partition key for the upstream asset for the data to be augmented and another partition for the augmentation parameters. Now I guess the augmentation parameters could go in a table that's partitioned by row and would be an upstream asset to the augmented_data asset, which would have a multipartition key (source_data, augmentation_parameters). However, another option would be to use config for the augmentation parameters and that would aid in making it more interactive. (Or maybe you use config to add to the upstream parameters interactively. So many knobs!) So bottom line is I think the ideal workflow would let us add partitions via config and keep that config associated with the partition somehow. But we need at least another dimension to keep the "configuration partition" (behavior) decoupled from the input partition and then the output is the combination of the two. On the UI for visualizing e.g., accuracy vs model, I imagine I'm more likely to have a custom tool for plotting that sort of thing, but if there were some hooks for customizing what's in dagit, I wouldn't turn it down. On that front, maybe I missed it, but being able to display image metadata would be nice (like a thumbnail of a video). Also in the above construct being able to see config next to jobs in an easier manner would be nice (rather than clicking to get a popup). On auto-generating names. We definitely want to be able to specify our own, but having them auto-generated in some cases might be useful -- only a nice-to-have. For some things -- like model names -- we make them deliberately -- but for others like augmentation parameters above we generate a GUID name. Not a feature I see as critical.

2 Views

Open in Slack

Previous Next