Hi all wave Is there a standard pattern for initializing sin dagster #announcements

Hi all :wave: Is there a standard pattern for init...

victor

04/06/2021, 9:57 AM

Hi all 👋 Is there a standard pattern for initializing singleton-like resources? I'd like to be able to write a resource definition that initializes a given resource only once and reuses that initialization for the resources provided to solids in other executors. This is useful when initializing a resources is side-effecting: e.g., you create an external resource, and are given back a UUID for that resource, and then want that UUID available to use that external resource in other parts of the pipeline. (This might be a bit abstract, but I'll add an example in the comments).

victor

04/06/2021, 10:01 AM

The particular use case I'm looking at enabling is MLflow experiment tracking: I'd like to have a parent experiment run that is the same across the whole pipeline run, and some solids will start new experiment runs that will be registered as children. That way, something like a hyperparameter search where each hyperparameter vector is tested by a different solid instance will be somewhat neatly organized. Right now I'm starting the experiment run as a

@resource

and yielding the generated run, so I can use it in solids. This works as I need for the

in_process

executor, but any others will create a new parent experiment run for each executor process that is started.

victor

04/06/2021, 10:05 AM

Since I haven't found any built-in dagster APIs that allow to do that (are there? should there be?), these are what I think are my options right now: 1. On first experiment run creation, yield an event with the run UUID. Then, on every new instantiation check the event log for the current dagster run and if an event with the MLflow UUID is found, use that. 2. Create the MLflow experiment run with a tag indicating the current dagster run. On every new initialization, use the MLflow API to search for a run with that tag and use it if present, create it otherwise.

victor

04/06/2021, 10:06 AM

Right now this is quite use-case specific, but I feel like it could be a quite general pattern for stateful resources, so would be good to have some documentation on a dagster-idiomatic way.

victor

04/06/2021, 11:10 AM

For the record, I just implemented option 2 above successfully.

schrockn

04/06/2021, 2:40 PM

I think given the current feature set both of the solutions you came up with are quite good. This notion of stateful resources is interesting. The other use case we’ve thought would come up along this front is a resource that spins up a EMR/Dataproc cluster once per run and then shuffles enough state to each solid to boot up a spark context that talks to that cluster, and then spin it down at the end of a run. Without that abstraction uses would have to have special-purpose solids for spinning up and down the clusters. What would you expect the API/capability to be? My first thought would be to provide a key-value stores where you can associate attributes and serializable values with run ids, and allow resources to stash state there. (In your case there would be an mlflow_experiment_id attribute) This facility could easily be abused, which would be my primary reservation to adding it.

schrockn

04/06/2021, 2:41 PM

For your specific case, I would recommend having a special purpose solid at the beginning of the pipeline whose sole responsibility is to create the dagster run id <--> mlflow id mapping. We make no guarantees about the parallelism of resource spinups. You will need to account for potential race conditions where your mlflow resource is booted up in parallel and two mlflow experiments are created.

victor

04/06/2021, 4:09 PM

I would recommend having a special purpose solid at the beginning of the pipeline whose sole responsibility is to create the dagster run id <--> mlflow id mapping

I thought about doing this, but it would be annoying to have to pass around the mlflow run ID as an output/input as opposed to having it magically in all the contexts.

We make no guarantees about the parallelism of resource spinups

Good point, I had not thought about that...

victor

04/06/2021, 4:12 PM

a resource that spins up a EMR/Dataproc cluster once per run

That's definitely a good use case. I was actually thinking of the same pattern but with a Dask cluster (possibly on kubernetes with the same image as the UCD for our use case) and we'd need the same pattern.

What would you expect the API/capability to be? My first thought would be to provide a key-value stores where you can associate attributes and serializable values with run ids, and allow resources to stash state there. (In your case there would be an mlflow_experiment_id attribute) This facility could easily be abused, which would be my primary reservation to adding it.

That sounds like a good API, and abuse could be minimised (while keeping the functionality needed by the use cases we have mentioned) by making the pairs immutable for a given run.

Tobias Macey

04/06/2021, 5:00 PM

Another options is to bring in an external KV store (e.g. Consul/etcD) that you can use to store and retrieve the ID and do that as part of the resource definition.

schrockn

04/06/2021, 5:15 PM

Oh @victor to be clear the special purpose solid I referred to would jam the mapping into MLFlow as you suggested, rather than thread it around using inputs and outputs

victor

04/07/2021, 8:52 AM

Ah, right, so just using a root solid to use DAG dependencies as a way to make sure there are no race conditions. It would probably need to be a noop solid that just triggers the initialization of the resource by requiring it. However, I wanted to write the resource in a way that it could be imported from a library without many changes to the actual pipeline, so this is somewhat unsatisfactory. Thanks for the suggestion, though, will definitely look into it.

7 Views

Open in Slack

Previous Next