wave Hello team First time here posting thanks for the reso dagster #ask-community

:wave: Hello, team! First time here posting: than...

Bryan Wood

03/21/2022, 10:00 PM

👋 Hello, team! First time here posting: thanks for the resource to ask questions. Hopefully this is an easy question. I have a repo that uses a sensor to detect new files in a folder. For each new file it runs a job of ops on it. Some of those ops need to load an ML model that takes a lot of time and sucks up resources on the GPU. Since the processing on the files is all decoupled I'm hoping to just load the models once some how and have the inference done with that single model load. Question: is this the use case for Resources (which would be great) or do I need to drop into devops and start spinning up something like Docker images exposing APIs?

jamie

03/21/2022, 10:31 PM

hey @Bryan Wood! you have a couple options: 1. pass the model as a dependency between the ops

Copy code

@job 
def my_job():
   model = get_ml_model()
   out_1 = op_1(model)
   out_2 = op_2(model)
   ...

2. depending on what trade-offs you're willing to make a resource could work as well. by default dagster runs jobs in multi-process mode and each process gets an instantiation of each resource. so you'd still end up loading the ml model multiple times. If you set your jobs so that they only run in a single process, you could have a resource that loads the model into memory and then access the model in each op

Bryan Wood

03/21/2022, 10:43 PM

I think option 1 is clever ... I actually don't need the same model in different ops but ... hmm ... wait, the senor is run at the job level per file so that won't help or am I mistaken?

Bryan Wood

03/21/2022, 10:43 PM

Appreciate the response so quickly by the way ... thanks ... I know it's super late depending on where you are ... none of this can't wait btw

Bryan Wood

03/21/2022, 10:46 PM

Maybe I need to rethink what I'm doing? High-level: sensor watches directory / s3 / whatever for changes, any new / changed file get sent, bunch of stuff happens, and the end ... what I was doing seemed like a good fit for how I was doing it but maybe not

Bryan Wood

03/21/2022, 10:48 PM

So ... maybe I am back to having to use something like k8s or Docker compose to spin up resources so that they aren't each time an op in a job gets executed?

Bryan Wood

03/21/2022, 10:51 PM

If that's the case, since I see you're a dagster, I know that the Streamlit folks went through something similar-ish ... you'll should think about having something like a "singleton" / "shared resource" / "whatever" ... or just be clear about the slice of the devops / mlops space you're carving out (which wasn't clear to me ... if it was just ELT / ETL I'd have just moved to Docker compose / k8s for the questions I've posed)

Bryan Wood

03/21/2022, 10:52 PM

Thanks a lot of your thoughts and your time ... have a great night!

prha

03/21/2022, 11:20 PM

I think the trade-offs that Jamie mentioned are the key here… Is it okay that all of the ops run in a single process, sequentially? If so, then both options Jamie mentioned are good (model as op dependency or model as resource). You can configure your job to run in a single process like so:

Copy code

@job(executor_def=in_process_executor)
def my_job():
    ...

This will only initialize each resource once, at the start of the job run. If you are using the default multiprocess executor, you’ll still have to select which ops incur the loading cost by using

required_resource_keys

Copy code

@op(required_resource_keys={'model'})
def my_op_that_needs_the_model(context):
    # does something with context.resources.model, will incur the resource initialization cost

@op
def my_other_op():
    # will not have an initialized resource
    ...

@job(resource_defs={'model': my_model_loading_resource})
def my_multiprocess_job():
    my_op_that_needs_the_model()
    my_other_op()

Bryan Wood

04/16/2022, 12:20 AM

I should have said before that this helped a lot: thanks!

2 Views

Open in Slack

Previous Next