Hi wave Is it possible to request GPU instances for specific dagster #dagster-plus

Hi :wave: Is it possible to request GPU instances ...

Charles Lariviere

02/27/2023, 7:56 PM

Hi 👋 Is it possible to request GPU instances for specific assets/ops/jobs (assuming a hybrid deployment)? I'm interested in using Dagster to orchestrate training jobs for deep learning models that require hardware acceleration on AWS. I'm aware of the

ecs/cpu

and

ecs/memory

tags for an ECS deployment with Fargate, but it doesn't seem to be possible to request specific EC2 instances within the Dagster context. Is using Kubernetes and per-job/op tags the recommended solution currently?

👍 1

daniel

02/27/2023, 8:03 PM

Hey Charles - I think not being able to use GPUs is an ECS fargate restriction. Once it's possible in fargate it should be no problem for us to support in ECS hybrid deployments as well (or if you're using a hybrid ECS deployment on ECS+EC2, it should be an option there today) The easiest way with what's available today is likely to use kubernetes with tags to route it to an ec2 instance running a gpu, as you mention

daniel

02/27/2023, 8:04 PM

the tl;dr is that if it's something you can do today without dagster in the underlying compute environment we're committed to making it possible in dagster as well

Charles Lariviere

02/27/2023, 8:13 PM

Hey Daniel, thanks for your quick reply! Agreed on the Fargate/GPU limitations, though I believe ECS does support GPU instances through registering a managed EC2 cluster in ECS. From the docs, it seems to be a matter of specifying the resource requirements in the task definition, and IIUC Dagster controls the task definition on job runs. Can we define the task definition on jobs from within Dagster in the same way that we can define the

ecs/cpu

and

ecs/memory

for Fargate?

daniel

02/27/2023, 8:14 PM

Yeah, that's the "ECS+EC2" option I mentioned above. We would need to expose that resourceRequirements key as a configuration option, but that would be a quick change on our end

daniel

02/27/2023, 8:15 PM

(And similarly could expose it as a tag for specific run, yeah - that would take a bit more work since I think right now it assumes you can use a single task definition for the whole code location, but that's a solvable problem)

Charles Lariviere

02/27/2023, 8:56 PM

That's great to hear! We'd prefer using ECS over Kubernetes if it did allow for that level of configuration (i.e. running specific steps in a job on GPU instances) and it sounds like it's not supported directly right now, but could be in the future. Thanks @daniel 🙏

condagster 1

daniel

02/27/2023, 8:57 PM

Ah and I think for it to work with individual steps we'd also need to ship the ECS task executor that lets you run each op in its own task - which is definitely on our radar but not going to be available in the short-term. So overall I think k8s is the way to go in an immediate timeframe for this one

Charles Lariviere

02/27/2023, 9:04 PM

Sounds good -- that makes sense. We'll look into k8s in that case. Thanks again!

Matt Clarke

03/03/2023, 11:40 AM

We'd also be interested in the ECS version of this. EKS is a bit unwieldy for a team of our scale, but we have the odd few tasks which would really benefit from GPU acceleration.

Matt Clarke

03/03/2023, 11:41 AM

Not something we're desperate for, but would be useful. Generally speaking the ability to do it per job would be less relevant than per task for our case, as if a task requires a GPU it will always need it

11 Views

Open in Slack

Previous Next