Hey all When using Dagster with k8s I m curious what happens dagster #deployment-kubernetes

Hey all, When using Dagster with k8s I'm curious ...

Amadou Crookes

06/12/2022, 9:44 PM

Hey all, When using Dagster with k8s I'm curious what happens if an op uses too much memory on the pod it is running in? That is usually an OOMKilled status on the pod. Will that error be captured by a failure hook? If so, what information will the failure hook receive about the failure? Is it possible to catch the OOM and run a different op without the job failing?

Stephen Bailey

06/13/2022, 9:40 AM

I think the behavior depends slightly on whether you are using the K8sExecutionEngine (every op in its own pod) or the default, which I think is MultiprocessExcutionEngine. What generally happens is that the op will fail, and you'll see a mysterious error message -- "op failed without receiving error message". If you are spinning up each op in a new pod, I think it will create a new pod and retry. If you are running all ops in the same pod, it will not retry, because the pod no longer exists, and failure hooks will not fire, because the execution environment is gone. In all cases, if the job fails, it will be marked as a failure, and you can set up a sensor to retry it, or you can use the

dagster/max_retries

and

dagster/retry_strategy

tags to set up a retry policy for the job as a whole or in part.

johann

06/13/2022, 3:30 PM

Some caveats here:

dagster/max_retries

and

dagster/retry_strategy

will be available in oss Dagster 0.15.0, releasing this week (we were able to release them a bit earlier in Cloud, since it’s all stuff on our side).

Is it possible to catch the OOM and run a different op without the job failing

Not currently, but having a try catch inside jobs is an interesting idea

Amadou Crookes

06/13/2022, 5:47 PM

Great, thanks for this context. For my use case being able to see or catch that the job failed with an OOM would be helpful.

Open in Slack

Previous Next