Hey all, When using Dagster with k8s I'm curious ...
# deployment-kubernetes
a
Hey all, When using Dagster with k8s I'm curious what happens if an op uses too much memory on the pod it is running in? That is usually an OOMKilled status on the pod. Will that error be captured by a failure hook? If so, what information will the failure hook receive about the failure? Is it possible to catch the OOM and run a different op without the job failing?
s
I think the behavior depends slightly on whether you are using the K8sExecutionEngine (every op in its own pod) or the default, which I think is MultiprocessExcutionEngine. What generally happens is that the op will fail, and you'll see a mysterious error message -- "op failed without receiving error message". If you are spinning up each op in a new pod, I think it will create a new pod and retry. If you are running all ops in the same pod, it will not retry, because the pod no longer exists, and failure hooks will not fire, because the execution environment is gone. In all cases, if the job fails, it will be marked as a failure, and you can set up a sensor to retry it, or you can use the
dagster/max_retries
and
dagster/retry_strategy
tags to set up a retry policy for the job as a whole or in part.
j
Some caveats here:
dagster/max_retries
and
dagster/retry_strategy
will be available in oss Dagster 0.15.0, releasing this week (we were able to release them a bit earlier in Cloud, since it’s all stuff on our side).
Is it possible to catch the OOM and run a different op without the job failing
Not currently, but having a try catch inside jobs is an interesting idea
a
Great, thanks for this context. For my use case being able to see or catch that the job failed with an OOM would be helpful.