https://dagster.io/ logo
#dagster-kubernetes
Title
# dagster-kubernetes
e

Elizabeth

06/28/2021, 1:45 PM
My pipelines are failing frequently with a
KeyboardInterrupt
error. They are running with the K8sRunLauncher on EKS with Spot instances. The failure occurs at different solids throughout the pipeline and it appears Dagster is unable to restart from this failed state:
Copy code
Invariant failed, Description: Pipeline run ... in state PipelineRunStatus.FAILUER, expected NOT_STARTED or STARTING
so that once a pod is evicted and a new one created, the job fails to run in the new pod. Our pipelines run for hours, some even for days, and pod eviction is not uncommon. I have seen the suggestions for
cluster_autoscaler
, which we are not using, and
podDisruptionBudgets
, also not an option as we are on k8s v1.19. But these are workarounds and do not account for all possible failures. It would be ideal if a pipeline could restart automatically should its pod be evicted. If there is a solution I am missing please let me know.
n

Noah K

06/28/2021, 5:33 PM
With IO managers you can do manual retries but I think automatic stuff is still WIP.
e

Elizabeth

06/28/2021, 5:42 PM
Yes, your IOManager works great and the manual retries are really nice, the Dagit UI for pipeline running is beautifully done, but automatic retries for pipelines which fail due to resource issues is key. If we need manual intervention it defeats the purpose of automating workflows with Dagster. We’re looking at On Demand instances instead of spot but they are more expensive and still not a great solution. Can you give me any idea on when Dagster may support this?
n

Noah K

06/28/2021, 5:57 PM
I'm just a fellow user so that's not my department 😄
e

Elizabeth

06/28/2021, 5:59 PM
oh my goodness 🤣, my bad, sorry about that and thanks @Noah K for your suggestion
n

Noah K

06/28/2021, 6:00 PM
I'm sure some Elementl folks will be along soon and can give you a better idea of stuff from their side
😀 1
r

rex

06/29/2021, 5:21 PM
n

Noah K

06/29/2021, 5:23 PM
The second of those issues has a ticket filed 🙂
r

rex

06/29/2021, 6:01 PM
You’re right - here’s a similar issue you can upvote https://github.com/dagster-io/dagster/issues/3705
e

Elizabeth

06/30/2021, 7:36 PM
Thank you @rex and @Noah K. I upvoted and added to both of those requests. 🤞
5 Views