Elizabeth
06/28/2021, 1:45 PMKeyboardInterrupt
error. They are running with the K8sRunLauncher on EKS with Spot instances.
The failure occurs at different solids throughout the pipeline and it appears Dagster is unable to restart from this failed state:
Invariant failed, Description: Pipeline run ... in state PipelineRunStatus.FAILUER, expected NOT_STARTED or STARTING
so that once a pod is evicted and a new one created, the job fails to run in the new pod.
Our pipelines run for hours, some even for days, and pod eviction is not uncommon.
I have seen the suggestions for cluster_autoscaler
, which we are not using, and podDisruptionBudgets
, also not an option as we are on k8s v1.19. But these are workarounds and do not account for all possible failures.
It would be ideal if a pipeline could restart automatically should its pod be evicted.
If there is a solution I am missing please let me know.Noah K
06/28/2021, 5:33 PMElizabeth
06/28/2021, 5:42 PMNoah K
06/28/2021, 5:57 PMElizabeth
06/28/2021, 5:59 PMNoah K
06/28/2021, 6:00 PMrex
06/29/2021, 5:21 PMNoah K
06/29/2021, 5:23 PMrex
06/29/2021, 6:01 PMNoah K
06/29/2021, 6:07 PMElizabeth
06/30/2021, 7:36 PM