https://dagster.io/ logo
#ask-community
Title
# ask-community
f

Frank Dekervel

05/06/2022, 3:37 PM
hello, i have a dagster job with a lot of (dynamic) ops. some fail (overloaded kubernetes evicts op pods). i can then retry from failure, which works nicely. now much less ops still fail (9 instead of 60). now i want to retry once more to complete the last 9 remaining ops. but i can only retry once
so if i retry a failed job, it works. if i retry a retry of a failed job, i get the above error. the problem is my kubernetes cluster is unreliable so i always need more than 2 tries before everything completes succesfully (job has 1408 ops)
i think it is because the "retry from failure" job only has a partial job graph. and then it cannot reconstruct a new graph from this partial graph
a

alex

05/06/2022, 3:43 PM
I believe this issue is related, would you be willing to add a comment with the stack trace (as text) and this context? https://github.com/dagster-io/dagster/issues/6044
f

Frank Dekervel

05/06/2022, 3:56 PM
Done! seems the same issue indeed
keanu thanks 1
3 Views