https://dagster.io/ logo
#ask-community
Title
# ask-community
a

Airton Neto

01/03/2023, 6:56 PM
Have you tried cleaning involved cache? like docker caching, etc?
c

Caio Tavares

01/03/2023, 6:57 PM
I haven't but would that be somewhat related when running Dagster in k8s?
a

Airton Neto

01/03/2023, 7:41 PM
I think you should try this out
d

daniel

01/03/2023, 7:51 PM
Hi Caio - that "Sending run termination request" should only appear when somebody presses the "Terminate" button in Dagit or terminates a run over the GraphQL API. Are you certain that nobody did that just before that line was logged?
c

Caio Tavares

01/03/2023, 7:52 PM
@daniel my apologies, the log message is actually
Copy code
Ignoring a duplicate run that was started from somewhere other than the run monitor daemon
then after that the job just hangs
d

daniel

01/03/2023, 7:52 PM
I see - what that sounds to me is that your k8s cluster may be spinning down the pod on which your run is happening and then restarting it
"but then the subsequent jobs won't never resume after this event and hence the pipeline just hangs." - can you share more details about this? What do you mean by "the subsequent jobs" exactly here?
you might find the run monitoring features here useful for automatically failing the run when the k8s cluster decides to kill it: https://docs.dagster.io/deployment/run-monitoring#run-monitoring
c

Caio Tavares

01/03/2023, 7:55 PM
What I mean by that is the subsequent step in the run won't resume. It's an intermittent issue which happens in the middle of a run
d

daniel

01/03/2023, 7:55 PM
and run retries useful for kicking off a new run to try to recover: https://docs.dagster.io/deployment/run-retries#run-retries
c

Caio Tavares

01/03/2023, 7:56 PM
I have enabled runMonitor and was going to give that a try. Let's say a node scale down event happens in the cluster and hence the pod get's killed. Will the `
Copy code
max_resume_run_attempts > 0
try to recover the run?
d

daniel

01/03/2023, 7:57 PM
that's this bit here: https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental - which only works in certain situations, I'd recommend using the "Run retries" 2nd link i posted to make it retry
c

Caio Tavares

01/03/2023, 7:59 PM
d

daniel

01/03/2023, 7:59 PM
yeah
c

Caio Tavares

01/03/2023, 8:04 PM
would that retry the entire run or resume where it was left off?
d

daniel

01/03/2023, 8:05 PM
c

Caio Tavares

01/03/2023, 8:13 PM
ty! We'll give it a try
condagster 1
2 Views