Have you tried cleaning involved cache? like docke...
# ask-community
a
Have you tried cleaning involved cache? like docker caching, etc?
c
I haven't but would that be somewhat related when running Dagster in k8s?
a
I think you should try this out
d
Hi Caio - that "Sending run termination request" should only appear when somebody presses the "Terminate" button in Dagit or terminates a run over the GraphQL API. Are you certain that nobody did that just before that line was logged?
c
@daniel my apologies, the log message is actually
Copy code
Ignoring a duplicate run that was started from somewhere other than the run monitor daemon
then after that the job just hangs
d
I see - what that sounds to me is that your k8s cluster may be spinning down the pod on which your run is happening and then restarting it
"but then the subsequent jobs won't never resume after this event and hence the pipeline just hangs." - can you share more details about this? What do you mean by "the subsequent jobs" exactly here?
you might find the run monitoring features here useful for automatically failing the run when the k8s cluster decides to kill it: https://docs.dagster.io/deployment/run-monitoring#run-monitoring
c
What I mean by that is the subsequent step in the run won't resume. It's an intermittent issue which happens in the middle of a run
d
and run retries useful for kicking off a new run to try to recover: https://docs.dagster.io/deployment/run-retries#run-retries
c
I have enabled runMonitor and was going to give that a try. Let's say a node scale down event happens in the cluster and hence the pod get's killed. Will the `
Copy code
max_resume_run_attempts > 0
try to recover the run?
d
that's this bit here: https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental - which only works in certain situations, I'd recommend using the "Run retries" 2nd link i posted to make it retry
c
d
yeah
c
would that retry the entire run or resume where it was left off?
d
c
ty! We'll give it a try
condagster 1