I'm running a large backfill (366 days) using a He...
# ask-community
I'm running a large backfill (366 days) using a Helm deployed Dagster 0.14.6 instance and the
. I expect the whole process to take many hours to complete - each job takes ~20 min & I'm limited to running 5 at a time due to K8s resource constraints. When I leave the backfill overnight ~ 1/3 of the jobs are marked as succeeded. The rest get stuck in a
state. I suspect this is because K8s is cleaning up the
s that haven't yet started (due to the aforementioned resource constraints) I enabled Run Monitoring https://docs.dagster.io/deployment/run-monitoring with a
start_timeout_seconds: 14400
and a
max_resume_run_attempts: 8
. My expectation was that the monitoring daemon would (a) detect the jobs that had failed to start, (b) mark them as failed then (c) restart new jobs a & b happened as expected. But c didn't happen. Have I misunderstood how this is meant to work? Or just configured something wrong?
Hey david - are you using the k8s_job_executor with the jobs in question here? As per https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental that would be needed to take advantage of the resuming feature. We're considering a more generic job-level retry feature that would work with all executors though (cc @johann)
Aha - I hadn't grokked that part of the docs.
If I understand correctly, the
would launch each step as a separate K8s job, rather than running all the steps in the same job. Is that true?
@mrdavidlaing Yeah that's right
👍 1
thank you box 1