mrdavidlaing
03/31/2022, 5:49 PMK8sRunLauncher
.
I expect the whole process to take many hours to complete - each job takes ~20 min & I'm limited to running 5 at a time due to K8s resource constraints.
When I leave the backfill overnight ~ 1/3 of the jobs are marked as succeeded. The rest get stuck in a STARTING
state.
I suspect this is because K8s is cleaning up the Job
s that haven't yet started (due to the aforementioned resource constraints)
I enabled Run Monitoring https://docs.dagster.io/deployment/run-monitoring with a start_timeout_seconds: 14400
and a max_resume_run_attempts: 8
.
My expectation was that the monitoring daemon would (a) detect the jobs that had failed to start, (b) mark them as failed then (c) restart new jobs
a & b happened as expected. But c didn't happen.
Have I misunderstood how this is meant to work? Or just configured something wrong?daniel
03/31/2022, 9:01 PMmrdavidlaing
04/01/2022, 8:50 AMk8s_job_executor
would launch each step as a separate K8s job, rather than running all the steps in the same job.
Is that true?Andrea Giardini
04/01/2022, 12:29 PM