mrdavidlaing03/31/2022, 5:49 PM
. I expect the whole process to take many hours to complete - each job takes ~20 min & I'm limited to running 5 at a time due to K8s resource constraints. When I leave the backfill overnight ~ 1/3 of the jobs are marked as succeeded. The rest get stuck in a
state. I suspect this is because K8s is cleaning up the
s that haven't yet started (due to the aforementioned resource constraints) I enabled Run Monitoring https://docs.dagster.io/deployment/run-monitoring with a
. My expectation was that the monitoring daemon would (a) detect the jobs that had failed to start, (b) mark them as failed then (c) restart new jobs a & b happened as expected. But c didn't happen. Have I misunderstood how this is meant to work? Or just configured something wrong?
daniel03/31/2022, 9:01 PM
mrdavidlaing04/01/2022, 8:50 AM
would launch each step as a separate K8s job, rather than running all the steps in the same job. Is that true?
Andrea Giardini04/01/2022, 12:29 PM