I m running a large backfill 366 days using a Helm deployed dagster #ask-community

I'm running a large backfill (366 days) using a He...

mrdavidlaing

03/31/2022, 5:49 PM

I'm running a large backfill (366 days) using a Helm deployed Dagster 0.14.6 instance and the

K8sRunLauncher

. I expect the whole process to take many hours to complete - each job takes ~20 min & I'm limited to running 5 at a time due to K8s resource constraints. When I leave the backfill overnight ~ 1/3 of the jobs are marked as succeeded. The rest get stuck in a

STARTING

state. I suspect this is because K8s is cleaning up the

Job

s that haven't yet started (due to the aforementioned resource constraints) I enabled Run Monitoring https://docs.dagster.io/deployment/run-monitoring with a

start_timeout_seconds: 14400

and a

max_resume_run_attempts: 8

. My expectation was that the monitoring daemon would (a) detect the jobs that had failed to start, (b) mark them as failed then (c) restart new jobs a & b happened as expected. But c didn't happen. Have I misunderstood how this is meant to work? Or just configured something wrong?

daniel

03/31/2022, 9:01 PM

Hey david - are you using the k8s_job_executor with the jobs in question here? As per https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental that would be needed to take advantage of the resuming feature. We're considering a more generic job-level retry feature that would work with all executors though (cc @johann)

mrdavidlaing

04/01/2022, 8:50 AM

Aha - I hadn't grokked that part of the docs.

mrdavidlaing

04/01/2022, 8:52 AM

If I understand correctly, the

k8s_job_executor

would launch each step as a separate K8s job, rather than running all the steps in the same job. Is that true?

Andrea Giardini

04/01/2022, 12:29 PM

@mrdavidlaing Yeah that's right

👍 1

thank you box 1

Open in Slack

Previous Next