Hi Weird behavior regarding the run monitoring feature quick dagster #ask-community

Hi! Weird behavior regarding the run monitoring fe...

Roei Jacobovich

07/02/2022, 4:16 PM

Hi! Weird behavior regarding the run monitoring feature - quick jobs (a few minutes at max) automatically marked as failed. We get the following message as a

RUN_FAILURE

event:

Copy code

Detected run worker status SUCCESS: 'None'. Marking run <run_id> as failed, because it has surpassed the configured maximum attempts to resume the run: {max_resume_run_attempts}.

It’s weird as we get a

RUN_SUCCESS

event a few seconds earlier. We configured the

poll_interval_seconds

to 45 seconds. Seems like a race condition between the job being finished and the run itself being fetched by

instance.get_runs

with the

IN_PROGRESS_RUN_STATUSES

filter. It’s possible that

instance.get_runs

is taking some time on our instance due to a large number of jobs (and the TODO comment above might improve that greatly 🙂). (Code reference and also a small bug there: missing format string at https://github.com/dagster-io/dagster/blob/HEAD/python_modules/dagster/dagster/daemon/monitoring/monitoring_daemon.py#L64) Thanks 🙂

Roei Jacobovich

07/03/2022, 7:15 PM

@Dagster Bot issue Run monitoring marks successful quick runs as failed

Dagster Bot

07/03/2022, 7:15 PM

Created issue at: https://github.com/dagster-io/dagster/issues/8730

Roei Jacobovich

07/03/2022, 7:21 PM

Found this PR created 15 minutes ago https://github.com/dagster-io/dagster/pull/8729 @ba I wanted to create a PR for that too, found yours by mistake 🙂 thanks

07/03/2022, 7:36 PM

sorry I didn't chime in on the discussion, but this PR is in response to your slack messages :) I work on the cloud product for Elementl, but I haven't worked directly on the dagster codebase much so I was going to run it by some folks more familiar with the codebase on ~~Monday~~ Tuesday.

07/03/2022, 7:41 PM

so I wouldn't be offended if you had different suggestions on how to fix it 👍

07/08/2022, 6:13 AM

#8729 released in 0.15.5 🎉

• Fixed issue where the run monitoring daemon could mark completed runs as failed if they transitioned quickly between STARTING and SUCCESS status.

Roei Jacobovich

07/08/2022, 11:45 AM

Awesome, thanks @ba!

3 Views

Open in Slack

Previous Next