Hi! Weird behavior regarding the run monitoring fe...
# ask-community
r
Hi! Weird behavior regarding the run monitoring feature - quick jobs (a few minutes at max) automatically marked as failed. We get the following message as a
RUN_FAILURE
event:
Copy code
Detected run worker status SUCCESS: 'None'. Marking run <run_id> as failed, because it has surpassed the configured maximum attempts to resume the run: {max_resume_run_attempts}.
It’s weird as we get a
RUN_SUCCESS
event a few seconds earlier. We configured the
poll_interval_seconds
to 45 seconds. Seems like a race condition between the job being finished and the run itself being fetched by
instance.get_runs
with the
IN_PROGRESS_RUN_STATUSES
filter. It’s possible that
instance.get_runs
is taking some time on our instance due to a large number of jobs (and the TODO comment above might improve that greatly 🙂). (Code reference and also a small bug there: missing format string at https://github.com/dagster-io/dagster/blob/HEAD/python_modules/dagster/dagster/daemon/monitoring/monitoring_daemon.py#L64) Thanks 🙂
@Dagster Bot issue Run monitoring marks successful quick runs as failed
d
r
Found this PR created 15 minutes ago https://github.com/dagster-io/dagster/pull/8729 @ba I wanted to create a PR for that too, found yours by mistake 🙂 thanks
b
sorry I didn't chime in on the discussion, but this PR is in response to your slack messages :) I work on the cloud product for Elementl, but I haven't worked directly on the dagster codebase much so I was going to run it by some folks more familiar with the codebase on Monday Tuesday.
so I wouldn't be offended if you had different suggestions on how to fix it 👍
#8729 released in 0.15.5 🎉
• Fixed issue where the run monitoring daemon could mark completed runs as failed if they transitioned quickly between STARTING and SUCCESS status.
r
Awesome, thanks @ba!