Roei Jacobovich
07/02/2022, 4:16 PMRUN_FAILURE
event:
Detected run worker status SUCCESS: 'None'. Marking run <run_id> as failed, because it has surpassed the configured maximum attempts to resume the run: {max_resume_run_attempts}.
It’s weird as we get a RUN_SUCCESS
event a few seconds earlier.
We configured the poll_interval_seconds
to 45 seconds.
Seems like a race condition between the job being finished and the run itself being fetched by instance.get_runs
with the IN_PROGRESS_RUN_STATUSES
filter. It’s possible that instance.get_runs
is taking some time on our instance due to a large number of jobs (and the TODO comment above might improve that greatly 🙂).
(Code reference and also a small bug there: missing format string at https://github.com/dagster-io/dagster/blob/HEAD/python_modules/dagster/dagster/daemon/monitoring/monitoring_daemon.py#L64)
Thanks 🙂Dagster Bot
07/03/2022, 7:15 PMRoei Jacobovich
07/03/2022, 7:21 PMba
07/03/2022, 7:36 PMRoei Jacobovich
07/08/2022, 11:45 AM