Hi Team, I’m trying to retry failed jobs on a late...
# ask-community
Hi Team, I’m trying to retry failed jobs on a later time and trying to avoid them taking a place in the queue while the wait. RequestRetry and Retry policy keep the job in the queue. 2 questions: • What is the best way to retry jobs from a run_failure_sensor? • What’s the best generic way to rerun a job at a later time? Thanks in advance for the help
hi @Jose Estudillo! There's no built-in way to do exactly what you're describing -- in general you can either schedule a job to run on a static schedule, or ask for a job to be run immediately (there's no abstraction for dynamically requesting a job to be run at a specific time). Retrying jobs from a run_failure_sensor can be done just by yielding RunRequests from within the body, but that setup will end up launching a run very soon after the initial failure, and there's not really a way to make it wait. I think your best bet would be to create a custom sensor that monitors your job, and manually queries the instance for failed runs of your job. For example, you can get the most recent run of a given job with:
Copy code
most_recent_run_record = context.instance.get_run_records(
you can then check if it failed with
most_recent_run_record.dagster_run.status == DagsterRunStatus.FAILED
. If the most recent run has failed, then you can check`most_recent_run_record.end_time` , which is a float timestamp (in UTC) representing when that run failed. You can compare that to the current time, and if it's more than (let's say) 2 hours ago, you can kick off a run of that job.
Thanks @owen let me experiment a little bit, the issue I have found is that for a run request to work, the job needs to be tied to the sensor, and I’m looking for something that can retry any job