The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hi Team, I’m trying to retry failed jobs on a later time and trying to avoid them taking a place in the queue while the wait. RequestRetry and Retry policy keep the job in the queue. 2 questions:
• What is the best way to retry jobs from a run_failure_sensor?
• What’s the best generic way to rerun a job at a later time?
Thanks in advance for the help

hi <@U03M0344KB8>! There's no built-in way to do exactly what you're describing -- in general you can either schedule a job to run on a static schedule, or ask for a job to be run immediately (there's no abstraction for dynamically requesting a job to be run at a specific time).

Retrying jobs from a run_failure_sensor can be done just by yielding RunRequests from within the body, but that setup will end up launching a run very soon after the initial failure, and there's not really a way to make it wait.

I think your best bet would be to create a custom sensor that monitors your job, and manually queries the instance for failed runs of your job. For example, you can get the most recent run of a given job with:
```most_recent_run_record = context.instance.get_run_records(
    filters=RunsFilter(job_name=...)
    limit=1,
)[0]```
you can then check if it failed with `most_recent_run_record.dagster_run.status == DagsterRunStatus.FAILED`.

If the most recent run has failed, then you can check`most_recent_run_record.end_time` , which is a float timestamp (in UTC) representing when that run failed. You can compare that to the current time, and if it's more than (let's say) 2 hours ago, you can kick off a run of that job.

Thanks <@U01J51Y6B9D> let me experiment a little bit, the issue I have found is that for a run request to work, the job needs to be tied to the sensor, and I’m looking for something that can retry any job