Hey, I’m looking into an issue with jobs failing b...
# ask-community
o
Hey, I’m looking into an issue with jobs failing but not retrying. I’m running a big batch of jobs (~400), and I noticed that some of time they don’t retry after failing (~1%-%2% of the jobs). It doesn’t seem like a configuration issue, because most of them do retry when they fail. In the first attachment, you see a run that failed and then nothing happens. In the second attachment, you see a different run (same job, different range of dates) that failed, and shortly after there’s an ENGINE EVENT saying that it was re-queue as a new run. Is this a known issue? I’ve seen this before in other jobs, but dismissed it because it was very rare. However, when I have 400-1200 runs, 1% is still a lot of failures (that I currently need to look for manually because it’s hard to filter for jobs that should have retried but didn’t). I can try and debug it further, but I need some guidance on which logs or code I should look at. I understand that there’s a daemon that pulls the event log for run failures (https://docs.dagster.io/deployment/run-retries), so I’m guessing it’s missing some events for some reason. Where does this daemon run? Where can I find the code? (Note - this is happening in a self-hosted Dagster installation, but I have seen this happening in our Dagster Cloud installation as well)
a
Do you see a pattern in the types of the failures that fail to retry?
Where does this daemon run?
https://docs.dagster.io/deployment/overview in k8s there is a deployment for the daemon process
Where can I find the code?
for run retries aka auto re-execution https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py for the failures in the screenshot, those came from run monitoring https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_daemon/monitoring/run_monitoring.py
o
Thanks @alex! Looks like I have some reading to do. For the daemon, are there more detailed logs I can turn on to help understand what it’s doing? (or are the current log messages I should be looking for?) I didn’t see any pattern to this issue. I have 400 total runs in the queue, 50 running at the same time, takes about a day to run all of them, failing at different times and failing to get re-queued at different time 🤷 I do remember seeing at least one occasion where a job took several minutes before it got re-queued (between the failure message and the ENGINE EVENT message showing up), but perhaps I didn’t read the logs correctly. I wish I had a screenshot
a
there should be pod logs you can check, no current options for increasing log output
o
I searched the daemon pod logs, and I see the same entries for the runs that managed to retry, and for the runs that didn’t. Nothing about kicking off a retry run in the logs then (I also searched all the GCP logs in the dagster namespace and couldn’t find anything relevant). Time to look at the code then..