Non-terminating Job Blocking Execution Queue I no...
# ask-community
Non-terminating Job Blocking Execution Queue I noticed today that a job had got stuck (for an unknown reason) in a state where it was still considered to be running and therefore there was a long backlog of jobs in the queue waiting to execute. I'm using a
. The only way for me to notice that jobs that should have run on a schedule hadn't run, was to notice a gap in the logs (which are going to Google Cloud Logging) or to check dagit. I'm looking for a general way to know when a job that "should" have run did not run. For example, • get alerted if a certain amount of time has elapsed since a job's scheduled start time, i.e. a schedule is not being fulfilled • get alerted if a jobs run time exceeds some threshold, i.e. something appears fishy with a job Of course I need to figure out specifically why this particular job was not completing, but more importantly I need to put an early warning system in place to catch irregularities in a general way. Please let me know if anyone has any ideas.
My thinking at this stage is to run a job periodically which performs a healthcheck. It will run graphql queries and then do certain sanity checks. It will log an "all clear" if everything looks fine. It will log an error if not. For the sanity checking, the age of the oldest queued job could be compared against a threshold. A GraphQL query like this one could be used,
Copy code
query QueuedJobs {
  runsOrError(filter: {
    statuses: [QUEUED]
  }) {
    ... on Runs {
      results {
        tags {
        events {
          ... on RunEnqueuedEvent {
From an alerting perspective I can obviously alert on an error. But I could also alert if a lack of healthcheck messages is detected, i.e. a kind of dead man switch. Any other ideas?
Hi Stefan- to answer part of your requests here, we’re working on run monitoring to detect when a run worker has died
This is currently only supported on Kubernetes and Docker, but we’re looking to improve coverage. With this, the run would be marked as FAILED and you’d be able to alert on that
Hi Stefan. Another option you could try is to query the instance directly:
Copy code
On the RunsFilter object you can also specify specific job names or run ids.
@johann Thanks for the response. This get some of the way there. It looks like there is some emphasis being put on higher level job monitoring, which is great. @claire Thank you for your response. This dramatically simplifies what I was planning!
@claire I'm trying the approach you suggested. I notice that
is not documented, which makes me worry that this functionality might disappear in the future. Also,
doesn't appear to have the events and (more importantly for me) the timestamps associated with those events, which is what I need to be able to check for "stalled" jobs. Or am I missing something?
Hi Stefan. Yes, the methods are undocumented, but in the event we replace it, we will deprecate this function and it will still exist. The GraphQL API at the moment is still experimental and subject to change. In order to get the
, you can use a separate method from the instance:
Copy code
context.instance.logs_after(run_id=run_id, cursor=-1, of_type=DagsterEventType.RUN_ENQUEUED)
parameter returns all events after
, so
returns all events. The
objects returned from this method will contain the timestamps you are looking for
Thanks, @claire. I'll give this a go.