Non-terminating Job Blocking Execution Queue I no...
# ask-community
s
Non-terminating Job Blocking Execution Queue I noticed today that a job had got stuck (for an unknown reason) in a state where it was still considered to be running and therefore there was a long backlog of jobs in the queue waiting to execute. I'm using a
QueuedRunCoordinator
with
max_concurrent_runs=1
. The only way for me to notice that jobs that should have run on a schedule hadn't run, was to notice a gap in the logs (which are going to Google Cloud Logging) or to check dagit. I'm looking for a general way to know when a job that "should" have run did not run. For example, • get alerted if a certain amount of time has elapsed since a job's scheduled start time, i.e. a schedule is not being fulfilled • get alerted if a jobs run time exceeds some threshold, i.e. something appears fishy with a job Of course I need to figure out specifically why this particular job was not completing, but more importantly I need to put an early warning system in place to catch irregularities in a general way. Please let me know if anyone has any ideas.
My thinking at this stage is to run a job periodically which performs a healthcheck. It will run graphql queries and then do certain sanity checks. It will log an "all clear" if everything looks fine. It will log an error if not. For the sanity checking, the age of the oldest queued job could be compared against a threshold. A GraphQL query like this one could be used,
Copy code
query QueuedJobs {
  runsOrError(filter: {
    statuses: [QUEUED]
  }) {
    ... on Runs {
      results {
        runId
        jobName
        pipelineName
        tags {
          key
          value
        }
        events {
          ... on RunEnqueuedEvent {
            timestamp
          }
        }
      }
    }
  }
}
From an alerting perspective I can obviously alert on an error. But I could also alert if a lack of healthcheck messages is detected, i.e. a kind of dead man switch. Any other ideas?
j
Hi Stefan- to answer part of your requests here, we’re working on run monitoring to detect when a run worker has died https://docs.dagster.io/deployment/run-monitoring
This is currently only supported on Kubernetes and Docker, but we’re looking to improve coverage. With this, the run would be marked as FAILED and you’d be able to alert on that
c
Hi Stefan. Another option you could try is to query the instance directly:
Copy code
context.instance.get_runs(RunsFilter(statuses=[PipelineRunStatus.QUEUED]))
On the RunsFilter object you can also specify specific job names or run ids.
s
@johann Thanks for the response. This get some of the way there. It looks like there is some emphasis being put on higher level job monitoring, which is great. @claire Thank you for your response. This dramatically simplifies what I was planning!
@claire I'm trying the approach you suggested. I notice that
instance.get_runs
is not documented, which makes me worry that this functionality might disappear in the future. Also,
DagsterRun
doesn't appear to have the events and (more importantly for me) the timestamps associated with those events, which is what I need to be able to check for "stalled" jobs. Or am I missing something?
c
Hi Stefan. Yes, the methods are undocumented, but in the event we replace it, we will deprecate this function and it will still exist. The GraphQL API at the moment is still experimental and subject to change. In order to get the
RunEnqueuedEvent
, you can use a separate method from the instance:
Copy code
context.instance.logs_after(run_id=run_id, cursor=-1, of_type=DagsterEventType.RUN_ENQUEUED)
The
cursor
parameter returns all events after
cursor+1
, so
cursor=-1
returns all events. The
EventLogEntry
objects returned from this method will contain the timestamps you are looking for
s
Thanks, @claire. I'll give this a go.