Stefan Adelbert
04/03/2022, 10:47 PMQueuedRunCoordinator
with max_concurrent_runs=1
.
The only way for me to notice that jobs that should have run on a schedule hadn't run, was to notice a gap in the logs (which are going to Google Cloud Logging) or to check dagit.
I'm looking for a general way to know when a job that "should" have run did not run. For example,
• get alerted if a certain amount of time has elapsed since a job's scheduled start time, i.e. a schedule is not being fulfilled
• get alerted if a jobs run time exceeds some threshold, i.e. something appears fishy with a job
Of course I need to figure out specifically why this particular job was not completing, but more importantly I need to put an early warning system in place to catch irregularities in a general way.
Please let me know if anyone has any ideas.query QueuedJobs {
runsOrError(filter: {
statuses: [QUEUED]
}) {
... on Runs {
results {
runId
jobName
pipelineName
tags {
key
value
}
events {
... on RunEnqueuedEvent {
timestamp
}
}
}
}
}
}
From an alerting perspective I can obviously alert on an error. But I could also alert if a lack of healthcheck messages is detected, i.e. a kind of dead man switch.
Any other ideas?johann
04/04/2022, 4:45 PMclaire
04/04/2022, 4:48 PMcontext.instance.get_runs(RunsFilter(statuses=[PipelineRunStatus.QUEUED]))
On the RunsFilter object you can also specify specific job names or run ids.Stefan Adelbert
04/04/2022, 10:26 PMinstance.get_runs
is not documented, which makes me worry that this functionality might disappear in the future. Also, DagsterRun
doesn't appear to have the events and (more importantly for me) the timestamps associated with those events, which is what I need to be able to check for "stalled" jobs. Or am I missing something?claire
04/04/2022, 11:22 PMRunEnqueuedEvent
, you can use a separate method from the instance:
context.instance.logs_after(run_id=run_id, cursor=-1, of_type=DagsterEventType.RUN_ENQUEUED)
The cursor
parameter returns all events after cursor+1
, so cursor=-1
returns all events. The EventLogEntry
objects returned from this method will contain the timestamps you are looking forStefan Adelbert
04/04/2022, 11:25 PM