Hi all slightly smiling face We need the ability to see on d dagster #ask-community

Hi all :slightly_smiling_face: We need the ability...

May Bohadana

02/06/2023, 8:01 AM

Hi all 🙂 We need the ability to see on datadog jobs that stuck on 'canceling' state for more than 30 min. Any ideas how to do it? specifically, is there a way to get the state of a job on datadog?

dagster bot responded by community 1

Zach

02/06/2023, 6:31 PM

interested in how other people might do this, but one direction to look in would be to have a sensor that uses the dagster instance available on the sensor context to query for all canceling jobs, post those to datadog as a custom metric, then have an alarm set up in datadog on that metric. unfortunately I'm not sure you could use a run_status_sensor because it seems like you might not be able to get the run's start time via the RunSensorContext, but you could probably do it through a normal sensor that uses the DagsterInstance on the SensorContext with an EventRecordsFilter for canceled jobs. it might look vaguely like this:

Copy code

from dagster import sensor, SensorExecutionContext, EventRecordsFilter, DagsterEventType

@sensor(job=send_datadog_metric)
def canceling_sensor(context: SensorExecutionContext):
    canceling_ids = []
    canceling_events = context.instance.get_event_records(EventRecordsFilter(event_type=DagsterEventType.PIPELINE_CANCELING))
    for e in canceling_events:
        if e.timestamp > datetime.now().timestamp() - 30*60:
            canceling_ids.append(e.run_id)
    if canceling_ids:
        yield RunRequest(run_config={"ops": ..."}
    else:
        raise SkipReason("No canceling jobs detected")

May Bohadana

02/09/2023, 4:23 PM

Thanks a lot Zach! Did it as you suggested 🙂

2 Views

Open in Slack

Previous Next