How do folks handle metrics dashboards and alerting related dagster #ask-community

How do folks handle metrics, dashboards, and alert...

Mark Fickett

03/14/2022, 8:18 PM

How do folks handle metrics, dashboards, and alerting related to their Dagster pipelines? Examples: • I want to get an alert if the pipeline doesn't finish successfully. • I want a trend of overall pipeline duration over days (I think Dagster does this), and for different multi-op portions of the DAG (some notional 'stage 1', 'stage 2' of the pipeline that may map to a subgraph but is larger than an op). • I want to report a metric if a particular op encounters some condition (such as no response from an API), and able to get an alert if too many ops in a fan-out hit that condition, sliced by some attribute (green widgets had 20 errors, blue widgets had 4k errors). • One particular op had 500 ERROR-level log lines. Find the op, group the log lines by error message. How much of this can Dagster do natively? What services do folks use for other pieces -- DataDog, ELK, other? Thanks in advance for sharing your experiences!

yuhan

03/14/2022, 8:36 PM

I want to get an alert if the pipeline doesn’t finish successfully.

Dagster has built-in support for job-level alerting called

run_status_sensor

. You can find some examples here: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#job-failure-sensor

I want a trend of overall pipeline duration over days (I think Dagster does this), and for different multi-op portions of the DAG (some notional ‘stage 1’, ‘stage 2’ of the pipeline that may map to a subgraph but is larger than an op).

Dagit’s homepage “Factory Floor” provides a view that contains this info. Check out our blog post here: https://dagster.io/blog/dagster-0-14-0-never-felt-like-this-before#new-dagit-homepage-factory-floor-view

yuhan

03/14/2022, 8:44 PM

I want to report a metric if a particular op encounters some condition (such as no response from an API), and able to get an alert if too many ops in a fan-out hit that condition, sliced by some attribute (green widgets had 20 errors, blue widgets had 4k errors).

Similarly, run status sensors should be able to address this use case too. It is a way to listen to job-level events. If that’s not the case, you can also write a custom sensor to manually listen to events, such as:

Copy code

@sensor(job=my_job)
def custom_dagster_event_sensor(context):
    dagster_event_records = context.instance.get_event_records(
        EventRecordsFilter(
            event_type=DagsterEventType.<...>, # insert the event that indicates the condition you're interested in 
        ),
        ascending=False,
        limit=1,
    )

    if not dagster_event_records:
        return

    yield RunRequest(...)

Mark Fickett

03/14/2022, 8:45 PM

sensors

Thanks, I'll take a more detailed look at sensors.

Factory Floor

Cool! That's reporting at the job level, right? So that works for the overall pipeline duration, but not if I have a few areas of my DAG that I want to keep an eye on that don't map to jobs.

yuhan

03/14/2022, 8:46 PM

One particular op had 500 ERROR-level log lines. Find the op, group the log lines by error message.

Again, similar to other monitoring cases, sensors could be one approach. Also, you could configure your own logger for this use case. Here are an example: https://docs.dagster.io/concepts/logging/loggers#customizing-loggers

yuhan

03/14/2022, 8:47 PM

Right, for now, factory floor is showing info at the job level.

dwall

03/14/2022, 8:47 PM

hey @Mark Fickett 👋 we use a custom cronitor resource + a custom rollbar logger to do the things you're referring to

dagsir 1

ty thankyou 1

dwall

03/14/2022, 8:48 PM

the rollbar logger is responsible for forwarding logs to rollbar for downstream grouping, analysis, etc. and the cronitor resource is implemented to ping cronitor on job run + complete and also is attached as a run status failure sensor on each job

dwall

03/14/2022, 8:48 PM

so that we can get high level job metrics + notifications

Zach

03/14/2022, 8:48 PM

you can get a basic view of execution times for an op by going to a job overview containing the op, then clicking on the op in the graph, it'll populate a tab on the right with a graph of execution times

dagsir 1

☝️ 1

Mark Fickett

03/14/2022, 8:49 PM

Thanks @dwall I'll check out Cronitor, glad to know it integrates well with Dagster. Yes, I've enjoyed the per-op execution time trend graph!

yuhan

03/14/2022, 8:51 PM

If you’d want things like an aggregated view of op duration, you can get info from the event db. e.g. the duration of an op is

timestamp of STEP_SUCCESS - timestamp of STEP_START

dwall

03/14/2022, 8:51 PM

the big value add we get from cronitor currently is the notifications for pipeline failures and missed executions. not only can we tell cronitor to let us know if a job failed, we can also configure it to let us know if a job hasnt run when it should have. this helps us catch things like misconfigured jobs, etc.

dwall

03/14/2022, 8:51 PM

outside of that, most high-level performance metrics you can already get to within the factory floor

yuhan

03/29/2022, 12:40 AM

Thanks folks on this thread for sharing your approaches/ideas. I’ve created a Github Discussion here hoping to collect more ideas from the community about alerting and monitoring, also hope the discussion form could benefit public discoverability (which slack isn’t a good fit for). Please feel free to leave a comment here about your approaches!

👍🏻 1

yuhan

03/30/2022, 11:09 PM

Questions in the thread have been surfaced to GitHub Discussions for future discoverability: • https://github.com/dagster-io/dagster/discussions/7159 • https://github.com/dagster-io/dagster/discussions/7163

157 Views

Open in Slack

Previous Next