How do folks handle metrics, dashboards, and alert...
# ask-community
m
How do folks handle metrics, dashboards, and alerting related to their Dagster pipelines? Examples: • I want to get an alert if the pipeline doesn't finish successfully. • I want a trend of overall pipeline duration over days (I think Dagster does this), and for different multi-op portions of the DAG (some notional 'stage 1', 'stage 2' of the pipeline that may map to a subgraph but is larger than an op). • I want to report a metric if a particular op encounters some condition (such as no response from an API), and able to get an alert if too many ops in a fan-out hit that condition, sliced by some attribute (green widgets had 20 errors, blue widgets had 4k errors). • One particular op had 500 ERROR-level log lines. Find the op, group the log lines by error message. How much of this can Dagster do natively? What services do folks use for other pieces -- DataDog, ELK, other? Thanks in advance for sharing your experiences!
y
I want to get an alert if the pipeline doesn’t finish successfully.
Dagster has built-in support for job-level alerting called
run_status_sensor
. You can find some examples here: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#job-failure-sensor
I want a trend of overall pipeline duration over days (I think Dagster does this), and for different multi-op portions of the DAG (some notional ‘stage 1’, ‘stage 2’ of the pipeline that may map to a subgraph but is larger than an op).
Dagit’s homepage “Factory Floor” provides a view that contains this info. Check out our blog post here: https://dagster.io/blog/dagster-0-14-0-never-felt-like-this-before#new-dagit-homepage-factory-floor-view
I want to report a metric if a particular op encounters some condition (such as no response from an API), and able to get an alert if too many ops in a fan-out hit that condition, sliced by some attribute (green widgets had 20 errors, blue widgets had 4k errors).
Similarly, run status sensors should be able to address this use case too. It is a way to listen to job-level events. If that’s not the case, you can also write a custom sensor to manually listen to events, such as:
Copy code
@sensor(job=my_job)
def custom_dagster_event_sensor(context):
    dagster_event_records = context.instance.get_event_records(
        EventRecordsFilter(
            event_type=DagsterEventType.<...>, # insert the event that indicates the condition you're interested in 
        ),
        ascending=False,
        limit=1,
    )

    if not dagster_event_records:
        return

    yield RunRequest(...)
m
sensors
Thanks, I'll take a more detailed look at sensors.
Factory Floor
Cool! That's reporting at the job level, right? So that works for the overall pipeline duration, but not if I have a few areas of my DAG that I want to keep an eye on that don't map to jobs.
y
One particular op had 500 ERROR-level log lines. Find the op, group the log lines by error message.
Again, similar to other monitoring cases, sensors could be one approach. Also, you could configure your own logger for this use case. Here are an example: https://docs.dagster.io/concepts/logging/loggers#customizing-loggers
Right, for now, factory floor is showing info at the job level.
d
hey @Mark Fickett 👋 we use a custom cronitor resource + a custom rollbar logger to do the things you're referring to
dagsir 1
ty thankyou 1
the rollbar logger is responsible for forwarding logs to rollbar for downstream grouping, analysis, etc. and the cronitor resource is implemented to ping cronitor on job run + complete and also is attached as a run status failure sensor on each job
so that we can get high level job metrics + notifications
z
you can get a basic view of execution times for an op by going to a job overview containing the op, then clicking on the op in the graph, it'll populate a tab on the right with a graph of execution times
dagsir 1
☝️ 1
m
Thanks @dwall I'll check out Cronitor, glad to know it integrates well with Dagster. Yes, I've enjoyed the per-op execution time trend graph!
y
If you’d want things like an aggregated view of op duration, you can get info from the event db. e.g. the duration of an op is
timestamp of STEP_SUCCESS - timestamp of STEP_START
d
the big value add we get from cronitor currently is the notifications for pipeline failures and missed executions. not only can we tell cronitor to let us know if a job failed, we can also configure it to let us know if a job hasnt run when it should have. this helps us catch things like misconfigured jobs, etc.
outside of that, most high-level performance metrics you can already get to within the factory floor
y
Thanks folks on this thread for sharing your approaches/ideas. I’ve created a Github Discussion here hoping to collect more ideas from the community about alerting and monitoring, also hope the discussion form could benefit public discoverability (which slack isn’t a good fit for). Please feel free to leave a comment here about your approaches!
👍🏻 1
Questions in the thread have been surfaced to GitHub Discussions for future discoverability: • https://github.com/dagster-io/dagster/discussions/7159https://github.com/dagster-io/dagster/discussions/7163
132 Views