https://dagster.io/ logo
#dagster-support
Title
# dagster-support
j

Jose Uribe

05/23/2022, 5:36 PM
Hello again. Looking for advice on a specific situation. We're currently using dagster_slack as a means of alerting of in-execution error reporting, but I am wondering, what options are potentially available to say, being alerted on the scheduler failing to launch a task at all? For example, we had resolved an issue that was causing our runs to not start at all - meaning our failure-based alerting didn't make us aware of a problem until someone checked the dagster run histories. The error in question is as follows:
Copy code
dagster.core.scheduler.scheduler.DagsterSchedulerError: Unable to reach the user code server for schedule daily_ingestion. Schedule will resume execution once the server is available.
  File "/root/.pyenv/versions/3.6.10/lib/python3.6/site-packages/dagster/scheduler/scheduler.py", line 363, in launch_scheduled_runs_for_schedule
    ) from e
The above exception was caused by the following exception:
dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/root/.pyenv/versions/3.6.10/lib/python3.6/site-packages/dagster/scheduler/scheduler.py", line 356, in launch_scheduled_runs_for_schedule
    debug_crash_flags,
🤖 1
👀 1
r

rex

05/24/2022, 5:05 AM
Hey! This is a problem that we’ve throught when we implemented alerting in Dagster Cloud 🙂 Basically, this is “who watches the watchers” problem. Like you’ve noticed, using user code to monitor user code will not work if the user code is not executable in the first place. In open source, one of the ways to alleviate this problem is to have an external system monitor your Dagster components. This includes Dagit, the daemon, and your user code deployments. I’ll assume that you’re deploying on Kubernetes. Basically, if any of those components fail, your user code will also fail to execute. So one solution here is to have datadog/prometheus monitor these Kubernetes deployments for you to ensure up time. Then you can be notified of the class of errors related to up time.
j

Jose Uribe

05/27/2022, 2:52 PM
Makes sense! We're actually in the midst of building up an improved watchdog system anyways, so that will meet our needs. Thanks for the response! I didnt get a notification for this somehow, and figured i would check to see if there was an update/response.
❤️ 1
2 Views