Rich Schiavi

12/02/2019, 8:33 PM
Apologies if I missed it in the docs, is there a notion of a timeout/heartbeat for a Solid? ie, if we have long running tasks that would need to keep sending a heartbeat/alive, and if that stops, reschedule the Task as failed


12/02/2019, 8:59 PM
This is a great question. So we do not have this capability yet, but working on dagster executors is a big priority for us at the moment so I would love to learn more about your use case. There are two workarounds I can suggest. The first would be building a heartbeat resource which gets plugged into solids that can be used for "retrying". The second option could be to use the Dask executor because timeouts/heartbeats are a first class citizen there. However, I haven't really played around with heartbeats on dask/dagster before so there be dragons.

Rich Schiavi

12/02/2019, 10:35 PM
For the use case, we have tasks that take take an indeterminate amount of time. A heartbeat allows us to know they are still processing. If we just scheduled say a very long timeout, and the task died a few seconds in, we'd be very inefficient on retries, versus say a 60 second timeout that would have stopped and let us know to reschedule that task. This is similar to the AWS Step Functions. " "TimeoutSeconds": 300, "HeartbeatSeconds": 60," For our use case, we could schedule a one day "Timeout" (excessive) but 60 seconds heartbeats
I saw the heartbeat option in dagster_dask, but was unclear on how it's used. Are there any examples that show that option?


12/03/2019, 6:27 PM
So in dagster the execution substrate is pluggable. In the default case we execute in process. This is useful for testing but obviously not what you want for lots of long running jobs. This is where
comes in. It provides Dask as alternative executor which has all of its own configuration for its cluster based execution model. So Dask is what you will be configuring to manage heartbeats etc. Some more details are here