What would be an ideal way to sleep during a task ...
# ask-community
i
What would be an ideal way to sleep during a task execution? We have loads of tasks handled by external services, which take some time to complete (typically, in a range of 30-90 minutes). Currently, we just block on op’s execution for 60 seconds (aka sleep), then ask for the current status of the task on the external system, and do so in a loop until it’s completed or timeout is reached. There’s really no useful work going on in here: and all we do is just occupy resources on the underlying k8s cluster. So, I wonder if there’s a way to essentially release resources (kill an op for some time), restart it after 10 minutes (for example), and progress the DAG execution that way.
🤖 1
c
I had a similar design problem where the jobs would have submit and wait on an AWS batch task that could take a few hours to complete. I eventually ended up breaking the dagster job up into 2 parts, one to submit the task and one to do the final processing once the Batch task was finished. The handshake in between the two is the task outputing an SQS message with the needed info and a dagster sensor/schedule polling the queue and executing the second half of the job.
i
For my case there isn’t really a work to do after the job completes. This just means, that one of the downstream tasks now has less dependencies to be fulfilled by other upstream tasks. And splitting the job into several sub-jobs with sensors is not a good way either for me: >500 tasks currently with tasks having multiple upstream dependencies. Imagine maintenance of that monster 😄
I’ve seen
RetryRequested(max_retry_attempts, seconds_to_wait)
exception in the documentation, but sadly, this doesn’t kill the underlying k8s pod, the sleep just goes to a higher level.
Found an ugly? solution: it seems like I can actually fail the step itself, and control the restarts behaviour by
.spec.backoffLimit
option of the k8s job. It will restart the job eventually every 6 minutes (max value, unconfigurable) until
.spec.backoffLimit
is reached, after that the job is considered failed. Not sure, if devops guys will be happy about that 😄
Nah, that will be horrible, what happens if I actually need to fail a job 😄
c
Oh yea that’s actually a problem I ran into with this method, having multiple jobs have a “lazy fan-in” trigger to execute the next step is still something I haven’t found a good solution for.
s
I think the pattern for this in Airflow is called "deferrable operators", where the job itself would essentially bake in an ephemeral sensor, pasuing some amount of time before resuming and polling. I'm not aware of a great solution for it in Dagster yet except for the sensor "check" method @Cameron Gallivan mentioned