Are there any primitives in dagster I can use to respond to dagster #announcements

Join Slack

Are there any primitives in dagster I can use to r...

# announcements

alir

06/09/2020, 4:24 PM

Are there any primitives in dagster I can use to respond to celery worker failures and restarts?

alir

06/09/2020, 4:24 PM

I have dagster set up in a k8s environment, with dagit passing jobs to celery workers with rabbitmq. Just as a test, I simulated a failure by killing and restarting the celery pod while a long-running solid was being executed by the worker.

alir

06/09/2020, 4:25 PM

The outcome of this test was: dagit thinks the pipeline is still running and the new celery worker pod did not re-execute the solid.

alir

06/09/2020, 4:28 PM

Ideally, I would use dagster primitives to seamlessly restart failed jobs. But I wouldn't mind building them if need be. I just wanted to know what the recommendation was. (I couldn't find past chats on slack on this topic)

max

06/09/2020, 5:29 PM

@nate

alir

06/09/2020, 5:34 PM

i'm thinking maybe the correct way to do this is at the celery/rabbitmq level, to make sure celery doesn't ack a task message in the broker until the task is complete. i don't know what the celery defaults are and i don't know if dagster changes the defaults, so I'm looking into that at the moment

alir

06/09/2020, 6:09 PM

if my reading about the celery documentation is right, I think a combination of enabling

task_reject_on_worker_lost

and

task_acks_late

would address this: https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_acks_late

alir

06/09/2020, 6:09 PM

but i can't pass these arguments to celery via dagster, can I ?

max

06/09/2020, 6:09 PM

you should be able to pass celery arguments through the dagster engine config

alir

06/09/2020, 6:09 PM

ah lemme take a look

max

06/09/2020, 6:10 PM

it's the

config_source

argument to

CeleryConfig

max

06/09/2020, 6:10 PM

an awful name, but there's inconsistent naming in the underlying celery apis

max

06/09/2020, 6:10 PM

i think there's certain kinds of retry/acks late behavior you should be able to configure at the celery level

max

06/09/2020, 6:11 PM

there's a broader issue which is that right now there's no heartbeating for dagster execution steps, so across the various execution paths there are ways to get a pipeline into an eternally running state that probably shouldn't exist

max

06/09/2020, 6:11 PM

but yes the intention with the celery engine is to give you as much flexibility as possible when configuring the underlying celery cluster

alir

06/09/2020, 6:21 PM

config_source

works perfectly. I saw it before but it just slipped my mind. and enabling the two celery options makes this test case pass. of course, there are implications here: all solids must now be idempotent and we'll have to make sure any side effects will not kill us but it's a good practice to do that anyway.

alir

06/09/2020, 6:24 PM

re: heartbeats and eternal pipelines, perhaps one mitigation would be to at least have a cli command to terminate all currently running pipelines. that could be a sledgehammer that one could use in the entrypoint script of a dagster container to kill all currently running pipelines before starting a new one. but maybe it's a bit too coarse

alir

06/09/2020, 6:24 PM

i'll be interested in discussions around this topic, especially with the upcoming 0.8 release

alir

06/09/2020, 6:25 PM

in any case, thanks for your assist!

max

06/09/2020, 6:53 PM

yep, though with a variety of execution engines, we will also need to persist information about how to go find the pipelines we want to terminate, etc

max

06/09/2020, 6:54 PM

i suspect we'll be looking at this as we plan for 0.9.0

max

06/09/2020, 6:54 PM

so starting next week 🙂

Open in Slack

Previous Next