How does dagster handle the in-progress runs in case the docker container restarts mid run?
1. Does it restart the runs from the first op?
2. Does it restart the run from the same op where the execution was interrupted? If so, does it store the data/output from the previous op even after the container outage?
3. Does it do nothing?
03/01/2022, 4:04 PM
Hi Anoop - which docker container are you referring to here? Is this a docker container spun up by the run launcher for the run? (e.g. using the DockerRunLauncher)? Or the docker container that your dagster system components (dagit etc.) are running on?
03/01/2022, 4:05 PM
Its the docker container where the dagster service is running on.
03/01/2022, 4:07 PM
Our recommendation to make your runs resilient to dagit going down is to put each run in its own container using the DockerRunLauncher (or some other run launcher) - that way, dagit going down won't affect any in-progress runs
In the future we'll also have some run monitoring features that will be able to pick things off where they left off even if you're using the default run launcher that runs in the same container as dagit, but we don't currently have that
03/01/2022, 4:11 PM
Even with DockerRunLauncher, what if the docker container executing the run separately goes down due to any reason? How's that handled currently?
03/01/2022, 4:15 PM
The work-in-progress run monitoring features will be able to start up a new container and pick up where the failed run left off. Right now its available in docker if your run is using the docker_executor to launch each op in its own container, and we're in the process of expanding it to include all run launchers and executors as well - cc @johann who knows the latest state of that project.
Instructions on how to enable it for docker runs are here: https://github.com/dagster-io/dagster/blob/master/CHANGES.md#experimental-9