Hey everyone - I have been working on getting a Da...
# deployment-ecs
k
Hey everyone - I have been working on getting a Dagster deployment working using docker compose and ECS. When I attempt a rolling update, the Stack hangs out in an
UPDATE_IN_PROGRESS
status for quite some time (about 20 minutes so far). I'm wondering if this is typical based on anyone else's experience. Thanks in advance.
j
I’ve seen the docker compose tooling do that when a Service gets stuck in a loop (it brings a Task up, the Task fails, it tries to bring a new Task up). Can you check your ECS cluster to see if something is repeatedly not able to start? Maybe because it can’t pull an image or because when it runs the container, something is erroring? I’ve gotten around it by setting the desired count for the Service back to 0 which stops the crash loop.
k
I think that must be what's happening. Setting the desired task count to 0 for each service let the stack complete and I now see the changes in the task definition that I was expecting. I don't have a health check defined in my compose file. Could that be what's leading to this? The container likely doesn't even know it's unhealthy...
j
That could be the case. I actually don’t know if the problem is with ECS not being able to recognize that the container is unhealthy or with CloudFormation not being able to recognize that the new ECS Service will never reach its desired count.
k
I have no idea either, but I suspect it's the former because when we tried to deploy without docker compose (by writing a task definition JSON and uploading it to ECS) it knew the service was unhealthy and tried to restart. With this new docker compose deployment it showed the dagit and daemon services had 1 task running each. So I would guess that when CloudFormation tried to update in place, it just sat there waiting for the service to say "okay I'm ready" while the service was actually in a death loop ☝️ total guess because this is the first time I've used docker compose I also just realized that this ECS cluster is in a different VPC than the back end RDS, which is probably what's causing both dagit and daemon to hang on an error. I probably have a few things going wrong here 😬