https://dagster.io/ logo
#ask-community
Title
# ask-community
b

Ben Jordan

04/06/2022, 5:57 PM
Hi Dagster, I have a question about Docker and AWS ECS - I succesfully followed the example in Github and was running jobs, but after rebuilding the Docker images I get an error:
botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the RunTask operation: ECS was unable to assume the role 'arn:aws:iam::<aws_account_id>:role/portal-DaemonTaskRole-1SLWIFK9HLFR0'
which seems as though it wants to pass a role that doesn't exist in IAM (there's a different
DaemonTaskRole
there now) - is this correct? Should this new role be created for the run or is there a cached role_id somewhere? Thanks for all your help!
j

johann

04/07/2022, 1:58 PM
This error is when you launch a run?
Or when you spin up the daemon?
b

Ben Jordan

04/07/2022, 1:58 PM
Yes, using EcsRunLauncher
All containers appear to load successfully, the daemons report success in Dagit
It seems like the runLauncher is not getting updated secrets from aws
j

johann

04/07/2022, 2:08 PM
I’m not the most familiar with ECS, but my understanding is that we specify
Copy code
- Effect: "Allow"
          Action:
            - "iam:PassRole"
on dagit and the daemon so that the run tasks that we spin up will have the same role
I’m guessing
role/portal-DaemonTaskRole
is what you specified for the daemon?
b

Ben Jordan

04/07/2022, 2:10 PM
Yes I have that in the definition. "portal" is the name I specified, new roles are generated as expected but the role passed to the EcsRunLauncher doesn't use the new role names, it uses the same (expired, not existing) role name on every run
I've rebuilt the containers and pushed updates to ECR but the role name is always the same when I try to launch a run
I can, however, launch runs with the example project from your github on the same AWS account
j

johann

04/07/2022, 2:11 PM
Strange. Would you be able to try to find the minimal change you can make to the example project which breaks it
b

Ben Jordan

04/07/2022, 2:17 PM
Sure, I'd be happy to work through that - the
docker-compose.yml
files are extremely similar, any idea where to start?
j

johann

04/07/2022, 2:22 PM
What iam roles are getting assigned for the example?
b

Ben Jordan

04/07/2022, 2:26 PM
I see the same roles, with the exception of an extra role for PostgresqlTaskExecutionRole (I use RDS for run storage)
• DaemonTaskExecutionRole-XXXXXXXXXX • DaemonTaskRole-XXXXXXXXXX • DagitTaskExecutionRole-XXXXXXXXXX • DagitTaskRole-XXXXXXXXXX • UsercodeTaskExecutionRole-XXXXXXXXXX
New roles are generated when the containers spin up (as expected) but the runs do not inherit the new roles
OK well I spun up the example and loaded it with the same
docker context
as my code, and got the same error
I rebuilt the example with a different
docker context
and the incorrect role ARN is the same, so it does not appear to be related to the context. Next I will delete the ECR repositories and recreate to see if that works
@johann I rebuilt the containers and replaced the definitions in ECR, trying to launch a run results in the same error with the same (incorrect) ARN specified. Where does
EcsRunLauncher
get its secrets from?
j

johann

04/07/2022, 6:05 PM
Which secrets?
b

Ben Jordan

04/07/2022, 6:06 PM
well the AWS role names.... working down another lead, I found that the Task Definitions in ECS contain the incorrect role names that are causing problems. Not sure why these are not refreshed on a new build, but... might be some indication what's happening
Appreciate you looking into this @johann - I think I've found the problem, at least high-level. When a run is launched from the run launcher, it creates a new ECS TaskDefinition and Task. When the run completes the Task ends but the TaskDefinition remains active. Following a teardown (using
docker compose
as in the example), this child TaskDefinition is still active. As the roles are provisioned when the instance is built, a subsequent rebuild creates a new set of roles with new ARNs. When a new run is launched, the orphaned TaskDefinition with the expired ARN generates a new Task with the error I mentioned previously:
botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the RunTask operation: ECS was unable to assume the role
I was able to allow a new run to complete by deregistering the orphaned TaskDefinition - after this, a new run still uses this TaskDefinition but with a new revision with the current, correct ARNs. It seems to me that the
docker compose down
should deregister this TaskDefinition, as it does for the other services (dagit, daemon...)
Reproduction with the `deploy_ecs`example: • build and push the example containers •
docker compose up
• Launch a run •
docker compose down
• check ECS TaskDefinition is still active:
user_code
docker compose up
• compare ARNs in IAM vs the TaskDefinition above • Launch a run - it fails with the botocore error • Deregister the TaskDefinition • Launch a run - check the TaskDefinition to see the ARN has updated (you can compare to previous revisions as well) • Task now completes as the correct ARN is passed to the Task
j

johann

04/08/2022, 3:25 PM
Ah, that makes sense. Thank you for investigating!
Would you mind making a GH issue for this?
b

Ben Jordan

04/08/2022, 3:25 PM
will do
2 Views