https://dagster.io/ logo
#dagster-ecs
Title
# dagster-ecs
r

rectalogic

04/06/2023, 4:01 PM
I have egress locked down in my ECS cluster. When I try to add a code location, the new task it spawns eventually fails with this error. This seems like a network issue not permissions, it launches the new task 5 times and each time it fails then it gives up. Is their some GRPC port or something that I need to allow egress on between the agent task and the code location task?
Copy code
dagster_cloud.workspace.ecs.client.EcsServiceError: ECS service failed because task arn:aws:ecs:us-west-2:XXXX:task/dagster-cluster/9cd06ec3d77940d4a3032497b3b58561 failed: ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 5 time(s): failed to fetch secret arn:aws:secretsmanager:us-west-2:XXXX:secret:XXXXDatascienceSecrets from secrets manager: RequestCanceled: request context canceled caused by: context deadline exceeded. Please check your task network configuration.

  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1255, in _reconcile
    self._wait_for_new_server_ready(
  File "/dagster-cloud/dagster_cloud/workspace/ecs/launcher.py", line 359, in _wait_for_new_server_ready
    task_arn = self.client.wait_for_new_service(
  File "/dagster-cloud/dagster_cloud/workspace/ecs/client.py", line 405, in wait_for_new_service
    return self.check_service_has_running_task(
  File "/dagster-cloud/dagster_cloud/workspace/ecs/client.py", line 517, in check_service_has_running_task
    self._raise_failed_task(task, container_name, logger)
  File "/dagster-cloud/dagster_cloud/workspace/ecs/client.py", line 442, in _raise_failed_task
    raise EcsServiceError(task_arn=task_arn, stopped_reason=stopped_reason, logs=logs)
d

daniel

04/06/2023, 4:03 PM
Here's the checklist we usually go through for ECS agent networking requirements when installing the agent in an existing VPC:
Copy code
- The VPC needs to use route53 for DNS
    - You can verify this by looking at the DHCP option set on the VPC
- The VPC needs to have assign_hostnames enabled
- The "default" security group in the VPC needs the following rules
    - An ingress rule that allows traffic from other addresses within the default security group. this allows the agent and grpc server to communicate with each other
    - Open egress from addresses in the Security Group to the internet, this allows the agent to communicate with Dagster Cloud
- (if using private subnets) The network ACL should allow the same rules as the security group, egress to the public internet and ingress from other hosts in the private subnet

How to check things:

- For the VPC DNS you can go to the VPC console, find the VPC the user wants and click on the DHCP option set
- For the security group go to the security groups section in the VPC console, filter for your VPC and find the one named "default"
- For the network ACLs you'll first need to find the subnet which you can also find from the VPC console and click on the tab for network ACLs
(if you use the cloudformation template that spins up a new VPC, it should create a cluster and VPC that has those properties out of the box)
r

rectalogic

04/06/2023, 4:05 PM
yeah, I'm using CDK to build my own based on that template, but locked down more. I think the security group rules you described are what I need to tweak
Is there a specific port I can lock this down to? (443 ?)
Open egress from addresses in the Security Group to the internet, this allows the agent to communicate with Dagster Cloud
Same here, is there a GRPC port I can allow?
An ingress rule that allows traffic from other addresses within the default security group. this allows the agent and grpc server to communicate with each other
d

daniel

04/06/2023, 4:10 PM
outbound egress happens via https, so I think 443 is correct
I believe the port the grpc server uses is 4000 - I'm a little unclear whether that's exposed directly in the service discovery setup that we use though
We think that allowing 443 for outbound and 4000 and 443 for communicating within the cluster should work for port restrictions, but let us know if that gives you any trouble
r

rectalogic

04/06/2023, 4:31 PM
thanks, I'll try it
nice, that worked - thanks
8 Views