https://dagster.io/ logo
#dagster-ecs
Title
# dagster-ecs
j

Jeremy Fisher

07/23/2021, 7:34 PM
t

Tiri Georgiou

07/23/2021, 8:26 PM
Copy code
#!/bin/bash
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
^^ have you added this into your launch_template resource?
this needs to be the name of your ecs cluster you created
j

Jeremy Fisher

07/23/2021, 8:27 PM
This one?
Copy code
// Autoscaling configs
// Launch an EC2
resource "aws_launch_configuration" "ecs_dagster_launch_config" {
  name_prefix   = "ecs_dagster_launch_config"
  image_id      = "ami-00129b193dc81bc31"
  instance_type = "t2.small"
  security_groups = [aws_security_group.jeremy_sg.id]
  // Must be the same name as the ecs-cluster.name created
  user_data = <<EOF
              #!/bin/bash
              echo ECS_CLUSTER=${var.ecs_dagster_cluster} >> /etc/ecs/ecs.config
              EOF
  // Updates require destroying orginal resource
  lifecycle {
    create_before_destroy = true
  }
  depends_on = [
    <http://aws_db_instance.pg|aws_db_instance.pg>
  ]
}
I have `var.ecs_dagster_cluster`set to
"ecs_dagster_cluster"
t

Tiri Georgiou

07/23/2021, 8:29 PM
so your ecs_cluster resource should look something like …
Copy code
resource "aws_ecs_cluster" "dagster" {
  name               = var.ecs_dagster_cluster
  // other stuff..
}
j

Jeremy Fisher

07/23/2021, 8:29 PM
Actually, I uncommented these lines:
Copy code
capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.dagster_cp.name
    weight            = 100
  }
That seemed to make the "No Container Instances were found in your cluster" disappear, but now the task seems to be stuck in "provisioning"
t

Tiri Georgiou

07/23/2021, 8:31 PM
Give it some time to load the task
j

Jeremy Fisher

07/23/2021, 8:31 PM
Ok
t

Tiri Georgiou

07/23/2021, 8:31 PM
provisioning can take 5-10 mins
depends on the infra loaded, size of containers etc
j

Jeremy Fisher

07/23/2021, 8:31 PM
Thanks for your help 🙂
My ecs_cluster resource seems to be written correctly
Copy code
resource "aws_ecs_cluster" "dagster" {
  // Name must be the same given under ecs_dagster_launch_config.user_data
  name = var.ecs_dagster_cluster
t

Tiri Georgiou

07/23/2021, 8:33 PM
^^ Yeah thats the right convention
in practice you might want to set up a capacity provider
which manages your autoscaling group
not sure if you’ve set that up
j

Jeremy Fisher

07/23/2021, 8:35 PM
Are you referring to the
aws_ecs_capacity_provider.dagster_cp
resource?
t

Tiri Georgiou

07/23/2021, 8:36 PM
yeah haha I could remember if I had sent those TF modules over
but yes thats the one
j

Jeremy Fisher

07/23/2021, 8:36 PM
lol
I'm new to devops, the best I can do is adapt other people's code who actually know what they're doing 😅
t

Tiri Georgiou

07/23/2021, 8:37 PM
autoscaling group will look something like this…
Copy code
resource "aws_autoscaling_group" "dagster_asg" {
  name                = "${var.infra_role}-asg-${var.infra_env}"
  vpc_zone_identifier = var.subnet_private_ids

  launch_template {
    id      = aws_launch_template.launch_template.id
    version = aws_launch_template.launch_template.latest_version
  }
  force_delete              = true
  desired_capacity          = var.desired_capacity
  min_size                  = var.desired_capacity
  max_size                  = var.desired_capacity * 2
  health_check_grace_period = 60
  health_check_type         = "EC2"
  default_cooldown          = 10

  // This prevents all instances running tasks from being terminated during scale-in
  protect_from_scale_in = var.protect_from_scale_in

  lifecycle {
    create_before_destroy = true
  }

  tag {
    key                 = "AmazonECSManaged"
    value               = ""
    propagate_at_launch = true
  }
}
j

Jeremy Fisher

07/23/2021, 8:38 PM
Ok, looks like you've added stuff to that
I'm doing to do a clean
terraform destroy && terraform apply
and try that
t

Tiri Georgiou

07/23/2021, 8:39 PM
And the capacity provider needs also a strategy which can be set up with the cluster..
Copy code
################################################
//-------------- ECS CLUSTER ----------------//
################################################

resource "aws_ecs_cluster" "dagster" {
  // Name must be the same given under ecs_dagster_launch_config.user_data
  name               = var.ecs_dagster_cluster
  capacity_providers = [aws_ecs_capacity_provider.dagster_cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.dagster_cp.name
    weight            = 1
    base              = 1
  }

  tags = {
    Name        = "data-${var.infra_env}-${var.infra_role}"
    Role        = var.infra_role
    Environment = var.infra_env
    ManagedBy   = "terraform"
  }
}

####################################################
// ------- Autoscaling capacity provider -------- //
####################################################

resource "aws_ecs_capacity_provider" "dagster_cp" {
  name = "${var.infra_role}-cp-${var.infra_env}"

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.dagster_asg.arn
    managed_termination_protection = var.managed_termination_protection

    managed_scaling {
      minimum_scaling_step_size = 1
      maximum_scaling_step_size = 5
      instance_warmup_period    = 10
      status                    = "ENABLED"
      target_capacity           = var.target_capacity
    }
  }
  tags = {
    Name        = "data-${var.infra_env}-${var.infra_role}"
    Role        = var.infra_role
    Environment = var.infra_env
    ManagedBy   = "terraform"
  }
}
with capacity providers it requires you to manually destroy the cluster before applying terraform destroy
j

Jeremy Fisher

07/23/2021, 8:40 PM
good to know!
t

Tiri Georgiou

07/23/2021, 8:41 PM
Cool, sorry I cant share the whole project I built because its associated to my work but happy to help
j

Jeremy Fisher

07/23/2021, 8:41 PM
I really appreciate it
Is there a reason that you substituted the
aws_launch_configuration
resource with the
aws_launch_template
one?
t

Tiri Georgiou

07/23/2021, 9:13 PM
Copy code
We recommend that you create Auto Scaling groups from launch templates to ensure that you're accessing the latest features and improvements.
Just following best practices really
j

Jeremy Fisher

07/23/2021, 10:15 PM
This is weird--it repeatedly tries to provision the task but fails after ten minutes
Looks like its also causing a CapacityProviderReservation alarm
Do I need to increase the
target_capacity
or
desired_capacity
variables? I have those both set to 2
t

Tiri Georgiou

07/23/2021, 11:12 PM
desired_capacity I set to 1, with max being 2 to allow some scaling
target_capacity I set to 100%
that way it doesn’t keep spare resources around and fully utilises 1 task = 1 container instance
The error is most probably down to your task definition/containers
Have you tested it locally?
j

Jeremy Fisher

07/23/2021, 11:18 PM
I've been developing it using the docker-compose example
Is that what you mean by local?
Hmm, poking around the logs gives me this
Seems suspicious but I don't see anything on the documentation about an
environment.yml
t

Tiri Georgiou

07/24/2021, 11:03 AM
yeah it seems to be targeting an environment.yml file which is obviously not found
hmm you would need your dagster configs to be in a dagster.yaml
then define a dagster path
then depending on if you’ve set up a grpc server as one of your containers, and then define this inside the workspace.yml file
Something like this
j

Jeremy Fisher

07/26/2021, 5:01 PM
I think the issue is that the containers aren't even starting. The containers table is stuck in a "loading" state
The log from earlier was part of a completely different project 😅...cloudwatch doesnt seem to be picking up any logs from ECS
I finally get some relevant logs by ssh'ing into the ECS-created EC2 instances
Copy code
level=info time=2021-07-26T18:35:29Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
level=info time=2021-07-26T18:35:29Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
level=info time=2021-07-26T18:35:29Z msg="Loading state!" module=state_manager.go
level=info time=2021-07-26T18:35:29Z msg="Event stream ContainerChange start listening..." module=eventstream.go
level=error time=2021-07-26T18:35:30Z msg="Error getting valid credentials: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" module=agent.go
level=info time=2021-07-26T18:35:30Z msg="Registering Instance with ECS" module=agent.go
level=info time=2021-07-26T18:35:30Z msg="Remaining mem: 1993" module=client.go
level=error time=2021-07-26T18:35:30Z msg="Unable to register as a container instance with ECS: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" module=client.go
I think the instance isn't getting the
aws_iam_role_policy_attachment.ecs_dagster
role attached to it
Is that something that needs to be specified in the launch template?
Ah, okay--I need to add
Copy code
iam_instance_profile {
    name = aws_iam_instance_profile.ecs_dagster.name
  }
Now I have container instances but its still stuck at
PROVISIONING
I just needed to destroy and re-create the infrastructure!
Ok-- one more naive question: once the containers are up and running, how do I connect to them from my local machine?
I've configured the launch template to allow ingress from my IP address from port 22 and port 3000. I think that applies to the container instance, but when I run
curl "$container_instance_ip_address":3000
, I just get
Copy code
curl: (7) Failed to connect to --.---.---.-- port 3000: Connection refused
Ah, it looks like its failing to start
Copy code
[ec2-user@ip-10-38-31-120 ~]$ docker ps -a
CONTAINER ID        IMAGE                                                                      COMMAND                  CREATED              STATUS                            PORTS               NAMES
b341b1b0ed16        <http://211883317150.dkr.ecr.us-east-1.amazonaws.com/ecr_dagit:latest|211883317150.dkr.ecr.us-east-1.amazonaws.com/ecr_dagit:latest>              "dagster-daemon run"     40 seconds ago       Exited (1) 35 seconds ago                             ecs-ecs_dagster_cluster-task-3-daemon-fc98e88183fcec8a2b00
6f54d42f87ee        <http://211883317150.dkr.ecr.us-east-1.amazonaws.com/ecr_dagit:latest|211883317150.dkr.ecr.us-east-1.amazonaws.com/ecr_dagit:latest>              "dagit -h 0.0.0.0 -p…"   40 seconds ago       Exited (1) 33 seconds ago                             ecs-ecs_dagster_cluster-task-3-dagit-b093acb295d6c2a4ae01
658259970bf4        <http://211883317150.dkr.ecr.us-east-1.amazonaws.com/ecr_dagster_pipeline:latest|211883317150.dkr.ecr.us-east-1.amazonaws.com/ecr_dagster_pipeline:latest>   "dagster api grpc -h…"   40 seconds ago       Exited (137) 4 seconds ago
I think the issue is that that the database table is called "postgres", not "podpoint"--then, that the AWS credentials in the daemon/dagit are not configured
Copy code
botocore.exceptions.NoRegionError: You must specify a region.
t

Tiri Georgiou

07/27/2021, 8:40 AM
yeah, remember to set up an RDS postgres db to handle your runs/logs etc
and have that all configured as env variables in your dagster.yaml
j

Jeremy Fisher

07/27/2021, 1:47 PM
It seems the daemon and grpc server both start, but dagit fails
Copy code
/usr/local/lib/python3.8/site-packages/dagster/core/workspace/context.py:485: UserWarning: Error loading repository location example_pipelines:grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "DNS resolution failed for service: docker-example-pipelines:4000"
        debug_error_string = "{"created":"@1627393453.385512190","description":"Resolver transient failure","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":1360,"referenced_errors":[{"created":"@1627393453.385510578","description":"DNS resolution failed for service: docker-example-pipelines:4000","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":359,"grpc_status":14,"referenced_errors":[{"created":"@1627393453.385478570","description":"C-ares status is not ARES_SUCCESS qtype=A name=docker-example-pipelines is_balancer=0: Could not contact DNS servers","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":724}]}]}"
>
Somehow I've managed to deploy dagit but it can't access the grpc server
Should I use
network_mode = "awsvpc"
?
It finally deploys with
network_mode = "host"
and setting the workspace.yaml host to
localhost
. This works!
According the docs,
host
 — The task bypasses Docker's built-in virtual network and maps container ports directly to the ENI of the Amazon EC2 instance hosting the task. As a result, you can't run multiple instantiations of the same task on a single Amazon EC2 instance when port mappings are used.
Would this be an issue?
Looks like its working 🙌 thanks @Tiri Georgiou
t

Tiri Georgiou

07/28/2021, 7:35 AM
No problem!
4 Views