How would I define the ECS cpu memory for the task definitio dagster #dagster-plus

How would I define the ECS cpu/memory for the task...

Ben Latz

09/16/2022, 9:07 PM

How would I define the ECS cpu/memory for the task definitions? I see these values are hardcoded in the template in the AgentTaskDefinition to 256/512, but the task definition for executing jobs is also 256/512 with no clear way to modify it. I would’ve assumed it’s a configuration in one of our yml files for deployment but don’t see that

daniel

09/16/2022, 9:09 PM

Hi Ben - this can be configured at the job level following the example here: https://docs.dagster.io/deployment/guides/aws#customizing-cpu-and-memory-in-ecs (this should really be in the ECS agent docs as well - planning to do a whole pass on them shortly, I think I saw you make a correct observation on an issue that they could use some fleshing out)

Ben Latz

09/16/2022, 9:10 PM

oh interesting, we tried this but it didn’t do anything

Ben Latz

09/16/2022, 9:10 PM

maybe because we were trying on define_asset_job instead of a @job decorator

daniel

09/16/2022, 9:10 PM

Hm as long as the tags ended up on the resulting run, I'd expect it to still work. Do you have a link to a run in cloud?

Brandon Warren

09/16/2022, 9:21 PM

Hi Daniel! Ben and I work together. Here is the UID for a run that we tried to change the memory/cpu for: f1e96562-53c7-427c-9073-1ad1373abe87 We set the memory and cpu in our asset job definition by doing the following since we’re not using an @job decorator:

Copy code

define_asset_job("name",
    tags = {
    "ecs/cpu": "512",
    "ecs/memory": "1024",
  }

Ben Latz

09/16/2022, 9:22 PM

It also feels a little strange to control the server’s size on a per-job basis when the taskdefinition is getting created in the deployment cycle, as opposed to the jobs that are running. Maybe it would imply the required autoscaling or something by summing up these individual job requirements?

daniel

09/16/2022, 9:25 PM

Ah it's possible we're talking about different things - what exactly do you mean by "the task definition for executing jobs"? I was referring to the task that gets spun up when the run launches

daniel

09/16/2022, 9:26 PM

In that case the task definition is also registered/created during the run launch (but we do some checks to avoid creating the same task definition over and over again)

daniel

09/16/2022, 9:26 PM

are you talking about a different task definition?

Ben Latz

09/16/2022, 9:27 PM

Our ECS has two services - one for the agent and one for the task execution, i understand the agent one is hardcoded in the cloudformation stack bc it needs very little compute, but i can’t figure out how to increase the compute of that task execution one. Within the task execution service, there is only one task defintion (which shows the same .25/.5 compute config). So maybe my comment is really about services instead of tasks Though the service is defined by one task definition (container), so the terminology is confusing

daniel

09/16/2022, 9:29 PM

So that second service doesn't actually execute your runs - it loads your code and serves metadata about your jobs to appear in Dagit, and it does things like execute schedule and sensor code. Each run executes in its own ECS task, using a different task definition than the one backing either of those services

daniel

09/16/2022, 9:29 PM

I suspect if you go look at the ECS Task Arn for that run you sent me (It appears early in the event logs), then look at its task definition in your cluster, you'll see your desired resource limits applied to that task definition

Ben Latz

09/16/2022, 9:29 PM

oooh so that initial load time on each of the runs is it spinning up an ECS task, which is why it takes like 20-30sec or so

daniel

09/16/2022, 9:30 PM

That's right

daniel

09/16/2022, 9:31 PM

The tradeoff there is isolation between different runs (in exchange for some startup cost)

Ben Latz

09/16/2022, 9:32 PM

yeah that’s a cool concept so we don’t have to worry about tasks hogging resources from others on the same machine. Is it normal for it to take at least 30 seconds to start a run? Soon we’re going to be introducing 1 min cron schedules, so that may be a problem

daniel

09/16/2022, 9:33 PM

Sadly that's a somewhat known issue with ECS Fargate 😕 We're looking into adding ECS EC2 support quite soon which has faster startup times (but requires you to manage the instances more)

daniel

09/16/2022, 9:34 PM

Here's a github issue with an angry mob of developers complaining about fargate task startup time https://github.com/aws/containers-roadmap/issues/696

😂 1

Ben Latz

09/16/2022, 9:36 PM

got it, makes sense - hopefully we aren’t going to rack up those data transfer fees if it’s all in the same region 🤞

Ben Latz

09/16/2022, 9:41 PM

at the bottom of that GitHub discussion they mentioned the ECS_IMAGE_PULL_BEHAVIOR configuration to be set to ‘prefer-cached’ to speed things up. Is that something you guys set or no? https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html

daniel

09/16/2022, 9:43 PM

I think that might only apply to ECS on EC2?

daniel

09/16/2022, 9:43 PM

it's listed under "Container image pull behavior for Amazon EC2 launch types"

daniel

09/16/2022, 9:43 PM

But once we support EC2, then we should definitely look into that

Ben Latz

09/16/2022, 9:52 PM

just attempted it by adding it as an env variable to the docker run and as expected, didn’t change it - either bc it’s not Fargate or maybe because the tasks are getting torn down and rebuilt, the cache isn’t available on subsequent runs?

daniel

09/16/2022, 9:52 PM

Yeah, I think that's what the title of the issue is referring to

👍 2

3 Views

Open in Slack

Previous Next