How would I define the ECS cpu/memory for the task...
# dagster-plus
b
How would I define the ECS cpu/memory for the task definitions? I see these values are hardcoded in the template in the AgentTaskDefinition to 256/512, but the task definition for executing jobs is also 256/512 with no clear way to modify it. I would’ve assumed it’s a configuration in one of our yml files for deployment but don’t see that
d
Hi Ben - this can be configured at the job level following the example here: https://docs.dagster.io/deployment/guides/aws#customizing-cpu-and-memory-in-ecs (this should really be in the ECS agent docs as well - planning to do a whole pass on them shortly, I think I saw you make a correct observation on an issue that they could use some fleshing out)
b
oh interesting, we tried this but it didn’t do anything
maybe because we were trying on define_asset_job instead of a @job decorator
d
Hm as long as the tags ended up on the resulting run, I'd expect it to still work. Do you have a link to a run in cloud?
b
Hi Daniel! Ben and I work together. Here is the UID for a run that we tried to change the memory/cpu for: f1e96562-53c7-427c-9073-1ad1373abe87 We set the memory and cpu in our asset job definition by doing the following since we’re not using an @job decorator:
Copy code
define_asset_job("name",
    tags = {
    "ecs/cpu": "512",
    "ecs/memory": "1024",
  }
b
It also feels a little strange to control the server’s size on a per-job basis when the taskdefinition is getting created in the deployment cycle, as opposed to the jobs that are running. Maybe it would imply the required autoscaling or something by summing up these individual job requirements?
d
Ah it's possible we're talking about different things - what exactly do you mean by "the task definition for executing jobs"? I was referring to the task that gets spun up when the run launches
In that case the task definition is also registered/created during the run launch (but we do some checks to avoid creating the same task definition over and over again)
are you talking about a different task definition?
b
Our ECS has two services - one for the agent and one for the task execution, i understand the agent one is hardcoded in the cloudformation stack bc it needs very little compute, but i can’t figure out how to increase the compute of that task execution one. Within the task execution service, there is only one task defintion (which shows the same .25/.5 compute config). So maybe my comment is really about services instead of tasks Though the service is defined by one task definition (container), so the terminology is confusing
d
So that second service doesn't actually execute your runs - it loads your code and serves metadata about your jobs to appear in Dagit, and it does things like execute schedule and sensor code. Each run executes in its own ECS task, using a different task definition than the one backing either of those services
I suspect if you go look at the ECS Task Arn for that run you sent me (It appears early in the event logs), then look at its task definition in your cluster, you'll see your desired resource limits applied to that task definition
b
oooh so that initial load time on each of the runs is it spinning up an ECS task, which is why it takes like 20-30sec or so
d
That's right
The tradeoff there is isolation between different runs (in exchange for some startup cost)
b
yeah that’s a cool concept so we don’t have to worry about tasks hogging resources from others on the same machine. Is it normal for it to take at least 30 seconds to start a run? Soon we’re going to be introducing 1 min cron schedules, so that may be a problem
d
Sadly that's a somewhat known issue with ECS Fargate 😕 We're looking into adding ECS EC2 support quite soon which has faster startup times (but requires you to manage the instances more)
Here's a github issue with an angry mob of developers complaining about fargate task startup time https://github.com/aws/containers-roadmap/issues/696
😂 1
b
got it, makes sense - hopefully we aren’t going to rack up those data transfer fees if it’s all in the same region 🤞
at the bottom of that GitHub discussion they mentioned the ECS_IMAGE_PULL_BEHAVIOR configuration to be set to ‘prefer-cached’ to speed things up. Is that something you guys set or no? https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html
d
I think that might only apply to ECS on EC2?
it's listed under "Container image pull behavior for Amazon EC2 launch types"
But once we support EC2, then we should definitely look into that
b
just attempted it by adding it as an env variable to the docker run and as expected, didn’t change it - either bc it’s not Fargate or maybe because the tasks are getting torn down and rebuilt, the cache isn’t available on subsequent runs?
d
Yeah, I think that's what the title of the issue is referring to
👍 2