Hi Team, i am getting below error while launching...
# deployment-ecs
a
Hi Team, i am getting below error while launching a job from launchpad,
dagster._core.errors.DagsterUserCodeUnreachableError: Timed out waiting for call to user code GET_EXTERNAL_EXECUTION_PLAN [1ee4043d-0527-4a7b-931b-46a5e4604d8a]
however my task on ECS is in running state 🙂 when i did some research, it ended me on this property where can i increase the startup timeout ? while using hybrid deployment on AWS ECS
Copy code
DAGSTER_CODE_SERVER_STARTUP_TIMEOUT
The error is random , it only occurs when i try from launchpad and if the jobs has not been tried before from launchpad, it the pool is hot it runs without any issues agent version is : v 1.2.1
j
This is a dagster cloud hybrid deployment?
Is it a branch deployment or regular deployment?
a
its hybrd deployment, and regular
https://github.com/dagster-io/dagster-cloud/blob/80a29970425c2104ab365182ab92e3ae4439ac4f/dagster-cloud/dagster_cloud/workspace/ecs/launcher.py#[…]3 this is exactly happening as mentioned in description
Copy code
"server_process_startup_timeout": Field(
                    IntSource,
                    is_required=False,
                    default_value=DEFAULT_SERVER_PROCESS_STARTUP_TIMEOUT,
                    description=(
                        "Timeout when waiting for a code server to be ready after it is created."
                        " You might want to increase this if your ECS tasks are successfully"
                        " starting but your gRPC server is timing out."
                    ),
j
I’m surprised it’s happening when you’re launching runs. I’d expect it to fail when you reload the code location
Is it loading properly? E.g. turning green on https://lundbeck.dagster.cloud/sand/locations
In any case to edit the server timeout, you’ll want to edit the cloud formation template for your agent: https://docs.dagster.io/dagster-cloud/deployment/agents/amazon-ecs/configuration-reference#per-deployment-configuration
a
code locations are loading property, no issues with them
j
Got it. I don’t think that timeout setting will help then, it just applies when you’re loading the location
a
Copy code
"Timeout when waiting for a code server to be ready after it is created."
                        " You might want to increase this if your ECS tasks are successfully"
                        " starting but your gRPC server is timing out."
ECS task runs however the ui shows timeout and no run
j
Right but the timeout occurs on later calls. That setting would apply if they were timing out here https://lundbeck.dagster.cloud/sand/locations
So the mission is to find out why later gRPC calls are timing out. Can you look at cpu/mem of the task?
a
this never happens with the schedule, it only happens when trigger it manually
Run id does not appear, for the timed out one, it just disappears/never populate but the task runs in ecs if i run another one right after it manually, it will do fine
j
Strange. Do you have
server_ttl
set in your agent cloudformation?
a
yes,
300 seconds
is it due to he cold start of ecs ?
j
Yes, I think that’s fairly aggressive because depending on how you are running ECS, the tasks can take over a minute to start back up. So you’d have that latency pretty much every time you start a run.
I’m double checking that the
server_process_startup_timeout
would increase the timeout there
a
this is the log on ecs task that give error on UI and started in ecs
Copy code
[32m2023-08-08 14:11:50 +0000[0m - dagster.code_server - [34mINFO[0m - Started Dagster code server for package hackernews_api_example on port 4000 in process 1
2nd run of the same job, giving descriptive logs
this task (container) will run for 7 minutes without any output, let me know if my interpretation is wrong
j
The latter image is the grpc server task. Before we launch a run for a code location we start that task and query it. For the first run that fails, that query timed out. For the second run, it successfully queries then creates a run task (the first logs image)
Ah I understand what’s going on here. There’s a timeout of 45 seconds from the web tier waiting on the request to the grpc server to start, which it’s exceeding. That timeout isn’t currently configurable so I’ll raise this to the team
❤️ 1
In the meantime, I think the options are either remove the server ttl or keep doing 2 requests. Not ideal
a
Perfect, I can just hit the run button twice, its fine for now