Hi Team i am getting below error while launching a job from dagster #deployment-ecs

Hi Team, i am getting below error while launching...

ashish meshram

08/07/2023, 1:34 PM

Hi Team, i am getting below error while launching a job from launchpad,

dagster._core.errors.DagsterUserCodeUnreachableError: Timed out waiting for call to user code GET_EXTERNAL_EXECUTION_PLAN [1ee4043d-0527-4a7b-931b-46a5e4604d8a]

however my task on ECS is in running state 🙂 when i did some research, it ended me on this property where can i increase the startup timeout ? while using hybrid deployment on AWS ECS

Copy code

DAGSTER_CODE_SERVER_STARTUP_TIMEOUT

The error is random , it only occurs when i try from launchpad and if the jobs has not been tried before from launchpad, it the pool is hot it runs without any issues agent version is : v 1.2.1

johann

08/07/2023, 11:13 PM

This is a dagster cloud hybrid deployment?

johann

08/07/2023, 11:13 PM

Is it a branch deployment or regular deployment?

ashish meshram

08/08/2023, 7:56 AM

its hybrd deployment, and regular

ashish meshram

08/08/2023, 10:53 AM

https://github.com/dagster-io/dagster-cloud/blob/80a29970425c2104ab365182ab92e3ae4439ac4f/dagster-cloud/dagster_cloud/workspace/ecs/launcher.py#[…]3 this is exactly happening as mentioned in description

Copy code

"server_process_startup_timeout": Field(
                    IntSource,
                    is_required=False,
                    default_value=DEFAULT_SERVER_PROCESS_STARTUP_TIMEOUT,
                    description=(
                        "Timeout when waiting for a code server to be ready after it is created."
                        " You might want to increase this if your ECS tasks are successfully"
                        " starting but your gRPC server is timing out."
                    ),

johann

08/08/2023, 1:35 PM

I’m surprised it’s happening when you’re launching runs. I’d expect it to fail when you reload the code location

johann

08/08/2023, 1:35 PM

Is it loading properly? E.g. turning green on https://lundbeck.dagster.cloud/sand/locations

johann

08/08/2023, 1:36 PM

In any case to edit the server timeout, you’ll want to edit the cloud formation template for your agent: https://docs.dagster.io/dagster-cloud/deployment/agents/amazon-ecs/configuration-reference#per-deployment-configuration

ashish meshram

08/08/2023, 1:52 PM

code locations are loading property, no issues with them

johann

08/08/2023, 1:53 PM

Got it. I don’t think that timeout setting will help then, it just applies when you’re loading the location

ashish meshram

08/08/2023, 1:56 PM

Copy code

"Timeout when waiting for a code server to be ready after it is created."
                        " You might want to increase this if your ECS tasks are successfully"
                        " starting but your gRPC server is timing out."

ECS task runs however the ui shows timeout and no run

johann

08/08/2023, 1:58 PM

Right but the timeout occurs on later calls. That setting would apply if they were timing out here https://lundbeck.dagster.cloud/sand/locations

johann

08/08/2023, 1:58 PM

So the mission is to find out why later gRPC calls are timing out. Can you look at cpu/mem of the task?

ashish meshram

08/08/2023, 1:59 PM

this never happens with the schedule, it only happens when trigger it manually

ashish meshram

08/08/2023, 2:00 PM

Run id does not appear, for the timed out one, it just disappears/never populate but the task runs in ecs if i run another one right after it manually, it will do fine

johann

08/08/2023, 2:01 PM

Strange. Do you have

server_ttl

set in your agent cloudformation?

ashish meshram

08/08/2023, 2:01 PM

yes,

ashish meshram

08/08/2023, 2:02 PM

300 seconds

ashish meshram

08/08/2023, 2:08 PM

is it due to he cold start of ecs ?

johann

08/08/2023, 2:08 PM

Yes, I think that’s fairly aggressive because depending on how you are running ECS, the tasks can take over a minute to start back up. So you’d have that latency pretty much every time you start a run.

johann

08/08/2023, 2:08 PM

I’m double checking that the

server_process_startup_timeout

would increase the timeout there

ashish meshram

08/08/2023, 2:13 PM

this is the log on ecs task that give error on UI and started in ecs

Copy code

[32m2023-08-08 14:11:50 +0000[0m - dagster.code_server - [34mINFO[0m - Started Dagster code server for package hackernews_api_example on port 4000 in process 1

ashish meshram

08/08/2023, 2:16 PM

2nd run of the same job, giving descriptive logs

ashish meshram

08/08/2023, 2:34 PM

this task (container) will run for 7 minutes without any output, let me know if my interpretation is wrong

johann

08/08/2023, 2:43 PM

The latter image is the grpc server task. Before we launch a run for a code location we start that task and query it. For the first run that fails, that query timed out. For the second run, it successfully queries then creates a run task (the first logs image)

johann

08/08/2023, 2:46 PM

Ah I understand what’s going on here. There’s a timeout of 45 seconds from the web tier waiting on the request to the grpc server to start, which it’s exceeding. That timeout isn’t currently configurable so I’ll raise this to the team

❤️ 1

johann

08/08/2023, 2:47 PM

In the meantime, I think the options are either remove the server ttl or keep doing 2 requests. Not ideal

ashish meshram

08/09/2023, 10:33 AM

Perfect, I can just hit the run button twice, its fine for now

2 Views

Open in Slack

Previous Next