hey team been getting this a lot recently for a couple of my dagster #dagster-plus

hey team, been getting this a lot recently for a c...

Charlie Bini

07/28/2023, 3:33 PM

hey team, been getting this a lot recently for a couple of my sensors, but the evaluation function runs ~20 seconds during testing. I'm thinking something may be up with out sensor daemon

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: dagster._core.errors.DagsterUserCodeUnreachableError: The sensor tick timed out due to taking longer than 60 seconds to execute the sensor function. One way to avoid this error is to break up the sensor work into chunks, using cursors to let subsequent sensor calls pick up where the previous call left off.

jordan

07/28/2023, 3:34 PM

you probably have more memory/cpu allocated locally than we do on serverless but i’ll take a look

jordan

07/28/2023, 3:34 PM

the severless grpc servers are pretty under-provisioned - which can cause some challenges around heavier sensor workloads. it’s something we’re still deciding the best way to address.

jordan

07/28/2023, 3:35 PM

oh wait you’re not on serverless, are you? in that case, same basic answer but hopefully more solvable can you take a look at your grpc server pods and see if they’re resource constrained?

Charlie Bini

07/28/2023, 3:35 PM

yeah I'm on a hybrid deployment

Charlie Bini

07/28/2023, 3:36 PM

would that be the code location pod, or dagster agent pod, or something different?

jordan

07/28/2023, 3:40 PM

code location pod

Charlie Bini

07/28/2023, 3:47 PM

wrong one before

Charlie Bini

07/28/2023, 3:53 PM

running it on the deployed pod is still ~20 seconds

Charlie Bini

07/28/2023, 4:02 PM

as a workaround, I tried setting this env var to increase the timeout time yesterday and it doesn't appear to have made it to the pod

jordan

07/28/2023, 7:58 PM

We’re going to have to keep digging into this - I don’t have an answer yet.

alex

07/31/2023, 8:06 PM

the

agent

which is initiating these requests (and enforcing the timeout) will need that timeout env var set and the built in environment variables for deployments feature only effects code servers and runs. You may need to set that env var via helm or however you manage your agent pod. What size nodes are you running on in k8s?

Charlie Bini

07/31/2023, 9:13 PM

thanks @alex, I use helm yeah

Charlie Bini

07/31/2023, 9:14 PM

I tried adding the env to the chart but I don't think it worked

Charlie Bini

07/31/2023, 9:15 PM

I'm using GKE autopilot, which doesn't have a set node size, but the default pod size is 0.5 vCPU/2 GiB https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#defaults

Charlie Bini

07/31/2023, 9:16 PM

on my helm chart, I set

Copy code

dagsterCloudAgent:
  env:
    DAGSTER_GRPC_TIMEOUT_SECONDS: "120"

Charlie Bini

07/31/2023, 9:17 PM

it's visible on the config map, but the timeout still seems to be 60 according to errors

alex

07/31/2023, 9:48 PM

hmm, did the agent pod restart when you made this change?

alex

07/31/2023, 9:48 PM

we may be missing some machinery to force helm/k8s to cause a restart on env var change, since its indirected via the config map

Charlie Bini

07/31/2023, 11:21 PM

I thought it did, but I deleted the agent pod and started a new one

Charlie Bini

07/31/2023, 11:22 PM

I'm seeing the env var show up correctly when I connect to the pod, so I'll see if the error persists after that

alex

08/01/2023, 3:41 PM

got it keep me posted in terms of why this takes longer, the user code server is a grpc server that handles executing your code in threads. When a sensor evaluation happens its happening in a thread with any other requests that user code server is handling. If you have multiple sensors that start to overlap, they can contend for compute in the process. Potential mitigations are: • increase the resources allocated to the user code server • split across code locations (each code location gets its own code server) Another option if you have the bandwidth to do so is perform some profiling on the user code server using a tool like py-spy. This can reveal exactly what is slowing things down.

2 Views

Open in Slack

Previous Next