hey team, been getting this a lot recently for a c...
# dagster-plus
c
hey team, been getting this a lot recently for a couple of my sensors, but the evaluation function runs ~20 seconds during testing. I'm thinking something may be up with out sensor daemon
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: dagster._core.errors.DagsterUserCodeUnreachableError: The sensor tick timed out due to taking longer than 60 seconds to execute the sensor function. One way to avoid this error is to break up the sensor work into chunks, using cursors to let subsequent sensor calls pick up where the previous call left off.
j
you probably have more memory/cpu allocated locally than we do on serverless but i’ll take a look
the severless grpc servers are pretty under-provisioned - which can cause some challenges around heavier sensor workloads. it’s something we’re still deciding the best way to address.
oh wait you’re not on serverless, are you? in that case, same basic answer but hopefully more solvable can you take a look at your grpc server pods and see if they’re resource constrained?
c
yeah I'm on a hybrid deployment
would that be the code location pod, or dagster agent pod, or something different?
j
code location pod
c
wrong one before
running it on the deployed pod is still ~20 seconds
as a workaround, I tried setting this env var to increase the timeout time yesterday and it doesn't appear to have made it to the pod
j
We’re going to have to keep digging into this - I don’t have an answer yet.
a
the
agent
which is initiating these requests (and enforcing the timeout) will need that timeout env var set and the built in environment variables for deployments feature only effects code servers and runs. You may need to set that env var via helm or however you manage your agent pod. What size nodes are you running on in k8s?
c
thanks @alex, I use helm yeah
I tried adding the env to the chart but I don't think it worked
I'm using GKE autopilot, which doesn't have a set node size, but the default pod size is 0.5 vCPU/2 GiB https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#defaults
on my helm chart, I set
Copy code
dagsterCloudAgent:
  env:
    DAGSTER_GRPC_TIMEOUT_SECONDS: "120"
it's visible on the config map, but the timeout still seems to be 60 according to errors
a
hmm, did the agent pod restart when you made this change?
we may be missing some machinery to force helm/k8s to cause a restart on env var change, since its indirected via the config map
c
I thought it did, but I deleted the agent pod and started a new one
I'm seeing the env var show up correctly when I connect to the pod, so I'll see if the error persists after that
a
got it keep me posted in terms of why this takes longer, the user code server is a grpc server that handles executing your code in threads. When a sensor evaluation happens its happening in a thread with any other requests that user code server is handling. If you have multiple sensors that start to overlap, they can contend for compute in the process. Potential mitigations are: • increase the resources allocated to the user code server • split across code locations (each code location gets its own code server) Another option if you have the bandwidth to do so is perform some profiling on the user code server using a tool like py-spy. This can reveal exactly what is slowing things down.