Hello! We're running into the below issue on our d...
# dagster-plus
Hello! We're running into the below issue on our dagster cloud deployment, anything we can do to resolve this asap?
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not send a request to your Dagster Cloud agent since no agents have recently heartbeated. Reach out on Slack or email <mailto:support@elementl.com|support@elementl.com>.
  File "/dagster/dagster/_daemon/sensor.py", line 536, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/dagster/dagster/_daemon/sensor.py", line 672, in _evaluate_sensor
    sensor_runtime_data = code_location.get_external_sensor_execution_data(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 629, in get_external_sensor_execution_data
    result = self.api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 366, in api_call
    return dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 130, in dagster_cloud_api_call
    for result in gen_dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 195, in gen_dagster_cloud_api_call
    raise DagsterUserCodeUnreachableError(
Hi Henri looks like the agent had stopped responding and I have now restarted it on our side so this should be back up.
Thank you @Shalabh Chaturvedi !
hello @Shalabh Chaturvedi - how can we prevent this from happening? Looking in our deployment, i see that 4 times in the past day we'e had an agent become inactive and replaced - what is the root cause of the agent becoming unreachable, and why did no replacement happen yesterday?
Hi Leo - We expect ECS to automatically restart agents if they become unresponsive however that did not happen. We are current investigating the root cause and will share what we find.
👍 1
🙇 1
@Shalabh Chaturvedi looks like our agent is in a bad state again, would you be able to restart it again? Thanks
Sorry about that - it should be back up now. We are still investigating this issue.
@Shalabh Chaturvedi - it looks like our agent went down again around 2:23pm pacific time - right now the status is "Activating":
btw - if there's a broader or more operational way to get help for these scenarios, let us know!
I've restarted it again. The best would way would be to reply to this thread that your serverless agent is down and check "also send to #dagster-cloud". it will get picked up by one of us and it will also keep the history of all failures.
hello - our serverless agent is down again - last heartbeat was around 38 minutes ago and the deployment has been "activating"
We have restarted the agent. Sorry about the trouble. We're still investigating at our end.
thank you! It does seem like stability has suddenly become an issue lately... Is there a way to introduce jitter to the first sensor tick after a new agent starts? I'm noticing timed out sensor ticks after an agent restart because they all seem to start ticking at the same time.
I’m going to reset your agent again so i can attach a debugger to it
🙏 1
hello - our agent is down again, starting around an hour ago @jordan@Shalabh Chaturvedi - related to the debugger?
Could be. We’re chasing a lead right now involving the latest version of grpcio but I don’t think we’re going to have a fix for you this afternoon.
Just a hunch but I’m going to bump your agent’s memory overnight in hopes that it keeps things more stable while we continue to troubleshoot.
I got this error. Is it the same thing? I'm using serverless-deploy
Copy code
Stack Trace:
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1745, in _wait_for_server_process
  File "/dagster/dagster/_grpc/client.py", line 190, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "/dagster/dagster/_grpc/client.py", line 157, in _query
  File "/dagster/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(