Hello! We're running into the below issue on our d...
# dagster-plus
h
Hello! We're running into the below issue on our dagster cloud deployment, anything we can do to resolve this asap?
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not send a request to your Dagster Cloud agent since no agents have recently heartbeated. Reach out on Slack or email <mailto:support@elementl.com|support@elementl.com>.
  File "/dagster/dagster/_daemon/sensor.py", line 536, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/dagster/dagster/_daemon/sensor.py", line 672, in _evaluate_sensor
    sensor_runtime_data = code_location.get_external_sensor_execution_data(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 629, in get_external_sensor_execution_data
    result = self.api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 366, in api_call
    return dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 130, in dagster_cloud_api_call
    for result in gen_dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 195, in gen_dagster_cloud_api_call
    raise DagsterUserCodeUnreachableError(
s
Hi Henri looks like the agent had stopped responding and I have now restarted it on our side so this should be back up.
h
Thank you @Shalabh Chaturvedi !
l
hello @Shalabh Chaturvedi - how can we prevent this from happening? Looking in our deployment, i see that 4 times in the past day we'e had an agent become inactive and replaced - what is the root cause of the agent becoming unreachable, and why did no replacement happen yesterday?
s
Hi Leo - We expect ECS to automatically restart agents if they become unresponsive however that did not happen. We are current investigating the root cause and will share what we find.
👍 1
🙇 1
h
@Shalabh Chaturvedi looks like our agent is in a bad state again, would you be able to restart it again? Thanks
s
Sorry about that - it should be back up now. We are still investigating this issue.
l
@Shalabh Chaturvedi - it looks like our agent went down again around 2:23pm pacific time - right now the status is "Activating":
btw - if there's a broader or more operational way to get help for these scenarios, let us know!
s
I've restarted it again. The best would way would be to reply to this thread that your serverless agent is down and check "also send to #dagster-cloud". it will get picked up by one of us and it will also keep the history of all failures.
l
hello - our serverless agent is down again - last heartbeat was around 38 minutes ago and the deployment has been "activating"
s
We have restarted the agent. Sorry about the trouble. We're still investigating at our end.
l
thank you! It does seem like stability has suddenly become an issue lately... Is there a way to introduce jitter to the first sensor tick after a new agent starts? I'm noticing timed out sensor ticks after an agent restart because they all seem to start ticking at the same time.
j
I’m going to reset your agent again so i can attach a debugger to it
🙏 1
l
hello - our agent is down again, starting around an hour ago @jordan@Shalabh Chaturvedi - related to the debugger?
j
Could be. We’re chasing a lead right now involving the latest version of grpcio but I don’t think we’re going to have a fix for you this afternoon.
Just a hunch but I’m going to bump your agent’s memory overnight in hopes that it keeps things more stable while we continue to troubleshoot.
y
I got this error. Is it the same thing? I'm using serverless-deploy
Copy code
Stack Trace:
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1745, in _wait_for_server_process
    client.ping("")
  File "/dagster/dagster/_grpc/client.py", line 190, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "/dagster/dagster/_grpc/client.py", line 157, in _query
    self._raise_grpc_exception(
  File "/dagster/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(