https://dagster.io/ logo
#dagster-cloud
Title
# dagster-cloud
h

Henri Blancke

07/23/2023, 8:14 PM
Hello! We're running into the below issue on our dagster cloud deployment, anything we can do to resolve this asap?
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not send a request to your Dagster Cloud agent since no agents have recently heartbeated. Reach out on Slack or email <mailto:support@elementl.com|support@elementl.com>.
  File "/dagster/dagster/_daemon/sensor.py", line 536, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/dagster/dagster/_daemon/sensor.py", line 672, in _evaluate_sensor
    sensor_runtime_data = code_location.get_external_sensor_execution_data(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 629, in get_external_sensor_execution_data
    result = self.api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 366, in api_call
    return dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 130, in dagster_cloud_api_call
    for result in gen_dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 195, in gen_dagster_cloud_api_call
    raise DagsterUserCodeUnreachableError(
s

Shalabh Chaturvedi

07/23/2023, 10:13 PM
Hi Henri looks like the agent had stopped responding and I have now restarted it on our side so this should be back up.
h

Henri Blancke

07/23/2023, 10:23 PM
Thank you @Shalabh Chaturvedi !
l

Leo Qin

07/24/2023, 8:46 PM
hello @Shalabh Chaturvedi - how can we prevent this from happening? Looking in our deployment, i see that 4 times in the past day we'e had an agent become inactive and replaced - what is the root cause of the agent becoming unreachable, and why did no replacement happen yesterday?
s

Shalabh Chaturvedi

07/24/2023, 9:43 PM
Hi Leo - We expect ECS to automatically restart agents if they become unresponsive however that did not happen. We are current investigating the root cause and will share what we find.
👍 1
🙇 1
h

Henri Blancke

07/26/2023, 12:45 PM
@Shalabh Chaturvedi looks like our agent is in a bad state again, would you be able to restart it again? Thanks
s

Shalabh Chaturvedi

07/26/2023, 1:43 PM
Sorry about that - it should be back up now. We are still investigating this issue.
l

Leo Qin

07/27/2023, 12:13 AM
@Shalabh Chaturvedi - it looks like our agent went down again around 2:23pm pacific time - right now the status is "Activating":
btw - if there's a broader or more operational way to get help for these scenarios, let us know!
s

Shalabh Chaturvedi

07/27/2023, 1:20 AM
I've restarted it again. The best would way would be to reply to this thread that your serverless agent is down and check "also send to #dagster-cloud". it will get picked up by one of us and it will also keep the history of all failures.
l

Leo Qin

07/27/2023, 7:21 PM
hello - our serverless agent is down again - last heartbeat was around 38 minutes ago and the deployment has been "activating"
s

Shalabh Chaturvedi

07/27/2023, 7:34 PM
We have restarted the agent. Sorry about the trouble. We're still investigating at our end.
l

Leo Qin

07/27/2023, 7:39 PM
thank you! It does seem like stability has suddenly become an issue lately... Is there a way to introduce jitter to the first sensor tick after a new agent starts? I'm noticing timed out sensor ticks after an agent restart because they all seem to start ticking at the same time.
j

jordan

07/27/2023, 8:12 PM
I’m going to reset your agent again so i can attach a debugger to it
🙏 1
l

Leo Qin

07/27/2023, 9:32 PM
hello - our agent is down again, starting around an hour ago @jordan@Shalabh Chaturvedi - related to the debugger?
j

jordan

07/27/2023, 9:49 PM
Could be. We’re chasing a lead right now involving the latest version of grpcio but I don’t think we’re going to have a fix for you this afternoon.
Just a hunch but I’m going to bump your agent’s memory overnight in hopes that it keeps things more stable while we continue to troubleshoot.
y

Yang

08/01/2023, 6:06 PM
I got this error. Is it the same thing? I'm using serverless-deploy
Copy code
Stack Trace:
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1745, in _wait_for_server_process
    client.ping("")
  File "/dagster/dagster/_grpc/client.py", line 190, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "/dagster/dagster/_grpc/client.py", line 157, in _query
    self._raise_grpc_exception(
  File "/dagster/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
2 Views