Anyone know how to resolve this error (in :thread:...
# dagster-plus
l
Anyone know how to resolve this error (in 🧵 ) I am getting in both my prod and branch deployments since this morning in Hybrid?
Copy code
Copy
Exception: Timed out after waiting 180s for server dr-e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb-4189e5:4000.

Most recent connection error: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

Stack Trace:
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1744, in _wait_for_server_process
    client.ping("")
  File "/dagster/dagster/_grpc/client.py", line 190, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "/dagster/dagster/_grpc/client.py", line 157, in _query
    self._raise_grpc_exception(
  File "/dagster/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "DNS resolution failed for dr-e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb-4189e5:4000: C-ares status is not ARES_SUCCESS qtype=A name=dr-e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb-4189e5 is_balancer=0: Domain name not found"
	debug_error_string = "UNKNOWN:DNS resolution failed for dr-e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb-4189e5:4000: C-ares status is not ARES_SUCCESS qtype=A name=dr-e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb-4189e5 is_balancer=0: Domain name not found {created_time:"2023-06-30T23:12:01.504637721+00:00", grpc_status:14}"
>

Stack Trace:
  File "/dagster/dagster/_grpc/client.py", line 155, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/dagster/dagster/_grpc/client.py", line 130, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable

  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1342, in _reconcile
    self._wait_for_new_server_ready(
  File "/dagster-cloud/dagster_cloud/workspace/docker/__init__.py", line 324, in _wait_for_new_server_ready
    self._wait_for_dagster_server_process(
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1722, in _wait_for_dagster_server_process
    self._wait_for_server_process(
  File "/dagster-cloud/dagster_cloud/workspace/user_code_launcher/user_code_launcher.py", line 1757, in _wait_for_server_process
    raise Exception(
q
Are you using a preemptible node for your agent?
l
We have a single VM with our Agent and one code location deployed using Docker
Preemptibility is off
p
Hmm, it looks like the agent is unable to resolve those domains for the deployment code locations. Are there anything telling in your agent logs?
l
Hi @prha I'm not seeing anything much more informative in the logs for the agent container (using
docker logs <agent-container-name>
) and the Cloud UI lists the Agent (v1.3.13) as
Running
"no recent errors". The Dagster Cloud version seems to be an off-release commit (see attached image -- if I'm understanding what that version under the daggy logo represents). Here are some partially redacted logs from the container (let me know if I should be looking elsewhere):
Copy code
2023-07-01 16:48:55 +0000 - dagster_cloud.agent - INFO - Received request [53fb4afb-61e8-4a31-8648-7cf782bbc5e2: DagsterCloudApi.CHECK_FOR_WORKSPACE_UPDATES].
2023-07-01 16:48:55 +0000 - dagster_cloud.agent - INFO - Finished processing request [53fb4afb-61e8-4a31-8648-7cf782bbc5e2: DagsterCloudApi.CHECK_FOR_WORKSPACE_UPDATES].
2023-07-01 16:48:55 +0000 - dagster_cloud.agent - INFO - Uploading response for request [53fb4afb-61e8-4a31-8648-7cf782bbc5e2: DagsterCloudApi.CHECK_FOR_WORKSPACE_UPDATES].
2023-07-01 16:48:55 +0000 - dagster_cloud.agent - INFO - Finished uploading response for request [53fb4afb-61e8-4a31-8648-7cf782bbc5e2: DagsterCloudApi.CHECK_FOR_WORKSPACE_UPDATES].
2023-07-01 16:48:56 +0000 - dagster_cloud.user_code_launcher - INFO - Reconciling to reach {(e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb, dr, 1688230135.046586)}. To add: {}. To update: {(e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb, dr, 1688230135.046586)}. To remove: {}. To upload: {(e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb, dr, 1688230135.046586)}.
2023-07-01 16:48:56 +0000 - dagster_cloud.user_code_launcher - INFO - Updating server for e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb:dr
2023-07-01 16:48:56 +0000 - dagster_cloud.user_code_launcher - INFO - Starting a new container for e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb:dr with image <REDACTED>:0f0add75be627adadf012cd59c26aede53abfef8-5427230978-1: dr-e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb-b2e55e
2023-07-01 16:48:58 +0000 - dagster_cloud.user_code_launcher - INFO - Started container 572601875e80a0205f656b6d92b485cd16279d7794ea537dccfc0636ee44fac2
2023-07-01 16:48:58 +0000 - dagster_cloud.user_code_launcher - INFO - Created a new server for ('e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb', 'dr')
2023-07-01 16:48:58 +0000 - dagster_cloud.user_code_launcher - INFO - Waiting for new grpc server for ('e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb', 'dr') to be ready...
2023-07-01 16:51:59 +0000 - dagster_cloud.user_code_launcher - ERROR - Error while waiting for server for e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb:dr to be ready: Exception: Timed out after waiting 180s for server dr-e99b708e70d051faf1a4e1dc50b8e7f3ac9673bb-b2e55e:4000.
These same symptoms are affecting both our Full Deployment and Branch Deployments, but only for recent branches (old branches from several days ago still seem to have valid code locations). Is it possible this was caused by merging "bad code"? The code location loads without issue when done locally using
dagster dev
It turns out the issue was due to the user code containers crashing upon startup with the following Pydantic import error.
Copy code
Traceback (most recent call last):
  File "/opt/conda/bin/dagster", line 5, in <module>
    from dagster.cli import main
  File "/opt/conda/lib/python3.10/site-packages/dagster/__init__.py", line 100, in <module>
    from dagster._config.pythonic_config import (
  File "/opt/conda/lib/python3.10/site-packages/dagster/_config/pythonic_config/__init__.py", line 22, in <module>
    from pydantic import ConstrainedFloat, ConstrainedInt, ConstrainedStr
  File "/opt/conda/lib/python3.10/site-packages/pydantic/__init__.py", line 206, in __getattr__
    return _getattr_migration(attr_name)
  File "/opt/conda/lib/python3.10/site-packages/pydantic/_migration.py", line 285, in wrapper
    raise PydanticImportError(f'`{import_path}` has been removed in V2.')
pydantic.errors.PydanticImportError: `pydantic:ConstrainedFloat` has been removed in V2.

For further information visit <https://errors.pydantic.dev/2.0/u/import-error>
We hadn't seen this in local dev since local environments weren't being rebuilt with every code change, and there was apparently a new Pydantic release on Friday. facepalm
Pinning
pydantic<2
fixed the issue
p
Glad that you were able to find the issue!