https://dagster.io/ logo
#ask-community
Title
# ask-community
j

Jordan

06/09/2022, 8:57 AM
Hey everyone! I have the feeling that there are some problems when you materialize a lot of jobs and partitions at the same time: It had been several hours since I had launched a dozen jobs with a hundred partitions each, everything was going well and suddenly hundreds of partitions failed with the following errors
Copy code
Caught an error for run 90e29dd1-679d-41a5-bece-a18534a49236 while removing it from the queue. Marking the run as failed and dropping it from the queue: dagster.core.errors.DagsterRepositoryLocationLoadError: Failure loading staging_repository: Exception: Timed out waiting for gRPC server to start with arguments: "/opt/dagster/dagster_dags/staging/dagster/.venv/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmpu37j1tqe --heartbeat --heartbeat-timeout 120 --fixed-server-id e58d213f-4410-4c4d-8441-81d13f63e35b --log-level WARNING --use-python-environment-entry-point -f /opt/dagster/dagster_dags/staging/dagster/repository.py -d /opt/dagster/dagster_dags/staging/dagster/". Most recent connection error: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/grpc/server.py", line 961, in wait_for_grpc_server
    client.ping("")
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/grpc/client.py", line 128, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/grpc/client.py", line 115, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1654615602.653502925","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3217,"referenced_errors":[{"created":"@1654615602.653501893","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":165,"grpc_status":14}]}"
>

Stack Trace:
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/grpc/client.py", line 112, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/opt/dagster/dagster_envs/dagster_main/lib64/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/dagster/dagster_envs/dagster_main/lib64/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)


Stack Trace:
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/core/host_representation/grpc_server_registry.py", line 207, in _get_grpc_endpoint
    server_process = GrpcServerProcess(
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/grpc/server.py", line 1119, in __init__
    self.server_process = open_server_process(
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/grpc/server.py", line 1032, in open_server_process
    wait_for_grpc_server(server_process, client, subprocess_args, timeout=startup_timeout)
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/grpc/server.py", line 967, in wait_for_grpc_server
    raise Exception(


Stack Trace:
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py", line 155, in run_iteration
    self._dequeue_run(instance, run, workspace)
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py", line 230, in _dequeue_run
    instance.launch_run(run.run_id, workspace)
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/core/instance/__init__.py", line 1772, in launch_run
    self._run_launcher.launch_run(LaunchRunContext(pipeline_run=run, workspace=workspace))
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/core/launcher/default_run_launcher.py", line 99, in launch_run
    repository_location = context.workspace.get_repository_location(
  File "/opt/dagster/dagster_envs/dagster_main/lib/python3.8/site-packages/dagster/daemon/workspace.py", line 58, in get_repository_location
    raise DagsterRepositoryLocationLoadError(
I don't understand what could have happened. Does anyone have an idea? Thanks
j

johann

06/09/2022, 4:36 PM
Hi Jordan, it looks like you’re using the DefaultRunLauncher. This sends runs to execute in a subprocess of your gRPC server. I’m going to guess that you were able to overload wherever you have that server running (by default it starts up alongside dagit), which resulted in the timeout error
One option- you could try to restrict the number of jobs you run in parallel: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency
You could also try hosting the server on a bigger box, or deploy on top of a different compute system (e.g. ECS or Kubernetes). The advantage of these is that each run will spin up in an ephemeral bit of compute (ECS tasks, K8s Jobs, etc.) which can help scalining.
77 Views