Hey team, I just queued 2k pipelines at once as pa...
# deployment-kubernetes
r
Hey team, I just queued 2k pipelines at once as part of a partition set and my Dagit deployment died, is this expected? Pipelines are not being run at once, they only run 25 at a time, so the daemon + user deployments were perfectly fine, it was just Dagit that died
d
Not expected, no - I can't actually think of anything in the sequence there that would even go through dagit actually. are there any logs from the deployment that might give a clue as to why it died? Could the cluster as a whole just be running out of resources?
r
Actually I am going to check the logs myself because I think I've been misinformed and Dagit is only failing because it cannot reach the user deployment, I'll come back with the stack trace in a few mins 🙂
Ok just reviewed, and confirmed that the user-deployment was alive. I was confused by the error message in Dagit:
Copy code
Error
2022-04-28T13:14:35.728093555Z File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 481, in _create_location_from_origin
Error
2022-04-28T13:14:35.728207378Z return origin.create_location()
Error
2022-04-28T13:14:35.728322245Z File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/origin.py", line 308, in create_location
Error
2022-04-28T13:14:35.728449655Z return GrpcServerRepositoryLocation(self)
Error
2022-04-28T13:14:35.728581657Z File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/repository_location.py", line 523, in __init__
Error
2022-04-28T13:14:35.728707033Z list_repositories_response = sync_list_repositories_grpc(self.client)
Error
2022-04-28T13:14:35.728820518Z File "/usr/local/lib/python3.7/site-packages/dagster/api/list_repositories.py", line 19, in sync_list_repositories_grpc
Error
2022-04-28T13:14:35.728932676Z api_client.list_repositories(),
Error
2022-04-28T13:14:35.729039347Z File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 164, in list_repositories
Error
2022-04-28T13:14:35.729148801Z res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
Error
2022-04-28T13:14:35.729260719Z File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 110, in _query
Error
2022-04-28T13:14:35.729367968Z raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
Error
2022-04-28T13:14:35.729498836Z
Error
2022-04-28T13:14:35.729648445ZThe above exception was caused by the following exception:
Error
2022-04-28T13:14:35.729775087Zgrpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
Error
2022-04-28T13:14:35.729889068Z status = StatusCode.UNAVAILABLE
Error
2022-04-28T13:14:35.729991465Z details = "failed to connect to all addresses"
Error
2022-04-28T13:14:35.730098165Z debug_error_string = "{"created":"@1651151675.723988421","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"@1651151675.723973195","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
Error
2022-04-28T13:14:35.730241149Z>
Error
2022-04-28T13:14:35.730352249Z
Error
2022-04-28T13:14:35.730490968ZStack Trace:
Error
2022-04-28T13:14:35.730606985Z File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 107, in _query
Error
2022-04-28T13:14:35.730755615Z response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
Error
2022-04-28T13:14:35.730863424Z File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
Error
2022-04-28T13:14:35.730963877Z return _end_unary_response_blocking(state, call, False, None)
Error
2022-04-28T13:14:35.731064290Z File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
Error
2022-04-28T13:14:35.731170181Z raise _InactiveRpcError(state)
Error
2022-04-28T13:14:35.731276531Z
Error
2022-04-28T13:14:35.731384617Z location_name=location_name, error_string=error.to_string()
This happened immediately after re-deploying to the cluster. To be clear, triggering 2k runs at once (being those 2k run in batches on 25 and the rest in queue) did not immediately kill Dagit. The partition set was running for 54 minutes so probably only 1.2k or so pipelines still in queue and, after re-deploying, dagit would die over and over (restarting) with the stack trace posted above.