https://dagster.io/ logo
#dagster-cloud
Title
# dagster-cloud
d

Dennis Schwartz (he/him)

03/04/2024, 9:23 AM
Hi all! I keep seeing this error in our Hybrid Cloud deployments:
Copy code
dagster_cloud_cli.core.errors.GraphQLStorageError: Max retries (6) exceeded, too many ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) error responses.
It seems to be somewhat intermittent although it happens in probably half my runs and it's making my head explode. Any tips of where to look for causes or errors? I have nothing else to go on. I will post the full error message in the thread.
🤖 1
Copy code
dagster_cloud_cli.core.errors.GraphQLStorageError: Max retries (6) exceeded, too many ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) error responses.

  File "/usr/local/lib/python3.11/site-packages/dagster/_cli/api.py", line 377, in _execute_step_command_body
    yield DagsterEvent.step_worker_started(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dagster/_core/events/__init__.py", line 1120, in step_worker_started
    log_manager.log_dagster_event(
  File "/usr/local/lib/python3.11/site-packages/dagster/_core/log_manager.py", line 420, in log_dagster_event
    self.log(level=level, msg=msg, extra={DAGSTER_META_KEY: dagster_event})
  File "/usr/local/lib/python3.11/site-packages/dagster/_core/log_manager.py", line 435, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/local/lib/python3.11/logging/__init__.py", line 1634, in _log
    self.handle(record)
  File "/usr/local/lib/python3.11/logging/__init__.py", line 1644, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.11/logging/__init__.py", line 1706, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.11/logging/__init__.py", line 978, in handle
    self.emit(record)
  File "/usr/local/lib/python3.11/site-packages/dagster/_core/log_manager.py", line 301, in emit
    handler.handle(dagster_record)
  File "/usr/local/lib/python3.11/logging/__init__.py", line 978, in handle
    self.emit(record)
  File "/usr/local/lib/python3.11/site-packages/dagster/_core/instance/__init__.py", line 237, in emit
    self._instance.handle_new_event(event)
  File "/usr/local/lib/python3.11/site-packages/dagster/_core/instance/__init__.py", line 2350, in handle_new_event
    self._event_storage.store_event(event)
  File "/usr/local/lib/python3.11/site-packages/dagster_cloud/storage/event_logs/storage.py", line 519, in store_event
    self._execute_query(
  File "/usr/local/lib/python3.11/site-packages/dagster_cloud/storage/event_logs/storage.py", line 399, in _execute_query
    res = self._graphql_client.execute(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dagster_cloud_cli/core/graphql_client.py", line 135, in execute
    raise GraphQLStorageError(

The above exception was caused by the following exception:
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

  File "/usr/local/lib/python3.11/site-packages/dagster_cloud_cli/core/graphql_client.py", line 81, in execute
    return self._execute_retry(query, variable_values, headers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dagster_cloud_cli/core/graphql_client.py", line 157, in _execute_retry
    response = <http://self._session.post|self._session.post>(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)

The above exception occurred during handling of the following exception:
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 1390, in getresponse
    response.begin()
  File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 294, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"

The above exception occurred during handling of the following exception:
http.client.RemoteDisconnected: Remote end closed connection without response

  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 1390, in getresponse
    response.begin()
  File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/http/client.py", line 294, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"

The above exception occurred during handling of the following exception:
TypeError: HTTPConnection.getresponse() got an unexpected keyword argument 'buffering'

  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse(buffering=True)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
These are the versions I'm running:
Copy code
dagster	1.6.6
dagster-aws	0.22.6
dagster-k8s	0.22.6
dagstermill	0.22.6
They are the latest as far as I'm aware.
Ah, I stand corrected. I will try upgrading.
Still getting the same error:
Copy code
Error in Dagster Cloud request (('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))). Retrying now.
Ok I think I found the issue with this. One of the nodes in my Kubernetes cluster was somehow unable to resolve
dagster-cloud.svc.cluster.local
and it caused the connection errors. Since not all workloads were scheduled to this node, the issue was sporadic but happened often. Removing this node from the cluster solved the problem. Hope this might help someone in the future 🙂
dagsir 1
12 Views