Hey team, I am getting this error when attempting ...
# deployment-kubernetes
r
Hey team, I am getting this error when attempting to run 200 parallel runs in K8s v 0.14.8, any ideas?
Copy code
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "<my_server_ip>", port 5432 failed: FATAL: sorry, too many clients already :: proc.c:363
connection to server at "<my_server_ip>", port 5432 failed: FATAL: sorry, too many clients already :: proc.c:363
Copy code
dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/grpc/impl.py", line 91, in core_execute_run
    yield from execute_run_iterator(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/execution/api.py", line 879, in __iter__
    yield from self.iterator(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/execution/api.py", line 844, in pipeline_execution_iterator
    event = DagsterEvent.pipeline_success(pipeline_context)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/events/__init__.py", line 831, in pipeline_success
    return DagsterEvent.from_pipeline(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/events/__init__.py", line 351, in from_pipeline
    log_pipeline_event(pipeline_context, event)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/events/__init__.py", line 258, in log_pipeline_event
    pipeline_context.log.log_dagster_event(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/log_manager.py", line 377, in log_dagster_event
    self.log(level=level, msg=msg, extra={DAGSTER_META_KEY: dagster_event})
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/log_manager.py", line 392, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 1589, in _log
    self.handle(record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 954, in handle
    self.emit(record)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/log_manager.py", line 259, in emit
    handler.handle(dagster_record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 954, in handle
    self.emit(record)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/instance/__init__.py", line 150, in emit
    self._instance.handle_new_event(event)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/instance/__init__.py", line 1326, in handle_new_event
    self._run_storage.handle_run_event(run_id, event.dagster_event)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 135, in handle_run_event
    run = self.get_run_by_id(run_id)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 397, in get_run_by_id
    rows = self.fetchall(query)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 71, in fetchall
    with self.connect() as conn:
  File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster_postgres/utils.py", line 170, in create_pg_connection
    conn = retry_pg_connection_fn(engine.connect)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster_postgres/utils.py", line 131, in retry_pg_connection_fn
    raise DagsterPostgresException("too many retries for DB connection") from exc
a
it looks like you’ll need more resources on your postgres database to handle that volume of simultaneous connections
r
Could this be related to this issue I am getting in the Instance status page?
Copy code
Operation name: LastTenRunsPerJobQuery

Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT subquery.id, subquery.run_body, subquery.create_timestamp, subquery.update_timestamp, subquery.start_time, subquery.end_time 
FROM (SELECT runs.id AS id, runs.run_body AS run_body, runs.create_timestamp AS create_timestamp, runs.update_timestamp AS update_timestamp, runs.start_time AS start_time, runs.end_time AS end_time, rank() OVER (PARTITION BY runs.pipeline_name ORDER BY runs.id DESC) AS rank 
FROM runs 
WHERE runs.pipeline_name IN (%(pipeline_name_1_1)s)) AS subquery 
WHERE subquery.rank <= %(rank_1)s ORDER BY subquery.rank ASC]
[parameters: {'rank_1': 10, 'pipeline_name_1_1': 'sync_company_metrics_to_client_api_pipeline'}]
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)

Path: ["workspaceOrError","locationEntries",0,"locationOrLoadError","repositories",0,"pipelines",0,"runs"]

Locations: [{"line":17,"column":17}]
both happens at the same time
According to this: https://cloud.google.com/sql/docs/quotas my default db-f1-micro should handle 250 runs, is it possible that there are 50 other connections going on other than the 200 pipelines? I assume schedules and such?
a
which executor are you using? the connection scaling may be more on the order of simultaneous ops than simultaneous jobs unless you are only using the in process executor
r
so I'm still using the legacy solids api, just upgraded Dagster + migrated, but I am using in process execution, one k8s job per pipeline and no multiprocessing
a
hm, ok. dagit, the daemon, and in some cases the user code servers will have connections as well, though 50 is more than i would guess. the limited CPU and memory available on the small instance size is likely the cause for some queries timing out when under load
there are ways to query pg directly about its connections which may help identify where unexpected volume is coming from
r
will have a look and let you know, thanks!
Connection number + CPU usage seem fine, mem usage is almost at the limit, but not surpassing it
a
hmm what granularity are those measurements taken? I’m not sure how
FATAL: sorry, too many clients already
occurs except when there are too many connections
r
lemme check again, because it makes total sense what you are saying
a
it would be surprising for it to spike so much in a really small time window but im not sure how else to explain it Are you seeing these errors in the jobs that are running or in dagit or both?
r
Dagit but I think that PostgreSQL max concurrent connections is indeed 25 rather than the 250 stated in GCP docs
1
so it makes total sense
there is no way to use MySQL for Dagster right?
a
theres dagster-mysql