Hey team I am getting this error when attempting to run 200 dagster #deployment-kubernetes

Hey team, I am getting this error when attempting ...

Rubén Lopez Lozoya

04/12/2022, 6:41 PM

Hey team, I am getting this error when attempting to run 200 parallel runs in K8s v 0.14.8, any ideas?

Copy code

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "<my_server_ip>", port 5432 failed: FATAL: sorry, too many clients already :: proc.c:363
connection to server at "<my_server_ip>", port 5432 failed: FATAL: sorry, too many clients already :: proc.c:363

Rubén Lopez Lozoya

04/12/2022, 6:42 PM

Copy code

dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/grpc/impl.py", line 91, in core_execute_run
    yield from execute_run_iterator(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/execution/api.py", line 879, in __iter__
    yield from self.iterator(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/execution/api.py", line 844, in pipeline_execution_iterator
    event = DagsterEvent.pipeline_success(pipeline_context)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/events/__init__.py", line 831, in pipeline_success
    return DagsterEvent.from_pipeline(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/events/__init__.py", line 351, in from_pipeline
    log_pipeline_event(pipeline_context, event)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/events/__init__.py", line 258, in log_pipeline_event
    pipeline_context.log.log_dagster_event(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/log_manager.py", line 377, in log_dagster_event
    self.log(level=level, msg=msg, extra={DAGSTER_META_KEY: dagster_event})
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/log_manager.py", line 392, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 1589, in _log
    self.handle(record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 954, in handle
    self.emit(record)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/log_manager.py", line 259, in emit
    handler.handle(dagster_record)
  File "/usr/local/lib/python3.8/logging/__init__.py", line 954, in handle
    self.emit(record)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/instance/__init__.py", line 150, in emit
    self._instance.handle_new_event(event)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/instance/__init__.py", line 1326, in handle_new_event
    self._run_storage.handle_run_event(run_id, event.dagster_event)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 135, in handle_run_event
    run = self.get_run_by_id(run_id)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 397, in get_run_by_id
    rows = self.fetchall(query)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 71, in fetchall
    with self.connect() as conn:
  File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster_postgres/utils.py", line 170, in create_pg_connection
    conn = retry_pg_connection_fn(engine.connect)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/dagster_postgres/utils.py", line 131, in retry_pg_connection_fn
    raise DagsterPostgresException("too many retries for DB connection") from exc

alex

04/12/2022, 6:44 PM

it looks like you’ll need more resources on your postgres database to handle that volume of simultaneous connections

Rubén Lopez Lozoya

04/12/2022, 6:50 PM

Could this be related to this issue I am getting in the Instance status page?

Copy code

Operation name: LastTenRunsPerJobQuery

Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT subquery.id, subquery.run_body, subquery.create_timestamp, subquery.update_timestamp, subquery.start_time, subquery.end_time 
FROM (SELECT runs.id AS id, runs.run_body AS run_body, runs.create_timestamp AS create_timestamp, runs.update_timestamp AS update_timestamp, runs.start_time AS start_time, runs.end_time AS end_time, rank() OVER (PARTITION BY runs.pipeline_name ORDER BY runs.id DESC) AS rank 
FROM runs 
WHERE runs.pipeline_name IN (%(pipeline_name_1_1)s)) AS subquery 
WHERE subquery.rank <= %(rank_1)s ORDER BY subquery.rank ASC]
[parameters: {'rank_1': 10, 'pipeline_name_1_1': 'sync_company_metrics_to_client_api_pipeline'}]
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)

Path: ["workspaceOrError","locationEntries",0,"locationOrLoadError","repositories",0,"pipelines",0,"runs"]

Locations: [{"line":17,"column":17}]

Rubén Lopez Lozoya

04/12/2022, 6:50 PM

both happens at the same time

Rubén Lopez Lozoya

04/12/2022, 6:54 PM

According to this: https://cloud.google.com/sql/docs/quotas my default db-f1-micro should handle 250 runs, is it possible that there are 50 other connections going on other than the 200 pipelines? I assume schedules and such?

alex

04/12/2022, 6:56 PM

which executor are you using? the connection scaling may be more on the order of simultaneous ops than simultaneous jobs unless you are only using the in process executor

Rubén Lopez Lozoya

04/12/2022, 6:56 PM

so I'm still using the legacy solids api, just upgraded Dagster + migrated, but I am using in process execution, one k8s job per pipeline and no multiprocessing

alex

04/12/2022, 7:03 PM

hm, ok. dagit, the daemon, and in some cases the user code servers will have connections as well, though 50 is more than i would guess. the limited CPU and memory available on the small instance size is likely the cause for some queries timing out when under load

alex

04/12/2022, 7:06 PM

there are ways to query pg directly about its connections which may help identify where unexpected volume is coming from

Rubén Lopez Lozoya

04/12/2022, 7:06 PM

will have a look and let you know, thanks!

Rubén Lopez Lozoya

04/12/2022, 7:09 PM

Connection number + CPU usage seem fine, mem usage is almost at the limit, but not surpassing it

alex

04/12/2022, 7:18 PM

hmm what granularity are those measurements taken? I’m not sure how

FATAL: sorry, too many clients already

occurs except when there are too many connections

Rubén Lopez Lozoya

04/12/2022, 7:22 PM

lemme check again, because it makes total sense what you are saying

alex

04/12/2022, 7:25 PM

it would be surprising for it to spike so much in a really small time window but im not sure how else to explain it Are you seeing these errors in the jobs that are running or in dagit or both?

Rubén Lopez Lozoya

04/12/2022, 7:37 PM

Dagit but I think that PostgreSQL max concurrent connections is indeed 25 rather than the 250 stated in GCP docs

❓ 1

Rubén Lopez Lozoya

04/12/2022, 7:37 PM

so it makes total sense

Rubén Lopez Lozoya

04/12/2022, 7:37 PM

there is no way to use MySQL for Dagster right?

alex

04/12/2022, 7:41 PM

theres dagster-mysql

7 Views

Open in Slack

Previous Next