https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
r

Rubén Lopez Lozoya

06/04/2021, 1:07 PM
Copy code
dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1265, in submit_run
    run, external_pipeline=external_pipeline
  File "/usr/local/lib/python3.7/site-packages/dagster/core/run_coordinator/queued_run_coordinator.py", line 96, in submit_run
    self._instance.handle_new_event(event_record)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1068, in handle_new_event
    self._run_storage.handle_run_event(run_id, event.dagster_event)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 142, in handle_run_event
    with self.connect() as conn:
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/dagster_postgres/utils.py", line 160, in create_pg_connection
    conn = retry_pg_connection_fn(engine.connect)
  File "/usr/local/lib/python3.7/site-packages/dagster_postgres/utils.py", line 127, in retry_pg_connection_fn
    raise DagsterPostgresException("too many retries for DB connection") from exc
The above exception was caused by the following exception:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections
(Background on this error at: <http://sqlalche.me/e/14/e3q8>)
  File "/usr/local/lib/python3.7/site-packages/dagster_postgres/utils.py", line 116, in retry_pg_connection_fn
    return fn()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3166, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3245, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3216, in _wrap_pool_connect
    e, dialect, self
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2069, in _handle_dbapi_exception_noconnection
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3212, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 301, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 761, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 419, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 259, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 247, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 362, in __init__
    self.__connect()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 605, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
    with_traceback=exc_tb,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 599, in __connect
    connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/create.py", line 578, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 584, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
Last time this happened was during a backfill with 300+ partition items. It broke at around partition item 256, this error popped and no further partition items were inserted into the queue
d

daniel

06/04/2021, 1:14 PM
Do you have the run queue feature enabled to limit the number of runs that can be happening at the same time?
r

Rubén Lopez Lozoya

06/04/2021, 1:14 PM
run_coordinator: module: dagster.core.run_coordinator class: QueuedRunCoordinator config: max_concurrent_runs: 20
this?
d

daniel

06/04/2021, 1:14 PM
Yeah
r

Rubén Lopez Lozoya

06/04/2021, 1:14 PM
yeah it's set up
d

daniel

06/04/2021, 1:15 PM
How many runs/solids are typically happening in parallel when you run into this? The error message here looks like the DB is receiving more connections at once than it can handle
r

Rubén Lopez Lozoya

06/04/2021, 1:15 PM
Right now there are 18 concurrent pipeline runs in progress
10 solid pipeline it is
d

daniel

06/04/2021, 1:19 PM
And the solids also run in parallel?
for example using celery-k8s?
r

Rubén Lopez Lozoya

06/04/2021, 1:20 PM
No, we moved away from celery, we are not using multiprocessing either
a

alex

06/04/2021, 2:38 PM
what are you using to supply your postgres database?
r

Rubén Lopez Lozoya

06/04/2021, 3:28 PM
Google cloud sql
a

alex

06/04/2021, 3:34 PM
Do you see any monitoring / settings around concurrent connections there? You could consider putting a
pgbouncer
between dagster and the google cloud sql db too if that makes sense, but I would start with looking at what google cloud sql says is going on for your instance
👀 1
r

Rubén Lopez Lozoya

06/07/2021, 3:24 PM
Max concurrent connections this week has been 18, and not during the time of the failure. I have faced the retry issue twice this week but none during peak times
My cloudsql instance allows for 25 concurrent connections
a

alex

06/07/2021, 3:31 PM
hmm odd - im not sure what else would cause
Copy code
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections
(Background on this error at: <http://sqlalche.me/e/14/e3q8>)
besides connection limit issues
r

Rubén Lopez Lozoya

06/08/2021, 11:23 AM
How does Dagster manage connections? Lets say I have a backfill of 100 items, how many connections to the Postgres DB would that imply? Also, is there a way to specify the size of the connection pool for the db?
a

alex

06/08/2021, 3:05 PM
Dagster just connects to the entity configured on the instance, so in the default set-up each process executing any Dagster code will have its own DB connection. So of those 100 backfill runs, it depends whether you are doing in-process or multi-process for how many simultaneous connections there may be. If you need global managing of connections you will need to configure Dagster to talk to something like a
pgbouncer
deployment that sits in front of your pg db.
2 Views