0.8.7 with postgres for run, event log, and schedu...
# announcements
d
0.8.7 with postgres for run, event log, and schedule storage. Getting a "too many clients" postgres exception from sqlalchemy as an "Engine Event" in the pipeline run log. Pgadmin shows that most of the ~100 idle connections to the
dagster
db are running the query
LISTEN "run_events"
. I think 100 is the max clients setting for the postgres docker image I'm running, so it might be opening as many as it can get. Is that normal?
Sorry, posted the wrong part of the trace. Here's the full one:
Copy code
dagster_1           | ERROR:root:(psycopg2.OperationalError) FATAL:  sorry, too many clients already
dagster_1           | 
dagster_1           | (Background on this error at: <http://sqlalche.me/e/13/e3q8>)
dagster_1           | Traceback (most recent call last):
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
dagster_1           |     return fn()
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 364, in connect
dagster_1           |     return _ConnectionFairy._checkout(self)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
dagster_1           |     fairy = _ConnectionRecord.checkout(pool)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dagster_1           |     rec = pool._do_get()
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 239, in _do_get
dagster_1           |     return self._create_connection()
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
dagster_1           |     return _ConnectionRecord(self)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
dagster_1           |     self.__connect(first_connect_check=True)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
dagster_1           |     pool.logger.debug("Error on connect(): %s", e)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 69, in __exit__
dagster_1           |     exc_value, with_traceback=exc_tb,
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
dagster_1           |     raise exception
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
dagster_1           |     connection = pool._invoke_creator(self)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
dagster_1           |     return dialect.connect(*cargs, **cparams)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 493, in connect
dagster_1           |     return self.dbapi.connect(*cargs, **cparams)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
dagster_1           |     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
dagster_1           | psycopg2.OperationalError: FATAL:  sorry, too many clients already
dagster_1           | 
dagster_1           | 
dagster_1           | The above exception was the direct cause of the following exception:
dagster_1           | 
dagster_1           | Traceback (most recent call last):
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster_celery/core_execution_loop.py", line 108, in core_celery_execution_loop
dagster_1           |     step_key=step.key,
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 644, in engine_event
dagster_1           |     step_key=step_key,
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 199, in from_pipeline
dagster_1           |     log_pipeline_event(pipeline_context, event, step_key)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 138, in log_pipeline_event
dagster_1           |     step_key=step_key,
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 262, in debug
dagster_1           |     return self._log(logging.DEBUG, msg, kwargs)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 232, in _log
dagster_1           |     logger_.log(level, message, extra=extra)
dagster_1           |   File "/usr/local/lib/python3.7/logging/__init__.py", line 1444, in log
dagster_1           |     self._log(level, msg, args, **kwargs)
dagster_1           |   File "/usr/local/lib/python3.7/logging/__init__.py", line 1514, in _log
dagster_1           |     self.handle(record)
dagster_1           |   File "/usr/local/lib/python3.7/logging/__init__.py", line 1524, in handle
dagster_1           |     self.callHandlers(record)
dagster_1           |   File "/usr/local/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
dagster_1           |     hdlr.handle(record)
dagster_1           |   File "/usr/local/lib/python3.7/logging/__init__.py", line 894, in handle
dagster_1           |     self.emit(record)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 120, in emit
dagster_1           |     self._instance.handle_new_event(event)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 868, in handle_new_event
dagster_1           |     self._event_storage.store_event(event)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/dagster_postgres/event_log/event_log.py", line 92, in store_event
dagster_1           |     sql_statement.returning(SqlEventLogStorageTable.c.run_id, SqlEventLogStorageTable.c.id)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2237, in execute
dagster_1           |     connection = self._contextual_connect(close_with_result=True)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2305, in _contextual_connect
dagster_1           |     self._wrap_pool_connect(self.pool.connect, None),
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2343, in _wrap_pool_connect
dagster_1           |     e, dialect, self
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1585, in _handle_dbapi_exception_noconnection
dagster_1           |     sqlalchemy_exception, with_traceback=exc_info[2], from_=e
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
dagster_1           |     raise exception
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
dagster_1           |     return fn()
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 364, in connect
dagster_1           |     return _ConnectionFairy._checkout(self)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
dagster_1           |     fairy = _ConnectionRecord.checkout(pool)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dagster_1           |     rec = pool._do_get()
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 239, in _do_get
dagster_1           |     return self._create_connection()
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
dagster_1           |     return _ConnectionRecord(self)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
dagster_1           |     self.__connect(first_connect_check=True)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
dagster_1           |     pool.logger.debug("Error on connect(): %s", e)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 69, in __exit__
dagster_1           |     exc_value, with_traceback=exc_tb,
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
dagster_1           |     raise exception
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
dagster_1           |     connection = pool._invoke_creator(self)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
dagster_1           |     return dialect.connect(*cargs, **cparams)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 493, in connect
dagster_1           |     return self.dbapi.connect(*cargs, **cparams)
dagster_1           |   File "/usr/local/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
dagster_1           |     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
Doesn't seem to be a problem on 0.8.5
On 0.8.7, I'm seeing the connection count grow continuously as my pipeline progresses, until it hits that 100 limit and the dagit engine starts erroring out. It's not preallocating connections or anything, it looks like a leak
On 0.8.5 with the same pipeline running, pg connection counts to the dagster db typically stay at <10
Possibly relevant: this pipeline continuously polls dagster's run list with the
wait_for_run_slot
function here: https://github.com/dagster-io/dagster/issues/2716#issue-656821567. Don't imagine a lot of setups do this, so thought I'd mention this. If there's any reason why this would stop working >0.8.5, that would probably be the answer
m
@prha does this ring any bells
p
no, not off the top off my head…
m
@Danny i wonder about that call to
DagsterInstance.get
what if you passed the instance object through from the solid compute context instead
context.instance
d
Yep, will try it later and post back. But I have the same thing implemented via graphql as well (pipelineRunsOrError) so I hope it doesn’t also affect that. Won’t have context.instance there. Might be worth looking into why there’s a difference 0.8.5->0.8.7, seems like this might bite elsewhere
m
in the solid logic itself i mean
so this was new in 0.8.7?
d
My solid is polling a dagster instance. If it’s a local instance it uses the implementation in the ticket I linked. If remote, it uses a sister implementation that uses graphql. So I’ll definitely try the context.instance tip but it can only work in the former case.
Yes new in 0.8.6 or 7
I can confirm issue isn’t present 0.8.5
@prha @max I've isolated the issue and its 100% reproducible in this repo: https://github.com/dbendelman/dagster-testing. The connection leak wasn't present in 0.8.6 or earlier, only started in 0.8.7. Current .8 release still has the problem.
I put repro instructions in the readme, everything runs in docker so all you need is docker compose
Would be awesome if someone can take a look at this. We need access to the excellent multidep fan-in behavior in the latest releases, so we can't stay on 0.8.6. So this connection leak is a blocker for us
TIA!
Sweet - I've been able to track this down to the problematic commit: https://github.com/dagster-io/dagster/commit/90d404169d60fea714bdcd09b827c6de74c8a261 This is the first commit that shows the db connection leak. Binary search trial and error ftw! So now the question is, what in this commit can cause the symptoms I describe above: steadily growing postgres connection count to dagster's database until it maxes out at 100, and Pgadmin shows that most of the ~100 idle connections are running the query 
LISTEN "run_events"
. Unfortunately I don't have the knowledge of dagster internals to tease this out, any help would be appreciated! I'll soon update the repo I used to track this down https://github.com/dbendelman/dagster-testing so that you guys can easily repro locally using this bad commit versus the one immediately before.
m
yikes
@schrockn @alex
d
Opening an issue for this now and will investigate right away. Thanks so much for finding the blamerev Danny
d
My pleasure, thanks for looking into this! 🎉