We are encountering an intermittent failure runnin...
# ask-community
We are encountering an intermittent failure running tests on our dagster pipeline in Jenkins. The failures occur when starting dagit, with it in turn calling into dagster which starts initialising the database (which is confirmed to contain no tables before this point). Here is a snippet of the traceback:
Copy code
dagster_1   |   File "/usr/local/lib/python3.8/site-packages/dagster/core/instance/ref.py", line 235, in run_storage
dagster_1   |     return self.run_storage_data.rehydrate()
dagster_1   |   File "/usr/local/lib/python3.8/site-packages/dagster/serdes/config_class.py", line 85, in rehydrate
dagster_1   |     return klass.from_config_value(self, result.value)
dagster_1   |   File "/usr/local/lib/python3.8/site-packages/dagster_postgres/run_storage/run_storage.py", line 88, in from_config_value
dagster_1   |     return PostgresRunStorage(
dagster_1   |   File "/usr/local/lib/python3.8/site-packages/dagster_postgres/run_storage/run_storage.py", line 62, in __init__
dagster_1   |     stamp_alembic_rev(pg_alembic_config(__file__), conn)
dagster_1   |   File "/usr/local/lib/python3.8/site-packages/dagster/core/storage/sql.py", line 46, in stamp_alembic_rev
dagster_1   |     stamp(alembic_config, rev)
dagster_1   |     cursor.execute(statement, parameters)
dagster_1   | sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "alembic_version_pkc"
dagster_1   | DETAIL:  Key (version_num)=(7cba9eeaaf1d) already exists.
So I said intermittent, and it seems like this is some kind of race condition because in other cases this initialisation proceeds and the tests all pass. When this happens though, postgres still logs uniqueness violations; they're just wrapped with some retry logic in
(line 59 of
, just above 62 mentioned in the traceback). I don't understand how the retry actually helps though: if you retry something which results in a uniqueness violation, it will cause the same issue; the wrapped function is just a sqlalchemy metadata's
method, which as far as I know doesn't check state before issuing the
statements. This suggests there's something going on I don't understand. On a "successful" run, the postgres log has 6 error messages from trying to create tables that already exist, and the dagster logs contain several instances of "Retrying failed database creation":
Copy code
dagster_1   | WARNING:root:Retrying failed database creation
dagster_1   | WARNING:root:Retrying failed database creation
dagster_1   | WARNING:root:Retrying failed database creation
dagster_1   | WARNING:root:Retrying failed database creation
dagster_1   | WARNI [root] Retrying failed database creation
dagster_1   | WARNI [root] Retrying failed database creation
I am fairly sure the numeric correspondence is a coincidence, since the runs schema alone contains several of the 6 tables being duplicated. I don't understand the significance of the different
formats; I could only work out where one of these was coming from in the dagster codebase. I should mention that as you'd expect, we sometimes see slightly different errors - this may happen while trying to initialise the
for example. I am unfamiliar with driving alembic programmatically so I don't really understand the significance of stamping the revision here in the dagster code as opposed to allowing alembic to do it. If anyone has any idea as to why this might be happening we'd be grateful. As an aside I was wondering why there is an attempt to mitigate race conditions with retry logic here, instead of using SQL transaction logic to serialise the initialisation.
Hi Chris - thanks for the detailed report, our whole DB initialization setup here could definitely be cleaner. One thing to be aware of is that create_all actually does check for existing tables (which is why the retry works): https://docs.sqlalchemy.org/en/14/core/metadata.html#sqlalchemy.schema.MetaData.create_all - but incorporating some transactions into the mix here so that all the writes happen at once is probably still a good idea and we can look into that. In the meantime, the most likely reason this is happening in your test is that you have multiple dagster processes starting up in parallel trying to create the same DB. If there's some way to avoid that (for example, starting up dagit first and waiting it to start up successfully before spinning up a second process) that would also address this.
OK I can definitely see one way in which that could be happening (and another way in which I hope it isn't happening...) Thanks for the clarification on how sqlalchemy creates the tables, too :)
Just to follow up on this (took us this long because something else exploded), we were doing something like
dagit &
followed by
. Introducing a wait here looks to be fixing this.
🙏 1