Hi team, we ran a migration from 0.12.12 to 0.14.5...
# ask-community
a
Hi team, we ran a migration from 0.12.12 to 0.14.5 yesterday. We had a small hiccup regarding statement timeout while creating index on event logs table, however we were able fix the issue and run another successful migration. I am seeing some of the today's runs are failing with the following error. However, it looks like the actual ops with in the job are running fine, but seeing the below error at the end of the job run. Any thoughts on why this could happen?
Copy code
dagster.core.errors.DagsterInstanceMigrationRequired: Instance is out of date and must be migrated (Postgres event log storage requires migration). Database is at revision 9c5f00e80ef2, head is f4b6a4885876. Please run `dagster instance migrate`.

Original exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.LockNotAvailable: canceling statement due to lock timeout


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dagster/core/storage/sql.py", line 62, in handle_schema_errors
yield
File "/usr/local/lib/python3.7/site-packages/dagster_postgres/utils.py", line 166, in create_pg_connection
yield conn
File "/usr/local/lib/python3.7/site-packages/dagster_postgres/event_log/event_log.py", line 153, in store_event
(res[0] + "_" + str(res[1]),),
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1296, in execute
future=False,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1602, in _exec_driver_sql
distilled_parameters,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1863, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2044, in _handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
raise exception
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.LockNotAvailable) canceling statement due to lock timeout

[SQL: NOTIFY run_events, %s; ]
[parameters: ('da8a1351-3190-4f25-8b26-825ac7ed42c5_14158049',)]
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)


  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 748, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/multiprocess.py", line 244, in execute
    event_specific_data=EngineEventData.multiprocess(os.getpid()),
  File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 897, in engine_event
    step_handle=step_handle,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 312, in from_pipeline
    log_pipeline_event(pipeline_context, event)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 221, in log_pipeline_event
    dagster_event=event,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 336, in log_dagster_event
    self.log(level=level, msg=msg, extra={DAGSTER_META_KEY: dagster_event})
  File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 351, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 1514, in _log
    self.handle(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 1524, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 243, in emit
    handler.handle(dagster_record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 135, in emit
    self._instance.handle_new_event(event)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1174, in handle_new_event
    self._event_storage.store_event(event)
  File "/usr/local/lib/python3.7/site-packages/dagster_postgres/event_log/event_log.py", line 153, in store_event
    (res[0] + "_" + str(res[1]),),
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/dagster_postgres/utils.py", line 166, in create_pg_connection
    yield conn
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/storage/sql.py", line 83, in handle_schema_errors
    ) from None
d
Hi Arun - that error looks to me like the
dagster instance migrate
call may not have actually finished? Not sure if that's the root cause of the bug - but
Database is at revision 9c5f00e80ef2, head is f4b6a4885876
implies to me that it didn't actually get all the way to the end of the schema migration
Do you have the output from when you ran the schema migration? I'm assuming you ran through some version of the guide here https://docs.dagster.io/deployment/guides/kubernetes/how-to-migrate-your-instance#overview
1
just a shot in the dark here - there's no chance these jobs were running while the schema migration was happening, is there?
a
Hi @daniel, unfortunately I don't have access to the migrate job logs anymore 😞 But I am quite sure it succeeded. So when we ran the migration, it failed for the first time ~with the above exact error related to timeout (~Correction: It failed with a different statement timeout error). Then we increased the timeout on the DB and ran the
dagster instance migrate
job again which eventually succeeded. Not sure how that error still appears in the job. Is it possible that DB went to an inconsistent state due to the first error and the second migration did not fix it?
Nope, we ran the migration yesterday night. These jobs started today
p
Are you using the same postgres db for all your storages? Can we check the alembic hash by stored in the DB:
Copy code
select * from alembic_version;
a
Yes, we use the same DB. Yes, I can get that
p
Actually, is it possible that some of your run workers are still on an old version?
a
I made a correction in my previous msg. During migration we did not see the above error, we saw a different statement timeout error which we fixed by increasing the timeout
@prha here is the alembic version
Copy code
version_num
--------------
 9c5f00e80ef2
(1 row)
p
The error message
Database is at revision 9c5f00e80ef2, head is f4b6a4885876
reflects that your DB is in the most recent known migration state as of
0.14.5
, but that the code raising the exception believes that the last known migration is one from
0.12.11
a
where is the expected revision coming from? Also, does the lock timeout error provides any clue?
p
The last known revision is marked as
head
, which is
f4b6a4885876
. In our sequence of migrations,
f4b6a4885876
is an earlier, older migration than the revision that your DB is currently marked at, which is
9c5f00e80ef2
. This tells me that you did in fact successfully run the schema migration.
The fact that the last known revision in the stack trace is
f4b6a4885876
tells me that the dagster version of the code raising the exception is older than the dagster version of the code that actually migrated the DB.
I’m not sure about the significance of the lock timeout error…
My best guess is that it’s transient, but not sure.
h
@daniel I thought that instance migrate error msg is a wrapper to all psql errors?
d
It wraps psql errors where it detects the alembic revision is different than what it thinks is the latest revision. prha’s theory about why it is firing here (the version of dagster running the code that is storing the event is slightly older so it has a different latest revision) makes sense to me as an explanation for why
h
I see...so the psql operation failed and resulted in a diff in rev.
p
yes, that’s my read on it…
h
We are still using 0.12.12 for our user code repo but I thought they are compatible with 14.5?
p
I do think that they are compatible… but it does mean that you will get some misleading error messages when some DB exceptions occur.
thankyou 1
1
d
It's also possible that despite being compatible, that upgrading the version there might help (the code should still work unchanged, but it's possible that if it's writing using the dagster libraries in 0.14.5 that it is more performant)
thankyou 1
1
h
Thanks folks!
a
Closing the loop here. Upgrading the repository to 0.14.5 fixed the issue 👍