Hi team we ran a migration from 0 12 12 to 0 14 5 yesterday dagster #ask-community

Hi team, we ran a migration from 0.12.12 to 0.14.5...

Arun Kumar

04/22/2022, 9:59 PM

Hi team, we ran a migration from 0.12.12 to 0.14.5 yesterday. We had a small hiccup regarding statement timeout while creating index on event logs table, however we were able fix the issue and run another successful migration. I am seeing some of the today's runs are failing with the following error. However, it looks like the actual ops with in the job are running fine, but seeing the below error at the end of the job run. Any thoughts on why this could happen?

Copy code

dagster.core.errors.DagsterInstanceMigrationRequired: Instance is out of date and must be migrated (Postgres event log storage requires migration). Database is at revision 9c5f00e80ef2, head is f4b6a4885876. Please run `dagster instance migrate`.

Original exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.LockNotAvailable: canceling statement due to lock timeout


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dagster/core/storage/sql.py", line 62, in handle_schema_errors
yield
File "/usr/local/lib/python3.7/site-packages/dagster_postgres/utils.py", line 166, in create_pg_connection
yield conn
File "/usr/local/lib/python3.7/site-packages/dagster_postgres/event_log/event_log.py", line 153, in store_event
(res[0] + "_" + str(res[1]),),
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1296, in execute
future=False,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1602, in _exec_driver_sql
distilled_parameters,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1863, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2044, in _handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
raise exception
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.LockNotAvailable) canceling statement due to lock timeout

[SQL: NOTIFY run_events, %s; ]
[parameters: ('da8a1351-3190-4f25-8b26-825ac7ed42c5_14158049',)]
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)


  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 748, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/multiprocess.py", line 244, in execute
    event_specific_data=EngineEventData.multiprocess(os.getpid()),
  File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 897, in engine_event
    step_handle=step_handle,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 312, in from_pipeline
    log_pipeline_event(pipeline_context, event)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/events/__init__.py", line 221, in log_pipeline_event
    dagster_event=event,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 336, in log_dagster_event
    self.log(level=level, msg=msg, extra={DAGSTER_META_KEY: dagster_event})
  File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 351, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 1514, in _log
    self.handle(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 1524, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/log_manager.py", line 243, in emit
    handler.handle(dagster_record)
  File "/usr/local/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 135, in emit
    self._instance.handle_new_event(event)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1174, in handle_new_event
    self._event_storage.store_event(event)
  File "/usr/local/lib/python3.7/site-packages/dagster_postgres/event_log/event_log.py", line 153, in store_event
    (res[0] + "_" + str(res[1]),),
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/dagster_postgres/utils.py", line 166, in create_pg_connection
    yield conn
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/storage/sql.py", line 83, in handle_schema_errors
    ) from None

daniel

04/22/2022, 10:05 PM

Hi Arun - that error looks to me like the

dagster instance migrate

call may not have actually finished? Not sure if that's the root cause of the bug - but

Database is at revision 9c5f00e80ef2, head is f4b6a4885876

implies to me that it didn't actually get all the way to the end of the schema migration

daniel

04/22/2022, 10:08 PM

Do you have the output from when you ran the schema migration? I'm assuming you ran through some version of the guide here https://docs.dagster.io/deployment/guides/kubernetes/how-to-migrate-your-instance#overview

✅ 1

daniel

04/22/2022, 10:09 PM

just a shot in the dark here - there's no chance these jobs were running while the schema migration was happening, is there?

Arun Kumar

04/22/2022, 10:34 PM

Hi @daniel, unfortunately I don't have access to the migrate job logs anymore 😞 But I am quite sure it succeeded. So when we ran the migration, it failed for the first time ~with the above exact error related to timeout (~Correction: It failed with a different statement timeout error). Then we increased the timeout on the DB and ran the

dagster instance migrate

job again which eventually succeeded. Not sure how that error still appears in the job. Is it possible that DB went to an inconsistent state due to the first error and the second migration did not fix it?

Arun Kumar

04/22/2022, 10:34 PM

Nope, we ran the migration yesterday night. These jobs started today

prha

04/22/2022, 10:41 PM

Are you using the same postgres db for all your storages? Can we check the alembic hash by stored in the DB:

Copy code

select * from alembic_version;

Arun Kumar

04/22/2022, 10:44 PM

Yes, we use the same DB. Yes, I can get that

prha

04/22/2022, 10:45 PM

Actually, is it possible that some of your run workers are still on an old version?

Arun Kumar

04/22/2022, 10:45 PM

I made a correction in my previous msg. During migration we did not see the above error, we saw a different statement timeout error which we fixed by increasing the timeout

Arun Kumar

04/22/2022, 10:46 PM

@prha here is the alembic version

Copy code

version_num
--------------
 9c5f00e80ef2
(1 row)

prha

04/22/2022, 10:47 PM

The error message

Database is at revision 9c5f00e80ef2, head is f4b6a4885876

reflects that your DB is in the most recent known migration state as of

0.14.5

, but that the code raising the exception believes that the last known migration is one from

0.12.11

Arun Kumar

04/22/2022, 10:54 PM

where is the expected revision coming from? Also, does the lock timeout error provides any clue?

prha

04/22/2022, 10:57 PM

The last known revision is marked as

head

, which is

f4b6a4885876

. In our sequence of migrations,

f4b6a4885876

is an earlier, older migration than the revision that your DB is currently marked at, which is

9c5f00e80ef2

. This tells me that you did in fact successfully run the schema migration.

prha

04/22/2022, 10:58 PM

The fact that the last known revision in the stack trace is

f4b6a4885876

tells me that the dagster version of the code raising the exception is older than the dagster version of the code that actually migrated the DB.

prha

04/22/2022, 10:59 PM

I’m not sure about the significance of the lock timeout error…

prha

04/22/2022, 10:59 PM

My best guess is that it’s transient, but not sure.

Hebo Yang

04/22/2022, 11:26 PM

@daniel I thought that instance migrate error msg is a wrapper to all psql errors?

daniel

04/22/2022, 11:29 PM

It wraps psql errors where it detects the alembic revision is different than what it thinks is the latest revision. prha’s theory about why it is firing here (the version of dagster running the code that is storing the event is slightly older so it has a different latest revision) makes sense to me as an explanation for why

Hebo Yang

04/22/2022, 11:33 PM

I see...so the psql operation failed and resulted in a diff in rev.

prha

04/22/2022, 11:34 PM

yes, that’s my read on it…

Hebo Yang

04/22/2022, 11:34 PM

We are still using 0.12.12 for our user code repo but I thought they are compatible with 14.5?

prha

04/22/2022, 11:37 PM

I do think that they are compatible… but it does mean that you will get some misleading error messages when some DB exceptions occur.

thankyou 1

✅ 1

daniel

04/22/2022, 11:38 PM

It's also possible that despite being compatible, that upgrading the version there might help (the code should still work unchanged, but it's possible that if it's writing using the dagster libraries in 0.14.5 that it is more performant)

thankyou 1

✅ 1

Hebo Yang

04/22/2022, 11:42 PM

Thanks folks!

Arun Kumar

04/29/2022, 2:08 AM

Closing the loop here. Upgrading the repository to 0.14.5 fixed the issue 👍

4 Views

Open in Slack

Previous Next