https://dagster.io/ logo
Title
r

Rohan Prasad

01/10/2023, 7:44 PM
Hi, we recently saw this error:
DagsterExecutionInterruptedError
in a few of our jobs … was trying to find what this means in docs but can’t seem to find it. Would anyone be able to help in terms of how to resolve this? I found this on GH: https://github.com/dagster-io/dagster/blob/1.0.17/python_modules/dagster/dagster/_core/execution/plan/utils.py#L84-L94 which if I’m reading this correctly means that this error will get thrown if there’s no retry_policy? CC: @Phil Armour
s

sandy

01/10/2023, 9:14 PM
Hi Rohan - this happens if the process executing the job receives an interrupt signal. Is it possible that something your environment is interrupting your runs? E.g. Kubernetes shutting down pods?
r

Rohan Prasad

01/10/2023, 9:24 PM
hmm - I’ll try and take a look
++ @Stanley Yang to this thread
👀 1
p

Phil Armour

01/10/2023, 9:30 PM
assuming we have Run Retries configured at the Daemon - shouldn’t these get retried?
s

sandy

01/10/2023, 9:31 PM
@johann - mind chiming in here on run retries?
j

johann

01/10/2023, 9:34 PM
To clarify- how do you have Dagster deployed? Do you have run retries enabled? Do you see any events regarding a retry at the bottom of the failed run?
r

Rohan Prasad

01/10/2023, 9:36 PM
I believe so - but we didn’t see any log messages indicating a retry (gonna re-check here). also, adding @Caio Tavares to this thread
c

Caio Tavares

01/10/2023, 9:39 PM
@johann Dagster is running on GKE deployed with the official Helm Chart. Version
1.0.17
Right after the interrupt error there is a log message form the engine:
Ignoring a duplicate run that was started from somewhere other than the run monitor daemon
and the run didn't continue from there.
runRetries was enabled here:
dagsterDaemon:
    image:
      tag: 1.0.17
    runMonitoring:
      enabled: true
      # Temporary workaround provided by Dagster Support. Revisit this later on.
      maxResumeRunAttempts: -1
    runRetries:
      enabled: true
      maxRetries: 3
j

johann

01/10/2023, 9:40 PM
Ah- is the run getting marked as a failure in Dagster or is it hanging in Started status?
r

Rohan Prasad

01/10/2023, 9:40 PM
it’s marked as a failure
c

Caio Tavares

01/10/2023, 9:40 PM
it's getting marked as failure.
j

johann

01/10/2023, 9:44 PM
Do you see any errors in the logs from the Daemon pod? Particularly lines with
EventLogConsumerDaemon
are relevant for retries not starting
c

Caio Tavares

01/10/2023, 9:45 PM
hm yes
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "kvs" does not exist
this error above is all over the place followed by
EventLogConsumerDaemon
j

johann

01/10/2023, 9:47 PM
This will be solved by migrating your Dagster instance: https://docs.dagster.io/deployment/guides/kubernetes/how-to-migrate-your-instance
c

Caio Tavares

01/10/2023, 9:49 PM
Is that a requirement after Dagster upgrades?
j

johann

01/10/2023, 9:49 PM
Run retries were added in 0.15.0. If your database was initialized before this, it doesn’t have the
kvs
table. The above guide will migrate the db to add the new table
c

Caio Tavares

01/10/2023, 9:50 PM
interesting, that's great info. I will read through the doc and execute the steps.
💪 2
p

Phil Armour

01/10/2023, 9:51 PM
Thanks @johann and @sandy !!
c

Caio Tavares

01/10/2023, 9:52 PM
much appreciated!
r

Rohan Prasad

01/10/2023, 9:52 PM
thank you!!
c

Caio Tavares

01/11/2023, 2:40 PM
@johann We ran the DB migration last night but as soon as the db-migration job got completed the daemon triggered a bunch of jobs and it kept retrying them without incrementing the retry_number. Even if we cancel the runs the daemon would retry them again over and over. So we got to a point where we had to stop the daemon and to unblock our engineers we pointed Dagster to a fresh DB but that means we "lost" all the historical metadata. One thing to mention, we are using Dagster since the very early versions and as far as I know we've never done any DB migrations so the schema could be way behind. What would you recommend in this case?
j

johann

01/11/2023, 3:20 PM
Apologies for the mess! In terms of returning to previous state, with the daemon off could you point your engineers back at the main db? In the meantime I’m going to try to reproduce your setup and diss this out.
c

Caio Tavares

01/11/2023, 3:23 PM
I backed up the main DB prior to the migration but haven't restored it yet. I believe we started on Dagster 0.12 so the DB upgrade probably went straight from 0.12 to 1.0.17 (current version)
if there is any specific logs you want to look at let me know I can try to get them.
👍 1
j

johann

01/12/2023, 7:30 PM
Update here: making a few fixes that will go live in the release next week. The first one is to only retry runs that fail after the daemon is enabled. Next I’m going to investigate the retry count issue.
c

Caio Tavares

01/12/2023, 7:31 PM
Thank you for the update @johann. Does that mean if we upgrade to the latest (once the fix is released) then we will be able to re-run the migration in the "old" database where we have all the historical metadata?
j

johann

01/12/2023, 8:21 PM
Yes that should work
:dagster-spin: 1
c

Caio Tavares

01/12/2023, 8:23 PM
awesome, thank you Johann! We'll keep an eye on the next release then.
j

johann

01/12/2023, 10:43 PM
So far I haven’t been able to reproduce any bugs with incrementing the retry_number. I’m wondering if this explains what you saw, or if there was something additional going on. When you flipped on run retries, the daemon started from the beginning of your DB and started applying
maxRetries: 3
to every failed run (very silly behavior). All of these retries would have been
retry_number: 1
, since they were retrying a different failure. Eventually if it caught up to the latest runs, it would have retried any of those that failed with retry number 2 and 3.
As a heads up, you can see which failure caused a retry in the page here:
As well as at the start of the event log for the runs
c

Caio Tavares

01/13/2023, 8:04 PM
@johann is there a particular string I can search in the logs to check the retry_number count? The behavior I saw was all jobs kept retrying but the retry_count never increased more than 1 and it was just in this infinite loop till I terminated the daemon.
j

johann

01/23/2023, 5:15 PM
Apologies, just getting back from PTO. We don’t currently log it, but it’s in the run tags.
c

Caio Tavares

01/25/2023, 3:44 PM
hey @johann no worries. Do you know if the fix for the DB migration is already available in the latest version?
j

johann

01/25/2023, 4:30 PM
The fix landed, yes
c

Caio Tavares

01/25/2023, 4:30 PM
Is it in
1.1.13 (core)
?
j

johann

01/25/2023, 4:31 PM
It landed first in 1.1.11, so yes it’s also included in 1.1.13
c

Caio Tavares

01/25/2023, 4:31 PM
gr8, we'll give it a try. Thank you again!
👍 1