Hi, we recently saw this error: `DagsterExecutionI...
# ask-community
r
Hi, we recently saw this error:
DagsterExecutionInterruptedError
in a few of our jobs … was trying to find what this means in docs but can’t seem to find it. Would anyone be able to help in terms of how to resolve this? I found this on GH: https://github.com/dagster-io/dagster/blob/1.0.17/python_modules/dagster/dagster/_core/execution/plan/utils.py#L84-L94 which if I’m reading this correctly means that this error will get thrown if there’s no retry_policy? CC: @Phil Armour
s
Hi Rohan - this happens if the process executing the job receives an interrupt signal. Is it possible that something your environment is interrupting your runs? E.g. Kubernetes shutting down pods?
r
hmm - I’ll try and take a look
++ @Stanley Yang to this thread
👀 1
p
assuming we have Run Retries configured at the Daemon - shouldn’t these get retried?
s
@johann - mind chiming in here on run retries?
j
To clarify- how do you have Dagster deployed? Do you have run retries enabled? Do you see any events regarding a retry at the bottom of the failed run?
r
I believe so - but we didn’t see any log messages indicating a retry (gonna re-check here). also, adding @Caio Tavares to this thread
c
@johann Dagster is running on GKE deployed with the official Helm Chart. Version
1.0.17
Right after the interrupt error there is a log message form the engine:
Ignoring a duplicate run that was started from somewhere other than the run monitor daemon
and the run didn't continue from there.
runRetries was enabled here:
Copy code
dagsterDaemon:
    image:
      tag: 1.0.17
    runMonitoring:
      enabled: true
      # Temporary workaround provided by Dagster Support. Revisit this later on.
      maxResumeRunAttempts: -1
    runRetries:
      enabled: true
      maxRetries: 3
j
Ah- is the run getting marked as a failure in Dagster or is it hanging in Started status?
r
it’s marked as a failure
c
it's getting marked as failure.
j
Do you see any errors in the logs from the Daemon pod? Particularly lines with
EventLogConsumerDaemon
are relevant for retries not starting
c
hm yes
Copy code
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "kvs" does not exist
this error above is all over the place followed by
EventLogConsumerDaemon
j
This will be solved by migrating your Dagster instance: https://docs.dagster.io/deployment/guides/kubernetes/how-to-migrate-your-instance
c
Is that a requirement after Dagster upgrades?
j
Run retries were added in 0.15.0. If your database was initialized before this, it doesn’t have the
kvs
table. The above guide will migrate the db to add the new table
c
interesting, that's great info. I will read through the doc and execute the steps.
💪 2
p
Thanks @johann and @sandy !!
c
much appreciated!
r
thank you!!
c
@johann We ran the DB migration last night but as soon as the db-migration job got completed the daemon triggered a bunch of jobs and it kept retrying them without incrementing the retry_number. Even if we cancel the runs the daemon would retry them again over and over. So we got to a point where we had to stop the daemon and to unblock our engineers we pointed Dagster to a fresh DB but that means we "lost" all the historical metadata. One thing to mention, we are using Dagster since the very early versions and as far as I know we've never done any DB migrations so the schema could be way behind. What would you recommend in this case?
j
Apologies for the mess! In terms of returning to previous state, with the daemon off could you point your engineers back at the main db? In the meantime I’m going to try to reproduce your setup and diss this out.
c
I backed up the main DB prior to the migration but haven't restored it yet. I believe we started on Dagster 0.12 so the DB upgrade probably went straight from 0.12 to 1.0.17 (current version)
if there is any specific logs you want to look at let me know I can try to get them.
👍 1
j
Update here: making a few fixes that will go live in the release next week. The first one is to only retry runs that fail after the daemon is enabled. Next I’m going to investigate the retry count issue.
c
Thank you for the update @johann. Does that mean if we upgrade to the latest (once the fix is released) then we will be able to re-run the migration in the "old" database where we have all the historical metadata?
j
Yes that should work
dagster spin 1
c
awesome, thank you Johann! We'll keep an eye on the next release then.
j
So far I haven’t been able to reproduce any bugs with incrementing the retry_number. I’m wondering if this explains what you saw, or if there was something additional going on. When you flipped on run retries, the daemon started from the beginning of your DB and started applying
maxRetries: 3
to every failed run (very silly behavior). All of these retries would have been
retry_number: 1
, since they were retrying a different failure. Eventually if it caught up to the latest runs, it would have retried any of those that failed with retry number 2 and 3.
As a heads up, you can see which failure caused a retry in the page here:
As well as at the start of the event log for the runs
c
@johann is there a particular string I can search in the logs to check the retry_number count? The behavior I saw was all jobs kept retrying but the retry_count never increased more than 1 and it was just in this infinite loop till I terminated the daemon.
j
Apologies, just getting back from PTO. We don’t currently log it, but it’s in the run tags.
c
hey @johann no worries. Do you know if the fix for the DB migration is already available in the latest version?
j
The fix landed, yes
c
Is it in
1.1.13 (core)
?
j
It landed first in 1.1.11, so yes it’s also included in 1.1.13
c
gr8, we'll give it a try. Thank you again!
👍 1