Hey all ever since updating to 0 11 13 I ve been getting lot dagster #ask-community

Hey all, ever since updating to 0.11.13 I've been ...

Kirk Stennett

10/01/2021, 5:24 PM

Hey all, ever since updating to 0.11.13 I've been getting lots of these errors:

Copy code

An exception was thrown during execution that is likely a framework error, rather than an error in user code.
dagster.check.CheckError: Invariant failed. Description: Pipeline run cleanup (e44bf39c-e57e-4318-ac8f-1f7c62667a40) in state PipelineRunStatus.STARTED, expected NOT_STARTED or STARTING.

Any idea what would be causing this? I'm running this on a k8s instance that's comprised of spot instances and from what I've seen it looks like that might be a cause if the nodes are going down frequently. But before on 0.11.12 it was failing less. I also switched from celery_k8s_exec to just the k8s_job_exec during this time too.

Kirk Stennett

10/01/2021, 5:32 PM

It also looks like the scheduler spins up new pods when this happens to compensate for the old ones being removed. Is there any tooling in future versions that would support dagster recognizing these new pods?

johann

10/01/2021, 5:46 PM

@Kirk Stennett we’re aware of this issue and I’m actively working on improving our tolerance here. We currently break if the run worker process (which manages the run) goes down, and starting a new one logs the error you’re seeing

Kirk Stennett

10/01/2021, 5:49 PM

Any issue that I could follow along with / provide background on? @johann

johann

10/01/2021, 5:50 PM

I believe it would be a coincidence that this picked up in frequency with the upgrade

Kirk Stennett

10/01/2021, 5:50 PM

Agreed, I think this is just randomness with bidding. I'm also don't have full access over the cluster so it could be a few things. I'm pretty sure it's just coincidence

johann

10/01/2021, 5:54 PM

Any issue that I could follow along with

There is now! https://github.com/dagster-io/dagster/issues/4993

12 Views

Open in Slack

Previous Next