Hey all, ever since updating to 0.11.13 I've been ...
# ask-community
k
Hey all, ever since updating to 0.11.13 I've been getting lots of these errors:
Copy code
An exception was thrown during execution that is likely a framework error, rather than an error in user code.
dagster.check.CheckError: Invariant failed. Description: Pipeline run cleanup (e44bf39c-e57e-4318-ac8f-1f7c62667a40) in state PipelineRunStatus.STARTED, expected NOT_STARTED or STARTING.
Any idea what would be causing this? I'm running this on a k8s instance that's comprised of spot instances and from what I've seen it looks like that might be a cause if the nodes are going down frequently. But before on 0.11.12 it was failing less. I also switched from celery_k8s_exec to just the k8s_job_exec during this time too.
It also looks like the scheduler spins up new pods when this happens to compensate for the old ones being removed. Is there any tooling in future versions that would support dagster recognizing these new pods?
j
@Kirk Stennett we’re aware of this issue and I’m actively working on improving our tolerance here. We currently break if the run worker process (which manages the run) goes down, and starting a new one logs the error you’re seeing
k
Any issue that I could follow along with / provide background on? @johann
j
I believe it would be a coincidence that this picked up in frequency with the upgrade
k
Agreed, I think this is just randomness with bidding. I'm also don't have full access over the cluster so it could be a few things. I'm pretty sure it's just coincidence
j
Any issue that I could follow along with
There is now! https://github.com/dagster-io/dagster/issues/4993