The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hey there! We have been seeing several cases in which jobs will die before they come up and be stuck in a running state for a long period of time. And a few cases where we can't see the job in dagit but there is a job in k8s that will still be active and continue to run. What are the current best practices for stopping these?

Hi Pablo - the feature in Dagster that's intended to detect these hanging runs and move them into a failure state is here: <https://docs.dagster.io/deployment/run-monitoring#run-monitoring>

Is there anything that would catch jobs that are running in k8s but not present on dagit?

I'm not totally clear how that would happen - can you share more details / reproduction steps?