The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hello, we're running into an issue where a job op is hanging on start for 200+ hours, who should I talk to about this?

For some context it fails at the launching step before it even executes any of our resources code

Hi Sean, can you share a link to a run where this is happening in dagster cloud? 

<https://apella.dagster.cloud/prod/runs/d5b35710-74fe-46c2-94e0-d6df073fa25d?logFileKey=mjobhoio>

If a particular op is hanging and you're not sure why, using py-spy --dump on the relevant process is one way to go to produce a dump of each thread. That can be a bit involved in kubernetes but I added a guide here for some steps that you can take to set that up: <https://github.com/dagster-io/dagster/discussions/14771#discussioncomment-6165783>