I m having trouble getting a job to run which was running fi dagster #dagster-plus

I'm having trouble getting a job to run which was ...

Zach

05/03/2023, 3:01 AM

I'm having trouble getting a job to run which was running fine a couple hours ago and no code changes have been made. The job seems to run its first step just fine. The first op fans out to a handful of downstream ops (in this case, 3 downstream ops are generated), but all the downstream ops just hang on STEP_WORKER_STARTING. To make it more interesting the jobs seem to be unable to be terminated - when I hit Terminate I see a log saying

[DagsterCloudAgent] Agent d91b05f0 received request to terminate run

, but the log doesn't terminate. It won't give me the option to Force Terminate either. Not able to reproduce it in a local Dagit instance. Also tried redeploying, still not working. I'm not seeing the issue in other jobs that use

DynamicOutputs

✅ 1

Zach

05/03/2023, 4:11 AM

Hmm so I can't even find the ECS task corresponding to the ARN that the CloudEcsRunLauncher spits out. It seems like it might be silently dying when the subprocesses are being launched

Zach

05/03/2023, 4:14 AM

Found them as stopped tasks, even though the corresponding Dagster job thinks it's still spinning up subprocesses for the DynamicOutputs. They have generic

Essential container in task exited

messages, and the CloudWatch logs just show the last DagsterEvent "`Launching subprocess for...` "

Zach

05/03/2023, 4:26 AM

ahh okay sweet I got it - looks like it must've been an OOM. just increased the memory on the job and it seems to have pushed through the subprocess creation for the dynamic ops. odd that it was completely silent though - I didn't even see any indication in the ECS logs. still not sure how to get these jobs to cancel though.

alex

05/03/2023, 3:26 PM

not sure how to get these jobs to cancel though.

did you check the “force” toggle when terminating?

Zach

05/03/2023, 4:56 PM

the "force" checkbox wasn't showing up which was particularly odd

Zach

05/03/2023, 4:56 PM

I should've taken a picture, I hope I'm not just misremembering and missed the checkbox in my debug hurry

Open in Slack

Previous Next