05/03/2023, 3:01 AM
I'm having trouble getting a job to run which was running fine a couple hours ago and no code changes have been made. The job seems to run its first step just fine. The first op fans out to a handful of downstream ops (in this case, 3 downstream ops are generated), but all the downstream ops just hang on STEP_WORKER_STARTING. To make it more interesting the jobs seem to be unable to be terminated - when I hit Terminate I see a log saying
[DagsterCloudAgent] Agent d91b05f0 received request to terminate run
, but the log doesn't terminate. It won't give me the option to Force Terminate either. Not able to reproduce it in a local Dagit instance. Also tried redeploying, still not working. I'm not seeing the issue in other jobs that use
Hmm so I can't even find the ECS task corresponding to the ARN that the CloudEcsRunLauncher spits out. It seems like it might be silently dying when the subprocesses are being launched
Found them as stopped tasks, even though the corresponding Dagster job thinks it's still spinning up subprocesses for the DynamicOutputs. They have generic
Essential container in task exited
messages, and the CloudWatch logs just show the last DagsterEvent "`Launching subprocess for...` "
ahh okay sweet I got it - looks like it must've been an OOM. just increased the memory on the job and it seems to have pushed through the subprocess creation for the dynamic ops. odd that it was completely silent though - I didn't even see any indication in the ECS logs. still not sure how to get these jobs to cancel though.


05/03/2023, 3:26 PM
not sure how to get these jobs to cancel though.
did you check the “force” toggle when terminating?


05/03/2023, 4:56 PM
the "force" checkbox wasn't showing up which was particularly odd
I should've taken a picture, I hope I'm not just misremembering and missed the checkbox in my debug hurry