Hello fellow Dagsterers I have a job which takes a large lis dagster #ask-community

Hello fellow Dagsterers, I have a job which takes...

Moody Edghaim

06/27/2023, 3:26 PM

Hello fellow Dagsterers, I have a job which takes a large list of IDs. This list is chunked into N smaller sublists (where N = number of executors, to allow for some parallelization). Each of these sublists is passed to the N dynamically spawned ops. Each dynamically spawned op simply iterates over the list, and performs some work for each individual ID. At some point, each of the dynamically spawned ops gets stuck indefinitely on a single ID (this is not a Dagster problem, this is a me problem). Root causing this and fixing it at the source isn't an option currently, so I'm looking to work around this for now, by terminating the job and having Dagster spit out all the remaining IDs which haven't been processed. I would manually rerun the job for the unprocessed IDs (which I've confirmed does not get stalled). 1. Would this be possible via failure hooks? The HookContext looks like it has a lot of goodies I can play around with. 2. Would this be better suited for assets? Materialize on failure, and pick up the asset on start (if it exists)?

🤖 1

sean

06/27/2023, 3:38 PM

Hi Moody, 1. I think it is possible-- if you have control over the exception thrown from the op, presumably you could put the unprocessed IDs in there? Alternatively, it seems like maybe you could just log them directly from the op with

context.log

, but I’m not sure I fully understand what this faliure looks like. 2. Assets could be suitable for this problem, but you’d probably want to use partitioned assets, and given the special handling you need here I suspect that the greater flexibility you have with ops might be a better fit right now.

🙏 1

Moody Edghaim

06/27/2023, 4:33 PM

Thanks for the quick reply, Sean! If the job is in a long running op (which has an associated failure hook), and the job gets terminated - does the hook get triggered? I can't seem to make it happen locally

sean

06/27/2023, 4:40 PM

As in the job is terminated manually by the user?

Moody Edghaim

06/27/2023, 4:50 PM

yeah, sorry

sean

06/27/2023, 5:02 PM

Yeah I don’t think that should trigger failure hook

sean

06/27/2023, 5:06 PM

Is it possible for you to put a timeout on the per-id call in the op body and throw an exception when it fails? That would trigger the hook-- or you could just log the remaining ones directly from the op before throwing

sean

06/30/2023, 2:58 PM

Hey Moody, Have you been able to resolve/make progress on this?

Moody Edghaim

07/14/2023, 12:32 AM

Hey Sean! Yes, sorry. I opted to work around this as there was a tight deadline, and I couldn't spend too much time exploring around this. I simply did some bookkeeping in my database to mark individual IDs in certain states (READY,SUCCESS,FAILURE, etc), keyed by the job ID. if a job is rerun (context contains a

parent_id

variable for reruns), it simply pulled all the READY state IDs in order to process them

Moody Edghaim

07/14/2023, 12:33 AM

introduced a timeout decorator before the above change as well, but some I/O block downstream (of Dagster) was preventing the timeout exception from being thrown

🙌 1

2 Views

Open in Slack

Previous Next