Hello fellow Dagsterers, I have a job which takes...
# ask-community
m
Hello fellow Dagsterers, I have a job which takes a large list of IDs. This list is chunked into N smaller sublists (where N = number of executors, to allow for some parallelization). Each of these sublists is passed to the N dynamically spawned ops. Each dynamically spawned op simply iterates over the list, and performs some work for each individual ID. At some point, each of the dynamically spawned ops gets stuck indefinitely on a single ID (this is not a Dagster problem, this is a me problem). Root causing this and fixing it at the source isn't an option currently, so I'm looking to work around this for now, by terminating the job and having Dagster spit out all the remaining IDs which haven't been processed. I would manually rerun the job for the unprocessed IDs (which I've confirmed does not get stalled). 1. Would this be possible via failure hooks? The HookContext looks like it has a lot of goodies I can play around with. 2. Would this be better suited for assets? Materialize on failure, and pick up the asset on start (if it exists)?
🤖 1
s
Hi Moody, 1. I think it is possible-- if you have control over the exception thrown from the op, presumably you could put the unprocessed IDs in there? Alternatively, it seems like maybe you could just log them directly from the op with
context.log
, but I’m not sure I fully understand what this faliure looks like. 2. Assets could be suitable for this problem, but you’d probably want to use partitioned assets, and given the special handling you need here I suspect that the greater flexibility you have with ops might be a better fit right now.
🙏 1
m
Thanks for the quick reply, Sean! If the job is in a long running op (which has an associated failure hook), and the job gets terminated - does the hook get triggered? I can't seem to make it happen locally
s
As in the job is terminated manually by the user?
m
yeah, sorry
s
Yeah I don’t think that should trigger failure hook
Is it possible for you to put a timeout on the per-id call in the op body and throw an exception when it fails? That would trigger the hook-- or you could just log the remaining ones directly from the op before throwing
Hey Moody, Have you been able to resolve/make progress on this?
m
Hey Sean! Yes, sorry. I opted to work around this as there was a tight deadline, and I couldn't spend too much time exploring around this. I simply did some bookkeeping in my database to mark individual IDs in certain states (READY,SUCCESS,FAILURE, etc), keyed by the job ID. if a job is rerun (context contains a
parent_id
variable for reruns), it simply pulled all the READY state IDs in order to process them
introduced a timeout decorator before the above change as well, but some I/O block downstream (of Dagster) was preventing the timeout exception from being thrown
🙌 1