Hi everyone! Is there a good way to, from within a...
# announcements
e
Hi everyone! Is there a good way to, from within a solid, know where we are in the pipeline? I.e. to know what solids have already happened and what solids remain downstream?
d
Hi Rebeka - I think the answer here might depend on what exactly you want to use this for, could I ask for more context about the goal? One tricky thing here is that multiple solids can be running in parallel for many executors, so it isn't necessarily certain that the set of solids that have already happened will stay the same over the course of the execution of another solid
e
Hi Dan! 🙂 Here I'm only really interested in upstream and downstream dependencies. The specific usecase is that in my setup I'm looking to process a batch of data items. If the processing of a single data item fails, I want it to be excluded from downstream processing. So far so good - I can just set this data objects status to failed, and exclude failed ones downstream. However, what was very helpful back in Airflow days was, for dev and testing, to be able to just clear statuses of tasks (solids) midway through the run and see it try again like that once an issue (bug or external) is fixed. So in these cases I'd want to know that I can ignore errors that occurred downstream..
m
cc @nate
s
A way that this is typically handled in Dagster is through "Re-execution" - i.e. you can re-execute a subset of a pipeline, including the solids of your choosing. Here's some documentation: https://docs.dagster.io/tutorial/advanced_intermediates#reexecution Would that satisfy what you're looking for?
If not, would you be able to spell out what the logic looks like a little more specifically? E.g. is it that, if solid B depends on solid A, and solid C depends on solid B, you'd want solid B to ignore errors in solid A if solid C has already completed executing?
e
Hi Sandy! Thanks for this! Sorry for not being able to reply sooner. Here’s the problem a little clearer: in my pipeline I’m processing N objects. I track the success/failure state of each of those, so that when an object fails, the rest can continue to be processed downstream. So at the start of each solid I look for objects in the batch that have not failed. If I were to use the functionality you mention - reexecuting a subset - or even if I just had a complex pipeline with many branches - I’d need to know that errors that occurred in an object downstream from a solid in a pipeline (or indeed on a different, irrelevant branch) so that those errors I can “ignore” - I only really care about errors that happened upstream.