In a straight-line portion of a DAG, could Dagster...
# dagster-feedback
In a straight-line portion of a DAG, could Dagster collapse multiple ops into one execution step? I'm thinking about some cases in my graphs where things are separate ops mostly just for cleanliness (and because ops are so nice and easy to declare and wire together), but it incurs some overhead. Would it be what most people want / easy to implement in the framework / a worthwhile performance improvement if adjacent ops merged at execution time?
Hi Mark. I think that users may or may not want to collapse ops into the same execution step, it would depend on a case by case basis. Some benefits of keeping ops separate: • You can retry a failed run from the erroring op • You can reuse ops and observe individual op computation times/behavior I think that if you want to consolidate linear ops into one step, it could be the easiest solution to bundle them together into one op. Another thing to note is that the default
takes additional time to spin up a process for each op. One way to potentially reduce this overhead is to switch this executor to the
which executes all ops in a single process.
@Mark Fickett I've long thought that it would be cool to enable some jobs to smush certain ops together into the same process while leaving other ops in separate process. As Claire said, I don't think we'd want this behavior to be on automatically. I think we'd ultimately want to leave it up to the user to make the decision about what ops to combine. This probably isn't something we'll get to in the near term, but you might be able to implement your own custom executor that combines the in-process executor with another executor implementation to get this behavior.
I've actually just done this for a couple of our graphs precisely to avoid accumulating spin up lag (still want the multiprocess executor for other parts of it though) so having an opt-in automagic way to do it would be pretty cool I think
Thanks Claire and Sandy for the interesting discussion! I agree retries and timing info are good reasons to keep things split up. Mixing multiprocess and in_process would be fun.
Just came across this thread while searching for a way to squash and combine ops into one to avoid the process overhead. Is there any documentation or examples for writing custom executors? Having a way to use in-process for some ops and multi-process for others would be ideal.
Hi @Hunter Young. Unfortunately custom executors aren't well documented and the internal APIs continue to be in flux. You can however raise any implementation questions you have in #dagster-support