I'm in a scenario where I can have a lot of inputs...
# announcements
I'm in a scenario where I can have a lot of inputs that may be processed in parallel and I've been going through some of the past discussions on this topic.
One way of doing this is to have a solid invoke a spark job with a list of inputs and leave it to spark to parallelize it. I'm not a fan of this approach because if I deploy dagster to a k8s cluster with celery workers, I already have the infrastructure to perform data parallel operations. Setting up Spark seems a little redundant. Besides, I won't be able to make use of dagit to visualize the overall pipeline progress.
Another approach is to have a fixed set of parallelism, like in the
example. I can define a fixed number of aliased solids and partition my input to each of those solids. That will let me re-use my infrastructure and have dagit visualization but it does not necessarily scale with the number of inputs. For instance, if I have a large number of inputs and a small number of aliased solids, I won't be able to simply scale up the number of worker nodes in k8s because I will still be limited by the number of aliased solids.
Yet another approach, which I think max suggested in a discussion, is to have two pipelines. One pipeline will size the jobs that need to be done and dynamically create a second pipeline. This second pipeline will have the same number of aliased solids as inputs, which is great. Question: how will I actually execute this second pipeline during runtime? I can't trigger the start of pipeline from another pipeline, can I?
@alir yes, you can trigger one pipeline from another
@sashank has been looking at some of this lately
oh? could you please send me some pointers to code or documentation that exhibits how one pipeline can trigger another one? that would be very useful!
Here’s an example where we map over a list and execute a pipeline for each element of the list
this is great!! thanks! It simply didn't occur to me to call
from within a solid.
Here’s another example we shared with someone who was trying to implement recursive pipelines:
You can also call
if you want to process events in your main solid and do something even more fancy