I'm in a scenario where I can have a lot of inputs...
# announcements
a
I'm in a scenario where I can have a lot of inputs that may be processed in parallel and I've been going through some of the past discussions on this topic.
One way of doing this is to have a solid invoke a spark job with a list of inputs and leave it to spark to parallelize it. I'm not a fan of this approach because if I deploy dagster to a k8s cluster with celery workers, I already have the infrastructure to perform data parallel operations. Setting up Spark seems a little redundant. Besides, I won't be able to make use of dagit to visualize the overall pipeline progress.
Another approach is to have a fixed set of parallelism, like in the
sleepy.py
example. I can define a fixed number of aliased solids and partition my input to each of those solids. That will let me re-use my infrastructure and have dagit visualization but it does not necessarily scale with the number of inputs. For instance, if I have a large number of inputs and a small number of aliased solids, I won't be able to simply scale up the number of worker nodes in k8s because I will still be limited by the number of aliased solids.
Yet another approach, which I think max suggested in a discussion, is to have two pipelines. One pipeline will size the jobs that need to be done and dynamically create a second pipeline. This second pipeline will have the same number of aliased solids as inputs, which is great. Question: how will I actually execute this second pipeline during runtime? I can't trigger the start of pipeline from another pipeline, can I?
m
@alir yes, you can trigger one pipeline from another
@sashank has been looking at some of this lately
a
oh? could you please send me some pointers to code or documentation that exhibits how one pipeline can trigger another one? that would be very useful!
s
Here’s an example where we map over a list and execute a pipeline for each element of the list
a
this is great!! thanks! It simply didn't occur to me to call
execute_pipeline
from within a solid.
s
Here’s another example we shared with someone who was trying to implement recursive pipelines:
You can also call
execute_pipeline_iterator
if you want to process events in your main solid and do something even more fancy