I m in a scenario where I can have a lot of inputs that may dagster #announcements

I'm in a scenario where I can have a lot of inputs...

alir

05/20/2020, 9:00 PM

I'm in a scenario where I can have a lot of inputs that may be processed in parallel and I've been going through some of the past discussions on this topic.

alir

05/20/2020, 9:00 PM

One way of doing this is to have a solid invoke a spark job with a list of inputs and leave it to spark to parallelize it. I'm not a fan of this approach because if I deploy dagster to a k8s cluster with celery workers, I already have the infrastructure to perform data parallel operations. Setting up Spark seems a little redundant. Besides, I won't be able to make use of dagit to visualize the overall pipeline progress.

alir

05/20/2020, 9:00 PM

Another approach is to have a fixed set of parallelism, like in the

sleepy.py

example. I can define a fixed number of aliased solids and partition my input to each of those solids. That will let me re-use my infrastructure and have dagit visualization but it does not necessarily scale with the number of inputs. For instance, if I have a large number of inputs and a small number of aliased solids, I won't be able to simply scale up the number of worker nodes in k8s because I will still be limited by the number of aliased solids.

alir

05/20/2020, 9:01 PM

Yet another approach, which I think max suggested in a discussion, is to have two pipelines. One pipeline will size the jobs that need to be done and dynamically create a second pipeline. This second pipeline will have the same number of aliased solids as inputs, which is great. Question: how will I actually execute this second pipeline during runtime? I can't trigger the start of pipeline from another pipeline, can I?

max

05/20/2020, 9:31 PM

@alir yes, you can trigger one pipeline from another

max

05/20/2020, 9:32 PM

@sashank has been looking at some of this lately

alir

05/20/2020, 9:34 PM

oh? could you please send me some pointers to code or documentation that exhibits how one pipeline can trigger another one? that would be very useful!

sashank

05/20/2020, 9:37 PM

Here’s an example where we map over a list and execute a pipeline for each element of the list

sashank

05/20/2020, 9:37 PM

https://gist.github.com/helloworld/a5beae21bfa32050fbc2787e2833c65e

👍 1

alir

05/20/2020, 9:38 PM

this is great!! thanks! It simply didn't occur to me to call