Jean-Pierre M

05/10/2021, 4:14 PM
Hello! I have a pipeline right now that takes a single image as input (from s3) and runs it through a bunch of solids which call external services. A single run for a single image usually takes approximately 30 seconds. We have it deployed on k8s with a QueuedRunCoordinator which spins up one pod per run (so one pod per image). We're trying to call this pipeline for 5000 images at once. We have tried having an external python script that uses the graphql api to submit the 5000 runs with a loop. This approach crashed our dagit/graphql service pods more often than not. I'm currently testing a sensor approach with the sensor looking at the s3 bucket for new files (similarly to the documented examples). This approach is much more reliable (doesn't crash dagit/graphql) but the queue is never being fully utilized. Our queue is configured with 100 max concurrent runs but we only ever reach about 60-70 running with about 40-50 in the queue. It seems like the daemon running the sensor is throttled so the queue never fills up. This is causing our overall process to slow down quite a bit. Are there ways that I could increase the throughput of the sensor? I have tried increase the k8s resource allocation to the dagster-daemon but it doesn't seem to do much. Or alternately, is this approach of a single image pipeline actually feasible? Should I instead be looking at the pipeline input being the whole s3 bucket (possibly using dynamic mapping)? Currently using version 0.11.4 but can easily upgrade if needed. Thanks in advance for help and suggestions.

Max Wong

05/10/2021, 4:22 PM
I think using aws lambda might be more suitable for your use case. it auto scales depending on the load, so that should fix the queue issue

Jean-Pierre M

05/10/2021, 4:37 PM
Thanks for the suggestion but I that won't be possible in our case. We have our data in a minIO store and we won't be able to change that


05/10/2021, 5:04 PM
cc @johann for queue throughput
is this approach of a single image pipeline actually feasible?
hard to say definitively but given most of the heavy lifting is happening in external services it could be worth trying. Depending on how your solids are factored it might be pretty easy to create a dynamic variant of the pipeline to test it out
👍 1
that said i doubt the UI will handle 5000 solids very gracefully -I dont expect it to break but im skeptical it will be an awesome experience
you could also try to split the difference and launch N pipeline runs that each handle M photos
👍 1

Jean-Pierre M

05/10/2021, 5:12 PM
I'm leaning towards splitting the difference as you mentioned. Thanks!


05/10/2021, 5:13 PM
Regarding the queue throughput- would you mind sharing (either here or you could message me directly) logs from the dagster daemon while you’re noticing the behavior? I’m curious if we see any logs like
x runs are currently in progress. Maximum is {y}, won't launch more