Hello!
I have a pipeline right now that takes a single image as input (from s3) and runs it through a bunch of solids which call external services. A single run for a single image usually takes approximately 30 seconds. We have it deployed on k8s with a QueuedRunCoordinator which spins up one pod per run (so one pod per image).
We're trying to call this pipeline for 5000 images at once. We have tried having an external python script that uses the graphql api to submit the 5000 runs with a loop. This approach crashed our dagit/graphql service pods more often than not.
I'm currently testing a sensor approach with the sensor looking at the s3 bucket for new files (similarly to the documented examples). This approach is much more reliable (doesn't crash dagit/graphql) but the queue is never being fully utilized. Our queue is configured with 100 max concurrent runs but we only ever reach about 60-70 running with about 40-50 in the queue. It seems like the daemon running the sensor is throttled so the queue never fills up. This is causing our overall process to slow down quite a bit.
Are there ways that I could increase the throughput of the sensor? I have tried increase the k8s resource allocation to the dagster-daemon but it doesn't seem to do much.
Or alternately, is this approach of a single image pipeline actually feasible? Should I instead be looking at the pipeline input being the whole s3 bucket (possibly using dynamic mapping)?
Currently using version 0.11.4 but can easily upgrade if needed.
Thanks in advance for help and suggestions.