Hey all, I have a situation where I am trying to process a list of data values in order. I have an op that goes and checks an S3 bucket for files, could be one file or a list of them. I then want to trigger a second op to process the files however I need to ensure that the files are processed in the correct order. I have used Dynamic Outputs in the past however I believe that this spins up parallel processes and doesn't ensure FIFO. Is there a best practice for handling data in order without also limiting the execution count for the given job? Or be able to set number of possible executions on the op itself? Thanks!
07/06/2022, 3:54 PM
The executor is responsible for how the ops in the job are executed. As you’ve identified, by default, using the multiprocess executor, this will spin up a separate process for each step and run them concurrently. You could switch to the inprocess_executor, but this will limit the concurrency for other steps as well.
There are other executors that allow for more fine-grained control, at the expense of having a more complicated setup. The celery k8s executor , for example, sets up celery queues that allows for concurrency limits on a per-op basis. You can read more about it here: https://docs.dagster.io/deployment/guides/kubernetes/deploying-with-helm-advanced#overview
We do provide built-in support for limiting run-level concurrency using run coordinators (see https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency) but taking advantage of that in your situation would mean restructuring your job to be split up across multiple jobs (or job/sensors).