https://dagster.io/ logo
Title
j

Jordan

10/14/2022, 10:30 PM
Hi 👋 I have a partitioned job with two assets. The first asset retrieves a file to eventually split it into several if it is too big. The second asset performs a series of processes, however I would like each sub-file to be processed separately (for memory issues) so in separate runs. I think the most obvious solution would be to split this job into two jobs and add a sensor to detect if there is one or more sub-files per partition to know the number of RunRequest to trigger. However, in my use it is rare that there are several files so I think it's a shame to have to separate the flow of these two assets. I wonder if it is not possible in the io_manager between the two assets, to take the first sub-file for processing (most of the time the only file) and for all other files to trigger a RunRequest on the second asset with the path of this sub-file as configuration. But I don't have the impression that this is feasible. Do you have any suggestions/solutions to achieve this? Thanks in advance
c

chris

10/17/2022, 9:24 PM
Hi Jordan - you don't need to have separate jobs to process each subfile separately - you could use one of the out-of-process executors (IE multiprocess executor) and launch off a dynamic step for each subfile. What I would suggest here is a graph-backed asset that launches off sub-ops for each of the files that is required to be processed.