I'm writing an iomanager, is there a good guide to how to deal w/ parallelism/multiprocessing here? Specifically, for time based partitions my iowriter is not parallelizable but for static partitions it is. Ideally, I'd be able to specify this in a way that also works w/ MultiPartitions.
j
jamie
03/24/2023, 7:36 PM
hey @Drew You are you looking for advice for how to determine if an output is from a time partition vs a static partition vs a multi partition?
d
Drew You
03/24/2023, 7:41 PM
So, I know that. I'm more concerned w/ how dagster deals w/ multiprocessing. Is there a way to guarantee that a backfill will only be handled by a single thread?
I have a single asset that is a very large timeseries and have defined my timepartition iowriter to be an append to a parquet file. (load parquet file + add latest partition to the end)
Drew You
03/25/2023, 9:07 PM
This is single writer, so my intuition is I should have an easy way to specify this. The time series is large so I don't want to rematerialize the whole thing each time I get a new partition, but I also don't want to force anything that depends on it to run in a single threaded executor
j
jamie
03/27/2023, 2:08 PM
ok i think i understand the problem better now. I agree that changing the executor for every single pipeline doesn’t make sense. We don’t have a built in way to achieve the kind of thing you want. So you could look at other solutions to ensure that two instances of the io manager aren’t writing to the file at the same time. one could be maintaining a lock file that the io manager needs to hold in order to write to the parquet file