https://dagster.io/ logo
Title
c

Caleb Parnell Lampen

03/28/2023, 3:59 PM
I'm looking for advice on how best to structure my project. Notionally, the questoin is how to do "fan-out" operations using (preferably dynamically) partitioned assets, but I might be overlooking another approach. For context, I'm new to dagster. The situation: I have sets of data files. Each set has multiple image data files, and a single calibration data file. I want to be able to build a pipeline to apply the calibration to each uncalibrated image file to create corresponding calibrated image files. There are per-image tasks downstream as well, but I don't think those are relevant to this question. Both the calibration data and each image file need (somewhat time consuming) preprocessing before the calibration is applied to the images. The way I'd think to structure this would be a preprocessed_calibration asset partitioned by set, and a calibrated_image asset partitioned by image file id, with a mapping between the set name and the image file ids (one set goes to many image file ids). As I understand it, this way each set's calibration data would be processed exactly once, and then applied to all of the image assets in the set. Now, the practical problem. We want to be able to add data sets over time. Moreover, we want to be able to use this code in multiple locations (different laptops and a cloud deployment), each of which will have their own datasets. The first thought is to use
DynamicPartitions
to generate the partitions at runtime, based upon the files available on the system its running on. The problem is, it seems that there is no way to map dynamic partitions to each other (see feature request https://github.com/dagster-io/dagster/issues/13139). Thus, with dynamic partitions, I'm seemingly locked out of "fan-out" type operations, so I can't use the mapping strategy to only preprocess the calibration once per set. The second option is to use static partitions, and do the mapping on those. My main concern is that the static partitions it seems have to be instantiated before passing to the
@asset
decorator. So, we'd have to edit the code upstream of the asset, every time we want to run on new files. Possible, but this makes the code less portable. I'd prefer to separate the asset logic from the data it runs on. It would be better if we could define the static partitions downstream of the asset definitions, so they could be reusable against different data locations. For example, if we defined them in a per-deployment script that calls
load_ssets_from_package_module
. Even better would be if the static partitions could be loaded from a config file or environmental variables, but I'm guessing at that point we are back to dynamic partitions. A third option is to define partitions and assets only at the set level, but that isn't really the partitioning scheme that is natural for downstream things we want to do, and since most of the operations are highly parallelizable by file, we'd want to find a new parallelization solution nested inside of dagster (maybe dask). Any advice on how to handle this situation?
:dagster-bot-resolve: 1
s

sean

03/29/2023, 3:48 PM
Hi Caleb, Thanks for this very thorough problem description-- will be useful for us as we continue developing dynamic partitions. I think probably the best solution for you for now (pending custom mapping between dynamic partitions) is to use static partitions. You could create a function that generates the
StaticPartitionsDefinitions
by reading a file, so you shouldn't have to edit any code. You will of course need to reload the code location to register an update to the data file defining the partitions.
c

Caleb Parnell Lampen

03/29/2023, 4:27 PM
Thanks. Yeah, I wasn't sure how to explain the problem in a more pithy way without potentially omitting something, since I really wasn't sure if I was even thinking about the problem correctly in the dagster framework. Thanks for reading the whole thing! This is actually the solution I came to late yesterday. Still vetting it. Thanks for confirming that it is a reasonable way to go about it, since having a "generator function" for these things loaded at run time is not something I noted in the docs, and I wasn't sure if there might be some pitfalls. Simply clicking on the "reload" button in dagit seems to detect new static partitions fine, so this seems like a pretty good solution. If you ever add dynamic partition mappings, I should be able to swap those in with minimal impact to the rest of the code.
s

sean

03/29/2023, 7:11 PM
Great, glad that this solution is working for you. I will add something to the docs explaining this alternative approach to dynamic partitions.
h

Harrison Conlin

03/29/2023, 11:31 PM
interesting problem! gives some food for though
c

Caleb Parnell Lampen

04/04/2023, 8:57 PM
This seems to be working. It's a little awkward, since I need to setup the file locations before the partitions are created and thus also before the software defined assets are declared. Essentially, I need to define it by something like a global variable before importing my assets (if I want the assets to be in an importable python module.) I'm currently doing this via an environmental variable. I'm a little worried this is inflexible as the project scales up and when writing tests. I'd prefer to somehow use the dagster context/config system to define this path. If this winds up in the documentation @sean, I'll be interested in taking a look to see if you have a more elegant way to set this sort of "dynamic-defined" static partitions up.