https://dagster.io/ logo
Title
f

Fabio Picchi

04/25/2023, 5:55 PM
Hi guys, we're progressing quite fast with our dagster deployment and usage in house, but as we move, more doubts keep popping up, so I hope you don't mind me asking 😅 I'll try to describe our pipeline and the modelling approach I have in mind. We work with remote sensing data, so mostly pointclouds and images. The image processing is usually 1 - 1, meaning every step transforms an image, but the number of images remains the same along the pipeline. The pointcloud processing, however, might change the number of files since we tile the data to different tile sizes. Initially, the data isn't even organized in space (tiles) but it is in time (swaths), so the asset dependency chain is a bit trickier to model. We process the data in datasets, arbitrarily defined according to our business logic. These datasets comprise of a number of pointcloud and image files. I was thinking of establishing each stage in our pipeline for each dataset as an asset and there I see two ways of modelling things: defining the s3 subpaths where the data is as the asset or defining each file as an asset partition. The second approach seems more interesting to me, but I think that it might be tricky to understand how things progress across the pipeline since, as I mentioned previously, the number of partitions might change as the data progress down the pipeline, meaning partition definition has to be fully dynamic. Finally, we defined a generic asset dependency chain that should be applied to all datasets and we tie the assets to each dataset using the AssetMaterialization event. We concatenate the dataset ID and the pipeline stage name to create a unique asset key. Is that the way to do this sort of thing? Anyway, hope to get some feedback on this structured from the Dagster team and more experienced users 🙂
c

chris

04/25/2023, 7:49 PM
s3 subpaths sounds like a reasonable thing to define your asset across, but I think the partition space should be something more logical to your pipeline’s function. It sounds like initially your data is partitioned across time window, does that remain the case through the rest of the pipeline / can the end s3 files be partitioned across time?
f

Fabio Picchi

04/26/2023, 4:02 PM
Initially the data is organized in time, meaning each file comprises a few minutes for pointcloud data capture, for instance. Later in the pipeline we organize the data geometrically, meaning in tiles of, let's say, 1km by 1km. We don't go back to the original structure. However, it would be nice to see the dependency chain between these different stages. I guess the subpaths are the way to go, but at some point we might want to process, for instance, not a dataset of tiles but single tiles. This is why I'm thinking we might make good use of partitions. Sorry if I'm being too vague, but our current situation is that we batch process these datasets in big pods. However, in the future, we would want to process each file in a single small kubernetes pod, for the steps in which the work is inherently parallel. Not sure if we're able to describe an asset dependency chain that can work on individual partitions until they're all needed together.
c

chris

04/26/2023, 4:49 PM
I guess the question becomes what are you trying to gain from modeling each tile as its own partition? If I understand correctly, there are going to be many different partitions, and the mapping between time data and tiles isn’t clear. This is my mental model of what you’re describing:
f

Fabio Picchi

04/26/2023, 4:50 PM
that's exactly right! Yeah, sorry, I think I mangled two ideas together into a confusing mess
after we generate the tiles, we apply many transformations to them and those keep being 1 to 1, meaning that the number of tiles is kept across these operations
some of the steps are inherently parallel and others require all the tiles to be downloaded together for the analysis
c

chris

04/26/2023, 4:51 PM
If that’s the case, I think that you’d be better served not even using partitions here. You can use graph backed assets with dynamic ops in order to perform batch processing - if you’re using the k8s executor, each dynamic step will run in its own pod - you can launch off a dynamic step for each tile, for example
So essentially the entire dataset is your asset when modeling things like that
f

Fabio Picchi

04/28/2023, 12:03 PM
@chris, sorry to bug you again, but the further we develop the more questions we have 😅 So, what I just described in this thread is the situation of a single dataset in our system. We are looking for a way of describing the same asset dependency chain for every single dataset so we can see their state separately in Dagit.
I read about the asset factory pattern so I'm wondering if we should generate the asset tree for every dataset and have them in a separate group 🤔
again partitions come to mind (each dataset being a partition of the "tiles" asset) but that feels very weird semantically
c

chris

04/28/2023, 4:21 PM
It’s not the craziest idea to represent them as partitions, we have https://docs.dagster.io/_apidocs/partitions#dagster.MultiPartitionsDefinition for this case. It ultimately comes down to how uniquely you want these things represented in dagit. How many datasets are you dealing with here?