Hi guys, we're progressing quite fast with our dagster deployment and usage in house, but as we move, more doubts keep popping up, so I hope you don't mind me asking 😅
I'll try to describe our pipeline and the modelling approach I have in mind. We work with remote sensing data, so mostly pointclouds and images. The image processing is usually 1 - 1, meaning every step transforms an image, but the number of images remains the same along the pipeline. The pointcloud processing, however, might change the number of files since we tile the data to different tile sizes. Initially, the data isn't even organized in space (tiles) but it is in time (swaths), so the asset dependency chain is a bit trickier to model.
We process the data in datasets, arbitrarily defined according to our business logic. These datasets comprise of a number of pointcloud and image files. I was thinking of establishing each stage in our pipeline for each dataset as an asset and there I see two ways of modelling things: defining the s3 subpaths where the data is as the asset or defining each file as an asset partition. The second approach seems more interesting to me, but I think that it might be tricky to understand how things progress across the pipeline since, as I mentioned previously, the number of partitions might change as the data progress down the pipeline, meaning partition definition has to be fully dynamic.
Finally, we defined a generic asset dependency chain that should be applied to all datasets and we tie the assets to each dataset using the AssetMaterialization event. We concatenate the dataset ID and the pipeline stage name to create a unique asset key. Is that the way to do this sort of thing?
Anyway, hope to get some feedback on this structured from the Dagster team and more experienced users 🙂