Another question related to the previous one We run projects dagster #ask-community

Another question related to the previous one. We r...

Fabio Picchi

04/28/2023, 3:57 PM

Another question related to the previous one. We run projects with hundreds of datasets and each one of them has to go through a pipeline which I modelled with assets, each asset being the artifact we produce in a certain stage of this pipeline. This, however, leaves me with an abstract dependency tree of assets. Abstract in the sense that it isn't tied to a particular dataset and our team would like to visualize the state of each dataset in this asset dependency dag separately. One alternative to model this are partitions. Each dataset would be a partition of an asset. Another one we found out, less convenient, is bundling the dataset ID in the asset key with an AssetMaterialization event. None of the above seems right though. The first seems to be the best of the two since we can visualize things better though dagit, but still, you don't see the dag for each dataset. You have to click an asset and check the state of each partition, so you never see the state of that particular partition in the pipeline as a whole... One alternative I have yet to explore are asset factories, but having 2 thousand replicas of the asset dependency chain in Dagster also doesn't sound like a good idea. Is there a better way to model this?

chris

04/28/2023, 4:29 PM

If you have hundreds / thousands of datasets, multi-dimensional partitions might be the most reasonable approach here.

D 1

Fabio Picchi

04/28/2023, 5:58 PM

thanks! Seems to be the case indeed. I have another question on partitions, though

Fabio Picchi

04/28/2023, 5:59 PM

I materialize one partition for every dataset (in my company's internal language, ran a dataset through our pipeline) and when rerunning, the stale partitions are not shown in Dagit as stale

Fabio Picchi

04/28/2023, 5:59 PM

is that the expected behavior?

chris

04/28/2023, 6:40 PM

I’m confused - are you saying that during the process of re-running they aren’t shown as stale? cc @sean who might have thoughts here

sean

04/28/2023, 7:15 PM

Hey Fabio, yes that’s expected, staleness does not currently work with partitions but it’s a very active area of development.

Fabio Picchi

05/03/2023, 9:36 AM

all right, thanks @sean 👍 Are there plans to consider it? We're mostly interested in the visual traceability/monitoring of our pipeline and we're trying to organize our dagster definitions in a way that things are easily confirmable/visible in dagit

sean

05/04/2023, 6:19 PM

Are there plans to consider it?

Definitely, we are working on it a lot right now. Just a tough problem because moving staleness tracking from the asset level to partition level means you’re moving from a few dependencies per node to potentially thousands (as when you have an unpartitioned assets downstream of, say, an hourly-partitioned assets).

❤️ 1

Open in Slack

Previous Next