https://dagster.io/ logo
Title
e

Eric Cheminot

03/31/2022, 9:52 AM
Hello ! I'm found the blog post by @sandy really great (actually, all are great, but this one is so enlightening for me!). I have many questions, but there's a point I'm really not comfortable with. There's a reference to Maxime Beauchemin's Functional Data Engineering and in this article, about "late arriving facts" it insists on the fact that time partitioning should always be done on event processing time. But most likely, produced assets are partitioned on the data dates. So how do we reconcile the 2 dimensions? There is no longer obvious lineage link between a produced asset (a partition based on data date) and the input (potentially multiple partitions containing upstream input data). And regarding non-mutability of data "units", combined with the need for reproducibility, don't you think that materialization layer must take care of this and be able to serve the possible versions to Dagster? Or else, it's the processing/ingestion partitioning from the first question that is the only enabler for reproducibility?
s

sandy

03/31/2022, 10:15 PM
Hey @Eric Cheminot - glad you enjoyed the post! I think you bring up some great points. In situations where there can be large gaps between "event time" and "arrival time", I think you often end up with with multiple assets: • Raw/base assets, which are partitioned by arrival time and are 100% immutable • Derived assets, which are partitioned by event time and whose contents are either mutable or versioned The choice between mutable vs. versioned ultimately comes down to a tradeoff between need for perfect immutability vs. the storage + complexity cost of maintaining a versioned asset. For mutable derived assets, each partition essentially needs to depend on many partitions in the upstream assets that are partitioned by arrival time. Dagster can model this many-to-many partition dependency via the
PartitionMapping
abstraction. In practice, it's common to bound partition dependencies with the equivalent of a "watermark". I.e. you say that, if data arrives more than X minutes/hours/days later than its event time, it gets ignored. This makes it more tractable to keep the derived asset up-to-date as new data arrives for the upstream assets. It essentially corresponds to the "standard deviation" header in Maxime's post. For versioned derived assets, one approach is to have a 2-D partitioning where one of the dimensions is the version. Does that answer your question?
e

Eric Cheminot

04/01/2022, 8:19 AM
Hey! Thanks for the detailed answer. And indeed, 2-D partitioning is simple and actually Delta seems more adapted for mutable data while here the pattern is to drop the whole asset before reconstructing it. When storage is cheap, I think it's better and simpler to follow the versioned path (and maybe implement housekeeping policies for derived assets)
v

victor

04/05/2022, 8:55 AM
Is there any dagster software-defined assets example with a 2-D partition mapping? The docs seem to focus on either SWA or partitioned jobs.
s

sandy

04/05/2022, 2:58 PM
Hey Victor - we don't have an example written up on that yet, but I could put something together if you tell me a little about what you're aiming to accomplish
e

Eric Cheminot

04/06/2022, 9:16 AM
Maybe derived from the guide example (https://docs.dagster.io/guides/dagster/software-defined-assets)? • how to get version information (and timestamp) from the Source Asset (assuming a fs layout)? • how to add version (and materialize this) to the produced assets?