Jeff Nawrocki
06/29/2023, 5:57 PMsean
06/29/2023, 8:00 PMFor data versions, it seems like this should be accomplished “under-the-hood” by hashing the data objects without requiring the code version.You have the option of doing this by attaching a
DataVersion
to the Output
returned when materializing an asset, but this isn’t something dagster can do on its own, since Dagster does not concern itself with how physical assets are represented. See this guide:
https://docs.dagster.io/guides/dagster/asset-versioning-and-cachingJeff Nawrocki
06/29/2023, 8:25 PMsean
06/29/2023, 8:33 PMfrom dagster import asset
@asset
def foo():
return 1
@asset
def bar(foo):
return foo + 1
• dagit -f
this file
• materialize foo
• materialize bar
• materialize foo again
Then you should see this (sorry it’s “Upstream data”, not “New data”):Jeff Nawrocki
06/29/2023, 8:40 PMJeff Nawrocki
06/29/2023, 8:44 PMcode_version
. If I add code versions to your example assets and materialize foo
, I don't see Upstream data
sean
06/29/2023, 8:47 PMcode_version
on foo, the data version of foo does not change when you materialize it again (because the code version is the same and it has no inputs). Therefore there is no new “Upstream data” for bar no matter how many times you materialize foo.
When there is no code version, we assume the code could have changed each time you materialize it, so the data version changes.Jeff Nawrocki
06/29/2023, 8:50 PMfoo
reads in data from S3 and that data is new? In that case, the code is the same but the data is updated.sean
06/29/2023, 8:53 PM@asset
def foo():
data = get_data_from_s3()
data_hash = hash_data(data)
return Output(data, data_version=DataVersion(data_hash))
Jeff Nawrocki
06/29/2023, 9:11 PM