OK revisiting this as I got distracted and Assets are mainline now - I want to define an asset that:
1. Gathers its source data
2. Compares the newly gathered data to the most recent version of the asset
3. Only if step 2 finds a difference, go ahead and materialise the new version
Obviously, at step 2 I need to access the current asset. What's the idiomatic way to do this?
07/19/2022, 3:04 PM
hi @Ben Gatewood this isn't something we support out of the box. I'm not sure it would work (see reasons below) but if you wanted, you could try writing the code to fetch the current version of the asset from wherever you're storing your assets and then compare with the newly gathered data.
You might run into problems though, because the asset will still return something (even if you don't return any data yourself, the python fn will automatically return None). so this None value might get materialized as the asset.
If you don't mind sharing, what are your reasons for wanting to write your asset this way? there might be another approach to solve the larger problem
07/19/2022, 10:58 PM
Hey Jamie - thanks. So I actually currently do this the way you suggest: I use GCS as my asset storage so I can easily just grab the actual file from there but this feels kinda dirty and like I'm going "around" dagster. The reason is simply that I want to hook materialisation sensors up to these assets but I'd rather they didn't run (and spin up jobs etc on our k8s cluster) if the underlying data hadn't actually changed
@jamie Does this make sense at all? More than happy to hear a better way for this tbh
07/25/2022, 7:10 PM
hey @Ben Gatewood - jumping in here and spitballing an idea: when you write out your asset, you could take a hash of its contents and include that in the metadata. in the sensor, you could use a cursor to store the most recent hash, and when you are considering whether to request a run, you could compare the cursor hash to the hash in the latest metadata?
07/25/2022, 11:07 PM
Thanks, @sandy - yeah I mean that's what I'm doing essentially atm but I hadn't thought of stashing it in the metadata, that's a good call I think. I understand this would probably be non-trivial to generalise but doesn't seem like a particularly unusual motivation to conserve downstream compute