I have a somewhat interesting FreshnessPolicy esque use case dagster #dagster-feedback

I have a somewhat interesting FreshnessPolicy-esqu...

Spencer Nelson

02/13/2023, 7:26 PM

I have a somewhat interesting FreshnessPolicy-esque use case which I think is hard to accommodate today, but which I think could be, and might be pretty general. https://ztf.uw.edu/alerts/public/ has many years of scientific data, bundled into tarballs per day. New tarballs are created each night. I am writing ETL jobs that scan those tarballs, pull out a subset of interesting features, and then march off to do other work with those features. Sometimes, old tarballs are modified. This can be because the archiving code was buggy, or there was an intermittent network failure, or whatever. MD5 checksums of the tarballs are in https://ztf.uw.edu/alerts/public/MD5SUMS. I would like to be able to say something like “Our asset is fresh if its MD5 checksum hasn’t changed since we last materialized it.” When the MD5SUMS file reports a difference (maybe polled once a day), I would like to kick off a new job.

rex

02/13/2023, 7:32 PM

@sean Seems like we could model this using observable source assets yeah? https://docs.dagster.io/guides/dagster/asset-versioning-and-caching#step-three-staleness-with-source-assets

Spencer Nelson

02/13/2023, 7:40 PM

Whoa, how did i not find this part of the docs. Yes

Spencer Nelson

02/13/2023, 7:49 PM

Ah, it looks like this is very recently added. That makes me feel better!

Spencer Nelson

02/13/2023, 7:50 PM

How does this interact with partitioning? Each partition has its own upstream logical version, here

Spencer Nelson

02/13/2023, 7:51 PM

(I’m picturing treating each source tarball as a separate partition)

sandy

02/13/2023, 8:30 PM

It currently does not work very well with partitioning, but that's on our near-term roadmap

sandy

02/13/2023, 9:43 PM

here's an issue for tracking: https://github.com/dagster-io/dagster/issues/12314

👍 2

Spencer Nelson

02/13/2023, 9:44 PM

Thanks for opening that. It would be nice if I could write a function that returns a map of partition-key -> logical version value.

Open in Slack

Previous Next