https://dagster.io/ logo
n

Noah K

02/05/2021, 10:03 PM
But I would probably implement something more specific in the solid itself
f

Felipe Saldana

02/05/2021, 10:13 PM
So there's not anything directly built into the framework. I am comparing against Prefect and I see "Output caching" in that framework.
n

Noah K

02/05/2021, 10:15 PM
There's memoization of solid outputs, and there's asset materializations, but I don't think there's specifically time-based invalidations of those.
f

Felipe Saldana

02/05/2021, 10:23 PM
Just thinking out loud here: would I "implement something" using a combo of asset mats + custom IO manager? https://docs.dagster.io/_apidocs/io-managers
n

Noah K

02/05/2021, 10:24 PM
Some of this also depends on how big the data is
Usually when this comes up, you don't actually want to cache the data itself in Dagster or whatever because it's huge
So you want to upload it to a storage bucket somewhere
And maybe just store a URL or key in the Dagster solid output
s

sandy

02/05/2021, 10:25 PM
Hey Felipe - this might be relevant to what you're trying to do: https://docs.dagster.io/examples/memoized_development
Happy to discuss in more detail if it would be helpful
f

Felipe Saldana

02/05/2021, 10:26 PM
I would be working with a "result object" of some sort and not the actual data itself
great @sandy let me take a look
... just FYI Spark or another tool would be doing the heavy lifting
@sandy That does look to be the start of what I am going after. Do you have any comments on a custom implementation using assets mats + and IO Manager .... would that be another way to solve my example?
m

mrdavidlaing

02/06/2021, 1:28 PM
@sandy is there a memoized example showing how to compute whether an external dataset (eg, a table in a datalake) has changed (and should thus trigger re-execution of a solid?)
I’m wondering if there is a better way than something like:
process_data(fingerprint=compute_fingerprint())
s

sandy

02/08/2021, 9:32 PM
Hey @mrdavidlaing - for the dataset that you're trying to detect changes in, are you envisioning it's a dataset that's generated by a solid in your pipeline or a dataset that's computed by some other process?
m

mrdavidlaing

02/08/2021, 9:38 PM
By some other process - eg; a separate import process into a datalake
s

sandy

02/08/2021, 9:42 PM
Gotcha. I think there's no silver bullet, but the ones that come to mind for me: • The approach you suggested - computing a fingerprint of the contents to see whether they've changed • Looking at the date the dataset was updated • Having the import process register some sort of event somewhere to say that the dataset has changed and then consuming that event
👍 1