https://dagster.io/ logo
p

Prasad Chalasani

01/28/2021, 7:59 PM
hi all, liking dagster a lot from the tutorial, planning to use with
dbt
to manage a workflow consisting of transforming a dataset from bigquery, then running ML on it, and putting results back in bigquery. I know this was asked before, but one question — when re-running a pipeline containing heavy computations, are there mechanisms for avoiding re-computation of parts whose inputs haven’t changed? Let’s say the solids’ inputs [outputs] are results read from [dumped to] files Essentially some type of caching/invalidation mechanism.
j

johann

01/28/2021, 8:01 PM
cc @chris
c

chris

01/28/2021, 8:03 PM
Hey @Prasad Chalasani! We actually have an experimental feature for versioning/memoization of pipeline runs. We have an example intended to get someone up and running with these features coming out later today with our mini-release.
p

Prasad Chalasani

01/28/2021, 8:03 PM
great, thanks for the fast response!
c

chris

01/28/2021, 8:04 PM
Once the docs are updated, I'd be happy to link that to you, and would love to get your feedback on the user experience
p

Prasad Chalasani

01/28/2021, 8:04 PM
happy to give feedback
👍 1
c

cat

01/29/2021, 12:53 AM
you can also re-execute a solid subselection, using the upstream outputs from a previous run. let me find the docs..
here’s a video that shows loading a finished run, selecting one of the solids (you can also select a subset), and then launching a new run with the previous run’s outputs: https://www.loom.com/share/b02dea352c034c15b671307ecd71f0b9
p

Prasad Chalasani

01/29/2021, 1:12 AM
will take a look, thank you!
c

cat

01/29/2021, 1:18 AM
Oh also, this only works if the intermediates are written to persistent storage (ie not in memory), so S3 for example would work