Matyas Tamas

02/10/2020, 7:57 PM
Hi all! New user here. I'm really liking the philosophy and documentation of the framework and am trying it out on a client project. I have some n00b questions: Looking at the last couple of weeks of chat history, it sounds like lots of folks are making feature engineering / ML pipelines (as am I) and trying to get intermediate memoization to fit their use case. I find the current UI for this somewhat inconvenient since my source data does not updating frequently and I'm not running jobs interactively or though dagit; re-using intermediates across jobs and configs seems to fit my use case better. If I understand correctly this is possible, but requires specifying a particular job run intermediate output. Someone suggested splitting the pipeline into parts, but this is also inconvenient since experimentation on the feature engineering parts of the pipeline generate lots of different possible sub-pipelines, so this ends up being close to making a pipeline per solid. The best options I could think of are either: a. Build caching logic into solids; this seems unfortunate since most of the work for what I want is already built in the intermediates framework b. Hack IntermediateStore paths to: (i) not incorporate job_id, (ii) have versioned solids (to invalidate caches when logic changes), (iii) pass along upstream solid dependencies (and their versions) to incorporate into the intermediate path key (so upstream version changes will bump the downstream path) Before going further down either path, since this seems like a pretty common use case, I wanted to check if there are other better options (or if either of these is a bad idea).


02/10/2020, 8:32 PM
hi @Matyas Tamas -- I think that some version of b) is the right answer, and we should probably incorporate support for this workflow into the core
would love if you could open an issue on the github repo that we could use to track this case -- and would be very interested to see anything that you hack together
versioning solids in a general way seems like the hardest part of this workflow
hm, actually, we already have this:
and would love for you to add your thoughts

Matyas Tamas

02/10/2020, 8:58 PM
thanks for the guidance - I'll take a stab and let you know how it goes 🙂


02/10/2020, 9:15 PM
we definitely need a better general solution - but another stop-gap approach may be to use the re-execution parameters on

Matyas Tamas

02/10/2020, 9:20 PM
oh interesting