https://dagster.io/ logo
f

fred

02/04/2020, 10:57 AM
Are there any near-term plans for caching?
a

abhi

02/04/2020, 4:07 PM
Hi Fred. That’s a great point. So I’d love to flesh this out a bit more. Currently Dagster actually does store the outputs of solids as intermediates which means if you have the run storage set up you can theoretically run parts of pipelines over and over again in dagit. However, you are totally right in that if you are tweaking config (IE your hyperparameter search space) between runs then you lose everything. As a result, why not just split your code into two pipelines. One which handles feature generation and another which grabs a dataset and trains a model? That way you run the first pipeline once and then you can run your training pipeline over and over again.
f

fred

02/04/2020, 5:12 PM
Hi @abhi ! Let me rephrase my question: is caching on the roadmap at all? And if yes, do you have a vision of how it should be implemeted ( if one would like to contribute to the project)?
a

abhi

02/04/2020, 5:26 PM
No it’s not currently on the roadmap but it’s all subject to user requests, we haven’t had enough signal to justify taking it on and also because the user problem isn’t very well fleshed out. I don’t quite have a vision offhand as the general solution would be really involved but I would say it’s worth implementing a local solution for your use case and spinning up a PR, that will help all of us design towards a better solution!
f

fred

02/04/2020, 5:33 PM
Ok thanks! I’ll take a look! I supposed it would involve linking the intermediates to a hash based on task name and input variables. Flyte.org uses manual cache invalidation while Prefect uses a timebased approach.
m

max

02/04/2020, 7:11 PM
we have this issue open: https://github.com/dagster-io/dagster/issues/1784 and would love further input