https://dagster.io/ logo
c

Cris

08/07/2020, 5:17 PM
Hi people. Do you know if when using persistence storage for outputs it's possible to have some things not be saved forever? In a certain node I have a model that outputs a bunch of things, mostly are used only for subsequent calculations in the pipeline. On top of that some are heavy and we wouldn't like our storage to explode. Is there a way to handle this without deleting the runs?
☝️ 1
m

matas

08/07/2020, 7:13 PM
also interested
c

Cris

08/07/2020, 9:47 PM
An idea from @sashank was if using s3 to use a policy to remove buckets with certain lifetime. And if using local storage to use a cron to remove old jobs. Still it'd be nice to perhaps specify when certain outputs are ethereal, such that they save only in memory. This could complicate the whole data layer though.
m

matas

08/08/2020, 10:55 AM
we're having a distributed setup so in-memory is not a choice - we have to use s3. but it would be nice to have ethereal outputs in a sense that their lifetime is limited to a complete pipeline cycle. So they could all be purged after it succeeded
but if it fails - probably they should stay so you can explore the failure and reproduce it
c

Cris

08/10/2020, 4:10 PM
Hmmm, that makes a lot of sense. Then the idea would be to specify in some way when the output is ethereal so pipeline data can be cleaned up on pipeline success. I wonder how difficult is to implement this
Still, with S3 you could set up the bucket policy so that older data gets automatically purged. Unless you really need some outputs after a long time
2 Views