The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hi all - quick q, in wrapping my head around the framework.

With regards to Ops and IO managers - given Ops are by definition - for small calculations, but IO Managers (unless using in-memory) store data somewhere - isn’t there a lot of overhead being introduced for read/write? Particularly when working with big data, do we really want to store/duplicate the data at each small stage of processing (as opposed to key check-points)? What is the reasoning behind this, noting the redundancy (large tables per step per run - exponential sizes).

Or is it meant that one would stay in memory for certain Ops and then store at others ? 

Thanks heaps 

Hi PB- one pattern is rather than directly passing around a full table through IO managers, to use them to pass pointers to wherever your table/other piece of data is stored