Hi all - quick q, in wrapping my head around the framework.
With regards to Ops and IO managers - given Ops are by definition - for small calculations, but IO Managers (unless using in-memory) store data somewhere - isn’t there a lot of overhead being introduced for read/write? Particularly when working with big data, do we really want to store/duplicate the data at each small stage of processing (as opposed to key check-points)? What is the reasoning behind this, noting the redundancy (large tables per step per run - exponential sizes).
Or is it meant that one would stay in memory for certain Ops and then store at others ?
04/25/2022, 1:27 PM
Hi PB- one pattern is rather than directly passing around a full table through IO managers, to use them to pass pointers to wherever your table/other piece of data is stored