If I need to temporarily store data during an op (...
# ask-community
t
If I need to temporarily store data during an op (retrieve files, sort them individually, write them to files, then merge the sorted files into the final result), what's the canonical way to do it? just use some temp directory or is there a special temp directory in dagster? Make the intermediates op outputs (I don't need them later)? If I want to have a file as an dagster output, which (python) type would I use for this?
🤖 1
v
If I understand the question correctly, you can use a custom/different IO Manager for specific ops. If you’re using them to to materialize assets instead of jobs, you can look into Graph-Backed Assets
t
How do I hand over sth. to an IO Manager without loading it into memory first?
v
As far as I know the default behavior in (all?) out of the box IO managers is to load the data into memory for processing, e.g. the S3 IO Manager will save the output of
op_1
into S3 and load it into memory for processing by
op_2
. You could of course just return the file paths if you don’t need to process the underlying data. What are the ops/steps doing exactly? I might be able to sketch out something more appropriate
t
I've got blob storage, where files are stored as a/b/c{1,...n}. I generate a DynamicPartition with keys "a/b" for all a & b. Now I need to fetch all c1..n, which are large files that need to be sorted and appended into a resulting "file".
I am at the point where I have the resulting file on my hard drive (below some temporary directory) and I want to hand it back to dagster for storage, best without loading it into memory completely.
z
I think if you really want to leverage the IOManager concept one way would be to compile your files into a fileobject handle / BytesIO stream from within your op, then return that from the op. Then a custom IOManager could pick up the output from memory and store it. But you could also just pass the IOManager the path to where you wrote the aggregate file, and have it just store the path in memory / in a database / on disk as a small file that just contains the path to your larger file. Then if you need to load that output from another op the IOManager can use that to find your large file. If you don't need to load the large aggregate file from a downstream op / asset then you don't really need the IOManager to do anything
t
So passing around fileboject handles as assets is ok in dagster? Will buildin io managers handle that as well?
v
Depends on the IO Manager, e.g. if you use the
fs_io_manager
, this logic will be executed, calling
pickle.dump
on the returned object. I don’t know off the top of my mind if that would support a BytesIO/Fileobj, but you could adapt the logic into your own IO Manager that will.