If I need to temporarily store data during an op retrieve fi dagster #ask-community

If I need to temporarily store data during an op (...

Tobias Pankrath

03/01/2023, 1:36 PM

If I need to temporarily store data during an op (retrieve files, sort them individually, write them to files, then merge the sorted files into the final result), what's the canonical way to do it? just use some temp directory or is there a special temp directory in dagster? Make the intermediates op outputs (I don't need them later)? If I want to have a file as an dagster output, which (python) type would I use for this?

🤖 1

Vinnie

03/01/2023, 2:16 PM

If I understand the question correctly, you can use a custom/different IO Manager for specific ops. If you’re using them to to materialize assets instead of jobs, you can look into Graph-Backed Assets

Tobias Pankrath

03/01/2023, 2:37 PM

How do I hand over sth. to an IO Manager without loading it into memory first?

Vinnie

03/01/2023, 2:52 PM

As far as I know the default behavior in (all?) out of the box IO managers is to load the data into memory for processing, e.g. the S3 IO Manager will save the output of

op_1

into S3 and load it into memory for processing by

op_2

. You could of course just return the file paths if you don’t need to process the underlying data. What are the ops/steps doing exactly? I might be able to sketch out something more appropriate

Tobias Pankrath

03/01/2023, 3:26 PM

I've got blob storage, where files are stored as a/b/c{1,...n}. I generate a DynamicPartition with keys "a/b" for all a & b. Now I need to fetch all c1..n, which are large files that need to be sorted and appended into a resulting "file".

Tobias Pankrath

03/01/2023, 3:27 PM

I am at the point where I have the resulting file on my hard drive (below some temporary directory) and I want to hand it back to dagster for storage, best without loading it into memory completely.

Zach

03/01/2023, 6:57 PM

I think if you really want to leverage the IOManager concept one way would be to compile your files into a fileobject handle / BytesIO stream from within your op, then return that from the op. Then a custom IOManager could pick up the output from memory and store it. But you could also just pass the IOManager the path to where you wrote the aggregate file, and have it just store the path in memory / in a database / on disk as a small file that just contains the path to your larger file. Then if you need to load that output from another op the IOManager can use that to find your large file. If you don't need to load the large aggregate file from a downstream op / asset then you don't really need the IOManager to do anything

Tobias Pankrath

03/02/2023, 9:09 AM

So passing around fileboject handles as assets is ok in dagster? Will buildin io managers handle that as well?

Vinnie

03/02/2023, 9:14 AM

Depends on the IO Manager, e.g. if you use the

fs_io_manager

, this logic will be executed, calling

pickle.dump

on the returned object. I don’t know off the top of my mind if that would support a BytesIO/Fileobj, but you could adapt the logic into your own IO Manager that will.

5 Views

Open in Slack

Previous Next