I am trying to write an IO manager that wraps another IO man dagster #ask-community

I am trying to write an IO manager that "wraps" an...

Bertrand Nouvel

05/04/2023, 5:23 PM

I am trying to write an IO manager that "wraps" another IO manager (usecases I imagine could be things like asset "dedup" by changing keys and storing map in a database resources, anonymisation, change data representation, repartition of data, computation of analytics in way that is independent from underlying storage ). I have tried various mechanisms such as

required_resource_keys

- but always end up with problem having problem passing the correct context when calling

handle_output

in the wrapped

io_manager

. It seems context and configuration in the context needs to be duplicated / modified when you pass it to the sub resources which is a priori fine theoretically but complex as not much documented. I wonder if anyone has done something similar or there are recommendation or any advice on the best way to achieve this.

🤖 1

sean

05/08/2023, 4:16 PM

Hi Bertrand, Interesting question-- could you provide more context on why you’d like to use this “wrapping” pattern instead of just subclassing an IO manager?

Bertrand Nouvel

05/08/2023, 4:33 PM

Hi Sean, thanks for the follow up. Actually I have managed to make the progress on this. The reason why I was considering this is the pattern of the data. I have a database that evolve through time that correspond to a list of financial assets. Most of the days it will be very similar so it seems not a good idea to save the entire asset over and over again in a file storage or in a database. At the same time, I want to be aware if that universe change due to change in our code or change providers, and I'd like to be able to go back to previous versions. Dagster has the right abstract concepts, code_version , data_version etc... However, if I am to store hundred of copies of the same data because I reprocess and I want to be able to go back through time that's not very clever. So we need the storage to undestand there is redundancy, and this problem is somehow indepdent of the abstract "partiontion+asset-key-dataframe" store that I use to store my data. Currently, I am testing, I not entirely sure that I want to store on s3 or in database, so it seem naturally to try to abstract the problem of deduplication from the underlying io_manager. I now have a very first version of hash-map that seems to seems to work in a dirty state, it requires an io_manager for storing blocks of data and a database connection for storing the hashmap. Temporal diffs between consecutive partition look much harder. I have mentioned other examples of why I wanted to use this pattern, the main reason is that allow the problem solved by the wrapper io_manager is conceptually independent from what the underlying io_manager does. I was bit difficult to get started because, I had to get a good understanding how configuration was stored in this type of usecase. I have now accepted that I need a dict objects in my config/state to store the config of the wrapped io_manager. Naively at the beginning, I was hoping to be able reuse existing preconfigured io_manager and not to have to duplicate the config in an untyped dict.

sean

05/08/2023, 4:41 PM

I see, thanks for the details. Yeah it sounds like there is a clean conceptual break between Dagster IO managers and this redundancy management you want to implement-- you can do anything you want inside

load_input

handle_output

of the IO manager, so that’s where you’ll need to implement or call out to some other resource that does the tricky redundancy stuff. Seems like you’re in a good place though-- do you need any more help right now?

Bertrand Nouvel

05/08/2023, 4:42 PM

Thanks, probably not just right now, but at some point, I am open to the idea to share the code to see if it can be improved / generalised.

👌 1

2 Views

Open in Slack

Previous Next