I’ve got a bunch of existing code that’s set up to process data from an external source and produce a [Data Package](
https://specs.frictionlessdata.io/data-package/) of the result. These packages are then loaded and transformed later in our process.
I’d like to make use of Dagster’s IO system to model these sources as assets and provide automatic loading. My gut reaction to the default pickle-based IO managers is that they store data in a format that’s much more arbitrary and less interoperable with other tools.
It seems like my options are:
1. Write a custom IO manager that can read and write Data Packages (there’s a little complication here because my current code produces packages that contain multiple assets, but thats’ resolvable)
2. Run my existing code, then load the resulting data as a CSV without metadata into Dagster and let it be pickled using the standard IO managers for downstream use.
3. Run my existing code, then load the Data Package along with all its metadata and then pickle that whole thing.
4. Give up on having interoperability and re-write everything to run the raw data through Dagster.
I’m curious if there are best practices or strategies others have employed to handle this kind of. Am I overthinking it? Are there better ways to pass schema metadata long with assets within Dagster?