I feel like I m missing something simple Our default I O sav dagster #ask-community

I feel like I'm missing something simple. Our defa...

Noah Ford

05/18/2023, 8:04 PM

I feel like I'm missing something simple. Our default I/O saves to a BigQuery warehouse, but sometimes I just want to do the following: 1. Run two assets that query Redis and Bigquery in parallel 2. Once both assets are finished, merge the two assets from #1 inside a third asset, save the result of the third asset to BigQuery I have no need for the results of the two assets to be saved to BigQuery and it is a lot of data to append each run. It seems like the crux of the issue is the I/O manager and the fact that BigQuery IO Manager is the default? I definitely could put all the logic into one asset but that seems less modular as well as inefficient since I'm waiting to query Redis while querying BigQuery for no reason.

Tim Castillo

05/18/2023, 8:06 PM

Hi Noah! You don't need to use I/O managers if you don't need them. Our tutorial has a section that shows how you can skip using I/O managers in an asset To help you out more tactically, where would you like to have the result set of the Redis and Bigquery queries in assets 1 & 2?

Noah Ford

05/18/2023, 8:11 PM

My intent is to only save the results of asset 3 to BigQuery, for all I care the results of 1 & 2 only need to be persisted long enough for 3 to read them. Would a decent solution maybe be just to save the most recent runs results (a full replace in BQ instead of appending)? I just thought there might be something cleaner that didn't create more schemas in our DW

Tim Castillo

05/18/2023, 8:13 PM

Got it! You don't have to use the BQ I/O manager for every asset. If you want something truly ephemeral, you can use the mem_io_manager to store the results of assets 1 and 2, and use them in 3. This tutorial section shows how you can specify which I/O managers specific assets should use.

Noah Ford

05/18/2023, 8:30 PM

Got it to work with fs_io_manager but not mem_io_manager which seems like it would overwrite the files anyway? How does a fs_io_manager work on Dagster Cloud though, or would it?

Noah Ford

05/18/2023, 8:33 PM

Error for `mem_io_manager`:

Copy code

dagster._core.errors.DagsterUnmetExecutorRequirementsError: You have attempted to use an executor that uses multiple processes, but your job includes op outputs that will not be stored somewhere where other processes can retrieve them. Please use a persistent IO manager for these outputs. E.g. with
the_graph.to_job(resource_defs={"io_manager": fs_io_manager})

Tim Castillo

05/18/2023, 8:35 PM

Ah, didn't know you were on Cloud. Yeah,

fs_io_manager

would be a better bet because of how it executes. Best practice on Cloud though is to persist the data to somewhere that we don't manage, ex. a GCS bucket.

Noah Ford

05/18/2023, 8:36 PM

Gotcha, thanks for the explanation, will rethink the BigQuery default and get a S3 IO manager going. Appreciate it!

keanu thanks 1

4 Views

Open in Slack

Previous Next