Hi, I have a question regarding IO manager. I have...
# ask-community
a
Hi, I have a question regarding IO manager. I have an
upstream_asset
in the raw data zone, and a
downstream
asset in the cleansed data zone, and both assets should be materialized as files in S3. When storing upstream asset in S3, I want to have a timestamp (~= a version) in the file path, like
upstream_asset/partition=2022-08-17/version=1660902342/data.txt
so that each time that asset is materialised, the old data is not overwritten. However, when building the downstream asset, only the latest version of the upstream asset should be used. I have two questions here: 1. Is there any thing designed for IO-managers to store that version thing, like a state, which is stored in
handle_output()
, and read in
load_input()
? 2. When I execute my job in one single process, is there anyway that I can avoid loading the upstream asset from S3 for building the downstream asset (i.e. having something like in-memory IO-manager but still have the upstream asset materialised in S3)? Thanks!
s
We're tracking functionality that would make this easier here: https://github.com/dagster-io/dagster/issues/8521. In that issue, there's a suggestion for a workaround
a
Thanks @sandy. How about my 2nd question? Are there any recommendations for doing that?
s
Sorry I missed your second question. We don't have out-of-the-box support for that pattern, but it would be possible to write your own IO manager that does that
a
thanks @sandy