hi! i'm just starting to use io managers to help u...
# ask-community
r
hi! i'm just starting to use io managers to help us interact with s3. i know you can set the
s3_prefix
in the io manager resource definition on the job itself, but i'm wondering if there's a way of dynamically setting the s3 prefix in an op depending on some inputs to the op. for example, if we're running a pipeline that takes in
customer_id
as an input for some of the op configs, is there a way to include that in the s3_prefix so that things are more easily searchable in s3?
🤖 1
j
not with the dagster provided s3 io manager. You could always write a custom IO manager that can use that information though
r
thanks! one other question, does the provided s3 io manager allow you to deserialize an output from another op/job? for example, we have job 2 that relies on job 1's output. we want to stop persisting job 1's output to our local filesystem. i think if we move to assets, this would come for free? but what if we're still using jobs / ops?
j
yeah, with assets that would come for free. in you current system where you’re persisting to local filesystem, are you doing that with the file system IO manager, or just manually writing the file?
r
just manually writing the file
j
ok - i think you’d have to do the same thing, but you could just write it to S3. the complication with jobs is that each run of the job is stored under a different run_id. So to fetch an output from the first job in a second job, the second job would have to figure out what the latest run_id of the first job was. That’s possible, but i think would be really complicated
r
yeah i think we ultimately just want to move over to assets
just will take a bunch of work to get us there. thanks for your help!!
j
no problem! the other option would be to write a custom io manager that writes to the same location each time (rather than under the run id) - it would basically be doing the same kind of logic the io managers do for assets, just for ops/jobs
you’d lose the ability to re-execute old runs though, since you’d be overwriting the data each time. not sure if that’s important to you
r
is that the default behavior for assets? losing the ability to see old runs / run on older versions?
j
yeah so the idea with assets is you have your singular storage location (s3 blob, db table etc) that is the “data asset” and then the
@asset
function in dagster is responsible for updating that data asset. So each time you materialize an asset, the contents of the “data asset” is replaced. You can do partitioned assets, or incremental updates to the asset, but the basic case is that the asset is replaced each time. You don’t loose the ability to see old runs (like you can still see the logs and stuff), but we don’t keep old versions of the data assets around by default. You could write that functionality yourself with I/O managers though
the consequence of that is you can’t be like “run this asset, but with the upstream data from 3 months ago” if the upstream data has been more recently updated
r
ah gotcha, that makes sense! thanks so much!