hi i m just starting to use io managers to help us interact dagster #ask-community

hi! i'm just starting to use io managers to help u...

Rory LaMendola

08/24/2023, 3:08 PM

hi! i'm just starting to use io managers to help us interact with s3. i know you can set the

s3_prefix

in the io manager resource definition on the job itself, but i'm wondering if there's a way of dynamically setting the s3 prefix in an op depending on some inputs to the op. for example, if we're running a pipeline that takes in

customer_id

as an input for some of the op configs, is there a way to include that in the s3_prefix so that things are more easily searchable in s3?

🤖 1

jamie

08/24/2023, 3:19 PM

not with the dagster provided s3 io manager. You could always write a custom IO manager that can use that information though

Rory LaMendola

08/24/2023, 3:21 PM

thanks! one other question, does the provided s3 io manager allow you to deserialize an output from another op/job? for example, we have job 2 that relies on job 1's output. we want to stop persisting job 1's output to our local filesystem. i think if we move to assets, this would come for free? but what if we're still using jobs / ops?

jamie

08/24/2023, 3:25 PM

yeah, with assets that would come for free. in you current system where you’re persisting to local filesystem, are you doing that with the file system IO manager, or just manually writing the file?

Rory LaMendola

08/24/2023, 3:25 PM

just manually writing the file

jamie

08/24/2023, 3:27 PM

ok - i think you’d have to do the same thing, but you could just write it to S3. the complication with jobs is that each run of the job is stored under a different run_id. So to fetch an output from the first job in a second job, the second job would have to figure out what the latest run_id of the first job was. That’s possible, but i think would be really complicated

Rory LaMendola

08/24/2023, 3:27 PM

yeah i think we ultimately just want to move over to assets

Rory LaMendola

08/24/2023, 3:28 PM

just will take a bunch of work to get us there. thanks for your help!!

jamie

08/24/2023, 3:30 PM

no problem! the other option would be to write a custom io manager that writes to the same location each time (rather than under the run id) - it would basically be doing the same kind of logic the io managers do for assets, just for ops/jobs

jamie

08/24/2023, 3:31 PM

you’d lose the ability to re-execute old runs though, since you’d be overwriting the data each time. not sure if that’s important to you

Rory LaMendola

08/24/2023, 3:32 PM

is that the default behavior for assets? losing the ability to see old runs / run on older versions?

jamie

08/24/2023, 3:34 PM

yeah so the idea with assets is you have your singular storage location (s3 blob, db table etc) that is the “data asset” and then the

@asset

function in dagster is responsible for updating that data asset. So each time you materialize an asset, the contents of the “data asset” is replaced. You can do partitioned assets, or incremental updates to the asset, but the basic case is that the asset is replaced each time. You don’t loose the ability to see old runs (like you can still see the logs and stuff), but we don’t keep old versions of the data assets around by default. You could write that functionality yourself with I/O managers though

jamie

08/24/2023, 3:35 PM

the consequence of that is you can’t be like “run this asset, but with the upstream data from 3 months ago” if the upstream data has been more recently updated

Rory LaMendola

08/24/2023, 3:45 PM

ah gotcha, that makes sense! thanks so much!

3 Views

Open in Slack

Previous Next