Hi I m trying to get my head around how Dagster works with r dagster #ask-community

Hi, I'm trying to get my head around how Dagster w...

Peter Law

07/18/2023, 5:14 PM

Hi, I'm trying to get my head around how Dagster works with regards to assets over time. I had assumed that Dagster stored each version of each asset somewhere, yet in fact it looks like the IO managers end up overwriting them instead. Is that expected? Have I misconfigured something?

dagster bot responded by community 1

Peter Law

07/18/2023, 5:16 PM

I've had a look in the docs around assets and they don't seem to say anything one way or the other.

Peter Law

07/18/2023, 5:18 PM

I think I'd made my assumption based on the effort that Dagster seems to go to to present the user with the lineage of each asset, but also based on the use-case we have in mind. We're planning to use Dagster to run a data pipeline and a follow-on model-training pipline, for which we (hopefully obviously) want to preserve each version of the trained models and the data used to train them. Having the following run overwrite the output from a previous one would make comparisons between the runs somewhat harder and less reproducible.

Peter Law

07/18/2023, 5:18 PM

I did find a comment at https://github.com/dagster-io/dagster/discussions/14733#discussioncomment-6136903 which seems to suggest that the IO managers assume overwriting is expected though.

Zach

07/18/2023, 5:20 PM

IOManagers are meant to be customized, most users will probably write their own to serve their needs since those needs are so diverse when it comes to storing outputs. If you need to maintain a history of all your outputs, it's quite straightforward to create a custom IOManager which writes each output under a unique directory using something like the run ID as a prefix.

Peter Law

07/18/2023, 5:23 PM

Ah right. I saw something about run ids in the docstring for the

FilesystemIOManager

(https://github.com/dagster-io/dagster/blob/9eabe0e101a4e2efd071040844c44710d186e0c[…]c/python_modules/dagster/dagster/_core/storage/fs_io_manager.py):

Assigns each op output to a unique filepath containing run ID, step key, and output name.

Assigns each asset to a single filesystem path, at "<base_dir>/<asset_key>". If the asset key has multiple components, the final component is used as the name of the file, and the preceding components as parent directories under the base_dir.

So I think I'd assumed that that was meant to be happening already. If that's not the default behaviour how/where would a custom implementation get the run id from? Is it available as a resource or something? Apologies if I'm missing something obvious in the docs.

Peter Law

07/18/2023, 5:25 PM

Ah, I'm guessing the

OutputContext

given to the

dump_to_path

contains the run id?

Zach

07/18/2023, 5:26 PM

Correct! Context objects are passed to most Dagster components and are usually the first place to look for contextual information like run IDs, partition definitions, run config, etc.

Peter Law

07/18/2023, 5:28 PM

Thanks :) I guess I was hoping that there might be a hook to modify the

path

passed in some templateable way, rather than needing implement a custom IO manager for something like this. Kinda feels like this is a slightly different layer than the IO manager (and I might want to apply the same/similar template to various managers in various code locations).

Zach

07/18/2023, 5:33 PM

I'm not aware of a feature like that. It's really super low effort to create an IOManager, like 30-50 lines of code for something like this. I think trying to stick to just the defaults for Dagster will be quite limiting in the long run, to me Dagster is truly a framework in that they give you the components and scaffolding and you customize it to match your needs.

Peter Law

07/18/2023, 5:35 PM

Ah, ok. Looking at

get_asset_relative_path

and

get_op_output_relative_path

it seems the latter is a bit closer to what I was perhaps expecting (

context.get_identifier()

seems to include the run id etc.). I wonder if there's a reason they're different?

Zach

07/18/2023, 5:38 PM

I haven't use those methods, but I would assume they're different because one operates on assets and one on ops, and these two different components have quite different metadata (assets being more or less a superset of ops).

Peter Law

07/19/2023, 3:29 PM

Having a look at this it's not clear to me how to make this work. When running a downstream asset it uses an

InputContent

to construct the path to the previous output asset. While there is an

upstream_output

member which is an

OutputContext

, that context doesn't have key information (namely the run id) which would be needed to construct the proper path. In generality -- I don't know how subsequent runs will know what the proper path is to the previous version of the asset.

Zach

07/19/2023, 3:33 PM

Seems like you'd have to provide what version of the asset you'd want subsequent runs to act on, possibly through a configuration schema on the IOManager

Zach

07/19/2023, 3:34 PM

version being the run ID or some other versioning info you used to generate the path to a previous version of the asset

Peter Law

07/19/2023, 3:38 PM

Yeah, this is well into the realm of something I'd hoped/assumed was a built-in rather than something I want to be building for myself. Obviously I could build it myself, but it feels like it's going to be quite fragile if I'm relying on manually passing around config to achieve it. It's also entirely unclear to me how in the UI someone will know which versions of one asset from the lineage relate to which versions of another asset in the same lineage. This is super important to get right for reproducibility.

Zach

07/19/2023, 3:41 PM

Sounds like you might want to make a feature request. Also sounds like your versioning requirements may be better suited to partitioned assets if you want to maintain dependencies between versions. Seems like you could model it as a chain of dynamically partitioned assets

Peter Law

07/19/2023, 3:43 PM

Yeah, I'd seen there was some support for partitioned data. Some of my use-cases would support that (the data pipeline side), however that doesn't feel like it's a good fit for the other things I'm expecting would need this, such as: • versions of a trained model • runs of the same asset differentiated by different configs (not sure if I'm actually going to need this, but it feels like a general solution for run-unique asset paths would help here) Neither of these usefully fits with the idea of partitioned data.

Zach

07/19/2023, 3:46 PM

Multidimensional partitions might help with those requirements. I'm just trying to give you options because right now nothing exists out of the box to do what you want to do, and the Dagster team may not be able to put something together for you in a short period of time that fits your exact requirements. If you make a feature request on their github then I'm sure they'll consider it

Peter Law

07/19/2023, 4:02 PM

Thanks, I've posted https://github.com/dagster-io/dagster/discussions/15386, which will hopefully get a response from the Dagster team.

3 Views

Open in Slack

Previous Next