The Asset API looks amazing 👏 and I love the dagit view!
Somehow related to my question in dagster support I wonder in general what is going to be a best practice in asset jobs when configuration or the ability of swapping out inputs is involved.
Example use cases:
• a generic machine learning job that accepts "arbitrary" inputs that are configurable directly from dagit and a config specifying which algorithm to use, etc. How would you create that with assets?
• a re-run of a production pipeline that needs to use a previous snapshot of the data (in a regular job i would just use the op selection and manually add the input config)
02/10/2022, 6:55 AM
Hey @Alessandro Marrella. Thanks for bringing this up - it's something we've been putting thought into.
Do you mind if I turn the question around on you first? Do you have thoughts on how you'd like your configurable assets to be represented in storage? I.e., imagine you have a software-defined asset that accepts a single config parameters, and you materialize it three times: once with value X, once with value Y, and once with value X again. Imagining that you store the results in a filesystem, do you have thoughts on which of these you would prefer?
• A single file that gets overwritten with each materialization.
• Two files: one for each config value. The file corresponding to config value X gets overwritten when the asset is materialized with config value X a second time.
• Three files: one for each run. No files get overwritten.
Relatedly, when you navigate to the "asset details" page for that asset, what would you like to see emphasized?
• The most recent materialization of the asset, independent of what config value was used for it.
• The most recent materialization for each config value. I.e. maybe you'd get to type in a config value, and we'd show you the latest materialization for that config value.
• All historical materializations equally.
02/10/2022, 2:23 PM
Hey @sandy, thanks for the reply.
It's a tricky situation, for production re-runs I would probably prefer just overwriting - and just setting partitions would be enough, but for arbitrary runs of a generic pipeline it would be great to have some influence on where the assets are going to be stored like when you specify the 'outputs' in a op-based pipeline (i can be convinced this is more a use case for a traditional op-based pipeline, and asset based pipelines are more static)
For the visualization i'd vote for displaying everything equally, with the option of seeing the config next to the materialization (like with a "view config" button).