Hi all How do you version your assets I have the following u dagster #ask-community

Hi all, How do you version your assets? I have the...

Mycchaka Kleinbort

03/07/2023, 12:50 PM

Hi all, How do you version your assets? I have the following use case:

Copy code

@asset
def training_data():
  ...
 
@asset
def ml_model(training_data):
  ...

@asset 
def model_report(ml_model):
   ...

This works well, but it overwrites the earlier models and model reports. Is there a way to version the assets without creating copies? Something to replace this:

Copy code

@asset 
def model_report_jan(ml_model_jan):
   ...

@asset 
def model_report_feb(ml_model_feb):
   ...

@asset 
def model_report_mar(ml_model_mar):
   ...

Mycchaka Kleinbort

03/07/2023, 12:52 PM

I think what I'm looking for is a way to tell dagster to load not the latest, but a previous materialization of an asset.

Mycchaka Kleinbort

03/07/2023, 12:54 PM

I'm thinking maybe a use of SourceAssets is best...

Andras Somi

03/07/2023, 12:59 PM

This works well, but it overwrites the earlier models and model reports.

Static (or even dynamic) partitions might help with this. You could define your own io manager to handle partitions to avoid overwriting.

Mycchaka Kleinbort

03/07/2023, 1:01 PM

I have an io-manager that writes two versions of each asset to persistent storage {asset_name} {asset_name.datetime.runid} as such I always have a copy of all the assets ever materialized As I reflect on the question, it's mainly around how I can have downstream assets read the materialization that is not the most recent.

Mycchaka Kleinbort

03/07/2023, 1:01 PM

Thinking that SourceAssets might help... but not 100% sure

Mycchaka Kleinbort

03/07/2023, 1:03 PM

Something like

Copy code

jan_model = SourceAsset(...) # Hard coded?

@asset
def model_report_jan(jan_model):
    ...

Vinnie

03/07/2023, 1:14 PM

It seems like this use case is fitting for partitions, even a “simple” MonthlyPartitionsDefinition seems like it would solve it. If you need heavier partitioning logic, as @Andras Somi said I’d go for dynamic partitions. See here.

Mycchaka Kleinbort

03/07/2023, 1:19 PM

Thank you

Vinnie

03/07/2023, 1:20 PM

Alternatively check this github discussion that explains the philosophy behind using partitions https://github.com/dagster-io/dagster/discussions/12061

Mycchaka Kleinbort

03/07/2023, 1:22 PM

That looks interesting - I'll check it out. Yet another way to phrase my ask is - what if I want to re-run part of my DAG using an old materialization of an asset - as a form of roll-back

Rahul Dave

03/07/2023, 2:55 PM

I've been thinking about this as well, and am thinking that versioning is something an artifact management system should do, such as wandb/mlflow/dvc

Rahul Dave

03/07/2023, 2:56 PM

Since an iomanager has run-ids/step-ids etc in it, you could register an artifact

Rahul Dave

03/07/2023, 2:57 PM

I've been thinking of partitions more along the lines for inference on new data...

Mycchaka Kleinbort

03/07/2023, 3:00 PM

The trigger for this today was

Copy code

Hmm... the new model is predicting strange things on new data, I want to apply the old model to the new data to compare

I have an asset

Copy code

@asset
def y_pred(model, X):
   ...

and I wanted to run it with a previous materialization of the model (which I have saved somewhere)

Andras Somi

03/07/2023, 4:00 PM

You could probably pass the model version as a config param and load the appropriate model inside the asset function?

❤️ 1

Rahul Dave

03/07/2023, 4:53 PM

Ideally if one could call the dagster api from the model management ui/cli etc one could probably invoke a sensor which will run inference with the older model...?

Mycchaka Kleinbort

03/07/2023, 4:56 PM

Interesting idea

Rahul Dave

03/07/2023, 5:06 PM

i looked but could not find any hooks in mlflow though for this (for example). My guess is that the most general idea would be to setup a dagster sensor to check if a particular mlflow model has been promoted to a "production" stage or tagged specifically (using the mlflow api) and then run that model after fetching it (this part is easy with the mlflow api)

Mycchaka Kleinbort

03/07/2023, 5:16 PM

I'm updating the model via Dagster itself, so that's not a big issue in and of itself.

chris

03/07/2023, 9:13 PM

Have you seen https://docs.dagster.io/guides/dagster/asset-versioning-and-caching#asset-versioning-and-caching

Chris Histe

03/10/2023, 2:56 PM

I built the IO Manager for W&B that got released recently. I would love to hear your thoughts if you ever use it 🙂

12 Views

Open in Slack

Previous Next