im trying to define a training job factory where one op is t dagster #ask-community

im trying to define a training job factory where o...

Andrew Maguire

07/18/2023, 9:24 PM

im trying to define a training job factory where one op is to get the data and next op is to train the model and produce the trained model as an asset. idea is for the trained models to be assets that a prediction job can then use. code is here. am new to dagster, so might be i have mixed up some patterns or something. it all runs fine and i can see the models are trained - i just am not seeing any assets

Andrew Maguire

07/18/2023, 9:25 PM

my jobs running fine

Andrew Maguire

07/18/2023, 9:26 PM

but i don't see the 2 assets i was hoping for - one per metric trained model

Andrew Maguire

07/18/2023, 9:33 PM

basically i am trying to define a job that is op->asset

Andrew Maguire

07/18/2023, 9:34 PM

have a feeling i maybe need to do op->op but have the last op define the asset as an output or something but am not really sure

Zach

07/18/2023, 9:51 PM

Mixing ops and assets within a single job isn't super well supported, neither is using @assets within a @job (asset jobs are defined using

define_asset_job).

I think what's happening is that your asset is actually just being executed as an op. There is a specific way to pass an asset into an op using AssetIn, but that doesn't seem like what you're trying to do here. Is there any particular reason you're loading the dataframe in an op only to want to pass it to an asset? I would probably simplify things by just doing the loading and logging within the asset (or modeling the dataframe as a SourceAsset). Then I'd make an asset factory instead of a job factory in order to stamp out the different asset configurations.

Andrew Maguire

07/19/2023, 7:47 AM

One reason was that in future I'd seperate the loading (from BQ, snowflake, etc - will make nicer and try use io managers later I think, although I like idea of it all just being pandas df's and using pandas to do all the io, might be a bad idea, unsure) from the training (which is just a python function on a df using pyod). I'm also not using assets all round as each op actually is just working over a subset of metrics from a single metrics table. Main idea is anomaly detection on metrics, where each metric is just some SQL you define and then some yaml Configs for that metric, like ingest frequency, train frequency, alert thresholds on an anomaly score (have not done scoring yet). So I went with jobs and ops as I sort of felt a bit more closer to what I'm trying to do as opposed to assests (which I like as a general approach but am feeling I might be in a more custom or special use case).

Andrew Maguire

07/19/2023, 7:47 AM

I'm basically trying to port something like this airflow provider I built into a dragster app of sorts. https://github.com/andrewm4894/airflow-provider-anomaly-detection

Andrew Maguire

07/19/2023, 7:51 AM

(blog on this provider fyi in case helps some context https://medium.com/apache-airflow/painless-anomaly-detection-with-apache-airflow-dfd83f320a9e )

Andrew Maguire

07/19/2023, 10:33 AM

i should also say, idea was to start simple by just having the trained models as local assets and then a score job would actually use them every hour or so to generate anomaly scores and append them to the metrics table too

2 Views

Open in Slack

Previous Next