Hi all! I'm new here and after reading through alm...
# ask-community
e
Hi all! I'm new here and after reading through almost the entirety of the documentation, I want to see if you would agree with some conclusions I have made regarding a highly flexible evaluation pipeline I plan to build: • Ops instead of assets - an eval pipeline which allows the user to configure any input data and any output location to write the report does not have a consistent data asset that is used by other pipelines, and so SDAs don't really provide much value • Configuring pipeline runs through op selection syntax - the best way to design such a pipeline is as a static graph of ops which manipulate the data in many different ways and produce various metrics (some of which may not be computable, depending on what is in the upstream data). The user can then select the list of metrics they would like to compute based on the data they are providing as a list of clauses:
["*metric_1*", "*metric_2*", ..., "*metric_n*"]
Would love some guidance, as I'm just starting to build my own mental model for dagster design
s
Glad you’re trying out Dagster Eric!
Ops instead of assets
Hard to say without more details-- in general we recommend assets for any kind of data transformation, where you aren’t doing something explicitly imperative (e.g. sending an email or other notification). Assets can have a configurable physical location via IO manager, and it’s unclear what’s meant to me by “configure any input data”-- if that data has a consistent form, assets are likely the way to go.
Configuring pipeline runs through op selection syntax
Also hard to provide much guidance here without knowing more, but, my instinct is that you should consider having multiple jobs rather than a mega-job that you subset through op-selection to achieve different objectives. You can reuse the same op across jobs.
e
Thanks for the detailed response!
Ops instead of assets
Based on the way assets are presented in the documentation, using an asset for my type of data just feels wrong . In the documentation, assets are presented as a collection of data that always has the same interpretation but may need refreshing from time to time, e.g. all stars from a github repo or the recorded temperature for the past week. But in my case, the data definition is always different - it could be data coming from product logs, or it could be fake data, or external data. It could have even have different features available. From this perspective, it seems awkward to consider this asset as going "stale," as each time the job is run the content should be completely different.
Configuring pipeline runs through op selection syntax
This is an interesting proposal! So basically, each metric can be defined as a job, which specifies the chain of ops that get from input data to metric value. Some follow up questions: 1. is there a way to combine jobs into a single job to launch at runtime? It would be nice if user provides list of metric names and this kicks off all associated jobs 2. even though an op definition is shared across jobs, can its execution be shared? For example, if metric1 job looks like
op1 -> op2 -> op3 -> metric1
, and metric2 job looks like
op1 -> op2 -> metric2
can I launch these in a way where
op1 -> op2
only needs to execute once? Thanks again for taking time to respond, sorry for all the questions
s
is there a way to combine jobs into a single job to launch at runtime? It would be nice if user provides list of metric names and this kicks off all associated jobs
I don’t think so-- if you want to launch all this stuff at the same time then putting it in the same job (and possibly subsetting with op selection) is the way to go.
even though an op definition is shared across jobs, can its execution be shared?
No, as above in that case they should be in the same job.
sorry for all the questions
No prob at all, that’s why we have this channel