Hi! I enjoyed the dagster podcasts. Is there an e...
# announcements
t
Hi! I enjoyed the dagster podcasts. Is there an example of how one could use dagster as a way to model an ML pipeline down to featurization (say replacing sklearn or Dask ML pipelines)? I see ML pipelines as spanning from coarse (Airflow) to being fine grained (featurization or business logic pipeline). My interest in Dagster is that it provides a single abstraction for both coarse and fine grained pipelines, along with guaranteed execution, low latency execution (say the DAG is to be executed in a serving environment), and parameterization support.
m
Hi Taleb! @abhi and I are building out an example like that right now actually.
t
excited to see it!
When would it be ready? Could I see it in a branch?
I'd love to play with dagster soon
m
i'd also suggest you take a look at the airline demo, which doesn't include any featurization but does exercise a bunch of relevant features (like jupyter integration)
a
Just as an FYI, we haven't explicitly done the featurization/model training piece yet in bay bikes. (I am working on it as we speak). The aim is end of next week!
t
It'd be great to have it show that the same code can run in bulk and in serving a single request with low latency!
Could be done if part of the DAG is using pandas DF only. While in bulk the input is a Spark DF, and in serving mode it's a pandas DF with a single row.
m
can you say more about how you'd like to use dagster for model serving
up to now we haven't really envisioned the system being used in a realtime context
t
I'd like to define fine grained pipelines (featurization) within a DAG to be used in both bulk and low latency serving envs. Often engineers have to rewrite the DAG between batch/bulk and serving environment.
I'm evaluating the space of Prefect, Dask ML pipelines and Dagster, https://github.com/dagster-io/dagster/issues/1593
a
I totally see what you mean, I would love to know more about what your serving environment looks like. I envision some web service that is deployed with model state and has an endpoint which takes an input and then performs featurization/model.predict and returns an output. Is this your mental model, or do you have something different in mind?
t
Yup exactly!
We're also evaluating SeldonCore and their inference graphs. SeldonCore provides many great features, but unfortunately also it's own definition of an inference graph.
a
So we don't have an out of the box solution for executing pipelines in a web app at the moment. The reason for this is because the side effect behaviors you would like from a batch prediction pipeline are dramatically different from a real time service that passes an input through a prediction execution graph so we would need to build a solution that is mindful of these differences. Assuming that the data available during real time prediction is exactly the same as the data you trained on (which is often times not true because ETL does a lot of stuff to data which raw production systems that your app will be talking to haven't), you would be producing different side effects and you would also have different requirements around things like alerting, model prediction (a lot of runtimes e.g. sklearn BaseEstimator predictions are optimized for batch inputs but not for transactions), and most importantly model artifacts. The last one is particularily interesting because often times the model you train is dependent on the runtime it will be used in e.g. tweaking n_jobs when doing RandomizedCV with any sklearn model. Given all these complexities, we haven't really bit off the real time serving problem, but would love to see contributions that tackle it!
TLDR: There currently be hydra's if you tried to use configs to do forking in your prediction pipeline to multiplex between runtime and batch environments.
t
Thanks! What is the side effect? Could dagster run in a webcontext with the low latency of Dask?
a
It potentially could, it's just that we haven't really tried it out because model serving is not a feature we want to bite off at the moment for the reasons expressed above! At the moment, I think dagster is best suited for use cases like ETL, model training, retraining, and batch prediction. Once a dagster pipeline drops a versioned model off into a store (S3 for example) other tools/processes that specialize in model run-time environments can take over from there. What I mean by "side effects" is that the consequences of coupling your data/ml pipelines with web service development can be pretty gnarly from a mantainance perspective!
t
That's the scenario we want to avoid, duplicate DAGs for ETL/training/batch and serving. There'd be offline and serving runners. Maybe another example that would be awesome is that of building my own dagster executor. Say a unit test executor with no side effects.
a
That is totally fair! I would love to see it in action if you get an MVP going!
Also to help you out there, you can check out the python API as there is nothing stopping you from running a pipeline in a web service! This should help you get started! https://dagster.readthedocs.io/en/latest/sections/api/apidocs/execution.html