Hi all. I've been evaluating some tools to use in a simple machine learning pipeline that will be run from a docker container. High overview of the flow is: pre-process the data -> stitch the labels -> train (produce model) and/or predict
The data scientist in theam had already created code that uses MLFlow, we just need an ease of use pipeline to help with the flow. The tools for flow we are now at a cross roads are Dagster and Kedro. I have done tutorials on both. My view from doing the tutorials that Dagster was clear and obvious what it is happening, and how to use it. Kedro is opinionated on the project setup (but this can be good so their is a form of best practice in their), but it was a bit difficult to make out how everything gets stitched up to form a pipeline.
However Kedro offers an advantage, it has an MLFlow plugin that seems to make it something to lean more towards. Is there anything of similar nature with Dagster or examples of how to setup ML pipelines?
Another question I have regarding Dagster: if a solid fails in the middle of the pipeline and I fix whatever the issue was, would I need to rerun the entire pipeline or just that solid and the pipeline continues from where it left off?
Sidenote: Metaflow was also a contender but the tutorials don't really give an idea what's going on, and from my reading it is really tied down to AWS kind infrastructures.
01/25/2021, 4:31 PM
Hi @Bongani - what MLFlow integration functionality would be helpful for you?
If a solid fails in the middle of a pipeline, you can re-run the pipeline starting with that solid