Hey, I'm currently evaluating dagster as a replace...
# announcements
s
Hey, I'm currently evaluating dagster as a replacement for airflow. I've listened to https://softwareengineeringdaily.com/2019/11/15/dagster-with-nick-schrock/ and could relate to many of the airflow issues you describe @schrockn. Thanks for your insights, very interesting episode 🙂 So, although we started developing our own k8s native tool, I'd like to give dagster a go since it sounded interesting when you described it and looked even better when I went through the tutorial. Is there any reason why you're using your own k8s launcher in dagster-k8s instead of Dask Kubernetes?
a
Is there any reason why you’re using your own k8s launcher in dagster-k8s instead of Dask Kubernetes?
We are currently building out
dagster-celery
&
dagster-k8s
together. I expect when we expand
dagster-k8s
to support Dask we will use Dask Kubernetes.
m
and would welcome your thoughts on how we could best integrate with dask k8s -- this is an area where our thinking and design is evolving quickly and we would love insights from other practitioners
👍 1
s
Thanks for your replies. I’m asking because, according to documentation, your dask Integration runs all solids individually while the k8s integration, according to my understanding of the code, runs whole pipelines in a single job. I suppose the former has much better scaling potential and dask Kubernetes claims to scale dynamically with your workload. So the combination of dask Kubernetes wir dagster-dask seemed straight forward to me.
That being said, I see the advantage of not introducing yet another system. But then I'd go for an implementation similar to the one used by dagster-dask in that you run one pod per solid rather than one job per pipeline. Also we found jobs hard to observe programmatically, which is why we decided to spawn and manage pods ourselves with the tool we're developing.
m
would love to know more about the issues you had observing jobs
there clearly is an approach where we reach deeper into k8s and write some kind of CRD -- this is the approach that argo has taken, e.g.
s
would love to know more about the issues you had observing jobs
It's hard to figure out the reason why a job is in a certain state. If it's running, it might actually be running but might as well be that the pod which was spawned by the job is stuck in pending. Why? Maybe because of an image pull error, Maybe because it's not schedulable. Is the image pull error due to a missing/wrong image or due to missing/wrong credentials? Is it not schedulable because the cluster is full? Does it fit on a node at all? Most of that you can find out by reading the pod events, but you won't find that information on the job resource. Therefore, since we have to watch the pods anyway, we'll just manage them ourselves and thereby only need to watch and deal with one instead of two k8s resources.
m
yep, that makes sense to me