Hey group. A friend of mine referred me to dagster...
# announcements
y
Hey group. A friend of mine referred me to dagster and I found it interesting. Currently, we use airflow to schedule work on kubernetes cluster via kubernetes job api. The feature that dagster appeals to me the most is its airflow integration and an web ui editor that allows you to write dags interactively. Being able to develop dags interactively has been a shortcoming for airflow imo. Here is my question: what's it like to develop dags with dagster and deploy the dag to a remote airflow instance? Skimming through the documentation, I thought a user could write a dag on dagster-git, deploy dag to a repo, then have airflow dagster plugin pull from that repo. Am I close?
m
Hi @yx! I think that makes sense. The way that our Airflow integration is currently conceived is that 1) you first write pipelines in Dagster 2) then you write a little bit of code in a DAG definition (e.g. https://github.com/dagster-io/dagster/tree/master/python_modules/dagster-airflow#running-uncontainerized-dagsterpythonoperator) that imports the Dagster pipeline and wraps it in an Airflow DAG.
You probably would want to check both your pipeline and the DAG definition into your git repository, and pull them both down to your remote Airflow instance
If you can't install packages on your Airflow instance (so that importing the Dagster pipeline wouldn't work), we also have a containerized route.
Does that help?
I should mention that we're also considering working on a Kubernetes-oriented integration with Airflow
y
It does make sense, although I was scratching my head when reading
dag, steps
returned from
make_airflow_dag
, I assume airflow would have just scanned the dag with all the tasks (or
steps
in that example) properly included in that dag.
for Docker example, I understand it as to wrap up the dag as a docker container, and execute that using airflow's DockerOperator.
m
the
dag
is the Airflow DAG that contains all of the tasks -- we could maybe adjust
make_airflow_dag
so that it doesn't return the
steps
as well
and yes, in the dockerized approach, each Dagster solid is executed in a separate Docker container using the DockerOperator
y
Good to hear that you are also considering working on a kubernetes oriented integration with Airflow!
m
😄 We would love your input on what kind of Kubernetes integration would be the most helpful!
y
@max I have been cooking one idea related to kubernetes which may interest you. Without me knowing how much of the project scope you are planning nor having read the dagster's source code, I would love to see the user be able to specify compute resource for a step, such as:
Copy code
@kube(cpu=8, mem=250g, gpu=1)
@lambda_solid(...)
def compute()
this is a neat feature from netflix's metaflow (

https://www.youtube.com/watch?v=XV5VGddmP24

) You could leverage taints in kubernetes for advanced scheduling (https://kubernetes.io/blog/2017/03/advanced-scheduling-in-kubernetes/)
🔥 1
m
Very interesting! My reflex is that we might want to keep hints to the compute substrates out of the code, and instead specify them in config -- so that you could run the same solid code in local test, or containerized in Airflow, or even perhaps directly on Kubernetes, and just change config blocks to tell the engines what constraints to impose on solid execution. Does that feel burdensome or unnatural?
y
Yes, config sounds good. It looks like dagster's configuration is the right place for that (because it's really just pipeline metadata).
👍 1