Hi, i am implementing an ETL pipeline in dagster a...
# deployment-kubernetes
h
Hi, i am implementing an ETL pipeline in dagster and want to use K8s for job execution, each step of ETL needs to be packaged in its own image due to dependencies conflics and each step needs to be executed either in a separate pod or job, im wondering: 1. how to pass input/outputs between each step, is there an automatic way of handling this? 2. how to trigger each step in a separate pod/job, is there any docs available for this At the very high level, i have the following structure in my mind, is this the correct way to do this? @op() def load() -> pd.DataFrame: // Run kubernetes job by passing file path as a reference ... // once finished, get the results return results @op() def transform(pd.DataFrame) -> pd.DataFrame: // Run kubernetes job by passing df ... // once done, get the results return results @op() def load(pd.DataFrame) // Run kubernetes job by passing df Any hints/guidance will be really appreciated since im just getting started with dagster Thanks a lot
d
Hi Harris - I wrote up a discussion answer about how to do #2 in your question here: https://github.com/dagster-io/dagster/discussions/14387
for #1 - dagster has a concept called an io manager that will automatically handle persisting outputs between ops https://docs.dagster.io/concepts/io-management/io-managers
h
thanks @daniel for your response, is there a still a benefit of using data assets if we go with k8s_job_op (option A)? How can we use serde with this approach? Any working example in the docs?
d
I don’t think we have any built in support for serializing dataframes into the container when using k8s_job_op
h
right, in that case isnt by using k8s_job_op we loose most of the benefits of dagster, like data assets, is there a way to still make use of data assets? what other benefits we get from dagster by using k8s_job_op other than orchestration of ops?
d
Yes, I think it’s safe to say that by using dagster to orchestrate arbitrary containers you lose a lot of the benefits - maybe worth considering option B then
h
is there any working example for approach B available in the docs/tests?
d
I put a code example in the discussion, but there isn't currently something more comprehensive than that that i'm aware of