Hi i am implementing an ETL pipeline in dagster and want to dagster #deployment-kubernetes

Hi, i am implementing an ETL pipeline in dagster a...

Haris Akhtar

05/22/2023, 8:06 AM

Hi, i am implementing an ETL pipeline in dagster and want to use K8s for job execution, each step of ETL needs to be packaged in its own image due to dependencies conflics and each step needs to be executed either in a separate pod or job, im wondering: 1. how to pass input/outputs between each step, is there an automatic way of handling this? 2. how to trigger each step in a separate pod/job, is there any docs available for this At the very high level, i have the following structure in my mind, is this the correct way to do this? @op() def load() -> pd.DataFrame: // Run kubernetes job by passing file path as a reference ... // once finished, get the results return results @op() def transform(pd.DataFrame) -> pd.DataFrame: // Run kubernetes job by passing df ... // once done, get the results return results @op() def load(pd.DataFrame) // Run kubernetes job by passing df Any hints/guidance will be really appreciated since im just getting started with dagster Thanks a lot

daniel

05/22/2023, 3:27 PM

Hi Harris - I wrote up a discussion answer about how to do #2 in your question here: https://github.com/dagster-io/dagster/discussions/14387

daniel

05/22/2023, 3:27 PM

for #1 - dagster has a concept called an io manager that will automatically handle persisting outputs between ops https://docs.dagster.io/concepts/io-management/io-managers

Haris Akhtar

05/22/2023, 5:49 PM

thanks @daniel for your response, is there a still a benefit of using data assets if we go with k8s_job_op (option A)? How can we use serde with this approach? Any working example in the docs?

daniel

05/22/2023, 5:50 PM

I don’t think we have any built in support for serializing dataframes into the container when using k8s_job_op

Haris Akhtar

05/22/2023, 6:02 PM

right, in that case isnt by using k8s_job_op we loose most of the benefits of dagster, like data assets, is there a way to still make use of data assets? what other benefits we get from dagster by using k8s_job_op other than orchestration of ops?

daniel

05/22/2023, 6:03 PM

Yes, I think it’s safe to say that by using dagster to orchestrate arbitrary containers you lose a lot of the benefits - maybe worth considering option B then

Haris Akhtar

05/22/2023, 7:52 PM

is there any working example for approach B available in the docs/tests?

daniel

05/22/2023, 8:01 PM

I put a code example in the discussion, but there isn't currently something more comprehensive than that that i'm aware of

2 Views

Open in Slack

Previous Next