https://dagster.io/ logo
#ask-community
Title
# ask-community
s

Saurabh

01/19/2023, 2:42 PM
Hey guys, We are trying dagster now as a replacement of airflow and in the process of writing some pipelines. All our workloads run on kubernetes on GKE. Based on our initial evaluation we thought the repository grpc server is the central place where all dagster jobs/assets exists and serve other dagster components. Our POC had all assets defined in the same dagster image so it worked pretty well. Now that we are in process of productionizing it we realized that grpc server only serves the dagit web server and the daemon. The k8s jobs started by K8sRunLauncher expect to have the jobs definition inside them. For additional context, our OPs then launch pods to process the workload. In our current setup, we don't have assets/jobs available in the job launched by the K8sRunLauncher. This has become a bit tricky for us and we will need to think of a way to centralize all the jobs. The main thing we wanted to understand is why the grpc server is not serving all dagster components. We expected that the repository would be a central place to store code, and the job pod would be able to communicate with the repository to fetch the code it needs to run. I couldn't find these answers on dagster docs, so thought maybe you guys can point me towards some answers.
r

rex

01/19/2023, 3:02 PM
This is to have isolations in run execution: if the jobs launched by the K8sRunLauncher relied on the gRPC server, and the gRPC server went down, we don’t want all the jobs to stop working while they are still in progress. Same thing with the Dagster UI - the jobs do not directly interface with the UI’s server, but write directly to the database. Again, if the UI goes down, the jobs will still run autonomously.
y

Yeachan Park

01/19/2023, 3:22 PM
This is to have isolations in run execution: if the jobs launched by the K8sRunLauncher relied on the gRPC server, and the gRPC server went down, we don’t want all the jobs to stop working while they are still in progress.
Doesn't it just need to sync the code between the repository and the job pod once at the beginning? If I understand correctly, information about OP status/progress are just directly written to the database anyway, so I'm not sure why the job would fail halfway. Even if it can't sync at the beginning, I expect this to be less of an issue since I assume most people will be running high available repositories, seeing as no jobs can be scheduled if the repository is down - and I expect the repository will be getting re-deployed frequently since the image of the repository will be used to launch a job from the K8RunLauncher?
r

rex

01/19/2023, 3:28 PM
it’s more than just syncing code - all the dependencies that are required to run the code will be needed. This is why the K8sRunLauncher pulls the image and runs the code from there. I was assuming a world where the gRPC server was executing code, while the job pod was simply calling into the gRPC to trigger what ops/assets to execute/materialize.
y

Yeachan Park

01/19/2023, 3:42 PM
> all the dependencies that are required to run the code will be needed. This is why the K8sRunLauncher pulls the image and runs the code from there.
Ah OK, so the main reason the job pod isn't connected to the repository is that there would have been too much load on it to sync this? I guess that would be more relevant if users are running actual work within the job pod itself, using for example the
in_process_executor
? For our use case, we're using something similar to the
k8s_job_executor
(we're just using pods without the jobs abstraction). All the other dependencies for the job pod would be relatively static to just be able to start pods, etc and could be baked into the job pod image, since the actual dependencies needed to run the business logic is baked into the image that gets started for an OP by the job pod. So is the reason it's set up this way because the majority of users running their business logic inside the scheduler?
r

rex

01/19/2023, 3:54 PM
I’m not even sure if syncing was an option up for contention when we integrated with gRPC (but another core team member could chime in here)
So is the reason it’s set up this way because the majority of users running their business logic inside the scheduler?
not sure what you mean here - our scheduler is a daemon process. the daemon doesn’t execute any work, but submits runs to launch. In the case of the default run launcher, this run is executed within the gRPC server itself; in the case of the K8sRunLauncher, as we’re talking about, the run is executed in a job pod (with the same image running on the gRPC server)
y

Yeachan Park

01/19/2023, 4:07 PM
not sure what you mean here - our scheduler is a daemon process
Sorry for being unclear, I was just asking whether the majority of users run their business logic within dagster (not the actual scheduler daemon): so in the context of the K8RunLauncher running their actual business logic inside the job pod that gets spun up, as opposed to us where we calculate the business logic outside of Dagster in a different container - i.e. dagster is only responsible for scheduling.
I’m not even sure if syncing was an option up for contention when we integrated with gRPC (but another core team member could chime in here)
Yeah, I would be interested in understanding if possible