Hello all. We have a Dagster Hybrid deployment usi...
# ask-community
Hello all. We have a Dagster Hybrid deployment using the kubernetes executor. Is it possible to reduce the spin up time of each job pod? It's adding a lot of overhead. I found the Celery documentation but this appears to be for self-hosted only. Edit: I found this piece of info talking about "non-isolated runs", maybe this is related? https://docs.dagster.io/master/dagster-cloud/developing-testing/deployment-settings-reference#non-isolated-runs
Hi Archie - non-isolated runs are currently only on for serverless but we'd like to bring it to Hybrid as well in the future. Are you using the k8s_job_executor that does each op in its own kubernetes pod? Or is this just slow startup times for the pod that happens for each run?
Hello yes I am using the k8s_job_executor and an S3 io manager, this is my code location:
Copy code
location_name: my_pipeline
image: ************.<http://dkr.ecr.us-east-1.amazonaws.com/my-pipeline:staging|dkr.ecr.us-east-1.amazonaws.com/my-pipeline:staging>
  package_name: my_pipeline
      - my-pipeline-aws-access-key
      - AWS_ACCESS_KEY_ID=****************
      - DATABASE_HOST=**********
        cpu: 250m
        memory: 1024Mi
The problem is start-up times for each op are very slow.
each op in its own kubernetes pod
That sounds like the issue, ideally each job would be in it's own pod
we're keen on getting this fixed since it could save many days off our processing time 🙂
If you take out the k8s_job_executor and use the default executor instead, it will do everything all in one pod, with each op in its own subprocess on that pod
Do you have a sample slow run you can link to in cloud? I can take a look in our logs and verify that we are talking about the same things, the terminology can get a bit confusing
Ah I see the confusion about k8s executors.. here is a sample run:
Copy code
[CloudK8sRunLauncher] Creating Kubernetes run worker job
Kubernetes Job name
Kubernetes Namespace
Run ID
That takes about 2 minutes on production and about 44 seconds on my local machine
OK, right - so it's doing that once per run, not once per op
Hm, 2 minutes is a bit slow even for kubernetes...
that's using a powerful worker node too
It does sound like non-isolated runs might help here (with the caveat that there are some tradeoffs there - specifically losing isolation makes it easier for different runs to mess each other up)- I expect we will support that on Hybrid fairly soon, but likely sometime after the holidays
ok makes sense
I noticed that if I run say 10 runs concurrently the slowdown becomes more noticeable
Although I don't think in general 2 minutes startup time is expected in kubernetes - let me see what kind of startup time we see in our own hybrid k8s cluster
almost as if the run coordinator is waiting for resources to free up before starting the next op
Do you have a link to a run that's slow? I can compare the startup logs with one of our hybrid k8s runs
This one was running concurrently with 9 others at the time