Hello all We have a Dagster Hybrid deployment using the kube dagster #ask-community

Hello all. We have a Dagster Hybrid deployment usi...

Archie Kennedy

12/13/2022, 10:07 AM

Hello all. We have a Dagster Hybrid deployment using the kubernetes executor. Is it possible to reduce the spin up time of each job pod? It's adding a lot of overhead. I found the Celery documentation but this appears to be for self-hosted only. Edit: I found this piece of info talking about "non-isolated runs", maybe this is related? https://docs.dagster.io/master/dagster-cloud/developing-testing/deployment-settings-reference#non-isolated-runs

daniel

12/13/2022, 12:58 PM

Hi Archie - non-isolated runs are currently only on for serverless but we'd like to bring it to Hybrid as well in the future. Are you using the k8s_job_executor that does each op in its own kubernetes pod? Or is this just slow startup times for the pod that happens for each run?

Archie Kennedy

12/13/2022, 1:22 PM

Hello yes I am using the k8s_job_executor and an S3 io manager, this is my code location:

Copy code

location_name: my_pipeline
image: ************.<http://dkr.ecr.us-east-1.amazonaws.com/my-pipeline:staging|dkr.ecr.us-east-1.amazonaws.com/my-pipeline:staging>
code_source:
  package_name: my_pipeline
container_context:
  k8s:
    env_secrets:
      - my-pipeline-aws-access-key
    env_vars:
      - AWS_ACCESS_KEY_ID=****************
      - DATABASE_HOST=**********
    resources:
      requests:
        cpu: 250m
        memory: 1024Mi

The problem is start-up times for each op are very slow.

Archie Kennedy

12/13/2022, 1:24 PM

each op in its own kubernetes pod

That sounds like the issue, ideally each job would be in it's own pod

Archie Kennedy

12/13/2022, 1:50 PM

we're keen on getting this fixed since it could save many days off our processing time 🙂

daniel

12/13/2022, 2:35 PM

If you take out the k8s_job_executor and use the default executor instead, it will do everything all in one pod, with each op in its own subprocess on that pod

daniel

12/13/2022, 2:36 PM

Do you have a sample slow run you can link to in cloud? I can take a look in our logs and verify that we are talking about the same things, the terminology can get a bit confusing

Archie Kennedy

12/13/2022, 3:20 PM

Ah I see the confusion about k8s executors.. here is a sample run:

Copy code

[CloudK8sRunLauncher] Creating Kubernetes run worker job
Kubernetes Job name
dagster-run-64f20da2-2c40-4367-a3ce-864ae1f4a7a4
Kubernetes Namespace
dagster-cloud
Run ID
64f20da2-2c40-4367-a3ce-864ae1f4a7a4

That takes about 2 minutes on production and about 44 seconds on my local machine

daniel

12/13/2022, 3:21 PM

OK, right - so it's doing that once per run, not once per op

daniel

12/13/2022, 3:21 PM

Hm, 2 minutes is a bit slow even for kubernetes...

Archie Kennedy

12/13/2022, 3:22 PM

that's using a powerful worker node too

daniel

12/13/2022, 3:23 PM

It does sound like non-isolated runs might help here (with the caveat that there are some tradeoffs there - specifically losing isolation makes it easier for different runs to mess each other up)- I expect we will support that on Hybrid fairly soon, but likely sometime after the holidays

Archie Kennedy

12/13/2022, 3:25 PM

ok makes sense

Archie Kennedy

12/13/2022, 3:26 PM

I noticed that if I run say 10 runs concurrently the slowdown becomes more noticeable

daniel

12/13/2022, 3:26 PM

Although I don't think in general 2 minutes startup time is expected in kubernetes - let me see what kind of startup time we see in our own hybrid k8s cluster

Archie Kennedy

12/13/2022, 3:27 PM

almost as if the run coordinator is waiting for resources to free up before starting the next op

daniel

12/13/2022, 3:28 PM

Do you have a link to a run that's slow? I can compare the startup logs with one of our hybrid k8s runs

Archie Kennedy

12/13/2022, 3:30 PM

https://aihealth.dagster.cloud/stage/runs/ec546d6e-81f5-41b8-9164-f27d4b4729cd?logFileKey=urgggmuk

Archie Kennedy

12/13/2022, 3:31 PM

This one was running concurrently with 9 others at the time

90 Views

Open in Slack

Previous Next