https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
a

Adam McCartney

06/23/2022, 7:56 AM
hi all, in relation to https://dagster.slack.com/archives/C014N0PK37E/p1655970710731319?thread_ts=1655541804.905499&cid=C014N0PK37E we are investigating slow spin up times for our k8s jobs.. sometimes taking more than 90 secs from execution to start of the job in k8s.. is there anything in particular we should look at? we are running dagster helm 0.15.0 in AWS EKS 1 m4 large and 2 m4 medium nodes.
🤖 1
a

Andrea Giardini

06/23/2022, 8:03 AM
can you describe the pod and post here the list of pod events?
a

Adam McCartney

06/23/2022, 8:07 AM
Copy code
LAST SEEN   TYPE     REASON      OBJECT                                                       MESSAGE
21m         Normal   Scheduled   pod/dagster-run-ab72cb46-74b0-4aeb-b5cf-ef870655b7d1-rj6jh   Successfully assigned dagster/dagster-run-ab72cb46-74b0-4aeb-b5cf-ef870655b7d1-rj6jh to ip-10-1-2-194.eu-west-1.compute.internal
21m         Normal   Pulling     pod/dagster-run-ab72cb46-74b0-4aeb-b5cf-ef870655b7d1-rj6jh   Pulling image "<http://583736480904.dkr.ecr.eu-west-1.amazonaws.com/dagster-pipelines-k8s:latest|583736480904.dkr.ecr.eu-west-1.amazonaws.com/dagster-pipelines-k8s:latest>"
21m         Normal   Pulled      pod/dagster-run-ab72cb46-74b0-4aeb-b5cf-ef870655b7d1-rj6jh   Successfully pulled image "<http://583736480904.dkr.ecr.eu-west-1.amazonaws.com/dagster-pipelines-k8s:latest|583736480904.dkr.ecr.eu-west-1.amazonaws.com/dagster-pipelines-k8s:latest>" in 108.165958ms
21m         Normal   Created     pod/dagster-run-ab72cb46-74b0-4aeb-b5cf-ef870655b7d1-rj6jh   Created container dagster
21m         Normal   Started     pod/dagster-run-ab72cb46-74b0-4aeb-b5cf-ef870655b7d1-rj6jh   Started container dagster
the pod starts up pretty quickly, but then theres a lag between the pod starting and the job executing.
a

Andrea Giardini

06/23/2022, 8:09 AM
pod logs?
a

Adam McCartney

06/23/2022, 8:10 AM
just checking for anything i need to redact..
a

Andrea Giardini

06/23/2022, 8:10 AM
take your time 🙂
a

Adam McCartney

06/23/2022, 8:14 AM
the pod is created at 2022-06-23T084543+01:00 then the pod logs start at 2022-06-23T084704+01:00
i'd like to understand whats causing the gap inbetween..
a

Andrea Giardini

06/23/2022, 8:15 AM
all these events seem to be triggered in the same seconds (correct me if i'm wrong)... do you see a gap in the event list?
There is nothing before that?
a

Adam McCartney

06/23/2022, 8:17 AM
no.. perhaps a logging setting i could change if there is something missing?
a

Andrea Giardini

06/23/2022, 8:19 AM
what do you see happening during this gap? does the pod get scheduled immediately and go in a 'Running' state?
a

Adam McCartney

06/23/2022, 8:19 AM
yes. thats correct.
a

Andrea Giardini

06/23/2022, 8:20 AM
can you describe the pod? maybe you have very restrictive resource limits/requests?
docker image size is usually a contributing factor but that does not seem to be a problem in your case
a

Adam McCartney

06/23/2022, 8:25 AM
i'm using the helm chart, so any limits are most likely left as default..
a

Andrea Giardini

06/23/2022, 8:32 AM
Indeed the container starts quick
Copy code
Started:      Thu, 23 Jun 2022 08:45:45 +0100
Finished:     Thu, 23 Jun 2022 08:47:04 +0100
Are you doing some pre-processing? pulling data from somewhere? can you share your dockerfile maybe?
you could also try to start a run, get into the pod with
k exec
as soon as it's running and run
ps aux
or any other tool to understand what it is doing
a

Adam McCartney

06/23/2022, 8:35 AM
the image is prebuilt and in a registry.. there's no data being pulled from anywhere.. i'll try and get some more info from the pod when it spins up.. thanks for your help so far.. i may be back with more questions!
a

Andrea Giardini

06/23/2022, 8:44 AM
Sure thing, happy to help
b

Binoy Shah

06/23/2022, 5:51 PM
@Adam McCartney was this issue resolved ?
a

Adam McCartney

06/24/2022, 12:28 PM
not yet.. i've rebuild our Kubernetes images which is based on python:3.9-slim with some extra logging so i'm going to do some forensics soon.. its not top of my priority list right now and i might call on some pair programming with a colleague to figure it out..