Hi I have a job that is stuck in Starting phase on K8s How d dagster #deployment-kubernetes

Hi, I have a job that is stuck in Starting phase o...

Szymon Piskorz

03/06/2023, 10:54 AM

Hi, I have a job that is stuck in Starting phase on K8s. How do I troubleshoot what might have gone wrong? Not a single job managed to start for me, all of them are stuck like this.

Andrea Giardini

03/06/2023, 10:58 AM

kubectl describe pod $podname

Andrea Giardini

03/06/2023, 10:58 AM

and

kubectl describe job $jobname

Szymon Piskorz

03/06/2023, 10:58 AM

@Andrea Giardini Not a single pod/job is visible other than just default dagsterdeamon and dagster-dagit

Szymon Piskorz

03/06/2023, 10:59 AM

Copy code

jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get pods -n dagster
NAME                                                              READY   STATUS    RESTARTS   AGE
dagster-daemon-7b576b46f9-r4ppg                                   1/1     Running   0          38h
dagster-dagit-5b6b8946bc-5mwlr                                    1/1     Running   0          38h
dagster-dagster-user-deployments-k8s-dagster-poc-simon-65fjdpnd   1/1     Running   0          38h
dagster-dagster-user-deployments-k8s-example-user-code-3-6lb78c   1/1     Running   0          38h
dagster-postgresql-0                                              1/1     Running   0          38h

Szymon Piskorz

03/06/2023, 10:59 AM

There are code repositories and postgres as well

Andrea Giardini

03/06/2023, 10:59 AM

what about the jobs?

Szymon Piskorz

03/06/2023, 10:59 AM

None

Szymon Piskorz

03/06/2023, 10:59 AM

Copy code

jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get jobs -n dagster
No resources found in dagster namespace.

Andrea Giardini

03/06/2023, 10:59 AM

can you post a screenshot of your dagit page?

Szymon Piskorz

03/06/2023, 11:01 AM

Screenshot 2023-03-06 at 12.00.43.png

Andrea Giardini

03/06/2023, 11:01 AM

can you show me the screenshot of one run?

Andrea Giardini

03/06/2023, 11:01 AM

one of the runs that gets stuck

Szymon Piskorz

03/06/2023, 11:02 AM

It might be that the dagster Service account does not have privilages to launch a job. But shouldnt it error out visibly in that case?

Andrea Giardini

03/06/2023, 11:03 AM

yeah it could be, that’s weird… dagit seems to thing that the job was created. Any error in the dagster-daemon logs?

Szymon Piskorz

03/06/2023, 11:03 AM

None

Copy code

jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl logs -n dagster dagster-daemon-7b576b46f9-r4ppg 

  Telemetry:

  As an open source project, we collect usage statistics to inform development priorities. For more
  information, read <https://docs.dagster.io/install#telemetry>.

  We will not see or store solid definitions, pipeline definitions, modes, resources, context, or
  any data that is processed within solids and pipelines.

  To opt-out, add the following to $DAGSTER_HOME/dagster.yaml, creating that file if necessary:

    telemetry:
      enabled: false


  Welcome to Dagster!

  If you have any questions or would like to engage with the Dagster team, please join us on Slack
  (<https://bit.ly/39dvSsF>).

2023-03-06 05:45:32 +0000 - dagster.daemon - INFO - Instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']
2023-03-06 05:45:32 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
...
2023-03-06 10:56:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:57:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:58:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:59:54 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:00:54 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:01:55 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:02:56 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.

Andrea Giardini

03/06/2023, 11:04 AM

strange. can you create a new run from dagit and run

k get job

right after?

Andrea Giardini

03/06/2023, 11:04 AM

how did you install dagster? helm chart?

Szymon Piskorz

03/06/2023, 11:05 AM

Helm, but we had issue with service account so we used Kustomize to patch it though. But it still sounds like a bug if it doesnt error out so thats why I came here

Szymon Piskorz

03/06/2023, 11:07 AM

Copy code

jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get job --all-namespaces
NAMESPACE       NAME                             COMPLETIONS   DURATION   AGE
ingress-nginx   ingress-nginx-admission-create   1/1           5s         53d
ingress-nginx   ingress-nginx-admission-patch    1/1           6s         53d

Szymon Piskorz

03/06/2023, 11:07 AM

Screenshot 2023-03-06 at 12.07.33.png

Szymon Piskorz

03/06/2023, 11:10 AM

Copy code

jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get sa -n dagster dagster -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::083749379286:role/gtm-core-eks-uat-euc1-cfn-eksclustergtmsadagsterda-F9H0IFEFTLSB
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{"<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>":"arn:aws:iam::083749379286:role/gtm-core-eks-uat-euc1-cfn-eksclustergtmsadagsterda-F9H0IFEFTLSB"},"labels":{"<http://app.kubernetes.io/name|app.kubernetes.io/name>":"dagster","aws.cdk.eks/prune-c8c325efef07e37e6691673585c3559fcc4effbb9a":"","git-commit-sha":"2c2c774"},"name":"dagster","namespace":"dagster"}}
  creationTimestamp: "2023-03-01T14:00:59Z"
  labels:
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
    aws.cdk.eks/prune-c8c325efef07e37e6691673585c3559fcc4effbb9a: ""
    git-commit-sha: 2c2c774
  name: dagster
  namespace: dagster
  resourceVersion: "29874861"
  uid: 3dbd6b3e-c724-4b75-b112-07f832b649c7
secrets:
- name: dagster-token-skh82

Szymon Piskorz

03/06/2023, 11:29 AM

@Andrea Giardini The SA that we patched with Kustomize(above) looks really similiar to the one that should be generated by dagster Helm chart, all the roles and rolebindings are still intact. From helm:

Copy code

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: dagster
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
    <http://app.kubernetes.io/version|app.kubernetes.io/version>: 1.1.11
    <http://helm.sh/chart|helm.sh/chart>: dagster-1.1.11
  name: dagster

The only real difference I see is that some of the labels are not there, do you have a mechanism that might rely on those that could potentially fail the job run?

Szymon Piskorz

03/06/2023, 12:05 PM

@Bartosz Kopytek cc

Szymon Piskorz

03/06/2023, 2:14 PM

After more thorough investigation it turned out the node-group was scaled down to zero BUT I think there is a bug in dagster on version

1.11.11

. The job/pod both delete themselves after unsuccessful schedule. Which eliminates any sign of error happening both on

kubectl

and

dagit

sides. The only way to troubleshoot it was to have a

watch

command scan for

kubectl get jobs -n dagster

every 0.01 second and if there was something describe it. Later we discovered that those were also available in the

kubectl get events -n dagster

Copy code

jovyan@piskorzs-v2-0:~/work/GitRepos/platform/k8s/eks/assets/k8s_manifests/dagster$ kubectl get events -n dagster
LAST SEEN   TYPE      REASON                   OBJECT                                                       MESSAGE
3m26s       Warning   FailedScheduling         pod/dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a-lv79g   0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
3m26s       Normal    SuccessfulCreate         job/dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a         Created pod: dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a-lv79g
6s          Warning   FailedScheduling         pod/dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9-qrd6x   0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
6s          Normal    SuccessfulCreate         job/dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9         Created pod: dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9-qrd6x
4m10s       Warning   FailedScheduling         pod/dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8-qpzqz   0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
4m10s       Normal    SuccessfulCreate         job/dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8         Created pod: dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8-qpzqz
43m         Normal    SuccessfullyReconciled   targetgroupbinding/k8s-dagster-dagsterd-ad19985f97           Successfully reconciled

I don’t think the deletion of pod and job is the desired behaviour, K8s version of dagster should use the Backoff mechanism instead because there is no good way of telling what has happened with a pod after it had been deleted.

Szymon Piskorz

03/06/2023, 2:19 PM

Can we please have someone let us know whether this had been solved in any versions later than

1.1.11

we did not want to introduce additional complexity with version bump till we found the cause.

daniel

03/06/2023, 3:10 PM

Hi Szymon - I'm not aware of anything in Dagster that would delete the pod or the job for you if it fails to schedule - the deletion there would most likely have been initiated within your cluster from some other place (maybe there's a default TTL set?).

Szymon Piskorz

03/07/2023, 3:06 PM

@daniel The pods were getting 0 resources because the Helm chart for user deployments defaults to zero resources. Had to see the process history in kernel on nodes to verify that. Testing right now with resources bumped.

daniel

03/07/2023, 3:08 PM

Hmm I thought the default if no resources was set was 'unlimited resources', not 'no resources' (which has its own problems) - it might differ between different clusters/clouds, where is your k8s cluster running?

Szymon Piskorz

03/07/2023, 3:08 PM

EKS

Szymon Piskorz

03/07/2023, 3:08 PM

The worst thing was it was getting silent kill, not even kubelet OOM kill as it should.

daniel

03/07/2023, 3:09 PM

Hmmm, that's very odd - what you're describing is different than what i've seen on EKS in the past

daniel

03/07/2023, 3:12 PM

this is a random medium article, so not the most reputable source, but is consistent with what i've observed in the past: https://reuvenharrison.medium.com/kubernetes-resource-limits-defaults-and-limitranges-f1eed8655474

daniel

03/07/2023, 3:12 PM

You don't have a strict LimitRange defined in your cluster or anything like that I assume?

Szymon Piskorz

03/07/2023, 3:18 PM

Not that I know of, will see if I can reproduce the issue once again just to make sure. This is how it looks in

describe pod

and

get events

commands:

Szymon Piskorz

03/07/2023, 3:18 PM

I checked and there is no LimitRange in any namespace on the cluster

daniel

03/07/2023, 3:19 PM

Trying to think of what else could be different/unique - what k8s version are you on?

Szymon Piskorz

03/07/2023, 3:19 PM

1.23

Szymon Piskorz

03/07/2023, 3:21 PM

Some new findings since yesterday were: We could not schedule a pod on our nodegroup and autoscaler did not scale up in time. We increased the capacity and are able to run the code-example job, but still cannot run our dbt project on custom image.

Szymon Piskorz

03/07/2023, 3:22 PM

We bumped the version of dagster to 1.1.20 yesterday, and enabled run monitoring. Still the runs of dbt are stuck in starting phase and just timeout after 220 seconds or so

Szymon Piskorz

03/07/2023, 3:22 PM

The container logs are never there because the container inside pod never starts so cant do a

kubectl logs

Szymon Piskorz

03/07/2023, 3:23 PM

But still the weird behaviour that is the real problem are the instant kills of pods, you cannot troubleshoot anything if they die like that. I even created a job using manifest by hand, forced to exit with code 42 and it did. It did not get silently killed.

Szymon Piskorz

03/07/2023, 3:24 PM

So only the jobs created by dagit are treated that way. The other containers have their history end events presented normally. So thats why we thought it was resource limit from Helm.

daniel

03/07/2023, 3:24 PM

That is very odd and sounds very frustrating - I don't think I've seen another report of pods just getting silently killed like that

daniel

03/07/2023, 3:25 PM

Have you tried with the example image that's included with the Helm chart? That would help rule it out being something specific to your image

daniel

03/07/2023, 3:28 PM

Would it be a lot of work to try it in a different small test cluster? My suspicion is still that something about your cluster may be in an unusual state, just because the symptoms are so different than any reports we've seen before

Szymon Piskorz

03/07/2023, 3:30 PM

We have two user deployments currently, one being the example deployment and one our custom. We can run the example deployment just fine since yesterday when I fixed the scheduling problem.

Szymon Piskorz

03/07/2023, 3:30 PM

So there are deployments that can run on this cluster

Szymon Piskorz

03/07/2023, 3:31 PM

Screenshot 2023-03-07 at 16.31.05.png

Szymon Piskorz

03/07/2023, 3:31 PM

This is working just fine

Szymon Piskorz

03/07/2023, 3:31 PM

This one does not

Szymon Piskorz

03/07/2023, 3:32 PM

Screenshot 2023-03-07 at 16.31.36.png

daniel

03/07/2023, 3:33 PM

Is it that the user code deployment won't start up or that it fails when you go to launch a run for that user code deployment

daniel

03/07/2023, 3:33 PM

Sounds like the latter?

Szymon Piskorz

03/07/2023, 3:33 PM

Its stuck in Starting phase

Szymon Piskorz

03/07/2023, 3:33 PM

and then fails because of timeout

daniel

03/07/2023, 3:33 PM

The run is, right? Not the user code deployment

Szymon Piskorz

03/07/2023, 3:34 PM

Yes

Szymon Piskorz

03/07/2023, 3:34 PM

The run is stack in Starting phase

daniel

03/07/2023, 3:34 PM

here's a question - do those affinity and tolerations and nodeSelector fields need to be on the run as well?

Szymon Piskorz

03/07/2023, 3:34 PM

Not really, the run gets scheduled on our default nodegroup

Szymon Piskorz

03/07/2023, 3:34 PM

And I had seen the job being created, so did the pod, but it had died instantly

Szymon Piskorz

03/07/2023, 3:35 PM

the tolerations affinity are there to separate the ad-hoc loads from the things that run statically

daniel

03/07/2023, 3:35 PM

got it

daniel

03/07/2023, 3:36 PM

But it does sound like its something about the image then that induces the problem, right? If the example job runs fine but this one doesn't, and the deployments are otherwise identical

Szymon Piskorz

03/07/2023, 3:37 PM

I am just wondering if thats not something specific to the type of repo inside the image

Szymon Piskorz

03/07/2023, 3:37 PM

Inside our image is a dbt project loaded from manifest

Szymon Piskorz

03/07/2023, 3:37 PM

120MB of manifest

Szymon Piskorz

03/07/2023, 3:37 PM

We are just running a small part of that manifest, around 40 assets total

Szymon Piskorz

03/07/2023, 3:37 PM

But it does not seem to know how to start

daniel

03/07/2023, 3:37 PM

It's not impossible that it's running out of memory, but i've never seen that result in the pod getting silently killed before - usually there's a reasonably clear OOMKilled message on the pod

Szymon Piskorz

03/07/2023, 3:38 PM

https://medium.com/@reefland/tracking-down-invisible-oom-kills-in-kubernetes-192a3de33a60

ty spinny 1

Szymon Piskorz

03/07/2023, 3:38 PM

This is an article that I found with people having the same issue when using helm

daniel

03/07/2023, 3:39 PM

interesting - Have you tried bumping the memory limits way up?

Szymon Piskorz

03/07/2023, 3:40 PM

Right now I bumped them to 2GB of RAM and 1000m CPU

Szymon Piskorz

03/07/2023, 3:40 PM

Waiting for AWS CDK pipeline to finish and will come back here with results

👍 1

Szymon Piskorz

03/07/2023, 4:16 PM

So If I configure resources as folllows:

Copy code

dagster-user-deployments:
  deployments:
    - name: "k8s-dagster-poc-simon"
      image:
        repository: "<http://083749379286.dkr.ecr.eu-central-1.amazonaws.com/common/dagster_poc/refactored_project_repo|083749379286.dkr.ecr.eu-central-1.amazonaws.com/common/dagster_poc/refactored_project_repo>"
        tag: f1ac2a0d5698166ce065cdbb5bfb9b8fdacc4d7a
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "--python-file"
        - "./repo.py"
      port: 3030
      envSecrets:
        - name: redshift-dbt-secrets-envs
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: <http://topology.kubernetes.io/zone|topology.kubernetes.io/zone>
                    operator: In
                    values:
                      - eu-central-1a
                      - eu-central-1b
      tolerations:
        - key: "deployments-control-plane"
          operator: "Exists"
          effect: "NoSchedule"
      nodeSelector:
        deployments-control-plane: "true"
      resources:
        limits:
          cpu: 1000m
          memory: 2Gi
        requests:
          cpu: 100m
          memory: 128Mi

Does this mean the pod with code repo will have those resources or every single run will have them? In default values file it is mentioned that if not specified the K8s Scheduler values will be used but I am still unsure whether thats for code repo or for run of the actual assets.

daniel

03/07/2023, 4:18 PM

right now those will just be applied to the user code deployment - you can apply default run resource limits at the run launcher level in the helm chart here:https://github.com/dagster-io/dagster/blob/master/helm/dagster/values.yaml#L535-L546 Or for individual jobs via tags like in the example here: https://docs.dagster.io/deployment/guides/kubernetes/customizing-your-deployment#per-job-or-per-op-kubernetes-configuration

Szymon Piskorz

03/07/2023, 4:19 PM

Ah, okay might be the reasoning. Didnt work with resources on the user-code-deployment level. Trying with bumping the K8sRunLauncher now

Szymon Piskorz

03/16/2023, 2:21 AM

@daniel comming back here after a week of fighting. You were right, argocd was pruning those jobs/pods because for some part they are seen as part of application by default, not ephermal container spawned by application. Had to annotate them with Prune=false from argo and everything clicked.

🎉 1

7 Views

Open in Slack

Previous Next