Hi, I have a job that is stuck in Starting phase o...
# deployment-kubernetes
s
Hi, I have a job that is stuck in Starting phase on K8s. How do I troubleshoot what might have gone wrong? Not a single job managed to start for me, all of them are stuck like this.
a
kubectl describe pod $podname
?
and
kubectl describe job $jobname
s
@Andrea Giardini Not a single pod/job is visible other than just default dagsterdeamon and dagster-dagit
Copy code
jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get pods -n dagster
NAME                                                              READY   STATUS    RESTARTS   AGE
dagster-daemon-7b576b46f9-r4ppg                                   1/1     Running   0          38h
dagster-dagit-5b6b8946bc-5mwlr                                    1/1     Running   0          38h
dagster-dagster-user-deployments-k8s-dagster-poc-simon-65fjdpnd   1/1     Running   0          38h
dagster-dagster-user-deployments-k8s-example-user-code-3-6lb78c   1/1     Running   0          38h
dagster-postgresql-0                                              1/1     Running   0          38h
There are code repositories and postgres as well
a
what about the jobs?
s
None
Copy code
jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get jobs -n dagster
No resources found in dagster namespace.
a
can you post a screenshot of your dagit page?
s
Screenshot 2023-03-06 at 12.00.43.png
a
can you show me the screenshot of one run?
one of the runs that gets stuck
s
It might be that the dagster Service account does not have privilages to launch a job. But shouldnt it error out visibly in that case?
a
yeah it could be, that’s weird… dagit seems to thing that the job was created. Any error in the dagster-daemon logs?
s
None
Copy code
jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl logs -n dagster dagster-daemon-7b576b46f9-r4ppg 

  Telemetry:

  As an open source project, we collect usage statistics to inform development priorities. For more
  information, read <https://docs.dagster.io/install#telemetry>.

  We will not see or store solid definitions, pipeline definitions, modes, resources, context, or
  any data that is processed within solids and pipelines.

  To opt-out, add the following to $DAGSTER_HOME/dagster.yaml, creating that file if necessary:

    telemetry:
      enabled: false


  Welcome to Dagster!

  If you have any questions or would like to engage with the Dagster team, please join us on Slack
  (<https://bit.ly/39dvSsF>).

2023-03-06 05:45:32 +0000 - dagster.daemon - INFO - Instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']
2023-03-06 05:45:32 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
...
2023-03-06 10:56:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:57:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:58:53 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 10:59:54 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:00:54 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:01:55 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2023-03-06 11:02:56 +0000 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
a
strange. can you create a new run from dagit and run
k get job
right after?
how did you install dagster? helm chart?
s
Helm, but we had issue with service account so we used Kustomize to patch it though. But it still sounds like a bug if it doesnt error out so thats why I came here
Copy code
jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get job --all-namespaces
NAMESPACE       NAME                             COMPLETIONS   DURATION   AGE
ingress-nginx   ingress-nginx-admission-create   1/1           5s         53d
ingress-nginx   ingress-nginx-admission-patch    1/1           6s         53d
Screenshot 2023-03-06 at 12.07.33.png
Copy code
jovyan@piskorzs-v2-0:~/work/GitRepos/platform$ kubectl get sa -n dagster dagster -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::083749379286:role/gtm-core-eks-uat-euc1-cfn-eksclustergtmsadagsterda-F9H0IFEFTLSB
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{"<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>":"arn:aws:iam::083749379286:role/gtm-core-eks-uat-euc1-cfn-eksclustergtmsadagsterda-F9H0IFEFTLSB"},"labels":{"<http://app.kubernetes.io/name|app.kubernetes.io/name>":"dagster","aws.cdk.eks/prune-c8c325efef07e37e6691673585c3559fcc4effbb9a":"","git-commit-sha":"2c2c774"},"name":"dagster","namespace":"dagster"}}
  creationTimestamp: "2023-03-01T14:00:59Z"
  labels:
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
    aws.cdk.eks/prune-c8c325efef07e37e6691673585c3559fcc4effbb9a: ""
    git-commit-sha: 2c2c774
  name: dagster
  namespace: dagster
  resourceVersion: "29874861"
  uid: 3dbd6b3e-c724-4b75-b112-07f832b649c7
secrets:
- name: dagster-token-skh82
@Andrea Giardini The SA that we patched with Kustomize(above) looks really similiar to the one that should be generated by dagster Helm chart, all the roles and rolebindings are still intact. From helm:
Copy code
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: dagster
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
    <http://app.kubernetes.io/version|app.kubernetes.io/version>: 1.1.11
    <http://helm.sh/chart|helm.sh/chart>: dagster-1.1.11
  name: dagster
The only real difference I see is that some of the labels are not there, do you have a mechanism that might rely on those that could potentially fail the job run?
@Bartosz Kopytek cc
After more thorough investigation it turned out the node-group was scaled down to zero BUT I think there is a bug in dagster on version
1.11.11
. The job/pod both delete themselves after unsuccessful schedule. Which eliminates any sign of error happening both on
kubectl
and
dagit
sides. The only way to troubleshoot it was to have a
watch
command scan for
kubectl get jobs -n dagster
every 0.01 second and if there was something describe it. Later we discovered that those were also available in the
kubectl get events -n dagster
Copy code
jovyan@piskorzs-v2-0:~/work/GitRepos/platform/k8s/eks/assets/k8s_manifests/dagster$ kubectl get events -n dagster
LAST SEEN   TYPE      REASON                   OBJECT                                                       MESSAGE
3m26s       Warning   FailedScheduling         pod/dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a-lv79g   0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
3m26s       Normal    SuccessfulCreate         job/dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a         Created pod: dagster-run-03a9455d-9a35-4579-bf43-32f13b644a4a-lv79g
6s          Warning   FailedScheduling         pod/dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9-qrd6x   0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
6s          Normal    SuccessfulCreate         job/dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9         Created pod: dagster-run-28b3240c-3f83-4c86-b004-4f4b68df2ab9-qrd6x
4m10s       Warning   FailedScheduling         pod/dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8-qpzqz   0/10 nodes are available: 1 node(s) had taint {hcp-linking-pipelines: }, that the pod didn't tolerate, 3 node(s) had taint {deployments-control-plane: }, that the pod didn't tolerate, 6 node(s) had taint {kf-notebooks: }, that the pod didn't tolerate.
4m10s       Normal    SuccessfulCreate         job/dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8         Created pod: dagster-run-9f566c73-519f-4b55-ae52-130ac9df35a8-qpzqz
43m         Normal    SuccessfullyReconciled   targetgroupbinding/k8s-dagster-dagsterd-ad19985f97           Successfully reconciled
I don’t think the deletion of pod and job is the desired behaviour, K8s version of dagster should use the Backoff mechanism instead because there is no good way of telling what has happened with a pod after it had been deleted.
Can we please have someone let us know whether this had been solved in any versions later than
1.1.11
we did not want to introduce additional complexity with version bump till we found the cause.
d
Hi Szymon - I'm not aware of anything in Dagster that would delete the pod or the job for you if it fails to schedule - the deletion there would most likely have been initiated within your cluster from some other place (maybe there's a default TTL set?).
s
@daniel The pods were getting 0 resources because the Helm chart for user deployments defaults to zero resources. Had to see the process history in kernel on nodes to verify that. Testing right now with resources bumped.
d
Hmm I thought the default if no resources was set was 'unlimited resources', not 'no resources' (which has its own problems) - it might differ between different clusters/clouds, where is your k8s cluster running?
s
EKS
The worst thing was it was getting silent kill, not even kubelet OOM kill as it should.
d
Hmmm, that's very odd - what you're describing is different than what i've seen on EKS in the past
this is a random medium article, so not the most reputable source, but is consistent with what i've observed in the past: https://reuvenharrison.medium.com/kubernetes-resource-limits-defaults-and-limitranges-f1eed8655474
You don't have a strict LimitRange defined in your cluster or anything like that I assume?
s
Not that I know of, will see if I can reproduce the issue once again just to make sure. This is how it looks in
describe pod
and
get events
commands:
I checked and there is no LimitRange in any namespace on the cluster
d
Trying to think of what else could be different/unique - what k8s version are you on?
s
1.23
Some new findings since yesterday were: We could not schedule a pod on our nodegroup and autoscaler did not scale up in time. We increased the capacity and are able to run the code-example job, but still cannot run our dbt project on custom image.
We bumped the version of dagster to 1.1.20 yesterday, and enabled run monitoring. Still the runs of dbt are stuck in starting phase and just timeout after 220 seconds or so
The container logs are never there because the container inside pod never starts so cant do a
kubectl logs
But still the weird behaviour that is the real problem are the instant kills of pods, you cannot troubleshoot anything if they die like that. I even created a job using manifest by hand, forced to exit with code 42 and it did. It did not get silently killed.
So only the jobs created by dagit are treated that way. The other containers have their history end events presented normally. So thats why we thought it was resource limit from Helm.
d
That is very odd and sounds very frustrating - I don't think I've seen another report of pods just getting silently killed like that
Have you tried with the example image that's included with the Helm chart? That would help rule it out being something specific to your image
Would it be a lot of work to try it in a different small test cluster? My suspicion is still that something about your cluster may be in an unusual state, just because the symptoms are so different than any reports we've seen before
s
We have two user deployments currently, one being the example deployment and one our custom. We can run the example deployment just fine since yesterday when I fixed the scheduling problem.
So there are deployments that can run on this cluster
Screenshot 2023-03-07 at 16.31.05.png
This is working just fine
This one does not
Screenshot 2023-03-07 at 16.31.36.png
d
Is it that the user code deployment won't start up or that it fails when you go to launch a run for that user code deployment
Sounds like the latter?
s
Its stuck in Starting phase
and then fails because of timeout
d
The run is, right? Not the user code deployment
s
Yes
The run is stack in Starting phase
d
here's a question - do those affinity and tolerations and nodeSelector fields need to be on the run as well?
s
Not really, the run gets scheduled on our default nodegroup
And I had seen the job being created, so did the pod, but it had died instantly
the tolerations affinity are there to separate the ad-hoc loads from the things that run statically
d
got it
But it does sound like its something about the image then that induces the problem, right? If the example job runs fine but this one doesn't, and the deployments are otherwise identical
s
I am just wondering if thats not something specific to the type of repo inside the image
Inside our image is a dbt project loaded from manifest
120MB of manifest
We are just running a small part of that manifest, around 40 assets total
But it does not seem to know how to start
d
It's not impossible that it's running out of memory, but i've never seen that result in the pod getting silently killed before - usually there's a reasonably clear OOMKilled message on the pod
This is an article that I found with people having the same issue when using helm
d
interesting - Have you tried bumping the memory limits way up?
s
Right now I bumped them to 2GB of RAM and 1000m CPU
Waiting for AWS CDK pipeline to finish and will come back here with results
👍 1
So If I configure resources as folllows:
Copy code
dagster-user-deployments:
  deployments:
    - name: "k8s-dagster-poc-simon"
      image:
        repository: "<http://083749379286.dkr.ecr.eu-central-1.amazonaws.com/common/dagster_poc/refactored_project_repo|083749379286.dkr.ecr.eu-central-1.amazonaws.com/common/dagster_poc/refactored_project_repo>"
        tag: f1ac2a0d5698166ce065cdbb5bfb9b8fdacc4d7a
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "--python-file"
        - "./repo.py"
      port: 3030
      envSecrets:
        - name: redshift-dbt-secrets-envs
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: <http://topology.kubernetes.io/zone|topology.kubernetes.io/zone>
                    operator: In
                    values:
                      - eu-central-1a
                      - eu-central-1b
      tolerations:
        - key: "deployments-control-plane"
          operator: "Exists"
          effect: "NoSchedule"
      nodeSelector:
        deployments-control-plane: "true"
      resources:
        limits:
          cpu: 1000m
          memory: 2Gi
        requests:
          cpu: 100m
          memory: 128Mi
Does this mean the pod with code repo will have those resources or every single run will have them? In default values file it is mentioned that if not specified the K8s Scheduler values will be used but I am still unsure whether thats for code repo or for run of the actual assets.
d
right now those will just be applied to the user code deployment - you can apply default run resource limits at the run launcher level in the helm chart here:https://github.com/dagster-io/dagster/blob/master/helm/dagster/values.yaml#L535-L546 Or for individual jobs via tags like in the example here: https://docs.dagster.io/deployment/guides/kubernetes/customizing-your-deployment#per-job-or-per-op-kubernetes-configuration
s
Ah, okay might be the reasoning. Didnt work with resources on the user-code-deployment level. Trying with bumping the K8sRunLauncher now
@daniel comming back here after a week of fighting. You were right, argocd was pruning those jobs/pods because for some part they are seen as part of application by default, not ephermal container spawned by application. Had to annotate them with Prune=false from argo and everything clicked.
🎉 1