https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
e

Eldan Hamdani

02/24/2022, 1:17 PM
Hi. I created
nodeAffinity
which already work and now I’m trying do add
topologySpreadConstraints
and it failed for me. can someone point me where I wrong?
Copy code
tags={
    'dagster-k8s/config': {
        'container_config': {
            'resources': {
                'requests': {'cpu': '250m', 'memory': '64Mi'},
                'limits': {'cpu': '500m', 'memory': '2560Mi'},
            }
        },
        'pod_template_spec_metadata': {'annotations': {"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "true"}},
        'pod_spec_config': {
            'affinity': {
                'nodeAffinity': {
                    'requiredDuringSchedulingIgnoredDuringExecution': {
                        'nodeSelectorTerms': [
                            {
                                'matchExpressions': [
                                    {
                                        'key': '<http://cloud.google.com/gke-nodepool|cloud.google.com/gke-nodepool>',
                                        'operator': 'In',
                                        'values': ['immunai-pipeline-pool'],
                                    }
                                ]
                            }
                        ]
                    }
                }
            }
            'topologySpreadConstraints': [{
                'maxSkew': 1,
                'topologyKey': '<http://kubernetes.io/hostname|kubernetes.io/hostname>',
                'whenUnsatisfiable': 'DoNotSchedule',
                'labelSelector': {
                    'matchLabels': {
                        '<http://cloud.google.com/gke-nodepool|cloud.google.com/gke-nodepool>': 'immunai-pipeline-pool'
                    }
                }
            }
        ]
        },
    },
},
d

daniel

02/24/2022, 1:20 PM
Hi, how exactly does it fail? If there's an error would you mind posting the text and stack trace?
e

Eldan Hamdani

02/24/2022, 1:23 PM
hmm wait maybe I forgot some comma
here’s the error:
Copy code
TypeError: __init__() got an unexpected keyword argument 'topologySpreadConstraints'
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/utils.py", line 34, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/execution/launch_execution.py", line 16, in launch_pipeline_execution
    return _launch_pipeline_execution(graphene_info, execution_params)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/execution/launch_execution.py", line 50, in _launch_pipeline_execution
    run = do_launch(graphene_info, execution_params, is_reexecuted)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/execution/launch_execution.py", line 38, in do_launch
    workspace=graphene_info.context,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1434, in submit_run
    SubmitRunContext(run, workspace=workspace)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/run_coordinator/default_run_coordinator.py", line 32, in submit_run
    self._instance.launch_run(pipeline_run.run_id, context.workspace)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1498, in launch_run
    self._run_launcher.launch_run(LaunchRunContext(pipeline_run=run, workspace=workspace))
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/launcher.py", line 312, in launch_run
    user_defined_k8s_config=user_defined_k8s_config,
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/job.py", line 616, in construct_dagster_k8s_job
    **user_defined_k8s_config.pod_spec_config,
@daniel do you support that case? working on
affinity
and
topologySpreadConstraints
together?
d

daniel

02/24/2022, 1:57 PM
I think the current situation with which fields you supply with underscores (topology_spread_constraints) and which you supply with camelCase (topologySpreadConstraints) is in a slightly confusing/inconsistent state right now - we actually have a PR for this right now (https://github.com/dagster-io/dagster/pull/6205) that I'll prioritize getting merged. Until that change lands, it's possible that changing topologySpreadConstraints to topology_spread_constraints and leaving everything else the same will work
e

Eldan Hamdani

02/24/2022, 2:19 PM
@daniel Ok, I did it and run 2 job in parallel but they were run on the same node and not on a different nodes… do you know why?
@daniel what I’m trying to do is to run 1 pod on 1 node
d

daniel

02/24/2022, 2:22 PM
I might have to raise that question to others on the team with more k8s expertise (assuming that dagster is setting the config on the job/pod that you expected - if it is not, that I can help with)
e

Eldan Hamdani

02/24/2022, 2:25 PM
FYI @Igal Dahan
@daniel do you have an estimation when you’ll go back to me?
d

daniel

02/24/2022, 2:31 PM
Sometime today with what we have, but since this is more of a general k8s question than a dagster specific question (if I understand it correctly) I can't promise you that we’ll have a clear answer for you. Will do our best though!
e

Eldan Hamdani

02/24/2022, 2:32 PM
ok great! thank you so much Daniel!😀 waiting to your answer..
d

daniel

02/24/2022, 3:22 PM
Is it possible for you to post the output of "kubectl describe " on the pod that gets created? Does it have everything there that you expect based on what you set in dagster?
e

Eldan Hamdani

02/24/2022, 3:24 PM
Copy code
eldanhamdani@Eldans-MacBook-Pro helm % kubectl describe pod dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979-kz697 
Name:         dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979-kz697
Namespace:    default
Priority:     0
Node:         gke-dagster-omic-dat-immunai-pipeline-ca9c3a0a-2nd6/10.128.0.39
Start Time:   Thu, 24 Feb 2022 16:16:37 +0200
Labels:       <http://app.kubernetes.io/component=run_worker|app.kubernetes.io/component=run_worker>
              <http://app.kubernetes.io/instance=dagster|app.kubernetes.io/instance=dagster>
              <http://app.kubernetes.io/name=dagster|app.kubernetes.io/name=dagster>
              <http://app.kubernetes.io/part-of=dagster|app.kubernetes.io/part-of=dagster>
              <http://app.kubernetes.io/version=0.13.4|app.kubernetes.io/version=0.13.4>
              controller-uid=d84eba91-377d-4daf-9b79-ab8cffea7826
              job-name=dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979
Annotations:  <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: true
Status:       Succeeded
IP:           10.10.4.11
IPs:
  IP:           10.10.4.11
Controlled By:  Job/dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979
Containers:
  dagster:
    Container ID:  <containerd://34ebd018bc9d5230346a21b3802c85030ce462cfbd22d7dd5a9d4c903244f05>b
    Image:         <http://gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints|gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints>
    Image ID:      <http://gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline@sha256:4f670c50b45b673cb82a84bd37ca48a85efe215a165dcbd358618a910405b0d9|gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline@sha256:4f670c50b45b673cb82a84bd37ca48a85efe215a165dcbd358618a910405b0d9>
    Port:          <none>
    Host Port:     <none>
    Args:
      /usr/bin/python3
      -m
      dagster
      api
      execute_run
      {"__class__": "ExecuteRunArgs", "instance_ref": null, "pipeline_origin": {"__class__": "PipelinePythonOrigin", "pipeline_name": "staging_single_sample_job", "repository_origin": {"__class__": "RepositoryPythonOrigin", "code_pointer": {"__class__": "ModuleCodePointer", "fn_name": "staging_single_sample_repo", "module": "single_sample_pipeline"}, "container_image": "<http://gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints|gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints>", "executable_path": "/usr/bin/python3"}}, "pipeline_run_id": "308cd8a3-4459-4d15-a1af-e99d85d41979"}
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 24 Feb 2022 16:16:39 +0200
      Finished:     Thu, 24 Feb 2022 16:17:59 +0200
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  2560Mi
    Requests:
      cpu:     250m
      memory:  64Mi
    Environment Variables from:
      dagster-pipeline-env  ConfigMap  Optional: false
    Environment:
      DAGSTER_HOME:         /opt/dagster/dagster_home
      DAGSTER_PG_PASSWORD:  <set to the key 'postgresql-password' in secret 'dagster-postgresql-secret'>  Optional: false
      LD_LIBRARY_PATH:      
    Mounts:
      /opt/dagster/dagster_home/dagster.yaml from dagster-instance (rw,path="dagster.yaml")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p8glk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  dagster-instance:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dagster-instance
    Optional:  false
  kube-api-access-p8glk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:                      <none>
d

daniel

02/24/2022, 3:31 PM
sorry, try this instead?
Copy code
kubectl get pod dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979-kz697 -o yaml
e

Eldan Hamdani

02/24/2022, 3:31 PM
Copy code
apiVersion: v1
kind: Pod
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "true"
  creationTimestamp: "2022-02-24T14:16:37Z"
  generateName: dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979-
  labels:
    <http://app.kubernetes.io/component|app.kubernetes.io/component>: run_worker
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: dagster
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
    <http://app.kubernetes.io/part-of|app.kubernetes.io/part-of>: dagster
    <http://app.kubernetes.io/version|app.kubernetes.io/version>: 0.13.4
    controller-uid: d84eba91-377d-4daf-9b79-ab8cffea7826
    job-name: dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979
  name: dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979-kz697
  namespace: default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: dagster-run-308cd8a3-4459-4d15-a1af-e99d85d41979
    uid: d84eba91-377d-4daf-9b79-ab8cffea7826
  resourceVersion: "22674179"
  uid: 03b73058-96ea-4fa7-9ba4-6bbabb5cb682
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: <http://cloud.google.com/gke-nodepool|cloud.google.com/gke-nodepool>
            operator: In
            values:
            - immunai-pipeline-pool
  containers:
  - args:
    - /usr/bin/python3
    - -m
    - dagster
    - api
    - execute_run
    - '{"__class__": "ExecuteRunArgs", "instance_ref": null, "pipeline_origin": {"__class__":
      "PipelinePythonOrigin", "pipeline_name": "staging_single_sample_job", "repository_origin":
      {"__class__": "RepositoryPythonOrigin", "code_pointer": {"__class__": "ModuleCodePointer",
      "fn_name": "staging_single_sample_repo", "module": "single_sample_pipeline"},
      "container_image": "<http://gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints|gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints>",
      "executable_path": "/usr/bin/python3"}}, "pipeline_run_id": "308cd8a3-4459-4d15-a1af-e99d85d41979"}'
    env:
    - name: DAGSTER_HOME
      value: /opt/dagster/dagster_home
    - name: DAGSTER_PG_PASSWORD
      valueFrom:
        secretKeyRef:
          key: postgresql-password
          name: dagster-postgresql-secret
    - name: LD_LIBRARY_PATH
    envFrom:
    - configMapRef:
        name: dagster-pipeline-env
    image: <http://gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints|gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints>
    imagePullPolicy: Always
    name: dagster
    resources:
      limits:
        cpu: 500m
        memory: 2560Mi
      requests:
        cpu: 250m
        memory: 64Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /opt/dagster/dagster_home/dagster.yaml
      name: dagster-instance
      subPath: dagster.yaml
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-p8glk
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: gcr-json-key
  nodeName: gke-dagster-omic-dat-immunai-pipeline-ca9c3a0a-2nd6
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: dagster
  serviceAccountName: dagster
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
    operator: Exists
    tolerationSeconds: 300
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        <http://cloud.google.com/gke-nodepool|cloud.google.com/gke-nodepool>: immunai-pipeline-pool
    maxSkew: 1
    topologyKey: <http://kubernetes.io/hostname|kubernetes.io/hostname>
    whenUnsatisfiable: DoNotSchedule
  volumes:
  - configMap:
      defaultMode: 420
      name: dagster-instance
    name: dagster-instance
  - name: kube-api-access-p8glk
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-02-24T14:16:37Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-02-24T14:18:00Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-02-24T14:18:00Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-02-24T14:16:37Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: <containerd://34ebd018bc9d5230346a21b3802c85030ce462cfbd22d7dd5a9d4c903244f05>b
    image: <http://gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints|gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline:fb-topology-Spread-Constraints>
    imageID: <http://gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline@sha256:4f670c50b45b673cb82a84bd37ca48a85efe215a165dcbd358618a910405b0d9|gcr.io/immunai-registry-hub/panacea-ai/immunai-product-single_sample_pipeline@sha256:4f670c50b45b673cb82a84bd37ca48a85efe215a165dcbd358618a910405b0d9>
    lastState: {}
    name: dagster
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://34ebd018bc9d5230346a21b3802c85030ce462cfbd22d7dd5a9d4c903244f05>b
        exitCode: 0
        finishedAt: "2022-02-24T14:17:59Z"
        reason: Completed
        startedAt: "2022-02-24T14:16:39Z"
  hostIP: 10.128.0.39
  phase: Succeeded
  podIP: 10.10.4.11
  podIPs:
  - ip: 10.10.4.11
  qosClass: Burstable
  startTime: "2022-02-24T14:16:37Z"
d

daniel

02/24/2022, 3:35 PM
OK, I see the topologySpreadConstraints being applied
I asked around a bit and I think this is a bit outside the dagster team's k8s expertise unfortunately 😕 Best advice we have is to look through the logs on the k8s scheduler to try to understand why the rules on the pod/job aren't being respected, or to ask in a more k8s-focused community like the k8s slack: https://slack.k8s.io/ Sorry we can't be more help, but happy to help if there are any other issues with the part where dagster applies config on the pod/job or if there are any missing features there
b

ba

02/24/2022, 4:05 PM
I looked a bit through the kubernetes github issues mentioning topologySpreadConstraints and found this issue: https://github.com/kubernetes/kubernetes/issues/107888 The issue creator suggests that they upgraded from Kubernetes v1.22 to v1.23 and no longer see the issue. It could still be coincidence as mentioned, as the scheduler is a very complex part of Kubernetes. The issue template asks for some important info like Kubernetes version, Cloud provider, OS version, Container runtime version, plugins, etc which may all play a role in troubleshooting. I'd suggest making a similar issue, and if possible, going through the scheduler logs and see if there is anything that stands out as relevant.
i

Igal Dahan

02/27/2022, 8:00 AM
@Eldan Hamdani what is the current version we use?
3 Views