Hello Dagster team, I am using the k8sRunLauncher...
# announcements
j
Hello Dagster team, I am using the k8sRunLauncher w/out Celery. It works great when I just have a single node that matches the nodeSelector. However, if I have more than one node, it will also start the dagster jobs on those nodes, but it fails to pull the docker image. Seems that the imagePullSecrets are only getting loaded by the first node. Any ideas? Is this a Dagster issue or am I doing somthing wrong? Error from failed job pod:
Copy code
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  69s                default-scheduler  Successfully assigned dagster/dagster-run-db90fb2b-df8a-4bb2-af55-4cedecb91cb7-nc55c to lke10936-18246-5fd2d5d8e086
  Normal   BackOff    36s (x2 over 67s)  kubelet            Back-off pulling image "<http://docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest|docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest>"
  Warning  Failed     36s (x2 over 67s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    24s (x3 over 68s)  kubelet            Pulling image "<http://docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest|docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest>"
  Warning  Failed     24s (x3 over 68s)  kubelet            Failed to pull image "<http://docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest|docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest>": rpc error: code = Unknown desc = Error response from daemon: Get <https://docker.pkg.github.com/v2/jeremyhermann/docker-repo/dagster/manifests/latest>: no basic auth credentials
  Warning  Failed     24s (x3 over 68s)  kubelet            Error: ErrImagePull
My Helm values.yaml file:
Copy code
dagit:
  nodeSelector:
    <http://lke.linode.com/pool-id|lke.linode.com/pool-id>: "18034"

postgresql:
  master:
    nodeSelector:
      <http://lke.linode.com/pool-id|lke.linode.com/pool-id>: "18034"

  
k8sRunLauncher:
  enabled: true
  env_secrets:
    - "aws-secrets-env"
  nodeSelector:
    <http://lke.linode.com/pool-id|lke.linode.com/pool-id>: "18386"

userDeployments:
  enabled: true 
  deployments:
    - name: "k8s-user-code"
      image:
        repository: "<http://docker.pkg.github.com/jeremyhermann/docker-repo/dagster|docker.pkg.github.com/jeremyhermann/docker-repo/dagster>"
        tag: latest
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "-f"
        - "training_repo.py"
      replicaCount: 1  
      port: 3030
      env:
        ENV_VAR: ""
      env_config_maps:
        - ""
      env_secrets:
        - "aws-secrets-env"
      nodeSelector:
        <http://lke.linode.com/pool-id|lke.linode.com/pool-id>: "18386"

      affinity: {}
      tolerations: []
      podSecurityContext: {}
      securityContext: {}
      resources: {}

imagePullSecrets:
  - name: dockerconfigjson-github-com

celery:
  enabled: false

rabbitmq:
  enabled: false
n
What does the YAML for the failing pod look like?
j
i’ll need to repro and paste that here. assume you mean the yaml that i get from
Copy code
kubectl get pod <pod name> -o yaml
n
Yes
Possibly with a
-n <namespace>
if needed
j
Copy code
apiVersion: v1
kind: Pod
metadata:
  annotations:
    <http://cni.projectcalico.org/podIP|cni.projectcalico.org/podIP>: 10.2.40.2/32
  creationTimestamp: "2020-12-14T03:56:42Z"
  generateName: dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2-
  labels:
    <http://app.kubernetes.io/component|app.kubernetes.io/component>: run_coordinator
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: dagster
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
    <http://app.kubernetes.io/part-of|app.kubernetes.io/part-of>: dagster
    <http://app.kubernetes.io/version|app.kubernetes.io/version>: 0.9.21
    controller-uid: 1c2fafc8-e299-4d8e-afda-87cb6be44e48
    job-name: dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
        f:labels:
          .: {}
          f:<http://app.kubernetes.io/component|app.kubernetes.io/component>: {}
          f:<http://app.kubernetes.io/instance|app.kubernetes.io/instance>: {}
          f:<http://app.kubernetes.io/name|app.kubernetes.io/name>: {}
          f:<http://app.kubernetes.io/part-of|app.kubernetes.io/part-of>: {}
          f:<http://app.kubernetes.io/version|app.kubernetes.io/version>: {}
          f:controller-uid: {}
          f:job-name: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"1c2fafc8-e299-4d8e-afda-87cb6be44e48"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:containers:
          k:{"name":"dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2"}:
            .: {}
            f:args: {}
            f:command: {}
            f:env:
              .: {}
              k:{"name":"DAGSTER_HOME"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"DAGSTER_PG_PASSWORD"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:secretKeyRef:
                    .: {}
                    f:key: {}
                    f:name: {}
            f:envFrom: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources:
              .: {}
              f:requests:
                .: {}
                f:cpu: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/opt/dagster/dagster_home/dagster.yaml"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:subPath: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
        f:volumes:
          .: {}
          k:{"name":"dagster-instance"}:
            .: {}
            f:configMap:
              .: {}
              f:defaultMode: {}
              f:name: {}
            f:name: {}
    manager: kube-controller-manager
    operation: Update
    time: "2020-12-14T03:56:42Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:<http://cni.projectcalico.org/podIP|cni.projectcalico.org/podIP>: {}
    manager: calico
    operation: Update
    time: "2020-12-14T03:56:43Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"10.2.40.2"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: "2020-12-14T03:56:45Z"
  name: dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2-pb6gp
  namespace: dagster
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2
    uid: 1c2fafc8-e299-4d8e-afda-87cb6be44e48
  resourceVersion: "2443601"
  selfLink: /api/v1/namespaces/dagster/pods/dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2-pb6gp
  uid: 8ecf7420-8a72-48fc-9e28-33045a3b2c84
spec:
  containers:
  - args:
    - api
    - execute_run_with_structured_logs
    - '{"__class__": "ExecuteRunArgs", "instance_ref": null, "pipeline_origin": {"__class__":
      "PipelinePythonOrigin", "pipeline_name": "training_pipeline", "repository_origin":
      {"__class__": "RepositoryPythonOrigin", "code_pointer": {"__class__": "FileCodePointer",
      "fn_name": "training_repository", "python_file": "training_repo.py", "working_directory":
      "/"}, "executable_path": "/usr/local/bin/python"}}, "pipeline_run_id": "7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2"}'
    command:
    - dagster
    env:
    - name: DAGSTER_HOME
      value: /opt/dagster/dagster_home
    - name: DAGSTER_PG_PASSWORD
      valueFrom:
        secretKeyRef:
          key: postgresql-password
          name: dagster-postgresql-secret
    envFrom:
    - configMapRef:
        name: dagster-pipeline-env
    image: <http://docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest|docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest>
    imagePullPolicy: IfNotPresent
    name: dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2
    resources:
      requests:
        cpu: "5"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /opt/dagster/dagster_home/dagster.yaml
      name: dagster-instance
      subPath: dagster.yaml
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: dagster-token-spqmw
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: lke10936-18441-5fd6e111cdc5
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: dagster
  serviceAccountName: dagster
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: dagster-instance
    name: dagster-instance
  - name: dagster-token-spqmw
    secret:
      defaultMode: 420
      secretName: dagster-token-spqmw
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-12-14T03:56:42Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-12-14T03:56:42Z"
    message: 'containers with unready status: [dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-12-14T03:56:42Z"
    message: 'containers with unready status: [dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-12-14T03:56:42Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: <http://docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest|docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest>
    imageID: ""
    lastState: {}
    name: dagster-run-7e0c0df5-c83e-4b8f-9d5c-4aec7f32dcc2
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "<http://docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest|docker.pkg.github.com/jeremyhermann/docker-repo/dagster:latest>"
        reason: ImagePullBackOff
  hostIP: 192.168.128.163
  phase: Pending
  podIP: 10.2.40.2
  podIPs:
  - ip: 10.2.40.2
  qosClass: Burstable
  startTime: "2020-12-14T03:56:42Z"
n
That doesn't appear to have any pull secrets on it
Are you sure the helm chart values are set correctly?
j
strangely, the jobs that do work also don’t have the pull secrets in their yaml
i am applying the values.yaml file to the standard dagster chart using
helm upgrade dagster dagster/dagster --namespace dagster --create-namespace  -f ../infra/helm/dagster/values.yaml
is that what you mean by ‘chart values are set correctly’?
n
No I mean you showed what values are you using, but are you sure those are correct 🙂
Can you link whatever chart you're using?
j
The name of the imagePullSecret and docker image are correct. Not at all sure the rest is correct (except that it works on one node)
n
Again, not the data
But the values structure
You might have put things in the wrong place
And the one node that is working may have global docker creds installed
j
very possible that i have the structure wrong
n
Can you please link to whatever Helm char you are using?
j
i installed using
Copy code
helm repo add dagster <https://dagster-io.github.io/helm>
helm install dagster dagster/dagster
in the helm chart, all the references to imagePullSecrets look like this
$.Values.imagePullSecrets
so it seems I have those defined in the right place
i don’t have any global docker creds installed anywhere. but the jobs that do work are on the same node as the k8sRunLauncher pod (the one running the grpc api server) and that pod does have the imagePullSecrets in its yaml. so that might be related. but i don’t think that different pods running on the same node would see each other’s imagePullSecre
n
Looks about like I would expect, dump the configmap containing the dagit config file, that should show you the options being passed to the launcher
j
this one?
Copy code
$ kubectl get configmap dagster-instance -n dagster -o yaml
apiVersion: v1
data:
  dagster.yaml: "scheduler:\n  module: dagster_cron.cron_scheduler\n  class: SystemCronScheduler\n\nschedule_storage:\n
    \ module: dagster_postgres.schedule_storage\n  class: PostgresScheduleStorage\n
    \ config:\n    postgres_db:\n      username: test\n      password:\n        env:
    DAGSTER_PG_PASSWORD\n      hostname: dagster-postgresql\n      db_name:  test\n
    \     port: 5432\n\nrun_launcher:\n  module: dagster_k8s\n  class: K8sRunLauncher\n
    \ config:\n    load_incluster_config: true\n    kubeconfig_file: \n    job_namespace:
    dagster\n    service_account_name: dagster\n    dagster_home:\n      env: DAGSTER_HOME\n
    \   instance_config_map:\n      env: DAGSTER_K8S_INSTANCE_CONFIG_MAP\n    postgres_password_secret:\n
    \     env: DAGSTER_K8S_PG_PASSWORD_SECRET\n    env_config_maps:\n      - env:
    DAGSTER_K8S_PIPELINE_RUN_ENV_CONFIGMAP\n    env_secrets:\n\nrun_storage:\n  module:
    dagster_postgres.run_storage\n  class: PostgresRunStorage\n  config:\n    postgres_db:\n
    \     username: test\n      password:\n        env: DAGSTER_PG_PASSWORD\n      hostname:
    dagster-postgresql\n      db_name:  test\n      port: 5432\n\nevent_log_storage:\n
    \ module: dagster_postgres.event_log\n  class: PostgresEventLogStorage\n  config:\n
    \   postgres_db:\n      username: test\n      password:\n        env: DAGSTER_PG_PASSWORD\n
    \     hostname: dagster-postgresql\n      db_name:  test\n      port: 5432\n"
kind: ConfigMap
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: dagster
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: dagster
  creationTimestamp: "2020-12-07T00:14:10Z"
  labels:
    app: dagster
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    chart: dagster-0.9.21
    heritage: Helm
    release: dagster
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:dagster.yaml: {}
      f:metadata:
        f:annotations:
          .: {}
          f:<http://meta.helm.sh/release-name|meta.helm.sh/release-name>: {}
          f:<http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: {}
        f:labels:
          .: {}
          f:app: {}
          f:<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: {}
          f:chart: {}
          f:heritage: {}
          f:release: {}
    manager: Go-http-client
    operation: Update
    time: "2020-12-07T21:41:03Z"
  name: dagster-instance
  namespace: dagster
  resourceVersion: "2155420"
  selfLink: /api/v1/namespaces/dagster/configmaps/dagster-instance
  uid: 167d0873-fc5d-474e-9e04-f12fe3003680
n
Bit hard to read through all that formatting but I don't think I see an image_pull_secrets on the launcher config
j
correct. do i need to set the pull secrets property in the config? i was expecting it to get set from the helm side
but i can add it to the config if that’s the right place for it
n
Helm doesn't create the job pods, Dagit does
But Helm should be configuring Dagit
That's what that configmap does
That looks correct
And I think that matches your values
The
$.Values
is weird but should work
j
so you think the values from helm should land in the configmap, but they are not?
n
Oh wait that's the scheduler
Yeah if I scroll down to the right section
This is a bug in the helm chart
it doesn't pass through the config correctly
Nothing in there sets image_pull_secrets
j
yup
j
any idea why it works on that one node but not others? does it get the secrets somehow from the other pod on that same node?
n
Because the image is already pulled there from other deployments which do have the pull secret
So the kubelet doesn't try to pull it again
However on a node which doesn't have it already, it tries and fails
j
oh right. so the image is already cached on that node
got it
ok - i’ll figure out how to get that value into the launcher config
n
You cannot without forking and fixing the helm chart
I mean you could hand edit the configmap but that will put it out of sync with Helm so I wouldn't recommend it
j
is there a way to set it from the python side?
i guess for now i can also run another dummy pod on that node to pull the image
n
Or just fix the chart 🙂
It's a 3 line change
a
@rex