Does anyone else have an issue where the kubernete...
# deployment-kubernetes
j
Does anyone else have an issue where the kubernetes (GKE) jobs aren't getting the right service account? Its using the default service account which doesn't have access to the IO manager bucket, so my jobs keep failing. From the logs:
Copy code
google.api_core.exceptions.Forbidden: 403 GET <https://storage.googleapis.com/storage/v1/b/dagster-io-manager-artifacts?fields=name&prettyPrint=false>: <mailto:1045208284552-compute@developer.gserviceaccount.com|1045208284552-compute@developer.gserviceaccount.com> does not have storage.buckets.get access to the Google Cloud Storage bucket. 
  File "/opt/conda/lib/python3.8/site-packages/dagster/core/errors.py", line 184, in user_code_error_boundary
    yield
  File "/opt/conda/lib/python3.8/site-packages/dagster/core/execution/resources_init.py", line 310, in single_resource_event_generator
    resource_def.resource_fn(context)
  File "/opt/conda/lib/python3.8/site-packages/dagster_gcp/gcs/io_manager.py", line 121, in gcs_pickle_io_manager
    pickled_io_manager = PickledObjectGCSIOManager(
  File "/opt/conda/lib/python3.8/site-packages/dagster_gcp/gcs/io_manager.py", line 21, in __init__
    check.invariant(self.bucket_obj.exists())
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/storage/bucket.py", line 822, in exists
    client._get_resource(
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/storage/client.py", line 349, in _get_resource
    return self._connection.api_request(
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/storage/_http.py", line 80, in api_request
    return call()
  File "/opt/conda/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
    return retry_target(
  File "/opt/conda/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
Looking at the job itself seems to demonstrate the
serviceAccount
is correct but the managed fields is not:
Copy code
managedFields:
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:template:
          f:spec:
            f:serviceAccount: {} # 👈 
            f:serviceAccountName: {}
# ...snip...
spec:
  template:
    metadata:
      annotations:
        <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
      creationTimestamp: null
      labels:
    spec:
      containers:
      - image: my-cool-dagster-repo
        imagePullPolicy: Always
        name: dagster
      restartPolicy: Never
      serviceAccount: dagster # 👈
      serviceAccountName: dagster
And, in fact, the
helm.yaml
has the right service account 🤔
Copy code
---
global:
  postgresqlSecretName: "dagster-postgresql-secret"
  dagsterHome: "/opt/dagster/dagster_home"
  # A service account name to use for this chart and all subcharts. If this is set, then
  # dagster subcharts will not create their own service accounts.
  serviceAccountName: "dagster"
I wonder if this has to do with upgrading to dagster 0.15? I'm not quite certain, but the problem seems to have started after I upgraded.
🤖 1
I've verified the workload identity binding between the GCP service account and the K8s service account is correct. If I SSH into the other deployments, I can use
gsutil ls
on the dagster-io-manager-artifacts bucket. It seems to only be an issue with jobs?
d
Hi Jeremy - if it happened starting in 0.15.0, it may have something to do with the includeConfigInLaunchedRuns flag that was moved from defaulting to false to defaulting to true: https://docs.dagster.io/changelog#extension-libraries You could try changing it to false in your user code deployment and see if that makes it start working again - would be somewhat surprising but not impossible
If that fixes it, I'd be curious what the difference in k8s config is between a job when it's working and a job when it isn't
j
Huh, interesting! I'll give this a shot when I get back to my keyboard
🙏
m
I had this happen with my move to 0.15. I am using a separate user code deployment helm chart and had to add a line to use a global service account: https://github.com/xmarcosx/dagster-via-kubernetes/blob/master/user_code_values.yaml#L2
d
@marcos was the user code deployment helm chart running in a different namespace? I would have thought that the default service account that the user code deployment chart makes would still work for the launched run
m
Same namespace for both. I had a few Dagster deployments on GKE. I will upgrade a different one to 0.15.2 and see if it breaks in the same way.
d
huh. did you hit the same managedFields error that jeremy is describing here?
Or was the issue just that you had given the dagster serviceaccount certain permissions, so it switching to the dagster-user-deployments service account as the default caused problems?
m
I hit the same issue where my gcs io manager started failing when trying to connect to the bucket. The yaml config was showing that the jobs created by Dagster were not using the global service account that I configured in my dagster helm chart. I hadn’t needed to also include the same global service account override in my user code deployment helm config until after the upgrade to 0.15
d
what gave the dagster service account access to the io manager though? was that something that you previously configured?
access to the bucket, rather
m
In my GCP project I have a dagster service account. That has access to the bucket. Then in GKE I deploy Dagster and tell it to use a GKE service account dagster. Finally, I bind them together:
Copy code
gcloud iam service-accounts add-iam-policy-binding \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:$GOOGLE_CLOUD_PROJECT.svc.id.goog[default/dagster]" \
  dagster@$<http://GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com|GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com>;

kubectl annotate serviceaccount \
  dagster \
  <http://iam.gke.io/gcp-service-account=dagster@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com;|iam.gke.io/gcp-service-account=dagster@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com;>
I hope that’s the question you’re asking!
d
that's it exactly. OK, that sounds, while annoying, like an expected consequence of the breaking change. Apologies for the trouble
you could also set includeConfigInLaunchedRuns to false if you want to change the default back and have it not take config from the user code deployment when launching the run
👍🏽 1
m
Perfect. Thank you for clarifying.
condagster 2
j
Ok, I added that field to the helm values and it did nothing. But rolling back to dagster 0.14.20 fixed it! I'm not sure if I understand the problem well enough: I'm not using a separate user code deployment helm. Should I regenerate the helm values?
d
If making your user deployments look something like this:
Copy code
dagster-user-deployments:
  enabled: true
  deployments:
    - name: "k8s-example-user-code-1"
      image:
        repository: "<http://docker.io/dagster/user-code-example|docker.io/dagster/user-code-example>"
        tag: latest
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "--python-file"
        - "/example_project/example_repo/repo.py"
      port: 3030
      includeConfigInLaunchedRuns:
        enabled: false
didn't fix it then i'm not totally sure what's going on - changing the default there from false to true was the only significant change to the helm chart that I'm aware of between 0.14.20 and 0.15.0
just to double-check, that's the field that you changed? and it was inside the deployments list?
j
Yep, here's the full section:
Copy code
dagster-user-deployments:
  enabled: true
  enableSubchart: true
  deployments:
    - name: "dg-workspace"
      image:
        repository: "<http://gcr.io/foo/dagster|gcr.io/foo/dagster>"
        tag: "1.0.62"
        pullPolicy: Always
      envSecrets:
        - name: run-worker-secrets
        - name: slack-creds
      annotations: {}
      nodeSelector:
        <http://iam.gke.io/gke-metadata-server-enabled|iam.gke.io/gke-metadata-server-enabled>: "true"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "api"
                operator: "In"
                values: ["yes"]
      tolerations:
      - key: "api"
        operator: "Equal"
        value: "yes"
        effect: "NoSchedule"
      podSecurityContext: {}
      securityContext: {}
      resources: {}
      includeConfigInLaunchedRuns:
        enabled: false  # 👈
      replicaCount: 1
      livenessProbe:
        initialDelaySeconds: 0
        periodSeconds: 20
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3
      startupProbe:
        enabled: true
        initialDelaySeconds: 0
        periodSeconds: 10
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3
      service:
        annotations: {}
      dagsterApiGrpcArgs: 
        - "--python-file"
        - "/opt/dagster/dagster_home/dg/repo.py"
        - "-p"
        - "4000"
      port: 4000
d
Hm, OK - and do you have the description of the job in 0.14.20 vs. 0.15.0 (with enabled:false)? I'd expect the YAML to be the same between the two, but it sounds like it is different
j
Hmmmmm
It seems to be working again but the managed fields are still blank:
Copy code
f:serviceAccount: {}
            f:serviceAccountName: {}
Weird!
d
Gotcha - yeah the managed fields may be a red herring
j
My mistake, seems like some jobs still fail
That's so bizarre because the resources seem to be initialized correctly in a different job
Ok, so it seems that part of the issue is that the job doesn't always land on a node with workload identity enabled
So even if the k8s service account is correct, it can't bind to the GCP IAM
136 Views