Does anyone else have an issue where the kubernetes GKE jobs dagster #deployment-kubernetes

Does anyone else have an issue where the kubernete...

Jeremy Fisher

06/24/2022, 11:57 PM

Does anyone else have an issue where the kubernetes (GKE) jobs aren't getting the right service account? Its using the default service account which doesn't have access to the IO manager bucket, so my jobs keep failing. From the logs:

Copy code

google.api_core.exceptions.Forbidden: 403 GET <https://storage.googleapis.com/storage/v1/b/dagster-io-manager-artifacts?fields=name&prettyPrint=false>: <mailto:1045208284552-compute@developer.gserviceaccount.com|1045208284552-compute@developer.gserviceaccount.com> does not have storage.buckets.get access to the Google Cloud Storage bucket. 
  File "/opt/conda/lib/python3.8/site-packages/dagster/core/errors.py", line 184, in user_code_error_boundary
    yield
  File "/opt/conda/lib/python3.8/site-packages/dagster/core/execution/resources_init.py", line 310, in single_resource_event_generator
    resource_def.resource_fn(context)
  File "/opt/conda/lib/python3.8/site-packages/dagster_gcp/gcs/io_manager.py", line 121, in gcs_pickle_io_manager
    pickled_io_manager = PickledObjectGCSIOManager(
  File "/opt/conda/lib/python3.8/site-packages/dagster_gcp/gcs/io_manager.py", line 21, in __init__
    check.invariant(self.bucket_obj.exists())
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/storage/bucket.py", line 822, in exists
    client._get_resource(
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/storage/client.py", line 349, in _get_resource
    return self._connection.api_request(
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/storage/_http.py", line 80, in api_request
    return call()
  File "/opt/conda/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
    return retry_target(
  File "/opt/conda/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/conda/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)

Looking at the job itself seems to demonstrate the

serviceAccount

is correct but the managed fields is not:

Copy code

managedFields:
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:template:
          f:spec:
            f:serviceAccount: {} # 👈 
            f:serviceAccountName: {}
# ...snip...
spec:
  template:
    metadata:
      annotations:
        <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
      creationTimestamp: null
      labels:
    spec:
      containers:
      - image: my-cool-dagster-repo
        imagePullPolicy: Always
        name: dagster
      restartPolicy: Never
      serviceAccount: dagster # 👈
      serviceAccountName: dagster

And, in fact, the

helm.yaml

has the right service account 🤔

Copy code

---
global:
  postgresqlSecretName: "dagster-postgresql-secret"
  dagsterHome: "/opt/dagster/dagster_home"
  # A service account name to use for this chart and all subcharts. If this is set, then
  # dagster subcharts will not create their own service accounts.
  serviceAccountName: "dagster"

I wonder if this has to do with upgrading to dagster 0.15? I'm not quite certain, but the problem seems to have started after I upgraded.

🤖 1

Jeremy Fisher

06/24/2022, 11:59 PM

I've verified the workload identity binding between the GCP service account and the K8s service account is correct. If I SSH into the other deployments, I can use

gsutil ls

on the dagster-io-manager-artifacts bucket. It seems to only be an issue with jobs?

daniel

06/25/2022, 1:01 AM

Hi Jeremy - if it happened starting in 0.15.0, it may have something to do with the includeConfigInLaunchedRuns flag that was moved from defaulting to false to defaulting to true: https://docs.dagster.io/changelog#extension-libraries You could try changing it to false in your user code deployment and see if that makes it start working again - would be somewhat surprising but not impossible

daniel

06/25/2022, 1:02 AM

If that fixes it, I'd be curious what the difference in k8s config is between a job when it's working and a job when it isn't

Jeremy Fisher

06/25/2022, 1:03 AM

Huh, interesting! I'll give this a shot when I get back to my keyboard

Jeremy Fisher

06/25/2022, 1:03 AM

🙏

marcos

06/25/2022, 1:30 AM

I had this happen with my move to 0.15. I am using a separate user code deployment helm chart and had to add a line to use a global service account: https://github.com/xmarcosx/dagster-via-kubernetes/blob/master/user_code_values.yaml#L2

daniel

06/25/2022, 1:40 AM

@marcos was the user code deployment helm chart running in a different namespace? I would have thought that the default service account that the user code deployment chart makes would still work for the launched run

marcos

06/25/2022, 1:42 AM

Same namespace for both. I had a few Dagster deployments on GKE. I will upgrade a different one to 0.15.2 and see if it breaks in the same way.

daniel

06/25/2022, 1:43 AM

huh. did you hit the same managedFields error that jeremy is describing here?

daniel

06/25/2022, 1:45 AM

Or was the issue just that you had given the dagster serviceaccount certain permissions, so it switching to the dagster-user-deployments service account as the default caused problems?

marcos

06/25/2022, 1:46 AM

I hit the same issue where my gcs io manager started failing when trying to connect to the bucket. The yaml config was showing that the jobs created by Dagster were not using the global service account that I configured in my dagster helm chart. I hadn’t needed to also include the same global service account override in my user code deployment helm config until after the upgrade to 0.15

daniel

06/25/2022, 1:47 AM

what gave the dagster service account access to the io manager though? was that something that you previously configured?

daniel

06/25/2022, 1:48 AM

access to the bucket, rather

marcos

06/25/2022, 1:49 AM

In my GCP project I have a dagster service account. That has access to the bucket. Then in GKE I deploy Dagster and tell it to use a GKE service account dagster. Finally, I bind them together:

Copy code

gcloud iam service-accounts add-iam-policy-binding \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:$GOOGLE_CLOUD_PROJECT.svc.id.goog[default/dagster]" \
  dagster@$<http://GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com|GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com>;

kubectl annotate serviceaccount \
  dagster \
  <http://iam.gke.io/gcp-service-account=dagster@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com;|iam.gke.io/gcp-service-account=dagster@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com;>

marcos

06/25/2022, 1:49 AM

I hope that’s the question you’re asking!

daniel

06/25/2022, 1:50 AM

that's it exactly. OK, that sounds, while annoying, like an expected consequence of the breaking change. Apologies for the trouble

daniel

06/25/2022, 1:51 AM

you could also set includeConfigInLaunchedRuns to false if you want to change the default back and have it not take config from the user code deployment when launching the run

👍🏽 1

marcos

06/25/2022, 1:51 AM

Perfect. Thank you for clarifying.

condagster 2

Jeremy Fisher

06/27/2022, 6:43 PM

Ok, I added that field to the helm values and it did nothing. But rolling back to dagster 0.14.20 fixed it! I'm not sure if I understand the problem well enough: I'm not using a separate user code deployment helm. Should I regenerate the helm values?

daniel

06/27/2022, 6:46 PM

If making your user deployments look something like this:

Copy code

dagster-user-deployments:
  enabled: true
  deployments:
    - name: "k8s-example-user-code-1"
      image:
        repository: "<http://docker.io/dagster/user-code-example|docker.io/dagster/user-code-example>"
        tag: latest
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "--python-file"
        - "/example_project/example_repo/repo.py"
      port: 3030
      includeConfigInLaunchedRuns:
        enabled: false

didn't fix it then i'm not totally sure what's going on - changing the default there from false to true was the only significant change to the helm chart that I'm aware of between 0.14.20 and 0.15.0

daniel

06/27/2022, 6:47 PM

just to double-check, that's the field that you changed? and it was inside the deployments list?

Jeremy Fisher

06/27/2022, 6:48 PM

Yep, here's the full section:

Copy code

dagster-user-deployments:
  enabled: true
  enableSubchart: true
  deployments:
    - name: "dg-workspace"
      image:
        repository: "<http://gcr.io/foo/dagster|gcr.io/foo/dagster>"
        tag: "1.0.62"
        pullPolicy: Always
      envSecrets:
        - name: run-worker-secrets
        - name: slack-creds
      annotations: {}
      nodeSelector:
        <http://iam.gke.io/gke-metadata-server-enabled|iam.gke.io/gke-metadata-server-enabled>: "true"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "api"
                operator: "In"
                values: ["yes"]
      tolerations:
      - key: "api"
        operator: "Equal"
        value: "yes"
        effect: "NoSchedule"
      podSecurityContext: {}
      securityContext: {}
      resources: {}
      includeConfigInLaunchedRuns:
        enabled: false  # 👈
      replicaCount: 1
      livenessProbe:
        initialDelaySeconds: 0
        periodSeconds: 20
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3
      startupProbe:
        enabled: true
        initialDelaySeconds: 0
        periodSeconds: 10
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3
      service:
        annotations: {}
      dagsterApiGrpcArgs: 
        - "--python-file"
        - "/opt/dagster/dagster_home/dg/repo.py"
        - "-p"
        - "4000"
      port: 4000

daniel

06/27/2022, 6:51 PM

Hm, OK - and do you have the description of the job in 0.14.20 vs. 0.15.0 (with enabled:false)? I'd expect the YAML to be the same between the two, but it sounds like it is different

Jeremy Fisher

06/27/2022, 6:54 PM

Hmmmmm

Jeremy Fisher

06/27/2022, 6:55 PM

It seems to be working again but the managed fields are still blank:

Copy code

f:serviceAccount: {}
            f:serviceAccountName: {}

Jeremy Fisher

06/27/2022, 6:55 PM

Weird!

daniel

06/27/2022, 6:56 PM

Gotcha - yeah the managed fields may be a red herring

Jeremy Fisher

06/27/2022, 6:59 PM

My mistake, seems like some jobs still fail

Jeremy Fisher

06/27/2022, 7:00 PM

That's so bizarre because the resources seem to be initialized correctly in a different job

Jeremy Fisher

06/27/2022, 10:45 PM

Ok, so it seems that part of the issue is that the job doesn't always land on a node with workload identity enabled

Jeremy Fisher

06/27/2022, 10:46 PM

So even if the k8s service account is correct, it can't bind to the GCP IAM

136 Views

Open in Slack

Previous Next