Super confused why a bunch of run pods are launche...
# ask-community
s
Super confused why a bunch of run pods are launched immediately upon deploy (and from a helm chart that manages user code in a separate yaml no less). In other words, it seems like runlauncher (or daemon??) is generating pods even without a grpc user pod. obviously we do need grpc pods, but trying to pinpoint why this is occurring ANY help appreciated!!
j
Those pods would be from runs getting launched. If you open Dagit, do you have a lot of runs in your queue?
s
@johann i do not, i checked every job. all of our pipeline code is wrapped in @op, @job, etc. so nothing at least from our code/docker image was instantiating right away or kicking off executions
j
My best guess is that there was somehow a backfill queued up or that these were from another deployment or something like that.
dagster-run
K8s Jobs are only created when Dagster launches runs, and it doesn’t do that on startup
s
no releases are on the cluster prior to a new deploy, i’ve made sure of that to troubleshoot. helm uninstall <pretty much everything>. they come up as soon as the helm “infra” code (ie. minus the user code deployments) boot up. dagit UI has nothing in queue blob confused
j
If you delete them they come back when you reinstall Dagster?
s
@johann yep, they all come back, all ~37 of them within a couple mins
our “infra” helm chart:
Copy code
busybox:
  image:
    pullPolicy: IfNotPresent
    repository: <http://docker.io/busybox|docker.io/busybox>
    tag: "1.28"
computeLogManager:
  config: {}
  type: NoOpComputeLogManager
dagit:
  dbStatementTimeout: null
  enableReadOnly: false
  image:
    pullPolicy: Always
    repository: <http://docker.io/dagster/dagster-celery-k8s|docker.io/dagster/dagster-celery-k8s>
    tag: null
  livenessProbe: {}
  readinessProbe:
    failureThreshold: 3
    httpGet:
      path: /dagit_info
      port: 80
    periodSeconds: 20
    successThreshold: 1
    timeoutSeconds: 3
  replicaCount: 1
  service:
    annotations:
      <http://cloud.google.com/backend-config|cloud.google.com/backend-config>: '{"default": "backend-config-default"}'
    port: 80
    type: ClusterIP
  startupProbe:
    enabled: false
  workspace:
    enabled: true
    servers:
    - host: dmg
      port: 3030
    - host: dbt
      port: 3030
dagster-user-deployments:
  enableSubchart: false
  enabled: true
dagsterDaemon:
  enabled: true
  heartbeatTolerance: 300
  image:
    pullPolicy: Always
    repository: <http://docker.io/dagster/dagster-celery-k8s|docker.io/dagster/dagster-celery-k8s>
    tag: null
  livenessProbe: {}
  readinessProbe: {}
  runCoordinator:
    config:
      queuedRunCoordinator:
        dequeueIntervalSeconds: null
        maxConcurrentRuns: null
        tagConcurrencyLimits: []
    enabled: false
    type: QueuedRunCoordinator
  runMonitoring:
    enabled: true
    pollIntervalSeconds: 120
    startTimeoutSeconds: 360
  startupProbe: {}
generatePostgresqlPasswordSecret: true
global:
  dagsterHome: /opt/dagster/dagster_home
  postgresqlSecretName: dagster-postgresql-secret
helm:
  versions: null
postgresql:
  enabled: true
  image:
    pullPolicy: IfNotPresent
    repository: library/postgres
    tag: 9.6.21
  postgresqlDatabase: test
  postgresqlHost: ""
  postgresqlParams: {}
  postgresqlPassword: test
  postgresqlUsername: test
  service:
    port: 5432
rbacEnabled: true
runLauncher:
  config:
    k8sRunLauncher:
      imagePullPolicy: Always
      loadInclusterConfig: true
      volumeMounts:
      - mountPath: /dmg-secrets/google
        name: keyjson-repo-dmg
        readOnly: true
      volumes:
      - name: keyjson-repo-dmg
        secret:
          secretName: tf-svc-dagster-repo-dmg
  type: K8sRunLauncher
scheduler:
  type: DagsterDaemonScheduler
serviceAccount:
  create: true
telemetry:
  enabled: false
our user code chart, deployed separate as `dagster-user-deployments`:
Copy code
USER-SUPPLIED VALUES:
celeryConfigSecretName: dagster-celery-config-secret
dagsterHome: /opt/dagster/dagster_home
deployments:
- dagsterApiGrpcArgs:
  - --python-file
  - /mixer/dagster/workspace_cbh/repo_dmg/repo.py
  - --working-directory
  - /mixer/dagster/workspace_cbh/repo_dmg/
  image:
    pullPolicy: Always
    repository: ***
    tag: v1
  name: dmg
  port: 3030
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 20
    successThreshold: 1
    timeoutSeconds: 3
- dagsterApiGrpcArgs:
  - --python-file
  - /mixer/dagster/workspace_cbh/repo_dbt/repo.py
  - --working-directory
  - /mixer/dagster/workspace_cbh/repo_dbt/
  image:
    pullPolicy: Always
    repository: ***
    tag: latest
  name: dbt
  port: 3030
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 20
    successThreshold: 1
    timeoutSeconds: 3
helm:
  versions: null
postgresqlSecretName: dagster-postgresql-secret
d
Hey solaris, do the runs show up in dagit? (the run Id should show up in the pod name)
are they repeats of runs that previously ran, or brand new runs?
j
If you’re removing them by deleting the K8s namespace, maybe reinstalling the Dagster helm chart stops the deletion. I’d be curious if the jobs show up if you install in a different namespace
s
@daniel i looked based on the id (example:
id:940053e8-9216-4616-a997-92e25ab6110d
corresponding to one of those mysterious pods), doensn’t show up in dagit. i ran an ad hoc job and it worked like usual, and it shows up in the run logs as usual as well. these weird pods are in perpetual
containercreating
, here’s a comparison between my completed well-behaved job on the left vs. one of the problem jobs on the right:
d
what if you use kubectl to look at the job rather than the pod, any clues there?
the k8s job, not the dagster job 🙂
s
@johann we only use default namespace and its tied to our ingress/domain/networking all that, but if you think its critical i can test in a bit
@daniel ok good k8 job | bad job: the volume thats giving the bad job trouble doesnt exist in the helm chart at all, i’ve run the gamut of docker prunes/helm removes/k8 cleanup. fyi we removed the volume which should be fine, as the good job (picking up the right specs) demonstrates @Diana Stan
and i feel like, since it’s been hours since i deployed the corrected helm chart without the volume mounted mentioned ^, and i’ve been deploying/removing helm to troubleshoot, you’d think some new ‘weird’ containers would have the right helm specs and have moved to “run” then “completed”…. all i know is i switched to user deployments in a separate helm (following your online guidelines) late last night and made the associated config changes prior to incident
@daniel
kubectl delete jobs --all
aaaannnnd i’m not getting the issue anymore. i did not know about k8 jobs! wow, you guys fix everything. must’ve gotten rid of the ghostliest of jobs. think we’re going to have to bake some cleanup scripts into our devops. thank you for handing me the hammer, it’s been cathartic
d
ah great! Not sure why those popped back up again, but glad its sorted now