https://dagster.io/ logo
Title
d

Dominik Liebler

11/16/2022, 8:42 AM
dagster-k8s not setting ttlSecondsAfterFinished I use the K8sRunLauncher to run jobs (dagster 1.0.3, kubernetes 1.20.6), but it does not set
.spec.ttlSecondsAfterFinished
in the Job resources it creates resulting in a lot of accumulated Jobs never being deleted. I tried to set the
dagster-k8s/config
tag explicitly like described in the documentation but to no avail. I also tried updating to 1.0.17 but that didn’t help either. Is there something else I need to consider here?
a

Adam Bloom

11/16/2022, 3:43 PM
Your k8s version may be too old. I can't find a good k8s version history of the ttl controller, but stackoverflow claims it was still in alpha (and not enabled by default) in 1.20
j

johann

11/17/2022, 9:13 PM
To clarify- does the setting not appear in the Job spec? (If you
kubectl describe job
)? Or is it not getting respected (would point to Adam’s suggestion)
d

Dominik Liebler

11/18/2022, 6:49 AM
Thanks for your answers! It doesn’t even appear in the Job spec. When I set it manually after the Job has completed, they job will be cleaned up as expected.
j

johann

11/18/2022, 4:59 PM
Gotcha, that’s strange. Here’s where in code we specify ttl_seconds_after_finished: https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-k8s/dagster_k8s/job.py#L695
Actually I misspoke re:
kubectl describe job
, it actually won’t appear there. It does show up for me though if I run
kubectl get job <name> -o yaml
. We set a 1 day ttl by default that I’d expect you to see in the spec
d

Dominik Liebler

11/21/2022, 7:29 AM
Unfortunately, it’s not there. Here is an example (stripped some sensitive credentials from it, otherwise that’s exactly what is being created on the cluster):
apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2022-11-16T06:10:00Z"
  labels:
    <http://app.kubernetes.io/component|app.kubernetes.io/component>: run_worker
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: dagster
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
    <http://app.kubernetes.io/part-of|app.kubernetes.io/part-of>: dagster
    <http://app.kubernetes.io/version|app.kubernetes.io/version>: 1.0.17
    dagster/job: AbcETL
    dagster/run-id: 01f11a33-c9c8-4ce9-bb0e-326b972fb72e
  name: dagster-run-01f11a33-c9c8-4ce9-bb0e-326b972fb72e
  namespace: dagster
  resourceVersion: "270217520"
  uid: eafbdde6-2288-4db1-bc3f-5edc14889f9f
spec:
  backoffLimit: 0
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: eafbdde6-2288-4db1-bc3f-5edc14889f9f
  template:
    metadata:
      creationTimestamp: null
      labels:
        <http://app.kubernetes.io/component|app.kubernetes.io/component>: run_worker
        <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: dagster
        <http://app.kubernetes.io/name|app.kubernetes.io/name>: dagster
        <http://app.kubernetes.io/part-of|app.kubernetes.io/part-of>: dagster
        <http://app.kubernetes.io/version|app.kubernetes.io/version>: 1.0.17
        controller-uid: eafbdde6-2288-4db1-bc3f-5edc14889f9f
        dagster/job: AbcETL
        dagster/run-id: 01f11a33-c9c8-4ce9-bb0e-326b972fb72e
        job-name: dagster-run-01f11a33-c9c8-4ce9-bb0e-326b972fb72e
      name: dagster-run-01f11a33-c9c8-4ce9-bb0e-326b972fb72e
    spec:
      containers:
      - args:
        - dagster
        - api
        - execute_run
        - ...
        env:
        - name: DAGSTER_HOME
          value: /opt/dagster/dagster_home
        - name: DAGSTER_PG_PASSWORD
          valueFrom:
            secretKeyRef:
              key: postgresql-password
              name: dagster-postgresql-secret
        envFrom:
        - configMapRef:
            name: dagster-dagster-user-deployments-etl-user-env
        - secretRef:
            name: dagster-slack-secret
        - secretRef:
            name: dagster-trino-credentials
        image: dagster-user-code:3.1.1
        imagePullPolicy: Always
        name: dagster
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: deployment-token-dagster
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: dagster-dagster-user-deployments-user-deployments
      serviceAccountName: dagster-dagster-user-deployments-user-deployments
      terminationGracePeriodSeconds: 30
status:
  completionTime: "2022-11-16T06:10:14Z"
  conditions:
  - lastProbeTime: "2022-11-16T06:10:14Z"
    lastTransitionTime: "2022-11-16T06:10:14Z"
    status: "True"
    type: Complete
  startTime: "2022-11-16T06:10:00Z"
  succeeded: 1
j

johann

11/21/2022, 3:36 PM
It looks likt the K8s TTL Controller was still in alpha in version 1.20, then beta in 1.21, then GA in 1.23. I wonder if that’s causing your K8s distribution to ignore the setting? Doesn’t really add up with
When I set it manually after the Job has completed, they job will be cleaned up as expected.
What Kubernetes provider are you using? Unfortunately I can’t reproduce and don’t have much more to suggest. A suboptimal workaround would be to manually schedule some cleanups with a cron job This example deletes dagster jobs older than one day:
kubectl get job | grep -e dagster-run -e dagster-job | awk 'match($4,/[0-9]+d/) {print $1}' | xargs kubectl delete job
This deletes completed pods older than 1 day:
kubectl get pod | grep -e dagster-run -e dagster-job | awk 'match($3,/Completed/) {print $0}' | awk 'match($5,/[0-9]+d/) {print $1}' | xargs kubectl delete pod
d

Dominik Liebler

11/22/2022, 7:17 AM
It’s a private cloud provider. Hopefully we’ll soon get an update to a newer K8s version. Until then, I’ll use a cronjob then. Thanks for looking into it 👍
👍 1