https://dagster.io/ logo
Title
o

Oliver

06/22/2021, 9:25 AM
Hi, hittting an issue where k8s starts to choke and new job pods stop being created after a few hours of sustained load. I'm thinking it might be related to https://github.com/kubernetes/kubernetes/issues/95492 Im running this on an EKS cluster. Has anyone else experienced similar?
This seems to be the case -- attached graphs shows number of jobs on the y axis and time on the x axis. I tried modifying
dagster_k8s/job.py#22
to
K8S_JOB_TTL_SECONDS_AFTER_FINISHED = 5  # 5 seconds
in my user deployment but it seems to not be the right place for this as the jobs are still sticking around. I switched the daemon to use the user deployment image and jobs are now deleting as expected.. I will report back in a few hours with results I also manually ran a job with the following manifest to confirm that ttl feature was enabled in my cluster and the job deleted as expected.
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  ttlSecondsAfterFinished: 5
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4