Hi team, is run retry the default setting now? Our...
# ask-community
h
Hi team, is run retry the default setting now? Our job run pod failed due to OOM and it seems that runs get retried for another 3 time now even though we have these settings and didn’t specifically set retry for job run.
Copy code
runRetries:
    enabled: false
    maxRetries: 0
d
run retries should not be enabled by default, no. Can you check the Configuration tab in Dagit and see what it says under "run_retries"?
er I should clarify - the run retry feature is enabled by default, but the default number of retries is set to 0. And setting enabled: False in the values.yaml should still disable the feature.
h
Yeah..I found it quite strange. Here is the our config (1.3.1)
Copy code
local_artifact_storage:
  module: dagster._core.storage.root
  class: LocalArtifactStorage
  config:
    base_dir: /opt/dagster/dagster_home
run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      db_name: xyz
      hostname: <http://xyz.us-west-2.rds.amazonaws.com|xyz.us-west-2.rds.amazonaws.com>
      params: {}
      password:
        env: xyz
      port: 5432
      username: xyz
event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      db_name: dagster_prod
      hostname: <http://xyz.us-west-2.rds.amazonaws.com|xyz.us-west-2.rds.amazonaws.com>
      params: {}
      password:
        env: xyz
      port: 5432
      username: xyz
compute_logs:
  module: dagster._core.storage.noop_compute_log_manager
  class: NoOpComputeLogManager
  config: {}
schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      db_name: xyz
      hostname: <http://xyz.us-west-2.rds.amazonaws.com|xyz.us-west-2.rds.amazonaws.com>
      params: {}
      password:
        env: xyz
      port: 5432
      username: xyz
scheduler:
  module: dagster._core.scheduler
  class: DagsterDaemonScheduler
  config: {}
run_coordinator:
  module: dagster._core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 125
    tag_concurrency_limits:
    - key: dagster/sensor
      limit: 80
    - key: dagster/backfill
      limit: 65
    - key: databricks
      limit: 50
    - key: fabricator
      limit: 50
    - key: fabricator/source
      limit: 10
      value:
        applyLimitPerUniqueValue: true
    - key: fabricator/user
      limit: 12
      value:
        applyLimitPerUniqueValue: true
    - key: metrics-repo
      limit: 50
    - key: metrics-repo
      limit: 20
      value: event_source
    - key: metrics-repo
      limit: 25
      value: analysis_exposures
    - key: metrics-repo
      limit: 40
      value: metric_analysis
    - key: metrics-repo
      limit: 30
      value: analysis_quality_tests
    - key: simdash
      limit: 50
    - key: curator
      limit: 50
run_launcher:
  module: dagster_k8s
  class: K8sRunLauncher
  config:
    dagster_home: /opt/dagster/dagster_home
    image_pull_policy: Always
    instance_config_map: dagster-instance
    job_namespace: dagster
    load_incluster_config: true
    postgres_password_secret: dagster-postgresql-secret
    service_account_name: dagster
run_monitoring:
  enabled: true
  start_timeout_seconds: 3000
  poll_interval_seconds: 300
sensors:
  use_threads: true
  num_workers: 4
retention:
  sensor:
    purge_after_days:
      failure: 90
      skipped: 7
      started: -1
      success: -1
  schedule:
    purge_after_days: -1
telemetry:
  enabled: false
And we have pods like these
Copy code
kubectl get po -n dagster | grep 09829157-20bc-4ff4-a734-6fb580ba19a3
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-1-cmqs9          0/1     OOMKilled         0               14h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-2-slcrq          0/1     OOMKilled         0               13h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-3-gxlgd          0/1     Completed         0               12h
d
It's possible that k8s is what's spinning up the pods, not dagster
if it was dagster retrying, it would be a different job
h
Copy code
kubectl get job -n dagster | grep 09829157-20bc-4ff4-a734-6fb580ba19a3
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3     0/1           14h        14h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-1   0/1           14h        14h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-2   0/1           13h        13h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-3   1/1           16m        12h
d
although maybe not given those dashes
are you certain that your daemon is also running 1.3.1?
h
Yep
d
and that your job wasn't tagged with the dagster/max_retries tag?
h
Yep. I am sure we are not setting it
let me also check with our compute team to rule out k8 retries
d
Is the config that you're setting in the helm chart
Copy code
dagsterDaemon:
  runRetries:
    enabled: False
you didn't have the dagsterDaemon but i assume it was there
oh wait, i think i know what's happening here
what do you have in your helm chart under
runMonitoring
Also what version of helm are you using?
i think I have a PR that will likely fix this - https://github.com/dagster-io/dagster/pull/14323 but I can't figure out how you would manage to get your helm values into a state where you would need that PR
since the helm chart is supposed to always set max_resume_run_attempts unless you have explicitly set maxResumeRunAttempts to nil or ~ in your values.yaml
h
we have this for runMonitoring
Copy code
runMonitoring:
    enabled: true
    startTimeoutSeconds: 3000
    pollIntervalSeconds: 300
    maxResumeRunAttempts: ~
d
Oh, heh, that'll do it. OK, the PR will help!
In the meantime, try setting maxResumeRunAttempts to 0 instead of ~
h
Got it. Thanks a lot Daniel!
actually, if I set maxResumeRunAttempts to 1, would it retry once then?
d
that's specifically for when the run worker crashes