Hi team is run retry the default setting now Our job run pod dagster #ask-community

Hi team, is run retry the default setting now? Our...

Hebo Yang

05/16/2023, 4:00 PM

Hi team, is run retry the default setting now? Our job run pod failed due to OOM and it seems that runs get retried for another 3 time now even though we have these settings and didn’t specifically set retry for job run.

Copy code

runRetries:
    enabled: false
    maxRetries: 0

daniel

05/16/2023, 4:01 PM

run retries should not be enabled by default, no. Can you check the Configuration tab in Dagit and see what it says under "run_retries"?

daniel

05/16/2023, 4:08 PM

er I should clarify - the run retry feature is enabled by default, but the default number of retries is set to 0. And setting enabled: False in the values.yaml should still disable the feature.

Hebo Yang

05/16/2023, 4:13 PM

Yeah..I found it quite strange. Here is the our config (1.3.1)

Copy code

local_artifact_storage:
  module: dagster._core.storage.root
  class: LocalArtifactStorage
  config:
    base_dir: /opt/dagster/dagster_home
run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      db_name: xyz
      hostname: <http://xyz.us-west-2.rds.amazonaws.com|xyz.us-west-2.rds.amazonaws.com>
      params: {}
      password:
        env: xyz
      port: 5432
      username: xyz
event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      db_name: dagster_prod
      hostname: <http://xyz.us-west-2.rds.amazonaws.com|xyz.us-west-2.rds.amazonaws.com>
      params: {}
      password:
        env: xyz
      port: 5432
      username: xyz
compute_logs:
  module: dagster._core.storage.noop_compute_log_manager
  class: NoOpComputeLogManager
  config: {}
schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      db_name: xyz
      hostname: <http://xyz.us-west-2.rds.amazonaws.com|xyz.us-west-2.rds.amazonaws.com>
      params: {}
      password:
        env: xyz
      port: 5432
      username: xyz
scheduler:
  module: dagster._core.scheduler
  class: DagsterDaemonScheduler
  config: {}
run_coordinator:
  module: dagster._core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 125
    tag_concurrency_limits:
    - key: dagster/sensor
      limit: 80
    - key: dagster/backfill
      limit: 65
    - key: databricks
      limit: 50
    - key: fabricator
      limit: 50
    - key: fabricator/source
      limit: 10
      value:
        applyLimitPerUniqueValue: true
    - key: fabricator/user
      limit: 12
      value:
        applyLimitPerUniqueValue: true
    - key: metrics-repo
      limit: 50
    - key: metrics-repo
      limit: 20
      value: event_source
    - key: metrics-repo
      limit: 25
      value: analysis_exposures
    - key: metrics-repo
      limit: 40
      value: metric_analysis
    - key: metrics-repo
      limit: 30
      value: analysis_quality_tests
    - key: simdash
      limit: 50
    - key: curator
      limit: 50
run_launcher:
  module: dagster_k8s
  class: K8sRunLauncher
  config:
    dagster_home: /opt/dagster/dagster_home
    image_pull_policy: Always
    instance_config_map: dagster-instance
    job_namespace: dagster
    load_incluster_config: true
    postgres_password_secret: dagster-postgresql-secret
    service_account_name: dagster
run_monitoring:
  enabled: true
  start_timeout_seconds: 3000
  poll_interval_seconds: 300
sensors:
  use_threads: true
  num_workers: 4
retention:
  sensor:
    purge_after_days:
      failure: 90
      skipped: 7
      started: -1
      success: -1
  schedule:
    purge_after_days: -1
telemetry:
  enabled: false

Hebo Yang

05/16/2023, 4:15 PM

And we have pods like these

Copy code

kubectl get po -n dagster | grep 09829157-20bc-4ff4-a734-6fb580ba19a3
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-1-cmqs9          0/1     OOMKilled         0               14h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-2-slcrq          0/1     OOMKilled         0               13h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-3-gxlgd          0/1     Completed         0               12h

daniel

05/16/2023, 4:15 PM

It's possible that k8s is what's spinning up the pods, not dagster

daniel

05/16/2023, 4:15 PM

if it was dagster retrying, it would be a different job

Hebo Yang

05/16/2023, 4:15 PM

Copy code

kubectl get job -n dagster | grep 09829157-20bc-4ff4-a734-6fb580ba19a3
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3     0/1           14h        14h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-1   0/1           14h        14h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-2   0/1           13h        13h
dagster-run-09829157-20bc-4ff4-a734-6fb580ba19a3-3   1/1           16m        12h

daniel

05/16/2023, 4:16 PM

although maybe not given those dashes

daniel

05/16/2023, 4:16 PM

are you certain that your daemon is also running 1.3.1?

Hebo Yang

05/16/2023, 4:18 PM

Yep

daniel

05/16/2023, 4:18 PM

and that your job wasn't tagged with the dagster/max_retries tag?

Hebo Yang

05/16/2023, 4:20 PM

Yep. I am sure we are not setting it

Hebo Yang

05/16/2023, 4:20 PM

let me also check with our compute team to rule out k8 retries

daniel

05/16/2023, 5:00 PM

Is the config that you're setting in the helm chart

Copy code

dagsterDaemon:
  runRetries:
    enabled: False

you didn't have the dagsterDaemon but i assume it was there

daniel

05/16/2023, 5:03 PM

oh wait, i think i know what's happening here

daniel

05/16/2023, 5:03 PM

what do you have in your helm chart under

runMonitoring

daniel

05/16/2023, 5:40 PM

Also what version of helm are you using?

daniel

05/16/2023, 6:07 PM

i think I have a PR that will likely fix this - https://github.com/dagster-io/dagster/pull/14323 but I can't figure out how you would manage to get your helm values into a state where you would need that PR

daniel

05/16/2023, 6:08 PM

since the helm chart is supposed to always set max_resume_run_attempts unless you have explicitly set maxResumeRunAttempts to nil or ~ in your values.yaml

Hebo Yang

05/16/2023, 6:09 PM

we have this for runMonitoring

Copy code

runMonitoring:
    enabled: true
    startTimeoutSeconds: 3000
    pollIntervalSeconds: 300
    maxResumeRunAttempts: ~

daniel

05/16/2023, 6:09 PM

Oh, heh, that'll do it. OK, the PR will help!

daniel

05/16/2023, 6:09 PM

In the meantime, try setting maxResumeRunAttempts to 0 instead of ~

Hebo Yang

05/16/2023, 6:10 PM

Got it. Thanks a lot Daniel!

Hebo Yang

05/16/2023, 6:10 PM

actually, if I set maxResumeRunAttempts to 1, would it retry once then?

daniel

05/16/2023, 6:11 PM

that's specifically for when the run worker crashes

daniel

05/16/2023, 6:11 PM

Here's the feature: https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental

thankyou 1

Open in Slack

Previous Next