Hey there I recently asked about how dagster was ...
# ask-community
p
Hey there I recently asked about how dagster was resuming runs even though we had set # Experimental feature to add fault tolerance to Dagster runs. maxResumeRunAttempts =0 https://dagster.slack.com/archives/C01U954MEER/p1680565054261259 However we are still seeing the same thing happen when we upgraded to 1.2.6 would setting it to -1 work or are we doing something wrong. Our values.yaml
Copy code
# Experimental feature to add fault tolerance to Dagster runs. The new Monitoring Daemon will
  # perform health checks on run workers. If a run doesn't start within the timeout, it will be
  # marked as failed. If a run had started but then the run worker crashed, the daemon will attempt
  # to resume the run with a new run worker.
  runMonitoring:
    enabled: true
    # Timeout for runs to start (avoids runs hanging in STARTED)
    startTimeoutSeconds: 180
    # How often to check on in progress runs
    pollIntervalSeconds: 120
    # Max number of times to attempt to resume a run with a new run worker. Defaults to 3 if the the
    # run launcher supports resuming runs, otherwise defaults to 0.
    maxResumeRunAttempts: 0
d
Hi Pablo - could you pass along the contents of your dagster-instance configmap so we can take a look? (Dm would be fine) Just confirming, when you say "when we upgraded to 1.2.6" - this refers to upgrading the Helm chart version, right? Not just the version of Dagster used in your images
p
Just sent you a dm and yes we upgraded the helm and libraries to 1.2.6. Thank you!
So after changing the value of the maxResumeRunAttempts to -1 the issue has not occurred again.
@daniel
d
@Pablo Beltran this is a very old post but somebody else is running into something similar and I still can't explain what's happening, so one followup quesiton - what helm command do you run to upgrade your chart?
Ah nevermind, the other report was explicitly setting
maxResumeRunAttempts: ~