Hey there I recently asked about how dagster was resuming ru dagster #ask-community

Hey there I recently asked about how dagster was ...

Pablo Beltran

04/14/2023, 6:50 PM

Hey there I recently asked about how dagster was resuming runs even though we had set # Experimental feature to add fault tolerance to Dagster runs. maxResumeRunAttempts =0 https://dagster.slack.com/archives/C01U954MEER/p1680565054261259 However we are still seeing the same thing happen when we upgraded to 1.2.6 would setting it to -1 work or are we doing something wrong. Our values.yaml

Copy code

# Experimental feature to add fault tolerance to Dagster runs. The new Monitoring Daemon will
  # perform health checks on run workers. If a run doesn't start within the timeout, it will be
  # marked as failed. If a run had started but then the run worker crashed, the daemon will attempt
  # to resume the run with a new run worker.
  runMonitoring:
    enabled: true
    # Timeout for runs to start (avoids runs hanging in STARTED)
    startTimeoutSeconds: 180
    # How often to check on in progress runs
    pollIntervalSeconds: 120
    # Max number of times to attempt to resume a run with a new run worker. Defaults to 3 if the the
    # run launcher supports resuming runs, otherwise defaults to 0.
    maxResumeRunAttempts: 0

daniel

04/14/2023, 7:17 PM

Hi Pablo - could you pass along the contents of your dagster-instance configmap so we can take a look? (Dm would be fine) Just confirming, when you say "when we upgraded to 1.2.6" - this refers to upgrading the Helm chart version, right? Not just the version of Dagster used in your images

Pablo Beltran

04/14/2023, 7:43 PM

Just sent you a dm and yes we upgraded the helm and libraries to 1.2.6. Thank you!

Pablo Beltran

04/18/2023, 5:41 PM

So after changing the value of the maxResumeRunAttempts to -1 the issue has not occurred again.

Pablo Beltran

04/18/2023, 5:41 PM

@daniel

daniel

05/16/2023, 5:15 PM

@Pablo Beltran this is a very old post but somebody else is running into something similar and I still can't explain what's happening, so one followup quesiton - what helm command do you run to upgrade your chart?

daniel

05/16/2023, 6:11 PM

Ah nevermind, the other report was explicitly setting

maxResumeRunAttempts: ~

3 Views

Open in Slack

Previous Next