Hello! We have Dagster deployed with Helm to our Openshift cluster. All is great, only the user deployment pod has issues. It doesn't die, nor does the container it seems, but the container's status shifts from ready to not ready and then back constantly. I see error from the kubelet: "Readiness probe failed: command timed out" and in average it seems to happen approximately twice per minute. This results in Dagit erroring constantly (Dagster jobs run normally). I know I can adjust readiness probe in Dagster Helm values, but I wonder if that can maybe cause other issues. Any suggestions, ideas, experience welcome!
06/10/2022, 3:24 PM
Hmm my guess would be the gRPC server is under enough load (do you have a lot of sensors/schedules?) that it’s not responding to the probe within the timeout
Some options would be
• making the readiness probe less strict as you mentioned. I don’t think this would have much of an adverse effect, though it doesn’t resolve the issue that the server seems to be getting overloaded
• increasing resource requests for the server
06/13/2022, 7:24 AM
We do have a schedule that runs jobs every 2 minutes and one for every 5. We also limited the resources indeed, though I almost tripled them eventually because the container was failing to start. The funny thing is though that the metrics show no peaks in resources.
I will give increasing readiness probe and resources even further a try. Thanks for your help, I really appreciate it!
I increased timeoutSeconds from 3 to 5 and failureThreshold from 3 to 6 and so far it seems to have helped and nothing weird has happened. 👍