The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hi team, running in k8s, from dagit we are getting a lot of sporadic timeouts when trying to reach the gRPC code repository. We increased the readiness probe timeout to 10, and it seems a little better, but still seeing this behavior. I am also still able to see runs from dagit, even though the code repository isn't loaded and it shows some jobs as running. So it seems as if the daemon and jobs are still able to reach the code repository. No errors are showing from the code repository pod. Is this some of our own k8s network issues or is there something we are missing? Could running jobs be slowing the code repository pod making probe timeout?

How is your k8s cluster doing on cpu/mem? How about the gRPC server pod?