Félix Tremblay

05/23/2023, 10:23 PM
Hello 👋, I've been experiencing some issues with failed runs before they start. In the logs I see
Agent xxxxxxxx detected that run worker failed.
Over the past 7 days, we have encountered three instances of such errors, which account for approximately 0.1% of the total runs. I was wondering if anyone has investigated the root causes behind these worker failures. Additionally, I wanted to suggest the possibility of implementing an automatic process wherein the agent can initiate the launch of a new worker when such failures occur. This could help mitigate the impact of these issues and ensure smoother operations. Thanks!


05/24/2023, 3:25 AM
Hi Félix - thanks for reporting this, we believe this occurs when the underlying Fargate cluster that we're using to launch serverless runs has problems provisioning isolated network resources. We're investigating ways to automatically retry when these classes of issues occur.
👍 1
Just landed a fix for this that will go out in the release next week
❤️ 1