Thanks Johann, good to have corroboration.
The latest from EKS (last night) is that they're pushing for retries: "From EKS side, we have reduced the defrag frequency - earlier it was once in 6 hours, now it is once in 24 hours. This should reduce the impact but I would request you to please check if the workload can retry upon a timeout to make it more resilient."
I'm not actually sure reducing defrags from 6 to 24h intervals will help; we are seeing interruption every 1-2 days anyway and presumably a less frequent defrag will take longer. So adding retries would be great!