:wave: Hello, team! I am having trouble executing...
# ask-community
👋 Hello, team! I am having trouble executing my deep learning model training pipeline using k8s-executor. The pipeline itself has 24 training ops executed simultaneously. These ops are executed on designated k8s nodes selected by NodeSelector label and they are scaled up and down by cluster autoscaler and an AWS autoscaling group depending on the cpu utilization. The problem is after sometime during the training the pipeline get terminated with this error:
Execution of run for "" failed. Execution was interrupted unexpectedly. No user initiated termination request was found, treating as failure.
I've no clue where does this termination comes from.
I’ve seen that type of log when the k8s node is scaled down and kills all pods running
👍 1
I suspect the same.
you may want to add
"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
to the pods/code location template to prevent node scaledown. the tricky thing here is the completed pods will still have this annotation and you probably need a mechanism to set that annotation to true