Michel Rouly
05/27/2022, 6:47 PMdagster-step
) in order to scale down a node? 🧵Michel Rouly
05/27/2022, 6:47 PMMichel Rouly
05/27/2022, 6:47 PMsafe-to-evict: false
annotation, but that kind of feels like a bandaid. Plus we may have other eviction forces beyond just the CA (e.g. k8s descheduler). So I don't love that option.Michel Rouly
05/27/2022, 6:48 PMmaxUnavailable: 0
) but it looks like this doesn't work for jobs (this message shows up on our PDB's status):
message: jobs.batch does not implement the scale subresource
Michel Rouly
05/27/2022, 6:49 PMmetrics-server
and other pods that we do want evictable despite having local storage.Michel Rouly
05/27/2022, 7:20 PMJob
controller, which CA won't evict.fahad
05/27/2022, 8:24 PMclustered_extract_tumor_mutation_profile_workflow (06dfdcb4-ac74-4846-8894-a4f3157151a2) started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.
Noticed weirdness with the dagster-run pod for this job. It definitely took a long time to spin up the pod. Once the pod actually spun up - it caused this error to be printed and the run was killed.fahad
05/27/2022, 8:26 PMFailed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "dagster-run-06dfdcb4-ac74-4846-8894-a4f3157151a2--1-lw9g2": operation timeout: context deadline exceeded
fahad
05/27/2022, 8:40 PMdaniel
05/27/2022, 9:20 PMdaniel
05/27/2022, 9:20 PMfahad
05/27/2022, 9:36 PMfahad
05/27/2022, 9:36 PMAndrea Giardini
05/28/2022, 5:50 AMfahad
05/31/2022, 3:47 PMfahad
05/31/2022, 3:48 PMAndrea Giardini
06/03/2022, 12:14 PMextra_tags = {
"dagster-k8s/config": {
"pod_spec_config": {
"volumes": [
{
"name": "scratch-data",
"ephemeral": {
"volumeClaimTemplate": {
"spec": {
"accessModes": ["ReadWriteOnce"],
"storageClassName": "ssd",
"resources": {
"requests": {"storage": str(size) + "Gi"}
},
}
}
},
}
]
},
"container_config": {
"volume_mounts": [{"mountPath": mount_point, "name": "scratch-data"}]
},
}
}
Andrea Giardini
06/03/2022, 12:14 PMfahad
06/03/2022, 3:25 PMPodDisruptionBudgets
to prevent the autoscaler and our k8s-descheduler from evicting dagster pods. The other was related to the disk storage symptom in that we had a few pods that were set up with too-high memory limits and were causing memory pressure on the node.
Even after those two fixes we are still seeing the k8s control plane tainting the nodes that the dagster-run pod exists on with a “NoSchedule” taint. Eventually k8s will consider the node unschedulable as well and the entire node is brought down.fahad
06/03/2022, 3:26 PMOpRetryPolicy
daniel
06/03/2022, 3:49 PMjohann
06/03/2022, 3:53 PMDagster Bot
06/03/2022, 3:53 PMjohann
06/03/2022, 3:54 PMfahad
06/03/2022, 3:56 PM