hi, I'm running into an issue where occasionally, ...
# deployment-kubernetes
hi, I'm running into an issue where occasionally, our celery worker containers (using the celeryk8srunlauncher) are restarting. This is leaving any runs that had steps being orchestrated by that worker in a stuck state, where they have to be force terminated -- is this something that anyone has seen before? is it recommended to use a different run launcher in production deployments? (more details in thread)
the last couple logs from the container that shut down:
Copy code
2022-04-20T07:41:27.485987893Z [2022-04-20 07:41:27,485: INFO/MainProcess] Task execute_step_k8s_job[d0808b75-b22b-4417-9a3d-c7e417a379f4] received
2022-04-20T07:42:27.887700661Z [2022-04-20 07:42:27,887: INFO/ForkPoolWorker-16] Task execute_step_k8s_job[d0808b75-b22b-4417-9a3d-c7e417a379f4] succeeded in 60.40009546489455s: ['{"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": null, "marker_end": "celery_queue_wait", "marker_start": null, "metadata_entries": [{"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "alerts"}, "label": "Step key"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "dagster-step-5fb89fc8e1845e06b303fb521ec94a8a"}, "label": "Kubernetes Job name"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "<http://307185671274.dkr.ecr.us-west-2.amazonaws.com/dagster-repository:notebooks-8a9ba69f51c54d25d8ff42c4fa2f0958984ccbd6|307185671274.dkr.ecr.us-west-2.amazonaws.com/dagster-repository:notebooks-8a9ba69f51c54d25d8ff42c4fa2f0958984ccbd6>"}, "label": "Job image"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "Always"}, "label": "Image pull policy"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data":...', ...]
2022-04-20T07:42:49.155264693Z [2022-04-20 07:42:49,155: INFO/MainProcess] Task execute_step_k8s_job[7d8364bc-85f3-4676-9451-1d417712e2b5] received
2022-04-20T08:08:17.572452561Z [2022-04-20 08:08:17,572: INFO/MainProcess] Task execute_step_k8s_job[5c8832da-ab3a-4b85-b57f-1dd467a924b9] received
2022-04-20T08:08:37.513113674Z worker: Warm shutdown (MainProcess)
the pod itself is still running fine, but it's seen 2 container restarts in the past 20 days
in kubernetes, I see that the container exited with an unknown status code:
Copy code
Last Status
Reason: Unknown - exit code: 255
Started at: 2022-04-01T15:18:23Z
Finished at: 2022-04-20T08:09:08Z
it's weird to me that "warm shutdown" is logged, but the worker didn't actually wait for the tasks to finish
having used the celery k8s run launcher/executor a lot over the last 6 months I can’t recommend it for production due to issues like this and other sharp edges. the k8s and multiprocessor executor feel stable for medium workloads. For anything massively parallel (fanouts over 100x) I recommend using something lower level e.g. dask
got it, appreciate the input -- I'll look into moving over to the k8s launcher
Hmm I haven’t seen this particular error, and I’m not finding anything in the docs for exit code 255. Do you have any guesses what’s causing the warm shutdown?
In a similar vein to Alex’s suggestion, if you don’t need celery features, the K8sRunLauncher has fewer moving parts. Use the
to get the same pod per op behavior you have currrently
looking at the pod metrics, I don't see any spikes in cpu/mem utilization, and the container is running well within the resource limits set for it -- so I'm also pretty stumped as to why the container decided to restart
we do have one pipeline w/ a fanout where we need to limit the concurrency of the fanout, but we might be able to solve that just by slapping a generous retry policy on the ops
tbh I recently started running fanouts with the multiprocessor on r5.12xlarge machines or larger
works well as we’re using karpenter.sh so you can switch machine types quite flexibly