hi, I'm running into an issue where occasionally, ...
# deployment-kubernetes
w
hi, I'm running into an issue where occasionally, our celery worker containers (using the celeryk8srunlauncher) are restarting. This is leaving any runs that had steps being orchestrated by that worker in a stuck state, where they have to be force terminated -- is this something that anyone has seen before? is it recommended to use a different run launcher in production deployments? (more details in thread)
the last couple logs from the container that shut down:
Copy code
2022-04-20T07:41:27.485987893Z [2022-04-20 07:41:27,485: INFO/MainProcess] Task execute_step_k8s_job[d0808b75-b22b-4417-9a3d-c7e417a379f4] received
2022-04-20T07:42:27.887700661Z [2022-04-20 07:42:27,887: INFO/ForkPoolWorker-16] Task execute_step_k8s_job[d0808b75-b22b-4417-9a3d-c7e417a379f4] succeeded in 60.40009546489455s: ['{"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": null, "marker_end": "celery_queue_wait", "marker_start": null, "metadata_entries": [{"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "alerts"}, "label": "Step key"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "dagster-step-5fb89fc8e1845e06b303fb521ec94a8a"}, "label": "Kubernetes Job name"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "<http://307185671274.dkr.ecr.us-west-2.amazonaws.com/dagster-repository:notebooks-8a9ba69f51c54d25d8ff42c4fa2f0958984ccbd6|307185671274.dkr.ecr.us-west-2.amazonaws.com/dagster-repository:notebooks-8a9ba69f51c54d25d8ff42c4fa2f0958984ccbd6>"}, "label": "Job image"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data": {"__class__": "TextMetadataEntryData", "text": "Always"}, "label": "Image pull policy"}, {"__class__": "EventMetadataEntry", "description": null, "entry_data":...', ...]
2022-04-20T07:42:49.155264693Z [2022-04-20 07:42:49,155: INFO/MainProcess] Task execute_step_k8s_job[7d8364bc-85f3-4676-9451-1d417712e2b5] received
2022-04-20T08:08:17.572452561Z [2022-04-20 08:08:17,572: INFO/MainProcess] Task execute_step_k8s_job[5c8832da-ab3a-4b85-b57f-1dd467a924b9] received
2022-04-20T08:08:37.513093234Z
2022-04-20T08:08:37.513113674Z worker: Warm shutdown (MainProcess)
the pod itself is still running fine, but it's seen 2 container restarts in the past 20 days
in kubernetes, I see that the container exited with an unknown status code:
Copy code
Last Status
terminated
Reason: Unknown - exit code: 255
Started at: 2022-04-01T15:18:23Z
Finished at: 2022-04-20T08:09:08Z
it's weird to me that "warm shutdown" is logged, but the worker didn't actually wait for the tasks to finish
a
having used the celery k8s run launcher/executor a lot over the last 6 months I can’t recommend it for production due to issues like this and other sharp edges. the k8s and multiprocessor executor feel stable for medium workloads. For anything massively parallel (fanouts over 100x) I recommend using something lower level e.g. dask
w
got it, appreciate the input -- I'll look into moving over to the k8s launcher
j
Hmm I haven’t seen this particular error, and I’m not finding anything in the docs for exit code 255. Do you have any guesses what’s causing the warm shutdown?
In a similar vein to Alex’s suggestion, if you don’t need celery features, the K8sRunLauncher has fewer moving parts. Use the
k8s_job_executor
to get the same pod per op behavior you have currrently
w
looking at the pod metrics, I don't see any spikes in cpu/mem utilization, and the container is running well within the resource limits set for it -- so I'm also pretty stumped as to why the container decided to restart
we do have one pipeline w/ a fanout where we need to limit the concurrency of the fanout, but we might be able to solve that just by slapping a generous retry policy on the ops
a
tbh I recently started running fanouts with the multiprocessor on r5.12xlarge machines or larger
works well as we’re using karpenter.sh so you can switch machine types quite flexibly