Hey dagster team! We’ve been trying to debug this ...
# deployment-kubernetes
Hey dagster team! We’ve been trying to debug this issue for a few weeks now on our end and are still running into the issue. The scenario is we have a
that does some computation. We use dynamic ops with map/collect to split up the work between dagster step pods. The computation that each step pod is doing is expensive, runs for 2.5 hours in staging, about a day in production. The problem comes when we are running in production. The Step worker/pod is marked as complete in Kubernetes, but in dagit, it still shows as running. The step pod gets to the expensive computation, but during that, we see the following dagster log after our log `Running expensive computation`:
Copy code
Step worker started for "graph_name.op_name[partition_604062_612225]".
Somewhere, there is a mismatch between dagster and Kubernetes and we are not sure where or why. Here is the log in the dagster-step pod that is marked as complete in Kubernetes but running in dagster:
Copy code
  "__class__": "DagsterEvent",
  "event_specific_data": {
    "__class__": "EngineEventData",
    "error": null,
    "marker_end": "step_process_start",
    "marker_start": null,
    "metadata_entries": [
        "__class__": "EventMetadataEntry",
        "description": null,
        "entry_data": {
          "__class__": "TextMetadataEntryData",
          "text": "14"
        "label": "pid"
  "event_type_value": "STEP_WORKER_STARTED",
  "logging_tags": {},
  "message": "Step worker started for \"graph_name.op_name[partition_604062_612225]\".",
  "pid": 14,
  "pipeline_name": "__ASSET_JOB",
  "solid_handle": null,
  "step_handle": null,
  "step_key": "graph_name.op_name[partition_604062_612225]",
  "step_kind_value": null
Appreciate any and all help!
kubernetes 1
Strange. Does Kubernetes show the jobs as successfully exited?
Is this with the
Yeah it does. Shows as the completed status.
It is with the
It seems like the steps are getting interrupted somehow, but I’m not certain how that would happen without an error at least in the Kubernetes events
We have some changes planned that would at least surface the problem to dagster here (considering a run failed if the k8s pods have exited. Currently we just watch for k8s failures)
Fwiw we didn’t get any behavior like this in our dev environment. Could simply just be a resourcing issue
looks like our autoscaler (karpenter) was evicting the pods and trying to move around on the pods. I do think that the changes you mentioned yall have planned would help surface the error better in the future