https://dagster.io/ logo
Title
j

Jordan Wolinsky

03/17/2023, 7:30 PM
Hey dagster team! We’ve been trying to debug this issue for a few weeks now on our end and are still running into the issue. The scenario is we have a
@graph_asset
that does some computation. We use dynamic ops with map/collect to split up the work between dagster step pods. The computation that each step pod is doing is expensive, runs for 2.5 hours in staging, about a day in production. The problem comes when we are running in production. The Step worker/pod is marked as complete in Kubernetes, but in dagit, it still shows as running. The step pod gets to the expensive computation, but during that, we see the following dagster log after our log `Running expensive computation`:
Step worker started for "graph_name.op_name[partition_604062_612225]".
Somewhere, there is a mismatch between dagster and Kubernetes and we are not sure where or why. Here is the log in the dagster-step pod that is marked as complete in Kubernetes but running in dagster:
{
  "__class__": "DagsterEvent",
  "event_specific_data": {
    "__class__": "EngineEventData",
    "error": null,
    "marker_end": "step_process_start",
    "marker_start": null,
    "metadata_entries": [
      {
        "__class__": "EventMetadataEntry",
        "description": null,
        "entry_data": {
          "__class__": "TextMetadataEntryData",
          "text": "14"
        },
        "label": "pid"
      }
    ]
  },
  "event_type_value": "STEP_WORKER_STARTED",
  "logging_tags": {},
  "message": "Step worker started for \"graph_name.op_name[partition_604062_612225]\".",
  "pid": 14,
  "pipeline_name": "__ASSET_JOB",
  "solid_handle": null,
  "step_handle": null,
  "step_key": "graph_name.op_name[partition_604062_612225]",
  "step_kind_value": null
}
Appreciate any and all help!
:kubernetes: 1
j

johann

03/17/2023, 10:49 PM
Strange. Does Kubernetes show the jobs as successfully exited?
Is this with the
k8s_job_executor
?
j

Jordan Wolinsky

03/17/2023, 10:52 PM
Yeah it does. Shows as the completed status.
It is with the
k8s_job_executor
j

johann

03/20/2023, 4:28 PM
It seems like the steps are getting interrupted somehow, but I’m not certain how that would happen without an error at least in the Kubernetes events
We have some changes planned that would at least surface the problem to dagster here (considering a run failed if the k8s pods have exited. Currently we just watch for k8s failures)
j

Jordan Wolinsky

03/20/2023, 5:01 PM
Fwiw we didn’t get any behavior like this in our dev environment. Could simply just be a resourcing issue
looks like our autoscaler (karpenter) was evicting the pods and trying to move around on the pods. I do think that the changes you mentioned yall have planned would help surface the error better in the future