Hey dagster team We ve been trying to debug this issue for a dagster #deployment-kubernetes

Hey dagster team! We’ve been trying to debug this ...

Jordan Wolinsky

03/17/2023, 7:30 PM

Hey dagster team! We’ve been trying to debug this issue for a few weeks now on our end and are still running into the issue. The scenario is we have a

@graph_asset

that does some computation. We use dynamic ops with map/collect to split up the work between dagster step pods. The computation that each step pod is doing is expensive, runs for 2.5 hours in staging, about a day in production. The problem comes when we are running in production. The Step worker/pod is marked as complete in Kubernetes, but in dagit, it still shows as running. The step pod gets to the expensive computation, but during that, we see the following dagster log after our log `Running expensive computation`:

Copy code

Step worker started for "graph_name.op_name[partition_604062_612225]".

Somewhere, there is a mismatch between dagster and Kubernetes and we are not sure where or why. Here is the log in the dagster-step pod that is marked as complete in Kubernetes but running in dagster:

Copy code

{
  "__class__": "DagsterEvent",
  "event_specific_data": {
    "__class__": "EngineEventData",
    "error": null,
    "marker_end": "step_process_start",
    "marker_start": null,
    "metadata_entries": [
      {
        "__class__": "EventMetadataEntry",
        "description": null,
        "entry_data": {
          "__class__": "TextMetadataEntryData",
          "text": "14"
        },
        "label": "pid"
      }
    ]
  },
  "event_type_value": "STEP_WORKER_STARTED",
  "logging_tags": {},
  "message": "Step worker started for \"graph_name.op_name[partition_604062_612225]\".",
  "pid": 14,
  "pipeline_name": "__ASSET_JOB",
  "solid_handle": null,
  "step_handle": null,
  "step_key": "graph_name.op_name[partition_604062_612225]",
  "step_kind_value": null
}

Appreciate any and all help!

kubernetes 1

johann

03/17/2023, 10:49 PM

Strange. Does Kubernetes show the jobs as successfully exited?

johann

03/17/2023, 10:50 PM

Is this with the

k8s_job_executor

Jordan Wolinsky

03/17/2023, 10:52 PM

Yeah it does. Shows as the completed status.

Jordan Wolinsky

03/17/2023, 10:52 PM

It is with the

k8s_job_executor

johann

03/20/2023, 4:28 PM

It seems like the steps are getting interrupted somehow, but I’m not certain how that would happen without an error at least in the Kubernetes events

johann

03/20/2023, 4:29 PM

We have some changes planned that would at least surface the problem to dagster here (considering a run failed if the k8s pods have exited. Currently we just watch for k8s failures)

Jordan Wolinsky

03/20/2023, 5:01 PM

Fwiw we didn’t get any behavior like this in our dev environment. Could simply just be a resourcing issue

Jordan Wolinsky

03/20/2023, 9:56 PM

looks like our autoscaler (karpenter) was evicting the pods and trying to move around on the pods. I do think that the changes you mentioned yall have planned would help surface the error better in the future

3 Views

Open in Slack

Previous Next