Jordan Wolinsky
03/17/2023, 7:30 PM@graph_asset
that does some computation. We use dynamic ops with map/collect to split up the work between dagster step pods. The computation that each step pod is doing is expensive, runs for 2.5 hours in staging, about a day in production. The problem comes when we are running in production. The Step worker/pod is marked as complete in Kubernetes, but in dagit, it still shows as running. The step pod gets to the expensive computation, but during that, we see the following dagster log after our log
`Running expensive computation`:
Step worker started for "graph_name.op_name[partition_604062_612225]".
Somewhere, there is a mismatch between dagster and Kubernetes and we are not sure where or why.
Here is the log in the dagster-step pod that is marked as complete in Kubernetes but running in dagster:
{
"__class__": "DagsterEvent",
"event_specific_data": {
"__class__": "EngineEventData",
"error": null,
"marker_end": "step_process_start",
"marker_start": null,
"metadata_entries": [
{
"__class__": "EventMetadataEntry",
"description": null,
"entry_data": {
"__class__": "TextMetadataEntryData",
"text": "14"
},
"label": "pid"
}
]
},
"event_type_value": "STEP_WORKER_STARTED",
"logging_tags": {},
"message": "Step worker started for \"graph_name.op_name[partition_604062_612225]\".",
"pid": 14,
"pipeline_name": "__ASSET_JOB",
"solid_handle": null,
"step_handle": null,
"step_key": "graph_name.op_name[partition_604062_612225]",
"step_kind_value": null
}
Appreciate any and all help!johann
03/17/2023, 10:49 PMk8s_job_executor
?Jordan Wolinsky
03/17/2023, 10:52 PMk8s_job_executor
johann
03/20/2023, 4:28 PMJordan Wolinsky
03/20/2023, 5:01 PM