I’m seeing a failed run but the op (running via k8...
# ask-community
l
I’m seeing a failed run but the op (running via k8s executor) is still running (see screenshots in thread 🧵) https://nautilus.dagster.cloud/dev2/instance/runs/f81320ba-084c-496e-884e-5319f2dd7a30?logKey=floor_indexing_loop_op&selection=%2A
Copy code
dagster_cloud.errors.DagsterCloudHTTPError: Unexpected GraphQL response:
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 789, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/step_delegating/step_delegating_executor.py", line 186, in execute
    plan_context.run_id,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/step_delegating/step_delegating_executor.py", line 51, in _pop_events
    events = instance.logs_after(run_id, self._event_cursor, of_type=set(DagsterEventType))
  File "/usr/local/lib/python3.7/site-packages/dagster/utils/__init__.py", line 611, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1248, in logs_after
    limit=limit,
  File "/usr/local/lib/python3.7/site-packages/dagster_cloud/storage/event_logs/storage.py", line 270, in get_logs_for_run
    "limit": limit,
  File "/usr/local/lib/python3.7/site-packages/dagster_cloud/storage/event_logs/storage.py", line 235, in _execute_query
    res = self._graphql_client.execute(query, variable_values=variables)
  File "/usr/local/lib/python3.7/site-packages/dagster_cloud/storage/client.py", line 63, in execute
    return self._execute_retry(query, variable_values)
  File "/usr/local/lib/python3.7/site-packages/dagster_cloud/storage/client.py", line 104, in _execute_retry
    raise DagsterCloudHTTPError(http_error) from http_error
The above exception was caused by the following exception:
requests.exceptions.HTTPError: Unexpected GraphQL response
  File "/usr/local/lib/python3.7/site-packages/dagster_cloud/storage/client.py", line 101, in _execute_retry
    raise requests.HTTPError("Unexpected GraphQL response", response=response)
dagster bot resolve to issue 1
still running
d
Hey Liezl - I filed an issue for this here: https://github.com/dagster-io/dagster/issues/8006 - agree that it should clean up the step resources. While we sort that out, terminating the lingering step pod yourself is probably the way to go here (what would generally happen here is any step pods would finish normally, but no new step pods would be created - the fact that the step pod is hanging is probably unrelated to the run failing)
By the way, not the direct cause of the error here, but if upgrading to 0.14.15 or later is an option - we made that executor poll a bit less aggressively in that release which could help a bit here (it looks like it was one of those calls that got dialed back a bit that failed based on the stack trace there)
l
gotcha, so step resource = dagster step step pod = underlying k8s step job how do you know the step pod is hanging vs just chugging along normally?
d
In this case I just assumed, since it had been 70 hours. Unless that's expected for this job?
👍 1
l
anyways this run is designed to be long running (days) and continuously checkpointing, so I agree killing and restarting is ideal. I’ll try to upgrade!
d
Ahhh, OK - in that case it might not be hanging.
l
just want to make sure that I don’t have to explicitly tell dagster anywhere
that my job is > 70 hours
long running jobs are ok right?
d
Long running jobs are no problem, yeah. If this is the only op in the job, though, I'd recommend using the regular executor rather than the k8s_job_executor. There's a bit of overhead with the k8s_job_executor that you don't need
🙏 1
👍 1