Any suggestion on how to handle "Failed to retry r...
# ask-community
c
Any suggestion on how to handle "Failed to retry run"?
Copy code
dagster._core.errors.DagsterInvariantViolationError: Unresolved ExecutionStep "load_destination[?]" is resolved by "compose_queries" which is not part of the current step selection

  File "/root/app/__pypackages__/3.10/lib/dagster/_grpc/impl.py", line 404, in get_external_execution_plan_snapshot
    create_execution_plan(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/api.py", line 1005, in create_execution_plan
    return ExecutionPlan.build(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 1023, in build
    return plan_builder.build()
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 238, in build
    plan = plan.build_subset_plan(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 814, in build_subset_plan
    executable_map, resolvable_map = _compute_step_maps(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 1449, in _compute_step_maps
    raise DagsterInvariantViolationError(
The error that the job failed on was this:
Copy code
dagster._core.errors.DagsterExecutionInterruptedError

  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/execute_plan.py", line 224, in dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/execute_step.py", line 319, in core_dagster_event_sequence_for_step
    for event_or_input_value in ensure_gen(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/inputs.py", line 501, in load_input_object
    yield from _load_input_with_input_manager(input_manager, load_input_context)
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/inputs.py", line 857, in _load_input_with_input_manager
    with solid_execution_error_boundary(
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/utils.py", line 41, in solid_execution_error_boundary
    with raise_execution_interrupts():
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/errors.py", line 150, in raise_execution_interrupts
    with raise_interrupts_as(DagsterExecutionInterruptedError):
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/root/app/__pypackages__/3.10/lib/dagster/_utils/interrupts.py", line 85, in raise_interrupts_as
    raise error_cls()
pretty sure it was caused by GKE scaling down the node, but I haven't confirmed it yet
confirmed, here's the log immediately prior to the interrupted error:
Copy code
{
  "insertId": "rh0qp1f7mk9do",
  "jsonPayload": {
    "involvedObject": {
      "uid": "56418917-e257-4a94-b4da-68ced239cfd1",
      "kind": "Node",
      "resourceVersion": "112831116",
      "name": "gk3-dagster-cloud-default-pool-71c0c832-zr8f",
      "apiVersion": "v1"
    },
    "source": {
      "component": "cluster-autoscaler"
    },
    "kind": "Event",
    "reportingComponent": "",
    "type": "Normal",
    "apiVersion": "v1",
    "reportingInstance": "",
    "metadata": {
      "resourceVersion": "276787",
      "name": "gk3-dagster-cloud-default-pool-71c0c832-zr8f.171231483531ccd8",
      "managedFields": [
        {
          "time": "2022-09-06T06:23:18Z",
          "fieldsV1": {
            "f:source": {
              "f:component": {}
            },
            "f:firstTimestamp": {},
            "f:message": {},
            "f:lastTimestamp": {},
            "f:involvedObject": {},
            "f:reason": {},
            "f:count": {},
            "f:type": {}
          },
          "manager": "cluster-autoscaler",
          "fieldsType": "FieldsV1",
          "operation": "Update",
          "apiVersion": "v1"
        }
      ],
      "namespace": "default",
      "creationTimestamp": "2022-09-06T06:23:18Z",
      "uid": "5a72ac05-3498-48b1-85c1-d74e87cfaece"
    },
    "reason": "ScaleDown",
    "eventTime": null,
    "message": "marked the node as toBeDeleted/unschedulable"
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "node_name": "gk3-dagster-cloud-default-pool-71c0c832-zr8f",
      "project_id": "teamster-332318",
      "location": "us-central1",
      "cluster_name": "dagster-cloud"
    }
  },
  "timestamp": "2022-09-06T06:23:18Z",
  "severity": "INFO",
  "logName": "projects/teamster-332318/logs/events",
  "receiveTimestamp": "2022-09-06T06:23:23.664051690Z"
}
y
seems related to this known issue https://github.com/dagster-io/dagster/issues/8411
c
aha thanks for the link!
@yuhan idk if it's something feasible for dagster to handle, but the issue that kicked this all off was that GKE Autopilot decided to downscale the node that the job was running on and the pod got reassigned mid-run. Is there anything that can be enhanced on the executor-side to handle this?
y
cc @daniel / @johann for better GKE expertise^
j
GKE Autopilot decided to downscale the node that the job was running on
At some point this is just a reality of running on k8s and is why the retries are helpful. But you can set
"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
to avoid the K8s scheduler opting to stop your workloads. https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler
c
thanks for the tip @johann
I have retries setup, but with the multiprocess executor, it seems to fail to retry dynamic jobs as seen above. The k8s executor seems to be more resilient, so i switched over to that
j
Re: retries of dynamic jobs, one way to currently get around the limitation is to change to ALL_STEPS retry policy: https://docs.dagster.io/deployment/run-retries#retry-strategy
c
womp:
Copy code
Error creating: admission webhook "<http://gkepolicy.common-webhooks.networking.gke.io|gkepolicy.common-webhooks.networking.gke.io>" denied the request: GKE Policy Controller rejected the request because it violates one or more policies: {"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of <http://cluster-autoscaler.kubernetes.io/safe-to-evict=false|cluster-autoscaler.kubernetes.io/safe-to-evict=false> annotation on workloads Requested by user: 'system:serviceaccount:kube-system:job-controller', groups: 'system:serviceaccounts,system:serviceaccounts:kube-system,system:authenticated'."]}
"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>"
isn't allowed on Autopilot unfortunately
"dagster/retry_strategy": "ALL_STEPS"
appears to be the best solution for now