Any suggestion on how to handle Failed to retry run ```dagst dagster #ask-community

Any suggestion on how to handle "Failed to retry r...

Charlie Bini

09/06/2022, 2:27 PM

Any suggestion on how to handle "Failed to retry run"?

Copy code

dagster._core.errors.DagsterInvariantViolationError: Unresolved ExecutionStep "load_destination[?]" is resolved by "compose_queries" which is not part of the current step selection

  File "/root/app/__pypackages__/3.10/lib/dagster/_grpc/impl.py", line 404, in get_external_execution_plan_snapshot
    create_execution_plan(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/api.py", line 1005, in create_execution_plan
    return ExecutionPlan.build(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 1023, in build
    return plan_builder.build()
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 238, in build
    plan = plan.build_subset_plan(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 814, in build_subset_plan
    executable_map, resolvable_map = _compute_step_maps(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/plan.py", line 1449, in _compute_step_maps
    raise DagsterInvariantViolationError(

Charlie Bini

09/06/2022, 2:28 PM

The error that the job failed on was this:

Copy code

dagster._core.errors.DagsterExecutionInterruptedError

  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/execute_plan.py", line 224, in dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/execute_step.py", line 319, in core_dagster_event_sequence_for_step
    for event_or_input_value in ensure_gen(
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/inputs.py", line 501, in load_input_object
    yield from _load_input_with_input_manager(input_manager, load_input_context)
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/inputs.py", line 857, in _load_input_with_input_manager
    with solid_execution_error_boundary(
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/plan/utils.py", line 41, in solid_execution_error_boundary
    with raise_execution_interrupts():
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/errors.py", line 150, in raise_execution_interrupts
    with raise_interrupts_as(DagsterExecutionInterruptedError):
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/root/app/__pypackages__/3.10/lib/dagster/_utils/interrupts.py", line 85, in raise_interrupts_as
    raise error_cls()

Charlie Bini

09/06/2022, 2:28 PM

pretty sure it was caused by GKE scaling down the node, but I haven't confirmed it yet

Charlie Bini

09/06/2022, 3:42 PM

confirmed, here's the log immediately prior to the interrupted error:

Copy code

{
  "insertId": "rh0qp1f7mk9do",
  "jsonPayload": {
    "involvedObject": {
      "uid": "56418917-e257-4a94-b4da-68ced239cfd1",
      "kind": "Node",
      "resourceVersion": "112831116",
      "name": "gk3-dagster-cloud-default-pool-71c0c832-zr8f",
      "apiVersion": "v1"
    },
    "source": {
      "component": "cluster-autoscaler"
    },
    "kind": "Event",
    "reportingComponent": "",
    "type": "Normal",
    "apiVersion": "v1",
    "reportingInstance": "",
    "metadata": {
      "resourceVersion": "276787",
      "name": "gk3-dagster-cloud-default-pool-71c0c832-zr8f.171231483531ccd8",
      "managedFields": [
        {
          "time": "2022-09-06T06:23:18Z",
          "fieldsV1": {
            "f:source": {
              "f:component": {}
            },
            "f:firstTimestamp": {},
            "f:message": {},
            "f:lastTimestamp": {},
            "f:involvedObject": {},
            "f:reason": {},
            "f:count": {},
            "f:type": {}
          },
          "manager": "cluster-autoscaler",
          "fieldsType": "FieldsV1",
          "operation": "Update",
          "apiVersion": "v1"
        }
      ],
      "namespace": "default",
      "creationTimestamp": "2022-09-06T06:23:18Z",
      "uid": "5a72ac05-3498-48b1-85c1-d74e87cfaece"
    },
    "reason": "ScaleDown",
    "eventTime": null,
    "message": "marked the node as toBeDeleted/unschedulable"
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "node_name": "gk3-dagster-cloud-default-pool-71c0c832-zr8f",
      "project_id": "teamster-332318",
      "location": "us-central1",
      "cluster_name": "dagster-cloud"
    }
  },
  "timestamp": "2022-09-06T06:23:18Z",
  "severity": "INFO",
  "logName": "projects/teamster-332318/logs/events",
  "receiveTimestamp": "2022-09-06T06:23:23.664051690Z"
}

yuhan

09/06/2022, 6:06 PM

seems related to this known issue https://github.com/dagster-io/dagster/issues/8411

Charlie Bini

09/06/2022, 6:42 PM

aha thanks for the link!

Charlie Bini

09/06/2022, 6:46 PM

@yuhan idk if it's something feasible for dagster to handle, but the issue that kicked this all off was that GKE Autopilot decided to downscale the node that the job was running on and the pod got reassigned mid-run. Is there anything that can be enhanced on the executor-side to handle this?

yuhan

09/06/2022, 6:48 PM

cc @daniel / @johann for better GKE expertise^

johann

09/07/2022, 3:31 PM

GKE Autopilot decided to downscale the node that the job was running on

At some point this is just a reality of running on k8s and is why the retries are helpful. But you can set

"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"

to avoid the K8s scheduler opting to stop your workloads. https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

Charlie Bini

09/08/2022, 3:21 PM

thanks for the tip @johann

Charlie Bini

09/08/2022, 3:23 PM

I have retries setup, but with the multiprocess executor, it seems to fail to retry dynamic jobs as seen above. The k8s executor seems to be more resilient, so i switched over to that

johann

09/08/2022, 3:45 PM

Re: retries of dynamic jobs, one way to currently get around the limitation is to change to ALL_STEPS retry policy: https://docs.dagster.io/deployment/run-retries#retry-strategy

Charlie Bini

09/08/2022, 5:45 PM

womp:

Copy code

Error creating: admission webhook "<http://gkepolicy.common-webhooks.networking.gke.io|gkepolicy.common-webhooks.networking.gke.io>" denied the request: GKE Policy Controller rejected the request because it violates one or more policies: {"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of <http://cluster-autoscaler.kubernetes.io/safe-to-evict=false|cluster-autoscaler.kubernetes.io/safe-to-evict=false> annotation on workloads Requested by user: 'system:serviceaccount:kube-system:job-controller', groups: 'system:serviceaccounts,system:serviceaccounts:kube-system,system:authenticated'."]}

Charlie Bini

09/08/2022, 5:46 PM

"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>"

isn't allowed on Autopilot unfortunately

Charlie Bini

09/08/2022, 5:47 PM

"dagster/retry_strategy": "ALL_STEPS"

appears to be the best solution for now

johann

09/08/2022, 5:47 PM

https://issuetracker.google.com/issues/227162588 🙁

4 Views

Open in Slack

Previous Next