Hi :wave: i've been bumping into this error occasi...
# announcements
a
Hi đź‘‹ i've been bumping into this error occasionally using
dagster_celery_k8s
:
Copy code
An exception was thrown during execution that is likely a framework error, rather than an error in user code.
dagster.check.CheckError: Invariant failed. Description: Pipeline run dev_volumeclass_pipeline (0329a1a3-4013-4dfc-8f84-d9ee13492b9e) in state PipelineRunStatus.STARTED, expected NOT_STARTED or STARTING
any idea why this happens? (dagster 0.11.3)
on a possibly related note: is there a way to automatically retry the solid if this error happens?
j
Hi @Alessandro Marrella - usually this error comes up because the run worker restarted. Currently we don’t support retries at the pipeline level, so we don’t expect the run to already be started. Could you confirm that the pod is named
dagster-run-…
(the run worker) as opposed to
dagster-job-…
(the step worker)
a
correct, the issue is in the
run
container:
Copy code
❯ kubectl logs dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e-hbzkm -n dagster
{"__class__": "ExecuteRunArgsLoadComplete"}
{"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": {"__class__": "SerializableErrorInfo", "cause": null, "cls_name": "CheckError", "message": "dagster.check.CheckError: Invariant failed. Description: Pipeline run dev_volumeclass_pipeline (0329a1a3-4013-4dfc-8f84-d9ee13492b9e) in state PipelineRunStatus.STARTED, expected NOT_STARTED or STARTING\n", "stack": ["  File \"/usr/local/lib/python3.7/site-packages/dagster/grpc/impl.py\", line 76, in core_execute_run\n    yield from execute_run_iterator(recon_pipeline, pipeline_run, instance)\n", "  File \"/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py\", line 80, in execute_run_iterator\n    pipeline_run.pipeline_name, pipeline_run.run_id, pipeline_run.status\n", "  File \"/usr/local/lib/python3.7/site-packages/dagster/check/__init__.py\", line 169, in invariant\n    CheckError(\"Invariant failed. Description: {desc}\".format(desc=desc))\n", "  File \"/usr/local/lib/python3.7/site-packages/future/utils/__init__.py\", line 446, in raise_with_traceback\n    raise exc.with_traceback(traceback)\n"]}, "marker_end": null, "marker_start": null, "metadata_entries": []}, "event_type_value": "ENGINE_EVENT", "logging_tags": {}, "message": "An exception was thrown during execution that is likely a framework error, rather than an error in user code.", "pid": null, "pipeline_name": "dev_volumeclass_pipeline", "solid_handle": null, "step_handle": null, "step_key": null, "step_kind_value": null}
{"__class__": "DagsterEvent", "event_specific_data": null, "event_type_value": "PIPELINE_FAILURE", "logging_tags": {}, "message": "This pipeline run has been marked as failed from outside the execution context.", "pid": null, "pipeline_name": "dev_volumeclass_pipeline", "solid_handle": null, "step_handle": null, "step_key": null, "step_kind_value": null}
looking at kubernetes events, it doesn't look the run pod was ever restarted though
j
How is the run being launched?
(schedule, from dagit, etc)
a
From Dagit playground
This doesn't always happen, and in this case it happened after a few steps not immediately
j
For debugging I’d be interested to see
kubectl describe pod dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e-hbzkm -n dagster
If you’re ok sharing, please check that it doesn’t have any secrets etc
a
nothing particularly interesting there i think:
Copy code
❯ kubectl describe pod dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e-hbzkm -n dagster
Name:         dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e-hbzkm
Namespace:    dagster
Priority:     0
Node:         REDACTED
Start Time:   Wed, 07 Apr 2021 16:16:58 +0100
Labels:       <http://app.kubernetes.io/component=run_coordinator|app.kubernetes.io/component=run_coordinator>
              <http://app.kubernetes.io/instance=dagster|app.kubernetes.io/instance=dagster>
              <http://app.kubernetes.io/name=dagster|app.kubernetes.io/name=dagster>
              <http://app.kubernetes.io/part-of=dagster|app.kubernetes.io/part-of=dagster>
              <http://app.kubernetes.io/version=0.11.3|app.kubernetes.io/version=0.11.3>
              controller-uid=4b5c1c1f-0ec1-48ea-a790-af6dc114be03
              job-name=dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e
Annotations:  <none>
Status:       Succeeded
IP:           REDACTED
IPs:
  IP:           REDACTED
Controlled By:  Job/dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e
Containers:
  dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e:
    Container ID:  <docker://c8437d6c8e3ab0d971f5d2ea0f43b2961d36faaaf4eabc8a07bb6611e0d6c5fb>
    Image:         REDACTED
    Image ID:      REDACTED
    Port:          <none>
    Host Port:     <none>
    Args:
      dagster
      api
      execute_run
      {"__class__": "ExecuteRunArgs", "instance_ref": null, "pipeline_origin": {"__class__": "PipelinePythonOrigin", "pipeline_name": "dev_volumeclass_pipeline", "repository_origin": {"__class__": "RepositoryPythonOrigin", "code_pointer": {"__class__": "FileCodePointer", "fn_name": "repository", "python_file": "bin/repository.py", "working_directory": "/app"}, "container_image": "REDACTED", "executable_path": "/usr/local/bin/python"}}, "pipeline_run_id": "0329a1a3-4013-4dfc-8f84-d9ee13492b9e"}
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 07 Apr 2021 16:17:00 +0100
      Finished:     Wed, 07 Apr 2021 16:17:15 +0100
    Ready:          False
    Restart Count:  0
    Environment Variables from:
      dagster-anomaly-pipeline-user-env  ConfigMap  Optional: false
      ses-keys                           Secret     Optional: false
    Environment:
      DAGSTER_HOME:           /opt/dagster/dagster_home
      DAGSTER_PG_PASSWORD:    <set to the key 'postgresql-password' in secret 'dagster-postgresql-secret'>  Optional: false
      DAGSTER_CURRENT_IMAGE:  REDACTED
    Mounts:
      /opt/dagster/dagster_home/dagster.yaml from dagster-instance (rw,path="dagster.yaml")
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5wdps (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  dagster-instance:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dagster-instance
    Optional:  false
  default-token-5wdps:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-5wdps
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                 <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:          <none>
j
Sorry could I get
kubectl describe job dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e -n dagster
as well
a
Copy code
❯ kubectl describe job dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e -n dagster
Name:           dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e
Namespace:      dagster
Selector:       controller-uid=4b5c1c1f-0ec1-48ea-a790-af6dc114be03
Labels:         <http://app.kubernetes.io/component=run_coordinator|app.kubernetes.io/component=run_coordinator>
                <http://app.kubernetes.io/instance=dagster|app.kubernetes.io/instance=dagster>
                <http://app.kubernetes.io/name=dagster|app.kubernetes.io/name=dagster>
                <http://app.kubernetes.io/part-of=dagster|app.kubernetes.io/part-of=dagster>
                <http://app.kubernetes.io/version=0.11.3|app.kubernetes.io/version=0.11.3>
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Wed, 07 Apr 2021 15:51:34 +0100
Completed At:   Wed, 07 Apr 2021 16:17:16 +0100
Duration:       25m
Pods Statuses:  0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  <http://app.kubernetes.io/component=run_coordinator|app.kubernetes.io/component=run_coordinator>
           <http://app.kubernetes.io/instance=dagster|app.kubernetes.io/instance=dagster>
           <http://app.kubernetes.io/name=dagster|app.kubernetes.io/name=dagster>
           <http://app.kubernetes.io/part-of=dagster|app.kubernetes.io/part-of=dagster>
           <http://app.kubernetes.io/version=0.11.3|app.kubernetes.io/version=0.11.3>
           controller-uid=4b5c1c1f-0ec1-48ea-a790-af6dc114be03
           job-name=dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e
  Containers:
   dagster-run-0329a1a3-4013-4dfc-8f84-d9ee13492b9e:
    Image:      REDACTED
    Port:       <none>
    Host Port:  <none>
    Args:
      dagster
      api
      execute_run
      {"__class__": "ExecuteRunArgs", "instance_ref": null, "pipeline_origin": {"__class__": "PipelinePythonOrigin", "pipeline_name": "dev_volumeclass_pipeline", "repository_origin": {"__class__": "RepositoryPythonOrigin", "code_pointer": {"__class__": "FileCodePointer", "fn_name": "repository", "python_file": "bin/repository.py", "working_directory": "/app"}, "container_image": "REDACTED", "executable_path": "/usr/local/bin/python"}}, "pipeline_run_id": "0329a1a3-4013-4dfc-8f84-d9ee13492b9e"}
    Environment Variables from:
      dagster-anomaly-pipeline-user-env  ConfigMap  Optional: false
      ses-keys                           Secret     Optional: false
    Environment:
      DAGSTER_HOME:           /opt/dagster/dagster_home
      DAGSTER_PG_PASSWORD:    <set to the key 'postgresql-password' in secret 'dagster-postgresql-secret'>  Optional: false
      DAGSTER_CURRENT_IMAGE:  REDACTED
    Mounts:
      /opt/dagster/dagster_home/dagster.yaml from dagster-instance (rw,path="dagster.yaml")
  Volumes:
   dagster-instance:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dagster-instance
    Optional:  false
Events:        <none>
thanks for looking at this!
j
This doesn’t always happen, and in this case it happened after a few steps not immediately
This is strange because the run worker is the process spawning steps. Really seems like it would be a duplicate run worker, but your manifests are showing only a single pod
E.g. this came up with another user’s custom run launcher that could create multiple run workers for the same run https://dagster.slack.com/archives/CCCR6P2UR/p1606923838039400
Not sure how that’s happening here
a
in my case i'm using the vanilla celery-k8s one, is it possible it tries to spawn a new run again? possibly due to some "blip"? i'm using quite small nodes and i didn't constrain resources yet so i wonder if this happens because of that