https://dagster.io/ logo
#ask-community
Title
# ask-community
s

Samuel Stütz

04/15/2022, 1:08 PM
Hi, i tried moving my locally tested pipelines to k8s. However I am facing some failures without any debug output on why. How can I find more failure info?
Copy code
dagster-run-ac4a8856-0cca-477f-bb84-3dc2ea97c19d-92h9d dagster dagster.core.errors.DagsterExecutionInterruptedError
Stack Trace:
dagster-run-ac4a8856-0cca-477f-bb84-3dc2ea97c19d-92h9d dagster   File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/api.py", line 785, in pipeline_execution_iterator
There is no extra config on the asset step. It should just run with all defaults, same image, multiprocess executor. Serviceaccounts, images all work I got some basic op steps running. The user-code-examples execute. One issue is that all the asset jobs to print ERRORs in the config pane of Launchpad (and the view config button), This may have to to with setting up things with configured gcs_parquet_asset_cached_io_manager.configured(…), does that only work locally? I also don’t see where I can put the tags from
dagster-k8s/config
on software defined assets so I can have ops run in differently sized containers.
s

sandy

04/15/2022, 4:22 PM
@johann might be able to help out with the debug output on failures question.
configured
should work non-locally. If you'd be able to share the errors your seeing and the code that you're trying to run, I might be able to spot what's going on?
I also don’t see where I can put the tags from
dagster-k8s/config
on software defined assets so I can have ops run in differently sized containers.
That's currently not possible, but I'll post a PR to make it possible.
j

johann

04/15/2022, 4:55 PM
kubectl describe job <job name from a dagit event>
might reveal why your run was interrupted. A likely cause is the node running it went down.
s

Samuel Stütz

04/19/2022, 7:58 AM
I did run describe job which is a bit tricky as it cleans up so fast.
Copy code
k describe job $(k get jobs -A | awk '{print $2}' | tail -n 1) -n argocd
...
Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      10s   job-controller  Created pod: dagster-run-a914f111-b42e-408b-ab40-b6214c607260-49qvj
  Warning  BackoffLimitExceeded  2s    job-controller  Job has reached the specified backoff limit
I do not understand the reason for the Backoff though. I noticed the same problem with the step_isolated_job or at least very similar
Copy code
dagster.core.errors.DagsterExecutionInterruptedError: Execution was interrupted before completing the execution plan.
Steps pending processing: odict_keys(['count_letters'])
Steps pending action: ['multiply_the_word']

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 785, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/step_delegating/step_delegating_executor.py", line 234, in execute
    time.sleep(self._sleep_seconds)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/plan/active.py", line 127, in __exit__
    f"Execution was interrupted before completing the execution plan. {state_str}"
j

johann

04/19/2022, 2:42 PM
Pod events may have the specific reason for the interrupt. Essentially your cluster is stopping the K8s Job for whatever reason. Ideally you should be able to avoid that happening by increasing resources/etc., but we also have some support for working around these failures: https://docs.dagster.io/deployment/run-monitoring
s

Samuel Stütz

04/19/2022, 3:05 PM
I found a possible issue now.
Error: object "dagster"/"dagster-pipeline-env" not registered
Error: cannot find volume "dagster-instance" to mount into container "dagster"
now both of these exist in configmaps in the same namespace. “dagster” except that DAGSTER_K8S_PIPELINE_RUN_IMAGE is set to the user-code-example which be an issue but it seems to fail before that.
j

johann

04/19/2022, 5:00 PM
Not sure I follow. Are you using the provided helm chart? And just running in a different namespace hence needing to add those volumes?
s

Samuel Stütz

04/20/2022, 7:18 AM
Below is my helm values. The only thing outside the helm chart is the google service account setup and ingress. I am not sure about the pipelineRun. It seems if this has a purpose than more than one repo only really works with seperate helm charts which I want to move to eventually. Update: I have my logs now and am checking if I can find anything useful to figure this out. There definitely is randomness. One run succeeds normally, the next re-execute all does not even start (logs says
Unable to attach or mount volumes: unmounted volumes=[dagster-instance dagster-token-ff9pt], unattached volumes=[dagster-instance dagster-token-ff9pt]: timed out waiting for the condition
, the third fails with Interrupted.
Copy code
dagster-run-f46066d0-5666-49db-9863-57db01070c9c-rv7k8 "Stopping container dagster"
...reason: "Killing"
# will see if I can find the actual reason
Maybe with Celery and monitoring would be more reliable but still have to understand the issue at hand here first.
Copy code
postgresql:
  enabled: false
  postgresqlHost: ...
  postgresqlUsername: ...
  postgresqlPassword: ...
  postgresqlDatabase: ...
redis:
  enabled: true
serviceAccount:
  create: true
  name: "dagster"
  annotations: 
    <http://iam.gke.io/gcp-service-account|iam.gke.io/gcp-service-account>: <mailto:workflow@myproject.iam.gserviceaccount.com|workflow@myproject.iam.gserviceaccount.com>
pipelineRun:
  image:
    repository: "europe-docker.pkg.dev/myproject/dagster-user-code/example"
    tag: master
    pullPolicy: Always
computeLogManager:
 type: GCSComputeLogManager
 config: 
   gcsComputeLogManager:
     bucket: mybucket
     prefix: logs/dagster
#       #localDir: ~
#       #jsonCredentialsEnvvar: ~
dagster-user-deployments:
  enabled: true
  enableSubchart: true
  deployments:
    - name: "example-code"
      image:
        repository: "europe-docker.pkg.dev/myproject/dagster-user-code/example"
        tag: master
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "--python-file"
        - "sample_repo.py"
      port: 3030
      nodeSelector:
        servicelevel: service
      startupProbe:
        enabled: false
    - name: "forecast-code"
      image:
        repository: "europe-docker.pkg.dev/myproject/dagster-user-code/forecast"
        tag: master
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "--python-file"
        - "forecast/repository.py"
      port: 3030
      nodeSelector:
        servicelevel: service
      readinessProbe:
        periodSeconds: 20
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3
      startupProbe:
        enabled: false
    - name: "dagster-example"
      image:
        repository: "<http://docker.io/dagster/user-code-example|docker.io/dagster/user-code-example>"
        tag: ~
        pullPolicy: Always
      dagsterApiGrpcArgs:
        - "--python-file"
        - "/example_project/example_repo/repo.py"
      port: 3030
      nodeSelector:
        servicelevel: service
      readinessProbe:
        periodSeconds: 20
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3
      startupProbe:
        enabled: false
I did find my solution after looking in the wrong place for too long. ArgoCD was messing with my namespace and very quickly cleaning up any resource dagster attempted to create. Disable Auto Sync in ArgoCD and it worked properly again.
j

johann

04/21/2022, 2:36 PM
Great!
2 Views