https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
c

Charles Lariviere

06/30/2021, 2:58 PM
Hey 👋 I’m trying to use the new
k8s_job_executor
but my pipeline fails right after the following event (full logs in thread):
Copy code
ENGINE EVENT Starting execution with step handler K8sStepHandler
This pipeline works when using the default executor, and the k8s job is recorded as
Completed
. I’m not sure where to look at since I’m not really getting any stack trace. The execution is only defined as follows — do I need to pass more than that? Dagit did not raise any errors with this config so I assumed it was correct, since I wanted to use the same as the User Deployment:
Copy code
execution:
  k8s:
a

alex

06/30/2021, 3:02 PM
@johann
j

johann

06/30/2021, 3:05 PM
Hi @Charles Lariviere - could you share the kubernetes log for the job’s pod (same name with an additional suffix)
c

Charles Lariviere

06/30/2021, 3:10 PM
I’m seeing some
ErrorSource.FRAMEWORK_ERROR
at the end if that’s helpful
j

johann

06/30/2021, 3:39 PM
Hmm it seems like there may have been an error in the one of the step computes that’s not getting surfaced. Could you grab status or logs from any step pods that started? those have names
dagster-job-<id>
c

Charles Lariviere

06/30/2021, 3:54 PM
hmm I’m not finding any pods that start with
dagster-job-*
— I might be looking in the wrong place though? I’m running
kubectl get jobs
, but everything starts with
dagster-run-*
.
Could it be that the
dagster-job-<id>
errored out before it could get started?
j

johann

06/30/2021, 5:24 PM
It seems like the executor must have hit an error before launching any steps. Looking in to some possibilities
Hi @Charles Lariviere - could I ask for a full debug log of this run? That would include all the event logs and might help me see what’s going on. It can be downloaded from the
button on the runs page
I’m not finding any pods that start with 
dagster-job-*
Could you also check for any jobs other than the
dagster-run-…
? It’s possible they’re hitting an error creating the pod. Some investigation today revealed a bug that might be causing us to swallow an error here
c

Charles Lariviere

07/02/2021, 1:25 PM
Absolutely! And thanks for investigating this one 🙏 Here are the logs.
And I ended up finding the
dagster-job
! I’ve attached the logs. It looks like it’s missing an environment variable (i.e.
DAGSTER_K8S_INSTANCE_CONFIG_MAP
) — is this something we should set in the executor config?
j

johann

07/02/2021, 2:47 PM
Ah thanks for finding this, seems like this is the error that was swallowed. Is your values.yaml refering to DAGSTER_K8S_INSTANCE_CONFIG_MAP? It might help if you could share that section of your values. From what I suspect is going on, you will need to pass that env var through to the step jobs- you can do so using the
env_config_maps
config and pointing to either your own config map, or the one we create:
<dagster name>-user-env
It’s an easy mistake to make today that the user code repository locations have environment set differently from the the compute jobs. Definitely an area we’d like to improve.
Sorry for running in to both of these, thanks for surfacing them. The failed error reporting will be fixed by next week’s release. The convoluted environment configuration will take longer to scope out, but we might be able to mitigate it with better docs
c

Charles Lariviere

07/02/2021, 6:56 PM
Ah ok! I believe our
dagster-yaml
is defined by Dagster’s Helm chart — here’s our
values.yaml
for `runLauncher`:
Copy code
runLauncher:
  type: K8sRunLauncher
  config:
      kubeconfigFile: ~
      envConfigMaps: []
      envSecrets:
        - name: <secrets>
DAGSTER_K8S_INSTANCE_CONFIG_MAP
is not something we added to our config either — I believe that might be coming from Dagster’s helm as well? We’re also not defining our own config map. I tried with the following config, but now some solids work while others fail without an error message (similar to before).
Copy code
execution:
  k8s:
    config:
      env_config_maps:
      - dagster-pipeline-env
      env_secrets:
      - <secrets>
The logs for the job don’t show anything suspicious, but
kubectl describe
shows this as the last event:
Copy code
Warning  BackoffLimitExceeded  80s   job-controller  Job has reached the specified backoff limit
The failing solids might be due to something unrelated. I’ll investigate more. Thanks for all your help @johann — really appreciate it 🙏
2 Views