https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
l

Leonardo Cotti

06/21/2021, 8:31 AM
Hi Team! We are running
Dagster 0.11.12
in our main k8s cluster and we are having some problems with node shutdowns. (You can find the error logs in the thread) Basically, the k8s cluster sends a
SIGTERM
to the Dagster job pod and Dagster behaves in an unexpected way: • Raises an
ENGINE_EVENT
exception when receiving the
SIGTERM
• Returns 0 to the pod marking it as completed when it actually failed • Stops any other process (e.g. running a failure hook to get notified) That means that if a pod is sigtermed because of a node shutdown: • The pipeline immediately fails telling k8s that the pod terminated successfully • We don't get any notification from the failure hook • There is no automatic retry (Dagster doesn't support re-running failed pipelines) Unfortunately we were not able to find a solution for this issue, are we missing something? 🙂
Error logs
STEP_FAILURE
Copy code
dagster.core.errors.DagsterExecutionInterruptedError
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 193, in _dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 305, in core_dagster_event_sequence_for_step
    for user_event in check.generator(
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 64, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 599, in _user_event_sequence_for_step_compute_fn
    for event in iterate_with_context(
  File "/usr/local/lib/python3.8/site-packages/dagster/utils/__init__.py", line 382, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 126, in execute_core_compute
    for step_output in _yield_compute_results(step_context, inputs, compute_fn):
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 109, in _yield_compute_results
    for event in user_event_generator:
  File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/solids.py", line 193, in dbt_cli_run
    cli_output = execute_cli(
  File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/utils.py", line 65, in execute_cli
    for raw_line in process.stdout:
  File "/usr/local/lib/python3.8/site-packages/dagster/utils/interrupts.py", line 78, in _new_signal_handler
    raise error_cls()
PIPELINE_FAILURE
Copy code
dagster.core.errors.DagsterExecutionInterruptedError
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/api.py", line 762, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.8/site-packages/dagster/core/executor/in_process.py", line 38, in execute
    yield from iter(
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/api.py", line 841, in __iter__
    yield from self.iterator(
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 69, in inner_plan_execution_iterator
    for step_event in check.generator(_dagster_event_sequence_for_step(step_context)):
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 270, in _dagster_event_sequence_for_step
    raise interrupt_error
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 193, in _dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 305, in core_dagster_event_sequence_for_step
    for user_event in check.generator(
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 64, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 599, in _user_event_sequence_for_step_compute_fn
    for event in iterate_with_context(
  File "/usr/local/lib/python3.8/site-packages/dagster/utils/__init__.py", line 382, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 126, in execute_core_compute
    for step_output in _yield_compute_results(step_context, inputs, compute_fn):
  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 109, in _yield_compute_results
    for event in user_event_generator:
  File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/solids.py", line 193, in dbt_cli_run
    cli_output = execute_cli(
  File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/utils.py", line 65, in execute_cli
    for raw_line in process.stdout:
  File "/usr/local/lib/python3.8/site-packages/dagster/utils/interrupts.py", line 78, in _new_signal_handler
    raise error_cls()
ENGINE_EVENT
Copy code
An exception was thrown during execution that is likely a framework error, rather than an error in user code.
dagster.check.CheckError: Invariant failed. Description: Pipeline run dbt_pipeline (348a5c19-3973-4d1c-afe5-12f926e42bb4) in state PipelineRunStatus.FAILURE, expected NOT_STARTED or STARTING

Stack Trace:
  File "/usr/local/lib/python3.8/site-packages/dagster/grpc/impl.py", line 86, in core_execute_run
    yield from execute_run_iterator(recon_pipeline, pipeline_run, instance)
,  File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/api.py", line 74, in execute_run_iterator
    check.invariant(
,  File "/usr/local/lib/python3.8/site-packages/dagster/check/__init__.py", line 167, in invariant
    raise CheckError(f"Invariant failed. Description: {desc}")
m

max

06/21/2021, 6:12 PM
@prha
a

Alessandro Marrella

06/22/2021, 8:42 AM
Hi @Leonardo Cotti, not sure if this entirely helps your case, and SIGTERM should probably be handled more gracefully, but as a workaround if you set PodDisruptionBudgets for run pods the node will wait to shutdown until your job is completed (unless it's a spot / pre-emptible instance). example: https://github.com/dagster-io/dagster/discussions/4295 (this is what i do in my cluster at least)
thankyou 1
l

Leonardo Cotti

06/24/2021, 8:30 AM
I had a catch up with our internal team that managed the k8s cluster and we found that using the annotation
Copy code
"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
in the pipeline tags (https://docs.dagster.io/deployment/guides/kubernetes/customizing-your-deployment#solid-or-pipeline-kubernetes-configuration) worked.
They also mentioned that if there is a force shutdown of the node, the annotation won't work and we might get the same issue. Although it is a rare event, is there any plan to support pod eviction? The main two problems are: • No automatic retry of the pipeline • The failure hooks don't work (no notifications)
j

johann

06/16/2022, 5:44 PM
Hi Leonardo, late followup here but we just released automatic retries https://docs.dagster.io/deployment/run-retries
16 Views