Leonardo Cotti
06/21/2021, 8:31 AMDagster 0.11.12
in our main k8s cluster and we are having some problems with node shutdowns. (You can find the error logs in the thread)
Basically, the k8s cluster sends a SIGTERM
to the Dagster job pod and Dagster behaves in an unexpected way:
• Raises an ENGINE_EVENT
exception when receiving the SIGTERM
• Returns 0 to the pod marking it as completed when it actually failed
• Stops any other process (e.g. running a failure hook to get notified)
That means that if a pod is sigtermed because of a node shutdown:
• The pipeline immediately fails telling k8s that the pod terminated successfully
• We don't get any notification from the failure hook
• There is no automatic retry (Dagster doesn't support re-running failed pipelines)
Unfortunately we were not able to find a solution for this issue, are we missing something? 🙂STEP_FAILURE
dagster.core.errors.DagsterExecutionInterruptedError
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 193, in _dagster_event_sequence_for_step
for step_event in check.generator(step_events):
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 305, in core_dagster_event_sequence_for_step
for user_event in check.generator(
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 64, in _step_output_error_checked_user_event_sequence
for user_event in user_event_sequence:
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 599, in _user_event_sequence_for_step_compute_fn
for event in iterate_with_context(
File "/usr/local/lib/python3.8/site-packages/dagster/utils/__init__.py", line 382, in iterate_with_context
next_output = next(iterator)
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 126, in execute_core_compute
for step_output in _yield_compute_results(step_context, inputs, compute_fn):
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 109, in _yield_compute_results
for event in user_event_generator:
File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/solids.py", line 193, in dbt_cli_run
cli_output = execute_cli(
File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/utils.py", line 65, in execute_cli
for raw_line in process.stdout:
File "/usr/local/lib/python3.8/site-packages/dagster/utils/interrupts.py", line 78, in _new_signal_handler
raise error_cls()
PIPELINE_FAILURE
dagster.core.errors.DagsterExecutionInterruptedError
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/api.py", line 762, in pipeline_execution_iterator
for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
File "/usr/local/lib/python3.8/site-packages/dagster/core/executor/in_process.py", line 38, in execute
yield from iter(
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/api.py", line 841, in __iter__
yield from self.iterator(
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 69, in inner_plan_execution_iterator
for step_event in check.generator(_dagster_event_sequence_for_step(step_context)):
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 270, in _dagster_event_sequence_for_step
raise interrupt_error
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_plan.py", line 193, in _dagster_event_sequence_for_step
for step_event in check.generator(step_events):
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 305, in core_dagster_event_sequence_for_step
for user_event in check.generator(
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 64, in _step_output_error_checked_user_event_sequence
for user_event in user_event_sequence:
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/execute_step.py", line 599, in _user_event_sequence_for_step_compute_fn
for event in iterate_with_context(
File "/usr/local/lib/python3.8/site-packages/dagster/utils/__init__.py", line 382, in iterate_with_context
next_output = next(iterator)
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 126, in execute_core_compute
for step_output in _yield_compute_results(step_context, inputs, compute_fn):
File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/plan/compute.py", line 109, in _yield_compute_results
for event in user_event_generator:
File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/solids.py", line 193, in dbt_cli_run
cli_output = execute_cli(
File "/usr/local/lib/python3.8/site-packages/dagster_dbt/cli/utils.py", line 65, in execute_cli
for raw_line in process.stdout:
File "/usr/local/lib/python3.8/site-packages/dagster/utils/interrupts.py", line 78, in _new_signal_handler
raise error_cls()
ENGINE_EVENT
An exception was thrown during execution that is likely a framework error, rather than an error in user code.
dagster.check.CheckError: Invariant failed. Description: Pipeline run dbt_pipeline (348a5c19-3973-4d1c-afe5-12f926e42bb4) in state PipelineRunStatus.FAILURE, expected NOT_STARTED or STARTING
Stack Trace:
File "/usr/local/lib/python3.8/site-packages/dagster/grpc/impl.py", line 86, in core_execute_run
yield from execute_run_iterator(recon_pipeline, pipeline_run, instance)
, File "/usr/local/lib/python3.8/site-packages/dagster/core/execution/api.py", line 74, in execute_run_iterator
check.invariant(
, File "/usr/local/lib/python3.8/site-packages/dagster/check/__init__.py", line 167, in invariant
raise CheckError(f"Invariant failed. Description: {desc}")
max
06/21/2021, 6:12 PMAlessandro Marrella
06/22/2021, 8:42 AMLeonardo Cotti
06/24/2021, 8:30 AM"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
in the pipeline tags (https://docs.dagster.io/deployment/guides/kubernetes/customizing-your-deployment#solid-or-pipeline-kubernetes-configuration) worked.johann
06/16/2022, 5:44 PM