Hello Guys, I am running a dagster pipeline on GK...
# deployment-kubernetes
k
Hello Guys, I am running a dagster pipeline on GKE ( Kubernetes ), and am having a problem, where "Keyboard Interrupt" event is being sent to random K8S Jobs. Context: • My pipeline ( dagster-version: 0.9.1 ) is using K8SRunLauncher, there is a script that triggers 100 pipeline-run jobs to the dagit server using graphql. • In the pipeline, there are 4 solids. first one takes a long time ( about an hour or more ) Problem: • If the script is run to deploy all 100 jobs, some of those jobs ( 10-15 jobs ) would meet a "Keyboard Interrupt" event, and then a "dagster.check.CheckError: Invariant failed." Error. • This error always happen when solid_1 runs a very long process step ( this step takes 40-50 minutes ) • If I pick out the jobs that error-ed, and run them separately, they do not meet this issue. • I was using 0.8.6, and only met the "dagster.check.CheckError: Invariant failed." Error, so I decided to switch to 0.9.1, and met this "Keyboard Interrupt" event on top of the Invariant-failure. Opinion: • In 0.8.6, From the "dagster.check.CheckError: Invariant failed." Error, I feel that dagit is somehow restarting the job ( by calling graphql ExecuteRunInProcessMutation ) because from the stack trace, it shows that "DauphinExecuteRunInProcessMutation" is being called while pipeline has already started. • In 0.9.1, there is the "Keyboard Interrupt" Error. But I did not trigger any cancellation request. I further suspect that dagit & dagster-graphql has something to do with this, but not sure how. I am trying to figure out why this happens, it is hard to re-produce locally because I need to run lots of jobs and this error comes up randomly. 1. Has any Dagsterati ever met this issue before ? 2. Is there any mechanism in Dagit, where it programmatically sends the graphql "ExecuteRunInProcessMutation" or "Keyboard Interrupt" signal to K8S Job without direct user interaction ? Conclusion: K8S Cluster Auto Scaler was rescheduling the pod => re-triggering the pipeline runs.
Error shown on dagit
Stack trace for Keyboard Interrupt error.
Copy code
KeyboardInterrupt
  File "/usr/local/lib/python3.6/dist-packages/dagster/core/execution/plan/execute_plan.py", line 210, in _dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/usr/local/lib/python3.6/dist-packages/dagster/core/execution/plan/execute_step.py", line 269, in core_dagster_event_sequence_for_step
    _step_output_error_checked_user_event_sequence(step_context, user_event_sequence)
  File "/usr/local/lib/python3.6/dist-packages/dagster/core/execution/plan/execute_step.py", line 53, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.6/dist-packages/dagster/core/execution/plan/execute_step.py", line 399, in _user_event_sequence_for_step_compute_fn
    for event in gen:
  File "/usr/local/lib/python3.6/dist-packages/dagster/core/execution/plan/compute.py", line 102, in _execute_core_compute
    for step_output in _yield_compute_results(compute_context, inputs, compute_fn):
  File "/usr/local/lib/python3.6/dist-packages/dagster/core/execution/plan/compute.py", line 73, in _yield_compute_results
    for event in user_event_sequence:
  File "/usr/local/lib/python3.6/dist-packages/dagster/core/definitions/decorators/solid.py", line 220, in compute
    result = fn(context, **kwargs)
  File "/imagery_pipeline_source/solids/terravion_solids/solid_1_register_image/main.py", line 91, in register_image
    local_result_path = registrate(context, local_src_path, local_ref_image_path);
  File "/imagery_pipeline_source/solids/terravion_solids/solid_1_register_image/algo/__init__.py", line 191, in registrate
    date_register_if_not_exists(ctx, input_src_path, input_ref_path, day_reg_path)
  File "/imagery_pipeline_source/solids/terravion_solids/solid_1_register_image/algo/__init__.py", line 112, in date_register_if_not_exists
    register_date_fiji(source_path=source)
  File "/imagery_pipeline_source/solids/terravion_solids/solid_1_register_image/lib/__init__.py", line 63, in register_using_fiji
    run_sub_cmd(cmd_template)
  File "/imagery_pipeline_source/solids/terravion_solids/solid_1_register_image/lib/__init__.py", line 157, in run_sub_cmd
    <http://logging.info|logging.info>(subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT))
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 425, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.6/subprocess.py", line 850, in communicate
    stdout = self.stdout.read()
c
Hi Ken, thanks for reporting this! The only part of the dagit system that should be raising keyboard interrupts is when a user terminates a run. From reading through your description of when this occurs, I think it might be possible that the k8s cluster runs out of resources and is forced to spin down jobs?
This can also happen if the k8s container running the job hits its resource requirements
k
Hello Cat, Thank you for the insight! I was also thinking it could be resource allocation issue. But when I check, the GKE Dashboard, the Job was barely reaching the resource-request threshold. • CPU -> up to 1.2 out of 2 • Memory -> up to 4.9 Gb out of 10 Gb The pod also completed with exit code 0, it did not show any OOM failure. No alerts was brought up during the runs. My cluster is using node-auto provisioning with 12 Tb of memory + 1000 CPU, the maximum resource usage in this case should reach up to 250 CPU + 1.5Tb of memory.
But it is possible that I missed some details. Will double check the resource limit.
c
Hmm, when you inspect the pod that corresponds to the failing solid, do you see any logs / errors?
k
Errored Pod looks normal using
kubectl describe pod ...
, logs and status is the same as the successful ones 🤔 I will submit a support ticket to the Google Support GKE team, and see if there is any piece of configuration/quota-limit that I missed.
c
oh btw, to explain the difference in behavior between 0.8.6 -> 0.9.1 is that we remap SIGTERM to use the SIGINT handler which is why there is a Keyboard Interrupt where there wasn’t before
👍 1
with respect to the check.invariant error, i think that means something is trying to re-trigger the pipeline run while re-using the same run_id, which is not supported/recommended since we store each pipeline run invocation separately. i dont know of anything within the dagster system that would do that..
hmm, im not very familar with node-auto provisioning settings but i wonder if they can accommodate sudden spikes in demand, or are better suited for gradual increases? also “When scaling down, cluster autoscaler honors a graceful termination period of 10 minutes for rescheduling the node’s Pods onto a different node before forcibly terminating the node.” could be interesting — not sure if that might be relevant?
k
Wow, your intuition 100% matched the Google Support Team, Catherine!! Thanks for pointing that out! The AutoScaler was the issue, Dagster works as expected. 😄 I can either disable Auto-scaler, or I have learned that the annotation
Copy code
"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
can keep the auto-scaler from re-scheduling the pod. [1] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
😻 1
Knowing that, Is there any work on Dagster-side at the moment to allow passing in custom annotations for the V1PodTemplateSpec in the "dagster-k8s/job.py" module ? I wonder if we can have something similar to what we have in place now for
K8S_RESOURCE_REQUIREMENTS_KEY
, so that we can put something like
K8S_POD_TEMPLATE_ANNOTATION
in the pipeline-tags and the job.py will initiate the V1PodTemplateSpec with that annotations dictionary. If not, I can attempt an PR to implement it.
c
awesome, glad that we found the issue! i definitely think dagster should enable users to pass arbitrary config to their jobs / pods / containers but we havent prioritized it yet. i wonder whether it might be reasonable to set “cluster-autoscaler.kubernetes.io/safe-to-evict”: “false” for all run master jobs
we also love community contributions 😻 but the testing infra for the k8s side is pretty rough right now on the community side. also we’re thinking about whether resource requirements should move off pipeline tags into run config..
would you mind filing an issue for this on github?
k
Will do!
Posted here: https://github.com/dagster-io/dagster/issues/2812 Thank you for the feed back and helpful advices Catherine, I really appreciate it 😄
🎉 1
also, not sure if it helps, but have you guys considered microK8S for local K8S testing and development ? https://kubernetes.io/blog/2019/11/26/running-kubernetes-locally-on-linux-with-microk8s/
c
thanks for the issue! sounds like other users are running into this too, and we’ll try to get it slotted in for the next release
we use Kind for k8s testing, but we still need to invest more time in developer tooling around debugging test failures / faster local test loop