https://dagster.io/ logo
#dagster-feedback
Title
# dagster-feedback
s

Simon

04/28/2022, 3:06 PM
Good day! We've been testing Dagster for a bit now and one thing we noticed is that if Pods created by the Kubernetes Jobs that are created by the K8sRunLauncher fail the Run fails. We expected this to be more robust, especially given the use of Kubernetes Jobs instead of Pods we were expecting the Job to take care of intermittent failures. There are different scenarios where one might or might not want retries of course, so I figured it would be a good idea to start with a design/discussion on what the desired (default) behavior should be and if/which part of the Kubernetes Job configuration should be configurable. Does that sound like a good idea? Shall I just create an issue to start the discussion?
j

johann

04/28/2022, 4:31 PM
Hi Simon, happy to discuss here or in GH. We’re making strides away from failing when the underlying pod dies- you could take a look at https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental
s

Simon

04/28/2022, 5:00 PM
@johann thanks for the reply and the pointer to those docs. Resuming runs after workers crashes sounds like the correct thing, but the docs state it only works for the K8sRunLauncher with the k8s_job_executor, I'm don't really understand why the k8s_job_executor part is relevant for failures of the K8sRunLauncher created Job/Pod, do you know? We're not using k8s_job_executor, so going by the docs it seemed irrelevant to our usecase.
j

johann

04/28/2022, 5:25 PM
It’s on our radar to expand this to multiprocess. The reason it’s different per executor is that when we’re resuming, the executor needs to figure out what state all the steps were left in when it died. That logic is just a bit different for multiprocess
👍 1
s

Simon

04/29/2022, 12:32 PM
@johann I've created an Issue for it to discuss/figure out how to handle this https://github.com/dagster-io/dagster/issues/7641
c

Charlie Bini

04/29/2022, 8:53 PM
just got this error, is this related?
Copy code
Exception: filtered (4f9e8b6c-abb8-47a6-997a-52cc322f9d59) started a new run while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster.

Stack Trace:
  File "/root/app/__pypackages__/3.10/lib/dagster/grpc/impl.py", line 91, in core_execute_run
    yield from execute_run_iterator(
,  File "/root/app/__pypackages__/3.10/lib/dagster/core/execution/api.py", line 90, in execute_run_iterator
    raise Exception(
the op raises a
RetryRequested
in case of a
ConnectionError
but can't tell if that's what happened from the logs
s

Simon

04/29/2022, 10:38 PM
@Charlie Bini That error about started a new run whilst the run was already in state started is indeed similar (or the exact same? I'd need to check) as happens in this case
@johann Did you have any chance to look at this? I couldn't find an existing issue but you mentioned that expanding to multiprocess was on your radar, are there maybe also issues or tickets or something like for Dagster internal to Elementl/the Dagster team where you're already tracking similar things that aren't in github? In any case would be good to know what the plans are and the timelines so my team can assess if we want to move forward with Dagster and possibly contribute the changes for this or not (together with https://github.com/dagster-io/dagster/issues/7623 which did not really get the response we were hoping for)
j

johann

05/04/2022, 4:35 PM
Left a comment on https://github.com/dagster-io/dagster/issues/7623 For https://github.com/dagster-io/dagster/issues/7623, our timeline is • Next week we’ll ship automatic re-execute from failure. This retries a Job in a whole new Dagster run (the same experience as manually re-executing in Dagit). We still think that resuming the run in place is the optimal experience for K8s failures, but this will at least be a kind of brute force solution • Improving resuming runs in place is on the radar, but doesn’t have a set date
s

Simon

05/04/2022, 5:02 PM
@johann 👍 sounds good. We'll be eagerly trying out the re-execute changes. A rough fix like that is fine for us as well for now, improvements would be nice of course, but it's more important for it to work and be stable for now than to make it as optimal as possible. Is there (another) issue or roadmap item to track somewhere regarding the improvements for resuming runs? Would like to subscribe to that one if it exists 🙂
j

johann

05/05/2022, 1:48 PM
There isn’t a good public tracking issue- I’ll follow up here when I get some time to type up the current state and where we want to go
👍 1
s

Simon

05/05/2022, 2:29 PM
OK, thanks for the effort!