hey all i m struggling to understand how to utilize `Reexecu dagster #ask-community

hey all, i’m struggling to understand how to utili...

Caleb Overman

07/07/2023, 6:14 PM

hey all, i’m struggling to understand how to utilize

ReexecutionOptions

with a job run via a schedule…basically the job has a set of unreliable ops and i want a descendant op to always run even if something upstream fails…i’ve taken a look at the reexecution example but can’t quite wrap my head around it

🤖 1

jamie

07/07/2023, 6:19 PM

hey @Caleb Overman the reexecution system is for re-running failed op, not for continuing along with the op graph. Just want to confirm that what you’re looking for is a way to always execute a downstream op, even if the upstream op fails?

Caleb Overman

07/07/2023, 6:21 PM

yeah correct…i couldn’t find anything about continuing so thought reexecution might be an approach

jamie

07/07/2023, 6:23 PM

re-execution would be specifically for you to manually launch the job again and it would start from the first point of failure. So in your case, that would be the failed op. Would having retries on the unreliable ops be a solution? https://docs.dagster.io/concepts/ops-jobs-graphs/op-retries#op-retries

jamie

07/07/2023, 6:23 PM

if you did that, basically if an op in the job failed it would get automatically retried according to the policy

Caleb Overman

07/07/2023, 6:28 PM

we do have retries…basically the ops that fail do so consistently due to some underlying libraries we’re still enhancing meaning a retry won’t succeed either…the ops are also dynamic so it’s not very straightforward to exclude certain ones

Caleb Overman

07/07/2023, 6:29 PM

basically hoping to just allow the failed ops to fail and continue which would buy us some time to address the other issues

jamie

07/07/2023, 6:32 PM

yeah that makes sense

jamie

07/07/2023, 6:32 PM

is it infeasible to put a try catch in the unreliable ops?

Caleb Overman

07/07/2023, 6:48 PM

really like that idea! unfortunately these ops are also k8s jobs that can fail unexpectedly and succeed on retry (hence having that setup) so catching the failure exception would prevent the retries we do use

Caleb Overman

07/07/2023, 6:49 PM

we’re in the middle of migrating from airflow to dagster and just trying to get things running so we can shutdown airflow and clearly not following best practices yet 😂

jamie

07/07/2023, 6:51 PM

ok i see. this is tricky!

jamie

07/07/2023, 6:53 PM

you could do the retry manually in the try/catch and then once it’s failed for real then continue on, Like

Copy code

@op 
def ureliable():
   try:
      flakey_function()
   except:
        try: 
           flakey_function()
        except:
            ...

jamie

07/07/2023, 6:53 PM

might be able to put that in a loop too

Caleb Overman

07/07/2023, 6:54 PM

ooh i like it

jamie

07/07/2023, 6:54 PM

if the downstream op can run even if the upstream fails, is there just like an implicit ordering (B runs after A) rather than data being passed around (A returns a value that B needs)?

Caleb Overman

07/07/2023, 6:56 PM

kinda…basically the upstream ops build a partition parquet dataset by extracting data from another system then downstream we convert that entire parquet file to a tableau extract…so we want the tableau extract to get created even if one of the partitions fails

jamie

07/07/2023, 6:58 PM

ok - if it’s just a temp bandaid while you make the system more stable then i feel like the try catch thing should be fine. if it’s going to be more permanent then maybe there’s another combo of dagster concepts that’ll do it.

Caleb Overman

07/07/2023, 7:02 PM

agreed it’s probably the best approach for us right now…really appreciate the help!

Open in Slack

Previous Next