hey all, i’m struggling to understand how to utili...
# ask-community
c
hey all, i’m struggling to understand how to utilize
ReexecutionOptions
with a job run via a schedule…basically the job has a set of unreliable ops and i want a descendant op to always run even if something upstream fails…i’ve taken a look at the reexecution example but can’t quite wrap my head around it
🤖 1
j
hey @Caleb Overman the reexecution system is for re-running failed op, not for continuing along with the op graph. Just want to confirm that what you’re looking for is a way to always execute a downstream op, even if the upstream op fails?
c
yeah correct…i couldn’t find anything about continuing so thought reexecution might be an approach
j
re-execution would be specifically for you to manually launch the job again and it would start from the first point of failure. So in your case, that would be the failed op. Would having retries on the unreliable ops be a solution? https://docs.dagster.io/concepts/ops-jobs-graphs/op-retries#op-retries
if you did that, basically if an op in the job failed it would get automatically retried according to the policy
c
we do have retries…basically the ops that fail do so consistently due to some underlying libraries we’re still enhancing meaning a retry won’t succeed either…the ops are also dynamic so it’s not very straightforward to exclude certain ones
basically hoping to just allow the failed ops to fail and continue which would buy us some time to address the other issues
j
yeah that makes sense
is it infeasible to put a try catch in the unreliable ops?
c
really like that idea! unfortunately these ops are also k8s jobs that can fail unexpectedly and succeed on retry (hence having that setup) so catching the failure exception would prevent the retries we do use
we’re in the middle of migrating from airflow to dagster and just trying to get things running so we can shutdown airflow and clearly not following best practices yet 😂
j
ok i see. this is tricky!
you could do the retry manually in the try/catch and then once it’s failed for real then continue on, Like
Copy code
@op 
def ureliable():
   try:
      flakey_function()
   except:
        try: 
           flakey_function()
        except:
            ...
might be able to put that in a loop too
c
ooh i like it
j
if the downstream op can run even if the upstream fails, is there just like an implicit ordering (B runs after A) rather than data being passed around (A returns a value that B needs)?
c
kinda…basically the upstream ops build a partition parquet dataset by extracting data from another system then downstream we convert that entire parquet file to a tableau extract…so we want the tableau extract to get created even if one of the partitions fails
j
ok - if it’s just a temp bandaid while you make the system more stable then i feel like the try catch thing should be fine. if it’s going to be more permanent then maybe there’s another combo of dagster concepts that’ll do it.
c
agreed it’s probably the best approach for us right now…really appreciate the help!