Not sure if this a cloud specific question, but th...
# dagster-plus
s
Not sure if this a cloud specific question, but thought I'd ask here. We ran a job, and a single op failed due to some underlying data issues. We corrected the issues, and I tried to re-run that op in isolation, but it creates a different run and fails to locate the upstream dependencies for that op, so it won't succeed. This behavior seems different than dagit run locally, where I don't recall ever having dependency issues re-running failed ops in isolation.
d
Do you have a link to the job? If the isolated op has inputs that are outputs from other ops, I think it would need to pull in those inputs using the iomanager, even if its just a single op being run
s
It does have a "dummy" input. This is probably an anti-pattern in dagster, but was the only I found a number of months ago to pass multiple upstream deps to an op, and it percolated through the rest of my code.
d
What are you using as your io manager?
s
But when you re-run a failed op, why would it create a new run with no reference to the previous job that had failed?
d
I would expect it to have a reference to the previous job - do you have a link to the job in cloud?
s
Sure - I'll share in our private channel
d
it should show up in the run lineage on the right hand side of the runs page
s
sorry - do you want the url or the run ID?
d
either or
s
Job with op failure:
85ee54f8
. Attempted re-runs:
2ab44583
,
6d631844
d
if you click on the runs page for that re-run, do you see the run lineage like the one I posted on the right hand side linking it to the previous runs?
s
I don't actually
Happy to dig in with you if you'd like
d
You don't see something like this if you click through to the timeline view for the run?
s
ah yes
It has the original failed job in the upstream
So not sure why it can't grab the ref
d
You would need to be using an iomanager that will persist the output somewhere that a new run can find it - each run is in its own ECS task, so that would need to be something like s3, rather than the default filesystem IO manager
alternately, if you don't actually need the output/input since its a dummy, we can see if there are ways of removing it
s
Ah - gotcha
I think it's probably worth going with the more general approach of adding an s3 iomanager
👍 1
but yeahhh maybe cleaning up that pattern in code would be good too
y
I assume since a new run get's created, it resets the status of the rest of OPs after re-running a specific OP of a partition. Is there a way we can preserve the status from a previous run in the partition view? E.g. this is what it looks like after backfilling a specific OP from a job that ran previously
d
Hi Yeachen - in each row It should be using the op from the most recent run that executes that op. If you have a link to a partitions page that isn’t behaving that way we would be happy to take a look
y
Oh strange, then maybe I'm doing something wrong? So I initially successfully ran a partition (e6260714), then created a backfill (03cb2642) by using
Step subset
to select a specific (failed) OP to re-run for that partition. That resulted in the green OP status circles turning grey (i.e image above) for all the other successful ops from e6260714. The only one that's green now is the OP that ran via
step subset
. Functionally, what I want to do is just to run a failed OP again and see that all the OPs ran successfully in the overview page, without having to run all the OPs again.
d
Do you have a link handy to the partitions page? I can use that to pull it up in our logs
y
Ah sorry, I found this via search, didn't realise it was in the dagster-cloud channel. We're on open-source
d
Ah got - would you mind making a new post either here or in #dagster-support? I can surface it to our support oncall
er in dagster-support would be best actually if its open-source
👍 1
y