Not sure if this a cloud specific question but thought I d a dagster #dagster-plus

Not sure if this a cloud specific question, but th...

Seth Kimmel

06/14/2022, 4:10 PM

Not sure if this a cloud specific question, but thought I'd ask here. We ran a job, and a single op failed due to some underlying data issues. We corrected the issues, and I tried to re-run that op in isolation, but it creates a different run and fails to locate the upstream dependencies for that op, so it won't succeed. This behavior seems different than dagit run locally, where I don't recall ever having dependency issues re-running failed ops in isolation.

daniel

06/14/2022, 4:36 PM

Do you have a link to the job? If the isolated op has inputs that are outputs from other ops, I think it would need to pull in those inputs using the iomanager, even if its just a single op being run

Seth Kimmel

06/14/2022, 4:38 PM

It does have a "dummy" input. This is probably an anti-pattern in dagster, but was the only I found a number of months ago to pass multiple upstream deps to an op, and it percolated through the rest of my code.

daniel

06/14/2022, 4:38 PM

What are you using as your io manager?

Seth Kimmel

06/14/2022, 4:38 PM

But when you re-run a failed op, why would it create a new run with no reference to the previous job that had failed?

daniel

06/14/2022, 4:39 PM

I would expect it to have a reference to the previous job - do you have a link to the job in cloud?

Seth Kimmel

06/14/2022, 4:39 PM

Sure - I'll share in our private channel

daniel

06/14/2022, 4:40 PM

it should show up in the run lineage on the right hand side of the runs page

Seth Kimmel

06/14/2022, 4:40 PM

sorry - do you want the url or the run ID?

daniel

06/14/2022, 4:40 PM

either or

Seth Kimmel

06/14/2022, 4:41 PM

Job with op failure:

85ee54f8

. Attempted re-runs:

2ab44583

6d631844

daniel

06/14/2022, 4:42 PM

if you click on the runs page for that re-run, do you see the run lineage like the one I posted on the right hand side linking it to the previous runs?

Seth Kimmel

06/14/2022, 4:44 PM

I don't actually

Seth Kimmel

06/14/2022, 4:45 PM

Happy to dig in with you if you'd like

daniel

06/14/2022, 4:45 PM

You don't see something like this if you click through to the timeline view for the run?

Seth Kimmel

06/14/2022, 4:46 PM

ah yes

Seth Kimmel

06/14/2022, 4:47 PM

It has the original failed job in the upstream

Seth Kimmel

06/14/2022, 4:47 PM

So not sure why it can't grab the ref

daniel

06/14/2022, 4:47 PM

You would need to be using an iomanager that will persist the output somewhere that a new run can find it - each run is in its own ECS task, so that would need to be something like s3, rather than the default filesystem IO manager

daniel

06/14/2022, 4:48 PM

https://docs.dagster.io/deployment/guides/aws#using-s3-for-io-management has some examples of how to do this

daniel

06/14/2022, 4:48 PM

alternately, if you don't actually need the output/input since its a dummy, we can see if there are ways of removing it

Seth Kimmel

06/14/2022, 4:49 PM

Ah - gotcha

Seth Kimmel

06/14/2022, 4:49 PM

I think it's probably worth going with the more general approach of adding an s3 iomanager

👍 1

Seth Kimmel

06/14/2022, 4:49 PM

but yeahhh maybe cleaning up that pattern in code would be good too

Yeachan Park

10/26/2022, 12:56 PM

I assume since a new run get's created, it resets the status of the rest of OPs after re-running a specific OP of a partition. Is there a way we can preserve the status from a previous run in the partition view? E.g. this is what it looks like after backfilling a specific OP from a job that ran previously

daniel

10/26/2022, 1:28 PM

Hi Yeachen - in each row It should be using the op from the most recent run that executes that op. If you have a link to a partitions page that isn’t behaving that way we would be happy to take a look

Yeachan Park

10/26/2022, 2:46 PM

Oh strange, then maybe I'm doing something wrong? So I initially successfully ran a partition (e6260714), then created a backfill (03cb2642) by using

Step subset

to select a specific (failed) OP to re-run for that partition. That resulted in the green OP status circles turning grey (i.e image above) for all the other successful ops from e6260714. The only one that's green now is the OP that ran via

step subset

. Functionally, what I want to do is just to run a failed OP again and see that all the OPs ran successfully in the overview page, without having to run all the OPs again.

daniel

10/26/2022, 2:47 PM

Do you have a link handy to the partitions page? I can use that to pull it up in our logs

Yeachan Park

10/26/2022, 2:49 PM

Ah sorry, I found this via search, didn't realise it was in the dagster-cloud channel. We're on open-source

daniel

10/26/2022, 2:49 PM

Ah got - would you mind making a new post either here or in #dagster-support? I can surface it to our support oncall

daniel

10/26/2022, 2:50 PM

er in dagster-support would be best actually if its open-source

👍 1

Yeachan Park

10/26/2022, 3:15 PM

Thanks daniel, I've reposted it here https://dagster.slack.com/archives/C01U954MEER/p1666797289064099

4 Views

Open in Slack

Previous Next