I have a static partition with 300 or so partition...
# ask-community
j
I have a static partition with 300 or so partitions. I launched my job and 1/3 of the way my laptop unexpectedly rebooted (not sure if it's related to the jobs. if I can reproduce that I'll officially report it). But after the reboot, I went back to the partition, clicked 'materialize all', then clicked 'missing' and it, as expected, selected the 200 or so partitions that didn't get a chance to run. So far so good. for that set of jobs, about 5 died here and there, then the last 30 or so failed (I see errors in the dagit and dagster-daemon logs but again, if it reproduces I'll make an official report). My issue is that when I went back to the partitions, I can't seem to find a way to automatically select all of the failed jobs. what I tried: • click 'materialize all' again and click 'missing'. ◦ Expected result: I expected it to figure out which jobs failed, then select those like it did when I had it select the 200 missing jobs. ◦ Observed: it lets me click, but the button at bottom right says: Launch 0-Run Backfill. ▪︎ If I click it I get an unexpected error:
Copy code
Exception: Backfill requested without specifying either 'AllPartitions' or 'partitionNames' arguments.
My current workaround was to go to the Status screen, then Backfills, then click the backfill ID, then select 'Runs' then filter out the Failure jobs and re-execute them manually. I would have expected a simpler and more reliable way to easily determine which partitions failed so I'm hoping I'm missing something basic.
c
Hi John, thanks for reporting this issue. I'll file an issue request here, hopefully we can fix this soon.
@Dagster Bot issue auto-select failed jobs in asset backfill UI
d
b
hey @John Boyle thanks for filing this - I’m trying to repro this now. Do you know if you had more than one asset selected to rematerialize? I’m wondering if the failed runs in your first backfill partially ran before failing, materializing some of your selected assets but not all of them. ( In that case, Dagit’s “Launch Assets Run” modal should still show you that state, but I’m not sure what the behavior of the “Missing” button is)
j
Yes, this is a dag of about 10 software defined assets. for this DAG, the input is a string ID and the first asset pings a REST api to pull the data I need for the downstream assets (which do some transformation or ping the API again to get other data). So you are probably correct that for 'failed' situations, I had a mix of assets that succeeded along with a bunch of downstream assets that failed (or didn't get a chance to start running). It appeared that when I clicked 'missing' in the 'rematerialize all' window, it may have only been selecting the jobs that were actually fully missing (i.e. all 10 assets hadn't started). Due to the number of partitions I may have been mistaken thinking that dagit was actually selecting the failed jobs too. (i.e. I may have been mentally defining 'missing' to include 'failed' jobs.). I definitely could see the failed partitions (red slices, but they were thin and difficult to mouseover sometimes). I wonder if having an additional button on the partition materialization window for 'failed' (where failed means, at least one asset in the group failed) would address what I'm looking for. In my case, I eventually created another asset to figure out which partitions were missing, then I simply copied/pasted the list of IDs into the materialization window to complete the backfill. I'm going to talk more with @sandy in another thread (https://dagster.slack.com/archives/C01U954MEER/p1655493718806329) about whether I'm approaching static partitions and SDAs with the same philosophy that dagster is expecting, so maybe my use case isn't typical. please let me know if the above makes sense or if you would like additional context. I really appreciate the help.
Hi @Ben Gotow. I had to rerun my static partition and ran into this issue again, but I think I understand the behavior of the 'Missing' button. In my example of 10 connected SDAs, clicking 'missing' will only select a missing partition if ALL of the steps (in my case, all 10) are missing. I was able to confirm in my latest run where I had multiple failures that only two of the partitions had all 10 missing, whereas the others that failed all had at least one successful step. Clicking the 'Missing' button selected the two. My current workaround to select the failed jobs is to filter the runs using the backfill ID and then adding a filter for status: CANCELED. If I have more than 25 failures, it means I have to 1. filter 2. click box to select all 3. click button to execute jobs (this can take a couple minutes to complete) 4. click button at bottom to view older failed runs. so if I have more than 100 failures, it's a lot of clicks. but it works!