https://dagster.io/ logo
Title
h

Huib Keemink

02/08/2022, 4:19 PM
Is there a pattern to make backfills more efficient? When scheduling a single job a 5min/run overhead (waiting for databricks to spin up a cluster, waiting for the job container to spin up, etc) is no big deal, but when scheduling hundreds (daily runs over a few years) or thousands (hourly runs over a few years) this adds up quickly.
Good to note: I can’t run more than <x> operations at the same time due to resource limits, so just scaling is not really an option
y

yuhan

02/08/2022, 5:46 PM
cc @prha
h

Huib Keemink

02/08/2022, 6:48 PM
Some things that would work in my setting: • do more work in a single run. The output is partitioned by a field in the data, so I could run monthly jobs instead of daily when doing a backfill. The downside here is that this does not create the nice green lights in the UI 😉 • Do more runs in a single execution. Right now I’m scheduling a (tiny) job container in k8s that triggers a job execution in databricks. While this level of isolation is nice when running a single daily job, it’s not really needed for backfills, which could share runtimes
If there’s other things I can try that will reduce the overhead, I’m all ears 🙂
y

yuhan

02/08/2022, 9:40 PM
are you primarily looking for ways to reduce mental overhead (i.e. avoid to keep up with 1000x schedules) or reduce the total time spent for backfills (e.g. ways to run backfills in parallel)?
also, what do you think the bottleneck is? if it’s waiting for databricks to spin up a cluster or waiting for job container to spin up, i’d say, in order to make those more efficient, the best bet is to do more work in a single execution - in other words, reduce # databricks spin up or # new job spin up.
h

Huib Keemink

02/08/2022, 9:50 PM
A bit of both I guess. Scheduling thousands of jobs at once is not great for Dagit (the UI gets slow, seeing what is happening is cumbersome), but those are minor grips compared to waiting a week for a backfill to complete
the main overhead is in the usual suspects: scheduling a pod, waiting for databricks, and just time between things
y

yuhan

02/08/2022, 9:58 PM
got it. yea i’d recommend trying do more work in one execution in that case then.
h

Huib Keemink

02/08/2022, 10:00 PM
is there a nice way to do this? Or would you recommend just setting the schedule to start today, and backfill manually by running the past year at once?
Or is there a way to tell Dagster that the 01-02-2021 : 02-02-2021 run has succeeded if the 01-01-2021 : 31-12-2021 run has succeeded?
a

Arun Kumar

02/08/2022, 10:15 PM
I had a similar request before. Currently this is not possible as each job run is attributed to a single partition. However, it looks like the team is planning to solve this problem with Asset Job where a single job run would be able to fill in multiple partitions of the asset. Not sure if Dagster exposes any APIs that you can call from your job run ( for 01-01-2021 : 31-12-2021) that can let Dagster know that those partitions are already triggered.
h

Huib Keemink

02/08/2022, 10:27 PM
got it, that’s what I was afraid of 🙂
really glad this is on the radar though, would make for a killer feature
out of curiosity, how did you solve this in the end? I can just duplicate the jobs with larger partitions without a schedule (hourly -> weekly, daily -> monthly, etc), but this clutters the interface a bit
a

Arun Kumar

02/08/2022, 10:32 PM
Hmm, I evaluated that approach, but eventually ended up not using the backfill UI for backfilling our jobs. I pass the start_date and end_date config to the job and trigger it from the playground
h

Huib Keemink

02/08/2022, 10:36 PM
ah yeah, I assume you can just do 01-01-2020 -> 01-01-2022 without breaking anything then 😉 Unfortunately I can’t really pull all the data in at once without breaking things, and manually entering these details 52x per year I want to backfill is a bit much
probably should’ve just kept things simple and stayed away from k8s and databricks, and just run dagster + jobs inside a beefy VM
a

Arun Kumar

02/08/2022, 10:39 PM
Yeah, we current don't have use cases to backfill 2 years of data. If the range is wide, yeah I would probably run them in batches
h

Huib Keemink

02/08/2022, 10:47 PM
Thanks, this was super insightful!
😛artydagster: 1
g

geoHeil

02/09/2022, 8:17 AM
But I think you can keep the databricks resource/cluster initialized outside of the backfilling -- so you would not need to constantly start/stop it and not wait for it to spin up.
h

Huib Keemink

02/09/2022, 8:18 AM
yeah, using an existing cluster improves stuff a little, but there’s still a multi-minute lag
g

geoHeil

02/09/2022, 8:20 AM
I guess a potential solution will not be possible only on the side of dagster. However what about creating your spark job in a way that it can be parametrized to process multiple partitions/dates (so you can backfill a full range) and somehow (I guess manually using the dagster API= register back the results to get the nice green lights in the UI.
h

Huib Keemink

02/09/2022, 8:32 AM
Yeah, the job itself doesn’t care about the range as long as it’s full days (output data is partitioned by date). I haven’t found a way to tell dagster “this run is green” from a job though