Hi, May I ask what is the best practice to run a f...
# ask-community
a
Hi, May I ask what is the best practice to run a full historical load for a partitioned job? We need to extract daily_partitioned data from 2 years ago, and I feel that firing 365*2 executions is not a good idea. Option1: Should I have something like this? (I don't know how to have that
GET_FIRST_PARTITION
yet
Copy code
daily_partitions = DailyPartitionsDefinition(start_date=datetime(2022, 08, 15))

@asset(partitions_def=daily_partitions,)
def my_asset(context):
    start_date, end_date = context.output_asset_partitions_time_window()
    if start_date == daily_partitions.GET_FIRST_PARTITION:
        start_date = TWO_YEARS_AGO
    ....
Option 2: Or should I define a custom schema run_config with a Start_date, in which I can create a custom run with the start_date set to be 2 years ago? With option 2, could you please point me to the document for that run_config? Option 3: (the best I think) is to have something similar to
multi-assets
having one ops that produce multiple partitions. I'm not sure whether something similar to this exists. Thanks!
🤖 1
s
I think backfilling 365*2 executions is exactly the pattern you'd want to implement. We have hourly partitioned jobs and have fired up >20k job backfills multiple times (i.e. over several tables). Two things you want to make sure of: 1. you can control how many simultaneous jobs are put on your execution engine at a given time -- so 1000 might get queued, but only 50 run at a time -- dont starve all yoru resources 2. want to make sure your jobs are idempotent 🙂
s
Hey @Averell. As @Stephen Baileypointed out, Dagster is built to be able to handle large backfills. However, there are also definitely situations where it can be more efficient to backfill everything in a single step. Dagster doesn't yet have great support for this, but I'm actually working on it at this moment. Here's the issue where we're tracking it: https://github.com/dagster-io/dagster/issues/8706. In the mean time, I think my best recommendation would be to have an op-based job that yields multiple AssetMaterializations, each with one ofthe partitions you're backfilling
a
@Stephen Bailey @sandy thanks for your help