Curious if anyone has guidance on the best way to handle full-asset resyncs with Dagster.
For example, I am trying to build a series of Assets that will regularly materialize transactional data from an Oracle instance to Snowflake. Given the volume of data, I have some of these assets set-up with HourlyPartitions (resulting in tens of thousands of partitions to account for multiple years worth of data). In terms of normal scheduling, this should work great - aside from how slow Dagit gets with that many Partitions (which is another issue).
However, because the schemas for these source tables can change fairly regularly, it is necessary to do full-resyncs to incorporate any data that might have been added in the process for these new fields.
Herein lies the problem, running 20k+ backfill partitions is simply not feasible, and I'm worried about just using the "Pass partition ranges to single run" option - because this would effectively require millions of records per asset to be extracted and loaded all in one go (which also requires me to maintain the very large partition configurations to have the option go back that far).
Hoping those of you who have faced similar challenges can offer guidance on the best way to approach this within the Dagster framework.
Like maybe there is a way to have HourlyPartitions on an Asset for it's normal job/schedule, and then use a MonthlyPartition for adhoc materializations? Just not sure what's possible and haven't come across much in the docs that would seem to address this sort of issue.
02/22/2023, 5:29 PM
I don’t have a solution for you, but I have exactly the same problem. I haven’t found a good way to manage backfills of assets with lots of partitions. I’ve even considered writing a job that calls the Dagit GraphQL API to trigger runs for me to manage it!
02/22/2023, 5:46 PM
@Spencer Nelson relieved to hear I'm not alone! Trying to get creative here as well, just keep hitting roadblocks and thinking there has to be a simpler solution.
02/22/2023, 6:19 PM
🙋♂️ Similar problem here as well. I ended up avoiding hourly partitions for newer assets; there are just too many rough edges right now. I use daily partitioning instead, which creates other problems, but they’re more manageable since I can do it in my own code.
There was this issue that ended up creating the “Pass partition ranges to single run” feature, which is good, but probably not general enough. There’s this discussion that is related and might be a good place to hash out other ideas.
02/22/2023, 6:58 PM
Agree with all of the above feedback about the challenges that come with having many partitions at scale, and being able to backfill these effectively.
I don't have a great solution in mind, but we do have a couple issues out that propose interesting ideas to make handling these partitions more tractable:
• Hierarchical time partitioned assets (bundling older time partitions into buckets by month or year) https://github.com/dagster-io/dagster/issues/12351
• Time window partitions definitions that can have continuous time ranges, so you can backfill any arbitrary start -> end date: https://github.com/dagster-io/dagster/discussions/11809
02/22/2023, 7:29 PM
@claire thank you for the feedback! Both of these sound like they would be great solutions. I really like the idea of Hierarchical partitioning, especially if there could be a way somewhat customize how the buckets are established. Will follow these issues and see how things progress!