Is there a limit to the number of Partitions for a...
# ask-community
j
Is there a limit to the number of Partitions for an asset ? I am considering creating a
dynamic partition
with a sensor that adds a partition (a timestamp) each time new data is detected in an incoming CDC source (Fivetran). Depending on the frequency of the ingestion i could be creating :
Copy code
4 partitions per hour (once every 15 minutes FiveTran connector performs a sync)
96 partitions per day (4 per hour x 24 hours)
== lets round that up to 100 per day ==
36,500 per year (100 x 365 days)
73,000 after 2nd year
182,500 after year 5
... and continued growth of 36K partitions per annum
Is there a sensible limit I should be considering ? Will things slow down as I add more partitions ? Are there memory resource implications (e.g. does Dagster code ever load the entire partition set into memory ?)
t
Hey there! • Q: Is there a sensible limit I should be considering ? ◦ A: Yeah, there isn't a known "hard" limit, but a sensible limit is around 10k, which is where the UI starts slowing down. • Will things slow down as I add more partitions ? ◦ Up until the 10k point, it should be fine • Are there memory resource implications (e.g. does Dagster code ever load the entire partition set into memory ? ◦ By entire partition set, do you mean the list of all partitions? I believe so, during the backfilling process Can I ask what you're looking for in needing to have a partition every 15 minutes that matches the Fivetran sync frequency?
j
@Tim Castillo I want to use a sensor to detect when FiveTran completes a job on its own schedule and there is new data to process in Snowflake. (This approach is different to the FiveTran dagster integration which I am not using - I dont want dagster to manage FiveTran - i want FiveTran to run independently on its own and have dagster trigger when it ). I chose a dynamic partition because data can appear at any time. The partition key is still a timestamp though, so i think of this a time-based partition with disjointed partition intervals (ie. not regular weekly or hourly) The goal is to sense "as soon as" data is available using polling (ie. sensor calls the FiveTran API say, every minute). Another sensing strategy would be to make use of the FiveTran webhook which will push notifications of connector sync completion instead - but I dont know how this can be achieved in Dagster. My current FiveTran connectors are actually running every 6 hours (4 times / partitions per day) ; but I know I can increase the frequency to every 15 minutes (96 times / partitions per day - hence my initial question.
@Tim Castillo what do you think of this approach ?
t
Sorry for the delay, reached out to the team to get help and waiting on a trying to build a cohesive response. The sensor makes sense to me. I'm hesitant about the quickly growing number of partitions though. Does your data need to be updated in that frequent of micro-batches?
c
Hi Johno. Also wondering the same thing as Tim above--is there a reason why you need to create a new partition every time a job completes? You could instead yield a run request if downstream dependencies need to be updated. Or if it's for observability purposes, you could write compute logs.
Echoing what Tim mentioned above, you might come across performance constraints for partition definitions larger than 10K. If you did follow the current dynamic partitions route, one thing you could consider is deleting old partitions that are no longer being used. I.e. you could have a schedule that runs on a monthly basis to delete all dynamic partitions older than a month.
j
Deleting old partitions makes sense to me ; yes old ones are not very useful.
@claire @Tim Castillo in answering the questions above, my objective is to have the downstream models up to date as soon as data is available.
… and to use dynamic partitions to track the success of each time new data is available.
I will then face a new problem - which I think I will ask as a new question in #dagster-support
157 Views