Is there a limit to the number of Partitions for an asset I dagster #ask-community

Is there a limit to the number of Partitions for a...

Johno Scott

04/17/2023, 2:11 AM

Is there a limit to the number of Partitions for an asset ? I am considering creating a

dynamic partition

with a sensor that adds a partition (a timestamp) each time new data is detected in an incoming CDC source (Fivetran). Depending on the frequency of the ingestion i could be creating :

Copy code

4 partitions per hour (once every 15 minutes FiveTran connector performs a sync)
96 partitions per day (4 per hour x 24 hours)
== lets round that up to 100 per day ==
36,500 per year (100 x 365 days)
73,000 after 2nd year
182,500 after year 5
... and continued growth of 36K partitions per annum

Is there a sensible limit I should be considering ? Will things slow down as I add more partitions ? Are there memory resource implications (e.g. does Dagster code ever load the entire partition set into memory ?)

Tim Castillo

04/17/2023, 3:43 PM

Hey there! • Q: Is there a sensible limit I should be considering ? ◦ A: Yeah, there isn't a known "hard" limit, but a sensible limit is around 10k, which is where the UI starts slowing down. • Will things slow down as I add more partitions ? ◦ Up until the 10k point, it should be fine • Are there memory resource implications (e.g. does Dagster code ever load the entire partition set into memory ? ◦ By entire partition set, do you mean the list of all partitions? I believe so, during the backfilling process Can I ask what you're looking for in needing to have a partition every 15 minutes that matches the Fivetran sync frequency?

Johno Scott

04/17/2023, 8:24 PM

@Tim Castillo I want to use a sensor to detect when FiveTran completes a job on its own schedule and there is new data to process in Snowflake. (This approach is different to the FiveTran dagster integration which I am not using - I dont want dagster to manage FiveTran - i want FiveTran to run independently on its own and have dagster trigger when it ). I chose a dynamic partition because data can appear at any time. The partition key is still a timestamp though, so i think of this a time-based partition with disjointed partition intervals (ie. not regular weekly or hourly) The goal is to sense "as soon as" data is available using polling (ie. sensor calls the FiveTran API say, every minute). Another sensing strategy would be to make use of the FiveTran webhook which will push notifications of connector sync completion instead - but I dont know how this can be achieved in Dagster. My current FiveTran connectors are actually running every 6 hours (4 times / partitions per day) ; but I know I can increase the frequency to every 15 minutes (96 times / partitions per day - hence my initial question.

Johno Scott

04/19/2023, 11:36 AM

@Tim Castillo what do you think of this approach ?

Tim Castillo

04/19/2023, 7:36 PM

Sorry for the delay, reached out to the team to get help and waiting on a trying to build a cohesive response. The sensor makes sense to me. I'm hesitant about the quickly growing number of partitions though. Does your data need to be updated in that frequent of micro-batches?

claire

04/19/2023, 9:54 PM

Hi Johno. Also wondering the same thing as Tim above--is there a reason why you need to create a new partition every time a job completes? You could instead yield a run request if downstream dependencies need to be updated. Or if it's for observability purposes, you could write compute logs.

claire

04/19/2023, 9:56 PM

Echoing what Tim mentioned above, you might come across performance constraints for partition definitions larger than 10K. If you did follow the current dynamic partitions route, one thing you could consider is deleting old partitions that are no longer being used. I.e. you could have a schedule that runs on a monthly basis to delete all dynamic partitions older than a month.

Johno Scott

04/20/2023, 10:14 PM

Deleting old partitions makes sense to me ; yes old ones are not very useful.

Johno Scott

04/21/2023, 1:48 AM

@claire @Tim Castillo in answering the questions above, my objective is to have the downstream models up to date as soon as data is available.

Johno Scott

04/21/2023, 1:50 AM

… and to use dynamic partitions to track the success of each time new data is available.

Johno Scott

04/21/2023, 1:50 AM

I will then face a new problem - which I think I will ask as a new question in #dagster-support…

Johno Scott

04/21/2023, 2:00 AM

https://dagster.slack.com/archives/C01U954MEER/p1682042419296099

204 Views

Open in Slack

Previous Next