https://dagster.io/ logo
Title
b

Binoy Shah

11/16/2022, 1:32 PM
What is the best way to process a very large dataset by splitting them to smaller chunks ? I have dataset coming in for different customers and I wanted to process each customer separately. The data is first loaded from the DB and each subset of a customer’s data is large enough that keeps going on for 10s of hours I want to be able to process each customer’s data separately and in turn it will split each customer’s data by date and process per date chunk separately Its more of a multi level fan out structure
m

Marc Keeling

11/16/2022, 3:13 PM
Hi @Binoy Shah, I've used Dynamic Outputs to do this before https://docs.dagster.io/concepts/ops-jobs-graphs/dynamic-graphs#using-dynamic-outputs
b

Binoy Shah

11/16/2022, 4:11 PM
Thanks for the leads @Marc Keeling I was also wondering about some custom Partition Asset: Customers --> Asset: ApprovedCustomers --> Asset: CustomerReport (Partition: customerId) ---< Asset: CustomerSummary (Partition: CustomerId) Is this possible via Partitioned assets or Partitioned Jobs ? Where Partition key is a custom key in Panda dataframe
m

Marc Keeling

11/16/2022, 4:41 PM
I think you may be looking for static_partitioned_config https://docs.dagster.io/_apidocs/partitions#dagster.static_partitioned_config
c

claire

11/16/2022, 10:23 PM
Hi Binoy. I think something that might meet your use case is multidimensional partitions, e.g having one dimension be customers and the other partition be dates. Then, there will be a partition for every unique combination of customer and date. Dagster currently has a
MultiPartitionsDefinition
abstraction to define this, and tomorrow's major release will introduce Dagit support for this.
👍 1
b

Binoy Shah

11/16/2022, 11:00 PM
Thanks @claire I had 👍 ’ed the github Issue under a speculation that it could be my answer. Excited to adopt it..
:rainbow-daggy: 1
m

Marc Keeling

11/17/2022, 1:11 AM
Can't wait to see this implemented!