What is the right pattern for processing an extrem...
# ask-community
c
What is the right pattern for processing an extremly large dataset in batches within an asset? Is it expected to use an input resource to load each batch and output to an output resource in a loop? Or can an input resource be used to batch data in and then an IOManager be used in a loop with multiple yields?
g
Hey, Chris. I used static partitions for something like this and it works great. I wrote an IO manager to read/write percentiles of the data into partitions
daggy love 1
t
It'll depend on your needs. If you'd like to do the computation in Python, then we recommend using partitions if it can fit your needs. Partitioning your data will make it so that it'll only load the chunk of data that needs to be computed, rather than everything. If you have a platform to run the compute on separately, ex. a data warehouse, then we recommend using Dagster to orchestrate running the compute there, ex. running a SQL query.
c
Thanks @Tim Castillo and @Guy McCombe
daggy love 2
z
partitioned assets are super nice for this. another pattern I've used is graph-backed assets with a dynamic graph, although you lose the nice idempotency tracking that partitioned assets give you
daggy love 1