I’d appreciate some architectural advice: I have a...
# ask-community
d
I’d appreciate some architectural advice: I have a list of thousands of items in a database. I need to process each item using a containerized go application running in Kubernetes. Processing can take from minutes to tens of minutes. Once an item has completed processing, I update the item status in the DB. I’d like to process as many items concurrently as possible. I currently have a graph containing a: 1. database query op, where I build a list of items. The op yields
DynamicOutput
wrapped items. 2. an op that `map`s the generator result from the op above to a sub-graph describe below. 3. a sub-graph that is composed of a) an op that schedules a k8s job using
execute_k8s_job
and then (b) runs an op that updates the DB with the k8s job’s success status (actually, it sets a UUID field, but that’s not important). The priorities of each op are set such that the DB update is prioritized over starting a new k8s job. The list size for each run and number of concurrent op executions appears to be limited by available memory, with dagster OOPing if the numbers are too high. Ideally, I’d like to trigger a run and have the entire batch list processed in a memory-efficient manner. Is there a better approach to solving this problem than the above?
s
Hey Daniel - do you know what process you're seeing the OOM? Is it the process that yields the `DynamicOutput`s? as long as you are • not using an in memory io manager • dont have op hooks defined • not holding any references yourself the dynamic output values should de-allocate as you yield.
d
Thanks, @sandy. I wasn’t using the
multiprocess
executor. Now that I’m using this and up’ed the dagster container’s memory, my runs are far more stable.