I d appreciate some architectural advice I have a list of th dagster #ask-community

I’d appreciate some architectural advice: I have a...

Daniel Chalef

02/27/2023, 7:48 PM

I’d appreciate some architectural advice: I have a list of thousands of items in a database. I need to process each item using a containerized go application running in Kubernetes. Processing can take from minutes to tens of minutes. Once an item has completed processing, I update the item status in the DB. I’d like to process as many items concurrently as possible. I currently have a graph containing a: 1. database query op, where I build a list of items. The op yields

DynamicOutput

wrapped items. 2. an op that `map`s the generator result from the op above to a sub-graph describe below. 3. a sub-graph that is composed of a) an op that schedules a k8s job using

execute_k8s_job

and then (b) runs an op that updates the DB with the k8s job’s success status (actually, it sets a UUID field, but that’s not important). The priorities of each op are set such that the DB update is prioritized over starting a new k8s job. The list size for each run and number of concurrent op executions appears to be limited by available memory, with dagster OOPing if the numbers are too high. Ideally, I’d like to trigger a run and have the entire batch list processed in a memory-efficient manner. Is there a better approach to solving this problem than the above?

sandy

02/28/2023, 12:04 AM

Hey Daniel - do you know what process you're seeing the OOM? Is it the process that yields the `DynamicOutput`s? as long as you are • not using an in memory io manager • dont have op hooks defined • not holding any references yourself the dynamic output values should de-allocate as you yield.

Daniel Chalef

02/28/2023, 5:29 PM

Thanks, @sandy. I wasn’t using the

multiprocess

executor. Now that I’m using this and up’ed the dagster container’s memory, my runs are far more stable.

2 Views

Open in Slack

Previous Next