Hi team! I'd like to ask if it is good idea to use...
# ask-community
m
Hi team! I'd like to ask if it is good idea to use DynamicOutput for the big data sets (~10k records). I experience very big overhead (even with
mem_io_manager
).
dagster bot responded by community 1
z
I've ran jobs of similar size, but we kept the max_concurrent_runs relatively low (~30ish running in parallel at one time). I didn't have any real problems, aside from just needing to make sure logging wasn't too verbose as I found that slowed down Dagit a lot after all the logs piled up (although it seems from recent release notes that that may have been improved recently)
m
@Zach which
executor_def
and
io_manager
did you use? And which type of
runLauncher
?
z
I used the MultiprocessExecutor, with a EcsRunLauncher with 4vcpu / 8GB ECS tasks. IO was through a custom tool for writing data to Delta tables. One important thing about my setup was that most of the actual compute was taking place in Databricks, using a custom version of the databricks_pyspark_step_launcher from dagster-databricks
What kind of overhead are you seeing? All the steps are going to run on a single Dagster worker, so if you're not sending most of your compute to another provider like Databricks or EMR you'll probably need to limit the parallelism and up the resources for the worker. The
mem_io_manager
might actually make things worse for large fan-outs as all the outputs have to be held in memory
m
My job is just traversing through some http api collecting hierarchical data. So no heavy computation except waiting for the response. I run it locally, so memory should not be a problem. But we run it in K8s on prod, so may be we will need some tweaks. Probably amount of logs is the problem #1 as it runs in debug mode locally.