Looking launching many ops in parallel using `Dyna...
# deployment-kubernetes
j
Looking launching many ops in parallel using
DynamicOutput
, I run into scalability issues, as it puts
known_status
, as well as many other arguments in the dagster api call:
Copy code
kubectl describe pod -n gww dagster-step-0147b2ea447749e9fdf522f431ffe720-tz52b
shows that the pod args expand with every Dynamic output, resulting in an some errors if the number of Dynamic Ops is too large:
Copy code
Discovered failed Kubernetes job dagster-step-f4a04de627530ac6e01d43d0127dbfce for step dynamicStep[0029004]
Any ideas how to mitigate this problem? I like the idea of celery workers, but I want to be able to autoscale my cluster depending on the load.
a
Hi Jaap, this issue also occurs with celery K8s execution from my experience. IMO the best fan-out architecture with Dagster is to have a relatively low-resource graph (e.g. use multiprocessing on a large node to kick off the execution), then use a map-reduce solution for the high-dimensional work (e.g. a ray cluster) E: or using the K8s executor and running a local ray cluster in each pod.
❤️ 1
j
Thanks for sharing your experience! I will rethink how I view the responsibilities of dagster in parallel workloads.
j
We have a fix coming for this particular issue. In general I agree with Alex’s point though- for highly parallel workloads, leveraging another tool from Dagster can have the best results
j
Thanks good to know
j
Yep- there’s some backcompat challenges there if the launching vs launched code are on different versions, but we should be able to work through them (or at worst offer a toggle that turns on the feature and expects recent versions on both ends)
j
@johann forgot to update that I fixed the issue by reducing the size of the fanout. It still would be cleaner if dagster-kubernetes could handle this by itself. The PR seems a bit stuck, has an alternative been pushed elsewhere?
j
Also forgot to update: we found a bug that was causing a bunch of empty data in the known state we’re passing around, and patched that: https://github.com/dagster-io/dagster/pull/8975 which seemed to make the other fix less urgent