Hey guys any advice on running big pipelines? i ha...
# ask-community
Hey guys any advice on running big pipelines? i have around 10k parallel tasks which we need to run frequently tasks are not big currently we have dagster with celery running on k8s (we’re using standalone celery) and KEDA for autoscaling celery pods i’ve been playing around with concurrency settings/ pod resources for some time already but still can’t got it working it’s slow, laggy, sometimes stuck etc and advice will be really appreciated
dagster bot responded by community 1
🤖 1
I have some big jobs like that, although the parallel tasks can take a couple hours each. I'd start with relatively low concurrency and scale up slowly in tests. Watch your resource usage if your cluster has a limited number of resources in case cpu / memory pressure is contributing to slow-downs. Filter out most of your non-critical logs from collection. You can configure your log level in your run config: https://docs.dagster.io/concepts/logging/python-logging#configuring-a-python-log-level- I don't have a lot of other ideas besides that. I can imagine high parallelism with short jobs would result in a lot of dagster events flying around needing processing, so I'd imagine you'd want to allocate a decent amount of cpu for the user code and webserver components
Another thing to consider is the startup time of each sub-process or task in the pipeline. I don’t know if you’re planning to spin up a lot of jobs, or a single job that fans-out, but it’s something to look out for. It’s not something that I cared about when I worked on my daily pipelines, but not that I’m working on a mini-batch that runs every 5 minutes I’m starting to notice it
❤️ 1