Hey guys any advice on running big pipelines i have around 1 dagster #ask-community

Hey guys any advice on running big pipelines? i ha...

Pavlo Yemelianov

08/15/2023, 6:32 PM

Hey guys any advice on running big pipelines? i have around 10k parallel tasks which we need to run frequently tasks are not big currently we have dagster with celery running on k8s (we’re using standalone celery) and KEDA for autoscaling celery pods i’ve been playing around with concurrency settings/ pod resources for some time already but still can’t got it working it’s slow, laggy, sometimes stuck etc and advice will be really appreciated

dagster bot responded by community 1

🤖 1

Zach

08/15/2023, 6:44 PM

I have some big jobs like that, although the parallel tasks can take a couple hours each. I'd start with relatively low concurrency and scale up slowly in tests. Watch your resource usage if your cluster has a limited number of resources in case cpu / memory pressure is contributing to slow-downs. Filter out most of your non-critical logs from collection. You can configure your log level in your run config: https://docs.dagster.io/concepts/logging/python-logging#configuring-a-python-log-level- I don't have a lot of other ideas besides that. I can imagine high parallelism with short jobs would result in a lot of dagster events flying around needing processing, so I'd imagine you'd want to allocate a decent amount of cpu for the user code and webserver components

Oren Lederman

08/16/2023, 7:00 PM

Another thing to consider is the startup time of each sub-process or task in the pipeline. I don’t know if you’re planning to spin up a lot of jobs, or a single job that fans-out, but it’s something to look out for. It’s not something that I cared about when I worked on my daily pipelines, but not that I’m working on a mini-batch that runs every 5 minutes I’m starting to notice it

❤️ 1

4 Views

Open in Slack

Previous Next