I'd like to ask what's the best practice for large DAGs?
A typical use case is to fan out thousands of independent tasks and run them in parallel. In my special case, I have a parameter sweep to run 10k ops using 50 workers (could be on k8s pods or just celery worker) in parallel. However I encounter a few problems:
1. Dagit UI becomes super slow even if I switch off log button (because DAG graph itself is large) -> not a big deal though as I have prometheus+Grafana setup for monitoring.
2. See below screenshot, the job is running well but job status changed to
after ~40 minutes. I could confirm by looking at worker status and various OS monitoring that those ops are indeed still being executed.