I'd like to ask what's the best practice for large...
# ask-community
w
I'd like to ask what's the best practice for large DAGs? A typical use case is to fan out thousands of independent tasks and run them in parallel. In my special case, I have a parameter sweep to run 10k ops using 50 workers (could be on k8s pods or just celery worker) in parallel. However I encounter a few problems: 1. Dagit UI becomes super slow even if I switch off log button (because DAG graph itself is large) -> not a big deal though as I have prometheus+Grafana setup for monitoring. 2. See below screenshot, the job is running well but job status changed to
Failure
after ~40 minutes. I could confirm by looking at worker status and various OS monitoring that those ops are indeed still being executed.
c
Hi William, thanks for reporting this issue. Which log button are you referring to by "switching off the log button"?
Yep, the issue you linked to is the cause of the UI being slow. When the run page is loaded in Dagit, all logs are loaded in memory
For the job status changing to failure, would you mind sending over the debug logs for the job? You can access this at the top run page:
w
Bump on this