I d like to ask what s the best practice for large DAGs A ty dagster #ask-community

I'd like to ask what's the best practice for large...

William

08/09/2022, 6:56 AM

I'd like to ask what's the best practice for large DAGs? A typical use case is to fan out thousands of independent tasks and run them in parallel. In my special case, I have a parameter sweep to run 10k ops using 50 workers (could be on k8s pods or just celery worker) in parallel. However I encounter a few problems: 1. Dagit UI becomes super slow even if I switch off log button (because DAG graph itself is large) -> not a big deal though as I have prometheus+Grafana setup for monitoring. 2. See below screenshot, the job is running well but job status changed to

Failure

after ~40 minutes. I could confirm by looking at worker status and various OS monitoring that those ops are indeed still being executed.

William

08/09/2022, 9:37 AM

https://github.com/dagster-io/dagster/issues/7821 might be related

claire

08/09/2022, 6:07 PM

Hi William, thanks for reporting this issue. Which log button are you referring to by "switching off the log button"?

claire

08/09/2022, 6:20 PM

Yep, the issue you linked to is the cause of the UI being slow. When the run page is loaded in Dagit, all logs are loaded in memory

claire

08/09/2022, 6:46 PM

For the job status changing to failure, would you mind sending over the debug logs for the job? You can access this at the top run page:

William

08/10/2022, 2:48 PM

William

08/10/2022, 2:48 PM

0a5ac408-653c-4747-bd4a-940a550fe1f6.gz

William

09/23/2022, 2:41 AM

Bump on this

4 Views

Open in Slack

Previous Next