Suppose I have around 5000 users each user for example has a dagster #announcements

Suppose I have around 5000 users, each user, for e...

NawafSheikh

03/18/2021, 12:33 PM

Suppose I have around 5000 users, each user, for example, has a specific task to run at a specific time . And for that reason a scheduler needs to be created. Now to create a schedule there are two approaches, 1. Make schedulers of each user for the same task. But this isn’t working as each scheduler’s run takes a lot of time that this doesn’t seems feasible. 2. Make a single scheduler that request multiple runs for the same task with user’s config. It works great but it stops sending pipeline tasks to queue after some time. And the error is related to grpc which returns a. grpc_status:14 b. failed to connect to all addresses c. Failed to pick subchannel d. Failed to fetch execution plan for <scheduler> e. StatusCode.UNAVAILABLE What would be a better approach to handle such task?

daniel

03/18/2021, 1:05 PM

Hi, do you have a full stack trace for the issue you’re running into with option 2?

daniel

03/18/2021, 1:08 PM

If the issue comes from launching many runs at the same time, the features here may help: https://docs.dagster.io/overview/pipeline-runs/limiting-run-concurrency

daniel

03/18/2021, 1:31 PM

I’d be curious if the problem goes away if you set a limit on the number of runs that can be happening at once

daniel

03/18/2021, 2:03 PM

(of the two options, #2 seems like a better option to me once we sort out the issue you're running into - 5000 schedules is a lot to manage)

NawafSheikh

03/18/2021, 2:03 PM

Yeah. I've set the limit. So that running pipelines could be as minimum as possible. But its the thing that when scheduler is adding pipelines to the queue. It fails at some point. Like adding 1500 pipelines to the queue then all the next pipelines that should've been added to the queue just gets failed with grpc error.

NawafSheikh

03/18/2021, 2:05 PM

Current running limit for pipeline of such type is set to 2. Which is easily visible on dagit that at max 2 are running only.

daniel

03/18/2021, 2:08 PM

Got it, if you can send the full stack trace that will help a lot with understanding what's going on. How many RunRequests are you returning from your schedule function at once?

NawafSheikh

03/18/2021, 2:08 PM

5000

NawafSheikh

03/18/2021, 2:13 PM

As of the stack trace. Currently I don't have it. Sending you by tomorrow or so. One thing I do remember (additional to the above error info) is the error comes at some function named create_pipeline_run or something.

daniel

03/18/2021, 2:14 PM

Got it - yeah, providing the full stack trace will be really useful for this to understand the exact place you're hitting a scaling limit. If it is in create_pipeline_run that may indicate that you should switch to a Postgres database for your dagster storage instead of the default sqlite storage

daniel

03/18/2021, 2:15 PM

If providing the full logs from the daemon is an option over a period of time where you were experiencing issues, that would be the most useful

NawafSheikh

03/18/2021, 2:16 PM

Yeah. Sending you the logs by tomorrow in the same thread. Meanwhile I would test if shifting to postgres would help it. Thanks a lot

daniel

03/18/2021, 2:17 PM

np! sorry for the trouble with the system, very helpful for us to identify these bottlenecks.

NawafSheikh

03/18/2021, 2:22 PM

No problem at all. The system works flawlessly and that's a great thing. Thanks

schrockn

03/18/2021, 2:22 PM

flawless might be hard to live up to 🙂 thanks for the kind words!

😅 1

daniel

03/18/2021, 3:47 PM

Believe we found the issue here! Schedules or sensors that took more than 2 minutes to submit their runs were running into this problem. Fix should be going out today in the 0.11.0 release.

👍 3

NawafSheikh

03/18/2021, 3:47 PM

Wow. That's great. Thanks a lot.

Open in Slack

Previous Next