Suppose I have around 5000 users, each user, for e...
# announcements
n
Suppose I have around 5000 users, each user, for example, has a specific task to run at a specific time . And for that reason a scheduler needs to be created. Now to create a schedule there are two approaches, 1. Make schedulers of each user for the same task. But this isn’t working as each scheduler’s run takes a lot of time that this doesn’t seems feasible. 2. Make a single scheduler that request multiple runs for the same task with user’s config. It works great but it stops sending pipeline tasks to queue after some time. And the error is related to grpc which returns a. grpc_status:14 b. failed to connect to all addresses c. Failed to pick subchannel d. Failed to fetch execution plan for <scheduler> e. StatusCode.UNAVAILABLE What would be a better approach to handle such task?
d
Hi, do you have a full stack trace for the issue you’re running into with option 2?
If the issue comes from launching many runs at the same time, the features here may help: https://docs.dagster.io/overview/pipeline-runs/limiting-run-concurrency
I’d be curious if the problem goes away if you set a limit on the number of runs that can be happening at once
(of the two options, #2 seems like a better option to me once we sort out the issue you're running into - 5000 schedules is a lot to manage)
n
Yeah. I've set the limit. So that running pipelines could be as minimum as possible. But its the thing that when scheduler is adding pipelines to the queue. It fails at some point. Like adding 1500 pipelines to the queue then all the next pipelines that should've been added to the queue just gets failed with grpc error.
Current running limit for pipeline of such type is set to 2. Which is easily visible on dagit that at max 2 are running only.
d
Got it, if you can send the full stack trace that will help a lot with understanding what's going on. How many RunRequests are you returning from your schedule function at once?
n
5000
As of the stack trace. Currently I don't have it. Sending you by tomorrow or so. One thing I do remember (additional to the above error info) is the error comes at some function named create_pipeline_run or something.
d
Got it - yeah, providing the full stack trace will be really useful for this to understand the exact place you're hitting a scaling limit. If it is in create_pipeline_run that may indicate that you should switch to a Postgres database for your dagster storage instead of the default sqlite storage
If providing the full logs from the daemon is an option over a period of time where you were experiencing issues, that would be the most useful
n
Yeah. Sending you the logs by tomorrow in the same thread. Meanwhile I would test if shifting to postgres would help it. Thanks a lot
d
np! sorry for the trouble with the system, very helpful for us to identify these bottlenecks.
n
No problem at all. The system works flawlessly and that's a great thing. Thanks
s
flawless might be hard to live up to 🙂 thanks for the kind words!
😅 1
d
Believe we found the issue here! Schedules or sensors that took more than 2 minutes to submit their runs were running into this problem. Fix should be going out today in the 0.11.0 release.
👍 3
n
Wow. That's great. Thanks a lot.