https://dagster.io/ logo
#ask-community
Title
# ask-community
e

Eegan K

08/31/2022, 3:37 PM
Hey everyone. I have a QueuedRunCoordinator that limits how many runs I can have of a specific op (which effectively limits the number of jobs) at one time. This is because I have a sensor that kicks off hundreds (possibly thousands in the future) jobs at once. I’m currently running into an error where each job in the queue fails due to a “DagsterRepositoryLocationLoadError”, which itself is caused by an “_InactiveRpcError”. Any ideas how I could fix?
I know that this isn't a ton of information, so please let me know what you need. One thing that might be relevant from googling: I am behind a corporate proxy, but not sure if this is relevant because the jobs do start and queue fine, and this consistently happens.
I’m about 90% sure that this error did not occur a few versions back (0.14?)
j

jamie

08/31/2022, 3:50 PM
@johann any ideas on this?
d

Dong Kim

08/31/2022, 3:52 PM
I may be wrong. But it seems to me that you specify
grpc_server
in
workspace.yaml
to load a repository. So,
grpc
server might not start as it was expected. Therefore, grpc client can not connect and load a user repository code.
1
j

johann

08/31/2022, 3:52 PM
_InactiveRpcError
means the grpc server for your repository location didn’t respond
👍 1
e

Eegan K

08/31/2022, 3:53 PM
Hmm ok. What's confusing me is that the grpc server does start and is used by the jobs actually running (not in the queue)
j

johann

08/31/2022, 3:54 PM
What deployment are you using?
e

Eegan K

08/31/2022, 3:54 PM
what do you mean? at the moment I’m working on a single, self hosted machine.
👍 1
j

johann

08/31/2022, 3:54 PM
And using the DefaultRunLauncher?
e

Eegan K

08/31/2022, 3:55 PM
yes
j

johann

08/31/2022, 3:55 PM
is used by the jobs actually running (not in the queue)
does the rpc error not happen every time? How did those jobs manage to start running?
e

Eegan K

08/31/2022, 3:56 PM
It does happen every time. I think it occurs when around 200 jobs are in the queue?
maybe more or less
j

johann

08/31/2022, 3:56 PM
One possibility is that with everything on one box, the grpc server is getting starved and doesn’t respond to new requests. This happens once you have a lot of runs in progress?
e

Eegan K

08/31/2022, 3:56 PM
yeah, 20-40
it's a beefy box but I know it's not beefy enough long term
j

johann

08/31/2022, 3:57 PM
You could try setting a lower limit on the number of runs and see if that helps the error
e

Eegan K

08/31/2022, 3:58 PM
ok! a bit confused because the error didn't occur before, but will definitely try that and get back to you!
j

johann

08/31/2022, 3:58 PM
Long term if you want more parallelism you’ll want to look in to other options for run launchers that kick off new containers
d

Dong Kim

08/31/2022, 3:58 PM
If the number of jobs was an issue, you lower the # of jobs or run a grpc server in a seperate container.
e

Eegan K

08/31/2022, 3:58 PM
yeah I have a k8s setup possible but I was using this for the short term.
will definitely look into the run launchers! thanks!
2 Views