Title
p

paul.q

05/06/2021, 10:31 PM
Hi all, We've just stood up a new environment and installed dagster 0.11.7 on it. The server is AWS Win2019 behind an ALB. We have IIS as a reverse proxy in front of Dagit so we can restrict access to authorised users only. We have similar 0.10.9 environments running OK (minus the ALB). After deploying our app and trying to run a pipeline, we got this:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1620302964.358000000","description":"Error received from peer ipv6:[::1]:54632","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

  File "c:\program files\python37\lib\site-packages\dagster\grpc\client.py", line 350, in start_run
    serialized_execute_run_args=serialize_dagster_namedtuple(execute_run_args),
  File "c:\program files\python37\lib\site-packages\dagster\grpc\client.py", line 89, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "c:\program files\python37\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "c:\program files\python37\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
Does someone know an obvious cause for this? I can investigate AWS EC2 security group rules between ALB and server if it comes down to that. Thanks Paul
d

daniel

05/06/2021, 10:34 PM
hi paul - do you have any logs from the gRPC server that you're executing the runs on? it looks like it's timing out, so there might be some clues on the server as to why
p

paul.q

05/06/2021, 10:43 PM
we're not using our own gRPC server, just relying on the default process - is there a place I should look for the logs? The only logs I can see are the event logs. Unfortunately we've wrapped dagit as a windows service and we're not capturing stdout or stderr.
d

daniel

05/06/2021, 11:56 PM
Ah my mistake. Are there any logs earlier in the output with any errors? It’s strange that just this one call would time out - there should have been other calls earlier in order for the pipelines to show up in dagit, for example
p

paul.q

05/07/2021, 12:58 AM
No, this is basically on pipeline start. The stack trace I've sent is the first in the chain.
d

daniel

05/07/2021, 3:39 AM
Very strange. We did add a timeout of 60 seconds on gRPC calls in the 0.11.6 release, but 60 seconds should be plenty of time to start a run. One thing that you could do if it's not too much trouble is see if the 0.11.5 release has the problem. If it doesn't, that would be a pretty big clue that the timeouts are having an unexpected effect in this environment for some reason
One other question - is there anything in the event log for the run? I imagine its stuck in a STARTING state if the start_run call is timing out?
r

Rebecca Sevaites

05/12/2021, 4:55 PM
I actually had this error pop up today as well. Unfortunately, I didn't save the logs and it hasn't reproduced yet. I'm standing up my service on a Kubernetes cluster and to fix it, I scaled the pods down to zero and then back up to one. After that, it was working as expected.
p

paul.q

05/13/2021, 3:58 AM
For me the error is probably due to some issue we had with running dagit as a Windows service. Not sure if it's a permission or path issue. When I ran dagit and dagster-daemon in the foreground under the correct account, everything was OK. Happy for it to be closed.