Using Docker compose and trying to increase <gRPC ...
# ask-community
n
Using Docker compose and trying to increase gRPC code server timeout to something above 60. In dagster.yaml available to Dagster daemon and Dagit containers, I've included the following:
Copy code
code_servers:
  local_startup_timeout: 180
I even tried it in another dagster.yaml in the user code image, but nothing works. No matter what I try, I still see the error
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
Any tips?
d
Hey Nicolas - wheee exactly are you seeing that error, does it come with a stack trace? That timeout covers starting up the servers / loading your code, not requests to the servers in general
n
Yes thanks! From the Dagster daemon container:
Copy code
docker_daemon_local               | 2023-03-28 17:10:00 +0000 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location user_code_beam_dw_dagster to check for schedules due to the following error: dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
docker_daemon_local               |
docker_daemon_local               | Stack Trace:
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
docker_daemon_local               |     location = self._create_location_from_origin(origin)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 519, in _create_location_from_origin
docker_daemon_local               |     return origin.create_location()
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 332, in create_location
docker_daemon_local               |     return GrpcServerRepositoryLocation(self)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 637, in __init__
docker_daemon_local               |     self._external_repositories_data = sync_get_streaming_external_repositories_data_grpc(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_api/snapshot_repository.py", line 25, in sync_get_streaming_external_repositories_data_grpc
docker_daemon_local               |     external_repository_chunks = list(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
docker_daemon_local               |     for res in self._streaming_query(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 186, in _streaming_query
docker_daemon_local               |     self._raise_grpc_exception(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 137, in _raise_grpc_exception
docker_daemon_local               |     raise DagsterUserCodeUnreachableError(
docker_daemon_local               |
docker_daemon_local               | The above exception was caused by the following exception:
docker_daemon_local               | grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
docker_daemon_local               | 	status = StatusCode.DEADLINE_EXCEEDED
docker_daemon_local               | 	details = "Deadline Exceeded"
docker_daemon_local               | 	debug_error_string = "{"created":"@1680023292.241437667","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
docker_daemon_local               | >
docker_daemon_local               |
docker_daemon_local               | Stack Trace:
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 182, in _streaming_query
docker_daemon_local               |     yield from self._get_streaming_response(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 171, in _get_streaming_response
docker_daemon_local               |     yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 426, in __next__
docker_daemon_local               |     return self._next()
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 826, in _next
docker_daemon_local               |     raise self
docker_daemon_local               |
The code location has a huge dbt project in it
d
That’s an odd one - typically if the code is taking a long time to load the timeout would happen earlier. What version of dagster is this?
n
1.1.21
d
How huge a project are we talking?
n
~1800 models... there's also an upstream Fivetran instance with ~30 connectors
d
Could you possibly share your workspace.yaml file or pyproject.toml file?
n
Copy code
load_from:
  # Each entry here corresponds to a service in the docker-compose file that exposes user code.
  - grpc_server:
      host: docker_user_code_beam_dw_dagster
      port: 4004
      location_name: "user_code_beam_dw_dagster"
d
Ahh ok, so that's another reason that local_startup_timeout isn't going to help here - the servers aren't actually considered 'local' when you're running in docker. Let's se...
Is the user code container running?
n
Ahhhhh... okay
Yes
d
It has a line like "Started Dagster code server on port 4004 in process 7283" in its container output?
n
Copy code
docker_user_code_beam_dw_dagster  | 2023-03-28 17:26:03 +0000 - dagster - INFO - Started Dagster code server for package dagster_integrations on port 4004 in process 1
d
Got it - so after that happens, the daemon is still spewing that error? I could imagine it struggling while the user code container is starting up - the intended behavior would be that the error stops within a minute or so after the user code container finishes loading
n
And it's a little unpredictable... after a while it someteims does load... but then it disappears from Dagit UI
d
I'd check the Code Locations tab in Dagit when that happens, there might be an error there
(potentially this same error)
n
I get the error in Dagit code locations tab and Docker logs
d
ok, once we've gone through all that - there is a DAGSTER_GRPC_TIMEOUT_SECONDS env var that you could try setting to something like 180 in dagit and the daemon container. We're usually hesistant to recommend that first since in many cases there's some underlying issue (the default is 60 seconds which is quite long). It's possible that your dbt project is large enough that it's actually taking more than 60 seconds just to stream all the data over though - i'd be curious if things get more stable after this though or if the sheer size of the data causes other problems
n
I'll give this a shot rn
FWIW I've set
display_raw_sql=False
in
load_assets_from_dbt_manifets
to try to speed things up
d
Ah yes - maybe try that first
n
Sorry... I already have 🙃
d
waiting >60 seconds each time it needs to reload doesn't sound very enjoyable even once its no longer timing out
n
Like a couple days ago
d
ah unfortunate
n
doesn't sound very enjoyable even once its no longer timing out
You are correct
d
If you're comfortable sending us your metadata I could give you a script that prints out the serialized object that it's trying to fetch - we could see if there are other improvements like display_raw_sql that we need to make for these large asset graphs
n
Thanks! That's really generous... I'll hafta check with my team Question: "metadata" == manifest.json?
d
It's basically 'what's displayed in dagit for your assets'
for your asset graph, rather
n
Env var
DAGSTER_GRPC_TIMEOUT_SECONDS
seems to've worked better
Sorry... just trying to clarify, our dbt target/manifest.json would be helpful? Is that right?
(Like what specifically do you mean by "metadata"?)
d
Oh I see - us seeing the object that it's fetching from this particular API call (derived from your code and your manifest.json) that's taking so long to generate is what would be helpful
n
Ah okay... how would I retrieve that?
d
I was going to give you a quick script that basically runs the API call that's taking forever and prints the serialized object to stdout
n
That'd be great thx!
And again... thanks so much for your help on this
d
no problem - what's the name of your repository (if you're not using Definitions)
n
I'm using Definitions
d
gotcha
Ok, here's a quick script that should work when run from your daemon or dagit container:
Copy code
from dagster._core.host_representation.origin import (
    GrpcServerRepositoryLocationOrigin,
)

host = "docker_user_code_beam_dw_dagster"
port = 4004

origin = GrpcServerRepositoryLocationOrigin(host=host, port=port)

location = origin.create_location()

for repo_name in location.get_repositories():
    print(f"{repo_name}:")
    print(str(location.get_repositories()[repo_name].external_repository_data))
(with that timeout increased so the call actually finishes)
thank you box 1
n
I'll give it a shot, check with my team, and if we're all good I'll circle back to this thread if that's the best way
d
For sure - entirely optional and you're of course welcome to DM or email it to daniel@elementl.com instead of posting here
n
Oh great... Thanks!
@daniel Just giving you a heads-up that I sent a follow-up email a few seconds ago from nicolas.may@[my_employer_domain]... Thx!
d
Thanks nicolas - how's it performing after that timeout increase, are you seeing any other issues?
n
Totally missed your reply... Sorry The timeout increase has definitely improved things... and we scaled up the VM running the Docker compose deployment in production... so that's helped too... We still get InactiveRpcErrors on a pretty regular cadence, and I'm not sure how to improve/fix that
d
Can you post a stack trace for one of the latest InactiveRpcErrors?
n
Will do in a few min... thx!
Copy code
/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py:602: UserWarning: Error loading repository location user_code_beam_dw_dagster:dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

Stack Trace:
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 519, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 332, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 603, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
d
Got it - and that's in the dagit container it looks like?
I'd associate that particular error with the user code container being down / entirely unreachable
n
I think it's coming from Dagster daemon container... Fluent Bit is a sidecar container sending Datadog logs
d
ok, if its the daemon, one nice thing is that it'll periodically refresh - so if the code location is temporarily down, the daemon should be able to load it again and try again in about a minute
n
And ya... that error definitely happens when things are first spinning up... but it also happens intermittently
d
and it will pause executing things untii its available again
would you know from your logs if the user code container went down?
n
Yes! That's great... love this feature
would you know from your logs if the user code container went down?
Hmm... I don't think I've set that up
I'll add this... we've got 2 user code repos... the other one that we've been running in prod for a few months isn't this noisy
d
could it be running out of memory or something possibly?
n
It's far less complex... has a bunch of op graphs but it's not a big hairy dependency ball like this problem repo w/ Fivetran + dbt + Census
Maybe it's a memory problem... we've just upgraded the VM... GCP CE e2-standard-4... 4 vCPUs and 16 GB mem
I thought that'd hack it... • Dagster daemon container • Dagit container • 1 simple user code container • 1 big gnarly user code container • Fluent Bit sidecar
Morning @daniel... I just wanted to follow up on this... So the reason this user code container was going down every few minutes and then coming back up was because (1) I'd set a
--heartbeat-timeout 1200
(20 min) on the dagster api grpc command in the docker-compose.yaml to try to fix the gRPC timeout problem, but hadn't removed the heartbeat flags, and (2) the docker compose service restart policy is
unless-stopped
... Only by looking at the service's logs and noticing that there were repeated cycles of
Started Dagster code server for package ...
and
Shutting down Dagster code server for package ...
exactly 20 minutes apart did it occur to me that this heartbeat timeout flag was causing the problem Once got rid of the heartbeat timeout, this user code repo container (after it takes a while to spin up) works without a hitch... Thanks for helping me out with this!
condagster 1