https://dagster.io/ logo
#ask-community
Title
# ask-community
n

Nicolas May

03/28/2023, 5:13 PM
Using Docker compose and trying to increase gRPC code server timeout to something above 60. In dagster.yaml available to Dagster daemon and Dagit containers, I've included the following:
Copy code
code_servers:
  local_startup_timeout: 180
I even tried it in another dagster.yaml in the user code image, but nothing works. No matter what I try, I still see the error
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
Any tips?
d

daniel

03/28/2023, 5:22 PM
Hey Nicolas - wheee exactly are you seeing that error, does it come with a stack trace? That timeout covers starting up the servers / loading your code, not requests to the servers in general
n

Nicolas May

03/28/2023, 5:24 PM
Yes thanks! From the Dagster daemon container:
Copy code
docker_daemon_local               | 2023-03-28 17:10:00 +0000 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location user_code_beam_dw_dagster to check for schedules due to the following error: dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
docker_daemon_local               |
docker_daemon_local               | Stack Trace:
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
docker_daemon_local               |     location = self._create_location_from_origin(origin)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 519, in _create_location_from_origin
docker_daemon_local               |     return origin.create_location()
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 332, in create_location
docker_daemon_local               |     return GrpcServerRepositoryLocation(self)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 637, in __init__
docker_daemon_local               |     self._external_repositories_data = sync_get_streaming_external_repositories_data_grpc(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_api/snapshot_repository.py", line 25, in sync_get_streaming_external_repositories_data_grpc
docker_daemon_local               |     external_repository_chunks = list(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
docker_daemon_local               |     for res in self._streaming_query(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 186, in _streaming_query
docker_daemon_local               |     self._raise_grpc_exception(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 137, in _raise_grpc_exception
docker_daemon_local               |     raise DagsterUserCodeUnreachableError(
docker_daemon_local               |
docker_daemon_local               | The above exception was caused by the following exception:
docker_daemon_local               | grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
docker_daemon_local               | 	status = StatusCode.DEADLINE_EXCEEDED
docker_daemon_local               | 	details = "Deadline Exceeded"
docker_daemon_local               | 	debug_error_string = "{"created":"@1680023292.241437667","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
docker_daemon_local               | >
docker_daemon_local               |
docker_daemon_local               | Stack Trace:
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 182, in _streaming_query
docker_daemon_local               |     yield from self._get_streaming_response(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 171, in _get_streaming_response
docker_daemon_local               |     yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 426, in __next__
docker_daemon_local               |     return self._next()
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 826, in _next
docker_daemon_local               |     raise self
docker_daemon_local               |
The code location has a huge dbt project in it
d

daniel

03/28/2023, 5:26 PM
That’s an odd one - typically if the code is taking a long time to load the timeout would happen earlier. What version of dagster is this?
n

Nicolas May

03/28/2023, 5:26 PM
1.1.21
d

daniel

03/28/2023, 5:26 PM
How huge a project are we talking?
n

Nicolas May

03/28/2023, 5:27 PM
~1800 models... there's also an upstream Fivetran instance with ~30 connectors
d

daniel

03/28/2023, 5:30 PM
Could you possibly share your workspace.yaml file or pyproject.toml file?
n

Nicolas May

03/28/2023, 5:31 PM
Copy code
load_from:
  # Each entry here corresponds to a service in the docker-compose file that exposes user code.
  - grpc_server:
      host: docker_user_code_beam_dw_dagster
      port: 4004
      location_name: "user_code_beam_dw_dagster"
d

daniel

03/28/2023, 5:32 PM
Ahh ok, so that's another reason that local_startup_timeout isn't going to help here - the servers aren't actually considered 'local' when you're running in docker. Let's se...
Is the user code container running?
n

Nicolas May

03/28/2023, 5:32 PM
Ahhhhh... okay
Yes
d

daniel

03/28/2023, 5:33 PM
It has a line like "Started Dagster code server on port 4004 in process 7283" in its container output?
n

Nicolas May

03/28/2023, 5:34 PM
Copy code
docker_user_code_beam_dw_dagster  | 2023-03-28 17:26:03 +0000 - dagster - INFO - Started Dagster code server for package dagster_integrations on port 4004 in process 1
d

daniel

03/28/2023, 5:35 PM
Got it - so after that happens, the daemon is still spewing that error? I could imagine it struggling while the user code container is starting up - the intended behavior would be that the error stops within a minute or so after the user code container finishes loading
n

Nicolas May

03/28/2023, 5:35 PM
And it's a little unpredictable... after a while it someteims does load... but then it disappears from Dagit UI
d

daniel

03/28/2023, 5:35 PM
I'd check the Code Locations tab in Dagit when that happens, there might be an error there
(potentially this same error)
n

Nicolas May

03/28/2023, 5:35 PM
I get the error in Dagit code locations tab and Docker logs
d

daniel

03/28/2023, 5:37 PM
ok, once we've gone through all that - there is a DAGSTER_GRPC_TIMEOUT_SECONDS env var that you could try setting to something like 180 in dagit and the daemon container. We're usually hesistant to recommend that first since in many cases there's some underlying issue (the default is 60 seconds which is quite long). It's possible that your dbt project is large enough that it's actually taking more than 60 seconds just to stream all the data over though - i'd be curious if things get more stable after this though or if the sheer size of the data causes other problems
n

Nicolas May

03/28/2023, 5:37 PM
I'll give this a shot rn
FWIW I've set
display_raw_sql=False
in
load_assets_from_dbt_manifets
to try to speed things up
d

daniel

03/28/2023, 5:42 PM
Ah yes - maybe try that first
n

Nicolas May

03/28/2023, 5:42 PM
Sorry... I already have 🙃
d

daniel

03/28/2023, 5:42 PM
waiting >60 seconds each time it needs to reload doesn't sound very enjoyable even once its no longer timing out
n

Nicolas May

03/28/2023, 5:42 PM
Like a couple days ago
d

daniel

03/28/2023, 5:42 PM
ah unfortunate
n

Nicolas May

03/28/2023, 5:43 PM
doesn't sound very enjoyable even once its no longer timing out
You are correct
d

daniel

03/28/2023, 5:43 PM
If you're comfortable sending us your metadata I could give you a script that prints out the serialized object that it's trying to fetch - we could see if there are other improvements like display_raw_sql that we need to make for these large asset graphs
n

Nicolas May

03/28/2023, 5:45 PM
Thanks! That's really generous... I'll hafta check with my team Question: "metadata" == manifest.json?
d

daniel

03/28/2023, 5:45 PM
It's basically 'what's displayed in dagit for your assets'
for your asset graph, rather
n

Nicolas May

03/28/2023, 5:46 PM
Env var
DAGSTER_GRPC_TIMEOUT_SECONDS
seems to've worked better
Sorry... just trying to clarify, our dbt target/manifest.json would be helpful? Is that right?
(Like what specifically do you mean by "metadata"?)
d

daniel

03/28/2023, 5:47 PM
Oh I see - us seeing the object that it's fetching from this particular API call (derived from your code and your manifest.json) that's taking so long to generate is what would be helpful
n

Nicolas May

03/28/2023, 5:48 PM
Ah okay... how would I retrieve that?
d

daniel

03/28/2023, 5:48 PM
I was going to give you a quick script that basically runs the API call that's taking forever and prints the serialized object to stdout
n

Nicolas May

03/28/2023, 5:48 PM
That'd be great thx!
And again... thanks so much for your help on this
d

daniel

03/28/2023, 5:54 PM
no problem - what's the name of your repository (if you're not using Definitions)
n

Nicolas May

03/28/2023, 5:55 PM
I'm using Definitions
d

daniel

03/28/2023, 5:55 PM
gotcha
Ok, here's a quick script that should work when run from your daemon or dagit container:
Copy code
from dagster._core.host_representation.origin import (
    GrpcServerRepositoryLocationOrigin,
)

host = "docker_user_code_beam_dw_dagster"
port = 4004

origin = GrpcServerRepositoryLocationOrigin(host=host, port=port)

location = origin.create_location()

for repo_name in location.get_repositories():
    print(f"{repo_name}:")
    print(str(location.get_repositories()[repo_name].external_repository_data))
(with that timeout increased so the call actually finishes)
thank you box 1
n

Nicolas May

03/28/2023, 6:04 PM
I'll give it a shot, check with my team, and if we're all good I'll circle back to this thread if that's the best way
d

daniel

03/28/2023, 6:06 PM
For sure - entirely optional and you're of course welcome to DM or email it to daniel@elementl.com instead of posting here
n

Nicolas May

03/28/2023, 6:07 PM
Oh great... Thanks!
@daniel Just giving you a heads-up that I sent a follow-up email a few seconds ago from nicolas.may@[my_employer_domain]... Thx!
d

daniel

04/05/2023, 1:55 AM
Thanks nicolas - how's it performing after that timeout increase, are you seeing any other issues?
n

Nicolas May

04/06/2023, 3:01 PM
Totally missed your reply... Sorry The timeout increase has definitely improved things... and we scaled up the VM running the Docker compose deployment in production... so that's helped too... We still get InactiveRpcErrors on a pretty regular cadence, and I'm not sure how to improve/fix that
d

daniel

04/06/2023, 3:01 PM
Can you post a stack trace for one of the latest InactiveRpcErrors?
n

Nicolas May

04/06/2023, 3:01 PM
Will do in a few min... thx!
Copy code
/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py:602: UserWarning: Error loading repository location user_code_beam_dw_dagster:dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

Stack Trace:
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 519, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 332, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 603, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
d

daniel

04/06/2023, 5:38 PM
Got it - and that's in the dagit container it looks like?
I'd associate that particular error with the user code container being down / entirely unreachable
n

Nicolas May

04/06/2023, 7:00 PM
I think it's coming from Dagster daemon container... Fluent Bit is a sidecar container sending Datadog logs
d

daniel

04/06/2023, 7:01 PM
ok, if its the daemon, one nice thing is that it'll periodically refresh - so if the code location is temporarily down, the daemon should be able to load it again and try again in about a minute
n

Nicolas May

04/06/2023, 7:01 PM
And ya... that error definitely happens when things are first spinning up... but it also happens intermittently
d

daniel

04/06/2023, 7:01 PM
and it will pause executing things untii its available again
would you know from your logs if the user code container went down?
n

Nicolas May

04/06/2023, 7:02 PM
Yes! That's great... love this feature
would you know from your logs if the user code container went down?
Hmm... I don't think I've set that up
I'll add this... we've got 2 user code repos... the other one that we've been running in prod for a few months isn't this noisy
d

daniel

04/06/2023, 7:04 PM
could it be running out of memory or something possibly?
n

Nicolas May

04/06/2023, 7:04 PM
It's far less complex... has a bunch of op graphs but it's not a big hairy dependency ball like this problem repo w/ Fivetran + dbt + Census
Maybe it's a memory problem... we've just upgraded the VM... GCP CE e2-standard-4... 4 vCPUs and 16 GB mem
I thought that'd hack it... • Dagster daemon container • Dagit container • 1 simple user code container • 1 big gnarly user code container • Fluent Bit sidecar
Morning @daniel... I just wanted to follow up on this... So the reason this user code container was going down every few minutes and then coming back up was because (1) I'd set a
--heartbeat-timeout 1200
(20 min) on the dagster api grpc command in the docker-compose.yaml to try to fix the gRPC timeout problem, but hadn't removed the heartbeat flags, and (2) the docker compose service restart policy is
unless-stopped
... Only by looking at the service's logs and noticing that there were repeated cycles of
Started Dagster code server for package ...
and
Shutting down Dagster code server for package ...
exactly 20 minutes apart did it occur to me that this heartbeat timeout flag was causing the problem Once got rid of the heartbeat timeout, this user code repo container (after it takes a while to spin up) works without a hitch... Thanks for helping me out with this!
condagster 1
5 Views