Using Docker compose and trying to increase <https docs dags dagster #ask-community

Using Docker compose and trying to increase <gRPC ...

Nicolas May

03/28/2023, 5:13 PM

Using Docker compose and trying to increase gRPC code server timeout to something above 60. In dagster.yaml available to Dagster daemon and Dagit containers, I've included the following:

Copy code

code_servers:
  local_startup_timeout: 180

I even tried it in another dagster.yaml in the user code image, but nothing works. No matter what I try, I still see the error

dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.

Any tips?

daniel

03/28/2023, 5:22 PM

Hey Nicolas - wheee exactly are you seeing that error, does it come with a stack trace? That timeout covers starting up the servers / loading your code, not requests to the servers in general

Nicolas May

03/28/2023, 5:24 PM

Yes thanks! From the Dagster daemon container:

Copy code

docker_daemon_local               | 2023-03-28 17:10:00 +0000 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location user_code_beam_dw_dagster to check for schedules due to the following error: dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
docker_daemon_local               |
docker_daemon_local               | Stack Trace:
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
docker_daemon_local               |     location = self._create_location_from_origin(origin)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 519, in _create_location_from_origin
docker_daemon_local               |     return origin.create_location()
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 332, in create_location
docker_daemon_local               |     return GrpcServerRepositoryLocation(self)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 637, in __init__
docker_daemon_local               |     self._external_repositories_data = sync_get_streaming_external_repositories_data_grpc(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_api/snapshot_repository.py", line 25, in sync_get_streaming_external_repositories_data_grpc
docker_daemon_local               |     external_repository_chunks = list(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
docker_daemon_local               |     for res in self._streaming_query(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 186, in _streaming_query
docker_daemon_local               |     self._raise_grpc_exception(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 137, in _raise_grpc_exception
docker_daemon_local               |     raise DagsterUserCodeUnreachableError(
docker_daemon_local               |
docker_daemon_local               | The above exception was caused by the following exception:
docker_daemon_local               | grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
docker_daemon_local               | 	status = StatusCode.DEADLINE_EXCEEDED
docker_daemon_local               | 	details = "Deadline Exceeded"
docker_daemon_local               | 	debug_error_string = "{"created":"@1680023292.241437667","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
docker_daemon_local               | >
docker_daemon_local               |
docker_daemon_local               | Stack Trace:
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 182, in _streaming_query
docker_daemon_local               |     yield from self._get_streaming_response(
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 171, in _get_streaming_response
docker_daemon_local               |     yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 426, in __next__
docker_daemon_local               |     return self._next()
docker_daemon_local               |   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 826, in _next
docker_daemon_local               |     raise self
docker_daemon_local               |

Nicolas May

03/28/2023, 5:24 PM

The code location has a huge dbt project in it

daniel

03/28/2023, 5:26 PM

That’s an odd one - typically if the code is taking a long time to load the timeout would happen earlier. What version of dagster is this?

Nicolas May

03/28/2023, 5:26 PM

1.1.21

daniel

03/28/2023, 5:26 PM

How huge a project are we talking?

Nicolas May

03/28/2023, 5:27 PM

~1800 models... there's also an upstream Fivetran instance with ~30 connectors

daniel

03/28/2023, 5:30 PM

Could you possibly share your workspace.yaml file or pyproject.toml file?

Nicolas May

03/28/2023, 5:31 PM

Copy code

load_from:
  # Each entry here corresponds to a service in the docker-compose file that exposes user code.
  - grpc_server:
      host: docker_user_code_beam_dw_dagster
      port: 4004
      location_name: "user_code_beam_dw_dagster"

daniel

03/28/2023, 5:32 PM

Ahh ok, so that's another reason that local_startup_timeout isn't going to help here - the servers aren't actually considered 'local' when you're running in docker. Let's se...

daniel

03/28/2023, 5:32 PM

Is the user code container running?

Nicolas May

03/28/2023, 5:32 PM

Ahhhhh... okay

Nicolas May

03/28/2023, 5:32 PM

Yes

daniel

03/28/2023, 5:33 PM

It has a line like "Started Dagster code server on port 4004 in process 7283" in its container output?

Nicolas May

03/28/2023, 5:34 PM

Copy code

docker_user_code_beam_dw_dagster  | 2023-03-28 17:26:03 +0000 - dagster - INFO - Started Dagster code server for package dagster_integrations on port 4004 in process 1

daniel

03/28/2023, 5:35 PM

Got it - so after that happens, the daemon is still spewing that error? I could imagine it struggling while the user code container is starting up - the intended behavior would be that the error stops within a minute or so after the user code container finishes loading

Nicolas May

03/28/2023, 5:35 PM

And it's a little unpredictable... after a while it someteims does load... but then it disappears from Dagit UI

daniel

03/28/2023, 5:35 PM

I'd check the Code Locations tab in Dagit when that happens, there might be an error there

daniel

03/28/2023, 5:35 PM

(potentially this same error)

Nicolas May

03/28/2023, 5:35 PM

I get the error in Dagit code locations tab and Docker logs

daniel

03/28/2023, 5:37 PM

ok, once we've gone through all that - there is a DAGSTER_GRPC_TIMEOUT_SECONDS env var that you could try setting to something like 180 in dagit and the daemon container. We're usually hesistant to recommend that first since in many cases there's some underlying issue (the default is 60 seconds which is quite long). It's possible that your dbt project is large enough that it's actually taking more than 60 seconds just to stream all the data over though - i'd be curious if things get more stable after this though or if the sheer size of the data causes other problems

Nicolas May

03/28/2023, 5:37 PM

I'll give this a shot rn

Nicolas May

03/28/2023, 5:41 PM

FWIW I've set

display_raw_sql=False

load_assets_from_dbt_manifets

to try to speed things up

daniel

03/28/2023, 5:42 PM

Ah yes - maybe try that first

Nicolas May

03/28/2023, 5:42 PM

Sorry... I already have 🙃

daniel

03/28/2023, 5:42 PM

waiting >60 seconds each time it needs to reload doesn't sound very enjoyable even once its no longer timing out

Nicolas May

03/28/2023, 5:42 PM

Like a couple days ago

daniel

03/28/2023, 5:42 PM

ah unfortunate

Nicolas May

03/28/2023, 5:43 PM

doesn't sound very enjoyable even once its no longer timing out

You are correct

daniel

03/28/2023, 5:43 PM

If you're comfortable sending us your metadata I could give you a script that prints out the serialized object that it's trying to fetch - we could see if there are other improvements like display_raw_sql that we need to make for these large asset graphs

Nicolas May

03/28/2023, 5:45 PM

Thanks! That's really generous... I'll hafta check with my team Question: "metadata" == manifest.json?

daniel

03/28/2023, 5:45 PM

It's basically 'what's displayed in dagit for your assets'

daniel

03/28/2023, 5:45 PM

for your asset graph, rather

Nicolas May

03/28/2023, 5:46 PM

Env var

DAGSTER_GRPC_TIMEOUT_SECONDS

seems to've worked better

Nicolas May

03/28/2023, 5:47 PM

Sorry... just trying to clarify, our dbt target/manifest.json would be helpful? Is that right?

Nicolas May

03/28/2023, 5:47 PM

(Like what specifically do you mean by "metadata"?)

daniel

03/28/2023, 5:47 PM

Oh I see - us seeing the object that it's fetching from this particular API call (derived from your code and your manifest.json) that's taking so long to generate is what would be helpful

Nicolas May

03/28/2023, 5:48 PM

Ah okay... how would I retrieve that?

daniel

03/28/2023, 5:48 PM

I was going to give you a quick script that basically runs the API call that's taking forever and prints the serialized object to stdout

Nicolas May

03/28/2023, 5:48 PM

That'd be great thx!

Nicolas May

03/28/2023, 5:49 PM

And again... thanks so much for your help on this

daniel

03/28/2023, 5:54 PM

no problem - what's the name of your repository (if you're not using Definitions)

Nicolas May

03/28/2023, 5:55 PM

I'm using Definitions

daniel

03/28/2023, 5:55 PM

gotcha

daniel

03/28/2023, 6:03 PM

Ok, here's a quick script that should work when run from your daemon or dagit container:

Copy code

from dagster._core.host_representation.origin import (
    GrpcServerRepositoryLocationOrigin,
)

host = "docker_user_code_beam_dw_dagster"
port = 4004

origin = GrpcServerRepositoryLocationOrigin(host=host, port=port)

location = origin.create_location()

for repo_name in location.get_repositories():
    print(f"{repo_name}:")
    print(str(location.get_repositories()[repo_name].external_repository_data))

daniel

03/28/2023, 6:03 PM

(with that timeout increased so the call actually finishes)

thank you box 1

Nicolas May

03/28/2023, 6:04 PM

I'll give it a shot, check with my team, and if we're all good I'll circle back to this thread if that's the best way

daniel

03/28/2023, 6:06 PM

For sure - entirely optional and you're of course welcome to DM or email it to daniel@elementl.com instead of posting here

Nicolas May

03/28/2023, 6:07 PM

Oh great... Thanks!

Nicolas May

04/04/2023, 7:20 PM

@daniel Just giving you a heads-up that I sent a follow-up email a few seconds ago from nicolas.may@[my_employer_domain]... Thx!

daniel

04/05/2023, 1:55 AM

Thanks nicolas - how's it performing after that timeout increase, are you seeing any other issues?

Nicolas May

04/06/2023, 3:01 PM

Totally missed your reply... Sorry The timeout increase has definitely improved things... and we scaled up the VM running the Docker compose deployment in production... so that's helped too... We still get InactiveRpcErrors on a pretty regular cadence, and I'm not sure how to improve/fix that

daniel

04/06/2023, 3:01 PM

Can you post a stack trace for one of the latest InactiveRpcErrors?

Nicolas May

04/06/2023, 3:01 PM

Will do in a few min... thx!

Nicolas May

04/06/2023, 4:56 PM

Copy code

/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py:602: UserWarning: Error loading repository location user_code_beam_dw_dagster:dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

Stack Trace:
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 519, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 332, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 603, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)

daniel

04/06/2023, 5:38 PM

Got it - and that's in the dagit container it looks like?

daniel

04/06/2023, 5:39 PM

I'd associate that particular error with the user code container being down / entirely unreachable

Nicolas May

04/06/2023, 7:00 PM

I think it's coming from Dagster daemon container... Fluent Bit is a sidecar container sending Datadog logs

daniel

04/06/2023, 7:01 PM

ok, if its the daemon, one nice thing is that it'll periodically refresh - so if the code location is temporarily down, the daemon should be able to load it again and try again in about a minute

Nicolas May

04/06/2023, 7:01 PM

And ya... that error definitely happens when things are first spinning up... but it also happens intermittently

daniel

04/06/2023, 7:01 PM

and it will pause executing things untii its available again

daniel

04/06/2023, 7:01 PM

would you know from your logs if the user code container went down?

Nicolas May

04/06/2023, 7:02 PM

Yes! That's great... love this feature

Nicolas May

04/06/2023, 7:03 PM

would you know from your logs if the user code container went down?

Hmm... I don't think I've set that up

Nicolas May

04/06/2023, 7:04 PM

I'll add this... we've got 2 user code repos... the other one that we've been running in prod for a few months isn't this noisy

daniel

04/06/2023, 7:04 PM

could it be running out of memory or something possibly?

Nicolas May

04/06/2023, 7:04 PM

It's far less complex... has a bunch of op graphs but it's not a big hairy dependency ball like this problem repo w/ Fivetran + dbt + Census

Nicolas May

04/06/2023, 7:07 PM

Maybe it's a memory problem... we've just upgraded the VM... GCP CE e2-standard-4... 4 vCPUs and 16 GB mem

Nicolas May

04/06/2023, 7:08 PM

I thought that'd hack it... • Dagster daemon container • Dagit container • 1 simple user code container • 1 big gnarly user code container • Fluent Bit sidecar

Nicolas May

04/11/2023, 1:19 PM

Morning @daniel... I just wanted to follow up on this... So the reason this user code container was going down every few minutes and then coming back up was because (1) I'd set a

--heartbeat-timeout 1200

(20 min) on the dagster api grpc command in the docker-compose.yaml to try to fix the gRPC timeout problem, but hadn't removed the heartbeat flags, and (2) the docker compose service restart policy is

unless-stopped

... Only by looking at the service's logs and noticing that there were repeated cycles of

Started Dagster code server for package ...

and

Shutting down Dagster code server for package ...

exactly 20 minutes apart did it occur to me that this heartbeat timeout flag was causing the problem Once got rid of the heartbeat timeout, this user code repo container (after it takes a while to spin up) works without a hitch... Thanks for helping me out with this!

condagster 1

9 Views

Open in Slack

Previous Next