Hi team we constantly see flakiness with one of our user cod dagster #ask-community

Hi team, we constantly see flakiness with one of o...

Hebo Yang

05/03/2022, 6:02 PM

Hi team, we constantly see flakiness with one of our user-code-deployment repos that has lots of jobs & sensors with the 0.14.5 Dagit. For example, when viewing a sensor status, it flashes every few seconds between the correct page, blank screen, and an exception that “Exception: Location fabricator not in workspace” I am wondering if this is a problem with: a) Fabricator user-code-deployment not having enough pods/CPU to service the request from Dagit. We only have about 40% spike CPU usage on the pod. b) Dagit not having enough pods/CPU c) Some internal Dagit issue (maybe frequent auto-refresh?)

Hebo Yang

05/03/2022, 6:02 PM

daniel

05/03/2022, 6:03 PM

Hi Hebo - is the pod possibly restarting? Any logs from the pod you can share?

daniel

05/03/2022, 6:03 PM

if its only one location, that points pretty strongly to the problem being with that particular location's pod i think

daniel

05/03/2022, 6:04 PM

("the pod" here meaning the pod for that fabricator location)

Hebo Yang

05/03/2022, 6:14 PM

Thanks Daniel! No the pod doesn’t restart. I checked the logs but it is pretty much just our info logs. It might be related to location is doing too much sensor checks. I am wondering if there is a way to prevent the Dagit UI from frequent refreshing please? It seems to be able to load the data but then quickly auto-refresh within seconds

daniel

05/03/2022, 6:18 PM

I'll double check, but I don't think the sensor page hits the server when it refreshes. I think that's more likely to be a symptom of the root problem (the server becoming inaccessible for some other reason) than the cause.

daniel

05/03/2022, 6:20 PM

If you go to the Workspace tab, does it show an error for the fabricator location while this problem is happening?

Hebo Yang

05/03/2022, 6:21 PM

Yep, I think the root problem is definitely the location is inaccessible every few secs. However, if Dagit doens’t auto refresh frequently, we could at least still view the page. Now while I am on page, it flashes between working, blank, exception every few seconds.

Hebo Yang

05/03/2022, 6:22 PM

Let me try to increase resource and replica count on the location.

daniel

05/03/2022, 6:22 PM

We could address the refresh, but the server being down will cause other problems too. So I worry that that would bandaid over the root problem

daniel

05/03/2022, 6:22 PM

Any clues on the Workspace tab by any chance?

daniel

05/03/2022, 6:23 PM

I'm hoping that might include some more useful information about why the server is inaccessible

daniel

05/03/2022, 6:23 PM

When you say "increase replica count on the location" - what's the current replica count?

daniel

05/03/2022, 6:25 PM

I don't think we actually support multiple replicas of the user code deployments, but we absolutely could add that if it turns out that that is the blocker (it would be somewhat unusual for them to be, but not impossible)

Hebo Yang

05/03/2022, 6:26 PM

The workspace tab is more stable but I also saw that the “Fabricator” location went completely missing.

daniel

05/03/2022, 6:27 PM

Hmmmm, that's very unusual...

daniel

05/03/2022, 6:27 PM

Do you have multiple dagit replicas? Any chance they could be working with different workspace.yaml files somehow?

Hebo Yang

05/03/2022, 6:27 PM

Glad that I asked. The replica count is currently 1 for Fabricator location. Let me just double CPU to see if it helps

daniel

05/03/2022, 6:28 PM

I think this missing location thing is much more likely to be related to the root cause

daniel

05/03/2022, 6:28 PM

than the resources on the pod - it seems like sometimes dagit thinks the location is there, and sometimes it doesn't, which would cause all kinds of problems

Hebo Yang

05/03/2022, 6:29 PM

right, we have 2 dagit replicas. We don’t define a workspace.yaml and I think Dagster auto-generates it?

daniel

05/03/2022, 6:30 PM

Yeah, that's true. Is the fabricator location new?

daniel

05/03/2022, 6:30 PM

(true not trust, sorry)

daniel

05/03/2022, 6:30 PM

It seems like one of your replicas knows about it and the other doesn't, which is unusual

daniel

05/03/2022, 6:31 PM

these symptoms are consistent with something like adding a new location, running a helm upgrade, and the upgrade failing partway through somehow and only applying to one dagit replica

Hebo Yang

05/03/2022, 6:37 PM

I see..hmmm, the location has existed for a while. Let me see if I can isolate to the bad Dagit replica (Yeah, this is probably the root cause)

daniel

05/03/2022, 6:39 PM

Very strange...

daniel

05/03/2022, 6:40 PM

Every location in the workspace.yaml should resolve to either a location or an error, I can't think of a way that it would end up just missing from the list

daniel

05/03/2022, 6:40 PM

absent a bug of course, but I haven't seen any other reports of this

Hebo Yang

05/03/2022, 6:46 PM

I see. Thanks for the help Daniel! Let me look into this. Also got this error now but I think it’s probably a syndrome not cause

Copy code

dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 555, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 481, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/origin.py", line 306, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/repository_location.py", line 557, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/api/snapshot_repository.py", line 25, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 260, in streaming_external_repository
    external_repository_origin
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 119, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Exception iterating responses: 'fabricator_mx_selection_net_new_merchants_base_instance_partition_set'" debug_error_string = "{"created":"@1651544410.089130103","description":"Error received from peer ipv4:172.30.227.74:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"Exception iterating responses: 'fabricator_mx_selection_net_new_merchants_base_instance_partition_set'","grpc_status":2}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 117, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

daniel

05/03/2022, 6:46 PM

Where did you see that error?

Hebo Yang

05/03/2022, 6:47 PM

In workspace locations. The error goes away few secs later.

daniel

05/03/2022, 6:47 PM

That looks a bit more like what I would expect - an error trying to fetch a location that you think should be there, that then shows on the Workspace tab (that is more expected than the location being missing from the list altogether)

Hebo Yang

05/03/2022, 7:02 PM

Hmmm..I restarted the dagit pods and bumped Fabricator CPU. Still seeing the problem. Maybe it IS something wrong with that “fabricator_mx_selection_net_new_stores_sales_prediction_base_partition_set” thing as I only see problems with that single job

daniel

05/03/2022, 7:07 PM

You're still seeing one dagit replica with the location in the list on the Workspace tab, but another dagit replica with the location not in the list?

Hebo Yang

05/03/2022, 7:34 PM

Hmmm..we are still seeing this problem. How do we know if it’s a dagit replica problem vs. Fabricator location sometimes isn’t responsive?

daniel

05/03/2022, 7:34 PM

When you say "the problem" - we've discussed a couple of different problems here. Which problem exactly are you referring to?

daniel

05/03/2022, 7:35 PM

the location missing from the list? Or the error showing up where it can't connect to the server?

Hebo Yang

05/03/2022, 7:38 PM

1. It’s showing up but occasionally fails and then auto recover. (I think this is getting more stable now) 2. sensor page now flashes between working and loading every few seconds. (No longer seeing the exception)

daniel

05/03/2022, 7:40 PM

and does 2) also manifest as the fabrictaor location totally missing from the Workspace tab? Or has that problem gone away?

daniel

05/03/2022, 7:42 PM

Is the sensor page reloading more than once every 15 seconds? The expected behavior is that there's a timer in the upper right that counts down from :15 and if only refreshes when that timer hits 0

daniel

05/03/2022, 7:44 PM

if it'd be easier to hop on a quick call to gather all the information, let me know

Arun Kumar

05/03/2022, 7:49 PM

Just stumbled upon this thread. @Hebo Yang Is this happening only for fabricator repo?

Hebo Yang

05/03/2022, 7:51 PM

Thanks again Daniel for helping with this! It somehow became much much more stable now. We doubled the instance count, CPU, and mem for Dagit. We also doubled Fabricator CPU. Let me observe further to see what’s the behavior when the syndrome shows again

Hebo Yang

05/03/2022, 7:51 PM

@Arun Kumar, yes, I think it’s a only a problem with Fabricator

Arun Kumar

05/03/2022, 8:00 PM

@daniel When I opened one of the sensors tick page from that repo, I see a

instigationStateOrError

request being made every second. Is this expected?

graphql_sensor.mov

thankyou 1

daniel

05/03/2022, 8:01 PM

I'll check, that does seem like a lot (although that query shouldn't hit your user code server)

Arun Kumar

05/03/2022, 8:02 PM

Also, at one point, I saw a 100s of requests being made for a sensor when I loaded the page. This also coincides with the loading issue Hebo mentioned

daniel

05/03/2022, 8:02 PM

Was this the instigationStateOrError query? Or a different query?

Arun Kumar

05/03/2022, 8:03 PM

Yeah, this might not be related to the user code server issue. But very much related to the sensor loading issue

👍 1

Arun Kumar

05/03/2022, 8:04 PM

The single bunch were

sensorsOrError

queries mostly. Not sure if that was done to fetch the tick history?

daniel

05/03/2022, 8:09 PM

Thanks, this is helpful context

daniel

05/03/2022, 8:14 PM

Arun are you able to reproduce the 100s of requests thing every time you load the page? Or has that gone away now?

Arun Kumar

05/03/2022, 8:20 PM

Yes, I am able to reproduce it, but not consistently. Happens occasionally

Hebo Yang

05/06/2022, 6:33 PM

When the sensor page is in the blank state, the whole sensor frame (including the counter is gone). I agree with Arun that it’s probably just making way too many requests on the backend to fetch all the states. Then sensor page will show up with partially loaded tick history, then go into blank state (probably trying to load the rest ticks) Most of the times when we go to the sensor page, we just need to enable the sensor or to check the most recent Skipped reason. If we could only fetch 10 or few historical ticks by default or somehow batch fetch, it would probably help. Also that most of our sensors have a frequency of 1m, so there are tons of tick histories. The recent ticks and tick history are usually not very useful for this kind of sensors.

daniel

05/06/2022, 6:35 PM

How many sensors do you have in the repo?

daniel

05/06/2022, 6:35 PM

I'll surface this to our engineers focusing on dagit perf

Hebo Yang

05/06/2022, 10:03 PM

Thanks Daniel! I think about 200+. Does the sensor view page (i.e. https://dagit.doordash.team/workspace/feature_jobs@fabricator/sensors/fabricator_features_sensor) fetch status for all sensors? I thought it would only fetch ticks for the particular one.

daniel

05/06/2022, 10:04 PM

I'd expect it to just fetch from the one too - we'll see what we can do to make the performance here better, there's no reason we shouldn't be able to handle that many sensors

thankyou 1

daniel

05/06/2022, 10:10 PM

Separately - it could be interesting to talk through the setup here in a bit more detail and verify that you actually need each of these to be a separate sensor

daniel

05/06/2022, 10:11 PM

since even once we make it fast in the UI (which we absolutely should do) it might still be a lot for you all to manage and keep track of

Hebo Yang

05/06/2022, 10:14 PM

Yeah, I think that’s something we’d need to optimize. We are the MLPlatform team and we maintain the repo and service. Other teams add their ETL pipelines and we dynamically create sensors for them to check upstream dependency. Right now we create a sensor for each user pipeline. I think we could consolidate the sensors into few ones. That would also reduce the load on DB.

👍 1

10 Views

Open in Slack

Previous Next