Hi team, we constantly see flakiness with one of o...
# ask-community
h
Hi team, we constantly see flakiness with one of our user-code-deployment repos that has lots of jobs & sensors with the 0.14.5 Dagit. For example, when viewing a sensor status, it flashes every few seconds between the correct page, blank screen, and an exception that “Exception: Location fabricator not in workspace” I am wondering if this is a problem with: a) Fabricator user-code-deployment not having enough pods/CPU to service the request from Dagit. We only have about 40% spike CPU usage on the pod. b) Dagit not having enough pods/CPU c) Some internal Dagit issue (maybe frequent auto-refresh?)
d
Hi Hebo - is the pod possibly restarting? Any logs from the pod you can share?
if its only one location, that points pretty strongly to the problem being with that particular location's pod i think
("the pod" here meaning the pod for that fabricator location)
h
Thanks Daniel! No the pod doesn’t restart. I checked the logs but it is pretty much just our info logs. It might be related to location is doing too much sensor checks. I am wondering if there is a way to prevent the Dagit UI from frequent refreshing please? It seems to be able to load the data but then quickly auto-refresh within seconds
d
I'll double check, but I don't think the sensor page hits the server when it refreshes. I think that's more likely to be a symptom of the root problem (the server becoming inaccessible for some other reason) than the cause.
If you go to the Workspace tab, does it show an error for the fabricator location while this problem is happening?
h
Yep, I think the root problem is definitely the location is inaccessible every few secs. However, if Dagit doens’t auto refresh frequently, we could at least still view the page. Now while I am on page, it flashes between working, blank, exception every few seconds.
Let me try to increase resource and replica count on the location.
d
We could address the refresh, but the server being down will cause other problems too. So I worry that that would bandaid over the root problem
Any clues on the Workspace tab by any chance?
I'm hoping that might include some more useful information about why the server is inaccessible
When you say "increase replica count on the location" - what's the current replica count?
I don't think we actually support multiple replicas of the user code deployments, but we absolutely could add that if it turns out that that is the blocker (it would be somewhat unusual for them to be, but not impossible)
h
The workspace tab is more stable but I also saw that the “Fabricator” location went completely missing.
d
Hmmmm, that's very unusual...
Do you have multiple dagit replicas? Any chance they could be working with different workspace.yaml files somehow?
h
Glad that I asked. The replica count is currently 1 for Fabricator location. Let me just double CPU to see if it helps
d
I think this missing location thing is much more likely to be related to the root cause
than the resources on the pod - it seems like sometimes dagit thinks the location is there, and sometimes it doesn't, which would cause all kinds of problems
h
right, we have 2 dagit replicas. We don’t define a workspace.yaml and I think Dagster auto-generates it?
d
Yeah, that's true. Is the fabricator location new?
(true not trust, sorry)
It seems like one of your replicas knows about it and the other doesn't, which is unusual
these symptoms are consistent with something like adding a new location, running a helm upgrade, and the upgrade failing partway through somehow and only applying to one dagit replica
h
I see..hmmm, the location has existed for a while. Let me see if I can isolate to the bad Dagit replica (Yeah, this is probably the root cause)
d
Very strange...
Every location in the workspace.yaml should resolve to either a location or an error, I can't think of a way that it would end up just missing from the list
absent a bug of course, but I haven't seen any other reports of this
h
I see. Thanks for the help Daniel! Let me look into this. Also got this error now but I think it’s probably a syndrome not cause
Copy code
dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 555, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 481, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/origin.py", line 306, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/repository_location.py", line 557, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/api/snapshot_repository.py", line 25, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 260, in streaming_external_repository
    external_repository_origin
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 119, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Exception iterating responses: 'fabricator_mx_selection_net_new_merchants_base_instance_partition_set'" debug_error_string = "{"created":"@1651544410.089130103","description":"Error received from peer ipv4:172.30.227.74:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"Exception iterating responses: 'fabricator_mx_selection_net_new_merchants_base_instance_partition_set'","grpc_status":2}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 117, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
d
Where did you see that error?
h
In workspace locations. The error goes away few secs later.
d
That looks a bit more like what I would expect - an error trying to fetch a location that you think should be there, that then shows on the Workspace tab (that is more expected than the location being missing from the list altogether)
h
Hmmm..I restarted the dagit pods and bumped Fabricator CPU. Still seeing the problem. Maybe it IS something wrong with that “fabricator_mx_selection_net_new_stores_sales_prediction_base_partition_set” thing as I only see problems with that single job
d
You're still seeing one dagit replica with the location in the list on the Workspace tab, but another dagit replica with the location not in the list?
h
Hmmm..we are still seeing this problem. How do we know if it’s a dagit replica problem vs. Fabricator location sometimes isn’t responsive?
d
When you say "the problem" - we've discussed a couple of different problems here. Which problem exactly are you referring to?
the location missing from the list? Or the error showing up where it can't connect to the server?
h
1. It’s showing up but occasionally fails and then auto recover. (I think this is getting more stable now) 2. sensor page now flashes between working and loading every few seconds. (No longer seeing the exception)
d
and does 2) also manifest as the fabrictaor location totally missing from the Workspace tab? Or has that problem gone away?
Is the sensor page reloading more than once every 15 seconds? The expected behavior is that there's a timer in the upper right that counts down from :15 and if only refreshes when that timer hits 0
if it'd be easier to hop on a quick call to gather all the information, let me know
a
Just stumbled upon this thread. @Hebo Yang Is this happening only for fabricator repo?
h
Thanks again Daniel for helping with this! It somehow became much much more stable now. We doubled the instance count, CPU, and mem for Dagit. We also doubled Fabricator CPU. Let me observe further to see what’s the behavior when the syndrome shows again
@Arun Kumar, yes, I think it’s a only a problem with Fabricator
a
@daniel When I opened one of the sensors tick page from that repo, I see a
instigationStateOrError
request being made every second. Is this expected?
thankyou 1
d
I'll check, that does seem like a lot (although that query shouldn't hit your user code server)
a
Also, at one point, I saw a 100s of requests being made for a sensor when I loaded the page. This also coincides with the loading issue Hebo mentioned
d
Was this the instigationStateOrError query? Or a different query?
a
Yeah, this might not be related to the user code server issue. But very much related to the sensor loading issue
👍 1
The single bunch were
sensorsOrError
queries mostly. Not sure if that was done to fetch the tick history?
d
Thanks, this is helpful context
Arun are you able to reproduce the 100s of requests thing every time you load the page? Or has that gone away now?
a
Yes, I am able to reproduce it, but not consistently. Happens occasionally
h
When the sensor page is in the blank state, the whole sensor frame (including the counter is gone). I agree with Arun that it’s probably just making way too many requests on the backend to fetch all the states. Then sensor page will show up with partially loaded tick history, then go into blank state (probably trying to load the rest ticks) Most of the times when we go to the sensor page, we just need to enable the sensor or to check the most recent Skipped reason. If we could only fetch 10 or few historical ticks by default or somehow batch fetch, it would probably help. Also that most of our sensors have a frequency of 1m, so there are tons of tick histories. The recent ticks and tick history are usually not very useful for this kind of sensors.
d
How many sensors do you have in the repo?
I'll surface this to our engineers focusing on dagit perf
h
Thanks Daniel! I think about 200+. Does the sensor view page (i.e. https://dagit.doordash.team/workspace/feature_jobs@fabricator/sensors/fabricator_features_sensor) fetch status for all sensors? I thought it would only fetch ticks for the particular one.
d
I'd expect it to just fetch from the one too - we'll see what we can do to make the performance here better, there's no reason we shouldn't be able to handle that many sensors
thankyou 1
Separately - it could be interesting to talk through the setup here in a bit more detail and verify that you actually need each of these to be a separate sensor
since even once we make it fast in the UI (which we absolutely should do) it might still be a lot for you all to manage and keep track of
h
Yeah, I think that’s something we’d need to optimize. We are the MLPlatform team and we maintain the repo and service. Other teams add their ETL pipelines and we dynamically create sensors for them to check upstream dependency. Right now we create a sensor for each user pipeline. I think we could consolidate the sensors into few ones. That would also reduce the load on DB.
👍 1