Hi team, for Dagit UI to load repo job definitions...
# ask-community
h
Hi team, for Dagit UI to load repo job definitions via graphql, does it make a request against the user repository or it directly query the postgres db?
d
Hi Hebo - it makes a request against the gRPC server
h
Thanks Daniel! Do you mean the gPRC server on each user code repo?
d
I do yeah
h
We are upgrading dagit & deamon from 0.15.9 to 1.2.4, while running user code repo still with 0.15.9. We noticed some strange behavior. The launchpad UI calls “ConfigPartitionsQuery” query. With 1.2.4 dagit, this seems to somehow become very slow that it takes 20+s to load
However, the Partitions tab is able to load and render partitions right away. I am wondering if you know there are changes that might have caused this? (and if there is a way to solve this)
d
Did you migrate your database as part of the upgrade?
h
Yep we did
d
Are you able to check your postgres DB and see if there's a particular query that's running slowly?
h
not on individual queries..the overall cpu utilization seems to be constantly low (less than 20%)
a
@daniel Trying to find that SQL. In the meantime, here is the graphQL query which is timing out
Copy code
query ConfigPartitionsQuery($repositorySelector: RepositorySelector!, $partitionSetName: String!, $assetKeys: [AssetKeyInput!]) {
  partitionSetOrError(
    repositorySelector: $repositorySelector
    partitionSetName: $partitionSetName
  ) {
    __typename
    ... on PartitionSet {
      id
      partitionsOrError {
        ... on Partitions {
          results {
            ...ConfigPartitionResult
            __typename
          }
          __typename
        }
        ...PythonErrorFragment
        __typename
      }
      __typename
    }
  }
  assetNodes(assetKeys: $assetKeys) {
    id
    partitionDefinition {
      name
      type
      __typename
    }
    __typename
  }
}

fragment ConfigPartitionResult on Partition {
  name
  __typename
}

fragment PythonErrorFragment on PythonError {
  __typename
  message
  stack
  errorChain {
    ...PythonErrorChain
    __typename
  }
}

fragment PythonErrorChain on ErrorChainLink {
  isExplicitLink
  error {
    message
    stack
    __typename
  }
  __typename
}
d
some additional options: • run py-spy on the dagit process while this query is running: https://github.com/benfred/py-spy (requires setting some securityContext fields on the pod described at that link) • Use the graphql playground to try to identify the specific field in the query that is slow
a
I tried playing with the GraphQL playground. Once I remove the
assetNodes
or the
partitionDefinition
(with in assetNodes) field from the request , the API call finishes quickly. Also, assetKeys is not passed as the input query variable, and I am seeing all the asset nodes in the response that are unrelated to the partitionSet and the repository
d
Can you share the code of the partition set that's hitting it, any way we can reproduce ourselves?
I know you all just finished an upgrade, but i think the latest 1.2.x has no breaking changes since 1.2.4 if that's an option so i would expect an upgrade to 1.2.7 be pretty quick and painless - I see "[UI] Performance improvement for loading asset partition statuses." in the changelog for 1.2.5: https://docs.dagster.io/changelog#new-2
a
Sorry, not sure what you mean by "code of partition set"? However, having to fetch all the assetnodes in the LaunchPad of a single job sounds like a bug, No?
Sure, we can upgrade to 1.2.7
d
That does sound like a bug, yeah
Let me know if it still happens after you update
a
Does upgrading to 1.3.0 requires us to run migration?
d
no migrations this time, no
1.3.0 is planned to be live later today
a
Ah got it. I will go with 1.2.7 then and will let you know
Looks like we are already on 1.2.6. The improvement that you mentioned should already be in.
d
Got it - any chance you could share the code of the job that's hitting it? or some redacted subset of it that does? seem like its primarily related to the partition configuration of the jbo
a
Sure, I can DM the code of the job definition. But its happening in almost all the job run Launchpads, and probably has more to do with the scale of the assets in our deployment and due to the asset filter issue I pointed above
d
How many asset nodes do you have?
(in your deployment)
Ah ok, I think i have a lead here
D 1
a
The above GraphQL API returns close to 1120 asset nodes. Again the issue happens only when I request for the
partitionDefinition
field for each of those assetNodes
d
Are you also seeing timeouts on your asset catalog?
it's quite odd that adding partitionDefinition is what makes it time out - you're certain that's the case right? because that field doesn't look any more expensive than loading the asset node. If we were in cloud we have a bunch of profiling tools on our sie we could run but we may need you to spin up py-spy while running this query to fully answer this?
c
Also wondering what your partitions definitions look like--are they static/time/multipartitioned? And if they're time based, do they contain offsets?
a
@claire They are time partitions. I don't think we have any offsets configured @Hebo Yang Can you confirm if fabricator Assets have partition offsets?
h
This is what we have
DailyPartitionsDefinition(
start_date=PARTITION_START_DATE, timezone=str(TZ), end_offset=1
)
And yes, Asset catalog seems to become very slow to load with 1.2.6 now.
d
OK, we have some fixes in the works that we think will likely help with both the asset catalog and the launchpad
🙏 2
thankyou 2
One of those fixes just landed and will go out whenever our next dagit release is (next week at the latest)
D 1
a
@daniel Curious if that one fix which is going to land next week will fix the partition Launchpad issue?
d
s/next week/in the next couple of hours
🌈 1
🙂
🌈 1
now live in 1.3.1!
❤️ 1
s
Assets Catalog now loads under 1 sec for us, down from 10 secs
Timeline is still slow (10+ secs) but maybe there's a future enhancement in the pipeline :)
h
Thanks @Stuart Robinson and @daniel. It seems that with 1.3.1, the code server request timeout problem in Dagit is coming back. Somehow 1.2.6 doesn’t have this problem. Do you know any changes between these versions that might have caused this please?
Copy code
/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py:593: UserWarning: Error loading repository location fabricator:dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 588, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 508, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 325, in create_location
    return GrpcServerCodeLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/code_location.py", line 632, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 348, in streaming_external_repository
    defer_snapshots=defer_snapshots,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 185, in _streaming_query
    e, timeout=timeout, custom_timeout_message=custom_timeout_message
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
    ) from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1683137122.386091274","description":"Error received from peer ipv4:10.4.248.159:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 181, in _streaming_query
    method, request=request_type(**kwargs), timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 809, in _next
    raise self

  location_name=location_name, error_string=error.to_string()
d
Would it be possible to make a new post for this?
h
Yep. Let me do that