Hi everyone, I wanted to query missing partitions ...
# ask-community
a
Hi everyone, I wanted to query missing partitions using the following GraphQL query in the GraphQL playground:
Copy code
query test(
  $repositorySelector: RepositorySelector!, 
  $partitionSetName: String!
)
  {
    partitionSetOrError(
      repositorySelector: $repositorySelector
      partitionSetName: $partitionSetName
    ) {
      ... on PartitionSet {
        id
        name
        pipelineName
        partitionsOrError {
          ... on Partitions {
            results {
              name
            }
          }
        }
        partitionStatusesOrError {
          __typename
          ... on PartitionStatuses {
            results {
              id
              partitionName
              runStatus
              runDuration
            }
          }
        }
      }
    }
  }
and I have the following response :
{"error": "Unexpected token '<', \"\n<html><hea\"... is not valid JSON"}
Am I missing something or is it a known problem ? I am using Dagster & Dagit version 1.0.17, deployed on K8S.
j
cc @dish
d
Hi Alexis, if you remove any elements of that query, does it respond correctly? Do other queries respond correctly?
a
I tried one of the sample queries in the doc, same result:
This one works:
Copy code
query FilteredRunsQuery {
  runsOrError(filter: { statuses: [FAILURE] }) {
    __typename
    ... on Runs {
      results {
        runId
        jobName
        status
        runConfigYaml
        stats {
          ... on RunStatsSnapshot {
            startTime
            endTime
            stepsFailed
          }
        }
      }
    }
  }
}
d
Can you open your browser dev tools and see what kind of http status code your request is returning?
The unparseable response is almost certainly coming from a non-200, so I’m wondering what kind of response code it actually has
a
I get a 502 http status code
d
Perfect, thanks — cc @alex, any of these fields look suspicious to you in terms of leading to a 502?
a
how many failed runs do you have in your DB? I am guessing you are either timing out or oom-ing your webserver
you could check the logs on the webserver / state of the k8s pods could also set a
limit
on your
runsOrError
call
a
I have around 700 failed runs in my DB. My original problem is related to the first query, the runsOrError one was just a test to see if the issue was global or not
Which kind of logs should I look for ?
a
the 502 and html response means that the network ingress in your cluster, likely nginx, is serving the response because the upstream dagit webserver that attempted to handle the request failed in some way
a
I have a lot of errors related to unreachable user code:
Copy code
Stack Trace:
 File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 535, in _load_location
 location = self._create_location_from_origin(origin)
 File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 460, in _create_location_from_origin
 return origin.create_location()
 File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
 return GrpcServerRepositoryLocation(self)
 File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 569, in __init__
 list_repositories_response = sync_list_repositories_grpc(self.client)
 File "/usr/local/lib/python3.7/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
 api_client.list_repositories(),
 File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 211, in list_repositories
 res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
 File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 141, in _query
 raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
{}
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
 status = StatusCode.UNAVAILABLE
 details = "DNS resolution failed for elt:3031: C-ares status is not ARES_SUCCESS qtype=A name=elt is_balancer=0: Could not contact DNS servers"
 debug_error_string = "{"created":"@1671033666.514173394","description":"DNS resolution failed for elt:3031: C-ares status is not ARES_SUCCESS qtype=A name=elt is_balancer=0: Could not contact DNS servers","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}"
I think it can be related to another problem I have with my infrastructure. Can I mention you in the related thread ?
And a lot of this one too:
Copy code
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1671053109.330323489","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
I have more errors from dagit pod, I am posting them in case it could be useful to you:
Copy code
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
    return executor.execute(resolve_fn, source, info, **args)
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/executors/sync.py", line 16, in execute
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/schema/external.py", line 195, in resolve_schedules
    for schedule in self._repository.get_external_schedules()
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/schema/external.py", line 195, in <listcomp>
    for schedule in self._repository.get_external_schedules()
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 252, in get_schedule_state
    states = self._get(RepositoryDataType.SCHEDULE_STATES, schedule_name, 1)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 59, in _get
    self._fetch(data_type, limit)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 174, in _fetch
    instigator_type=InstigatorType.SCHEDULE,
  File "/usr/local/lib/python3.7/site-packages/dagster/_utils/__init__.py", line 640, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/instance/__init__.py", line 1979, in all_instigator_state
    repository_origin_id, repository_selector_id, instigator_type
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/schedules/sql_schedule_storage.py", line 54, in all_instigator_state
    if self.has_instigators_table() and self.has_built_index(SCHEDULE_JOBS_SELECTOR_ID):
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/schedules/sql_schedule_storage.py", line 237, in has_instigators_table
    return self._has_instigators_table(conn)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/schedules/sql_schedule_storage.py", line 240, in _has_instigators_table
    table_names = db.inspect(conn).get_table_names()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 267, in get_table_names
    conn, schema, info_cache=self.info_cache
  File "<string>", line 2, in get_table_names
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 55, in cache
a
StatusCode.DEADLINE_EXCEEDED
this means the user code server took longer to respond than the timeout , default of 60 sec. I would guess you have a slow/heavy schedule or sensor
a
Would you say that a partitioned schedule with a lot of partitions could be considered as heavy ? I initially tried to query a schedule with over 100k partitions (each asset is partitioned per 15 minutes since early 2022, with 4 assets in the scheduled job)
a
so its specifically the
@schedule
/
@sensor
decorated function that generates
RunRequests
/ config taking longer than 60 seconds
a
What can I do to fix it ?
a
make the function in question take less than 60 seconds or increase the timeout using the environment variable
DAGSTER_GRPC_TIMEOUT_SECONDS
a
I am using the
build_schedule_from_partitioned_job
to create the schedule, should I create my own implementation of it then ?
a
ah thats useful context - looking at the implementation i dont see anything jump out. How is the partitioned job defined?
a
First the job itself:
Copy code
my_job = define_asset_job(
    "my_job",
    tags=TAGS,
    selection=[
        "asset1",
        "asset2",
        "asset3",
        "asset4",
        "asset5",
        "asset6",
    ],
    partitions_def=fifteen_minute_partitions,
)
The partition definition:
Copy code
fifteen_minute_partitions = TimeWindowPartitionsDefinition(
    cron_schedule="*/15 * * * *",
    start=datetime(2022, 1, 1, 0, 0, 0),
    fmt="%Y-%m-%d %H:%M",
    timezone="Europe/Paris",
)
a
flagged for folks on the teams who know this area to take a look if you can repro locally you could identify the problem with a profiler like
py-spy
a
I will try to do so, thank you for your help ! 🙂
s
here's a PR that should speed this up considerably: https://github.com/dagster-io/dagster/pull/11147 I'll see if we can get it into tomorrow's release. if not, it might need to wait until the new year.
❤️ 1
a
thank you @sandy