Hi all, we're trying to upgrade from `0.12.15` to ...
# ask-community
a
Hi all, we're trying to upgrade from
0.12.15
to
0.13.18
and we're running into some problems. The scheduler daemon fails to connect to the grpc server and opening the schedules page results in the following graphql error:
Copy code
Operation name: InstanceSchedulesQuery

Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT anon_1.id, anon_1.run_body, anon_1.create_timestamp, anon_1.update_timestamp, anon_1.start_time, anon_1.end_time, anon_1.rank 
FROM (SELECT runs.id AS id, runs.run_body AS run_body, runs.create_timestamp AS create_timestamp, runs.update_timestamp AS update_timestamp, runs.start_time AS start_time, runs.end_time AS end_time, rank() OVER (PARTITION BY run_tags.value ORDER BY runs.id DESC) AS rank 
FROM runs JOIN run_tags ON runs.run_id = run_tags.run_id 
WHERE run_tags.key = %(key_1)s AND run_tags.value IN (%(value_1_1)s, %(value_1_2)s, %(value_1_3)s, %(value_1_4)s, %(value_1_5)s, %(value_1_6)s, %(value_1_7)s, %(value_1_8)s, %(value_1_9)s, %(value_1_10)s, %(value_1_11)s, %(value_1_12)s, %(value_1_13)s, %(value_1_14)s, %(value_1_15)s, %(value_1_16)s, %(value_1_17)s, %(value_1_18)s, %(value_1_19)s, %(value_1_20)s, %(value_1_21)s, %(value_1_22)s, %(value_1_23)s, %(value_1_24)s, %(value_1_25)s, %(value_1_26)s, %(value_1_27)s, %(value_1_28)s, %(value_1_29)s, %(value_1_30)s, %(value_1_31)s, %(value_1_32)s, %(value_1_33)s, %(value_1_34)s, %(value_1_35)s, %(value_1_36)s, %(value_1_37)s, %(value_1_38)s, %(value_1_39)s, %(value_1_40)s, %(value_1_41)s, %(value_1_42)s, %(value_1_43)s, %(value_1_44)s, %(value_1_45)s, %(value_1_46)s, %(value_1_47)s, %(value_1_48)s, %(value_1_49)s, %(value_1_50)s, %(value_1_51)s, %(value_1_52)s, %(value_1_53)s, %(value_1_54)s, %(value_1_55)s, %(value_1_56)s, %(value_1_57)s, %(value_1_58)s, %(value_1_59)s, %(value_1_60)s, %(value_1_61)s, %(value_1_62)s, %(value_1_63)s, %(value_1_64)s, %(value_1_65)s, %(value_1_66)s, %(value_1_67)s, %(value_1_68)s, %(value_1_69)s, %(value_1_70)s)) AS anon_1 
WHERE anon_1.rank <= %(rank_1)s ORDER BY anon_1.rank ASC]
[parameters: {'key_1': 'dagster/schedule_name', 'rank_1': 1, 'value_1_1': <... REDACTED LIST OF SCHEDULES ...> }]
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)

Path: ["repositoriesOrError","nodes",0,"schedules",0,"scheduleState","runs"]

Locations: [{"line":142,"column":3}]

Stack Trace:
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
    return executor.execute(resolve_fn, source, info, **args)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/graphql/execution/executors/sync.py", line 16, in execute
    return fn(*args, **kwargs)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster_graphql/schema/instigation.py", line 320, in resolve_runs
    else self._batch_loader.get_run_records_for_schedule(self._job_state.name, limit)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster_graphql/implementation/loader.py", line 125, in get_run_records_for_schedule
    return self._get(RepositoryDataType.SCHEDULE_RUNS, schedule_name, limit)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster_graphql/implementation/loader.py", line 49, in _get
    self._fetch(data_type, limit)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster_graphql/implementation/loader.py", line 70, in _fetch
    records = self._instance.get_run_records(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/utils/__init__.py", line 606, in inner
    return func(*args, **kwargs)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/core/instance/__init__.py", line 1137, in get_run_records
    return self._run_storage.get_run_records(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 393, in get_run_records
    rows = self.fetchall(query)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/core/storage/runs/sql_run_storage.py", line 71, in fetchall
    result_proxy = conn.execute(query)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1286, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1478, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1842, in _execute_context
    self._handle_dbapi_exception(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2023, in _handle_dbapi_exception
    util.raise_(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
p
Hi Auster. Are you using the default sqlite run storage?
a
No, I'm using postgres
Copy code
local_artifact_storage:
  module: dagster.core.storage.root
  class: LocalArtifactStorage
  config:
    base_dir: /opt/dagster/dagster-home/

run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      db_name:
        env: DAGSTER_POSTGRES_DB
      hostname:
        env: DAGSTER_POSTGRES_HOST
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      port: 5432
      username:
        env: DAGSTER_POSTGRES_USER

event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      db_name:
        env: DAGSTER_POSTGRES_DB
      hostname:
        env: DAGSTER_POSTGRES_HOST
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      port: 5432
      username:
        env: DAGSTER_POSTGRES_USER

compute_logs:
  module: dagster.core.storage.local_compute_log_manager
  class: LocalComputeLogManager
  config:
    base_dir: /opt/dagster/dagster-home/storage

schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      db_name:
        env: DAGSTER_POSTGRES_DB
      hostname:
        env: DAGSTER_POSTGRES_HOST
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      port: 5432
      username:
        env: DAGSTER_POSTGRES_USER

scheduler:
  module: dagster.core.scheduler
  class: DagsterDaemonScheduler
  config: {}

run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 25
    tag_concurrency_limits:
    - key: dagster/backfill
      limit: 1

run_launcher:
  module: dagster
  class: DefaultRunLauncher
  config: {}
I'm probably missing these local compute and artifact storages though.
a
one workaround you can apply is to override the statement timeout from dagit, with the
--db-statement-timeout
flag
what resources does your PG database have (cpu,mem)?
a
2 cpus, 4 gb (t3.medium)
ack 1
worth mentioning that if I use a fresh DB it works
a
ya this is a new DB query that is attempting to batch what was previously N separate round trips. It taking longer that 5s in the DB isn’t expected though. how many schedules do you have?
70
?
a
yep
a
0.13.16
should be safe to upgrade to - its before this batching was added
how long does this page normally take to load for you?
a
ok, the timeout parameter seems to have helped dagit at least... not sure if it'll help with the daemon though (i mangled my test db trying to delete my runs so it'll take a few minutes to restore it before I can check)
p
I’m pretty sure we didn’t use these batched queries in the daemon, so ping this thread if you’re hitting a different error with the daemon
a
7~10 seconds
ok, I'll report back later, thanks for the help guys
sorry for the delay, here's the daemon error:
Copy code
2022-02-15 15:50:00 -0300 - dagster.daemon.SchedulerDaemon - ERROR - Scheduler caught an error for schedule magcache_frauds : dagster.core.scheduler.scheduler.DagsterSchedulerError: Unable to reach the user code server for schedule magcache_frauds. Schedule will resume execution once the server is available.

Stack Trace:
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 130, in launch_scheduled_runs
    yield from launch_scheduled_runs_for_schedule(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 281, in launch_scheduled_runs_for_schedule
    raise DagsterSchedulerError(

The above exception was caused by the following exception:
dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 267, in launch_scheduled_runs_for_schedule
    yield from _schedule_runs_at_time(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 341, in _schedule_runs_at_time
    schedule_execution_data = repo_location.get_external_schedule_execution_data(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/core/host_representation/repository_location.py", line 740, in get_external_schedule_execution_data
    return sync_get_external_schedule_execution_data_grpc(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/api/snapshot_schedule.py", line 47, in sync_get_external_schedule_execution_data_grpc
    api_client.external_schedule_execution(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/grpc/client.py", line 273, in external_schedule_execution
    chunks = list(
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/grpc/client.py", line 117, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1644950999.109802731","description":"Error received from peer ipv4:127.0.0.1:4000","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

Stack Trace:
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/dagster/grpc/client.py", line 115, in _streaming_query
    yield from response_stream
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/auster/repos/zarpo_dagster/venv/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next
    raise self
@prha
d
that timeout happens when it takes more than 60 seconds to get results from the schedule function in your user code deployment - is there anything unique about your schedule that might make that more likely?
a
It takes a few seconds to load all schedules but not nearly that long. And the function that generates this schedule's config is pretty simple. I suspect the problem is related to the migration of my run history, since it works fine if I set up a brand new DB instead of migrating.
hmm, wait a sec, might be my partition selector function, since it lists past runs
d
Hm, I'm having trouble mapping that to this error though - this is specifically when the scheduler calls out to your gRPC server, that doesn't typically hit your storage
ah now we're talking 🙂
a
Adding updated_after to
PipelineRunFilter
seems to have fixed the error above, at least when testing outside docker. But when I try to run my setup with docker-compose, I get the following error from the daemon:
Copy code
2022-02-15 21:13:28 +0000 - dagster.daemon.SchedulerDaemon - ERROR - Scheduler caught an error for schedule magcache_frauds : dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
2022-02-15T21:13:28.655813279Z 
2022-02-15T21:13:28.655819749Z Stack Trace:
2022-02-15T21:13:28.655824619Z   File "/usr/local/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 129, in launch_scheduled_runs
2022-02-15T21:13:28.655829909Z     repo_location = workspace.get_location(origin)
2022-02-15T21:13:28.655834169Z   File "/usr/local/lib/python3.8/site-packages/dagster/core/workspace/dynamic_workspace.py", line 36, in get_location
2022-02-15T21:13:28.655839069Z     location = existing_location if existing_location else origin.create_location()
2022-02-15T21:13:28.655843259Z   File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/origin.py", line 266, in create_location
2022-02-15T21:13:28.655847579Z     return GrpcServerRepositoryLocation(self)
2022-02-15T21:13:28.655851739Z   File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/repository_location.py", line 528, in __init__
2022-02-15T21:13:28.655856129Z     list_repositories_response = sync_list_repositories_grpc(self.client)
2022-02-15T21:13:28.655861080Z   File "/usr/local/lib/python3.8/site-packages/dagster/api/list_repositories.py", line 14, in sync_list_repositories_grpc
2022-02-15T21:13:28.655865880Z     deserialize_json_to_dagster_namedtuple(api_client.list_repositories()),
2022-02-15T21:13:28.655869950Z   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/client.py", line 162, in list_repositories
2022-02-15T21:13:28.655874170Z     res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
2022-02-15T21:13:28.655878600Z   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/client.py", line 108, in _query
2022-02-15T21:13:28.655883360Z     raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
2022-02-15T21:13:28.655887460Z 
2022-02-15T21:13:28.655890980Z The above exception was caused by the following exception:
2022-02-15T21:13:28.655895170Z grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2022-02-15T21:13:28.655899600Z 	status = StatusCode.UNAVAILABLE
2022-02-15T21:13:28.655903540Z 	details = "failed to connect to all addresses"
2022-02-15T21:13:28.655908510Z 	debug_error_string = "{"created":"@1644959608.655161067","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1644959608.655160487","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-02-15T21:13:28.655915710Z >
2022-02-15T21:13:28.655927710Z 
2022-02-15T21:13:28.655932250Z Stack Trace:
2022-02-15T21:13:28.655936260Z   File "/usr/local/lib/python3.8/site-packages/dagster/grpc/client.py", line 105, in _query
2022-02-15T21:13:28.655941010Z     response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
2022-02-15T21:13:28.655945080Z   File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
2022-02-15T21:13:28.655947900Z     return _end_unary_response_blocking(state, call, False, None)
2022-02-15T21:13:28.655950490Z   File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
2022-02-15T21:13:28.655953190Z     raise _InactiveRpcError(state)
The grpc-health-check runs just fine from inside this container. And dagit, which runs of the same image, is able to connect to the grpc server. Any idea whats going on here?
Hi @daniel, could you take a look at this?
d
This is consistent with the daemon pod not being able to reach your gRPC server. Things I would try: • double-checking that your workspace.yaml (port/host/etc.) maps to what you would expect, try running grpc-health-check from the daemon container with that hostname and port set • Try turning the schedule off and on again in dagit to make sure its set with the right values