Title
j

jasono

03/08/2022, 4:28 AM
Hi, I had a scheduler to run a test job run every 15 seconds. After confirming it worked, the scheduler was then deactivated (switched off using dagit ui) but the task still runs every 15 seconds. I closed the dagster-daemon and dagit to stop that, but it starts again as soon as I restart dagster-daemon, and it repeats the grpc error message below. How can I hard reset this so that this weird behavior stops? The
2022-03-07 20:21:50 -0800 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location auto_tests.py to check for schedules due to the following error: Exception: Timed out waiting for gRPC server to start with arguments: “E:\Data\dagster\venv\Scripts\python.exe -m dagster api grpc --lazy-load-user-code --port 62844 --heartbeat --heartbeat-timeout 120 --fixed-server-id 75ba823e-e008-4b26-a0b4-de63f6482425 --log-level WARNING --use-python-environment-entry-point -f E:\Data\dagster\repo/auto_tests.py”. Most recent connection error: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server Stack Trace: File “E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py”, line 937, in wait_for_grpc_server client.ping(“”) File “E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py”, line 123, in ping res = self._query(“Ping”, api_pb2.PingRequest, echo=echo) File “E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py”, line 110, in _query raise DagsterUserCodeUnreachableError(“Could not reach user code server”) from e The above exception was caused by the following exception: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = “failed to connect to all addresses” debug_error_string = “{”created”:“@1646713277.274000000”,“description”:“Failed to pick subchannel”,“file”:“src/core/ext/filters/client_channel/client_channel.cc”,“file_line”:3135,“referenced_errors”:[{“created”:“@1646713277.274000000”,“description”:“failed to connect to all addresses”,“file”:“src/core/lib/transport/error_utils.cc”,“file_line”:163,“grpc_status”:14}]}”
d

daniel

03/08/2022, 5:13 AM
Hi, would you mind posting the contents of your workspace.yaml file, or the arguments that you're starting your daemon with? even if you don't have any schedules running, the daemon is still going to try to load the code in your workspace to see if there are any schedules there - it looks like that's the part that's failing.
if its not sensitive, maybe you could post the contents of
repo/auto_tests.py
?
j

jasono

03/08/2022, 5:25 AM
Sure. please see below. Workspace.yaml
load_from:
- python_file: repo/hello_cereal.py
- python_file: repo/auto_tests.py
- python_file: repo/dynamic_tests.py
Auto_tests.py
from dagster import RunRequest, ScheduleDefinition, job, op, repository, sensor, DefaultScheduleStatus
import great_expectations as ge
@op
def hello_world():
pass
@job
def ge_job():
context = ge.data_context.DataContext(context_root_dir="y:/great_expectations/")
checkpoint_result = context.run_checkpoint(
checkpoint_name = "my_checkpoint"
)
job1_schedule = ScheduleDefinition(
job=ge_job,
cron_schedule="*/15 * * * *",
execution_timezone="US/Pacific"
default_status=DefaultScheduleStatus.RUNNING
)
@repository
def ge__auto_test__repository():
return [ge_job, job1_schedule]
And finally, the daemon server, my command is the bare bones
dagster-daemon run
d

daniel

03/08/2022, 5:53 AM
Ok, so you turned off the schedule in the dagit UI but the schedule never stopped? If your dagit and your daemon are using the same DAGSTER_HOME and are using the same workspace.yaml and running in the same python environment, stopping the schedule in dagit should work. If you don't mind, you could post or DM the contents of your schedules.db file in your DAGSTER_HOME folder and we could take a look
The most likely reason for the grpc error that I can think of is that if a run is getting launched every 15 seconds, eventually it's more runs than your machine can handle so things have trouble starting up
j

jasono

03/08/2022, 6:09 AM
1 602a2e394a98ddbe30f3f19bbde31399b6f0488b SUCCESS SCHEDULE 2022-03-08 02:00:00.000000 {“__class__“: “TickData”, “cursor”: null, “error”: null, “failure_count”: 0, “job_name”: “ge_job_schedule”, “job_origin_id”: “602a2e394a98ddbe30f3f19bbde31399b6f0488b”, “job_type”: {“__enum__“: “InstigatorType.SCHEDULE”}, “origin_run_ids”: [], “run_ids”: [“57d91897-097f-4f34-857b-a39cff9ffe61"], “run_keys”: [], “skip_reason”: null, “status”: {“__enum__“: “TickStatus.SUCCESS”}, “timestamp”: 1646704800.0} 2022-03-08 02:00:39 2022-03-08 02:00:39 2 602a2e394a98ddbe30f3f19bbde31399b6f0488b SUCCESS SCHEDULE 2022-03-08 02:15:00.000000 {“__class__“: “TickData”, “cursor”: null, “error”: null, “failure_count”: 0, “job_name”: “ge_job_schedule”, “job_origin_id”: “602a2e394a98ddbe30f3f19bbde31399b6f0488b”, “job_type”: {“__enum__“: “InstigatorType.SCHEDULE”}, “origin_run_ids”: [], “run_ids”: [“a5aeb129-03bb-4db4-afe5-8577a65cdfd7"], “run_keys”: [], “skip_reason”: null, “status”: {“__enum__“: “TickStatus.SUCCESS”}, “timestamp”: 1646705700.0} 2022-03-08 02:15:04 2022-03-08 02:15:04 3 602a2e394a98ddbe30f3f19bbde31399b6f0488b SUCCESS SCHEDULE 2022-03-08 02:30:00.000000 {“__class__“: “TickData”, “cursor”: null, “error”: null, “failure_count”: 0, “job_name”: “ge_job_schedule”, “job_origin_id”: “602a2e394a98ddbe30f3f19bbde31399b6f0488b”, “job_type”: {“__enum__“: “InstigatorType.SCHEDULE”}, “origin_run_ids”: [], “run_ids”: [“2e602097-48f5-4b13-934e-cc5e1d2526f2"], “run_keys”: [], “skip_reason”: null, “status”: {“__enum__“: “TickStatus.SUCCESS”}, “timestamp”: 1646706600.0} 2022-03-08 02:30:03 2022-03-08 02:30:03 4 602a2e394a98ddbe30f3f19bbde31399b6f0488b SUCCESS SCHEDULE 2022-03-08 03:00:00.000000 {“__class__“: “TickData”, “cursor”: null, “error”: null, “failure_count”: 0, “job_name”: “ge_job_schedule”, “job_origin_id”: “602a2e394a98ddbe30f3f19bbde31399b6f0488b”, “job_type”: {“__enum__“: “InstigatorType.SCHEDULE”}, “origin_run_ids”: [], “run_ids”: [“9062d0eb-f100-417e-a235-26a69624c4af”], “run_keys”: [], “skip_reason”: null, “status”: {“__enum__“: “TickStatus.SUCCESS”}, “timestamp”: 1646708400.0} 2022-03-08 03:00:26 2022-03-08 03:00:26 5 602a2e394a98ddbe30f3f19bbde31399b6f0488b SUCCESS SCHEDULE 2022-03-08 03:15:00.000000 {“__class__“: “TickData”, “cursor”: null, “error”: null, “failure_count”: 0, “job_name”: “ge_job_schedule”, “job_origin_id”: “602a2e394a98ddbe30f3f19bbde31399b6f0488b”, “job_type”: {“__enum__“: “InstigatorType.SCHEDULE”}, “origin_run_ids”: [], “run_ids”: [“f9f3f5a6-bcc6-4bd2-9783-dbc1fc55fe5a”], “run_keys”: [], “skip_reason”: null, “status”: {“__enum__“: “TickStatus.SUCCESS”}, “timestamp”: 1646709300.0} 2022-03-08 03:15:04 2022-03-08 03:15:04
The above is the content of the job table.
actually job-ticks table.
below is the job table.
602a2e394a98ddbe30f3f19bbde31399b6f0488b bb5b6fe5f352839b38265dfa66badc6a78fb22a0 STOPPED SCHEDULE {“__class__“: “InstigatorState”, “job_specific_data”: {“__class__“: “ScheduleInstigatorData”, “cron_schedule”: “*/15 * * * *“, “start_timestamp”: null}, “job_type”: {“__enum__“: “InstigatorType.SCHEDULE”}, “origin”: {“__class__“: “ExternalJobOrigin”, “external_repository_origin”: {“__class__“: “ExternalRepositoryOrigin”, “repository_location_origin”: {“__class__“: “ManagedGrpcPythonEnvRepositoryLocationOrigin”, “loadable_target_origin”: {“__class__“: “LoadableTargetOrigin”, “attribute”: null, “executable_path”: “E:\\Data\\dagster\\venv\\Scripts\\python.exe”, “module_name”: null, “package_name”: null, “python_file”: “E:\\Data\\dagster\\repo/auto_tests.py”, “working_directory”: null}, “location_name”: “auto_tests.py”}, “repository_name”: “ge__auto_test__repository”}, “job_name”: “ge_job_schedule”}, “status”: {“__enum__“: “InstigatorStatus.STOPPED”}} 2022-03-08 03:38:59 2022-03-08 03:38:59
d

daniel

03/08/2022, 3:00 PM
are you sure that your daemon and your dagit are using the same value of DAGSTER_HOME? From that table you posted it looks like the schedule is stopped. If it's not, could you post the output of the daemon process where it's saying that the schedule is still running?
j

jasono

03/08/2022, 5:02 PM
Yes, they are using the same Dagster_Home. I checked the environment variables, and also the Daemon server appears on Dagit’s Daemon list in the Status menu.
Here is the daemon output. I terminated daemon last night, and just restarted it but it’s still having the issue.
$ dagster-daemon run
2022-03-08 09:02:46 -0800 - dagster.daemon - INFO - instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']
Calculating Metrics: 100%|########################################################################################################################| 15/15 [00:00<00:00, 68.57it/s] E:\Data\dagster\venv\lib\site-packages\jinja2\environment.py:1088: DeprecationWarning: 'soft_unicode' has been renamed to 'soft_str'. The old name will be removed in MarkupSafe 2.1.
  return concat(self.root_render_func(self.new_context(vars)))
Calculating Metrics: 100%|########################################################################################################################| 15/15 [00:00<00:00, 96.00it/s] 2022-03-08 09:03:34 -0800 - dagster.daemon.SensorDaemon - WARNING - Could not load location auto_tests.py to check for sensors due to the following error: Exception: Timed out waiting for gRPC server to start with arguments: "E:\Data\dagster\venv\Scripts\python.exe -m dagster api grpc --lazy-load-user-code --port 50654 --heartbeat --heartbeat-timeout 120 --fixed-server-id 81c77561-3f33-4be9-afd4-71f36336169b --log-level WARNING --use-python-environment-entry-point -f E:\Data\dagster\repo/auto_tests.py". Most recent connection error: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 937, in wait_for_grpc_server
    client.ping("")
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 123, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 110, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
 
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1646759002.877000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3135,"referenced_errors":[{"created":"@1646759002.877000000","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
> 
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 107, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "E:\Data\dagster\venv\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "E:\Data\dagster\venv\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
 
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\core\host_representation\grpc_server_registry.py", line 207, in _get_grpc_endpoint
    server_process = GrpcServerProcess(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1083, in __init__
    self.server_process, self.port = open_server_process_on_dynamic_port(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1031, in open_server_process_on_dynamic_port
    server_process = open_server_process(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1008, in open_server_process
    wait_for_grpc_server(server_process, client, subprocess_args, timeout=startup_timeout)
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 943, in wait_for_grpc_server
    raise Exception(
 
2022-03-08 09:03:34 -0800 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2022-03-08 09:03:34 -0800 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location auto_tests.py to check for schedules due to the following error: Exception: Timed out waiting for gRPC server to start with arguments: "E:\Data\dagster\venv\Scripts\python.exe -m dagster api grpc --lazy-load-user-code --port 50654 --heartbeat --heartbeat-timeout 120 --fixed-server-id 81c77561-3f33-4be9-afd4-71f36336169b --log-level WARNING --use-python-environment-entry-point -f E:\Data\dagster\repo/auto_tests.py". Most recent connection error: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 937, in wait_for_grpc_server
    client.ping("")
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 123, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 110, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
 
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1646759002.877000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3135,"referenced_errors":[{"created":"@1646759002.877000000","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
> 
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 107, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "E:\Data\dagster\venv\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "E:\Data\dagster\venv\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
 
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\core\host_representation\grpc_server_registry.py", line 207, in _get_grpc_endpoint
    server_process = GrpcServerProcess(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1083, in __init__
    self.server_process, self.port = open_server_process_on_dynamic_port(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1031, in open_server_process_on_dynamic_port
    server_process = open_server_process(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1008, in open_server_process
    wait_for_grpc_server(server_process, client, subprocess_args, timeout=startup_timeout)
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 943, in wait_for_grpc_server
    raise Exception(
 
2022-03-08 09:03:49 -0800 - dagster.daemon.SensorDaemon - WARNING - Could not load location auto_tests.py to check for sensors due to the following error: Exception: Timed out waiting for gRPC server to start with arguments: "E:\Data\dagster\venv\Scripts\python.exe -m dagster api grpc --lazy-load-user-code --port 50654 --heartbeat --heartbeat-timeout 120 --fixed-server-id 81c77561-3f33-4be9-afd4-71f36336169b --log-level WARNING --use-python-environment-entry-point -f E:\Data\dagster\repo/auto_tests.py". Most recent connection error: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 937, in wait_for_grpc_server
    client.ping("")
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 123, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 110, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
 
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1646759002.877000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3135,"referenced_errors":[{"created":"@1646759002.877000000","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
> 
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 107, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "E:\Data\dagster\venv\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "E:\Data\dagster\venv\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
 
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\core\host_representation\grpc_server_registry.py", line 207, in _get_grpc_endpoint
    server_process = GrpcServerProcess(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1083, in __init__
    self.server_process, self.port = open_server_process_on_dynamic_port(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1031, in open_server_process_on_dynamic_port
    server_process = open_server_process(
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 1008, in open_server_process
    wait_for_grpc_server(server_process, client, subprocess_args, timeout=startup_timeout)
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 943, in wait_for_grpc_server
    raise Exception(
 
2022-03-08 09:03:49 -0800 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.
2022-03-08 09:03:49 -0800 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location auto_tests.py to check for schedules due to the following error: Exception: Timed out waiting for gRPC server to start with arguments: "E:\Data\dagster\venv\Scripts\python.exe -m dagster api grpc --lazy-load-user-code --port 50654 --heartbeat --heartbeat-timeout 120 --fixed-server-id 81c77561-3f33-4be9-afd4-71f36336169b --log-level WARNING --use-python-environment-entry-point -f E:\Data\dagster\repo/auto_tests.py". Most recent connection error: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
 
Stack Trace:
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\server.py", line 937, in wait_for_grpc_server
    client.ping("")
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 123, in ping
    res = self._query("Ping", api_pb2.PingRequest, echo=echo)
  File "E:\Data\dagster\venv\lib\site-packages\dagster\grpc\client.py", line 110, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
 
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
d

daniel

03/08/2022, 11:25 PM
so what that error is saying is that it's trying to spin up a process that loads your code, but that process is taking more than 60 seconds to start up. In the past i've seen this when the box that is running dagster is just overworked / running out of resources, is there a way to check whether that's the case? You could try running that same command that it's saying is timing out and see if it's able to run and start up a code server:
E:\Data\dagster\venv\Scripts\python.exe -m dagster api grpc --lazy-load-user-code --port 50654 --heartbeat --heartbeat-timeout 120 --fixed-server-id 81c77561-3f33-4be9-afd4-71f36336169b --log-level WARNING --use-python-environment-entry-point -f E:\Data\dagster\repo/auto_tests.py