Harrison Conlin
03/20/2023, 3:46 AMdagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_daemon/backfill.py", line 34, in execute_backfill_iteration
yield from execute_asset_backfill_iteration(backfill, workspace, instance)
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 245, in execute_asset_backfill_iteration
submit_run_request(
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 283, in submit_run_request
external_pipeline = repo_location.get_external_pipeline(pipeline_selector)
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 141, in get_external_pipeline
subset_result = self.get_subset_external_pipeline_result(selector)
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 773, in get_subset_external_pipeline_result
return sync_get_external_pipeline_subset_grpc(
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_api/snapshot_pipeline.py", line 29, in sync_get_external_pipeline_subset_grpc
api_client.external_pipeline_subset(
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 293, in external_pipeline_subset
res = self._query(
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 159, in _query
self._raise_grpc_exception(
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 142, in _raise_grpc_exception
raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1679283310.259159924","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679283310.259158224","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
> File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 157, in _query
return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 132, in _get_response
return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
Tobias Pankrath
03/20/2023, 8:29 AMTobias Pankrath
03/20/2023, 8:29 AMTobias Pankrath
03/20/2023, 2:15 PMHarrison Conlin
03/20/2023, 10:30 PMdaniel
03/21/2023, 12:59 AMTobias Pankrath
03/21/2023, 6:23 AMhumor me, have you got a http_proxy env var set?No proxy.
Hi, are either of you using the default run launcher that launches each run in the same process? And are you using any kind of run queue settings to limit the maximum number of runs that can be happening at once? I’m wondering if this error could come from too many concurrent runs happening at the same time due to the backfill and overloading the code serverI am using the multiprocess executor that starts a process per run. But alot of them (although CPU utilization never was very high)
Tobias Pankrath
03/21/2023, 6:43 AMTobias Pankrath
03/21/2023, 8:41 AMThe above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1679387303.729055401","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679387303.729054773","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>
Harrison Conlin
03/21/2023, 8:50 AMTobias Pankrath
03/21/2023, 12:30 PMdaniel
03/21/2023, 2:16 PMTobias Pankrath
03/21/2023, 2:19 PMUsing the run queue to limit the number of runs that can happen at once: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency -I am already doing this and it doesn't help. The ability to just
dagster dev
and have it work is one of the key features of dagster for me. There is also no reason to stop an entire backfill if one such error occurs. I'll take a look into the docker stuff as well.daniel
03/21/2023, 2:19 PMTobias Pankrath
03/21/2023, 2:20 PMrun_coordinator:
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
max_concurrent_runs: 75
Had the same problem with 32.Tobias Pankrath
03/21/2023, 2:21 PMdaniel
03/21/2023, 2:28 PMdaniel
03/21/2023, 2:29 PMTobias Pankrath
03/21/2023, 2:30 PMI assume there are no logs in the dagster dev output with any clues about why the process might be becoming unavailable, like some kind of error message shortly before the StatusCode.UNAVAILABLE messages start?No, nothing suspicious. I looked at
htop
and it didn't look like it would run OOM, but I cannot be 100% sure.Tobias Pankrath
03/21/2023, 2:30 PMTobias Pankrath
03/21/2023, 2:31 PMTobias Pankrath
03/21/2023, 2:33 PMdaniel
03/21/2023, 2:34 PMTobias Pankrath
03/21/2023, 2:35 PMTobias Pankrath
03/21/2023, 2:38 PMTobias Pankrath
03/21/2023, 2:40 PMTobias Pankrath
03/21/2023, 2:40 PMTobias Pankrath
03/21/2023, 2:42 PMdaniel
03/21/2023, 2:42 PMdaniel
03/21/2023, 2:42 PMTobias Pankrath
03/21/2023, 2:44 PMTobias Pankrath
03/21/2023, 2:58 PMI0321 14:57:03.022587668 385241 <http://subchannel.cc:956]|subchannel.cc:956]> subchannel 0x7f5a10373400 {address=unix:/tmp/tmplsdry2hf, args={grpc.client_channel_factory=0x1ff9470, grpc.default_authority=localhost, grpc.default_compression_algorithm=2, grpc.internal.channel_credentials=0x2109320, grpc.internal.security_connector=0x7f5a101bea40, grpc.internal.subchannel_pool=0x23dce80, grpc.max_receive_message_length=50000000, grpc.max_send_message_length=50000000, grpc.primary_user_agent=grpc-python/1.47.5, grpc.resource_quota=0x23ca5c0, grpc.server_uri=unix:/tmp/tmplsdry2hf}}: connect failed ({"created":"@1679410623.022503505","description":"No such file or directory","errno":2,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":297,"os_error":"No such file or directory","syscall":"connect","target_address":"unix:/tmp/tmplsdry2hf"}), backing off for 1000 ms
Tobias Pankrath
03/21/2023, 2:59 PMGRPC_TRACE=true
Tobias Pankrath
03/21/2023, 3:01 PMdaniel
03/21/2023, 3:08 PMdaniel
03/21/2023, 3:08 PMTobias Pankrath
03/21/2023, 3:10 PMdaniel
03/21/2023, 4:02 PMTobias Pankrath
03/21/2023, 4:10 PMTobias Pankrath
03/21/2023, 4:10 PM536940 /home/bcr88/space/repos/dagster-playground/.direnv/python-3.8.10/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmp6zg0zdfp --heartbeat --heartbeat-timeout 120 --fixed-server-id aad1a72a-2970-4fce-a10f-2899d4ed4467 --log-level warning --inject-env-vars-from-instance --instance-ref {"__class__": "InstanceRef", "compute_logs_data": {"__class__": "ConfigurableClassData", "class_name": "LocalComputeLogManager", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/storage\n", "module_name": "dagster.core.storage.local_compute_log_manager"}, "custom_instance_class_data": null, "event_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteEventLogStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/runs/\n", "module_name": "dagster.core.storage.event_log"}, "local_artifact_storage_data": {"__class__": "ConfigurableClassData", "class_name": "LocalArtifactStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.root"}, "run_coordinator_data": {"__class__": "ConfigurableClassData", "class_name": "QueuedRunCoordinator", "config_yaml": "max_concurrent_runs: 75\n", "module_name": "dagster.core.run_coordinator"}, "run_launcher_data": {"__class__": "ConfigurableClassData", "class_name": "DefaultRunLauncher", "config_yaml": "{}\n", "module_name": "dagster"}, "run_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteRunStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/\n", "module_name": "dagster.core.storage.runs"}, "schedule_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteScheduleStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/schedules\n", "module_name": "dagster.core.storage.schedules"}, "scheduler_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterDaemonScheduler", "config_yaml": "{}\n", "module_name": "dagster.core.scheduler"}, "secrets_loader_data": null, "settings": {}, "storage_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterSqliteStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.sqlite_storage"}} --location-name intraday_events -m intraday_events -d /home/bcr88/space/repos/dagster-playground/intraday_events
Tobias Pankrath
03/21/2023, 4:11 PMTobias Pankrath
03/21/2023, 4:17 PMdaniel
03/21/2023, 4:19 PMdaniel
03/21/2023, 4:20 PMdaniel
03/21/2023, 4:22 PMdaniel
03/21/2023, 4:24 PMTobias Pankrath
03/21/2023, 4:30 PMYou can control some aspects of the fork behavior of the multiprocess_executor via the start_method field: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executorCan I configure this globally, e.g. in my dagster.yaml?
daniel
03/21/2023, 4:30 PMHarrison Conlin
03/22/2023, 5:38 AMHarrison Conlin
03/22/2023, 5:41 AMHarrison Conlin
03/22/2023, 6:26 AMHarrison Conlin
03/22/2023, 6:36 AMTobias Pankrath
03/22/2023, 6:44 AMTobias Pankrath
03/22/2023, 6:44 AMHarrison Conlin
03/22/2023, 7:23 AMHarrison Conlin
03/22/2023, 7:23 AMdagster job backfill
and it's humming away nicelyHarrison Conlin
03/22/2023, 7:29 AMTobias Pankrath
03/22/2023, 7:31 AMHarrison Conlin
03/22/2023, 7:50 AMHarrison Conlin
03/22/2023, 7:52 AMdaniel
03/22/2023, 8:45 AMdaniel
03/22/2023, 8:46 AMTobias Pankrath
03/22/2023, 8:47 AMdaniel
03/22/2023, 8:53 AMTobias Pankrath
03/22/2023, 8:54 AMTobias Pankrath
03/22/2023, 8:57 AM--all TEXT Specify to select all partitions to backfill.
Does all really take an argument?Tobias Pankrath
03/22/2023, 9:09 AMdagster dev
first and than issued a backfill via cliTobias Pankrath
03/22/2023, 9:40 AMdaniel
03/22/2023, 12:25 PMdaniel
03/22/2023, 2:47 PMTobias Pankrath
03/22/2023, 2:49 PMdaniel
03/22/2023, 2:58 PMdaniel
03/22/2023, 2:58 PMdaniel
03/22/2023, 4:16 PMdagster dev
running while running the CLI, although it won't hurt)Tobias Pankrath
03/22/2023, 4:47 PMHarrison Conlin
03/22/2023, 10:42 PMTobias Pankrath
03/23/2023, 10:11 AMdaniel
03/23/2023, 11:52 AMTobias Pankrath
03/23/2023, 11:57 AMStarting 1.1.18, users with a gRPC server that could not access the Dagster instance on user code deployments would see an error when launching backfills as the instance could not instantiate. This has been fixed.
I thought that this is it from the latest changelog.