I also seem to get these on backfills after a few ...
# ask-community
h
I also seem to get these on backfills after a few minutes
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_daemon/backfill.py", line 34, in execute_backfill_iteration
    yield from execute_asset_backfill_iteration(backfill, workspace, instance)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 245, in execute_asset_backfill_iteration
    submit_run_request(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 283, in submit_run_request
    external_pipeline = repo_location.get_external_pipeline(pipeline_selector)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 141, in get_external_pipeline
    subset_result = self.get_subset_external_pipeline_result(selector)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 773, in get_subset_external_pipeline_result
    return sync_get_external_pipeline_subset_grpc(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_api/snapshot_pipeline.py", line 29, in sync_get_external_pipeline_subset_grpc
    api_client.external_pipeline_subset(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 293, in external_pipeline_subset
    res = self._query(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 159, in _query
    self._raise_grpc_exception(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 142, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1679283310.259159924","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679283310.259158224","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 157, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 132, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
1
t
I have the same issue
One big problem is that the backfill than stops after the first occurence
This is hard-blocking me right now. Is there any other way to regenerate an asset? Deleting it and rematerializing it or sth?
h
humor me, have you got a http_proxy env var set?
d
Hi, are either of you using the default run launcher that launches each run in the same process? And are you using any kind of run queue settings to limit the maximum number of runs that can be happening at once? I’m wondering if this error could come from too many concurrent runs happening at the same time due to the backfill and overloading the code server
t
humor me, have you got a http_proxy env var set?
No proxy.
Hi, are either of you using the default run launcher that launches each run in the same process? And are you using any kind of run queue settings to limit the maximum number of runs that can be happening at once? I’m wondering if this error could come from too many concurrent runs happening at the same time due to the backfill and overloading the code server
I am using the multiprocess executor that starts a process per run. But alot of them (although CPU utilization never was very high)
What would be a sensible value here?
Reduced it by half (32 on a 64 core machine). Still happening.
Copy code
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1679387303.729055401","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679387303.729054773","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>
h
multiprocess executor but only 4 runs at a time
t
For what it is worth: With a retry policy the problem seems to go away
d
The default run launcher does each run in a subprocess on the gRPC server, so there's a risk of, when there are a high number of runs happening at once with no run queue in place, that the server gets overloaded. A couple of things I would suggest here to mitigate this: • Using the run queue to limit the number of runs that can happen at once: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency - • Using a different run launcher like the DockerRunLauncher (or k8s or ECS, although deploying to those is more involved), which launches each run in its own container that is fully isolated from the gRPC server: https://docs.dagster.io/deployment/guides/docker#launching-runs-in-containers
t
Thanks for your answer.
Using the run queue to limit the number of runs that can happen at once: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency -
I am already doing this and it doesn't help. The ability to just
dagster dev
and have it work is one of the key features of dagster for me. There is also no reason to stop an entire backfill if one such error occurs. I'll take a look into the docker stuff as well.
d
What are you setting the limit to?
t
Copy code
run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 75
Had the same problem with 32.
(32 core machine)
d
While recognizing that this wouldn't work for you in production, what if you set it to a much lower number (say, 4 or 2)? Is the backfill able to submit all the runs? That would confirm the theory that it's the runs happening all in the same process is what's causing the problem. You've mentioned the number of cores/CPU, but is there any chance that the runs are memory-intensive and the gRPC server might be running out of memory?
I assume there are no logs in the dagster dev output with any clues about why the process might be becoming unavailable, like some kind of error message shortly before the StatusCode.UNAVAILABLE messages start?
t
I assume there are no logs in the dagster dev output with any clues about why the process might be becoming unavailable, like some kind of error message shortly before the StatusCode.UNAVAILABLE messages start?
No, nothing suspicious. I looked at
htop
and it didn't look like it would run OOM, but I cannot be 100% sure.
I'll watch it, if it is happening again
I have one asset that's very memory intensive and one that isn't (it's currently running with 75 concurrent runs at 6/64G memory consumption) and it happened to both.
And it happened again, but I set a RetryPolicy, which seems to be a work around
d
I'm having trouble understanding why a RetryPolicy would help, I don't believe that's checked during backfills
t
Yeah, my bad. Double checked it, doesn't help :(
Always fails around 500 partitions
I increased the file discriptor limit, maybe it's that, but than someone would be leaking them.
(maybe my io managers)
but they should be running in a sub process and thus cleaned up automatically, wouldn't they?
d
every run would be happening in its own subprocess, yeah
(i'm not totally certain if it follows that all file descriptors would be cleaned up)
t
if you receive them in the child process (instead of e.g. inheriting them) then yes.
Copy code
I0321 14:57:03.022587668  385241 <http://subchannel.cc:956]|subchannel.cc:956]>          subchannel 0x7f5a10373400 {address=unix:/tmp/tmplsdry2hf, args={grpc.client_channel_factory=0x1ff9470, grpc.default_authority=localhost, grpc.default_compression_algorithm=2, grpc.internal.channel_credentials=0x2109320, grpc.internal.security_connector=0x7f5a101bea40, grpc.internal.subchannel_pool=0x23dce80, grpc.max_receive_message_length=50000000, grpc.max_send_message_length=50000000, grpc.primary_user_agent=grpc-python/1.47.5, grpc.resource_quota=0x23ca5c0, grpc.server_uri=unix:/tmp/tmplsdry2hf}}: connect failed ({"created":"@1679410623.022503505","description":"No such file or directory","errno":2,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":297,"os_error":"No such file or directory","syscall":"connect","target_address":"unix:/tmp/tmplsdry2hf"}), backing off for 1000 ms
from
GRPC_TRACE=true
So maybe it's a race condition between the creation of that unix socket and starting the sub process
d
Is there any indication of which grpc call that error is coming from? I wouldn't expect something like that to make the whole server unavailable
broadly i think moving the runs to a separate place than the gRPC server is likely to help here
t
All children seem have their own socket and only one seems to be affected at a time
d
the way that it works is each run happens in a subprocess (using a Python multiprocess context: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_grpc/server.py#L738-L748 ) - i wouldn't expect each run to interfere with the gRPC server machinery once it's started, but I would expect resource issues in one run to potentially affect other runs since they aren't particularly isolated (until you switch to something like docker or k8s)
t
does the multi_process_executor fork without exec? because all child processes look like they do something grpc:
Copy code
536940 /home/bcr88/space/repos/dagster-playground/.direnv/python-3.8.10/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmp6zg0zdfp --heartbeat --heartbeat-timeout 120 --fixed-server-id aad1a72a-2970-4fce-a10f-2899d4ed4467 --log-level warning --inject-env-vars-from-instance --instance-ref {"__class__": "InstanceRef", "compute_logs_data": {"__class__": "ConfigurableClassData", "class_name": "LocalComputeLogManager", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/storage\n", "module_name": "dagster.core.storage.local_compute_log_manager"}, "custom_instance_class_data": null, "event_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteEventLogStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/runs/\n", "module_name": "dagster.core.storage.event_log"}, "local_artifact_storage_data": {"__class__": "ConfigurableClassData", "class_name": "LocalArtifactStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.root"}, "run_coordinator_data": {"__class__": "ConfigurableClassData", "class_name": "QueuedRunCoordinator", "config_yaml": "max_concurrent_runs: 75\n", "module_name": "dagster.core.run_coordinator"}, "run_launcher_data": {"__class__": "ConfigurableClassData", "class_name": "DefaultRunLauncher", "config_yaml": "{}\n", "module_name": "dagster"}, "run_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteRunStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/\n", "module_name": "dagster.core.storage.runs"}, "schedule_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteScheduleStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/schedules\n", "module_name": "dagster.core.storage.schedules"}, "scheduler_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterDaemonScheduler", "config_yaml": "{}\n", "module_name": "dagster.core.scheduler"}, "secrets_loader_data": null, "settings": {}, "storage_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterSqliteStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.sqlite_storage"}} --location-name intraday_events -m intraday_events -d /home/bcr88/space/repos/dagster-playground/intraday_events
All of them look like this
For the move to docker: can I combine the DockerRunLuncher with the in_process_executor?
d
You can, yeah
that would work with both the default run launcher and docker run launcher - executors control individual steps, run launchers control the job as a whole
You can control some aspects of the fork behavior of the multiprocess_executor via the start_method field: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
It uses the python multiprocessing context to spawn a new subprocess and passes in that start method, but i'm not immediately sure how that maps to fork() vs. exec()
t
You can control some aspects of the fork behavior of the multiprocess_executor via the start_method field: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
Can I configure this globally, e.g. in my dagster.yaml?
d
I don't believe that's currently possible, but you can configure a defaulte xecutor on your Definitions object
❤️ 1
h
so I was initially only seeing this on my linux box (pre production environment) but am now seeing this on my windows development box. the one thing I do I have in common with Tobias is a large number of partitions (400, 950 and a 600 if I look at my current running jobs) and one just failed on a 43 partition run 😞
I'm wondering max_user_code_failure_retries option would help
we're down to only 1 concurrent run and still hitting the grpc timeout. ironically enough, max_concurrent_runs: 0 actually limits it to zero runs! so giving that a shot
and even with 0 runs, we're still hitting grpc timeouts. I can see the dagster daemon is using ~10% cpu and the api process is like ~5% cpu.
t
Changing the multiprocess behaviour from fork to spawn didn't help either.
Also increasing the ulimit for file discriptors to unlimited didn't help.
h
ok so I've got an interesting one, try launch the runs from the command line
I did a
dagster job backfill
and it's humming away nicely
it doesn't seem to come up on the backfills page of dagit but I can see the number of runs increasing
t
Keep me posted
h
ok well that worked great, I'm going to try with another backfill job that was being a pain and dying early on
and it came through on the dagit backfills page, once it had finished queuing the all the runs
🙏 1
d
If it consistently fails when running “dagster dev” or in the daemon but never fails when running “dagster job backfill” then that’s a very helpful clue
I’d be curious if you see that same behavior Tobias
t
I'll check later
d
Also a helpful clue that it still happens when no runs are being launched at all
t
How do I launch a backfill for an asset if I have no job for that asset?
--all TEXT                    Specify to select all partitions to backfill.
Does all really take an argument?
So, I did it for a job I have, although it's very long running. I started
dagster dev
first and than issued a backfill via cli
looks like this helps
d
Ok, with this information I have a theory about what this might be, I’ll see if I can reproduce the problem myself today
Trying to reproduce with ~400 partitions locally - how many assets are typically in a backfill where it fails?
t
fails at around 500
d
aha, I have reproduced the problem. Only a matter of time now :)
❤️ 1
happened at 897 for me
OK, here's a fix that I believe will squash this: https://github.com/dagster-io/dagster/pull/13085 - we should be able to get this out a week from today, thanks for reporting the problem. Running from the CLI should work in the meantime as a workaround (you shouldn't need to have
dagster dev
running while running the CLI, although it won't hurt)
❤️ 1
t
Thanks alot
h
I go to bed, wake up and @daniel's got a pull request merged in. champion 🙂 thanks again Tobias for verifying its not just a me problem 🙂
t
And it's already in the release.
d
It’s in master, it’ll be out in the release next Wednesday
t
• Jup, noticed.
Starting 1.1.18, users with a gRPC server that could not access the Dagster instance on user code deployments would see an error when launching backfills as the instance could not instantiate. This has been fixed.
I thought that this is it from the latest changelog.