https://dagster.io/ logo
#ask-community
Title
# ask-community
h

Harrison Conlin

03/20/2023, 3:46 AM
I also seem to get these on backfills after a few minutes
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_daemon/backfill.py", line 34, in execute_backfill_iteration
    yield from execute_asset_backfill_iteration(backfill, workspace, instance)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 245, in execute_asset_backfill_iteration
    submit_run_request(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 283, in submit_run_request
    external_pipeline = repo_location.get_external_pipeline(pipeline_selector)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 141, in get_external_pipeline
    subset_result = self.get_subset_external_pipeline_result(selector)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 773, in get_subset_external_pipeline_result
    return sync_get_external_pipeline_subset_grpc(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_api/snapshot_pipeline.py", line 29, in sync_get_external_pipeline_subset_grpc
    api_client.external_pipeline_subset(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 293, in external_pipeline_subset
    res = self._query(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 159, in _query
    self._raise_grpc_exception(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 142, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1679283310.259159924","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679283310.259158224","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 157, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 132, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
1
t

Tobias Pankrath

03/20/2023, 8:29 AM
I have the same issue
One big problem is that the backfill than stops after the first occurence
This is hard-blocking me right now. Is there any other way to regenerate an asset? Deleting it and rematerializing it or sth?
h

Harrison Conlin

03/20/2023, 10:30 PM
humor me, have you got a http_proxy env var set?
d

daniel

03/21/2023, 12:59 AM
Hi, are either of you using the default run launcher that launches each run in the same process? And are you using any kind of run queue settings to limit the maximum number of runs that can be happening at once? I’m wondering if this error could come from too many concurrent runs happening at the same time due to the backfill and overloading the code server
t

Tobias Pankrath

03/21/2023, 6:23 AM
humor me, have you got a http_proxy env var set?
No proxy.
Hi, are either of you using the default run launcher that launches each run in the same process? And are you using any kind of run queue settings to limit the maximum number of runs that can be happening at once? I’m wondering if this error could come from too many concurrent runs happening at the same time due to the backfill and overloading the code server
I am using the multiprocess executor that starts a process per run. But alot of them (although CPU utilization never was very high)
What would be a sensible value here?
Reduced it by half (32 on a 64 core machine). Still happening.
Copy code
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1679387303.729055401","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679387303.729054773","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>
h

Harrison Conlin

03/21/2023, 8:50 AM
multiprocess executor but only 4 runs at a time
t

Tobias Pankrath

03/21/2023, 12:30 PM
For what it is worth: With a retry policy the problem seems to go away
d

daniel

03/21/2023, 2:16 PM
The default run launcher does each run in a subprocess on the gRPC server, so there's a risk of, when there are a high number of runs happening at once with no run queue in place, that the server gets overloaded. A couple of things I would suggest here to mitigate this: • Using the run queue to limit the number of runs that can happen at once: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency - • Using a different run launcher like the DockerRunLauncher (or k8s or ECS, although deploying to those is more involved), which launches each run in its own container that is fully isolated from the gRPC server: https://docs.dagster.io/deployment/guides/docker#launching-runs-in-containers
t

Tobias Pankrath

03/21/2023, 2:19 PM
Thanks for your answer.
Using the run queue to limit the number of runs that can happen at once: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency -
I am already doing this and it doesn't help. The ability to just
dagster dev
and have it work is one of the key features of dagster for me. There is also no reason to stop an entire backfill if one such error occurs. I'll take a look into the docker stuff as well.
d

daniel

03/21/2023, 2:19 PM
What are you setting the limit to?
t

Tobias Pankrath

03/21/2023, 2:20 PM
Copy code
run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 75
Had the same problem with 32.
(32 core machine)
d

daniel

03/21/2023, 2:28 PM
While recognizing that this wouldn't work for you in production, what if you set it to a much lower number (say, 4 or 2)? Is the backfill able to submit all the runs? That would confirm the theory that it's the runs happening all in the same process is what's causing the problem. You've mentioned the number of cores/CPU, but is there any chance that the runs are memory-intensive and the gRPC server might be running out of memory?
I assume there are no logs in the dagster dev output with any clues about why the process might be becoming unavailable, like some kind of error message shortly before the StatusCode.UNAVAILABLE messages start?
t

Tobias Pankrath

03/21/2023, 2:30 PM
I assume there are no logs in the dagster dev output with any clues about why the process might be becoming unavailable, like some kind of error message shortly before the StatusCode.UNAVAILABLE messages start?
No, nothing suspicious. I looked at
htop
and it didn't look like it would run OOM, but I cannot be 100% sure.
I'll watch it, if it is happening again
I have one asset that's very memory intensive and one that isn't (it's currently running with 75 concurrent runs at 6/64G memory consumption) and it happened to both.
And it happened again, but I set a RetryPolicy, which seems to be a work around
d

daniel

03/21/2023, 2:34 PM
I'm having trouble understanding why a RetryPolicy would help, I don't believe that's checked during backfills
t

Tobias Pankrath

03/21/2023, 2:35 PM
Yeah, my bad. Double checked it, doesn't help :(
Always fails around 500 partitions
I increased the file discriptor limit, maybe it's that, but than someone would be leaking them.
(maybe my io managers)
but they should be running in a sub process and thus cleaned up automatically, wouldn't they?
d

daniel

03/21/2023, 2:42 PM
every run would be happening in its own subprocess, yeah
(i'm not totally certain if it follows that all file descriptors would be cleaned up)
t

Tobias Pankrath

03/21/2023, 2:44 PM
if you receive them in the child process (instead of e.g. inheriting them) then yes.
Copy code
I0321 14:57:03.022587668  385241 <http://subchannel.cc:956]|subchannel.cc:956]>          subchannel 0x7f5a10373400 {address=unix:/tmp/tmplsdry2hf, args={grpc.client_channel_factory=0x1ff9470, grpc.default_authority=localhost, grpc.default_compression_algorithm=2, grpc.internal.channel_credentials=0x2109320, grpc.internal.security_connector=0x7f5a101bea40, grpc.internal.subchannel_pool=0x23dce80, grpc.max_receive_message_length=50000000, grpc.max_send_message_length=50000000, grpc.primary_user_agent=grpc-python/1.47.5, grpc.resource_quota=0x23ca5c0, grpc.server_uri=unix:/tmp/tmplsdry2hf}}: connect failed ({"created":"@1679410623.022503505","description":"No such file or directory","errno":2,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":297,"os_error":"No such file or directory","syscall":"connect","target_address":"unix:/tmp/tmplsdry2hf"}), backing off for 1000 ms
from
GRPC_TRACE=true
So maybe it's a race condition between the creation of that unix socket and starting the sub process
d

daniel

03/21/2023, 3:08 PM
Is there any indication of which grpc call that error is coming from? I wouldn't expect something like that to make the whole server unavailable
broadly i think moving the runs to a separate place than the gRPC server is likely to help here
t

Tobias Pankrath

03/21/2023, 3:10 PM
All children seem have their own socket and only one seems to be affected at a time
d

daniel

03/21/2023, 4:02 PM
the way that it works is each run happens in a subprocess (using a Python multiprocess context: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_grpc/server.py#L738-L748 ) - i wouldn't expect each run to interfere with the gRPC server machinery once it's started, but I would expect resource issues in one run to potentially affect other runs since they aren't particularly isolated (until you switch to something like docker or k8s)
t

Tobias Pankrath

03/21/2023, 4:10 PM
does the multi_process_executor fork without exec? because all child processes look like they do something grpc:
Copy code
536940 /home/bcr88/space/repos/dagster-playground/.direnv/python-3.8.10/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmp6zg0zdfp --heartbeat --heartbeat-timeout 120 --fixed-server-id aad1a72a-2970-4fce-a10f-2899d4ed4467 --log-level warning --inject-env-vars-from-instance --instance-ref {"__class__": "InstanceRef", "compute_logs_data": {"__class__": "ConfigurableClassData", "class_name": "LocalComputeLogManager", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/storage\n", "module_name": "dagster.core.storage.local_compute_log_manager"}, "custom_instance_class_data": null, "event_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteEventLogStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/runs/\n", "module_name": "dagster.core.storage.event_log"}, "local_artifact_storage_data": {"__class__": "ConfigurableClassData", "class_name": "LocalArtifactStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.root"}, "run_coordinator_data": {"__class__": "ConfigurableClassData", "class_name": "QueuedRunCoordinator", "config_yaml": "max_concurrent_runs: 75\n", "module_name": "dagster.core.run_coordinator"}, "run_launcher_data": {"__class__": "ConfigurableClassData", "class_name": "DefaultRunLauncher", "config_yaml": "{}\n", "module_name": "dagster"}, "run_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteRunStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/\n", "module_name": "dagster.core.storage.runs"}, "schedule_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteScheduleStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/schedules\n", "module_name": "dagster.core.storage.schedules"}, "scheduler_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterDaemonScheduler", "config_yaml": "{}\n", "module_name": "dagster.core.scheduler"}, "secrets_loader_data": null, "settings": {}, "storage_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterSqliteStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.sqlite_storage"}} --location-name intraday_events -m intraday_events -d /home/bcr88/space/repos/dagster-playground/intraday_events
All of them look like this
For the move to docker: can I combine the DockerRunLuncher with the in_process_executor?
d

daniel

03/21/2023, 4:19 PM
You can, yeah
that would work with both the default run launcher and docker run launcher - executors control individual steps, run launchers control the job as a whole
You can control some aspects of the fork behavior of the multiprocess_executor via the start_method field: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
It uses the python multiprocessing context to spawn a new subprocess and passes in that start method, but i'm not immediately sure how that maps to fork() vs. exec()
t

Tobias Pankrath

03/21/2023, 4:30 PM
You can control some aspects of the fork behavior of the multiprocess_executor via the start_method field: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
Can I configure this globally, e.g. in my dagster.yaml?
d

daniel

03/21/2023, 4:30 PM
I don't believe that's currently possible, but you can configure a defaulte xecutor on your Definitions object
❤️ 1
h

Harrison Conlin

03/22/2023, 5:38 AM
so I was initially only seeing this on my linux box (pre production environment) but am now seeing this on my windows development box. the one thing I do I have in common with Tobias is a large number of partitions (400, 950 and a 600 if I look at my current running jobs) and one just failed on a 43 partition run 😞
I'm wondering max_user_code_failure_retries option would help
we're down to only 1 concurrent run and still hitting the grpc timeout. ironically enough, max_concurrent_runs: 0 actually limits it to zero runs! so giving that a shot
and even with 0 runs, we're still hitting grpc timeouts. I can see the dagster daemon is using ~10% cpu and the api process is like ~5% cpu.
t

Tobias Pankrath

03/22/2023, 6:44 AM
Changing the multiprocess behaviour from fork to spawn didn't help either.
Also increasing the ulimit for file discriptors to unlimited didn't help.
h

Harrison Conlin

03/22/2023, 7:23 AM
ok so I've got an interesting one, try launch the runs from the command line
I did a
dagster job backfill
and it's humming away nicely
it doesn't seem to come up on the backfills page of dagit but I can see the number of runs increasing
t

Tobias Pankrath

03/22/2023, 7:31 AM
Keep me posted
h

Harrison Conlin

03/22/2023, 7:50 AM
ok well that worked great, I'm going to try with another backfill job that was being a pain and dying early on
and it came through on the dagit backfills page, once it had finished queuing the all the runs
🙏 1
d

daniel

03/22/2023, 8:45 AM
If it consistently fails when running “dagster dev” or in the daemon but never fails when running “dagster job backfill” then that’s a very helpful clue
I’d be curious if you see that same behavior Tobias
t

Tobias Pankrath

03/22/2023, 8:47 AM
I'll check later
d

daniel

03/22/2023, 8:53 AM
Also a helpful clue that it still happens when no runs are being launched at all
t

Tobias Pankrath

03/22/2023, 8:54 AM
How do I launch a backfill for an asset if I have no job for that asset?
--all TEXT                    Specify to select all partitions to backfill.
Does all really take an argument?
So, I did it for a job I have, although it's very long running. I started
dagster dev
first and than issued a backfill via cli
looks like this helps
d

daniel

03/22/2023, 12:25 PM
Ok, with this information I have a theory about what this might be, I’ll see if I can reproduce the problem myself today
Trying to reproduce with ~400 partitions locally - how many assets are typically in a backfill where it fails?
t

Tobias Pankrath

03/22/2023, 2:49 PM
fails at around 500
d

daniel

03/22/2023, 2:58 PM
aha, I have reproduced the problem. Only a matter of time now :)
❤️ 1
happened at 897 for me
OK, here's a fix that I believe will squash this: https://github.com/dagster-io/dagster/pull/13085 - we should be able to get this out a week from today, thanks for reporting the problem. Running from the CLI should work in the meantime as a workaround (you shouldn't need to have
dagster dev
running while running the CLI, although it won't hurt)
❤️ 1
t

Tobias Pankrath

03/22/2023, 4:47 PM
Thanks alot
h

Harrison Conlin

03/22/2023, 10:42 PM
I go to bed, wake up and @daniel's got a pull request merged in. champion 🙂 thanks again Tobias for verifying its not just a me problem 🙂
t

Tobias Pankrath

03/23/2023, 10:11 AM
And it's already in the release.
d

daniel

03/23/2023, 11:52 AM
It’s in master, it’ll be out in the release next Wednesday
t

Tobias Pankrath

03/23/2023, 11:57 AM
• Jup, noticed.
Starting 1.1.18, users with a gRPC server that could not access the Dagster instance on user code deployments would see an error when launching backfills as the instance could not instantiate. This has been fixed.
I thought that this is it from the latest changelog.
35 Views