I also seem to get these on backfills after a few minutes `` dagster #ask-community

I also seem to get these on backfills after a few ...

Harrison Conlin

03/20/2023, 3:46 AM

I also seem to get these on backfills after a few minutes

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_daemon/backfill.py", line 34, in execute_backfill_iteration
    yield from execute_asset_backfill_iteration(backfill, workspace, instance)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 245, in execute_asset_backfill_iteration
    submit_run_request(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/execution/asset_backfill.py", line 283, in submit_run_request
    external_pipeline = repo_location.get_external_pipeline(pipeline_selector)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 141, in get_external_pipeline
    subset_result = self.get_subset_external_pipeline_result(selector)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 773, in get_subset_external_pipeline_result
    return sync_get_external_pipeline_subset_grpc(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_api/snapshot_pipeline.py", line 29, in sync_get_external_pipeline_subset_grpc
    api_client.external_pipeline_subset(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 293, in external_pipeline_subset
    res = self._query(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 159, in _query
    self._raise_grpc_exception(
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 142, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1679283310.259159924","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679283310.259158224","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 157, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/dagster/_grpc/client.py", line 132, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/redacted/virtualenvs/dagster-warehouse-MGHskhVC-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

➕ 1

Tobias Pankrath

03/20/2023, 8:29 AM

I have the same issue

Tobias Pankrath

03/20/2023, 8:29 AM

One big problem is that the backfill than stops after the first occurence

Tobias Pankrath

03/20/2023, 2:15 PM

This is hard-blocking me right now. Is there any other way to regenerate an asset? Deleting it and rematerializing it or sth?

Harrison Conlin

03/20/2023, 10:30 PM

humor me, have you got a http_proxy env var set?

daniel

03/21/2023, 12:59 AM

Hi, are either of you using the default run launcher that launches each run in the same process? And are you using any kind of run queue settings to limit the maximum number of runs that can be happening at once? I’m wondering if this error could come from too many concurrent runs happening at the same time due to the backfill and overloading the code server

Tobias Pankrath

03/21/2023, 6:23 AM

humor me, have you got a http_proxy env var set?

No proxy.

Hi, are either of you using the default run launcher that launches each run in the same process? And are you using any kind of run queue settings to limit the maximum number of runs that can be happening at once? I’m wondering if this error could come from too many concurrent runs happening at the same time due to the backfill and overloading the code server

I am using the multiprocess executor that starts a process per run. But alot of them (although CPU utilization never was very high)

Tobias Pankrath

03/21/2023, 6:43 AM

What would be a sensible value here?

Tobias Pankrath

03/21/2023, 8:41 AM

Reduced it by half (32 on a 64 core machine). Still happening.

Copy code

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1679387303.729055401","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1679387303.729054773","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

Harrison Conlin

03/21/2023, 8:50 AM

multiprocess executor but only 4 runs at a time

Tobias Pankrath

03/21/2023, 12:30 PM

For what it is worth: With a retry policy the problem seems to go away

daniel

03/21/2023, 2:16 PM

The default run launcher does each run in a subprocess on the gRPC server, so there's a risk of, when there are a high number of runs happening at once with no run queue in place, that the server gets overloaded. A couple of things I would suggest here to mitigate this: • Using the run queue to limit the number of runs that can happen at once: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency - • Using a different run launcher like the DockerRunLauncher (or k8s or ECS, although deploying to those is more involved), which launches each run in its own container that is fully isolated from the gRPC server: https://docs.dagster.io/deployment/guides/docker#launching-runs-in-containers

Tobias Pankrath

03/21/2023, 2:19 PM

Thanks for your answer.

Using the run queue to limit the number of runs that can happen at once: https://docs.dagster.io/deployment/run-coordinator#limiting-run-concurrency -

I am already doing this and it doesn't help. The ability to just

dagster dev

and have it work is one of the key features of dagster for me. There is also no reason to stop an entire backfill if one such error occurs. I'll take a look into the docker stuff as well.

daniel

03/21/2023, 2:19 PM

What are you setting the limit to?

Tobias Pankrath

03/21/2023, 2:20 PM

Copy code

run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 75

Had the same problem with 32.

Tobias Pankrath

03/21/2023, 2:21 PM

(32 core machine)

daniel

03/21/2023, 2:28 PM

While recognizing that this wouldn't work for you in production, what if you set it to a much lower number (say, 4 or 2)? Is the backfill able to submit all the runs? That would confirm the theory that it's the runs happening all in the same process is what's causing the problem. You've mentioned the number of cores/CPU, but is there any chance that the runs are memory-intensive and the gRPC server might be running out of memory?

daniel

03/21/2023, 2:29 PM

I assume there are no logs in the dagster dev output with any clues about why the process might be becoming unavailable, like some kind of error message shortly before the StatusCode.UNAVAILABLE messages start?

Tobias Pankrath

03/21/2023, 2:30 PM

I assume there are no logs in the dagster dev output with any clues about why the process might be becoming unavailable, like some kind of error message shortly before the StatusCode.UNAVAILABLE messages start?

No, nothing suspicious. I looked at

htop

and it didn't look like it would run OOM, but I cannot be 100% sure.

Tobias Pankrath

03/21/2023, 2:30 PM

I'll watch it, if it is happening again

Tobias Pankrath

03/21/2023, 2:31 PM

I have one asset that's very memory intensive and one that isn't (it's currently running with 75 concurrent runs at 6/64G memory consumption) and it happened to both.

Tobias Pankrath

03/21/2023, 2:33 PM

And it happened again, but I set a RetryPolicy, which seems to be a work around

daniel

03/21/2023, 2:34 PM

I'm having trouble understanding why a RetryPolicy would help, I don't believe that's checked during backfills

Tobias Pankrath

03/21/2023, 2:35 PM

Yeah, my bad. Double checked it, doesn't help :(

Tobias Pankrath

03/21/2023, 2:38 PM

Always fails around 500 partitions

Tobias Pankrath

03/21/2023, 2:40 PM

I increased the file discriptor limit, maybe it's that, but than someone would be leaking them.

Tobias Pankrath

03/21/2023, 2:40 PM

(maybe my io managers)

Tobias Pankrath

03/21/2023, 2:42 PM

but they should be running in a sub process and thus cleaned up automatically, wouldn't they?

daniel

03/21/2023, 2:42 PM

every run would be happening in its own subprocess, yeah

daniel

03/21/2023, 2:42 PM

(i'm not totally certain if it follows that all file descriptors would be cleaned up)

Tobias Pankrath

03/21/2023, 2:44 PM

if you receive them in the child process (instead of e.g. inheriting them) then yes.

Tobias Pankrath

03/21/2023, 2:58 PM

Copy code

I0321 14:57:03.022587668  385241 <http://subchannel.cc:956]|subchannel.cc:956]>          subchannel 0x7f5a10373400 {address=unix:/tmp/tmplsdry2hf, args={grpc.client_channel_factory=0x1ff9470, grpc.default_authority=localhost, grpc.default_compression_algorithm=2, grpc.internal.channel_credentials=0x2109320, grpc.internal.security_connector=0x7f5a101bea40, grpc.internal.subchannel_pool=0x23dce80, grpc.max_receive_message_length=50000000, grpc.max_send_message_length=50000000, grpc.primary_user_agent=grpc-python/1.47.5, grpc.resource_quota=0x23ca5c0, grpc.server_uri=unix:/tmp/tmplsdry2hf}}: connect failed ({"created":"@1679410623.022503505","description":"No such file or directory","errno":2,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":297,"os_error":"No such file or directory","syscall":"connect","target_address":"unix:/tmp/tmplsdry2hf"}), backing off for 1000 ms

Tobias Pankrath

03/21/2023, 2:59 PM

from

GRPC_TRACE=true

Tobias Pankrath

03/21/2023, 3:01 PM

So maybe it's a race condition between the creation of that unix socket and starting the sub process

daniel

03/21/2023, 3:08 PM

Is there any indication of which grpc call that error is coming from? I wouldn't expect something like that to make the whole server unavailable

daniel

03/21/2023, 3:08 PM

broadly i think moving the runs to a separate place than the gRPC server is likely to help here

Tobias Pankrath

03/21/2023, 3:10 PM

All children seem have their own socket and only one seems to be affected at a time

daniel

03/21/2023, 4:02 PM

the way that it works is each run happens in a subprocess (using a Python multiprocess context: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_grpc/server.py#L738-L748 ) - i wouldn't expect each run to interfere with the gRPC server machinery once it's started, but I would expect resource issues in one run to potentially affect other runs since they aren't particularly isolated (until you switch to something like docker or k8s)

Tobias Pankrath

03/21/2023, 4:10 PM

does the multi_process_executor fork without exec? because all child processes look like they do something grpc:

Tobias Pankrath

03/21/2023, 4:10 PM

Copy code

536940 /home/bcr88/space/repos/dagster-playground/.direnv/python-3.8.10/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmp6zg0zdfp --heartbeat --heartbeat-timeout 120 --fixed-server-id aad1a72a-2970-4fce-a10f-2899d4ed4467 --log-level warning --inject-env-vars-from-instance --instance-ref {"__class__": "InstanceRef", "compute_logs_data": {"__class__": "ConfigurableClassData", "class_name": "LocalComputeLogManager", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/storage\n", "module_name": "dagster.core.storage.local_compute_log_manager"}, "custom_instance_class_data": null, "event_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteEventLogStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/runs/\n", "module_name": "dagster.core.storage.event_log"}, "local_artifact_storage_data": {"__class__": "ConfigurableClassData", "class_name": "LocalArtifactStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.root"}, "run_coordinator_data": {"__class__": "ConfigurableClassData", "class_name": "QueuedRunCoordinator", "config_yaml": "max_concurrent_runs: 75\n", "module_name": "dagster.core.run_coordinator"}, "run_launcher_data": {"__class__": "ConfigurableClassData", "class_name": "DefaultRunLauncher", "config_yaml": "{}\n", "module_name": "dagster"}, "run_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteRunStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/history/\n", "module_name": "dagster.core.storage.runs"}, "schedule_storage_data": {"__class__": "ConfigurableClassData", "class_name": "SqliteScheduleStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home/schedules\n", "module_name": "dagster.core.storage.schedules"}, "scheduler_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterDaemonScheduler", "config_yaml": "{}\n", "module_name": "dagster.core.scheduler"}, "secrets_loader_data": null, "settings": {}, "storage_data": {"__class__": "ConfigurableClassData", "class_name": "DagsterSqliteStorage", "config_yaml": "base_dir: /home/bcr88/space/dagster-home\n", "module_name": "dagster.core.storage.sqlite_storage"}} --location-name intraday_events -m intraday_events -d /home/bcr88/space/repos/dagster-playground/intraday_events

Tobias Pankrath

03/21/2023, 4:11 PM

All of them look like this

Tobias Pankrath

03/21/2023, 4:17 PM

For the move to docker: can I combine the DockerRunLuncher with the in_process_executor?

daniel

03/21/2023, 4:19 PM

You can, yeah

daniel

03/21/2023, 4:20 PM

that would work with both the default run launcher and docker run launcher - executors control individual steps, run launchers control the job as a whole

daniel

03/21/2023, 4:22 PM

You can control some aspects of the fork behavior of the multiprocess_executor via the start_method field: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor

daniel

03/21/2023, 4:24 PM

It uses the python multiprocessing context to spawn a new subprocess and passes in that start method, but i'm not immediately sure how that maps to fork() vs. exec()

Tobias Pankrath

03/21/2023, 4:30 PM

You can control some aspects of the fork behavior of the multiprocess_executor via the start_method field: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor

Can I configure this globally, e.g. in my dagster.yaml?

daniel

03/21/2023, 4:30 PM

I don't believe that's currently possible, but you can configure a defaulte xecutor on your Definitions object

❤️ 1

Harrison Conlin

03/22/2023, 5:38 AM

so I was initially only seeing this on my linux box (pre production environment) but am now seeing this on my windows development box. the one thing I do I have in common with Tobias is a large number of partitions (400, 950 and a 600 if I look at my current running jobs) and one just failed on a 43 partition run 😞

Harrison Conlin

03/22/2023, 5:41 AM

I'm wondering max_user_code_failure_retries option would help

Harrison Conlin

03/22/2023, 6:26 AM

we're down to only 1 concurrent run and still hitting the grpc timeout. ironically enough, max_concurrent_runs: 0 actually limits it to zero runs! so giving that a shot

Harrison Conlin

03/22/2023, 6:36 AM

and even with 0 runs, we're still hitting grpc timeouts. I can see the dagster daemon is using ~10% cpu and the api process is like ~5% cpu.

Tobias Pankrath

03/22/2023, 6:44 AM

Changing the multiprocess behaviour from fork to spawn didn't help either.

Tobias Pankrath

03/22/2023, 6:44 AM

Also increasing the ulimit for file discriptors to unlimited didn't help.

Harrison Conlin

03/22/2023, 7:23 AM

ok so I've got an interesting one, try launch the runs from the command line

Harrison Conlin

03/22/2023, 7:23 AM

I did a

dagster job backfill

and it's humming away nicely

Harrison Conlin

03/22/2023, 7:29 AM

it doesn't seem to come up on the backfills page of dagit but I can see the number of runs increasing

Tobias Pankrath

03/22/2023, 7:31 AM

Keep me posted

Harrison Conlin

03/22/2023, 7:50 AM

ok well that worked great, I'm going to try with another backfill job that was being a pain and dying early on

Harrison Conlin

03/22/2023, 7:52 AM

and it came through on the dagit backfills page, once it had finished queuing the all the runs

🙏 1

daniel

03/22/2023, 8:45 AM

If it consistently fails when running “dagster dev” or in the daemon but never fails when running “dagster job backfill” then that’s a very helpful clue

daniel

03/22/2023, 8:46 AM

I’d be curious if you see that same behavior Tobias

Tobias Pankrath

03/22/2023, 8:47 AM

I'll check later

daniel

03/22/2023, 8:53 AM

Also a helpful clue that it still happens when no runs are being launched at all

Tobias Pankrath

03/22/2023, 8:54 AM

How do I launch a backfill for an asset if I have no job for that asset?

Tobias Pankrath

03/22/2023, 8:57 AM

--all TEXT                    Specify to select all partitions to backfill.

Does all really take an argument?

Tobias Pankrath

03/22/2023, 9:09 AM

So, I did it for a job I have, although it's very long running. I started

dagster dev

first and than issued a backfill via cli

Tobias Pankrath

03/22/2023, 9:40 AM

looks like this helps

daniel

03/22/2023, 12:25 PM

Ok, with this information I have a theory about what this might be, I’ll see if I can reproduce the problem myself today

daniel

03/22/2023, 2:47 PM

Trying to reproduce with ~400 partitions locally - how many assets are typically in a backfill where it fails?

Tobias Pankrath

03/22/2023, 2:49 PM

fails at around 500

daniel

03/22/2023, 2:58 PM

aha, I have reproduced the problem. Only a matter of time now :)

❤️ 1

daniel

03/22/2023, 2:58 PM

happened at 897 for me

daniel

03/22/2023, 4:16 PM

OK, here's a fix that I believe will squash this: https://github.com/dagster-io/dagster/pull/13085 - we should be able to get this out a week from today, thanks for reporting the problem. Running from the CLI should work in the meantime as a workaround (you shouldn't need to have

dagster dev

running while running the CLI, although it won't hurt)

❤️ 1

Tobias Pankrath

03/22/2023, 4:47 PM

Thanks alot

Harrison Conlin

03/22/2023, 10:42 PM

I go to bed, wake up and @daniel's got a pull request merged in. champion 🙂 thanks again Tobias for verifying its not just a me problem 🙂

Tobias Pankrath

03/23/2023, 10:11 AM

And it's already in the release.

daniel

03/23/2023, 11:52 AM

It’s in master, it’ll be out in the release next Wednesday

Tobias Pankrath

03/23/2023, 11:57 AM

• Jup, noticed.

Starting 1.1.18, users with a gRPC server that could not access the Dagster instance on user code deployments would see an error when launching backfills as the instance could not instantiate. This has been fixed.

I thought that this is it from the latest changelog.

56 Views

Open in Slack

Previous Next