Hi everyone we are having this error regularly on the Dagit dagster #announcements

Hi everyone, we are having this error regularly on...

Alexis M

01/11/2021, 7:45 AM

Hi everyone, we are having this error regularly on the Dagit UI when trying to launch a pipeline run. To temporarly fix it, we trigger a fresh deployment of our dagster container but it re-appears after a certain undetermined time. We use the DefaultRunLauncher and we deploy Dagster in its version 0.9.19 in a Docker container hosted on a EC2 machine. What can we do to further investigate this ?

Copy code

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1610350697.550486745","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4142,"referenced_errors":[{"created":"@1610350697.550481483","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"
>
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/utils.py", line 14, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/execution/launch_execution.py", line 13, in launch_pipeline_reexecution
    return _launch_pipeline_execution(graphene_info, execution_params, is_reexecuted=True)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/execution/launch_execution.py", line 48, in _launch_pipeline_execution
    run = do_launch(graphene_info, execution_params, is_reexecuted)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/execution/launch_execution.py", line 36, in do_launch
    pipeline_run = create_valid_pipeline_run(graphene_info, external_pipeline, execution_params)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/execution/run_lifecycle.py", line 21, in create_valid_pipeline_run
    step_keys_to_execute=step_keys_to_execute,
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/external.py", line 97, in get_external_execution_plan_or_raise
    step_keys_to_execute=None,
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/context.py", line 121, in get_external_execution_plan
    step_keys_to_execute=step_keys_to_execute,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/repository_location.py", line 372, in get_external_execution_plan
    step_keys_to_execute=step_keys_to_execute,
  File "/usr/local/lib/python3.7/site-packages/dagster/api/snapshot_execution_plan.py", line 38, in sync_get_external_execution_plan_grpc
    pipeline_snapshot_id=pipeline_snapshot_id,
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 117, in execution_plan_snapshot
    execution_plan_snapshot_args
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 73, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

daniel

01/11/2021, 2:03 PM

Hi Alexis - is it possible to upload a full set of logs from the container? My suspicion is there might be a crash or exception earlier that could give a clue about what’s causing the problem here later

Alexis M

01/11/2021, 3:19 PM

Unfortunately I can't provide you those logs. However I observed today that it happened right after this next error occured:

Copy code

OSError: [Errno 12] Cannot allocate memory
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/impl.py", line 68, in _core_execute_run
    for event in execute_run_iterator(recon_pipeline, pipeline_run, instance):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 728, in __iter__
    execution_plan=self.execution_plan, pipeline_context=self.pipeline_context,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 665, in _pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/in_process.py", line 36, in execute
    for event in inner_plan_execution_iterator(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/plan/execute_plan.py", line 55, in inner_plan_execution_iterator
    step_context.pipeline_run, step_context.step.key
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/storage/compute_log_manager.py", line 56, in watch
    with self._watch_logs(pipeline_run, step_key):
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/storage/local_compute_log_manager.py", line 47, in _watch_logs
    with mirror_stream_to_file(sys.stdout, outpath):
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/compute_logs.py", line 31, in mirror_stream_to_file
    with tail_to_stream(filepath, stream) as pids:
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/compute_logs.py", line 84, in tail_to_stream
    with execute_posix_tail(path, stream) as pids:
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/compute_logs.py", line 129, in execute_posix_tail
    tail_process = subprocess.Popen(tail_cmd, stdout=stream)
  File "/usr/local/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/usr/local/lib/python3.7/subprocess.py", line 1482, in _execute_child
    restore_signals, start_new_session, preexec_fn)

daniel

01/11/2021, 3:25 PM

Ah great - that definitely seems relevant. It seems like you might be hitting a memory limit in the container, is it possible there's a memory leak in your pipeline, or some way to increase the memory limit of the container? One thing that we could do is see if we run into similar problems with a much simpler pipeline, like the hello world example from the tutorial, to isolate whether it's a problem with this specific pipeline. One other idea that isn't a full fix but might be less annoying than doing a full re-deploy - there should be a reload icon next to your repository name in on the left-hand side dagit, I'd expect pressing that to reload the repository and let you launch the pipeline again without doing a full re-deploy (or if the memory issue is still there, to display some kind of error as soon as you press the button)

daniel

01/11/2021, 3:26 PM

On our side we should add some more monitoring of the process that dagit spins up for your repository, to proactively tell you when it crashes rather than waiting for you to notice when you try to launch the pipeline. cc @sashank who has been working on some features along these lines.

Alexis M

01/11/2021, 5:13 PM

I will try to press the reload button the next time I encounter the error. It's great to hear that this kind of troubleshooting will be easier to manage in the future 🙂 . Many thanks for the time you dedicated to my problem !

sashank

01/11/2021, 5:14 PM

Hey @Alexis M, just so I understand–you’re using python repository locations in your workspace, not remote gRPC repository locations correct?

Alexis M

01/11/2021, 5:18 PM

Hey, yes absolutely, here is our simple workspace.yaml:

Copy code

load_from:
  - python_file:
      relative_path: repository.py
      working_directory: .

Alexis M

01/12/2021, 7:39 AM

One thing I also noticed is that the failing gRPC process is making our schedules unable to start. One example log is the following one:

Copy code

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception calling application: [Errno 12] Cannot allocate memory"
debug_error_string = "{"created":"@1610323262.013688394","description":"Error received from peer unix:/tmp/tmpxmhzmbkw","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception calling application: [Errno 12] Cannot allocate memory","grpc_status":2}"
>
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1129, in launch_run
    self, run, external_pipeline=external_pipeline
  File "/usr/local/lib/python3.7/site-packages/dagster/core/launcher/default_run_launcher.py", line 97, in launch_run
    instance_ref=self._instance.get_ref(),
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 382, in start_run
    serialized_execute_run_args=serialize_dagster_namedtuple(execute_run_args),
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 73, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

Alexis M

01/12/2021, 7:41 AM

However it is indeed working after I manually press the reload button in the UI. Is there any way to trigger the reload action periodically ?

daniel

01/12/2021, 12:40 PM

There’s a graphql call that could be made to reload the repository location - but I think it may be worth investigating why the container is running out of memory as well? Otherwise it could very well crash in the middle of executing the pipeline, for example. Any chance it would be possible to post the pipeline code?

Alexis M

01/12/2021, 1:14 PM

Unfortunely I can't share the pipeline code, but we have some pipeline which can load millions of row of a pandas DataFrame in memory. I guess it is what causes the pipeline to crash occasionnaly.

daniel

01/12/2021, 1:40 PM

That makes sense - do you have any way to increase the memory limit of the container?

Alexis M

01/12/2021, 2:11 PM

I could do it, but I think that we could also optimize how we process this data on our side (like reading in chunks).

2 Views

Open in Slack

Previous Next