Hey all I m seeing a `dagster core executor child process ex dagster #ask-community

Hey all – I'm seeing a `dagster._core.executor.chi...

Kevin Schaich

02/21/2023, 6:09 PM

Hey all – I'm seeing a `dagster._core.executor.child_process_executor.ChildProcessCrashException`every time I run an asset materialization for the first time after a server restart. It's not an OOM issue – I'm processing on the order of ~200MB of data and the container has 32GB. Every time after retrying a second time it succeeds. What could be the cause of this / how can I dig in further?

daniel

02/22/2023, 5:37 PM

Hi Kevin - do you have a full stack trace from the error?

Kevin Schaich

02/23/2023, 5:10 AM

hey @daniel this is all i get unfortunately in the web UI

Multiprocess executor: child process for step seatgeek_events_raw was terminated by signal 7 (SIGBUS).

dagster._core.executor.child_process_executor.ChildProcessCrashException

Stack Trace:

File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 240, in execute

event_or_none = next(step_iter)

,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 357, in execute_step_out_of_process

for ret in execute_child_process_command(multiproc_ctx, command):

,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/child_process_executor.py", line 174, in execute_child_process_command

raise ChildProcessCrashException(exit_code=process.exitcode)

daniel

02/23/2023, 1:37 PM

SIGBUS is unusual.. any chance you’re able to share code that reproduces the problem? The crash is very likely coming from the code in the body of the op - are you able to reproduce the crash when running the same code or dagster job locally?

daniel

02/23/2023, 1:44 PM

Passing on retry is also odd - is this using the default run launcher or a different one? Any raw logs from the process where the run ran that might give more information about the crash?

Kevin Schaich

02/23/2023, 7:20 PM

@daniel I wasn't getting anything similar to this when I was running locally, this started when I followed the guide to get Docker working. Now it works sometimes and sometimes it doesn't. Using the default run launcher right now, i saw a bunch of errors when i used the Docker run launcher.

Kevin Schaich

02/23/2023, 7:22 PM

there's unfortunately not anything more helpful in the container logs either, it says the same stuff that's in the web UI. I feel like maybe an error in my code is occurring and it's getting masked by the dagster code as a

ChildProcessCrashException

, just a hunch though

Kevin Schaich

02/23/2023, 7:23 PM

most of my assets are using Spark but they were running fine without error locally

daniel

02/23/2023, 7:25 PM

one thing you could do is try running it with the in_process executor (that runs everything in the same process) instead of the default multiprocess executor that runs each op in its own subprocess (e.g. by adding this to your run config in the launchpad

Copy code

execution:
  config:
    in_process:

i'd still expect it to crash but maybe there might be more information about the crash since it'd be happening in the 'main' process? or if it doesn't crash that might be a clue

Kevin Schaich

02/23/2023, 7:27 PM

yeah I haven't been able to reproduce a crash reliably yet because I'm not getting any detailed errors that tell me what failed. I'm using assets for everything, do assets work with the launchpad? can't find it

daniel

02/23/2023, 7:28 PM

Yeah, it's a bit hidden but if you shift click on the text of the Materialize button it will open the launchpad

Kevin Schaich

02/23/2023, 7:32 PM

oh nice ok let me try

Kevin Schaich

02/23/2023, 7:37 PM

ok so I ran it 3 times, 1st and 3rd time, I passed the block you suggested directly, the second one I just clicked "re-execute" without going through the launchpad. 1st and 3rd succeeded and second failed, but it seems like the 2nd ended up using the same config as the others anyway

[DagsterApiServer] Run execution process for cca2b48b-36db-4443-8f53-01a7934c3283 was terminated by signal 7 (SIGBUS).

14:33:20.459

RUN_FAILURE

This run has been marked as failed from outside the execution context.

do I need a

true

or similar here? or is

null

expected

daniel

02/23/2023, 7:50 PM

I think null is fine

Kevin Schaich

02/23/2023, 8:09 PM

is there a way to make the log output more verbose maybe

daniel

02/23/2023, 8:12 PM

like specify a global python log level for the libraries accessed within the op you mean? You can add additional python loggers and configure their logging level here: https://docs.dagster.io/concepts/logging/python-logging#capturing-python-logs-

Kevin Schaich

02/23/2023, 8:14 PM

i'm already doing some logging with

<http://context.log.info|context.log.info>

and similar in my ops – I meant is there a way for us to dig into what's actually causing the signal 7 SIGBUS thing. Like when I was running locally if anything failed, I'd always get a full stacktrace

daniel

02/23/2023, 8:15 PM

Yeah with most failures I would expect the stack trace to appear in Dagit and the container output - crashes are trickier, I wouldn't expect Dagster to make that any harder to diagnose than it would be in any other Python program but it can still be very annoying to track down

daniel

02/23/2023, 8:16 PM

And I believe there are some ways to use pdb within dagster that could help here, I'll see if I can pull that up

daniel

02/23/2023, 8:25 PM

Another team member pointed me to https://docs.python.org/3/library/faulthandler.html which is not Dagster-specific but could be useful - you can call faulthandler.enable() within your op or set the

PYTHONFAULTHANDLER

environment variable to True and it will hopefully give you more of a stack trace at least when the SIGBUS happens

Kevin Schaich

02/25/2023, 7:56 PM

@daniel I did

faulthandler.enable()

and unfortunately I get the same thing

Copy code

dagster._core.executor.child_process_executor.ChildProcessCrashException

Stack Trace:
  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 240, in execute
    event_or_none = next(step_iter)
,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 357, in execute_step_out_of_process
    for ret in execute_child_process_command(multiproc_ctx, command):
,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/child_process_executor.py", line 174, in execute_child_process_command
    raise ChildProcessCrashException(exit_code=process.exitcode)

daniel

02/25/2023, 7:59 PM

Maybe try with the in process executor and faulthandler.enable?

Kevin Schaich

02/25/2023, 8:01 PM

that one gave me even less information, no stack trace

daniel

02/25/2023, 8:02 PM

Nothing in the raw process logs for dagit in the terminal either?

daniel

02/25/2023, 8:26 PM

You could also try calling that op that's crashing locally in a test or python script - either by calling the op function directly or by calling execute_in_process() on the job: https://docs.dagster.io/concepts/testing#testing-ops - which might give you more visibility into why it's crashing, or allow you to run it with a debugger like gdb or lldb that can set breakpoints on crashes

Kevin Schaich

02/27/2023, 9:29 PM

nothing in the raw process logs for any of the containers other than what I pasted above @daniel – dagit seems happy

2023-02-25 14:53:24 2023-02-25 19:53:24 +0000 - dagit - INFO - Serving dagit on <http://0.0.0.0:3003> in process 1

like I mentioned, this op was working fine when I just ran

dagster dev

locally but in Docker it only works sometimes

Kevin Schaich

02/27/2023, 9:30 PM

can I call

execute_in_process

for an asset?

daniel

02/27/2023, 9:32 PM

the assets version of execute_in_process is `materialize`: https://docs.dagster.io/_apidocs/execution#materializing-assets

Kevin Schaich

02/27/2023, 9:34 PM

ok gotcha – is that behavior different from when when we executed with the launchpad config like this?

Copy code

execution:
  config:
    in_process:

daniel

02/27/2023, 9:34 PM

the main difference is it will be in memory in the process that calls the function - it won't go through the run launcher

daniel

02/27/2023, 9:35 PM

so if you had a docker container that was similar to the one where it's SIGBUSing, you could try running materialize() within a container like that as a task and see if that also fails

Kevin Schaich

02/27/2023, 9:45 PM

that worked @daniel, I was able to do by calling the op directly and through materialize

Copy code

@asset(group_name="unified", retry_policy=retry_policy)
def events(context: OpExecutionContext):
    # ...
    # op definition
    # ...

# works
events(None)

# also works
materialize([events])

Kevin Schaich

02/27/2023, 10:52 PM

@daniel in case you care to chime in here as well (saw your name on the current

docker-compose.yml

example) – we'd like to improve the docs and hopefully alleviate problems for others https://dagster.slack.com/archives/C01U954MEER/p1677537966734039?thread_ts=1676958835.416799&cid=C01U954MEER

daniel

02/27/2023, 10:53 PM

I did see that, I think that would be really great

Kevin Schaich

02/27/2023, 10:53 PM

I think the issues I'm hitting in this thread might be resolved if the actual code runs in another container rather than in the same one as the GRPC server

Kevin Schaich

02/27/2023, 10:53 PM

or at the very least, we might get better logs

daniel

02/27/2023, 10:54 PM

Using the DockerRunLauncher seems like a good step, yeah - should if nothing else isolate things quite a bit more

👍 1

10 Views

Open in Slack

Previous Next