Hey all – I'm seeing a `dagster._core.executor.chi...
# ask-community
k
Hey all – I'm seeing a `dagster._core.executor.child_process_executor.ChildProcessCrashException`every time I run an asset materialization for the first time after a server restart. It's not an OOM issue – I'm processing on the order of ~200MB of data and the container has 32GB. Every time after retrying a second time it succeeds. What could be the cause of this / how can I dig in further?
d
Hi Kevin - do you have a full stack trace from the error?
k
hey @daniel this is all i get unfortunately in the web UI
Multiprocess executor: child process for step seatgeek_events_raw was terminated by signal 7 (SIGBUS).
dagster._core.executor.child_process_executor.ChildProcessCrashException
Stack Trace:
File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 240, in execute
event_or_none = next(step_iter)
,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 357, in execute_step_out_of_process
for ret in execute_child_process_command(multiproc_ctx, command):
,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/child_process_executor.py", line 174, in execute_child_process_command
raise ChildProcessCrashException(exit_code=process.exitcode)
d
SIGBUS is unusual.. any chance you’re able to share code that reproduces the problem? The crash is very likely coming from the code in the body of the op - are you able to reproduce the crash when running the same code or dagster job locally?
Passing on retry is also odd - is this using the default run launcher or a different one? Any raw logs from the process where the run ran that might give more information about the crash?
k
@daniel I wasn't getting anything similar to this when I was running locally, this started when I followed the guide to get Docker working. Now it works sometimes and sometimes it doesn't. Using the default run launcher right now, i saw a bunch of errors when i used the Docker run launcher.
there's unfortunately not anything more helpful in the container logs either, it says the same stuff that's in the web UI. I feel like maybe an error in my code is occurring and it's getting masked by the dagster code as a
ChildProcessCrashException
, just a hunch though
most of my assets are using Spark but they were running fine without error locally
d
one thing you could do is try running it with the in_process executor (that runs everything in the same process) instead of the default multiprocess executor that runs each op in its own subprocess (e.g. by adding this to your run config in the launchpad
Copy code
execution:
  config:
    in_process:
i'd still expect it to crash but maybe there might be more information about the crash since it'd be happening in the 'main' process? or if it doesn't crash that might be a clue
k
yeah I haven't been able to reproduce a crash reliably yet because I'm not getting any detailed errors that tell me what failed. I'm using assets for everything, do assets work with the launchpad? can't find it
d
Yeah, it's a bit hidden but if you shift click on the text of the Materialize button it will open the launchpad
k
oh nice ok let me try
ok so I ran it 3 times, 1st and 3rd time, I passed the block you suggested directly, the second one I just clicked "re-execute" without going through the launchpad. 1st and 3rd succeeded and second failed, but it seems like the 2nd ended up using the same config as the others anyway
[DagsterApiServer] Run execution process for cca2b48b-36db-4443-8f53-01a7934c3283 was terminated by signal 7 (SIGBUS).
14:33:20.459
-
RUN_FAILURE
This run has been marked as failed from outside the execution context.
do I need a
true
or similar here? or is
null
expected
d
I think null is fine
k
is there a way to make the log output more verbose maybe
d
like specify a global python log level for the libraries accessed within the op you mean? You can add additional python loggers and configure their logging level here: https://docs.dagster.io/concepts/logging/python-logging#capturing-python-logs-
k
i'm already doing some logging with
<http://context.log.info|context.log.info>
and similar in my ops – I meant is there a way for us to dig into what's actually causing the signal 7 SIGBUS thing. Like when I was running locally if anything failed, I'd always get a full stacktrace
d
Yeah with most failures I would expect the stack trace to appear in Dagit and the container output - crashes are trickier, I wouldn't expect Dagster to make that any harder to diagnose than it would be in any other Python program but it can still be very annoying to track down
And I believe there are some ways to use pdb within dagster that could help here, I'll see if I can pull that up
Another team member pointed me to https://docs.python.org/3/library/faulthandler.html which is not Dagster-specific but could be useful - you can call faulthandler.enable() within your op or set the
PYTHONFAULTHANDLER
environment variable to True and it will hopefully give you more of a stack trace at least when the SIGBUS happens
k
@daniel I did
faulthandler.enable()
and unfortunately I get the same thing
Copy code
dagster._core.executor.child_process_executor.ChildProcessCrashException

Stack Trace:
  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 240, in execute
    event_or_none = next(step_iter)
,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/multiprocess.py", line 357, in execute_step_out_of_process
    for ret in execute_child_process_command(multiproc_ctx, command):
,  File "/opt/conda/lib/python3.10/site-packages/dagster/_core/executor/child_process_executor.py", line 174, in execute_child_process_command
    raise ChildProcessCrashException(exit_code=process.exitcode)
d
Maybe try with the in process executor and faulthandler.enable?
k
that one gave me even less information, no stack trace
d
Nothing in the raw process logs for dagit in the terminal either?
You could also try calling that op that's crashing locally in a test or python script - either by calling the op function directly or by calling execute_in_process() on the job: https://docs.dagster.io/concepts/testing#testing-ops - which might give you more visibility into why it's crashing, or allow you to run it with a debugger like gdb or lldb that can set breakpoints on crashes
k
nothing in the raw process logs for any of the containers other than what I pasted above @daniel – dagit seems happy
2023-02-25 14:53:24 2023-02-25 19:53:24 +0000 - dagit - INFO - Serving dagit on <http://0.0.0.0:3003> in process 1
like I mentioned, this op was working fine when I just ran
dagster dev
locally but in Docker it only works sometimes
can I call
execute_in_process
for an asset?
d
the assets version of execute_in_process is `materialize`: https://docs.dagster.io/_apidocs/execution#materializing-assets
k
ok gotcha – is that behavior different from when when we executed with the launchpad config like this?
Copy code
execution:
  config:
    in_process:
d
the main difference is it will be in memory in the process that calls the function - it won't go through the run launcher
so if you had a docker container that was similar to the one where it's SIGBUSing, you could try running materialize() within a container like that as a task and see if that also fails
k
that worked @daniel, I was able to do by calling the op directly and through materialize
Copy code
@asset(group_name="unified", retry_policy=retry_policy)
def events(context: OpExecutionContext):
    # ...
    # op definition
    # ...

# works
events(None)

# also works
materialize([events])
@daniel in case you care to chime in here as well (saw your name on the current
docker-compose.yml
example) – we'd like to improve the docs and hopefully alleviate problems for others https://dagster.slack.com/archives/C01U954MEER/p1677537966734039?thread_ts=1676958835.416799&amp;cid=C01U954MEER
d
I did see that, I think that would be really great
k
I think the issues I'm hitting in this thread might be resolved if the actual code runs in another container rather than in the same one as the GRPC server
or at the very least, we might get better logs
d
Using the DockerRunLauncher seems like a good step, yeah - should if nothing else isolate things quite a bit more
👍 1