Sometimes dagster doesn’t show a stack trace after...
# ask-community
b
Sometimes dagster doesn’t show a stack trace after an error occurs. All it tells me is there’s an error in an asset:
Copy code
2023-03-01 12:24:01 -0900 - dagster - ERROR - epacems_job - ebe6e649-285d-4915-874e-b5adee4f0ce3 - 79821 - RUN_FAILURE - Execution of run for "epacems_job" failed. Steps failed: ['hourly_emissions_epacems'].
Has anyone else experienced this issue?
d
Hi Bennett - does this run also appear in Dagit? Are there any more useful logs there?
b
The run does not appear in dagit. I’m using
build_reconstructable_job
to run jobs via a python script. When I use
execute_in_process
I get the full stack trace.
d
Is it possible to share that stack trace? That might help explain why it’s getting swallowed
b
Sure! It’s a job with two ops:
Copy code
2023-03-01 14:43:28 [    INFO] catalystcoop.pudl.extract.ferc1:653 Dropping the old FERC Form 1 SQLite DB if it exists.
2023-03-01 14:43:28 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 328c2bdb-24e8-41a8-96a3-fab9e0788198 - 98735 - xbrl2sqlite - STEP_START - Started execution of step "xbrl2sqlite".
2023-03-01 14:43:28 [    INFO] catalystcoop.pudl.extract.xbrl:84 Dropping the old FERC Form 1 XBRL derived SQLite DB if it exists.
2023-03-01 14:43:29 -0900 - dagster - ERROR - ferc_to_sqlite_job - 328c2bdb-24e8-41a8-96a3-fab9e0788198 - 98735 - xbrl2sqlite - STEP_FAILURE - Execution of step "xbrl2sqlite" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "xbrl2sqlite"::

AssertionError: You are attempting to drop your database without setting clobber to False

Stack Trace:
  File "/Users/bendnorman/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 55, in op_execution_error_boundary
    yield
  File "/Users/bendnorman/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 471, in iterate_with_context
    next_output = next(iterator)
  File "/Users/bendnorman/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 121, in _coerce_solid_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/Users/bendnorman/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 115, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/Users/bendnorman/catalyst/pudl/src/pudl/extract/xbrl.py", line 143, in xbrl2sqlite
    sqlite_engine = _get_sqlite_engine(form.value, output_path, clobber)
  File "/Users/bendnorman/catalyst/pudl/src/pudl/extract/xbrl.py", line 91, in _get_sqlite_engine
    pudl.helpers.drop_tables(sqlite_engine, clobber=clobber)
  File "/Users/bendnorman/catalyst/pudl/src/pudl/helpers.py", line 1031, in drop_tables
    raise AssertionError(

2023-03-01 14:43:29 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 328c2bdb-24e8-41a8-96a3-fab9e0788198 - 98713 - ENGINE_EVENT - Multiprocess executor: parent process exiting after 7.74s (pid: 98713)
2023-03-01 14:43:29 -0900 - dagster - ERROR - ferc_to_sqlite_job - 328c2bdb-24e8-41a8-96a3-fab9e0788198 - 98713 - RUN_FAILURE - Execution of run for "ferc_to_sqlite_job" failed. Steps failed: ['dbf2sqlite', 'xbrl
2sqlite'].
It occasionally spits out one of the failures which is how I got this stack trace.
d
What function are you running to execute the job after build_reconstructable_job? To confirm, execute_in_process always includes the failures with stack traces, but this other code path using build_reconstructable_job only sometimes includes the failures?
b
I’m using execute_job to execute the job created by build_reconstructable_job. Yes, execute_in_process includes the stack traces but build_reconstructable_job does not.
d
I think I have an idea what's going on here (we probably need to be flushing our logs more in some places) - in the meantime as a workaround, the output of execute_job is an object with an
all_events
property that's a list of events - iterating through those and printing them out would probably have the failure message in it every time
b
Thanks for the workaround! Should I open an issue for this?
d
I was just trying to reproduce this this morning, no luck so far but could you try one thing for me? Does the problem go away if you set PYTHONUNBUFFERED=1 before running your script?
that would help verify that it is indeed a log flushing problem
b
I’ll try it!
The problem did not go away after setting PYTHONUNBUFFERED=1
d
Hm, ok, one other thing to try: If you add this to the run_config that is optionally passed into execute_job, does it go away then?
Copy code
run_config={"execution": {"config": {"in_process": {}}}}
b
Yes the full stack traces are reliably displayed when
"execution": {"config": {"in_process": {}}}
is added to
execute_job
d
And ok one other thing that would be helpful, hopefully the last for a bit - are you able to share a full stream of logs from a run where the problem is happening? Knowing the last log that each subprocess logs could help track this down
b
Sure! With the in_process executor config?
d
without the in_process config (so when the problem is reproducing)
b
Copy code
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE. For more infosee: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE. For more infosee: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
2023-03-02 10:17:33 [ WARNING] catalystcoop.pudl.workspace.setup:107 pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE. For more infosee: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
2023-03-02 10:17:33 [ WARNING] catalystcoop.pudl.workspace.setup:128 sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
/Users/bendnorman/catalyst/pudl/src/pudl/ferc_to_sqlite/cli.py:135: ExperimentalWarning: "build_reconstructable_job" is an experimental function. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.
  ferc_to_sqlite_reconstructable_job = build_reconstructable_job(
No dagster instance configuration file (dagster.yaml) found at /Users/bendnorman/catalyst/pudl-work. Defaulting to loading and storing all metadata with /Users/bendnorman/catalyst/pudl-work. If this is the desired behavior, create an empty dagster.yaml file in /Users/bendnorman/catalyst/pudl-work.
/Users/bendnorman/catalyst/pudl/src/pudl/ferc_to_sqlite/cli.py:141: ExperimentalWarning: "execute_job" is an experimental function. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.
  result = execute_job(
2023-03-02 10:17:34 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97201 - RUN_START - Started execution of run for "ferc_to_sqlite_job".
2023-03-02 10:17:34 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97201 - ENGINE_EVENT - Executing steps using multiprocess executor: parent process (pid: 97201)
2023-03-02 10:17:34 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97201 - dbf2sqlite - STEP_WORKER_STARTING - Launching subprocess for "dbf2sqlite".
2023-03-02 10:17:34 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97201 - xbrl2sqlite - STEP_WORKER_STARTING - Launching subprocess for "xbrl2sqlite".
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE. For more infosee: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE. For more infosee: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE. For more infosee: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE
pudl_settings is being depcrated in favor of environment variablesPUDL_OUTPUT and PUDL_CACHE. For more infosee: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: <https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html>
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97213 - STEP_WORKER_STARTED - Executing step "xbrl2sqlite" in subprocess.
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97212 - STEP_WORKER_STARTED - Executing step "dbf2sqlite" in subprocess.
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97213 - xbrl2sqlite - RESOURCE_INIT_STARTED - Starting initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97212 - dbf2sqlite - RESOURCE_INIT_STARTED - Starting initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97213 - xbrl2sqlite - RESOURCE_INIT_SUCCESS - Finished initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97212 - dbf2sqlite - RESOURCE_INIT_SUCCESS - Finished initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97213 - LOGS_CAPTURED - Started capturing logs in process (pid: 97213).
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97213 - xbrl2sqlite - STEP_START - Started execution of step "xbrl2sqlite".
2023-03-02 10:17:42 [    INFO] catalystcoop.pudl.extract.xbrl:84 Dropping the old FERC Form 1 XBRL derived SQLite DB if it exists.
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97212 - LOGS_CAPTURED - Started capturing logs in process (pid: 97212).
2023-03-02 10:17:42 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97212 - dbf2sqlite - STEP_START - Started execution of step "dbf2sqlite".
2023-03-02 10:17:42 [    INFO] catalystcoop.pudl.extract.ferc1:653 Dropping the old FERC Form 1 SQLite DB if it exists.
2023-03-02 10:17:44 -0900 - dagster - DEBUG - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97201 - ENGINE_EVENT - Multiprocess executor: parent process exiting after 9.99s (pid: 97201)
2023-03-02 10:17:44 -0900 - dagster - ERROR - ferc_to_sqlite_job - 8ff4c146-9654-4a8b-8c94-a9a8a76b79de - 97201 - RUN_FAILURE - Execution of run for "ferc_to_sqlite_job" failed. Steps failed: ['dbf2sqlite', 'xbrl2sqlite'].
dag_debug.log
d
I assume you don't have any logging-related customization in your dagster.yaml?
b
No not yet. We’ve been using the default logging behavior.
d
And similarly your ops don't make any logging configuration changes? (logging.dictConfig, that kind of thing)
b
I do set the python log level in the job exeucte_job runs. This is why I switched from using
dagster.reconstructable
to ``dagster.build_reconstructable_job()` . This PR explains the context.
d
Oh nice - if I wanted to try running your exact job locally myself would that be a whole adventure? didn't realize it was publicly available
b
haha potentially. It’s not that much setup. You’d need to clone the repo, setup a conda env, set a couple of env vars then run the script.
Here are our setup docs though they haven’t been updated to reflect our dagster changes.
d
I'd be curious if taking out that
Copy code
reconstructable_kwargs={"loglevel": args.loglevel, "logfile": args.logfile},
makes a difference (all this back and forth debugging is entirely optional btw, you've already gone above and beyond)
although looking at the code that might not be totally trivial to test
b
Ah i’m getting more reliable stack traces after removing
Copy code
reconstructable_kwargs={"loglevel": args.loglevel, "logfile": args.logfile},
But then I lose my ability to set the python log level for the script which is my original reason from switching from
reconstructable
to
build_reconstructable_job
d
Got it - that may help me reproduce though at least, that's a helpful clue
i'll see if i can reproduce with a filehandler added - that would make a lot of sense actually
if it's doing more work, it's more likely that the lack of a flush hook at the end will cause problems
b
Sounds good! Thanks for all of the help!
d
what script do you run in your repo to trigger the job locally? I see a couple in the PR
b
I’ve been testing this with the
ferc_to_sqlite
command.
d
super odd, still no dice running that script locally in my terminal
i suppose you're passing in the argfile/loglevel? what script are you running exactly
b
Copy code
ferc_to_sqlite src/pudl/package_data/settings/etl_fast.yml
From the pudl repo
d
huh. me too. Mac Terminal?
b
it should throw an error if there is an existing sqlite database.
d
er wait, i'll try that exact yaml file
b
Mac iTerm
The latest commit to that branch includes the workaround you suggested.
d
How odd:
Copy code
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7927 - STEP_WORKER_STARTED - Executing step "xbrl2sqlite" in subprocess.
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7920 - STEP_WORKER_STARTED - Executing step "dbf2sqlite" in subprocess.
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7927 - xbrl2sqlite - RESOURCE_INIT_STARTED - Starting initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7920 - dbf2sqlite - RESOURCE_INIT_STARTED - Starting initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7927 - xbrl2sqlite - RESOURCE_INIT_SUCCESS - Finished initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7920 - dbf2sqlite - RESOURCE_INIT_SUCCESS - Finished initialization of resources [datastore, ferc_to_sqlite_settings, io_manager].
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7927 - LOGS_CAPTURED - Started capturing logs in process (pid: 7927).
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7920 - LOGS_CAPTURED - Started capturing logs in process (pid: 7920).
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7927 - xbrl2sqlite - STEP_START - Started execution of step "xbrl2sqlite".
2023-03-02 14:36:33 [    INFO] catalystcoop.pudl.extract.xbrl:84 Dropping the old FERC Form 1 XBRL derived SQLite DB if it exists.
2023-03-02 14:36:33 -0600 - dagster - DEBUG - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7920 - dbf2sqlite - STEP_START - Started execution of step "dbf2sqlite".
2023-03-02 14:36:33 [    INFO] catalystcoop.pudl.extract.ferc1:653 Dropping the old FERC Form 1 SQLite DB if it exists.
2023-03-02 14:36:33 [    INFO] catalystcoop.pudl.extract.ferc1:667 Creating a new database schema based on 2020.
2023-03-02 14:36:33 -0600 - dagster - ERROR - ferc_to_sqlite_job - ba895ed4-8280-4453-8a80-e53f95906c8d - 7927 - xbrl2sqlite - STEP_FAILURE - Execution of step "xbrl2sqlite" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "xbrl2sqlite"::

Exception: SIMULATED FAILURE

Stack Trace:
  File "/Users/dgibson/dagster/python_modules/dagster/dagster/_core/execution/plan/utils.py", line 55, in op_execution_error_boundary
    yield
  File "/Users/dgibson/dagster/python_modules/dagster/dagster/_utils/__init__.py", line 471, in iterate_with_context
    next_output = next(iterator)
  File "/Users/dgibson/dagster/python_modules/dagster/dagster/_core/execution/plan/compute_generator.py", line 121, in _coerce_solid_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/Users/dgibson/dagster/python_modules/dagster/dagster/_core/execution/plan/compute_generator.py", line 115, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/Users/dgibson/pudl/src/pudl/extract/xbrl.py", line 145, in xbrl2sqlite
    raise Exception("SIMULATED FAILURE")
i still think the fix could resolve it, but would be lovely to be able to reproduce the problem to verify...
b
Hmm strange. The stack traces seem inconsistent. I just ran the script ~10 times and ~50% the stack traces from both op failures were included ~%40 of the time one stack was included and %10 no stack traces were included.
d
Dunno how enthused you are about installing dagster from source to test out a fix (obviously no expectations there) but I think the PR here has a good chance of fixing this by being sure to flush out loggers before the subprocesses end: https://github.com/dagster-io/dagster/pull/12670/files
b
I can give it a go! I setup the env the other day to make a fix of my own which I think you might be the reviewer on!
d
oh nice, sounds like a fair trade
👍 1
b
What’s the easiest way to install dagster from source? Can I just point pip to your branch?
d
Yeah, checking out that branch and then pip install -e python_modules/dagster should work
👍 1
a
, logging to a single file from multiple processes is not supported, because there is no standard way to serialize access to a single file across multiple processes in Python
https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes
b
Ah thank you @alex! I’ll remove the logfile option from the cli command.
@daniel looks like the code on
loggingshutdown
is still swallowing the stack traces.
Given you can’t reproduce it I think the work around should be fine for now!
d
Hmm alex's theory is probably the correct one then. Python logging strikes again
1