https://dagster.io/ logo
#dagster-support
Title
# dagster-support
g

Grigoriy Sterin

04/19/2022, 2:25 PM
I'm seeing a strange bug (?) when running Dagster job on ECS using ECSRunLauncher. I have a set of software-defined assets. I start an asset materialization job from Dagit. Then exception is thrown inside the asset function. Exception is legit - there's an error in the input data, so I see step failure in the log in Dagit
dagster.core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op...
. And here goes the most interesting part: underlying ECS jobs keep running and Dagster job also keeps running forever (Dagit shows it's status as
STARTED
), but nothing actually happens in this job, since it has already failed (but somehow Dagster doesn't know about it). Has anyone seen something like this before? Dagster version:
0.14.7
j

johann

04/19/2022, 4:05 PM
The ECS task keeps running indefinitely? Does Dagit show a STEP_FAILURE event?
g

Grigoriy Sterin

04/19/2022, 4:30 PM
Yes to both questions
j

johann

04/19/2022, 4:34 PM
Very strange… could you share the logs from the ECS task for the run
g

Grigoriy Sterin

04/19/2022, 4:35 PM
Sure. When I look into the ECS run logs CloudWatch I don't see the error there - it all finishes with resources' initialization for that failed asset:
Copy code
2022-04-18T09:44:23.810-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:28: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:44:23.810-04:00	metadata=RUN_METADATA

2022-04-18T09:44:23.810-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:45: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:44:23.810-04:00	@asset

2022-04-18T09:44:23.810-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:55: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:44:23.810-04:00	metadata=RUN_METADATA

2022-04-18T09:44:23.810-04:00	2022-04-18 13:44:23 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 1 - RUN_START - Started execution of run for "data_ingestion_assets".

2022-04-18T09:44:23.827-04:00	2022-04-18 13:44:23 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 1 - ENGINE_EVENT - Executing steps using multiprocess executor: parent process (pid: 1)

2022-04-18T09:44:23.910-04:00	2022-04-18 13:44:23 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 1 - new_novogene_data_folders_in_aggregation_bucket - ENGINE_EVENT - Launching subprocess for new_novogene_data_folders_in_aggregation_bucket

2022-04-18T09:44:28.635-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:28: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:44:28.635-04:00	metadata=RUN_METADATA

2022-04-18T09:44:28.635-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:45: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:44:28.635-04:00	@asset

2022-04-18T09:44:28.635-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:55: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:44:28.635-04:00	metadata=RUN_METADATA

2022-04-18T09:44:28.635-04:00	2022-04-18 13:44:28 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 12 - new_novogene_data_folders_in_aggregation_bucket - ENGINE_EVENT - Starting initialization of resources [root_manager, s3_aggregation_bucket_io_manager, sqs_source_asset_io_manager].

2022-04-18T09:44:28.816-04:00	2022-04-18 13:44:28 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 12 - new_novogene_data_folders_in_aggregation_bucket - ENGINE_EVENT - Finished initialization of resources [root_manager, s3_aggregation_bucket_io_manager, sqs_source_asset_io_manager].

2022-04-18T09:44:29.896-04:00	2022-04-18 13:44:29 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 12 - new_novogene_data_folders_in_aggregation_bucket - LOGS_CAPTURED - Started capturing logs for step: new_novogene_data_folders_in_aggregation_bucket.

2022-04-18T09:44:29.896-04:00	2022-04-18 13:44:29 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 12 - new_novogene_data_folders_in_aggregation_bucket - STEP_START - Started execution of step "new_novogene_data_folders_in_aggregation_bucket".

2022-04-18T09:44:29.896-04:00	2022-04-18 13:44:29 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 12 - new_novogene_data_folders_in_aggregation_bucket - LOADED_INPUT - Loaded input "new_folders_in_novogene_landing_bucket" using input manager "root_manager"

2022-04-18T09:44:30.896-04:00	2022-04-18 13:44:29 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 12 - new_novogene_data_folders_in_aggregation_bucket - STEP_INPUT - Got input "new_folders_in_novogene_landing_bucket" of type "NewLandingData". (Type check passed).

2022-04-18T09:44:30.896-04:00	2022-04-18 13:44:29 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 12 - new_novogene_data_folders_in_aggregation_bucket - STEP_OUTPUT - Yielded output "result" of type "NewAggregationData". (Type check passed).

2022-04-18T09:46:55.914-04:00	2022-04-18 13:46:55 +0000 - dagster - INFO - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - new_novogene_data_folders_in_aggregation_bucket - Skipping fvl58-novogene-data-landing/PLT314-317 - already exists in aggregation bucket.

2022-04-18T09:46:55.914-04:00	2022-04-18 13:46:55 +0000 - dagster - INFO - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - new_novogene_data_folders_in_aggregation_bucket - Writing the list of last moved folders in aggregation bucket to fvl58-novogene-data-landing/data_ingestion_assets-f502119f-b348-43fd-a2aa-0741f89fefb3.lst.

2022-04-18T09:46:57.095-04:00	2022-04-18 13:46:57 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 1 - new_novogene_submission_sets - ENGINE_EVENT - Launching subprocess for new_novogene_submission_sets

2022-04-18T09:47:03.435-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:28: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:47:03.435-04:00	metadata=RUN_METADATA

2022-04-18T09:47:03.435-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:45: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:47:03.435-04:00	@asset

2022-04-18T09:47:03.435-04:00	./pipelines/data_ingestion_assets/assets/novogene_data_ingestion_assets.py:55: ExperimentalWarning: "asset" is an experimental decorator. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

2022-04-18T09:47:03.435-04:00	metadata=RUN_METADATA

2022-04-18T09:47:03.435-04:00	2022-04-18 13:47:03 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 26 - new_novogene_submission_sets - ENGINE_EVENT - Starting initialization of resources [s3_aggregation_bucket_io_manager, s3_novogene_submission_sets_io_manager].

2022-04-18T09:47:03.614-04:00	2022-04-18 13:47:03 +0000 - dagster - DEBUG - data_ingestion_assets - f502119f-b348-43fd-a2aa-0741f89fefb3 - 26 - new_novogene_submission_sets - ENGINE_EVENT - Finished initialization of resources [s3_aggregation_bucket_io_manager, s3_novogene_submission_sets_io_manager].
j

johann

04/19/2022, 4:36 PM
Is this happening every time or a flake?
g

Grigoriy Sterin

04/19/2022, 4:37 PM
Every time
j

johann

04/19/2022, 4:43 PM
cc @alex in case he has any ideas for a hanging multi process executor
@Dagster Bot issue Multiprocess executor hanging on ECS
d

Dagster Bot

04/19/2022, 4:44 PM
a

alex

04/19/2022, 4:47 PM
does terminating / canceling the run work? how did you define your graph/job?
g

Grigoriy Sterin

04/19/2022, 4:49 PM
terminating a run works
Job is defined from an AssetGroup:
Copy code
novogene_data_ingestion_job = AssetGroup(
    assets=[new_novogene_data_folders_in_aggregation_bucket, new_novogene_submission_sets],
    source_assets=[new_folders_in_novogene_landing_bucket],
    resource_defs={
        "s3_aggregation_bucket_io_manager": s3_aggregation_bucket_io_manager,
        "sqs_source_asset_io_manager": sqs_source_asset_io_manager,
        "s3_novogene_submission_sets_io_manager": s3_novogene_submission_sets_io_manager,
    }
).build_job("data_ingestion_assets")
a

alex

04/19/2022, 4:55 PM
hmmm, could you send a debug file? theres a download debug file button on the run page
g

Grigoriy Sterin

04/19/2022, 4:55 PM
Sure. Where can I find it?
ok
Hi! Is there any update on this by any chance? Thank you
a

alex

04/21/2022, 6:05 PM
I tried reproing & looking at what you sent for a short while but was unable to find any leads
can you execute the same job locally and if yes do you get different behavior there?
g

Grigoriy Sterin

04/21/2022, 6:15 PM
I'll give it a try
a

alex

04/21/2022, 6:16 PM
a debug file of a canceled/terminated run may be useful too - hoping it may include information about where it was stuck
g

Grigoriy Sterin

04/21/2022, 6:21 PM
I was able to reproduce the bug locally
a

alex

04/21/2022, 6:50 PM
ok thats good to know so ECS probably not a factor, that leaves assets as the best lead since thats a newer feature @owen
g

Grigoriy Sterin

04/21/2022, 6:51 PM
One thing I noticed is that when this happens, the job is forever stuck in the
Started
state rather than
Running
. Could it be some race condition or some clause in the Dagster code like
if (exception == true and jobStatus == "Running"): terminateJob()
a

alex

04/21/2022, 6:53 PM
i dont believe we have a “running” state, so “started” is our equivalent to that since you can repro locally, is the process still running? can you
py-spy dump -p <pid>
to see where its stuck?
there are events in the log that communicate what the pid is for the process driving the run
g

Grigoriy Sterin

04/21/2022, 6:53 PM
ok, gimme a sec
I have several python process related to dagster. Which pid should I use?
Copy code
1127 pts/2    Sl+    0:13 /usr/bin/python3 /home/gsterin/.local/bin/dagit -f data_ingestion_assets/repo.py
 1129 pts/2    Sl+    0:04 /usr/bin/python3 -m dagster api grpc --lazy-load-user-code --socket /tmp/tmpzam80aki --heartbeat --heartbeat-timeout 45 --fixed-server-id 01eb010f-1058-41eb-b6a3-8a7a0ddb06da --log-level WARNING --use-python-environment-entry-point -f data_ingestion_assets/repo.py -d /mnt/c/projects/dagster-data-ingestion-pipelines-repo/pipelines
 2288 pts/2    S+     0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(13)
 2289 pts/2    Sl+    0:01 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=16) --multiprocessing-fork
 2351 pts/2    Sl+    0:01 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=17) --multiprocessing-fork
a

alex

04/21/2022, 6:56 PM
if you look at the dagit logs you should see something like
Executing steps using multiprocess executor: parent process (pid: 1)
at the top
g

Grigoriy Sterin

04/21/2022, 6:59 PM
Copy code
Process 2289: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=16) --multiprocessing-fork
Python v3.8.10 (/usr/bin/python3.8)

Thread 2289 (idle): "MainThread"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:47)
    join (multiprocessing/process.py:149)
    execute_child_process_command (dagster/core/executor/child_process_executor.py:165)
    execute_step_out_of_process (dagster/core/executor/multiprocess.py:323)
    execute (dagster/core/executor/multiprocess.py:210)
    pipeline_execution_iterator (dagster/core/execution/api.py:785)
    __iter__ (dagster/core/execution/api.py:879)
    core_execute_run (dagster/grpc/impl.py:91)
    _run_in_subprocess (dagster/grpc/impl.py:168)
    start_run_in_subprocess (dagster/grpc/impl.py:192)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:315)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)
Thread 2294 (idle): "kill-on-event"
    wait (multiprocessing/synchronize.py:261)
    wait (multiprocessing/synchronize.py:349)
    _kill_on_event (dagster/utils/__init__.py:357)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 2295 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
a

alex

04/21/2022, 7:01 PM
hmm, and is the process for the step still running? pid is printed in message like
Executing step new_novogene_submission_sets in subprocess
g

Grigoriy Sterin

04/21/2022, 7:22 PM
Ok, looks like it was my fault. Process for the step was stuck trying to execute destructor for the custom io manager:
Copy code
Process 2351: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=17) --multiprocessing-fork
Python v3.8.10 (/usr/bin/python3.8)

Thread 0x7FD090C9F740 (active): "MainThread"
    _wait_for_tstate_lock (threading.py:1027)
    join (threading.py:1011)
    close (python_dynamodb_lock/python_dynamodb_lock.py:688)
    __del__ (data_ingestion_assets/io_managers/s3_aggregation_bucket_io_manager.py:36)
Thread 2357 (idle): "kill-on-event"
    wait (multiprocessing/synchronize.py:261)
    wait (multiprocessing/synchronize.py:349)
    _kill_on_event (dagster/utils/__init__.py:357)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FD086E21700 (active): "DynamoDBLockClient-HS-0dd9417a179b4812bb474c504ee3e360"
    _send_heartbeat_loop (python_dynamodb_lock/python_dynamodb_lock.py:167)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 0x7FD086620700 (active): "DynamoDBLockClient-HC-0dd9417a179b4812bb474c504ee3e360"
    _check_heartbeat_loop (python_dynamodb_lock/python_dynamodb_lock.py:284)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
D 1
After commenting out the
data_ingestion_assets/io_managers/s3_aggregation_bucket_io_manager.py:36
line, the error doesn't reproduce anymore. Thank you for help with debugging this
a

alex

04/21/2022, 7:23 PM
well at least we figured it out! and now you know how to debug similar problems if they ever arise not sure what we could do to help catch problems like this
destructor dead lock is nasty nasty business
💯 1
2 Views