https://dagster.io/ logo
#ask-community
Title
# ask-community
s

Stefan Adelbert

04/11/2022, 1:05 AM
Job doesn't terminate I've noticed recently that some jobs don't terminate and instead get stuck in the "started" state. The logs show that the various failure hooks have completed, but nothing more (see first screenshot). This is in the case where there have been errors in the job run. I have to manually terminate the job. I the case where everything runs well, the hooks are skipped and the job terminates as expected (see second screenshot). It's not clear to me whether this situation occurs only when there are errors in a job, but it could the case. I do suspect the failure hooks I have written. Could anyone give me advice in debugging this scenario to work out where the execution is blocking?
s

sandy

04/11/2022, 3:26 PM
@alex - any tips here?
a

alex

04/11/2022, 3:38 PM
Is this running or reproducing locally? Is the process in question still running? The event logs include the
pid
of the process where the run is happening, if its still active you could use something like
py-spy dump
to see where its stuck
the “hook completed” message is fired from the place where we execute hooks with guards so we can fire a “hook errored” in the case an exception occurs I’m not sure what the hooks could be doing to cause execution to stall. Its also worth checking the stdout/stderr contents
s

Stefan Adelbert

04/12/2022, 2:52 AM
@alex @sandy Thanks guys. Next time I see this I'll check whether the process is still running. I'll also check stdout/stderr for any clues. If I figure out a way to reliably replicate I'll let you know, of course.
@alex I've just spotted this scenario again in production. I can see that the process is still running, including child processes relating to resources. Once of the resource spawns firefox and the gecko driver, which should all be closed when the resource is destroyed. I'm guessing that either the resource isn't behaving properly or something else is blocking before the resource would normally be destroyed.
@alex Righto... I just killed one of those child processes and the job wrapped itself up successfully. So this fairly clearly indicates to me that the resource itself is blocking on destruction. I'll have to look more closely into this...
a

alex

04/14/2022, 2:34 PM
sounds like it could be a missing
timeout
arg where you are trying to join/wait for the the sub process
s

Stefan Adelbert

04/28/2022, 11:15 PM
I managed to catch the problem in production and use
py-spy
to get a stack trace!
Copy code
Thread 0x7F46B4B43740 (active): "MainThread"
    readinto (socket.py:704)
    _read_status (http/client.py:281)
    begin (http/client.py:320)
    getresponse (http/client.py:1377)
    _make_request (urllib3/connectionpool.py:444)
    urlopen (urllib3/connectionpool.py:703)
    urlopen (urllib3/poolmanager.py:376)
    request_encode_url (urllib3/request.py:96)
    request (urllib3/request.py:74)
    _request (selenium/webdriver/remote/remote_connection.py:355)
    execute (selenium/webdriver/remote/remote_connection.py:333)
    execute (selenium/webdriver/remote/webdriver.py:423)
    quit (selenium/webdriver/remote/webdriver.py:950)
    quit (selenium/webdriver/firefox/webdriver.py:192)
    headless_firefox_driver (lassio_dagster/resources/selenium.py:48)
    __exit__ (contextlib.py:126)
    _gen_resource (dagster/core/execution/resources_init.py:423)
You can see the custom resource (which wraps a selenium firefox webdriver) being destroyed. The call seems to block on an HTTP call. So the problem is not with dagster, that's for sure. @alex Thanks for the tip on
py-spy
.
👍 1
3 Views