Hi team we are running dagster 0 12 12 and facing complete d dagster #ask-community

Hi team, we are running dagster 0.12.12 and facing...

Arun Kumar

04/21/2022, 7:05 PM

Hi team, we are running dagster 0.12.12 and facing complete downtime on dagit from morning. This initially was caused by the liveness probe, we then removed the liveness probe manually. Now the dagit pods are in a health state, however dagit is not up again. From our DB monitoring tool we see the below frequent query performing bad. Any thoughts on what else could we check?

Copy code

SELECT job_ticks.id, job_ticks.tick_body 
  FROM job_ticks 
 WHERE job_ticks.job_origin_id = $1 
 ORDER BY job_ticks.id DESC LIMIT $2

Arun Kumar

04/21/2022, 7:06 PM

we are planning to do the upgrade to the latest version tomorrow. However, would like resolve the downtime right away

Arun Kumar

04/21/2022, 7:08 PM

cc: @daniel sorry for the tag. Any thoughts ?

daniel

04/21/2022, 7:18 PM

hi arun - I'm confident this particular slow query will go away when you upgrade since the query there has been made more efficient. Seeing if there's any workaround we can provide...

thankyou 1

daniel

04/21/2022, 7:20 PM

are you sure it's that slow query that is bringing down dagit?

Arun Kumar

04/21/2022, 7:21 PM

I am actually not sure. For me the home page itself is not up and I don't see any graphQL queries being trigger from the browser

daniel

04/21/2022, 7:22 PM

Are there any other logs in the dagit pod that might give some clues about why its down?

Arun Kumar

04/21/2022, 7:25 PM

Ah yes, we see this. Might be some issue with our internal routing?

Copy code

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 999, in handle_one_response
    self.run_application()
  File "/usr/local/lib/python3.7/site-packages/geventwebsocket/handler.py", line 87, in run_application
    return super(WebSocketHandler, self).run_application()
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 946, in run_application
    self.process_result()
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 932, in process_result
    self.write(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 779, in write
    self._write_with_headers(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 800, in _write_with_headers
    self._write(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 762, in _write
    self._sendall(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 736, in _sendall
    self.socket.sendall(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 699, in sendall
    return _sendall(self, data_memory, flags)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 409, in _sendall
    timeleft = __send_chunk(socket, chunk, flags, timeleft, end)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 338, in __send_chunk
    data_sent += socket.send(chunk, flags)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 729, in send
    return self._sock.send(data, flags)
TimeoutError: [Errno 110] Connection timed out
2022-04-21T19:09:47Z {'REMOTE_ADDR': '172.31.196.63', 'REMOTE_PORT': '49404', 'HTTP_HOST': 'dagit.doordash.team', (hidden keys: 48)} failed with TimeoutError

daniel

04/21/2022, 7:26 PM

That's firing anytime anybody tries to load a page?

daniel

04/21/2022, 7:27 PM

If it's not happening consistently its not neccesarily the root cause of every page load failing

daniel

04/21/2022, 7:28 PM

did anything in particular change around the time that it started failing?

Arun Kumar

04/21/2022, 8:06 PM

I don't think so. It looks like the dagit web server itself not heathy, curl command to the server is not work from dagit pod even though I see the below message.

Copy code

Welcome to Dagster!

  If you have any questions or would like to engage with the Dagster team, please join us on Slack
  (<https://bit.ly/39dvSsF>).

Serving on <http://0.0.0.0:80> in process 1

Arun Kumar

04/21/2022, 8:07 PM

Is there any place where I can check the server logs?

daniel

04/21/2022, 8:07 PM

And no network config changes happened on your side around then? It would be unusual for it to start rejecting all requests without some kind of change to the environment

Arun Kumar

04/21/2022, 8:13 PM

Irrespective of the network changes, I should be able to curl the server from the dagit pod right?

daniel

04/21/2022, 8:13 PM

I would think so - but do you know if that was working yesterday?

daniel

04/21/2022, 8:14 PM

I can see if it works from our dagit pod

daniel

04/21/2022, 8:14 PM

what command are you running?

Arun Kumar

04/21/2022, 8:14 PM

curl localhost:80

curl localhost:80/graphql

daniel

04/21/2022, 8:17 PM

Yeah, when I run

curl localhost:80/graphql

in our dagit pod, I get

No GraphQL query found in the request

daniel

04/21/2022, 8:17 PM

what exactly do you get?

daniel

04/21/2022, 8:18 PM

(and just to confirm, this is by running

kubectl exec --stdin --tty

<dagit pod> -- /bin/bash

daniel

04/21/2022, 8:18 PM

and then installing curl since it wasn't installed

daniel

04/21/2022, 8:20 PM

If you kill the dagit pod and let it come back up, does it ever accept requests before getting stuck?

Arun Kumar

04/21/2022, 8:22 PM

For me, it just hangs. Not seeing any response.

Arun Kumar

04/21/2022, 8:22 PM

If you kill the dagit pod and let it come back up, does it ever accept requests before getting stuck?

Nope, I tried it. it never accepts any requests

daniel

04/21/2022, 8:23 PM

And we have no leads on anything that might have changed around the time that it stopped working?

daniel

04/21/2022, 8:24 PM

Going from accepting all requests to accepting no requests points strongly to some kind of network configuration change I think

daniel

04/21/2022, 8:32 PM

Any chance you're able to pip install py-spy and run it on that machine?

Copy code

py-spy dump --pid 1

daniel

04/21/2022, 8:33 PM

curl localhost:80/dagit_info

is a way to hit an endpoint that has absolutely no DB dependencies, it should just return a JSON dict

daniel

04/21/2022, 8:35 PM

What I see when I run py-spy dump --pid 1 on the dagit pod:

Copy code

root@user-deployment-dagster-dagit-86cb585b75-hc4pp:/# py-spy dump --pid 1
Process 1: /usr/local/bin/python /usr/local/bin/dagit -h 0.0.0.0 -p 80 -w /dagster-workspace/workspace.yaml
Python v3.8.7 (/usr/local/bin/python3.8)

Thread 1 (idle): "MainThread"
    run (asyncio/runners.py:44)
    run (uvicorn/server.py:60)
    run (uvicorn/main.py:463)
    host_dagit_ui_with_workspace_process_context (dagit/cli.py:148)
    dagit (dagit/cli.py:115)
    invoke (click/core.py:760)
    invoke (click/core.py:1404)
    main (click/core.py:1055)
    __call__ (click/core.py:1130)
    main (dagit/cli.py:162)
    <module> (dagit:33)
Thread 15 (idle): "grpc-server-watch"
    wait (threading.py:306)
    wait (threading.py:558)
    watch_for_changes (dagster/grpc/server_watcher.py:92)
    watch_grpc_server_thread (dagster/grpc/server_watcher.py:122)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 23 (idle): "grpc-server-watch"
    wait (threading.py:306)
    wait (threading.py:558)
    watch_for_changes (dagster/grpc/server_watcher.py:92)
    watch_grpc_server_thread (dagster/grpc/server_watcher.py:122)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 31 (idle): "grpc-server-watch"
    wait (threading.py:306)
    wait (threading.py:558)
    watch_for_changes (dagster/grpc/server_watcher.py:92)
    watch_grpc_server_thread (dagster/grpc/server_watcher.py:122)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 39 (idle): "telemetry-upload"
    wait (threading.py:306)
    wait (threading.py:558)
    upload_logs (dagster/core/telemetry_upload.py:85)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 378 (idle): "AnyIO worker thread"
    wait (threading.py:302)
    get (queue.py:170)
    run (anyio/_backends/_asyncio.py:744)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
root@user-deployment-dagster-dagit-86cb585b75-hc4pp:/#

daniel

04/21/2022, 8:54 PM

i'm unreachable for the next hour or so but tapping @alex or @johann in case any other questions come here about the downtime

Arun Kumar

04/21/2022, 9:30 PM

sorry was in a couple of meetings

Arun Kumar

04/21/2022, 9:33 PM

Trying to run py spy

Arun Kumar

04/21/2022, 9:47 PM

No luck running py spy.

Copy code

root@dagster-dagit-767fbb6bc5-hqkrj:/# py-spy dump --pid 1
Error: Operation not permitted (os error 1)

alex

04/21/2022, 9:47 PM

ya, you have to have certain security perms enabled in k8s to be able to run that, same as

gdb

Arun Kumar

04/21/2022, 9:51 PM

does the dagit web server write any logs? Where can I find or tail them?

alex

04/21/2022, 9:53 PM

no written logs that i am aware of

Arun Kumar

04/21/2022, 10:01 PM

Got it. Looks like dagit is back up again. I had to restart the entire deployment rather than pods

daniel

04/21/2022, 10:13 PM

Huh, So a helm uninstall and then a helm upgrade fixed it?

Arun Kumar

04/21/2022, 10:15 PM

I think

kubectl rollout restart deployment/dagster-dagit

just worked. Previously I was killing individual dagit pods

daniel

04/21/2022, 10:17 PM

And now when you ssh into the pods curl works again?

Arun Kumar

04/21/2022, 10:17 PM

Yes, it does 😓

daniel

04/21/2022, 10:24 PM

I don't think i've seen that before, very strange..

daniel

04/21/2022, 10:24 PM

unlucky that it happened the day before the upgrade 😕

Arun Kumar

04/22/2022, 12:10 AM

Yeah, hopefully the upgrade brings our deployment to a better state.Thanks a lot for your help

4 Views

Open in Slack

Previous Next