Hi team, we are running dagster 0.12.12 and facing...
# ask-community
a
Hi team, we are running dagster 0.12.12 and facing complete downtime on dagit from morning. This initially was caused by the liveness probe, we then removed the liveness probe manually. Now the dagit pods are in a health state, however dagit is not up again. From our DB monitoring tool we see the below frequent query performing bad. Any thoughts on what else could we check?
Copy code
SELECT job_ticks.id, job_ticks.tick_body 
  FROM job_ticks 
 WHERE job_ticks.job_origin_id = $1 
 ORDER BY job_ticks.id DESC LIMIT $2
we are planning to do the upgrade to the latest version tomorrow. However, would like resolve the downtime right away
cc: @daniel sorry for the tag. Any thoughts ?
d
hi arun - I'm confident this particular slow query will go away when you upgrade since the query there has been made more efficient. Seeing if there's any workaround we can provide...
thankyou 1
are you sure it's that slow query that is bringing down dagit?
a
I am actually not sure. For me the home page itself is not up and I don't see any graphQL queries being trigger from the browser
d
Are there any other logs in the dagit pod that might give some clues about why its down?
a
Ah yes, we see this. Might be some issue with our internal routing?
Copy code
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 999, in handle_one_response
    self.run_application()
  File "/usr/local/lib/python3.7/site-packages/geventwebsocket/handler.py", line 87, in run_application
    return super(WebSocketHandler, self).run_application()
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 946, in run_application
    self.process_result()
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 932, in process_result
    self.write(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 779, in write
    self._write_with_headers(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 800, in _write_with_headers
    self._write(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 762, in _write
    self._sendall(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/pywsgi.py", line 736, in _sendall
    self.socket.sendall(data)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 699, in sendall
    return _sendall(self, data_memory, flags)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 409, in _sendall
    timeleft = __send_chunk(socket, chunk, flags, timeleft, end)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 338, in __send_chunk
    data_sent += socket.send(chunk, flags)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 729, in send
    return self._sock.send(data, flags)
TimeoutError: [Errno 110] Connection timed out
2022-04-21T19:09:47Z {'REMOTE_ADDR': '172.31.196.63', 'REMOTE_PORT': '49404', 'HTTP_HOST': 'dagit.doordash.team', (hidden keys: 48)} failed with TimeoutError
d
That's firing anytime anybody tries to load a page?
If it's not happening consistently its not neccesarily the root cause of every page load failing
did anything in particular change around the time that it started failing?
a
I don't think so. It looks like the dagit web server itself not heathy, curl command to the server is not work from dagit pod even though I see the below message.
Copy code
Welcome to Dagster!

  If you have any questions or would like to engage with the Dagster team, please join us on Slack
  (<https://bit.ly/39dvSsF>).

Serving on <http://0.0.0.0:80> in process 1
Is there any place where I can check the server logs?
d
And no network config changes happened on your side around then? It would be unusual for it to start rejecting all requests without some kind of change to the environment
a
Irrespective of the network changes, I should be able to curl the server from the dagit pod right?
d
I would think so - but do you know if that was working yesterday?
I can see if it works from our dagit pod
what command are you running?
a
curl localhost:80
,
curl localhost:80/graphql
d
Yeah, when I run
curl localhost:80/graphql
in our dagit pod, I get
No GraphQL query found in the request
what exactly do you get?
(and just to confirm, this is by running
kubectl exec --stdin --tty
<dagit pod> -- /bin/bash
and then installing curl since it wasn't installed
If you kill the dagit pod and let it come back up, does it ever accept requests before getting stuck?
a
For me, it just hangs. Not seeing any response.
If you kill the dagit pod and let it come back up, does it ever accept requests before getting stuck?
Nope, I tried it. it never accepts any requests
d
And we have no leads on anything that might have changed around the time that it stopped working?
Going from accepting all requests to accepting no requests points strongly to some kind of network configuration change I think
Any chance you're able to pip install py-spy and run it on that machine?
Copy code
py-spy dump --pid 1
curl localhost:80/dagit_info
is a way to hit an endpoint that has absolutely no DB dependencies, it should just return a JSON dict
What I see when I run py-spy dump --pid 1 on the dagit pod:
Copy code
root@user-deployment-dagster-dagit-86cb585b75-hc4pp:/# py-spy dump --pid 1
Process 1: /usr/local/bin/python /usr/local/bin/dagit -h 0.0.0.0 -p 80 -w /dagster-workspace/workspace.yaml
Python v3.8.7 (/usr/local/bin/python3.8)

Thread 1 (idle): "MainThread"
    run (asyncio/runners.py:44)
    run (uvicorn/server.py:60)
    run (uvicorn/main.py:463)
    host_dagit_ui_with_workspace_process_context (dagit/cli.py:148)
    dagit (dagit/cli.py:115)
    invoke (click/core.py:760)
    invoke (click/core.py:1404)
    main (click/core.py:1055)
    __call__ (click/core.py:1130)
    main (dagit/cli.py:162)
    <module> (dagit:33)
Thread 15 (idle): "grpc-server-watch"
    wait (threading.py:306)
    wait (threading.py:558)
    watch_for_changes (dagster/grpc/server_watcher.py:92)
    watch_grpc_server_thread (dagster/grpc/server_watcher.py:122)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 23 (idle): "grpc-server-watch"
    wait (threading.py:306)
    wait (threading.py:558)
    watch_for_changes (dagster/grpc/server_watcher.py:92)
    watch_grpc_server_thread (dagster/grpc/server_watcher.py:122)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 31 (idle): "grpc-server-watch"
    wait (threading.py:306)
    wait (threading.py:558)
    watch_for_changes (dagster/grpc/server_watcher.py:92)
    watch_grpc_server_thread (dagster/grpc/server_watcher.py:122)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 39 (idle): "telemetry-upload"
    wait (threading.py:306)
    wait (threading.py:558)
    upload_logs (dagster/core/telemetry_upload.py:85)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 378 (idle): "AnyIO worker thread"
    wait (threading.py:302)
    get (queue.py:170)
    run (anyio/_backends/_asyncio.py:744)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
root@user-deployment-dagster-dagit-86cb585b75-hc4pp:/#
i'm unreachable for the next hour or so but tapping @alex or @johann in case any other questions come here about the downtime
a
sorry was in a couple of meetings
Trying to run py spy
No luck running py spy.
Copy code
root@dagster-dagit-767fbb6bc5-hqkrj:/# py-spy dump --pid 1
Error: Operation not permitted (os error 1)
a
ya, you have to have certain security perms enabled in k8s to be able to run that, same as
gdb
a
does the dagit web server write any logs? Where can I find or tail them?
a
no written logs that i am aware of
a
Got it. Looks like dagit is back up again. I had to restart the entire deployment rather than pods
d
Huh, So a helm uninstall and then a helm upgrade fixed it?
a
I think
kubectl rollout restart deployment/dagster-dagit
just worked. Previously I was killing individual dagit pods
d
And now when you ssh into the pods curl works again?
a
Yes, it does 😓
d
I don't think i've seen that before, very strange..
unlucky that it happened the day before the upgrade 😕
a
Yeah, hopefully the upgrade brings our deployment to a better state.Thanks a lot for your help