https://dagster.io/ logo
Title
a

Arun Kumar

05/26/2021, 7:45 PM
Hi team, I just deployed the dagster on our staging environment with a user deployment code that contains some dummy pipelines. I saw the following error
WebSocket connection to '<wss://dagit.doorcrawl.com/graphql>' failed:
in the dev console of the UI. I learned that our infra team does not currently support websocket connections yet. My question is does the entire UI depends on websocket connections or only the live view? Currently I am not able to even see the pipelines on dagit
a

alex

05/26/2021, 7:49 PM
which version? A few releases ago we switched from all websocket to only subscriptions over websockets
a

Arun Kumar

05/26/2021, 7:50 PM
Currently on 0.11.7. I just found out this was changed in 0.11.9?
a

alex

05/26/2021, 7:52 PM
yep
a

Arun Kumar

05/26/2021, 7:54 PM
Which part of the UI still uses subscriptions? I just want to understand what features will we be missing without web socket support. Not sure if the Status view will still work?
a

alex

05/26/2021, 8:00 PM
the live updating pipeline run viewer is the main one - there are subscriptions to do things like auto-update when the workspace changes that also wont work
a

Arun Kumar

05/26/2021, 8:30 PM
Thanks Alex. Will upgrade the version and check again
m

max

05/26/2021, 10:17 PM
out of curiosity, if you can share, would love to know what reverse proxy you guys run that won't allow or doesn't yet support websocket connections -- we are flying kind of blind wrt typical setups out there in the wild and would love to better understand the networking envs we typically get deployed into
a

Arun Kumar

05/26/2021, 11:36 PM
Hi @max , not sure about the network specifics and is being maintained by our infra team. Our traffic usually flows from Cloudfare -> ALB ->Nginx. Not sure which part of the flow lacks support for web sockets. I have asked for more information and will post here once I get to know them
🙏 1
@alex I tried to use the graphQL playground to check if I can fetch any data. I am always getting empty response and I think there is something wrong with my user deployment. Any leads on how can I debug or which logs to monitor?
m

max

05/26/2021, 11:47 PM
thanks i could believe that it is either the ALB or nginx tbh
a

Arun Kumar

05/27/2021, 12:25 AM
Update: It worked after version upgrade
Looks like lot of functionalities are still missing without websocket support (pipeline runs from sensors are failing, could not see the logs for the runs, etc.,) Also not sure why I am not able to launch runs from Playground. Does it also depend on web sockets?
d

daniel

05/27/2021, 1:26 AM
Could you share more about the runs from sensors failing / what error you're seeing? I wouldn't expect that to depend on websockets, so there may be something else going on here as well
a

Arun Kumar

05/27/2021, 1:34 AM
This is what I see on the UI
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1622076248.059276431","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5419,"referenced_errors":[{"created":"@1622076248.059272932","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"
>
  File "/usr/local/lib/python3.7/site-packages/dagster/daemon/sensor.py", line 212, in execute_sensor_iteration
    sensor_debug_crash_flags,
  File "/usr/local/lib/python3.7/site-packages/dagster/daemon/sensor.py", line 240, in _evaluate_sensor
    job_state.job_specific_data.last_run_key if job_state.job_specific_data else None,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/repository_location.py", line 702, in get_external_sensor_execution_data
    last_run_key,
  File "/usr/local/lib/python3.7/site-packages/dagster/api/snapshot_sensor.py", line 42, in sync_get_external_sensor_execution_data_grpc
    last_run_key=last_run_key,
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 294, in external_sensor_execution
    sensor_execution_args
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 97, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
d

daniel

05/27/2021, 1:42 AM
got it - I'm confident that's not a websocket issue_._ That seems like the daemon is having trouble reaching the gRPC server to call the sensor function. Would you mind sharing your workspace.yaml file?
the other thing I would check is the Sensors tab on the Status page (<your dagit URL>/instance/sensors) - if you see anything on that tab that says "Unloadable Sensors", that could potentially explain an error like this and you should be able to fix it by turning off the bad sensor (I can elaborate if that turns out to be the problem)
a

Arun Kumar

05/27/2021, 1:49 AM
Hmmm, looks like I don't have a workspace.yaml file in the user-code. Is it mandatory to add it a while running via helm?
d

daniel

05/27/2021, 1:50 AM
oh, if you're running via helm it should be automatically set up for you, right.
👍 1
Is anything jumping out from the Sensors tab?
a

Arun Kumar

05/27/2021, 1:52 AM
I don't see any issues there. Also, the evaluations are working perfectly fine.
d

daniel

05/27/2021, 1:53 AM
Hmm, I see. Is the place where you're seeing the errors under "Daemon statuses" on the Status page? Next to "Sensors"?
(Or somewhere else?)
a

Arun Kumar

05/27/2021, 1:58 AM
I saw these errors by clicking the failures on the sensor graph page
d

daniel

05/27/2021, 2:01 AM
Ah got it - so it looks like there was one connection failure, then a bunch of skips - Are the skips expected?
a

Arun Kumar

05/27/2021, 2:04 AM
Daniel, I think I found the reason in the daemon logs. Sorry for not figuring this out earlier 🤦
2021-05-27 00:29:19 - QueuedRunCoordinatorDaemon - INFO - Retrieved 1 queued runs, checking limits.
2021-05-27 00:29:19 - QueuedRunCoordinatorDaemon - ERROR - Caught an error for run 90930cf4-f89b-4024-a1ee-ef03b6dce54a while removing it from the queue. Marking the run as failed and dropping it from the queue: kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'f2275518-a506-4125-90b0-e09db9d909e9', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Thu, 27 May 2021 00:29:19 GMT', 'Content-Length': '311'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:dagster:dagster\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"dagster\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}



Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py", line 154, in run_iteration
    self._dequeue_run(instance, run, workspace)
  File "/usr/local/lib/python3.7/site-packages/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py", line 239, in _dequeue_run
    instance.launch_run(run.run_id, external_pipeline)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/instance/__init__.py", line 1322, in launch_run
    self._run_launcher.launch_run(run, external_pipeline=external_pipeline)
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/launcher.py", line 268, in launch_run
    self._batch_api.create_namespaced_job(body=job, namespace=self.job_namespace)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/batch_v1_api.py", line 66, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/batch_v1_api.py", line 175, in create_namespaced_job_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 397, in request
    body=body)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 280, in POST
    body=body)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
m

max

05/27/2021, 2:05 AM
we should probably not crash so hard on stuff like this
d

daniel

05/27/2021, 2:09 AM
got it, no problem! I think that's likely a separate thing from the sensor error (if the sensor is working as expected after that one failure I'm hoping it was a transient connection thing) - but does explain why runs would be failing to launch!
Seems like we still have a couple of mysteries left (you mentioned missing logs and the Launch button not working - the launch button in particular may depend on a websocket connection, will need to check with some others on the team to confirm)
a

Arun Kumar

05/27/2021, 2:15 AM
Yes you are right that the sensor error is different from this one and it is transient (seeing this from the logs). The run logs are actually not missing, but it is loading indefinitely with a bunch of web socket errors.
d

daniel

05/27/2021, 2:17 AM
Got it - once we have the logs issue figured out, the ApiException you saw in the daemon logs should appear in the logs at the bottom there.
a

Arun Kumar

05/27/2021, 2:18 AM
Yeah, I am working with our infra team on the web sockets support and it might take some time. Just trying to figure things in the meantime. Thanks a lot for the continuous support 🙏
@daniel Just want to check if you have any updates on the logs for the pipeline runs?
d

daniel

05/27/2021, 11:06 PM
We’re looking into adding functionality into dagit that fetches them over http instead of websockets, but that will take a bit of time to sort out. Let us know if it’s a hard blocker for you all to have dagit working without websockets and that’ll be a useful input into the prioritization
a

Arun Kumar

05/27/2021, 11:25 PM
Thanks for the update Daniel. Does that also include the ability to launch pipeline execution from dagit?
d

daniel

05/27/2021, 11:32 PM
That feature also currently assumes a websocket connection is available, yeah
a

Arun Kumar

05/28/2021, 12:05 AM
I see. Both launch pipeline execution and view pipeline runs are quite important for us to use dagit. We are indeed working with our infra team on the web socket support, but it might take some time to get there.
UPDATE (in case if this helps in your prioritization) : Our infra team has agreed and will start to work on the web-socket support. Might be able to get some working support within next week.
@daniel Until then, where can I view the pipeline run logs apart from Dagit?
d

daniel

06/01/2021, 11:49 PM
We do have a CLI command that might work if that's an option - "dagster debug export <RUN_ID> <FILE>" - but that's a much worse experience than dagit, you then have to unzip the file, etc.
a

Arun Kumar

06/02/2021, 12:02 AM
Where do I need to run the dagster command? I am running dagster on Kubenetes using Helm. Are these logs persisted by default to any of the deployments / DB?
d

daniel

06/02/2021, 12:08 AM
The logs are persisted to the DB - 'kubectl exec' with the 'dagster debug export' against the dagit pod would be one way to run the command, then 'kubectl cp' to get the gzipped file that that CLI command off of the pod. I think it depends on what your goal is - that setup above is probably too onerous if you want to just quickly inspect the logs from a run in dagit.
if it's a hard blocker for you to be able to view logs and launch runs without websocket support, then we would potentially be able to add that in to dagit, but it would probably be sometime later this week - so if websocket support will be added around that same time, it might not be that big of a help
a

Arun Kumar

06/02/2021, 12:59 AM
Thanks, I will give it a try. Our infra team is just starting to work on it, and it might take at least 2 more weeks to get the support. If the changes are small on your side and it can available with in this week, it will be surely of great help 🙂
c

cat

06/03/2021, 11:25 PM
Hey @Arun Kumar wanted to let you know that alex and dish landed a fix in the latest release (0.11.12) that has graceful fallback for launching runs and viewing event logs when websockets aren’t available. please let us know if you run into any issues!
a

Arun Kumar

06/03/2021, 11:26 PM
This is great! Thanks for the support and I will try it out and update here 🙂
👍 2
Just wanted to share the update. Tried updating to 0.11.12 and both run launch and run log views are working fine now. Thank you so much team 🙂
:condagster: 1
🎉 3
c

cat

06/04/2021, 10:50 PM
awesome!! great to hear 😛artyblob: