Hey :wave: We have an op that sends a POST request...
# dagster-plus
s
Hey 👋 We have an op that sends a POST request to Google Cloud Run and waits for the response. We are seeing odd behavior in Dagster Cloud where the request call hangs even if the Cloud Run instance returns a valid response. It looks like the hangs occur only if Cloud Run execution takes longer than 10 minutes. As a workaround we have configured a 25-minute timeout for the POST request to ensure the op doesn't hang indefinitely. When running Dagit locally the op returns normally and doesn't hang. Got an idea why we're seeing this kind of behavior? What are we missing here?
j
Hi Simo - is it possible that the timeout is on the cloud run side? https://cloud.google.com/functions/docs/configuring/timeout Especially if it’s only happening on the long-running ones.
When running Dagit locally the op returns normally and doesn’t hang.
Is this true for the long-running ones as well?
s
Hi Jordan! Nope, the cloud run is configured to timeout after 1 hour and it is returning 200 status codes as expected. Everything works when run on local Dagit but in the Cloud the op hangs.
a
Are you using cloud serverless or you run your own workload?
j
i suspect we might be running into this: https://aws.amazon.com/blogs/networking-and-content-delivery/implementing-long-running-tcp-connections-within-vpc-networking/
Many network appliances define idle connection timeout to terminate connections after an inactivity period. For example, appliances like NAT Gateway, Amazon Virtual Private Cloud (Amazon VPC) Endpoints, and Network Load Balancer (NLB) currently have a fixed idle timeout of 350 seconds. Packets sent after the idle timeout expired aren’t delivered to the destination.
is there a way to trigger these functions async and then poll for completion instead of leaving the connection open?
s
@Andrea Giardini Forgot to mention that: We're using serverless.
@jordan Thanks for the link! This sounds plausible: There's a NAT gateway between Dagster Cloud and the internet and the gateway has a timeout of 350 seconds for idle connections. I think this should be documented in the Dagster Cloud Serverless docs. I'll check if we can utilize the TCP keepalive mechanism to keep the connection active.
@jordan TCP keepalive did the trick! We used
HTTPConnection.default_socket_options
to apply the fix:
Copy code
import socket
from urllib3.connection import HTTPConnection


# Enable TCP Keep Alive to prevent Dagster Cloud NAT Gateway from closing long-running idle connections
# <https://urllib3.readthedocs.io/en/stable/reference/urllib3.connection.html>
HTTPConnection.default_socket_options += [
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),  # Enable TCP keepalive
    (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 300),  # First probe after 300 seconds
    (
        socket.IPPROTO_TCP,
        socket.TCP_KEEPINTVL,
        120,
    ),  # Probe every 120 seconds after initial probe
    (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 10),  # Max 10 probes
]
🙌 1