https://dagster.io/ logo
#dagster-support
Title
# dagster-support
j

Jean-Pierre M

05/20/2021, 1:32 PM
Hi. I have a pipeline running on K8S with the K8SRunLauncher and QueuedRunCoordinator. I'm launching runs via the graphql python client. The idea is that python script loops over 1000 files and launches a run for each. This part seem to work. Once all the runs are submitted and the QueuedRunCoordinator is working through them, eventually the dagit pods on K8S crash and the dagit UI becomes unresponsive. The dagit pods restart themselves on K8S but continue to be unresponsive and never recover. My only way out of this crash loop is to kill all the pods and restart from scratch. Any thoughts about why this is happening? I currently have the dagster daemon pod, the dagit pod (with 2 replicas), the user deployment pod and the postgresql pod.
d

daniel

05/20/2021, 2:13 PM
Hi Jean-Pierre - are there any logs from the crashing dagit pod that you could share?
j

Jean-Pierre M

05/20/2021, 2:23 PM
I'll capture the logs next time it happens.
d

daniel

05/20/2021, 2:25 PM
great, thanks! If the dagit pods stay unresponsive after restarting, it's possible that the issue is with the database rather than with dagit (maybe something unexpected is happening with 1000 inserts into the queue happening in quick succession). Just confirming, you're using a postgres storage I imagine?
j

Jean-Pierre M

05/20/2021, 2:26 PM
Yes, I'm using postgres.
a

alex

05/20/2021, 2:31 PM
Hm are you certain the python script has completed its requests before you see the bad behavior? Dagit has not yet been optimized for high web request throughput so my best guess is that its the 1000 web requests that are locking it up and putting it in to a bad state. Theoretically only thing that should be causing
dagit
problems is web requests - are there active users of the tool during this time? The background activity of the webserver when idle shouldn’t change as a function of activity (unless we are doing something we dont mean to)
It could also be what daniel is thinking that dagit is locking up because it’s DB requests are taking a long time
j

Jean-Pierre M

05/20/2021, 2:37 PM
The python script completes and successfully submits all 1000 runs. The problem happens later as dagster gets through the queue.
ack 1
a

alex

05/20/2021, 2:42 PM
dagit pods restart themselves on K8S but continue to be unresponsive and never recover
Can you be more precise when you say unresponsive? Do the static resources load and the data never shows up? Do you just get nothing? Does the web request time out?
cc @johann depending on when dagit is going unresponsive it could be the user deployment pod thats blocking things if were dequeuing runs continuously
j

Jean-Pierre M

05/20/2021, 3:58 PM
@alex When it happens, the dagit UI stops refreshing and then refresh timer freezes on "Refreshing data..." The dagit K8S pods then get into restart loop. I can no longer submit runs with the graphql python client or the dagit UI. The pods for any runs already in progress just keep running and never terminate.
So no matter what I do now, dagster is holding up. I even submitted 10K runs and it hasn't crashed like before. I'll keep monitoring though and report back if it happens again. However, I am noticing some K8S pods for dagster runs that fail but the failure is never passed back to dagster. (Not sure if it related to my previous issue.) In dagit, the run remains perpetually in "STARTING" and the only way to stop it is to manually terminate it. From the K8S pod logs, it looks like the pod had a hard time connecting to the postgres database, probably because of too much concurrent traffic. This is only happening for a handful of runs in a batch of 1000 so it's not a big deal at all, but it would be nice if the error was shared upstream to dagit. I've attached the logs of a of pod where this happened.
d

daniel

05/20/2021, 4:09 PM
Absolutely - thanks so much for those logs, that is helpful. Getting better about monitoring runs that fail in this way is high on our radar right now. It's a little tricky because a lot of the ways we would typically monitor and alert on this involve access to the very database that is inaccessible in this example, but we have some ideas to get around that.
👍 1
j

Jean-Pierre M

05/20/2021, 5:43 PM
Looks like I spoke too soon. Dagit crashed with a batch of 1000 images . It's now looping through restarts on K8S (it's at 7 restarts and counting). Attached is the
describe pod
for the dagit pod and the logs (note that for the logs were empty, but using the --previous flag in K8S it gave me something)
a

alex

05/20/2021, 5:50 PM
What is your set up for your postgres DB? Do you have any monitoring on it? How much cpu/mem does it have?
j

Jean-Pierre M

05/20/2021, 6:02 PM
postgres is installed in k8s using the bitnami/postgresql image. There is resource request of cpu=1000m and memory=512Mi. Other than that there is max connections=200 and persistance size=15Gi. There is no specific monitoring but I monitored its usage via
kubectl top pod
and I noticed it reaches cpu>4000 and memory>4000 as it gets through the runs
a

alex

05/20/2021, 6:06 PM
and I assume you don’t have any config set for the
queuedRunCoordinator
to limit the max simultaneous runs?
j

Jean-Pierre M

05/20/2021, 6:10 PM
I do, max_concurrent_runs=100
a

alex

05/20/2021, 6:12 PM
ah ok interesting
which executor are you using?
if multiprocess - that could push you over the connection limit since each process makes its own db connections also as observed you may need greater resource amounts to handle the write volume and still support reads from that many simultaneous runs
j

Jean-Pierre M

05/20/2021, 6:18 PM
I'm using the default executor with K8sRunLauncher
a

alex

05/20/2021, 6:19 PM
Gotcha. In any case i think some mix of less simultaneous runs or more DB resources should get you back in to a working state we will work on making dagit more gracefully handle this state
👍 1
j

Jean-Pierre M

05/20/2021, 6:20 PM
The resources are all set to scale on K8s. We set minimum requests but like with postgres, if it needs more and there are resources available it will get them.
I guess I can try reserving more dedicated resources for the postgres pod.
a

alex

05/20/2021, 6:25 PM
yeah I don’t know how fast k8s can respond to a resource need spike like this - especially when
StatefulSet
is involved. It can probably take over more resources on the
Node
but past that moving to a
Node
with more resources takes time. So upfront request of more resources might get you on to a
Node
that can handle it. I guess it depends whats available in your
NodePool
👍 1
j

Jean-Pierre M

05/20/2021, 6:26 PM
I'll give that a shot. thanks!