i'm observing something weird/interesting. a while...
# announcements
a
i'm observing something weird/interesting. a while ago, i asked about launching pipelines from other pipelines and I ended up issuing a
POST
to the
/graphql
endpoint. I'm on 0.8.1 now and it looks like if I issue multiple
POST
requests at the same time (around 5 or so), then the server goes into some deadlock-ish state and refuses to accept any more connections. Dagit becomes completely unusable. But if I issue pipeline launch requests using websockets to the same endpoint, I have no issues. Does this sound like something that's unique to my setup or does it sound like there's a problem somewhere?
Or actually, maybe I don't even understand what I'm seeing.
Is the graphql server single-threaded?
m
i think dagit runs exclusively over websockets so am not shocked that the POST endpoint has issues
a
ah i was using
dagster-graphql
as guidance when I initially wrote this, since it too uses POST to /graphql.
is there anything to know about how many requests I can issue at a time to dagit, whether websockets or POST?
is there something underneath dagit (like flask?) that doesn't behave so well when I issue a large number of requests at the same time?
because even if I use websockets and start issuing a bunch of requests to
/graphql
, the dagit UI becomes unresponsive until all
/graphql
requests are ack-ed. I don't mind that if my pipeline launch requests takes a while but I'd like the UI still be responsive. I'm trying to think of a good workaround but nothing I come up with seems satisfactory
I assumed that if I issue
LAUNCH_PIPELINE_EXECUTION_MUTATION
to
/graphql
, dagit would ack the request in a few ms and move on. But for some reason that I don't yet understand, it takes seconds to ack the request. Given that the server processes just one request at a time, my hypothesis is that the delays from several
LAUNCH_PIPELINE_EXECUTION_MUTATION
requests add up, causing the dagit UI to freeze up
maybe i can just circumvent all of this with multiple dagit instances, and dedicate one to the UI and the rest for just the
/graphql
endpoint
a
do you have a run launcher configured on your instance or are you using the default?
if default, the pipeline executions will happen in a subprocess on the dagit machine, so if you constrain against the number of CPUs you may see things start to grind to a halt
a
i have a run launcher and the pipeline executions run in celery
a
I have a run launcher
how does your run launcher work?
or rather which one are you using
a
sorry i misspoke. I use the default run launcher. I confused the executors and the run launcher.
👍 1
a
what run / event storage are you using? postgres? That may be the resource under contention as well
a
yep, postgres for run, event log, and schedule storage. although i guess the schedule storage part is less intensive
a
one workaround you could consider is staggering the launches by sleeping a random smallish amount of time
a
that makes sense though
a
I believe we would need profiler results to debug further to see what exact resource was under contention causing your problem
a
yea, i tried to see if staggering would work and tested it with some bash scripts that issue CURL requests. I'm not really sure what the right amount is in my use case because the number of pipelines I execute depend on how many inputs I get.
I'll try to run it under some profiler and see what happens
meanwhile, do you see anything bad happening if I go down the path of having multiple dagit instances?
a
no problems i can predict, I guess if its still locking up after you do that its likely a postgres contention issue
a
I guess I can rule also test it by temporarily switching to filesystem run and event storage and then re-running the tests
a
ya sqlite will have its own issues if you hammer it simultaneously so beware of that