Hi Guys! Have you seen situation where Dagit is frozen after a few hundreds pipeline execution? Neither GraphQL API, nor Dagit Web cannot be opened / not responding. Is there a way to figure out what is happening?
is there any pipeline execution time limit? or something else?
12/11/2020, 9:25 PM
the pipeline executions all happen in subprocesses - not sure what stalling the webserver
we do know that its use of
is not well tuned and needs to be vastly improved
12/12/2020, 10:55 AM
@alex do you have test or anything else where you hit the GraphQL API from outside to start short time pipelines (1-3 sec) with concurrency (even not high concurrency, lets say 2/3 sessions / Dagit ) Actually my problem is I do not know what is happening, except that everything is freezed in a Dagit instance even the Web interface too...so my new health-check after 30 sec always restarts the pod, and after start processing again for a minute or two and repeating again. Other question Is there a que based (event based) executor? where I can feed Dagit's queue and if Dagit feels itself ready to process the next job, only that time getting a new job from the queue, processing it, but this is always in its pace.
hm...I have to revoke this, I just noticed its queue based...sorry...in this case, I do not understand what is happening...actually Dagit does not respond after a hundred pipeline start and it seems this is the root cause .... @alex is there any concurrency control ? I mean in the Dagit's pipeline execution context? I think I should control that too, not just the resources
so it smells this concurrency form some kind of deadlock in Dagit, because it wants to start all queued run (every Dagit instance should only execute a certain number of pipelines in_process so keeping enough resource for other services (scheduling, grapql api, etc) )
one more thing, i was able to stop the producer so Dagit is still responding, but there are a lots of pipeline (more than 50) in running state, and it's not able to finish it, not moving...the DB resource are not busy, Dagit's CPU usage is nothing, my distributed resource lock (in redis) also free ... so it seems just stucking in this state
pipeline execution time limit would also a good control 🙂 , with a nice error context if it fails, so users can check why, and where was the pipeline context cancelled
I have found a workaround, if I dont wait for the shared resource (so if that is available, i process the pipeline, if it not I just raised an exception RetryRequired) just retry that, it seems it is working..but I have to load test this with higher concurrency
well.....the issue with Dagit response, still there it seems...not doing anything based on CPU utilization of that pod, but no response for Graphql queries just blocking it....
12/14/2020, 3:25 PM
sounds like there is definitely some bug in dagit causing it to lock up - will have to reproduce with a debugger or profiler attached to figure out what exactly is going wrong