Hello, from time to time my dagster daemon just st...
# announcements
l
Hello, from time to time my dagster daemon just stops. This time, I could retrieve the error. Could someone help me understand why it stops? The line before that one I printed, the pipeline ran successfully like it usually does.
j
cc @daniel
d
Hi Laura- when you say it stops - does it hang and do nothing until you stop the process, or does it crash and stop on its own? Trying to figure out if that last stack trace is from the daemon process itself or from you quitting it
The other question I have is whether your machine is running into any memory or other resource limits - that first error seems to indicate that it’s unable to start up a new sub process to load your schedule code in, which can happen if your machine is overloaded in some way
l
It crashes and stop on its own I'm going to check the resource usage. Would it be a lack of memory, for example?
d
That would be my first guess. How long does it usually run before it stops?
l
about a day and half
last time it stopped was at 8pm on tuesday
d
Got it. We should absolutely get to the bottom of this (it’s possible there’s a slow memory leak in the daemon, will run some tests to try and verify), but in addition to that you may want to set up the daemon so that it restarts automatically on failure (e.g by using supervisord or some other service that can automatically restart things) - it’s designed so that it should be able to restart cleanly if it fails and pick up where it left off.
l
we're working on that. do you think deploying it on docker with automatic restart would be better than supervisord?
disk usage was very unstable before stopping
d
Yeah, Docker would be our first recommendation if that’s an option. Huh, the disk thing is interesting. If there isn’t a clear cause for that in your pipelines I wonder if we could be hitting some kind of sqlite limit (which is another thing switching to Docker would help with since that requires using a Postgres dB)
l
We moved to psql, this is using PSQL already
d
Got it - must be something else then. The disk thing is a good clue, if the monitoring you’re using gives any indication of the source process or the files it was writing to or anything like that, that would be very useful.
It seems like the disk usage was slowly increasing as the daemon kept running
And the fact that it immediately went to zero after the crash points to it being something in the daemon process specifically
l
Yes, I enabled this kind of monitoring now. We have an instance in GCP, so I do not have this info for this crash. We'll keep monitoring! thanks a lot
d
edit: what alex said 🙂 (Similar to the one you reported, but would still be there after switching to postgres)
a
did you upgrade to
0.10.9
? We fixed a thread leak in that release that could be associated with this
d
Assuming you did, the other thing i'd be interested to know is what the process situation is on your machine - if there are a bunch of lingering processes when this happens, that would be a clue
l
I think we found the issue, there is a memory leak somewhere
d
Got it - we’ll see if we can reproduce this on our end too, thanks for checking
And you’re not seeing an increase in the number of processes right? Just that the daemon process itself is using more and more memory?
j
nops, just memory
l
Hi @daniel, we are using dockers now. How can I set the daemon docker to restart automatically. I'm using the deploy_docker example with docker-compose: https://docs.dagster.io/examples/deploy_docker
d
You can do that by setting a restart policy on the container as described here: https://docs.docker.com/config/containers/start-containers-automatically/
In Docker-compose I believe that would be ‘restart: on-failure’ in the yaml, we should add that to the example
While we try to reproduce the memory leak - could you send over your dagster.yaml so that we can make sure that we're in a similar environment as you? Also could you confirm the dagster version? This is on 0.10.9?
l
It is 0.10.9
@daniel I have one more question. I'm deploying it on docker, but it creates one pipeline container everytime the scheduler runs the pipeline.
d
Hi (just found the memory leak btw! thanks for the report)
If you don't want to have a new container for each pipeline, you can change the dagster.yaml and remove the run_launcher block from the code - the default run launcher will run it as a subprocess in the same container that supplies the pipeline code
(whereas the DockerRunLauncher that's included in the example puts each run in its own new container)
l
Got it! Thanks!! And cool that you found the leak, it is going to be fixed in the next release?
d
Yeah, 0.11.0 next week will have it
j
It is running smoothly now 🙂
d
Nice, thank you for all of your reports!