Hello from time to time my dagster daemon just stops This ti dagster #announcements

Hello, from time to time my dagster daemon just st...

Laura Moraes

03/11/2021, 1:17 PM

Hello, from time to time my dagster daemon just stops. This time, I could retrieve the error. Could someone help me understand why it stops? The line before that one I printed, the pipeline ran successfully like it usually does.

johann

03/11/2021, 1:25 PM

cc @daniel

daniel

03/11/2021, 1:30 PM

Hi Laura- when you say it stops - does it hang and do nothing until you stop the process, or does it crash and stop on its own? Trying to figure out if that last stack trace is from the daemon process itself or from you quitting it

daniel

03/11/2021, 1:43 PM

The other question I have is whether your machine is running into any memory or other resource limits - that first error seems to indicate that it’s unable to start up a new sub process to load your schedule code in, which can happen if your machine is overloaded in some way

Laura Moraes

03/11/2021, 1:57 PM

It crashes and stop on its own I'm going to check the resource usage. Would it be a lack of memory, for example?

daniel

03/11/2021, 2:08 PM

That would be my first guess. How long does it usually run before it stops?

Laura Moraes

03/11/2021, 2:14 PM

about a day and half

Laura Moraes

03/11/2021, 2:14 PM

last time it stopped was at 8pm on tuesday

daniel

03/11/2021, 2:21 PM

Got it. We should absolutely get to the bottom of this (it’s possible there’s a slow memory leak in the daemon, will run some tests to try and verify), but in addition to that you may want to set up the daemon so that it restarts automatically on failure (e.g by using supervisord or some other service that can automatically restart things) - it’s designed so that it should be able to restart cleanly if it fails and pick up where it left off.

Laura Moraes

03/11/2021, 2:23 PM

we're working on that. do you think deploying it on docker with automatic restart would be better than supervisord?

Laura Moraes

03/11/2021, 2:24 PM

disk usage was very unstable before stopping

disk usage

daniel

03/11/2021, 2:31 PM

Yeah, Docker would be our first recommendation if that’s an option. Huh, the disk thing is interesting. If there isn’t a clear cause for that in your pipelines I wonder if we could be hitting some kind of sqlite limit (which is another thing switching to Docker would help with since that requires using a Postgres dB)

Laura Moraes

03/11/2021, 2:33 PM

We moved to psql, this is using PSQL already

daniel

03/11/2021, 2:38 PM

Got it - must be something else then. The disk thing is a good clue, if the monitoring you’re using gives any indication of the source process or the files it was writing to or anything like that, that would be very useful.

daniel

03/11/2021, 2:39 PM

It seems like the disk usage was slowly increasing as the daemon kept running

daniel

03/11/2021, 2:40 PM

And the fact that it immediately went to zero after the crash points to it being something in the daemon process specifically

Laura Moraes

03/11/2021, 2:42 PM

Yes, I enabled this kind of monitoring now. We have an instance in GCP, so I do not have this info for this crash. We'll keep monitoring! thanks a lot

daniel

03/11/2021, 3:27 PM

edit: what alex said 🙂 (Similar to the one you reported, but would still be there after switching to postgres)

alex

03/11/2021, 3:27 PM

did you upgrade to

0.10.9

? We fixed a thread leak in that release that could be associated with this

daniel

03/11/2021, 3:35 PM

Assuming you did, the other thing i'd be interested to know is what the process situation is on your machine - if there are a bunch of lingering processes when this happens, that would be a clue

Laura Moraes

03/12/2021, 1:13 PM

I think we found the issue, there is a memory leak somewhere

daniel

03/12/2021, 1:28 PM

Got it - we’ll see if we can reproduce this on our end too, thanks for checking

daniel

03/12/2021, 2:18 PM

And you’re not seeing an increase in the number of processes right? Just that the daemon process itself is using more and more memory?

João Luiz Carabetta

03/12/2021, 2:25 PM

nops, just memory

Laura Moraes

03/12/2021, 2:59 PM

Hi @daniel, we are using dockers now. How can I set the daemon docker to restart automatically. I'm using the deploy_docker example with docker-compose: https://docs.dagster.io/examples/deploy_docker

daniel

03/12/2021, 3:00 PM

You can do that by setting a restart policy on the container as described here: https://docs.docker.com/config/containers/start-containers-automatically/

daniel

03/12/2021, 3:07 PM

In Docker-compose I believe that would be ‘restart: on-failure’ in the yaml, we should add that to the example

daniel

03/12/2021, 4:35 PM

While we try to reproduce the memory leak - could you send over your dagster.yaml so that we can make sure that we're in a similar environment as you? Also could you confirm the dagster version? This is on 0.10.9?

Laura Moraes

03/12/2021, 7:53 PM

It is 0.10.9

dagster.yaml

Laura Moraes

03/12/2021, 7:54 PM

@daniel I have one more question. I'm deploying it on docker, but it creates one pipeline container everytime the scheduler runs the pipeline.

daniel

03/12/2021, 7:54 PM

Hi (just found the memory leak btw! thanks for the report)

daniel

03/12/2021, 7:55 PM

If you don't want to have a new container for each pipeline, you can change the dagster.yaml and remove the run_launcher block from the code - the default run launcher will run it as a subprocess in the same container that supplies the pipeline code

daniel

03/12/2021, 8:02 PM

(whereas the DockerRunLauncher that's included in the example puts each run in its own new container)

Laura Moraes

03/12/2021, 8:13 PM

Got it! Thanks!! And cool that you found the leak, it is going to be fixed in the next release?

daniel

03/12/2021, 8:21 PM

Yeah, 0.11.0 next week will have it

João Luiz Carabetta

03/22/2021, 1:04 PM

It is running smoothly now 🙂

daniel

03/22/2021, 1:20 PM

Nice, thank you for all of your reports!

Open in Slack

Previous Next