Guys is anybody else having problems with dagster daemon I m dagster #ask-community

Guys, is anybody else having problems with dagster...

Ismael Rodrigues

09/04/2022, 1:29 AM

Guys, is anybody else having problems with dagster-daemon? I'm having terrible issues where the daemon it's stopping working. It starts well and running, but as the time passes, the daemon stops and I don't know what to do, this is ruining my jobs

daniel

09/05/2022, 12:29 PM

Hi Ismael - are there any logs from the dagster-daemon process that might give more clues about why it is stopping?

Ismael Rodrigues

09/05/2022, 12:33 PM

Sadly no, it just stops. The last log registered it's just a "no new runs for job_xxxx", which it's a normal log

Ismael Rodrigues

09/05/2022, 12:35 PM

This is all I have

daniel

09/05/2022, 12:56 PM

Is it possible to post the full set of logs from beginning to end from a time that it failed?

daniel

09/05/2022, 12:56 PM

As a file?

Ismael Rodrigues

09/05/2022, 1:10 PM

Here

daniel

09/05/2022, 1:11 PM

Would you mind sending it as a text file?

Ismael Rodrigues

09/05/2022, 1:14 PM

Here

logs_daemon.txt

Ismael Rodrigues

09/05/2022, 1:15 PM

As you can see, the last log I had about the daemon, was 08:10 and now it's 10:15

Ismael Rodrigues

09/05/2022, 1:15 PM

daniel

09/05/2022, 1:15 PM

Where's the line about it failing due to lack of heartbeats?

daniel

09/05/2022, 1:16 PM

Is the process still running?

Ismael Rodrigues

09/05/2022, 1:16 PM

It doesn't show up to me, I assumed I had the same error as the guy I sent the print, because he also had 2 Damon's not running

Ismael Rodrigues

09/05/2022, 1:17 PM

Ismael Rodrigues

09/05/2022, 1:18 PM

Just dagit it's running, the daemon stopped

daniel

09/05/2022, 1:19 PM

So the process is still running, but hanging? Or it is no longer running? How did you pull those logs?

Ismael Rodrigues

09/05/2022, 1:20 PM

Dagit is running, the daemon isn't it stopped. I collected this logs from the Openshift console

daniel

09/05/2022, 1:21 PM

Does it say what time the process stopped and what the exit code was?

Ismael Rodrigues

09/05/2022, 1:21 PM

Sadly no

daniel

09/05/2022, 1:21 PM

Typically there would be a stack trace or error message when the daemon stops

daniel

09/05/2022, 1:22 PM

Could it have run out of memory?

Ismael Rodrigues

09/05/2022, 1:23 PM

Nop, I have 4GB of RAM it's consuming just 1.3GB with a job running And storage, I have 500GB

daniel

09/05/2022, 1:24 PM

Do you have any ways of getting more information about when it stopped or what the exit code was? It looks almost like openshift might have stopped it rather than the dagster process stopping

Ismael Rodrigues

09/05/2022, 1:29 PM

If there's a way to get more info, I'd love to know also. And it's not openshift, because I tested local using a Docker image and the same error happened Maybe it's something with docker...

daniel

09/05/2022, 1:31 PM

Could you post a log file from when you ran it locally?

daniel

09/05/2022, 1:32 PM

Local Docker should have a way of telling you what the exit code was

Ismael Rodrigues

09/05/2022, 1:35 PM

Yeah, that's what I thought also, but it's the same, as the one I already sent you

Ismael Rodrigues

09/05/2022, 1:35 PM

No traceback, no exit code, just stops

daniel

09/05/2022, 1:36 PM

https://stackoverflow.com/questions/46300610/how-to-get-the-numeric-exit-status-of-an-exited-docker-container can you do this to get the exit code

Ismael Rodrigues

09/05/2022, 1:36 PM

It's almost like someone forgot to add "raise exception"

daniel

09/05/2022, 1:36 PM

And would you mind sharing your Dockerfile?

daniel

09/05/2022, 1:38 PM

To confirm, the container has stopped too?

Ismael Rodrigues

09/05/2022, 1:38 PM

No, the container keeps running with dagit on it, but the daemon stops

Ismael Rodrigues

09/05/2022, 1:38 PM

Dockerfile

daniel

09/05/2022, 1:38 PM

Can you run dagit and the daemon in separate containers?

Ismael Rodrigues

09/05/2022, 1:39 PM

I don't know how to do this :/

daniel

09/05/2022, 1:40 PM

https://docs.dagster.io/deployment/guides/docker has some guides and examples of how we recommend deploying dagster on docker

daniel

09/05/2022, 1:41 PM

here's an example that uses docker-compose to deploy dagit and the daemon separately: https://docs.dagster.io/deployment/guides/docker#example

daniel

09/05/2022, 1:41 PM

generally docker recommends running a single process per container when possible - it can automatically restart it for you if it fails

daniel

09/05/2022, 1:42 PM

we also have a cloud product that will handle all this deployment stuff for you, but that is not free/open-source

Ismael Rodrigues

09/05/2022, 1:42 PM

Fine, I'll try this, then

André Augusto

09/05/2022, 3:11 PM

I had a similar problem deploying dagster using docker-compose on a EC2 instance on AWS. Locally I’ve no problem with my code (running in a M1 pro with the same docker-compose code). However, running in the instance, I had the exact same problem as Ismael: the daemon would stop after ~3min or so. No tracebacks, exceptions, etc in the docker-compose logs, it just hangs. To rule out problems with the instance itself, I deployed this example from your repository and had no issues. Then I decided to upgrade my instance to have more available vCPUS it solved my issue. However, I’m not confident that the instance type itself was the problem but something about the processes on dagster-daemon (the number of PIDs should be this high?!) Few more information: I had this problem before 1.0.7 (I was running 1.0.5). Also, my deployment has 3 sensors and 2 schedules.

❤️ 1

daniel

09/05/2022, 3:13 PM

Andre any chance you have an example repo I could use to try to reproduce the problem? Like a GitHub repo with a docker-compose file that would hang after a few minutes?

daniel

09/05/2022, 3:41 PM

Cc @Qwame since you reported a similar issue above - did this start happening for you in 1.0.3 but wasn't happening in 1.0.2?

daniel

09/05/2022, 4:02 PM

Ismael given André’s post above you could try checking/increasing the # of cpus available to your local docker container and see if that helps

Qwame

09/05/2022, 4:02 PM

I was running locally on Windows and never had this problem. Since I ran dagster on an M1, started with v1.0.3, I've had this problem. I've been upgrading versions hoping this will be fixed but

daniel

09/05/2022, 4:03 PM

What version were you running on windows when it was working?

Qwame

09/05/2022, 4:04 PM

It was before 1.0.3. I think

daniel

09/05/2022, 4:05 PM

So two things have changed - windows to Mac, and a version upgrade? It would be useful to know if the problem happens on Mac on the version where it was working on windows

Qwame

09/05/2022, 4:05 PM

I saw this here.on line 100. Does this help? https://dagster.phacility.com/differential/changeset/?ref=470506

Qwame

09/05/2022, 4:05 PM

That's where an error like that is raised.

daniel

09/05/2022, 4:06 PM

A bit, but the full set of logs from the daemon process from a session when this happens would be the most useful (or a way for us to reproduce the problem ourselves)

Qwame

09/05/2022, 4:07 PM

Are there any other logs, because the only logs I see are the ones that show in the terminal.

daniel

09/05/2022, 4:07 PM

The ones in the terminal is what I was referring to

Ismael Rodrigues

09/05/2022, 4:09 PM

Sure, I')) try, but it's funny, because I already had increased The number of CPUs, but okay, I'll increase more

daniel

09/05/2022, 4:10 PM

Would definitely still recommend running dagit and the daemon in separate containers if that's an option as well

Qwame

09/05/2022, 4:11 PM

I also run dagit in separate containers as recommended in the docker depoly example and I experienced the same issues

Ismael Rodrigues

09/05/2022, 4:12 PM

Okay, I'll try increase the number of cpus for now, and work to separate these two

daniel

09/05/2022, 4:12 PM

It sounds like your issue might be different qwame since your symptoms are slightly different

Qwame

09/05/2022, 4:14 PM

Full log for me

Copy code

Traceback (most recent call last):
   File "/Users/abdin/dagster/bin/dagster-daemon", line 8, in <module>
     sys.exit(main())
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
     cli(obj={})  # pylint:disable=E1123
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
     return self.main(*args, **kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1055, in main
     rv = self.invoke(ctx)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 760, in invoke
     return __callback(*args, **kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
     _daemon_run_command(instance, kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_core/telemetry.py", line 110, in wrap
     result = f(*args, **kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
     controller.check_daemon_loop()
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/controller.py", line 273, in check_daemon_loop
     self.check_daemon_heartbeats()
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/controller.py", line 246, in check_daemon_heartbeats
     raise Exception(
 Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR']

daniel

09/05/2022, 4:15 PM

I was hoping for the full output of every line in the process, not just the error at the end

daniel

09/05/2022, 4:15 PM

To try to identify the problem that happened earlier and stopped the heartbeating

daniel

09/05/2022, 4:20 PM

(Attached as a text file if possible)

Qwame

09/05/2022, 4:23 PM

Do you mean the event logs?

daniel

09/05/2022, 4:24 PM

I mean the terminal output from the daemon process that is stopping unexpectedly

daniel

09/05/2022, 4:25 PM

Whatever is being logged to the terminal when it runs (probably a mixture of event logs from the runs and other things too)

Qwame

09/05/2022, 4:30 PM

Hopefully, this is it

logs.txt

daniel

09/05/2022, 5:50 PM

Is it possible to post or DM the code/workspace.yaml/dagster.yaml that the daemon is running? We could try running it locally on a Mac and see if we can reproduce the daemon crashing

👍 2

daniel

09/05/2022, 5:52 PM

I assume the mac isn't going into sleep mode or anything like that..

daniel

09/05/2022, 5:56 PM

Has it ever worked for you on a mac?

Qwame

09/05/2022, 6:07 PM

This is the first time trying to run it on a Mac

Qwame

09/05/2022, 6:23 PM

The mac can sometimes go to sleep though. Let me turn that off and let you know how it goes

Ismael Rodrigues

09/06/2022, 2:53 PM

I increased the CPUs, but it's almost comic as the daemon stopped faster than before, lmao

daniel

09/06/2022, 2:54 PM

have you had a chance to try running it in a separate container? I think that will make it a lot more clear how and why it fails when it fails

Ismael Rodrigues

09/06/2022, 2:56 PM

Not yet, I'm trying to understand how can I do this separated thing reading the docs you sent to me.

Ismael Rodrigues

09/06/2022, 2:56 PM

But it's my next goal

Romain

09/21/2022, 10:59 AM

We have this issue since the last update of the user_code container on a previously working ECS cluster.

Romain

09/21/2022, 11:01 AM

This is version 1.0.5. Here is the log of the daemon ecs task before it stops :

Romain

09/21/2022, 11:01 AM

Copy code

Telemetry:
  As an open source project, we collect usage statistics to inform development priorities. For more
  information, read <https://docs.dagster.io/install#telemetry>.
  We will not see or store solid definitions, pipeline definitions, modes, resources, context, or
  any data that is processed within solids and pipelines.
  To opt-out, add the following to $DAGSTER_HOME/dagster.yaml, creating that file if necessary:
    telemetry:
      enabled: false
  Welcome to Dagster!
  If you have any questions or would like to engage with the Dagster team, please join us on Slack
  (<https://bit.ly/39dvSsF>).
[32m2022-09-21 10:46:03 +0000[0m - dagster.daemon - [34mINFO[0m - instance is configured with the following daemons: ['BackfillDaemon', 'QueuedRunCoordinatorDaemon', 'SchedulerDaemon', 'SensorDaemon']
[32m2022-09-21 10:47:33 +0000[0m - dagster.daemon - [34mERROR[0m - [31mThread for SENSOR did not shut down gracefully[0m
[32m2022-09-21 10:48:03 +0000[0m - dagster.daemon - [34mERROR[0m - [31mThread for SCHEDULER did not shut down gracefully[0m
Traceback (most recent call last):
  File "/usr/local/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
    cli(obj=
{}
)  # pylint:disable=E1123
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
    _daemon_run_command(instance, kwargs)
  File "/usr/local/lib/python3.8/site-packages/dagster/_core/telemetry.py", line 110, in wrap
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
    controller.check_daemon_loop()
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/controller.py", line 273, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/controller.py", line 246, in check_daemon_heartbeats
    raise Exception(
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR', 'SCHEDULER']

daniel

09/21/2022, 12:14 PM

Hey Romain - happy to help out but this thread is quite long already, would you mind making a new post so that it shows up as a new issue in our support system?

👍 1

6 Views

Open in Slack

Previous Next