Guys, is anybody else having problems with dagster...
# ask-community
i
Guys, is anybody else having problems with dagster-daemon? I'm having terrible issues where the daemon it's stopping working. It starts well and running, but as the time passes, the daemon stops and I don't know what to do, this is ruining my jobs
d
Hi Ismael - are there any logs from the dagster-daemon process that might give more clues about why it is stopping?
i
Sadly no, it just stops. The last log registered it's just a "no new runs for job_xxxx", which it's a normal log
This is all I have
d
Is it possible to post the full set of logs from beginning to end from a time that it failed?
As a file?
i
Here
d
Would you mind sending it as a text file?
i
As you can see, the last log I had about the daemon, was 08:10 and now it's 10:15
d
Where's the line about it failing due to lack of heartbeats?
Is the process still running?
i
It doesn't show up to me, I assumed I had the same error as the guy I sent the print, because he also had 2 Damon's not running
Just dagit it's running, the daemon stopped
d
So the process is still running, but hanging? Or it is no longer running? How did you pull those logs?
i
Dagit is running, the daemon isn't it stopped. I collected this logs from the Openshift console
d
Does it say what time the process stopped and what the exit code was?
i
Sadly no
d
Typically there would be a stack trace or error message when the daemon stops
Could it have run out of memory?
i
Nop, I have 4GB of RAM it's consuming just 1.3GB with a job running And storage, I have 500GB
d
Do you have any ways of getting more information about when it stopped or what the exit code was? It looks almost like openshift might have stopped it rather than the dagster process stopping
i
If there's a way to get more info, I'd love to know also. And it's not openshift, because I tested local using a Docker image and the same error happened Maybe it's something with docker...
d
Could you post a log file from when you ran it locally?
Local Docker should have a way of telling you what the exit code was
i
Yeah, that's what I thought also, but it's the same, as the one I already sent you
No traceback, no exit code, just stops
i
It's almost like someone forgot to add "raise exception"
d
And would you mind sharing your Dockerfile?
To confirm, the container has stopped too?
i
No, the container keeps running with dagit on it, but the daemon stops
d
Can you run dagit and the daemon in separate containers?
i
I don't know how to do this :/
d
https://docs.dagster.io/deployment/guides/docker has some guides and examples of how we recommend deploying dagster on docker
here's an example that uses docker-compose to deploy dagit and the daemon separately: https://docs.dagster.io/deployment/guides/docker#example
generally docker recommends running a single process per container when possible - it can automatically restart it for you if it fails
we also have a cloud product that will handle all this deployment stuff for you, but that is not free/open-source
i
Fine, I'll try this, then
a
I had a similar problem deploying dagster using docker-compose on a EC2 instance on AWS. Locally I’ve no problem with my code (running in a M1 pro with the same docker-compose code). However, running in the instance, I had the exact same problem as Ismael: the daemon would stop after ~3min or so. No tracebacks, exceptions, etc in the docker-compose logs, it just hangs. To rule out problems with the instance itself, I deployed this example from your repository and had no issues. Then I decided to upgrade my instance to have more available vCPUS it solved my issue. However, I’m not confident that the instance type itself was the problem but something about the processes on dagster-daemon (the number of PIDs should be this high?!) Few more information: I had this problem before 1.0.7 (I was running 1.0.5). Also, my deployment has 3 sensors and 2 schedules.
❤️ 1
d
Andre any chance you have an example repo I could use to try to reproduce the problem? Like a GitHub repo with a docker-compose file that would hang after a few minutes?
Cc @Qwame since you reported a similar issue above - did this start happening for you in 1.0.3 but wasn't happening in 1.0.2?
Ismael given André’s post above you could try checking/increasing the # of cpus available to your local docker container and see if that helps
q
I was running locally on Windows and never had this problem. Since I ran dagster on an M1, started with v1.0.3, I've had this problem. I've been upgrading versions hoping this will be fixed but
d
What version were you running on windows when it was working?
q
It was before 1.0.3. I think
d
So two things have changed - windows to Mac, and a version upgrade? It would be useful to know if the problem happens on Mac on the version where it was working on windows
q
I saw this here.on line 100. Does this help? https://dagster.phacility.com/differential/changeset/?ref=470506
That's where an error like that is raised.
d
A bit, but the full set of logs from the daemon process from a session when this happens would be the most useful (or a way for us to reproduce the problem ourselves)
q
Are there any other logs, because the only logs I see are the ones that show in the terminal.
d
The ones in the terminal is what I was referring to
i
Sure, I')) try, but it's funny, because I already had increased The number of CPUs, but okay, I'll increase more
d
Would definitely still recommend running dagit and the daemon in separate containers if that's an option as well
q
I also run dagit in separate containers as recommended in the docker depoly example and I experienced the same issues
i
Okay, I'll try increase the number of cpus for now, and work to separate these two
d
It sounds like your issue might be different qwame since your symptoms are slightly different
q
Full log for me
Copy code
Traceback (most recent call last):
   File "/Users/abdin/dagster/bin/dagster-daemon", line 8, in <module>
     sys.exit(main())
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
     cli(obj={})  # pylint:disable=E1123
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
     return self.main(*args, **kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1055, in main
     rv = self.invoke(ctx)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
     return _process_result(sub_ctx.command.invoke(sub_ctx))
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
     return ctx.invoke(self.callback, **ctx.params)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/click/core.py", line 760, in invoke
     return __callback(*args, **kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
     _daemon_run_command(instance, kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_core/telemetry.py", line 110, in wrap
     result = f(*args, **kwargs)
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
     controller.check_daemon_loop()
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/controller.py", line 273, in check_daemon_loop
     self.check_daemon_heartbeats()
   File "/Users/abdin/dagster/lib/python3.9/site-packages/dagster/_daemon/controller.py", line 246, in check_daemon_heartbeats
     raise Exception(
 Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR']
d
I was hoping for the full output of every line in the process, not just the error at the end
To try to identify the problem that happened earlier and stopped the heartbeating
(Attached as a text file if possible)
q
Do you mean the event logs?
d
I mean the terminal output from the daemon process that is stopping unexpectedly
Whatever is being logged to the terminal when it runs (probably a mixture of event logs from the runs and other things too)
q
Hopefully, this is it
d
Is it possible to post or DM the code/workspace.yaml/dagster.yaml that the daemon is running? We could try running it locally on a Mac and see if we can reproduce the daemon crashing
👍 2
I assume the mac isn't going into sleep mode or anything like that..
Has it ever worked for you on a mac?
q
This is the first time trying to run it on a Mac
The mac can sometimes go to sleep though. Let me turn that off and let you know how it goes
i
I increased the CPUs, but it's almost comic as the daemon stopped faster than before, lmao
d
have you had a chance to try running it in a separate container? I think that will make it a lot more clear how and why it fails when it fails
i
Not yet, I'm trying to understand how can I do this separated thing reading the docs you sent to me.
But it's my next goal
r
We have this issue since the last update of the user_code container on a previously working ECS cluster.
This is version 1.0.5. Here is the log of the daemon ecs task before it stops :
Copy code
Telemetry:
  As an open source project, we collect usage statistics to inform development priorities. For more
  information, read <https://docs.dagster.io/install#telemetry>.
  We will not see or store solid definitions, pipeline definitions, modes, resources, context, or
  any data that is processed within solids and pipelines.
  To opt-out, add the following to $DAGSTER_HOME/dagster.yaml, creating that file if necessary:
    telemetry:
      enabled: false
  Welcome to Dagster!
  If you have any questions or would like to engage with the Dagster team, please join us on Slack
  (<https://bit.ly/39dvSsF>).
[32m2022-09-21 10:46:03 +0000[0m - dagster.daemon - [34mINFO[0m - instance is configured with the following daemons: ['BackfillDaemon', 'QueuedRunCoordinatorDaemon', 'SchedulerDaemon', 'SensorDaemon']
[32m2022-09-21 10:47:33 +0000[0m - dagster.daemon - [34mERROR[0m - [31mThread for SENSOR did not shut down gracefully[0m
[32m2022-09-21 10:48:03 +0000[0m - dagster.daemon - [34mERROR[0m - [31mThread for SCHEDULER did not shut down gracefully[0m
Traceback (most recent call last):
  File "/usr/local/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
    cli(obj=
{}
)  # pylint:disable=E1123
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
    _daemon_run_command(instance, kwargs)
  File "/usr/local/lib/python3.8/site-packages/dagster/_core/telemetry.py", line 110, in wrap
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
    controller.check_daemon_loop()
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/controller.py", line 273, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/usr/local/lib/python3.8/site-packages/dagster/_daemon/controller.py", line 246, in check_daemon_heartbeats
    raise Exception(
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR', 'SCHEDULER']
d
Hey Romain - happy to help out but this thread is quite long already, would you mind making a new post so that it shows up as a new issue in our support system?
👍 1