Hi, my `dagster-daemon run` keeps crashing with an...
# ask-community
j
Hi, my
dagster-daemon run
keeps crashing with an error message that states
ERROR - Thread for SENSOR did not shut down gracefully
and
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR', 'SCHEDULER']
. This occasionally happened in the past and somehow fixed itself, but this time the issue isn’t going away. Here is the full stack trace.
$ dagster-daemon run 2022-05-07 233435 -0700 - dagster.daemon - INFO - instance is configured with the following daemons: [‘BackfillDaemon’, ‘SchedulerDaemon’, ‘SensorDaemon’] warnings.warn(warning_message, DeprecationWarning) 2022-05-07 233605 -0700 - dagster.daemon - ERROR - Thread for SENSOR did not shut down gracefully 2022-05-07 233635 -0700 - dagster.daemon - ERROR - Thread for SCHEDULER did not shut down gracefully Traceback (most recent call last): File “C:\Program Files\Python\lib\runpy.py”, line 197, in _run_module_as_main return _run_code(code, main_globals, None, File “C:\Program Files\Python\lib\runpy.py”, line 87, in _run_code exec(code, run_globals) File “E:\Data\dagster\venv\Scripts\dagster-daemon.exe\__main__.py”, line 7, in <module> File “Y:\dagster\venv\lib\site-packages\dagster\daemon\cli\__init__.py”, line 142, in main cli(obj={}) # pylint:disable=E1123 File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1128, in call return self.main(*args, **kwargs) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1053, in main rv = self.invoke(ctx) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 754, in invoke return __callback(*args, **kwargs) File “Y:\dagster\venv\lib\site-packages\dagster\daemon\cli\__init__.py”, line 43, in run_command _daemon_run_command(instance, kwargs) File “Y:\dagster\venv\lib\site-packages\dagster\core\telemetry.py”, line 110, in wrap result = f(*args, **kwargs) File “Y:\dagster\venv\lib\site-packages\dagster\daemon\cli\__init__.py”, line 55, in _daemon_run_command controller.check_daemon_loop() File “Y:\dagster\venv\lib\site-packages\dagster\daemon\controller.py”, line 263, in check_daemon_loop self.check_daemon_heartbeats() File “Y:\dagster\venv\lib\site-packages\dagster\daemon\controller.py”, line 236, in check_daemon_heartbeats raise Exception( Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: [‘SENSOR’, ‘SCHEDULER’]
I’m on Windows server 12. R2 and Dagster 0.14.13.
d
Hi @jasono - it seems like the threads that the daemon spins up aren't even really able to start up. Is the machine where this is running close to resource limits? Did anything change in the environment between the last time it was working and when the problem started happening?
j
Hi @daniel Thanks for your response. There is enough resource as far as I can tell, It’s running at 60% CPU and 40% memory, but it’s still failing.
Also I can’t think of any unusual environment change recently.
d
Did you upgrade dagster? When was the last time it was working?
j
It was working at 0.14.13 and the version was the same when the issue started.
I then tried 0.14.14 hoping it might fix the issue, but it didn’t.
I also tried
dagster-daemon wipe
which it wiped, but didn’t help with the issue.
also tried
heartbeat write and read
which ran okay.
d
You could try running the daemon with the —empty-workspace arg to see if it's something related to your workspace.yaml that is causing problem
j
trying right now
Wow, it’s not failing.
Copy code
$ dagster-daemon run --empty-workspace

Y:\dagster\venv\lib\site-packages\dagster\core\definitions\job_definition.py:93: ExperimentalWarning: "VersionStrategy" is an experimental class. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

  super(JobDefinition, self).__init__(

2022-05-09 16:50:17 -0700 - dagster.daemon - INFO - instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']

2022-05-09 16:50:21 -0700 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.

2022-05-09 16:50:23 -0700 - dagster.daemon.SchedulerDaemon - WARNING - Schedule recon_atlas_gl_job_schedule was started from a location recon_atlas_gl.py that can no longer be found in the workspace. You can turn off this schedule in the Dagit UI from the Status tab.

2022-05-09 16:50:23 -0700 - dagster.daemon.SchedulerDaemon - WARNING - Schedule recon_me_scheduled_reports_job_schedule was started from a location recon_me_scheduled_reports.py that can no longer be found in the workspace. You can turn off this schedule in the Dagit UI from the Status tab.
d
Could you post your workspace.yaml?
j
Issues the above warnings, but it doesn’t stop.
Copy code
load_from:

 

  - python_file:

      relative_path: repo/file_watcher.py

      working_directory: y:/datapipeline/file_load

 

  - python_file:

      relative_path: repo/recon_8510r.py

      working_directory: y:/datapipeline/me_recon_supports/recon_8510r

 

  - python_file:

      relative_path: repo/recon_atlas_gl.py

      working_directory: y:/datapipeline/me_recon_supports/recon_atlas_gl

 

  - python_file:

      relative_path: repo/recon_etracdw_atlas.py

      working_directory: y:/datapipeline/me_recon_supports

 

  - python_file:

      relative_path: repo/trend_prem_alltrans.py

      working_directory: y:/datapipeline/me_prem_trend

 

  - python_file:

      relative_path: repo/trend_prem_byclient.py

      working_directory: y:/datapipeline/me_prem_trend

 

  - python_file:

      relative_path: repo/trend_PL.py

      working_directory: y:/datapipeline/file_load

 

  - python_file:

      relative_path: repo/recon_me_scheduled_reports.py

      working_directory: y:/datapipeline/me_recon_supports/recon_me_scheduled_reports

 

  - python_file:

      relative_path: repo/check_epic_gov.py

      working_directory: y:/datapipeline/me_ng/check_epic_gov

 

  - python_file:

      relative_path: repo/je_dac_co61_reclass.py

      working_directory: y:/datapipeline/me_ng/

 

  - python_file:

      relative_path: repo/audit_data_files.py

      working_directory: y:/datapipeline/me_ng/

 

  - python_file:

      relative_path: repo/me_report_5332.py

      working_directory: y:/datapipeline/me_ng/

 

  - python_file:

      relative_path: repo/memoization_test.py

      working_directory: y:/datapipeline/me_ng/
d
When you run dagit with that workspace.yaml, how long does it take to start up?
j
it’s been over 2 minutes and still not accessible from the browser
now it responds and opened up in the browser.
Interestingly, I just restarted the daemon with no args and it’s not stopping anymore.
Perhaps running it with that arg somehow fixed the problem.
d
My guess from what you've described so far is that one of your repository locations is taking a really long time to load
And we could be handling it better in the daemon (especially the fact that there are no useful logs before it fails), but you could also try to figure out which module is very slow and see if there's any way to make it faster
j
so Dagit executes the repo when it starts?
d
It loads the module so that it can display your jobs
And the daemon loads your code to check for schedules and sensors
j
okay, I will try to load each repo module and see which one is slow
They are relatively small files with few dependencies, so I’m surprised loading them would take that long, but I’ll try none the less.
Thanks for looking into this!!!
condagster 1