Hi my `dagster daemon run` keeps crashing with an error mess dagster #ask-community

Hi, my `dagster-daemon run` keeps crashing with an...

jasono

05/09/2022, 4:45 AM

Hi, my

dagster-daemon run

keeps crashing with an error message that states

ERROR - Thread for SENSOR did not shut down gracefully

and

Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR', 'SCHEDULER']

. This occasionally happened in the past and somehow fixed itself, but this time the issue isn’t going away. Here is the full stack trace.

jasono

05/09/2022, 4:48 AM

$ dagster-daemon run 2022-05-07 233435 -0700 - dagster.daemon - INFO - instance is configured with the following daemons: [‘BackfillDaemon’, ‘SchedulerDaemon’, ‘SensorDaemon’] warnings.warn(warning_message, DeprecationWarning) 2022-05-07 233605 -0700 - dagster.daemon - ERROR - Thread for SENSOR did not shut down gracefully 2022-05-07 233635 -0700 - dagster.daemon - ERROR - Thread for SCHEDULER did not shut down gracefully Traceback (most recent call last): File “C:\Program Files\Python\lib\runpy.py”, line 197, in _run_module_as_main return _run_code(code, main_globals, None, File “C:\Program Files\Python\lib\runpy.py”, line 87, in _run_code exec(code, run_globals) File “E:\Data\dagster\venv\Scripts\dagster-daemon.exe\__main__.py”, line 7, in <module> File “Y:\dagster\venv\lib\site-packages\dagster\daemon\cli\__init__.py”, line 142, in main cli(obj={}) # pylint:disable=E1123 File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1128, in call return self.main(*args, **kwargs) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1053, in main rv = self.invoke(ctx) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File “c:\Users\u12345\AppData\Roaming\Python\Python39\site-packages\click\core.py”, line 754, in invoke return __callback(*args, **kwargs) File “Y:\dagster\venv\lib\site-packages\dagster\daemon\cli\__init__.py”, line 43, in run_command _daemon_run_command(instance, kwargs) File “Y:\dagster\venv\lib\site-packages\dagster\core\telemetry.py”, line 110, in wrap result = f(*args, **kwargs) File “Y:\dagster\venv\lib\site-packages\dagster\daemon\cli\__init__.py”, line 55, in _daemon_run_command controller.check_daemon_loop() File “Y:\dagster\venv\lib\site-packages\dagster\daemon\controller.py”, line 263, in check_daemon_loop self.check_daemon_heartbeats() File “Y:\dagster\venv\lib\site-packages\dagster\daemon\controller.py”, line 236, in check_daemon_heartbeats raise Exception( Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: [‘SENSOR’, ‘SCHEDULER’]

jasono

05/09/2022, 4:49 AM

I’m on Windows server 12. R2 and Dagster 0.14.13.

daniel

05/09/2022, 11:37 PM

Hi @jasono - it seems like the threads that the daemon spins up aren't even really able to start up. Is the machine where this is running close to resource limits? Did anything change in the environment between the last time it was working and when the problem started happening?

jasono

05/09/2022, 11:43 PM

Hi @daniel Thanks for your response. There is enough resource as far as I can tell, It’s running at 60% CPU and 40% memory, but it’s still failing.

jasono

05/09/2022, 11:44 PM

Also I can’t think of any unusual environment change recently.

daniel

05/09/2022, 11:45 PM

Did you upgrade dagster? When was the last time it was working?

jasono

05/09/2022, 11:45 PM

It was working at 0.14.13 and the version was the same when the issue started.

jasono

05/09/2022, 11:46 PM

I then tried 0.14.14 hoping it might fix the issue, but it didn’t.

jasono

05/09/2022, 11:47 PM

I also tried

dagster-daemon wipe

which it wiped, but didn’t help with the issue.

jasono

05/09/2022, 11:48 PM

also tried

heartbeat write and read

which ran okay.

daniel

05/09/2022, 11:48 PM

You could try running the daemon with the —empty-workspace arg to see if it's something related to your workspace.yaml that is causing problem

jasono

05/09/2022, 11:50 PM

trying right now

jasono

05/09/2022, 11:53 PM

Wow, it’s not failing.

Copy code

$ dagster-daemon run --empty-workspace

Y:\dagster\venv\lib\site-packages\dagster\core\definitions\job_definition.py:93: ExperimentalWarning: "VersionStrategy" is an experimental class. It may break in future versions, even between dot releases. To mute warnings for experimental functionality, invoke warnings.filterwarnings("ignore", category=dagster.ExperimentalWarning) or use one of the other methods described at <https://docs.python.org/3/library/warnings.html#describing-warning-filters>.

  super(JobDefinition, self).__init__(

2022-05-09 16:50:17 -0700 - dagster.daemon - INFO - instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']

2022-05-09 16:50:21 -0700 - dagster.daemon.SensorDaemon - INFO - Not checking for any runs since no sensors have been started.

2022-05-09 16:50:23 -0700 - dagster.daemon.SchedulerDaemon - WARNING - Schedule recon_atlas_gl_job_schedule was started from a location recon_atlas_gl.py that can no longer be found in the workspace. You can turn off this schedule in the Dagit UI from the Status tab.

2022-05-09 16:50:23 -0700 - dagster.daemon.SchedulerDaemon - WARNING - Schedule recon_me_scheduled_reports_job_schedule was started from a location recon_me_scheduled_reports.py that can no longer be found in the workspace. You can turn off this schedule in the Dagit UI from the Status tab.

daniel

05/09/2022, 11:53 PM

Could you post your workspace.yaml?

jasono

05/09/2022, 11:54 PM

Issues the above warnings, but it doesn’t stop.

jasono

05/09/2022, 11:54 PM

Copy code

load_from:

 

  - python_file:

      relative_path: repo/file_watcher.py

      working_directory: y:/datapipeline/file_load

 

  - python_file:

      relative_path: repo/recon_8510r.py

      working_directory: y:/datapipeline/me_recon_supports/recon_8510r

 

  - python_file:

      relative_path: repo/recon_atlas_gl.py

      working_directory: y:/datapipeline/me_recon_supports/recon_atlas_gl

 

  - python_file:

      relative_path: repo/recon_etracdw_atlas.py

      working_directory: y:/datapipeline/me_recon_supports

 

  - python_file:

      relative_path: repo/trend_prem_alltrans.py

      working_directory: y:/datapipeline/me_prem_trend

 

  - python_file:

      relative_path: repo/trend_prem_byclient.py

      working_directory: y:/datapipeline/me_prem_trend

 

  - python_file:

      relative_path: repo/trend_PL.py

      working_directory: y:/datapipeline/file_load

 

  - python_file:

      relative_path: repo/recon_me_scheduled_reports.py

      working_directory: y:/datapipeline/me_recon_supports/recon_me_scheduled_reports

 

  - python_file:

      relative_path: repo/check_epic_gov.py

      working_directory: y:/datapipeline/me_ng/check_epic_gov

 

  - python_file:

      relative_path: repo/je_dac_co61_reclass.py

      working_directory: y:/datapipeline/me_ng/

 

  - python_file:

      relative_path: repo/audit_data_files.py

      working_directory: y:/datapipeline/me_ng/

 

  - python_file:

      relative_path: repo/me_report_5332.py

      working_directory: y:/datapipeline/me_ng/

 

  - python_file:

      relative_path: repo/memoization_test.py

      working_directory: y:/datapipeline/me_ng/

daniel

05/09/2022, 11:57 PM

When you run dagit with that workspace.yaml, how long does it take to start up?

jasono

05/10/2022, 12:01 AM

it’s been over 2 minutes and still not accessible from the browser

jasono

05/10/2022, 12:02 AM

now it responds and opened up in the browser.

jasono

05/10/2022, 12:03 AM

Interestingly, I just restarted the daemon with no args and it’s not stopping anymore.

jasono

05/10/2022, 12:04 AM

Perhaps running it with that arg somehow fixed the problem.

daniel

05/10/2022, 12:04 AM

My guess from what you've described so far is that one of your repository locations is taking a really long time to load

daniel

05/10/2022, 12:05 AM

And we could be handling it better in the daemon (especially the fact that there are no useful logs before it fails), but you could also try to figure out which module is very slow and see if there's any way to make it faster

jasono

05/10/2022, 12:07 AM

so Dagit executes the repo when it starts?

daniel

05/10/2022, 12:07 AM

It loads the module so that it can display your jobs

daniel

05/10/2022, 12:07 AM

And the daemon loads your code to check for schedules and sensors

jasono

05/10/2022, 12:08 AM

okay, I will try to load each repo module and see which one is slow

jasono

05/10/2022, 12:08 AM

They are relatively small files with few dependencies, so I’m surprised loading them would take that long, but I’ll try none the less.

jasono

05/10/2022, 12:08 AM

Thanks for looking into this!!!

condagster 1

5 Views

Open in Slack

Previous Next