https://dagster.io/ logo
Title
b

Benoit Perigaud

10/02/2021, 9:53 PM
We moved to daylight saving time overnight in Sydney, and I don't know if this is linked or not but my Scheduler has been failing since then. I run the daemon as a service but restarting it (or restarting my machine) didn't fix it. It also looks like since then, the
dagster-daemon run
command is always eating 100% of one of my CPUs. Here is the error in journalctl (I'm on dagster 0.12.12):
Oct 03 08:49:03 raspberrypi bash[10714]: 2021-10-03 08:49:03 - dagster-daemon - ERROR - Thread for SCHEDULER did not shut down gracefully
Oct 03 08:49:03 raspberrypi bash[10714]: Traceback (most recent call last):
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/bin/dagster-daemon", line 8, in <module>
Oct 03 08:49:03 raspberrypi bash[10714]:     sys.exit(main())
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/cli/__init__.py", line 135, in main
Oct 03 08:49:03 raspberrypi bash[10714]:     cli(obj={})  # pylint:disable=E1123
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 829, in __call__
Oct 03 08:49:03 raspberrypi bash[10714]:     return self.main(*args, **kwargs)
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 782, in main
Oct 03 08:49:03 raspberrypi bash[10714]:     rv = self.invoke(ctx)
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
Oct 03 08:49:03 raspberrypi bash[10714]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
Oct 03 08:49:03 raspberrypi bash[10714]:     return ctx.invoke(self.callback, **ctx.params)
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 610, in invoke
Oct 03 08:49:03 raspberrypi bash[10714]:     return callback(*args, **kwargs)
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/cli/__init__.py", line 48, in run_command
Oct 03 08:49:03 raspberrypi bash[10714]:     controller.check_daemon_loop()
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/controller.py", line 237, in check_daemon_loop
Oct 03 08:49:03 raspberrypi bash[10714]:     self.check_daemon_heartbeats()
Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/controller.py", line 212, in check_daemon_heartbeats
Oct 03 08:49:03 raspberrypi bash[10714]:     failed_daemons=failed_daemons
Oct 03 08:49:03 raspberrypi bash[10714]: Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SCHEDULER']
Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Main process exited, code=exited, status=1/FAILURE
Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Failed with result 'exit-code'.
Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Service RestartSec=100ms expired, scheduling restart.
Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Scheduled restart job, restart counter is at 6.
Oct 03 08:49:04 raspberrypi systemd[1]: Stopped Daemon for dagster.
Oct 03 08:49:04 raspberrypi systemd[1]: Started Daemon for dagster.
The heath page tells me: "Not running - No recent heartbeat"
d

daniel

10/02/2021, 10:11 PM
Hi Benoit - could you share what your execution timezone is and what the frequency of your schedule is (daily weekly etc?)
b

Benoit Perigaud

10/02/2021, 10:14 PM
Sure. TZ is
execution_timezone="Australia/Sydney"
And I have multiple pipelines on a cron schedule (including one that runs every 15 minutes:
cron_schedule="*/15 * * * *"
d

daniel

10/02/2021, 10:18 PM
Got it. I did notice this issue with the cron library we use, which could potentially cause something like this: https://github.com/taichino/croniter/issues/137 One thing you could try as a workaround while we sort this out is turning your schedules off and on again - that could get the start time past the dst transition if something is going wrong there
👍 1
b

Benoit Perigaud

10/02/2021, 10:28 PM
Thanks! Deactivating the schedule by itself didn't work but the next restart of the daemon worked out!
d

daniel

10/02/2021, 10:51 PM
Thanks - if it’s not a pain would it be possible to share what the different cron strings you use are? Having some trouble reproducing the problem with */15
b

Benoit Perigaud

10/02/2021, 10:54 PM
No worries. I just have 3 different schedules (the 15 min one has a
should_execute
filter as well):
sched_pipeline1 = ScheduleDefinition(cron_schedule="*/15 * * * *", execution_timezone="Australia/Sydney", should_execute=hour_filter, job=pipeline1)

sched_pipeline2 = ScheduleDefinition(cron_schedule="0 8,20 * * *", execution_timezone="Australia/Sydney", job=pipeline2)

sched_pipeline3 = ScheduleDefinition(cron_schedule="0 8 * * *", execution_timezone="Australia/Sydney", job=pipeline3)
Actually, it seems that it is not fixed. The Scheduler shows in a "Running" status but my 15 min runs are not triggered. Should I wipe the heartbeats? I can also see that the deamon is still at 100% CPU
d

daniel

10/03/2021, 12:14 AM
Is it possible to share what the should_execute filter is? That could possibly be contributing to the runs not being triggered
b

Benoit Perigaud

10/03/2021, 12:17 AM
I did a bit more troubleshooting, if I deactivate the schedule for my 15 min run and restart the daemon, it is working fine. But if I reactivate the schedule, it breaks again and CPU goes to 100%
My filter is quite simple:
def hour_filter(_context):
    hour = datetime.now().hour
    return hour >= 6 or hour == 0
I recreated the schedule without the
should_execute
parameter but it is still failing without it.
d

daniel

10/03/2021, 12:39 AM
Got it - do the logs from the daemon give any indication of what it’s doing when trying that schedule?
b

Benoit Perigaud

10/03/2021, 12:43 AM
I can't see anything particular, no. I managed to fix my issue by adding a parameter
name
to my schedule and make it a different one from the original name. Heartbeats are working and CPU is down to normal levels even with the 15 min schedule On. I am losing the info about the previous schedules though now. If I can help with further troubleshooting, let me know. I guess that if I change my schedule name back to the original one the error will reappear.
d

daniel

10/03/2021, 12:46 AM
If there’s any way you’d be able to pass along a file with a dump of your jobs and job_ticks tables in your dB (I don’t think there’s anything particularly sensitive in those tables) that would help a lot for debugging how your system managed to get into this weird state
b

Benoit Perigaud

10/03/2021, 12:48 AM
I'll look into it a bit later today and send it across 👍
🙏 1
Hi! Here are the dumps from my Postgres DB.
d

daniel

10/03/2021, 9:17 PM
Sweet thanks - and can I ask the name of the bad schedule that you had to rename?
p

paul.q

10/03/2021, 10:18 PM
FWIW, we are using 0.12.3 also on Australia/Sydney TZ but on Windows. We didn't see the CPU issues reported here, but our (0 3 * * 1-5) schedule ran its pipeline at 02:00 AM and 03:00 AM today (Monday). I will leave the schedule on for now to see what happens tomorrow.
b

Benoit Perigaud

10/03/2021, 10:21 PM
The one with the issue is golf_pipeline_schedule which I now renamed to golf_pipeline_schedule_new
p

paul.q

10/03/2021, 10:33 PM
Interestingly, when I select the 2AM tick in dagit tick history, it shows the run_id multiple times - normally it only shows once:
d

daniel

10/03/2021, 11:11 PM
thanks paul - if dumping your jobs and job_ticks table is an option that'd be another useful data point
p

paul.q

10/04/2021, 3:23 AM
Here you go Daniel - zip including two table backups via: pg_dump.exe <filename> --verbose --format=p --table "public.jobs"
Our 3AM schedule is behaving properly now without further intervention on our part. So I think there may be something that needs looking into when DST kicks in. My suspicion is that the "next tick" was already committed after the previous run without knowledge of a pending DST change. In our case, there was a Friday 3:00AM run (Oct 1) with a next pending tick of Monday 3:00AM (Oct 4) - treated as 72 hours later.
d

daniel

10/04/2021, 9:52 PM
@Benoit Perigaud I was able to verify that the last tick in your job_ticks table for that bad schedule is at 2021-10-03 at 01:30 AM, so that's highly suggestive that it was in fact something about the DST transition. However I'm simulating the transition locally and everything's working OK, so there may be something more subtle going on here. You've already gone above and beyond debugging-wise, but there are a couple additional things that could help • renaming the schedule one more time to the bad name to make the problem happen again and then capturing the full logs from the dagster-daemon process (it should have logs like
Checking for new runs for the following schedules:
- that and the following lines would give a lot of insight into where exactly it's going wrong (for example it might say
No run requests returned for golf_pipeline_schedule, skipping
) - that would be instructive • Could you share what version of pendulum you have installed as part of your dagster install?
What I see when i simulate it in a test is more or less what I would expect:
2021-10-03 01:34:00 - SchedulerDaemon - INFO - Evaluating schedule `schedule_sydney` at 2021-10-03 01:30:00+1000
2021-10-03 01:34:00 - SchedulerDaemon - INFO - Completed scheduled launch of run 6afeac99-a325-4292-9c5b-7e62cac17cda for schedule_sydney
<30 minutes pass, crossing the DST transition>
2021-10-03 03:04:00 - SchedulerDaemon - INFO - Evaluating schedule `partitionless_schedule_sydney` at 2021-10-03 03:00:00+1100
2021-10-03 03:04:00 - SchedulerDaemon - INFO - Completed scheduled launch of run cb59ae9a-8f25-4e84-8095-d2cca2cc4750 for partitionless_schedule_sydney
But clearly that did not happen in your environment for some reason, very strange.
Would it be possible to share your cron string? (Edit: you posted it earlier, just realized)
@paul.q I do indeed see a very surprising row in your job_ticks table that maps to what you saw in the UI:
295594	0e55b0dc50d412446db9b66354a83a979b494b2d	SUCCESS	SCHEDULE	2021-10-03 15:00:00	{"__class__": "JobTickData", "cursor": null, "error": null, "job_name": "daily_upload_schedule", "job_origin_id": "0e55b0dc50d412446db9b66354a83a979b494b2d", "job_type": {"__enum__": "JobType.SCHEDULE"}, "origin_run_ids": [], "run_ids": ["12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0", "12c36386-04c2-4aa0-a001-8b3bc22ccec0"], "run_keys": ["2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30", "2021-09-30"], "skip_reason": null, "status": {"__enum__": "JobTickStatus.SUCCESS"}, "timestamp": 1633273200.0}	2021-10-03 15:00:02.324713	2021-10-03 15:00:02.324713
b

Benoit Perigaud

10/04/2021, 10:57 PM
@daniel I'm on pendulum 2.1.2 in my dagster venv. I will share the daemon output a bit later.
Here are the files. One with just stdout and the other with stdout + stderr (which contains warnings about the new API). It looks like the SchedulerDaemon runs only once and then stops/hangs.
d

daniel

10/05/2021, 2:59 AM
thanks, that's very helpful. I'm pretty sure i know what function it's stuck in, I just need to figure out why it won't reproduce for me when I pass in the same inputs that you should be seeing.
Aha! Just had a small breakthrough. This only repros in pendulum 2 and i had pendulum 1 installed 🙂
(I posted a script for you to run earlier, but no need, i can finally repro 🙂 )
OK! I have a PR that I'm optimistic will fix Benoit's issue: https://github.com/dagster-io/dagster/pull/5021 - based on what i'm seeing I think it's also likely that pinning pendulum to 1.4.4 will also fix it in the short-term until we get the fix out. I'm less certain about your issue paul - will need to look into that a bit more. But if you're on pendulum 2 it's possible that PR will help you too.
b

Benoit Perigaud

10/05/2021, 6:49 AM
Would you want me to try it out or should I just wait the next release?
d

daniel

10/05/2021, 12:26 PM
up to you, really - I think we have a handle on it now if you want to wait. Thanks so much for all the debugging help
@paul.q I just verified that either waiting for the linked PR to land or switching to pendulum==1.4.4 should resolve your issue where it fired at 2AM and 3AM (still figuring out what happened with the multiple runs per-tick issue)
b

Benoit Perigaud

10/05/2021, 9:06 PM
I'll just wait for the next version then. I'll report here if it works.
p

paul.q

10/05/2021, 9:40 PM
Thanks Daniel. Oddly I'm on pendulum 1.4.4 locally, but 2.1.2 in our production environment due to a site package version I wasn't aware of. Happy to wait for the next release. Testing will wait another year 🙂
b

Benoit Perigaud

10/08/2021, 10:21 PM
I moved to 0.12.14 and so far so good 🎉! I renamed my schedule to the original name and the daemon is humming properly and triggering my 15 min runs. Thanks for the help @daniel
d

daniel

10/08/2021, 10:21 PM
no problem, thank you for raising the alarm before the rest of the world hit DST 🙂