Hi, I did not find anything related in the Slack A...
# ask-community
a
Hi, I did not find anything related in the Slack Archives so here is my problem (using 1.1.9 deployed on K8S): my Backfill daemon does not send heartbeats anymore and is in “Not Running” status. I found the following daemon logs:
Copy code
Traceback (most recent call last):
  File "/usr/local/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
    cli(obj={})  # pylint:disable=E1123
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
    _daemon_run_command(instance, kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/telemetry.py", line 110, in wrap
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
    controller.check_daemon_loop()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 268, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 239, in check_daemon_heartbeats
    raise Exception("Stopped dagster-daemon process due to thread heartbeat failure")
Exception: Stopped dagster-daemon process due to thread heartbeat failure
And:
Copy code
Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['BACKFILL']
Shutting down daemon threads...
Thread for BACKFILL did not shut down gracefully.
For more context, it happened after I tried to launch a backfill of few runs. How can I revert the daemon into a stable state ?
🤖 1
d
Hi alexis - would it be possible to share the full logs from your daemon? This error indicates that the backfill daemon failed, which means the root cause of the issue likely happened before this message
a
Before this message, I only have logs of Dagster starting and the SchedulerDaemon / SensorDaemon doing some of their regular checks before the BackfillDaemon crash. All of this happen in the 2 minutes following the container start. I will try to do an export asap
logs.txt
d
and this happens reliably for you every time you run the daemon?
a
Yes, it happens on every restart of the container without exception
d
Any chance you could run py-spy dump in the container during the period of time when its starting up? https://github.com/benfred/py-spy/blob/master/README.md#how-do-i-run-py-spy-in-docker Surprised that there's no logging if the thread is seemingly failing, but usually py-spy is helpful for understanding what's going on at the individual thread level
(If it's on k8s its a bit trickier and often requires adding that securityContext field that they mention to the daemon pod: https://github.com/benfred/py-spy/blob/master/README.md#how-do-i-run-py-spy-in-kubernetes, but we have done it successfully in the past)
Do you recall what exactly happened between the last time this was working and when the problem started happening?
Were these asset backfills / how many partitions were being backfilled?
a
Unfortunately I can’t ship the py-spy as this problem is occuring only on our production cluster and we do not want to put py-spy in production. I will try to reproduce the steps in another environment if I manage to do so. The only thing that happened was my attempt to backfill some partitions this morning. It was around 20-30 partitions of a massively partitioned asset (around 10k partitions for this asset I think). After I filled the partitions in the partition selection modal, the message requesting a backfill appeared with its id and the daemon stopped working few moments after that. I did remember spotting this issue as I was not seeing any runs being queued.
d
Is there any chance you'd be able to share the code of the partitioned asset with the body of the op removed? We could try on our end to see if we can reproduce the problem Is there a chance that the daemon is hitting a memory or CPU limit / is that something you're able to monitor during that minute before it goes down for spikes?
We do have some performance improvements for large partitioned assets coming out in the release today, but I'm struggling to match them with the specific symptoms that you're seeing of the daemon thread crashing (if you were seeing timeouts or heartbeat failures, then sure - but I would not expect the thread to die)
a
Here is the asset with its partition:
Copy code
fifteen_minute_partitions = TimeWindowPartitionsDefinition(
    cron_schedule="*/15 * * * *",
    start=datetime(2022, 1, 1, 0, 0, 0),
    fmt="%Y-%m-%d %H:%M",
    timezone="Europe/Paris",
)

@asset(
    group_name="group_name",
    io_manager_key=settings.FS_IO_MANAGER,
    key_prefix="some_prefix",
    partitions_def=fifteen_minute_partitions,
    required_resource_keys={"api"},
)
def tmp_asset(context) -> DataFrame:
    """Some desc."""
    columns = [
        "field1",
        "field2"
    ]
    start_dt, end_dt = context.output_asset_partitions_time_window()
    req_params = RequestParameters(
        ...
    )
    return context.resources.api.fetch(context, req_params, columns)
s
@Alexis Manuel did you kick of the backfill from the asset graph or from the asset job page?
a
@sandy I honestly don’t remember, I think it was from the asset graph where I selected this particular asset out of the graph. @daniel I just checked and there was no problem with the CPU and memory in the cluster at that given period.
d
Oh, you know what, I misread the error message. The thread isn't dying, it's just taking a (very) long time to run. That is much less mysterious
I have a short term workaround that may help here while we sort this out - if you set the DAGSTER_DAEMON_HEARTBEAT_TOLERANCE env var on your daemon pod to some larger number in seconds (say 7200) it will allow the backfill daemon to take longer to heartbeat without bringing down the whole daemon while we sort this out
b
Jumping on thread, because I am hardening my Kube deployment too, Daniel was your prognosis to tweak the HEARTBEAT_TOLERANCE due to the error message
Stopped dagster-daemon process due to thread heartbeat failure
?
d
that's a short-term workaround... it will 'harden' it in the sense that will leave the other threads running for longer when one of them runs into an issue
the env var in question is is DAGSTER_DAEMON_HEARTBEAT_TOLERANCE
b
sorry, yes, I noted the full name, but typed here the short 😛 , and just for curiosity, why do you call it short term, do you envision that there’s a more “appropriate” fix
?
d
Yeah the appropriate fix would be to squash whatever is causing the backfill daemon to hang - but its a good question about whether we should have the daemon keep the other threads running when one of them runs into issues - seems like that could be a configurable setting at least
b
that would only work robustly if there was a built-in self healing mechanism of some sort, else that particular backfill thread will always keep on erroring
d
today setting DAGSTER_DAEMON_HEARTBEAT_TOLERANCE to the largest number you can think of would essentially do that
👍 1
That's right- I think the thinking was that on some kind of transient failure we wouldn't want to leave the scheduler down forever. I think putting each daemon in its own pod would probably give us the best of both worlds here (at the expense of more pods/resoures used)
b
good point, i think the scale of the subsequent partition count would justify some kind of load distribution on the daemon
d
@Alexis Manuel do you have this set to a certain value in your Helm chart values.yaml? i'm a little confused why its giving up so quickly
Copy code
dagsterDaemon:
  heartbeatTolerance:
OK, after digging into this a bit more, I think i understand why its giving up so quickly and there's a fix out that issue - it appears to be specific to asset backfills. We think that some perf improvements that are coming out in 1.1.11 later today will help - and separately, setting that heartbeatTolerance value to a very high number may help as well. Thanks for reporting and bearing with us while we sort this out
a
I will try to update to 1.1.11 to see if it resolves the problem, and if not I will go for the DAGSTER_DAEMON_HEARTBEAT_TOLERANCE env variable trick. Thanks for your amazing support !
The 1.1.11 solved it and the daemon is now back to its normal state 🎉
🎉 1