Hi I did not find anything related in the Slack Archives so dagster #ask-community

Hi, I did not find anything related in the Slack A...

Alexis Manuel

01/19/2023, 7:51 AM

Hi, I did not find anything related in the Slack Archives so here is my problem (using 1.1.9 deployed on K8S): my Backfill daemon does not send heartbeats anymore and is in “Not Running” status. I found the following daemon logs:

Copy code

Traceback (most recent call last):
  File "/usr/local/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 127, in main
    cli(obj={})  # pylint:disable=E1123
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 43, in run_command
    _daemon_run_command(instance, kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/telemetry.py", line 110, in wrap
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/cli/__init__.py", line 55, in _daemon_run_command
    controller.check_daemon_loop()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 268, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/controller.py", line 239, in check_daemon_heartbeats
    raise Exception("Stopped dagster-daemon process due to thread heartbeat failure")
Exception: Stopped dagster-daemon process due to thread heartbeat failure

And:

Copy code

Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['BACKFILL']
Shutting down daemon threads...
Thread for BACKFILL did not shut down gracefully.

For more context, it happened after I tried to launch a backfill of few runs. How can I revert the daemon into a stable state ?

🤖 1

daniel

01/19/2023, 3:32 PM

Hi alexis - would it be possible to share the full logs from your daemon? This error indicates that the backfill daemon failed, which means the root cause of the issue likely happened before this message

Alexis Manuel

01/19/2023, 3:44 PM

Before this message, I only have logs of Dagster starting and the SchedulerDaemon / SensorDaemon doing some of their regular checks before the BackfillDaemon crash. All of this happen in the 2 minutes following the container start. I will try to do an export asap

Alexis Manuel

01/19/2023, 3:46 PM

logs.txt

daniel

01/19/2023, 3:46 PM

and this happens reliably for you every time you run the daemon?

Alexis Manuel

01/19/2023, 3:47 PM

Yes, it happens on every restart of the container without exception

daniel

01/19/2023, 4:15 PM

Any chance you could run py-spy dump in the container during the period of time when its starting up? https://github.com/benfred/py-spy/blob/master/README.md#how-do-i-run-py-spy-in-docker Surprised that there's no logging if the thread is seemingly failing, but usually py-spy is helpful for understanding what's going on at the individual thread level

daniel

01/19/2023, 4:15 PM

(If it's on k8s its a bit trickier and often requires adding that securityContext field that they mention to the daemon pod: https://github.com/benfred/py-spy/blob/master/README.md#how-do-i-run-py-spy-in-kubernetes, but we have done it successfully in the past)

daniel

01/19/2023, 4:17 PM

Do you recall what exactly happened between the last time this was working and when the problem started happening?

daniel

01/19/2023, 4:21 PM

Were these asset backfills / how many partitions were being backfilled?

Alexis Manuel

01/19/2023, 4:29 PM

Unfortunately I can’t ship the py-spy as this problem is occuring only on our production cluster and we do not want to put py-spy in production. I will try to reproduce the steps in another environment if I manage to do so. The only thing that happened was my attempt to backfill some partitions this morning. It was around 20-30 partitions of a massively partitioned asset (around 10k partitions for this asset I think). After I filled the partitions in the partition selection modal, the message requesting a backfill appeared with its id and the daemon stopped working few moments after that. I did remember spotting this issue as I was not seeing any runs being queued.

daniel

01/19/2023, 4:40 PM

Is there any chance you'd be able to share the code of the partitioned asset with the body of the op removed? We could try on our end to see if we can reproduce the problem Is there a chance that the daemon is hitting a memory or CPU limit / is that something you're able to monitor during that minute before it goes down for spikes?

daniel

01/19/2023, 4:42 PM

We do have some performance improvements for large partitioned assets coming out in the release today, but I'm struggling to match them with the specific symptoms that you're seeing of the daemon thread crashing (if you were seeing timeouts or heartbeat failures, then sure - but I would not expect the thread to die)

Alexis Manuel

01/19/2023, 5:16 PM

Here is the asset with its partition:

Copy code

fifteen_minute_partitions = TimeWindowPartitionsDefinition(
    cron_schedule="*/15 * * * *",
    start=datetime(2022, 1, 1, 0, 0, 0),
    fmt="%Y-%m-%d %H:%M",
    timezone="Europe/Paris",
)

@asset(
    group_name="group_name",
    io_manager_key=settings.FS_IO_MANAGER,
    key_prefix="some_prefix",
    partitions_def=fifteen_minute_partitions,
    required_resource_keys={"api"},
)
def tmp_asset(context) -> DataFrame:
    """Some desc."""
    columns = [
        "field1",
        "field2"
    ]
    start_dt, end_dt = context.output_asset_partitions_time_window()
    req_params = RequestParameters(
        ...
    )
    return context.resources.api.fetch(context, req_params, columns)

sandy

01/19/2023, 5:16 PM

@Alexis Manuel did you kick of the backfill from the asset graph or from the asset job page?

Alexis Manuel

01/19/2023, 5:20 PM

@sandy I honestly don’t remember, I think it was from the asset graph where I selected this particular asset out of the graph. @daniel I just checked and there was no problem with the CPU and memory in the cluster at that given period.

daniel

01/19/2023, 5:22 PM

Oh, you know what, I misread the error message. The thread isn't dying, it's just taking a (very) long time to run. That is much less mysterious

daniel

01/19/2023, 5:27 PM

I have a short term workaround that may help here while we sort this out - if you set the DAGSTER_DAEMON_HEARTBEAT_TOLERANCE env var on your daemon pod to some larger number in seconds (say 7200) it will allow the backfill daemon to take longer to heartbeat without bringing down the whole daemon while we sort this out

Binoy Shah

01/19/2023, 5:47 PM

Jumping on thread, because I am hardening my Kube deployment too, Daniel was your prognosis to tweak the HEARTBEAT_TOLERANCE due to the error message

Stopped dagster-daemon process due to thread heartbeat failure

daniel

01/19/2023, 5:50 PM

that's a short-term workaround... it will 'harden' it in the sense that will leave the other threads running for longer when one of them runs into an issue

daniel

01/19/2023, 5:51 PM

the env var in question is is DAGSTER_DAEMON_HEARTBEAT_TOLERANCE

Binoy Shah

01/19/2023, 5:52 PM

sorry, yes, I noted the full name, but typed here the short 😛 , and just for curiosity, why do you call it short term, do you envision that there’s a more “appropriate” fix

Binoy Shah

01/19/2023, 5:52 PM

daniel

01/19/2023, 5:53 PM

Yeah the appropriate fix would be to squash whatever is causing the backfill daemon to hang - but its a good question about whether we should have the daemon keep the other threads running when one of them runs into issues - seems like that could be a configurable setting at least

Binoy Shah

01/19/2023, 5:55 PM

that would only work robustly if there was a built-in self healing mechanism of some sort, else that particular backfill thread will always keep on erroring

daniel

01/19/2023, 5:55 PM

today setting DAGSTER_DAEMON_HEARTBEAT_TOLERANCE to the largest number you can think of would essentially do that

👍 1

daniel

01/19/2023, 5:56 PM

That's right- I think the thinking was that on some kind of transient failure we wouldn't want to leave the scheduler down forever. I think putting each daemon in its own pod would probably give us the best of both worlds here (at the expense of more pods/resoures used)

Binoy Shah

01/19/2023, 5:58 PM

good point, i think the scale of the subsequent partition count would justify some kind of load distribution on the daemon

daniel

01/19/2023, 6:17 PM

@Alexis Manuel do you have this set to a certain value in your Helm chart values.yaml? i'm a little confused why its giving up so quickly

Copy code

dagsterDaemon:
  heartbeatTolerance:

daniel

01/19/2023, 6:45 PM

OK, after digging into this a bit more, I think i understand why its giving up so quickly and there's a fix out that issue - it appears to be specific to asset backfills. We think that some perf improvements that are coming out in 1.1.11 later today will help - and separately, setting that heartbeatTolerance value to a very high number may help as well. Thanks for reporting and bearing with us while we sort this out

Alexis Manuel

01/20/2023, 7:28 AM

I will try to update to 1.1.11 to see if it resolves the problem, and if not I will go for the DAGSTER_DAEMON_HEARTBEAT_TOLERANCE env variable trick. Thanks for your amazing support !

Alexis Manuel

01/20/2023, 9:04 AM

The 1.1.11 solved it and the daemon is now back to its normal state 🎉

🎉 1

69 Views

Open in Slack

Previous Next