Hi team I wanna ask question about dagster daemon We have in dagster #ask-community

Hi, team. I wanna ask question about dagster daem...

Yogic Wahyu

05/29/2023, 2:39 AM

Hi, team. I wanna ask question about dagster daemon. We have installed the application (dagster/dagit) without containerization (bare-metal, as daemon service), dagit.service is constantly using 1-1.5 Gb Memory. Meanwhile, dagster-daemon.service is increasing, at initialization/restart time, it consumes 1.8 Gb but some days after that, it goes to 6.6-8 Gb memory. Is it expected to be like this?

Yogic Wahyu

05/29/2023, 2:40 AM

Specs ; dagit/daster 1.0.9 GCP VM n1 machine

Yogic Wahyu

05/29/2023, 2:42 AM

This is the config yaml,

Copy code

telemetry:
  enabled: false

run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      username: xxxxx
      password: xxxxxx
      hostname: x.x.x.x
      db_name: xxxx
      port: xxxx

event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      username: xxxxx
      password: xxxxxx
      hostname: x.x.x.x
      db_name: xxxxxx
      port: xxx

schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      username: xxxxxx
      password: xxxxxx
      hostname: x.x.x.x
      db_name: xxxxx
      port: xxxx

run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 30
    tag_concurrency_limits:
      - key: "dagster/priority"
        limit: 15
      - key: "job"
        value: "airbyte"
        limit: 5
      - key: "kind"
        value: "airbyte"
        limit: 5
      - key: "job"
        value: "dbt_asset_job"
        limit: 1

retention:
  schedule:
    purge_after_days: 60 # sets retention policy for schedule ticks of all types (in latest 60 days)
  sensor:
    purge_after_days:
      skipped: 7
      failure: 30
      success: 60 # keep success ticks in latest 60 days

daniel

05/30/2023, 6:50 PM

The most likely reason I can think of for this to happen would be a memory link in a run or sensor. The main thing we recommend to help prevent user code issues from affecting the system as a whole is to use containerization and a run launcher so that code servers and runs are each happening in their own isolated task or container

Yogic Wahyu

06/11/2023, 1:02 PM

Ok thanks for that @daniel, also maybe because of that, dagster-daemon (SCHEDULER, SENSOR, BACKFILLING, and QUEUERUNNER) often goes down (need manual restart). After check the deep component, I got this log,

Copy code

<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1686486246.742005053","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1686486246.742003835","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>
Unable to connect to gRPC server: 0

Possible root cause, 1. Daemon doesn’t send the heartbeat in timeout interval (default 60 seconds), therefore the code_server (gRPC) is shut down due to timeout. https://docs.dagster.io/deployment/dagster-instance#grpc-servers 2. Reach resource limit, somehow it is restarted but not gracefully and some components are failed to spin up (I looked up in the status logs journalctl/systemd, I remembered that we started to spin dagster 1 months ago, but it started 1 weeks ago, which means it restarted automatically? unit config is attached below). 3. There is hidden mechanism of sync between every component (dagit, dagster gRPC, daemon) which produce exhaustive operation such as recalling the code server every load or certain process (I am not sure about this but I found something related with this in latest bugfix, https://docs.dagster.io/changelog#bugfixes).

Copy code

[Unit]
Description=Dagster Daemon Service
Wants=network-online.target
After=network-online.target

[Service]
User=ubuntu
Group=ubuntu
Type=simple
Environment="DAGSTER_HOME=/opt/dagster/dagster_home"
WorkingDirectory=/opt/dagster/repo_sync
ExecStart=dagster-daemon run
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Yogic Wahyu

06/11/2023, 1:09 PM

Temporary solution, 1. Increase code server timeout (240 seconds) 2. Deployment changes using k8s

daniel

06/12/2023, 5:51 PM

my two recommendations are: • try upgrading to the latest dagster version - as you noted there have been some bugfixes related to the daemon reloading code recently • Try running each code server in its own pod/container/task using one of our recommended deployment schemes (e..g the dagster helm chart)

3 Views

Open in Slack

Previous Next