I have an issue that the queuedruncoordinator didnt start at dagster #ask-ai

I have an issue that the queuedruncoordinator didn...

Jon Erik Kemi Warghed

02/15/2024, 8:11 AM

I have an issue that the queuedruncoordinator didnt start at 7:00, in the logs I get an entry like this for 5:00 and 6:00 "[32m2024-02-15 050109 +0000[0m - dagster.daemon.QueuedRunCoordinatorDaemon - [34mINFO[0m - Launched 1 runs. [32m2024-02-15 050116 +0000[0m - dagster.daemon.MonitoringDaemon - [34mINFO[0m - Collected 1 runs for monitoring" however for 7 o'clock there is no entry, and there is no errors or warnings in the logs either, this happens daily. We are running a deployment on kubernetes.

Scout

02/15/2024, 8:11 AM

If the

QueuedRunCoordinatorDaemon

did not start as expected at 7:00 and there are no errors or warnings in the logs, there are a few steps you can take to troubleshoot the issue: 1. Check the Daemon Status in the Dagster UI: - Navigate to the Deployment section in the top navigation of the Dagster UI. - Click on the Daemons tab to check the status of the

dagster-daemon

process. - Each daemon should periodically write a heartbeat to your instance storage. If the

QueuedRunCoordinatorDaemon

doesn't show a recent heartbeat, this could indicate an issue. 2. Check the Logs: - Since you are running on Kubernetes, you should check the logs from the pod that's running the

dagster-daemon

. - Look for any logs around the time the coordinator should have started (7:00) to see if there are any indications of what might be going wrong. 3. Verify the Scheduler Configuration: - Ensure that the scheduler daemon is configured correctly and that it is enabled. - Verify that the

QueuedRunCoordinatorDaemon

is set up correctly in your instance configuration. 4. Check for Resource Constraints: - Since you are running on Kubernetes, it's possible that resource constraints or scheduling issues could be preventing the daemon from running as expected. - Check the Kubernetes events for the

dagster-daemon

pod to see if there are any scheduling issues. 5. Review the Schedule and Sensor Configuration: - If the runs are supposed to be launched by a schedule or sensor, review their configuration to ensure they are set up correctly. - Check the schedule or sensor code to make sure there are no issues that could prevent runs from being queued. 6. Monitor the System at the Time of the Issue: - If the issue occurs daily at the same time, you may want to actively monitor the system at that time to gather more information. - This could include checking system metrics, resource usage, and other relevant information that could help identify the cause. If these steps do not reveal the cause of the issue, you may want to reach out for support with more detailed information about your deployment, configuration, and any other relevant details that could help diagnose the problem.

6 Views

Open in Slack

Previous Next