https://dagster.io/ logo
#ask-community
Title
# ask-community
c

Craig Morris

07/18/2023, 4:28 PM
Hello. We have a sensor that looks at files in Google Cloud Storage Bucket and refreshes them when they have become expired. It runs every minute (minimum_interval_seconds=60) and simply checks the metadata of each file in the bucket. Once it finds an expired file it marks it as updateInProgress and then yields a RunRequest that will update the file. We are seeing more and more runs failing to start with the message “Ignoring a run worker that started after the run had already finished” . The UI does not give me the message, but instead directs me to get the message from a kubectl command which is how I found it. Dagster is running on GKE cluster deployed with helm chart on version 1.3.5. Code is built on python client 1.3.13 (guidance appreciated on python client version used vs deployed version. IE is code being ahead of deployed version a problem?) Any guidance appreciated on the above error as I dont see any references when I do google search.
I understand the message is telling me the job has finished before it started, but I see nothing in the logs indicating that has happened. How does it identify the job? I thought maybe it used run_key so I set that to a uuid so it should be unique for every RunRequest.
Full log message from UI.
Copy code
Run timed out due to taking longer than 300 seconds to start.
Debug information for pod dagster-run-62fa0cb3-526d-4662-b69f-22953c2efb5e-d4ngr:

Pod status: Running
Container 'dagster' status: Ready

No logs in pod.

No warning events for pod.
For more information about the failure, try running `kubectl describe pod dagster-run-62fa0cb3-526d-4662-b69f-22953c2efb5e-d4ngr`, `kubectl logs dagster-run-62fa0cb3-526d-4662-b69f-22953c2efb5e-d4ngr`, or `kubectl describe job dagster-run-62fa0cb3-526d-4662-b69f-22953c2efb5e` in your cluster.
output from kubectl logs command.
Copy code
> kubectl logs dagster-run-62fa0cb3-526d-4662-b69f-22953c2efb5e-d4ngr
{"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": null, "marker_end": null, "marker_start": null, "metadata_entries": []}, "event_type_value": "ENGINE_EVENT", "logging_tags": {}, "message": "Ignoring a run worker that started after the run had already finished.", "pid": null, "pipeline_name": "sql_cache_updater", "solid_handle": null, "step_handle": null, "step_key": null, "step_kind_value": null}
a

alex

07/20/2023, 5:57 PM
it is taking longer than the default timeout of
300
seconds https://github.com/dagster-io/dagster/blame/master/helm/dagster/values.yaml#L1100 for your kubernetes cluster to start the pod for the job. One possibility is that your cluster is over loaded which is preventing the pod from being scheduled in time. The error message from the kubectl logs is for when the pod actually does come up and sees it has already been marked as failed by the run monitoring daemon. You can change the timeout via your helm values if there is an expected reason for pod start up to be so long.
c

Craig Morris

07/21/2023, 2:52 PM
ok gotcha. Really appreciate the help. We try to run k8s lean and have tons of cron jobs starting/stopping so it is likely attempting to spin up a node in order to run the job. Will extend the timeout.
5 Views