We often have cases where jobs aren't correctly te...
# dagster-feedback
c
We often have cases where jobs aren't correctly terminated in dagster 1.1.21 - anyone having a similar experience?
1
d
Hi Casper - what run launcher are you using? Can you share more details about the sequence of events / what happens when you try to terminate it cleanly?
c
Eh the Docker one. I think the cases were something that possibly manually cancelled or something like that.. maybe It just hanged for 90-ish hours
d
Got it - dunno if you still have logs from the container that received the termination request, but if you do we could take a look and see why the signal wasn't picked up
Having our run monitoring feature apply to runs stuck in CANCELING as well as STARTING could help here too so at least it doesn't stay stuck for 90 hours
c
Copy code
raceback (most recent call last):
  File "/usr/local/bin/dagster", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/__init__.py", line 46, in main
    cli(auto_envvar_prefix=ENV_PREFIX)  # pylint:disable=E1123
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/api.py", line 73, in execute_run_command
    return_code = _execute_run_command_body(
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/api.py", line 150, in _execute_run_command_body
    instance.report_engine_event(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1877, in report_engine_event
    self.report_dagster_event(dagster_event, run_id=run_id, log_level=log_level)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1901, in report_dagster_event
    self.handle_new_event(event_record)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1816, in handle_new_event
    self._event_storage.store_event(event)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/event_log/event_log.py", line 175, in store_event
    with self._connect() as conn:
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 166, in create_pg_connection
    conn = retry_pg_connection_fn(engine.connect)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 130, in retry_pg_connection_fn
    raise DagsterPostgresException("too many retries for DB connection") from exc
dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection
There we go. from the actual container
Copy code
port 5432 failed: FATAL:  SSL connection is required. Please specify SSL options and retry.
SSL errors normally occure when we have some network issues. i.e firewall rules. so that might be the reason it broke? but i don't get why it wouldn't eventually get access again, and or simply mark it as terminated
from what i can see in the logs it posted the logs i can see in the UI and then lost database connection.. weirdly
d
ah interesting - and what state was the run in before you hit terminate the first time?
c
i cannot see from the docker logs. I can simply see it never got anything with the termination request. so presumably just running?
the last message i got in the docker container (i assume it terminated after that) is 180358 - which is the final "too many retries". The last non-error message was:
Copy code
2023-03-31T18:02:19.595391107Z 2023-03-31 18:02:18 +0000 - dagster - DEBUG - status_job - 86d088fc-d550-4560-9f2f-f9767f7851f1 - raw_dev__status__vis_data_quality - life_cycle_state='PENDING'
whereafter i got the first SSL error at
2023-03-31T18:02:24.595959509Z
- meaning it basically only retries for ~40 seconds before terminating. We normally retry for a few minutes. Also maybe the docker runner should check if the specific container has "died"
d
This feature should be able to detect crashed docker containers and spin them down, if you have it enabled in your dagster.yaml: https://docs.dagster.io/deployment/run-monitoring#run-monitoring
🙏 1
m
We also have been seeing cancelled jobs sticking around for many hours. Using the k8s launcher (step per pod, I can never remember the proper name). Here's an example run that's stuck now: https://formenergy.dagster.cloud/td-production/runs/b25b7ea5-18ae-4459-a9b8-dfd9dd766d28?logFileKey=bkejnqxe . It has a number of steps still pending, with messages like "Dependencies for step X were not executed: Y. Not executing." And some other pending steps say "Deleting Kubernetes job dagster-step-900d3f5361e3f01445289673e02208a9 for step" and when I look for a pod with a similar name to that job I don't see anything (maybe the pod/job never started?).