We often have cases where jobs aren t correctly terminated i dagster #dagster-feedback

Join Slack

We often have cases where jobs aren't correctly te...

# dagster-feedback

Casper Weiss Bang

04/03/2023, 6:46 AM

We often have cases where jobs aren't correctly terminated in dagster 1.1.21 - anyone having a similar experience?

➕ 1

daniel

04/03/2023, 8:34 PM

Hi Casper - what run launcher are you using? Can you share more details about the sequence of events / what happens when you try to terminate it cleanly?

Casper Weiss Bang

04/04/2023, 3:55 PM

Eh the Docker one. I think the cases were something that possibly manually cancelled or something like that.. maybe It just hanged for 90-ish hours

daniel

04/04/2023, 3:56 PM

Got it - dunno if you still have logs from the container that received the termination request, but if you do we could take a look and see why the signal wasn't picked up

daniel

04/04/2023, 3:56 PM

Having our run monitoring feature apply to runs stuck in CANCELING as well as STARTING could help here too so at least it doesn't stay stuck for 90 hours

Casper Weiss Bang

04/04/2023, 3:57 PM

Copy code

raceback (most recent call last):
  File "/usr/local/bin/dagster", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/__init__.py", line 46, in main
    cli(auto_envvar_prefix=ENV_PREFIX)  # pylint:disable=E1123
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/api.py", line 73, in execute_run_command
    return_code = _execute_run_command_body(
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/api.py", line 150, in _execute_run_command_body
    instance.report_engine_event(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1877, in report_engine_event
    self.report_dagster_event(dagster_event, run_id=run_id, log_level=log_level)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1901, in report_dagster_event
    self.handle_new_event(event_record)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1816, in handle_new_event
    self._event_storage.store_event(event)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/event_log/event_log.py", line 175, in store_event
    with self._connect() as conn:
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 166, in create_pg_connection
    conn = retry_pg_connection_fn(engine.connect)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 130, in retry_pg_connection_fn
    raise DagsterPostgresException("too many retries for DB connection") from exc
dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection

There we go. from the actual container

Casper Weiss Bang

04/04/2023, 3:58 PM

Copy code

port 5432 failed: FATAL:  SSL connection is required. Please specify SSL options and retry.

SSL errors normally occure when we have some network issues. i.e firewall rules. so that might be the reason it broke? but i don't get why it wouldn't eventually get access again, and or simply mark it as terminated

Casper Weiss Bang

04/04/2023, 3:59 PM

from what i can see in the logs it posted the logs i can see in the UI and then lost database connection.. weirdly

daniel

04/04/2023, 3:59 PM

ah interesting - and what state was the run in before you hit terminate the first time?

Casper Weiss Bang

04/04/2023, 4:00 PM

i cannot see from the docker logs. I can simply see it never got anything with the termination request. so presumably just running?

Casper Weiss Bang

04/04/2023, 4:04 PM

the last message i got in the docker container (i assume it terminated after that) is 180358 - which is the final "too many retries". The last non-error message was:

Copy code

2023-03-31T18:02:19.595391107Z 2023-03-31 18:02:18 +0000 - dagster - DEBUG - status_job - 86d088fc-d550-4560-9f2f-f9767f7851f1 - raw_dev__status__vis_data_quality - life_cycle_state='PENDING'

whereafter i got the first SSL error at

2023-03-31T18:02:24.595959509Z

- meaning it basically only retries for ~40 seconds before terminating. We normally retry for a few minutes. Also maybe the docker runner should check if the specific container has "died"

daniel

04/04/2023, 4:27 PM

This feature should be able to detect crashed docker containers and spin them down, if you have it enabled in your dagster.yaml: https://docs.dagster.io/deployment/run-monitoring#run-monitoring

🙏 1

Mark Fickett

04/04/2023, 7:41 PM

We also have been seeing cancelled jobs sticking around for many hours. Using the k8s launcher (step per pod, I can never remember the proper name). Here's an example run that's stuck now: https://formenergy.dagster.cloud/td-production/runs/b25b7ea5-18ae-4459-a9b8-dfd9dd766d28?logFileKey=bkejnqxe . It has a number of steps still pending, with messages like "Dependencies for step X were not executed: Y. Not executing." And some other pending steps say "Deleting Kubernetes job dagster-step-900d3f5361e3f01445289673e02208a9 for step" and when I look for a pod with a similar name to that job I don't see anything (maybe the pod/job never started?).

2 Views

Open in Slack

Previous Next