My pods are erroring with 429 Client Error Too Many Requests dagster #deployment-kubernetes

My pods are erroring with "429 Client Error: Too M...

Mark Fickett

12/22/2022, 4:43 PM

My pods are erroring with "429 Client Error: Too Many Requests for url: https://formenergy.agent.dagster.cloud/graphql". I'm running about 70 step pods concurrently across 2 nodes, and each one only takes a couple seconds. I assume this means I'm getting rate limited on the GraphQL API side. Is there a way to get a higher quota?

Shalabh Chaturvedi

12/22/2022, 5:16 PM

Hi Mark - we are seeing a large spike in the event logs written and these are getting rate limited. Usually

responses should auto-retry after a delay and we do not expect failures. This helps spread out the heavy load. Are you seeing failures in your logs or do the runs eventually succeed? Is it also possible to reduce your usage of

context.log

writes? That is the only request that is hitting the rate limit.

Mark Fickett

12/22/2022, 6:13 PM

The pods are failing with error. In the pod logs I do see retries, but then they say 'too many retries'. In theory we could reduce log usage, but not easily. It would be a slow process to comb through our logs and try to find ones which we might not need. And generally we want all logs to be captured, since it's useful to see logs in Dagit. One of the main things we want out of Dagster Cloud is the ability to scale horizontally significantly. I'm worried that this logging bottleneck means we can't scale to more parallel tasks in our job. For example, in the future we might want 2-10x the number of tasks running concurrently -- those are guesses, but basically we want lots of headroom to grow, and that's our main use case for Dagster.

Mark Fickett

12/22/2022, 6:21 PM

(Related, we've hit some slowness in the Dagit UI due to large log volume before: https://github.com/dagster-io/dagster/issues/7821 . We usually don't have a lot of logs per step, something like 10, but we do have a large fan-out of dynamic outputs, 1-5k depending on stage of the pipeline.)

Shalabh Chaturvedi

12/22/2022, 6:33 PM

In theory we could reduce log usage, but not easily. It would be a slow process to comb through our logs and try to find ones which we might not need. And generally we want all logs to be captured, since it's useful to see logs in Dagit.

Sometimes we see accidental heavy log usage (eg dumping large data in a tight loop). Good to know this is not the case and these logs are actually useful. Note that scaling the log writes (and scaling the database in general) is an active project we are working on and a key priority for us for the next couple of months. For this specific job - I noticed that the spike of log writes stopped before 9am pacific time today. Was this a one-off job or scheduled job? If we increase the rate limit are you able to retry this job?

Mark Fickett

12/22/2022, 6:36 PM

Thanks -- yes, I can retry. This is a development job I'm not running on a schedule, but it is full scale / representative of runs that we'll be doing on a daily schedule. I'm working on making sure our full daily runs succeed in the cloud environment before we finally flip the switch from on-prem to Dagster Cloud, so these are just the kinds of errors I'm hoping to flush out.

daniel

12/22/2022, 6:44 PM

Compute logs can also help here if you have a high volume of logs that you want to be displayed in dagit - those have much higher limits at play

Mark Fickett

12/22/2022, 6:45 PM

What's the difference? Or, how would I issue a compute log instead of an event log?

Mark Fickett

12/22/2022, 6:56 PM

Ah, OK. Yes, I think it would be fine to move our application logs over to just be visible in raw compute logs. I'm not sure what the change to the python_logs config would look like. What I want is (a) all logs go to the root logger, (b) root logger has our two custom handlers attached. I guess I would replace the

managed_python_loggers

with a console handler?

Copy code

python_logs:
  python_log_level: INFO
  managed_python_loggers:
    # Capture logs from all Python loggers into Dagster's UI.
    # I assume this is what we'd change to go to compute logs,
    # but we still want context.log calls to funnel into the root logger.
    - root
  # Specify additional handlers. We want these to get all the log messages
  # from context.log and application logs / the Python root logger.
  dagster_handler_config:
    handlers:
      combined_json:
        class: processing.common.log.JsonFileHandler
        level: INFO
      otel_span:
        class: form_observability.OtelSpanEventHandler
        level: INFO

daniel

12/22/2022, 6:57 PM

I don't think you need to do anything special to make things show up in compute logs as long as they're appear ing in stdout/stderr. In fact I explicitly anti-recommend capturing the root logger, that will send things into the structured part which is what's getting rate limited

Mark Fickett

12/22/2022, 6:58 PM

So Dagster will still attach my JsonFileHandler and OtelSpanEventHandler to the root logger even if it's not managed?

daniel

12/22/2022, 7:03 PM

Er I don't think it will, dagster_handler_config only applies to context.log calls / logs that are managed by dagster - I can check if there's a way to change the logging config for all logs, even logs that aren't going into the structured dagster logs

daniel

12/22/2022, 7:06 PM

If dagster no longer needs to be in the mix for those handlers is that something where you could use the regular python logging configuration APIs?

Mark Fickett

12/22/2022, 7:10 PM

Kinda. For our application logs, it sounds like I could switch from using

context.log

to some other

logging.getLogger(..)

. A big find-replace but not hard. And then I would just attach these handlers in our own setup (maybe in a resource). There are some Dagster framework logs that are nice to see alongside the log message we write. So I guess we might want to also attach the handlers via

dagster_handler_config

for those. It would be too bad not to see the logs interleaved normally in the compute log output, though. Overall, sounds like switching to compute logs is a potential workaround, but if we're not really abusing the log system and if raising the event log quota is possible, that would be a easier and result in a bit more streamlined product experience.

daniel

12/22/2022, 7:12 PM

I certainly don't think you're abusing it in the sense of any kind of intentional abuse - but I think we have to make some improvements on our side before we can support high volumes of runs with tens or hundreds thousands of structured log lines at scale

👍🏻 1

daniel

12/22/2022, 7:16 PM

Mark do you have an example run where 429s failed a run? I wouldn't expect that to happen for a context.log failure

Mark Fickett

12/22/2022, 7:17 PM

Yep: https://formenergy.dagster.cloud/td-dev/runs/63290100-aacb-48ab-a332-049e26aad1d8 .

daniel

12/22/2022, 7:25 PM

Are you sure that the 429s are what's failing the kubernetes jobs? I see a lot of "Discovered failed Kubernetes job" but I actually wouldn't expect the 429s to be what made them fail (very possible i'm mistaken)

Mark Fickett

12/22/2022, 7:28 PM

When I described the failed pods, it looked like they had failed because of an exit status 1, and the pod logs ended with a 'too many retries for status 429', which is interpreted the 429s as the cause of error.

daniel

12/22/2022, 7:30 PM

That does sound fairly conclusive

daniel

12/22/2022, 7:30 PM

I’ll see if we can reproduce that

thankyou 1

Mark Fickett

12/22/2022, 7:31 PM

For example step data_pipe_graph_sasquatch_anode.raw_data_graph._normalize_task[0ue9r_plus_25] ends with a "discovered failed pod" but right before that I see a "too many retries":

Copy code

An exception was thrown during step execution that is likely a framework error, rather than an error in user code.
dagster_cloud_cli.core.errors.GraphQLStorageError: Max retries (6) exceeded, too many 429 error responses.

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/api.py", line 441, in _execute_step_command_body
    yield from execute_plan_iterator(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/api.py", line 1190, in __iter__
    yield from self.iterator(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 114, in inner_plan_execution_iterator
    for step_event in check.generator(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 333, in dagster_event_sequence_for_step
    yield step_failure_event_from_exc_info(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/objects.py", line 122, in step_failure_event_from_exc_info
    return DagsterEvent.step_failure_event(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/events/__init__.py", line 802, in step_failure_event
    return DagsterEvent.from_step(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/events/__init__.py", line 418, in from_step
    log_step_event(step_context, event)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/events/__init__.py", line 297, in log_step_event
    step_context.log.log_dagster_event(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/log_manager.py", line 387, in log_dagster_event
    self.log(level=level, msg=msg, extra={DAGSTER_META_KEY: dagster_event})
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/log_manager.py", line 402, in log
    self._log(level, msg, args, **kwargs)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
,  File "/data-pipeline/orchestration/log.py", line 96, in handle
    super().handle(record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/log_manager.py", line 267, in emit
    handler.handle(dagster_record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 181, in emit
    self._instance.handle_new_event(event)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1587, in handle_new_event
    self._event_storage.store_event(event)
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud/storage/event_logs/storage.py", line 371, in store_event
    self._execute_query(
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud/storage/event_logs/storage.py", line 248, in _execute_query
    res = self._graphql_client.execute(
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud_cli/core/graphql_client.py", line 111, in execute
    raise GraphQLStorageError(
The above exception was caused by the following exception:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: <https://formenergy.agent.dagster.cloud/graphql>

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/dagster_cloud_cli/core/graphql_client.py", line 72, in execute
    return self._execute_retry(query, variable_values, headers)
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud_cli/core/graphql_client.py", line 157, in _execute_retry
    response.raise_for_status()
,  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)


11:36:00.058
data_pipe_graph_…asquatch_anode
↳raw_data_graph
↳_normalize_t…0ue9r_plus_25]
Step data_pipe_graph_sasquatch_anode.raw_data_graph._normalize_task[0ue9r_plus_25] failed health check: Discovered failed Kubernetes job dagster-step-e5abc3ae8aad5ae2b7ebf2295ce0fde4 for step data_pipe_graph_sasquatch_anode.raw_data_graph._normalize_task[0ue9r_plus_25].

🙏 1

Mark Fickett

12/22/2022, 8:19 PM

I confirmed that removing

root

from the

managed_python_loggers

allowed the run to complete successfully. (But I haven't done the other cleanup / verification to make sure the logs get to where we need them otherwise.)

Mark Fickett

01/03/2023, 1:51 PM

Hope y'all had great holidays! Looking for next steps here, what's your recommendation? Sounds like the options are: • increase the quota for context.log calls to accommodate the big fan-out (easiest for me!) • Dagster-side infrastructure improvements to handle larger log volume (what's the timeline on this?) • switch to compute logs; I'd need to figure out how to change our log configuration to attach handlers in the right place, and switch context.log calls to plain Python logger calls; some reduction in utility of the Dagit UI for searching logs (not my preference but may be fastest)

daniel

01/03/2023, 1:54 PM

Hi Mark! I think the last bullet is the way to go for now - log volume improvements are still a ways out, and when I looked at the volume it was high enough that I don’t think the fix was a simple as just increasing the quota

👍🏻 1

127 Views

Open in Slack

Previous Next