My pods are erroring with "429 Client Error: Too M...
# deployment-kubernetes
m
My pods are erroring with "429 Client Error: Too Many Requests for url: https://formenergy.agent.dagster.cloud/graphql". I'm running about 70 step pods concurrently across 2 nodes, and each one only takes a couple seconds. I assume this means I'm getting rate limited on the GraphQL API side. Is there a way to get a higher quota?
s
Hi Mark - we are seeing a large spike in the event logs written and these are getting rate limited. Usually
429
responses should auto-retry after a delay and we do not expect failures. This helps spread out the heavy load. Are you seeing failures in your logs or do the runs eventually succeed? Is it also possible to reduce your usage of
context.log
writes? That is the only request that is hitting the rate limit.
m
The pods are failing with error. In the pod logs I do see retries, but then they say 'too many retries'. In theory we could reduce log usage, but not easily. It would be a slow process to comb through our logs and try to find ones which we might not need. And generally we want all logs to be captured, since it's useful to see logs in Dagit. One of the main things we want out of Dagster Cloud is the ability to scale horizontally significantly. I'm worried that this logging bottleneck means we can't scale to more parallel tasks in our job. For example, in the future we might want 2-10x the number of tasks running concurrently -- those are guesses, but basically we want lots of headroom to grow, and that's our main use case for Dagster.
(Related, we've hit some slowness in the Dagit UI due to large log volume before: https://github.com/dagster-io/dagster/issues/7821 . We usually don't have a lot of logs per step, something like 10, but we do have a large fan-out of dynamic outputs, 1-5k depending on stage of the pipeline.)
s
In theory we could reduce log usage, but not easily. It would be a slow process to comb through our logs and try to find ones which we might not need. And generally we want all logs to be captured, since it's useful to see logs in Dagit.
Sometimes we see accidental heavy log usage (eg dumping large data in a tight loop). Good to know this is not the case and these logs are actually useful. Note that scaling the log writes (and scaling the database in general) is an active project we are working on and a key priority for us for the next couple of months. For this specific job - I noticed that the spike of log writes stopped before 9am pacific time today. Was this a one-off job or scheduled job? If we increase the rate limit are you able to retry this job?
m
Thanks -- yes, I can retry. This is a development job I'm not running on a schedule, but it is full scale / representative of runs that we'll be doing on a daily schedule. I'm working on making sure our full daily runs succeed in the cloud environment before we finally flip the switch from on-prem to Dagster Cloud, so these are just the kinds of errors I'm hoping to flush out.
d
Compute logs can also help here if you have a high volume of logs that you want to be displayed in dagit - those have much higher limits at play
m
What's the difference? Or, how would I issue a compute log instead of an event log?
Ah, OK. Yes, I think it would be fine to move our application logs over to just be visible in raw compute logs. I'm not sure what the change to the python_logs config would look like. What I want is (a) all logs go to the root logger, (b) root logger has our two custom handlers attached. I guess I would replace the
managed_python_loggers
with a console handler?
Copy code
python_logs:
  python_log_level: INFO
  managed_python_loggers:
    # Capture logs from all Python loggers into Dagster's UI.
    # I assume this is what we'd change to go to compute logs,
    # but we still want context.log calls to funnel into the root logger.
    - root
  # Specify additional handlers. We want these to get all the log messages
  # from context.log and application logs / the Python root logger.
  dagster_handler_config:
    handlers:
      combined_json:
        class: processing.common.log.JsonFileHandler
        level: INFO
      otel_span:
        class: form_observability.OtelSpanEventHandler
        level: INFO
d
I don't think you need to do anything special to make things show up in compute logs as long as they're appear ing in stdout/stderr. In fact I explicitly anti-recommend capturing the root logger, that will send things into the structured part which is what's getting rate limited
m
So Dagster will still attach my JsonFileHandler and OtelSpanEventHandler to the root logger even if it's not managed?
d
Er I don't think it will, dagster_handler_config only applies to context.log calls / logs that are managed by dagster - I can check if there's a way to change the logging config for all logs, even logs that aren't going into the structured dagster logs
If dagster no longer needs to be in the mix for those handlers is that something where you could use the regular python logging configuration APIs?
m
Kinda. For our application logs, it sounds like I could switch from using
context.log
to some other
logging.getLogger(..)
. A big find-replace but not hard. And then I would just attach these handlers in our own setup (maybe in a resource). There are some Dagster framework logs that are nice to see alongside the log message we write. So I guess we might want to also attach the handlers via
dagster_handler_config
for those. It would be too bad not to see the logs interleaved normally in the compute log output, though. Overall, sounds like switching to compute logs is a potential workaround, but if we're not really abusing the log system and if raising the event log quota is possible, that would be a easier and result in a bit more streamlined product experience.
d
I certainly don't think you're abusing it in the sense of any kind of intentional abuse - but I think we have to make some improvements on our side before we can support high volumes of runs with tens or hundreds thousands of structured log lines at scale
👍🏻 1
Mark do you have an example run where 429s failed a run? I wouldn't expect that to happen for a context.log failure
d
Are you sure that the 429s are what's failing the kubernetes jobs? I see a lot of "Discovered failed Kubernetes job" but I actually wouldn't expect the 429s to be what made them fail (very possible i'm mistaken)
m
When I described the failed pods, it looked like they had failed because of an exit status 1, and the pod logs ended with a 'too many retries for status 429', which is interpreted the 429s as the cause of error.
d
That does sound fairly conclusive
I’ll see if we can reproduce that
thankyou 1
m
For example step data_pipe_graph_sasquatch_anode.raw_data_graph._normalize_task[0ue9r_plus_25] ends with a "discovered failed pod" but right before that I see a "too many retries":
Copy code
An exception was thrown during step execution that is likely a framework error, rather than an error in user code.
dagster_cloud_cli.core.errors.GraphQLStorageError: Max retries (6) exceeded, too many 429 error responses.

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/dagster/_cli/api.py", line 441, in _execute_step_command_body
    yield from execute_plan_iterator(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/api.py", line 1190, in __iter__
    yield from self.iterator(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 114, in inner_plan_execution_iterator
    for step_event in check.generator(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 333, in dagster_event_sequence_for_step
    yield step_failure_event_from_exc_info(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/objects.py", line 122, in step_failure_event_from_exc_info
    return DagsterEvent.step_failure_event(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/events/__init__.py", line 802, in step_failure_event
    return DagsterEvent.from_step(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/events/__init__.py", line 418, in from_step
    log_step_event(step_context, event)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/events/__init__.py", line 297, in log_step_event
    step_context.log.log_dagster_event(
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/log_manager.py", line 387, in log_dagster_event
    self.log(level=level, msg=msg, extra={DAGSTER_META_KEY: dagster_event})
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/log_manager.py", line 402, in log
    self._log(level, msg, args, **kwargs)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
,  File "/data-pipeline/orchestration/log.py", line 96, in handle
    super().handle(record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/log_manager.py", line 267, in emit
    handler.handle(dagster_record)
,  File "/usr/local/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 181, in emit
    self._instance.handle_new_event(event)
,  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1587, in handle_new_event
    self._event_storage.store_event(event)
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud/storage/event_logs/storage.py", line 371, in store_event
    self._execute_query(
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud/storage/event_logs/storage.py", line 248, in _execute_query
    res = self._graphql_client.execute(
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud_cli/core/graphql_client.py", line 111, in execute
    raise GraphQLStorageError(
The above exception was caused by the following exception:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: <https://formenergy.agent.dagster.cloud/graphql>

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/dagster_cloud_cli/core/graphql_client.py", line 72, in execute
    return self._execute_retry(query, variable_values, headers)
,  File "/usr/local/lib/python3.10/site-packages/dagster_cloud_cli/core/graphql_client.py", line 157, in _execute_retry
    response.raise_for_status()
,  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)


11:36:00.058
data_pipe_graph_…asquatch_anode
↳raw_data_graph
↳_normalize_t…0ue9r_plus_25]
Step data_pipe_graph_sasquatch_anode.raw_data_graph._normalize_task[0ue9r_plus_25] failed health check: Discovered failed Kubernetes job dagster-step-e5abc3ae8aad5ae2b7ebf2295ce0fde4 for step data_pipe_graph_sasquatch_anode.raw_data_graph._normalize_task[0ue9r_plus_25].
🙏 1
I confirmed that removing
root
from the
managed_python_loggers
allowed the run to complete successfully. (But I haven't done the other cleanup / verification to make sure the logs get to where we need them otherwise.)
Hope y'all had great holidays! Looking for next steps here, what's your recommendation? Sounds like the options are: • increase the quota for context.log calls to accommodate the big fan-out (easiest for me!) • Dagster-side infrastructure improvements to handle larger log volume (what's the timeline on this?) • switch to compute logs; I'd need to figure out how to change our log configuration to attach handlers in the right place, and switch context.log calls to plain Python logger calls; some reduction in utility of the Dagit UI for searching logs (not my preference but may be fastest)
d
Hi Mark! I think the last bullet is the way to go for now - log volume improvements are still a ways out, and when I looked at the volume it was high enough that I don’t think the fix was a simple as just increasing the quota
👍🏻 1
127 Views