Hello has anyone experienced issues with logs not properly g dagster #ask-community

Hello, has anyone experienced issues with logs not...

Matt Millican

08/30/2022, 7:59 PM

Hello, has anyone experienced issues with logs not properly getting routed to AWS logs drivers via stdout for runs launched via the

EcsRunLauncher

, and in a seemingly nondeterministic way? We’re experiencing a problem where logs that would normally get sent to stdout and captured by our observability tooling in our other ECS-based services are not properly captured when running in a Dagster-launched task, even though they are successfully captured by the Dagster logging handler. Strangely, the issue seems to be related to the length of execution of individual ops; when an op is artificially lengthened by use of

time.sleep

, logs tend to be captured at a higher rate. More details in thread 👇🧵

Matt Millican

08/30/2022, 7:59 PM

Here is an example of a job that exhibits this behavior when launched in ECS:

Copy code

import logging
from time import sleep
from typing import Generator

from dagster import (
    DynamicOut,
    DynamicOutput,
    fs_io_manager,
    job,
    in_process_executor,
    op,
)


logger = logging.getLogger("our-logger")


@op(out=DynamicOut())
def num_helloworlds(num_steps: int) -> Generator[DynamicOutput, None, None]:
    for i in range(num_steps):
        yield DynamicOutput(None, mapping_key=str(i))


@op
def log_and_sleep(result: None, num_rounds_per_step: int, sleep_len: float) -> None:
    for i in range(num_rounds_per_step):
        for level in (
            logging.DEBUG,
            <http://logging.INFO|logging.INFO>,
            logging.WARNING,
            logging.ERROR,
            logging.CRITICAL,
        ):
            logger.log(level, f"hello world {i}")
            sleep(sleep_len)


@job(
    resource_defs={"io_manager": fs_io_manager},
    executor_def=in_process_executor,
    tags={
        "ecs/cpu": "2048",
        "ecs/memory": "4096",
    },
)
def helloworld(num_steps: int, num_rounds_per_step: int, sleep_len: float) -> None:
    helloworld_output = num_helloworlds(num_steps)
    helloworld_output.map(
        lambda dynamicout: log_and_sleep(dynamicout, num_rounds_per_step, sleep_len)
    )

Matt Millican

08/30/2022, 8:01 PM

Here is our in-house Python logging configuration, which would typically route all logs from

our-logger

to the

stdout

of the container, for collection by AWS logs drivers. This is passed to

logging.dictConfig

when a process is launched:

Copy code

INFRA_LOGGING_CONFIG = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "json": {
            "()": "pythonjsonlogger.jsonlogger.JsonFormatter",
            "format": """
                    asctime: %(asctime)s
                    created: %(created)f
                    filename: %(filename)s
                    funcName: %(funcName)s
                    levelname: %(levelname)s
                    levelno: %(levelno)s
                    lineno: %(lineno)d
                    message: %(message)s
                    module: %(module)s
                    msec: %(msecs)d
                    name: %(name)s
                    pathname: %(pathname)s
                    process: %(process)d
                    processName: %(processName)s
                    relativeCreated: %(relativeCreated)d
                    thread: %(thread)d
                    threadName: %(threadName)s
                    exc_info: %(exc_info)s
                """,
            "datefmt": "%Y-%m-%d %H:%M:%S",
        },
    },
    "handlers": {
        "stdout": {
            "class": "logging.StreamHandler",
            "stream": "<ext://sys.stdout>",
            "formatter": "json",
        },
    },
    "loggers": {
        "dagster": {"level": "DEBUG", "handlers": ["stdout"]},
        "dagit": {"level": "DEBUG", "handlers": ["stdout"]},
        "": {"level": "INFO", "handlers": ["stdout"]},
    },
}

prha

08/30/2022, 8:01 PM

Do you have a compute log manager configured? If you are grabbing logs using a sidecar, you may want to bypass the dagster-controlled log capture by configuring the `NoOpComputeLogManager`: https://docs.dagster.io/dagster-cloud/deployment/agents/customizing-configuration#disabling-compute-logs

👀 1

prha

08/30/2022, 8:03 PM

By default, I think there’s some subprocess that routes output both to stdout and to a file, but it can hit some timing issues depending on the OS, particulars of the run process

👀 1

Matt Millican

08/30/2022, 8:03 PM

(for completeness—here’s the part of our Dagster config that controls run launches and logging)

Copy code

run_launcher:
  module: dagster_aws.ecs
  class: EcsRunLauncher
  config:
    include_sidecars: true
    task_definition: <our task definition ARN>
    container_name: staging-workflows-runs

python_logs:
  managed_python_loggers:
    - our-logger

Matt Millican

08/30/2022, 8:05 PM

And the behavior we observed was that logs would semi-nondeterministically get collected and routed to other sources via stdout (which is the typical path for ECS log routing), but would get picked up much more reliably when we launched a run with a large

sleep_len

, seemingly giving the steps more time to flush logs to stdout before finishing

Matt Millican

08/30/2022, 8:05 PM

Thanks for the suggestion @prha—we can take a look at this as a start

prha

08/30/2022, 8:07 PM

Yeah, I’d bet that switching the compute log config would help with the reliability… On unix systems I think there’s a child process that invokes

tee

and on windows it’s some polling interval that should flush.

prha

08/30/2022, 8:07 PM

Would have to dig in deeper into what’s going on in your case, but bypassing the compute log capture will at least not muck with the file handles in the OS

Bianca Rosa

09/12/2022, 4:08 PM

So reviving Matt’s thread here! I’m picking up from where he left - we are using

NoOpComputeLogManager

and while it does help, it doesn’t help with all of it..

Copy code

compute_logs:
  module: dagster.core.storage.noop_compute_log_manager
  class: NoOpComputeLogManager

for this op I expected to see up to 49 helloworlds sent to datadog and for some reason we are getting up to 28. they all share the same timestamp on datadog

Sep 12 12:42:41.370

and were likely on the same batch. dagsterlogger 29 on dagster starts at

12:42:41.347

and we lose all logs until

12:42:41.932

. we do have the logs for hook skipped and the final execution logs on datadog:

Copy code

2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - hello_world - HOOK_SKIPPED - Skipped the execution of hook "store_exit_code". It did not meet its triggering condition during the execution of "hello_world".
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - 1 - ENGINE_EVENT - Finished steps in process (pid: 1) in 1.82s
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - 1 - RUN_SUCCESS - Finished execution of run for "helloworld_job".

and they are actually dated as

Sep 12 12:42:42.089

, roughly a second after the last log was sent on the job. they also come from the same container, so any assumption that the error might be on the log forwarder is losing strength. the SIGTERM to the aws for fluent bit was sent afterwards, too:

Copy code

[2022/09/12 15:42:43] [engine] caught signal (SIGTERM)
[2022/09/12 15:42:43] [ info] [input] pausing forward.2

prha

09/12/2022, 5:13 PM

It’s strange that the logger is losing batches in the middle (because you’re getting the run termination event logs) and that everything is happening on the same container. Do you get the same behavior with cloudwatch? This isn’t something I’ve seen before…

Bianca Rosa

09/12/2022, 6:38 PM

from what we’ve tested before, yes! we lose the logs on cloudwatch too. it looks like the

stdout

handler is not handling those logs well, while the

dagster

handler is dealing with them properly

Bianca Rosa

09/14/2022, 5:48 PM

One question @prha - is it possible that even using

NoOpComputeLogManager

but still using

fs_io_manager

, we can hitting the timing issues you mentioned in this thread?

prha

09/14/2022, 6:12 PM

@Bianca Rosa I don’t think so. The

fs_io_manager

is responsible for reading inputs / writing outputs to disk. The compute log manager is more about routing stdout/stderr to disk (and potentially uploading it somewhere, like S3). With the noop compute log manager, we don’t touch stdout/stderr at all, so the datadog sidecar you have set up should work as intended. If you’re still running into issues, I wonder if there’s some debugging you could do with your AWS rep in terms of the container lifecycle? I saw this issue in SO (https://stackoverflow.com/questions/54656906/missing-log-lines-when-writing-to-cloudwatch-from-ecs-docker-containers) that doesn’t sound like is what you’re seeing but maybe there’s some similar flushing issue that you’re hitting?

Bianca Rosa

09/15/2022, 11:18 PM

I didn’t think that was the case because we had container’s dependencies properly set and we don’t have this in any of our other apps, so I spent the day hooking our app into datadog locally and I was able to reproduce it in a local, containerized environment.

Bianca Rosa

09/15/2022, 11:19 PM

I’m losing roughly 10 logs on stdout, nearby the end of the execution of the hello_world job, and those logs get sent to the Dagster UI.

Bianca Rosa

09/15/2022, 11:20 PM

Oh wait - one last thing to check: enable

NoOpComputeLogManager

locally

Bianca Rosa

09/15/2022, 11:23 PM

Ha! With

NoOpComputeLogManager

in fact the behavior doesnt continue locally 😬

🎉 1

Bianca Rosa

09/19/2022, 11:52 PM

Update here for you @prha - indeed

NoOpComputeLogManager

was the way to go. I wonder if there is a way to make it explicit/default if you use ECS on the docs/code. Our issue was so much harder due to the fact that workflows_user_code was not picking up the config changes and we went into a rabbit hole that was hard to get out of because we needed to update all the three images with it, but our Dockerfile was not copying the right config into

dagster.yaml

for the workflows_user_code container. Anyways - things were quite hard to debug but eventually we got there.

26 Views

Open in Slack

Previous Next