Hello, has anyone experienced issues with logs not...
# ask-community
m
Hello, has anyone experienced issues with logs not properly getting routed to AWS logs drivers via stdout for runs launched via the
EcsRunLauncher
, and in a seemingly nondeterministic way? We’re experiencing a problem where logs that would normally get sent to stdout and captured by our observability tooling in our other ECS-based services are not properly captured when running in a Dagster-launched task, even though they are successfully captured by the Dagster logging handler. Strangely, the issue seems to be related to the length of execution of individual ops; when an op is artificially lengthened by use of
time.sleep
, logs tend to be captured at a higher rate. More details in thread 👇🧵
Here is an example of a job that exhibits this behavior when launched in ECS:
Copy code
import logging
from time import sleep
from typing import Generator

from dagster import (
    DynamicOut,
    DynamicOutput,
    fs_io_manager,
    job,
    in_process_executor,
    op,
)


logger = logging.getLogger("our-logger")


@op(out=DynamicOut())
def num_helloworlds(num_steps: int) -> Generator[DynamicOutput, None, None]:
    for i in range(num_steps):
        yield DynamicOutput(None, mapping_key=str(i))


@op
def log_and_sleep(result: None, num_rounds_per_step: int, sleep_len: float) -> None:
    for i in range(num_rounds_per_step):
        for level in (
            logging.DEBUG,
            <http://logging.INFO|logging.INFO>,
            logging.WARNING,
            logging.ERROR,
            logging.CRITICAL,
        ):
            logger.log(level, f"hello world {i}")
            sleep(sleep_len)


@job(
    resource_defs={"io_manager": fs_io_manager},
    executor_def=in_process_executor,
    tags={
        "ecs/cpu": "2048",
        "ecs/memory": "4096",
    },
)
def helloworld(num_steps: int, num_rounds_per_step: int, sleep_len: float) -> None:
    helloworld_output = num_helloworlds(num_steps)
    helloworld_output.map(
        lambda dynamicout: log_and_sleep(dynamicout, num_rounds_per_step, sleep_len)
    )
Here is our in-house Python logging configuration, which would typically route all logs from
our-logger
to the
stdout
of the container, for collection by AWS logs drivers. This is passed to
logging.dictConfig
when a process is launched:
Copy code
INFRA_LOGGING_CONFIG = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "json": {
            "()": "pythonjsonlogger.jsonlogger.JsonFormatter",
            "format": """
                    asctime: %(asctime)s
                    created: %(created)f
                    filename: %(filename)s
                    funcName: %(funcName)s
                    levelname: %(levelname)s
                    levelno: %(levelno)s
                    lineno: %(lineno)d
                    message: %(message)s
                    module: %(module)s
                    msec: %(msecs)d
                    name: %(name)s
                    pathname: %(pathname)s
                    process: %(process)d
                    processName: %(processName)s
                    relativeCreated: %(relativeCreated)d
                    thread: %(thread)d
                    threadName: %(threadName)s
                    exc_info: %(exc_info)s
                """,
            "datefmt": "%Y-%m-%d %H:%M:%S",
        },
    },
    "handlers": {
        "stdout": {
            "class": "logging.StreamHandler",
            "stream": "<ext://sys.stdout>",
            "formatter": "json",
        },
    },
    "loggers": {
        "dagster": {"level": "DEBUG", "handlers": ["stdout"]},
        "dagit": {"level": "DEBUG", "handlers": ["stdout"]},
        "": {"level": "INFO", "handlers": ["stdout"]},
    },
}
p
Do you have a compute log manager configured? If you are grabbing logs using a sidecar, you may want to bypass the dagster-controlled log capture by configuring the `NoOpComputeLogManager`: https://docs.dagster.io/dagster-cloud/deployment/agents/customizing-configuration#disabling-compute-logs
👀 1
By default, I think there’s some subprocess that routes output both to stdout and to a file, but it can hit some timing issues depending on the OS, particulars of the run process
👀 1
m
(for completeness—here’s the part of our Dagster config that controls run launches and logging)
Copy code
run_launcher:
  module: dagster_aws.ecs
  class: EcsRunLauncher
  config:
    include_sidecars: true
    task_definition: <our task definition ARN>
    container_name: staging-workflows-runs

python_logs:
  managed_python_loggers:
    - our-logger
And the behavior we observed was that logs would semi-nondeterministically get collected and routed to other sources via stdout (which is the typical path for ECS log routing), but would get picked up much more reliably when we launched a run with a large
sleep_len
, seemingly giving the steps more time to flush logs to stdout before finishing
Thanks for the suggestion @prha—we can take a look at this as a start
p
Yeah, I’d bet that switching the compute log config would help with the reliability… On unix systems I think there’s a child process that invokes
tee
and on windows it’s some polling interval that should flush.
Would have to dig in deeper into what’s going on in your case, but bypassing the compute log capture will at least not muck with the file handles in the OS
b
So reviving Matt’s thread here! I’m picking up from where he left - we are using
NoOpComputeLogManager
and while it does help, it doesn’t help with all of it..
Copy code
compute_logs:
  module: dagster.core.storage.noop_compute_log_manager
  class: NoOpComputeLogManager
for this op I expected to see up to 49 helloworlds sent to datadog and for some reason we are getting up to 28. they all share the same timestamp on datadog
Sep 12 12:42:41.370
and were likely on the same batch. dagsterlogger 29 on dagster starts at
12:42:41.347
and we lose all logs until
12:42:41.932
. we do have the logs for hook skipped and the final execution logs on datadog:
Copy code
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - hello_world - HOOK_SKIPPED - Skipped the execution of hook "store_exit_code". It did not meet its triggering condition during the execution of "hello_world".
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - 1 - ENGINE_EVENT - Finished steps in process (pid: 1) in 1.82s
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - 1 - RUN_SUCCESS - Finished execution of run for "helloworld_job".
and they are actually dated as
Sep 12 12:42:42.089
, roughly a second after the last log was sent on the job. they also come from the same container, so any assumption that the error might be on the log forwarder is losing strength. the SIGTERM to the aws for fluent bit was sent afterwards, too:
Copy code
[2022/09/12 15:42:43] [engine] caught signal (SIGTERM)
[2022/09/12 15:42:43] [ info] [input] pausing forward.2
p
It’s strange that the logger is losing batches in the middle (because you’re getting the run termination event logs) and that everything is happening on the same container. Do you get the same behavior with cloudwatch? This isn’t something I’ve seen before…
b
from what we’ve tested before, yes! we lose the logs on cloudwatch too. it looks like the
stdout
handler is not handling those logs well, while the
dagster
handler is dealing with them properly
One question @prha - is it possible that even using
NoOpComputeLogManager
but still using
fs_io_manager
, we can hitting the timing issues you mentioned in this thread?
p
@Bianca Rosa I don’t think so. The
fs_io_manager
is responsible for reading inputs / writing outputs to disk. The compute log manager is more about routing stdout/stderr to disk (and potentially uploading it somewhere, like S3). With the noop compute log manager, we don’t touch stdout/stderr at all, so the datadog sidecar you have set up should work as intended. If you’re still running into issues, I wonder if there’s some debugging you could do with your AWS rep in terms of the container lifecycle? I saw this issue in SO (https://stackoverflow.com/questions/54656906/missing-log-lines-when-writing-to-cloudwatch-from-ecs-docker-containers) that doesn’t sound like is what you’re seeing but maybe there’s some similar flushing issue that you’re hitting?
b
I didn’t think that was the case because we had container’s dependencies properly set and we don’t have this in any of our other apps, so I spent the day hooking our app into datadog locally and I was able to reproduce it in a local, containerized environment.
I’m losing roughly 10 logs on stdout, nearby the end of the execution of the hello_world job, and those logs get sent to the Dagster UI.
Oh wait - one last thing to check: enable
NoOpComputeLogManager
locally
Ha! With
NoOpComputeLogManager
in fact the behavior doesnt continue locally 😬
🎉 1
Update here for you @prha - indeed
NoOpComputeLogManager
was the way to go. I wonder if there is a way to make it explicit/default if you use ECS on the docs/code. Our issue was so much harder due to the fact that workflows_user_code was not picking up the config changes and we went into a rabbit hole that was hard to get out of because we needed to update all the three images with it, but our Dockerfile was not copying the right config into
dagster.yaml
for the workflows_user_code container. Anyways - things were quite hard to debug but eventually we got there.