Matt Millican
08/30/2022, 7:59 PMEcsRunLauncher
, and in a seemingly nondeterministic way?
We’re experiencing a problem where logs that would normally get sent to stdout and captured by our observability tooling in our other ECS-based services are not properly captured when running in a Dagster-launched task, even though they are successfully captured by the Dagster logging handler.
Strangely, the issue seems to be related to the length of execution of individual ops; when an op is artificially lengthened by use of time.sleep
, logs tend to be captured at a higher rate.
More details in thread 👇🧵import logging
from time import sleep
from typing import Generator
from dagster import (
DynamicOut,
DynamicOutput,
fs_io_manager,
job,
in_process_executor,
op,
)
logger = logging.getLogger("our-logger")
@op(out=DynamicOut())
def num_helloworlds(num_steps: int) -> Generator[DynamicOutput, None, None]:
for i in range(num_steps):
yield DynamicOutput(None, mapping_key=str(i))
@op
def log_and_sleep(result: None, num_rounds_per_step: int, sleep_len: float) -> None:
for i in range(num_rounds_per_step):
for level in (
logging.DEBUG,
<http://logging.INFO|logging.INFO>,
logging.WARNING,
logging.ERROR,
logging.CRITICAL,
):
logger.log(level, f"hello world {i}")
sleep(sleep_len)
@job(
resource_defs={"io_manager": fs_io_manager},
executor_def=in_process_executor,
tags={
"ecs/cpu": "2048",
"ecs/memory": "4096",
},
)
def helloworld(num_steps: int, num_rounds_per_step: int, sleep_len: float) -> None:
helloworld_output = num_helloworlds(num_steps)
helloworld_output.map(
lambda dynamicout: log_and_sleep(dynamicout, num_rounds_per_step, sleep_len)
)
our-logger
to the stdout
of the container, for collection by AWS logs drivers. This is passed to logging.dictConfig
when a process is launched:
INFRA_LOGGING_CONFIG = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"json": {
"()": "pythonjsonlogger.jsonlogger.JsonFormatter",
"format": """
asctime: %(asctime)s
created: %(created)f
filename: %(filename)s
funcName: %(funcName)s
levelname: %(levelname)s
levelno: %(levelno)s
lineno: %(lineno)d
message: %(message)s
module: %(module)s
msec: %(msecs)d
name: %(name)s
pathname: %(pathname)s
process: %(process)d
processName: %(processName)s
relativeCreated: %(relativeCreated)d
thread: %(thread)d
threadName: %(threadName)s
exc_info: %(exc_info)s
""",
"datefmt": "%Y-%m-%d %H:%M:%S",
},
},
"handlers": {
"stdout": {
"class": "logging.StreamHandler",
"stream": "<ext://sys.stdout>",
"formatter": "json",
},
},
"loggers": {
"dagster": {"level": "DEBUG", "handlers": ["stdout"]},
"dagit": {"level": "DEBUG", "handlers": ["stdout"]},
"": {"level": "INFO", "handlers": ["stdout"]},
},
}
prha
08/30/2022, 8:01 PMMatt Millican
08/30/2022, 8:03 PMrun_launcher:
module: dagster_aws.ecs
class: EcsRunLauncher
config:
include_sidecars: true
task_definition: <our task definition ARN>
container_name: staging-workflows-runs
python_logs:
managed_python_loggers:
- our-logger
sleep_len
, seemingly giving the steps more time to flush logs to stdout before finishingprha
08/30/2022, 8:07 PMtee
and on windows it’s some polling interval that should flush.Bianca Rosa
09/12/2022, 4:08 PMNoOpComputeLogManager
and while it does help, it doesn’t help with all of it..
compute_logs:
module: dagster.core.storage.noop_compute_log_manager
class: NoOpComputeLogManager
for this op I expected to see up to 49 helloworlds sent to datadog and for some reason we are getting up to 28. they all share the same timestamp on datadog Sep 12 12:42:41.370
and were likely on the same batch.
dagsterlogger 29 on dagster starts at 12:42:41.347
and we lose all logs until 12:42:41.932
.
we do have the logs for hook skipped and the final execution logs on datadog:
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - hello_world - HOOK_SKIPPED - Skipped the execution of hook "store_exit_code". It did not meet its triggering condition during the execution of "hello_world".
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - 1 - ENGINE_EVENT - Finished steps in process (pid: 1) in 1.82s
2022-09-12 15:42:42 +0000 - dagster - DEBUG - helloworld_job - c4dee54d-e328-449c-aed3-969e394138c2 - 1 - RUN_SUCCESS - Finished execution of run for "helloworld_job".
and they are actually dated as Sep 12 12:42:42.089
, roughly a second after the last log was sent on the job. they also come from the same container, so any assumption that the error might be on the log forwarder is losing strength.
the SIGTERM to the aws for fluent bit was sent afterwards, too:
[2022/09/12 15:42:43] [engine] caught signal (SIGTERM)
[2022/09/12 15:42:43] [ info] [input] pausing forward.2
prha
09/12/2022, 5:13 PMBianca Rosa
09/12/2022, 6:38 PMstdout
handler is not handling those logs well, while the dagster
handler is dealing with them properlyNoOpComputeLogManager
but still using fs_io_manager
, we can hitting the timing issues you mentioned in this thread?prha
09/14/2022, 6:12 PMfs_io_manager
is responsible for reading inputs / writing outputs to disk. The compute log manager is more about routing stdout/stderr to disk (and potentially uploading it somewhere, like S3). With the noop compute log manager, we don’t touch stdout/stderr at all, so the datadog sidecar you have set up should work as intended. If you’re still running into issues, I wonder if there’s some debugging you could do with your AWS rep in terms of the container lifecycle? I saw this issue in SO (https://stackoverflow.com/questions/54656906/missing-log-lines-when-writing-to-cloudwatch-from-ecs-docker-containers) that doesn’t sound like is what you’re seeing but maybe there’s some similar flushing issue that you’re hitting?Bianca Rosa
09/15/2022, 11:18 PMNoOpComputeLogManager
locallyNoOpComputeLogManager
in fact the behavior doesnt continue locally 😬NoOpComputeLogManager
was the way to go. I wonder if there is a way to make it explicit/default if you use ECS on the docs/code.
Our issue was so much harder due to the fact that workflows_user_code was not picking up the config changes and we went into a rabbit hole that was hard to get out of because we needed to update all the three images with it, but our Dockerfile was not copying the right config into dagster.yaml
for the workflows_user_code container. Anyways - things were quite hard to debug but eventually we got there.