Hi team, has anyone seen any issues with dagster-s...
# ask-community
k
Hi team, has anyone seen any issues with dagster-shell and simple bash scripts terminating unexpectedly? I'm running a test with a simple script (sleep 5) and I'm seeing the attached error. One thing to note is that this works fine inside our jobs that run with our docker run launcher. The failures are only present in jobs that execute through our dagster-celery-docker executor. One other note is that ls seems to work, sleep 5 seems to fail and aws cli commands fail. I am able to get into the containers while these steps are running and execute the commands from bash without issue. I'm not seeing any docker OOM events, so I'm thinking the 137 is coming from the finally clause in the utils class in the dagster shell that is cleaning up the subprocess.
One other thing to note is that if I manually execute the bash script the tmp folder looks like it disappears along with the bash script itself
j
Do you have logs from one of the Docker containers that failed?
k
I’ll have to patch the run launcher to disable auto remove and give it a shot, the containers have been deleted by the time I can get in. I’ll give that a shot and PM them to you when I can, thank you.
Part of the problem appears to be related to the environemnt variables not being passed to the step containers, so I corrected that. I was able to configure the container kw args auto remove to false so the containers stick around. @johann I shared a failure log with you, it looks like it writes the sleep line out to std out and the job dies after that. I can also share a log from a different environment where its working if that would help you narrow down the issue. @daniel actually recently helped resolve another issue where resources were printing to std out and corrupting the dagster event stream. I’m not sure if its related but I figured I’d mention that also.
j
Hmm it does look like 137 can be a linux exit code for OOM. Just to confirm, Docker shows the container with 137 exit code and no error in the logs?
k
Yes, a docker inspect of the container shows it was not oom killed. I've also read that if a sigkill is sent you can get a 128 + 9 exit code at the shell. I know if the celery worker receives a sigkill it will shutdown without waiting gracefully.
Copy code
"State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2022-05-20T15:28:16.270351159Z",
            "FinishedAt": "2022-05-20T15:28:25.628324723Z"
        },
j
It’s strange that this issue pops up with the
celery_docker_executor
but not the
DockerRunLauncher
, since in both cases the
sleep
cli command is just running in the same docker container
Any chance logs from the celery worker show it being shutdown/sending the sigkill?
k
I agree, it is strange. Just to confirm I re-ran all sorts of shell commands (ls, aws s3 ls, aws s3 cp, aws s3 rm, cat, sleep) all seem to work fine within docker run launcher and multi process executor. Is there something I can grep for in the worker logs that will surface what youre looking for?
j
So the latest is that when using
celery_docker_executor
,
ls
works but
aws
and
sleep
don’t?
You could try removing the celery dependency by using the plain
docker_executor
https://docs.dagster.io/_apidocs/libraries/dagster-docker#dagster_docker.docker_executor
k
We are using the docker run launcher and multiprocess executor for most jobs, we introduced the celery worker to control how certain jobs execute which is a requirement. In this case if they combine a shell script with one of our jobs that synchronizes data we need it to run in celery to control how many data synchronizers can run in parallel. We are generating our jobs from our application (they aren't hand written by developers). I think if we modify the logic in our app to generate the run config that I got working manually for jobs that require the celery executor it should work. The current issue is this now (again only happening in the celery executor):
but yes, ls is still working. The screenshot of the error above occurs when we execute sleep 10 or aws s3 ls
j
@Dagster Bot issue shell_op is raising “DagsterInvalidMetadata on type bytes”
👍 1
d
k
Thank you