Is there a good way to load a sidecar when running...
# deployment-kubernetes
o
Is there a good way to load a sidecar when running Dagster jobs in Kubernetes? From what I can tell, a sidecar would just keep the k8s Job up because it’s waiting for all containers to complete. In our case, some jobs need to access a Google CloudSQL instance, and the recommended way to do that is to use a cloud proxy sidecar.
a
I would avoid the sidecar pattern in this case and just use a separate cloudsql deployment. Then you can tell your jobs to connect to the DB using that deployment
o
Yeah, I was hoping to avoid doing that 🙂 Is it possible instead to tweak the container startup scripts to start a sql proxy as a background job before starting dagster itself?
d
I have successfully done this before, but it was pretty painful for the reason you mention - it's frustrating that kubernetes doesn't support the sidecar pattern more natively (https://github.com/kubernetes/enhancements/issues/753) Here's some code i've used for this before to configure a dagster job (without officially recommending it 🙂 ):
Copy code
def docker_sidecar_k8s_config_tag(timeout: int = 60 * 60 * 4):
    """
    k8s config that sets up a sidecar container.
    Most of the complexity here is due to the poor support for cleaning up after sidecar
    containers in k8s when the main container has finished - we write to a file on process
    termination, and have logic in the sidecar container to shutdown the sidecar as soon as
    that file exists.

    See <https://github.com/kubernetes/kubernetes/issues/25908#issuecomment-252089871>
    where this strategy was recommended by one of the k8s maintainers in 2016, no significant progress
    seems to have been made on this issue since then.
    """
    return {
        "container_config": {
            "volume_mounts": [
                {
                    # polled in the entrypoint to know when to shut down the container
                    "name": "sidecar-storage",
                    "mount_path": "/usr/share/pod",
                }
            ],
        },
        "pod_spec_config": {
            "containers": [
                {
                    "name": "sidecar",
                    "image": "sidecar_image_ehre",
                    "volume_mounts": [
                        {
                            "name": "sidecar-storage",
                            # polled in the entrypoint to know when to shut down the container
                            "mount_path": "/usr/share/pod",
                        }
                    ],
                    "command": ["/bin/sh", "-c"],
                    "args": [
                        f"""dockerd-entrypoint.sh &
sleep_interval=5
timeout={timeout}
i=0
while [ $i -le $(( timeout/$sleep_interval )) ]; do
    if test -f /usr/share/pod/done; then
       echo "Dagster pod finished, exiting"
       exit 0
    fi
    echo 'Waiting for the dagster pod to finish...'
    sleep $sleep_interval
    i=$(( i + 1 ))
done
echo "Timed out waiting for Dagster pod to finish"
exit 1"""
                    ],
                }
            ],
            "volumes": [{"name": "sidecar-storage", "empty_dir": {}}],
        },
    }
then.
Copy code
@job(
    tags={"dagster-k8s/config": docker_sidecar_k8s_config_tag()},
):
  ...
then
Copy code
def signal_sidecar_finished() -> None:
    """Register this via atexit.register to ensure that a sidecar knows that the main process has finished
    """

    if os.path.exists("/usr/share/pod"):
        print("Signaling that the dagster process has finished")
        Path("/usr/share/pod/done").touch()
    else:
        print("No /usr/share/pod folder on dagster process cleanup")
it wasn't great
o
Thanks @daniel. Not great, but not terrible. I wonder if we can improve it a bit by having the dagster container also update a heartbeat file (in a subprocess, or a separate process that starts with the container as part of the entrypoint script). This way, instead of waiting for a potentially long timeout , it can exit if it didn’t get a heartbeat. Unless the Dagster container has some sort of another heartbeat already?
d
that could also work, yeah. the dagster container does call signal_sidecar_finished whenever it finishes which would kill it right away - so the timeout is really only in the event of a crash where that cleanup doesn't get a chance to happen
(which I did by putting atexit.register(signal_sidecar_finished) in the code that loads the dagster definitions)
o
yup, i get it. but it means you need to be very aware of your expected run time. If you set it too high, wouldn’t it affect how long it will take before Dagster to retry running the job (if retries are set)?
d
oh right - yeah that timeout argument would need to be higher than the longest you expect a job will run
a
@Oren Lederman
Yeah, I was hoping to avoid doing that
Why is that if i can ask? I haven’t had any issue with this solution whenever I used it in the past
o
@Andrea Giardini mostly because it feels less elegant than a sidecar. Depending on how much load and traffic you're planning, there could be scaling issues for example. Yes, it can autoscale, but sidecars scale linearly with the number of pods and are available right when the pods need them. The other reason is security. By default, the deployment is accessible by the entire cluster. You'd still need a user and password, but still. Could be wrong here, but it adds one more thing to worry about.
It's much simpler than making sure your sidecars terminate nicely though.
a
Interesting… Just a couple of ideas here…
Yes, it can autoscale, but sidecars scale linearly with the number of pods and are available right when the pods need them.
I actually think a deployment scales much better than sidecars in GKE. Every sidecar proxy creates a certain number of management connections and (at least in my experience) you can run out of connections pretty fast with the sidecar pattern since every sidecar will always have some management connections.
Yes, it can autoscale, but sidecars scale linearly with the number of pods and are available right when the pods need them.
That’s something that needs to be solved with networkpolicies
o
Good point about the connections. I was thinking more of cpu load and throughput for heavy etl processes. Also about network policies. I figured you could do that, it's just something we usually don't need to deal with because we mostly use our cluster for running computational jobs, not deployments. In any case - thank you both for suggesting ways to handle this :) I'll relay this info back to our devs and infra team so we can decide what to do with it.
small update - I stumbled upon this discussion on the istio forum - https://discuss.istio.io/t/best-practices-for-jobs/4968/2 . I didn’t have time to dig into this, but their idea is to have an additional sidecar that monitors the main container. It uses the k8s API though, so it’ll need more permissions.
@daniel we are trying out your approach in a job with multiple steps, but it seems that
signal_sidecar_finished
gets called after the first step is completed (based on the timestamps in the log). Can you think of a reason why this could happen?
d
How did you trigger signal_sidecar_finished?
oh right, the atexit.register may be getting called when the first subprocess that runs the step ends
o
Yeah 🤷 . We are now trying my other idea for using a heartbeat. The sidecar looks at the modification date of a heartbeat file and exists if it hasn’t changed for X seconds. The python code spins up a subprocess that writes to the heartbeat every X seconds.
a
Sorry for insisting but… is it really that important to have cloudproxy as a sidecar? 🤔 Having additional sidecars for termination / custom scripts and endpoints / heartbeat & co sounds a bit… overkill?
o
@Andrea Giardini I agree that it adds complexity, but we still want to try it out. There are other sidecars (
istio
, for example) that we might need in the near future, so it’s an opportunity to test whether this solution would work.