Hi team, I'm working with <@U02KV9CFRFA> and <@U03...
# deployment-kubernetes
k
Hi team, I'm working with @Jim Nisivoccia and @Zack D. to get our app running on celery/k8s. We are close to having it working, and we are running into the issue in the attached screenshot. In our docker/celery version of the app we had to set the vhost = dagster in the environment. I don't see anything in the helm chart about the vhost or any secrets that are being setup for rabbit. Should we be creating a secret and starting rabbit with env vars, or is there a better way to set the vhost with helm?
Is this the wrong room to post dagster celery k8s related questions? Do they belong in the support room?
d
Hi Keith - this is an OK place to post - bit slower than usual due to respond with dagster day on the horizon, apologies. I'll confess that i'm not a rabbitmq expert and I had to look up what a vhost was in the first place, but we use the rabbitmq helm chart in our helm chart, and i see some information in their helm chart documentation about how to see the vhost: https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#configure-the-default-uservhost
and you can add any additional fields from the rabbitmq helm chart in this part of the dagster helm chart: https://github.com/dagster-io/dagster/blob/master/helm/dagster/values.yaml#L715-L736 the only place you might run into trouble is if those params also affect what the resulting broker URL should be - we may need to make some tweaks to the helm chart to support that if it needs to take additional config from the rabbitmq helm chart into account
k
Thanks @daniel I appreciate the response, I'll take a look at their helm chart and see if we can add overrides to our values.yaml. We've really been enjoying all the flexibility dagster offers and we're looking forward to finishing our conversion from dagster/docker/celery to dagster/k8s/celery. I can imagine that the team is really busy with dagster day coming up, so thanks again for the help. If it causes issues with the broker url, I'll post back here and let you know.
condagster 1
We actually ran into another issue once we manually set the vhost and got things running. We have a job with 5 ops, mostly sequential. Once we got things configured we could put queues come up (we have two for now) and our worker pods started. We were able to launch runs, and the k8s job that represented the dagster job launched, which launched the first step job and then the second, both succeeded. After that it just kept waiting for the third op to start. I think we waited around 3 minutes before we terminated the run. I remember last time we spoke about the celery docker run launcher there were problems with resources writing to standard output, which corrupted the event stream. Is it possible this is occurring again with celery k8s? I can post a screenshot and a debug file tomorrow if that would help.
d
I don't think this would be the same problem - we fixed that one by bringing the celery docker executor more in line with the celery k8s executor
k
Yes, i remember the problem was solved in the k8s version and you refactored it to be shared with the docker version. I'll post more details tomorrow
j
Hey Daniel, I've tried plugging in the rabbitmq configuration listed in the rabbit helm chart that you shared and it didn't work with either of the two configurations I've tried in the images attached to this message. After clearing the volume rabbit stores its data in, when I do a helm install and bring up the new rabbitmq instance the default vhost is still set to "/".
d
hm, we're rapidly approaching the boundaries of things I have expertise in, but the configuration you have there looks different from what I see in https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#configure-the-default-uservhost - it looks like they are using a multi-line string, but you're using a dict
Oh, the first one you have there is using the string, my mistake
j
No worries
During this recent execution, 5/6 ops completed successfully, however the last one is now not executing similar to the problem Keith had detailed above. I waited for a little over 5 minutes and terminated the job. Our log lines were showing up throughout the run. I will send you the debug file for this run personally
So we found that the reason these workers weren't picking up the work was a result of the livenessProbe constantly polling and restarting the workers, which may have been preventing them from picking up the work. We couldn't find a way to disable the liveness probe, and tried passing and enabled: false flag as well as commenting out the whole block, but that didn't work. So I set it to just run a simple ls command so that it always passed, and our job consistently was able to pick up all ops for execution and run successfully. I believe the normal liveness probe was failing and trying to connect to the default "/" rabbitmq vhost instead of the dagster vhost we created.
d
it still keeps the liveness probe even if you set livenessProbe to empty on the run launcher config?
j
Yes, I tried commenting out the entire liveness probe section and it was still spawning the worker containers with the default probe attached
d
This is on a recent dagster version?
j
We are running on 0.15.4 currently
d
Mind posting the exact text of the values.yaml that wasn't working? Curious why the code I posted earlier wouldn't kick in if it was empty
Seems like a fairly straightforward guard that leaves out any liveness probe if it's set as empty in the provided values.yaml
If the key was left out entirely I think it would fall back to the defaults in the dagster-provided values.yaml though
Cc @johann , seems like another instance of liveness probes causing more issues than they solve
j
runLauncher:
type: CeleryK8sRunLauncher
config:
# This configuration will only be used if the K8sRunLauncher is selected
# k8sRunLauncher:
#   # Change with caution! If you're using a fixed tag for pipeline run images, changing the
#   # image pull policy to anything other than "Always" will use a cached/stale image, which is
#   # almost certainly not what you want.
#   imagePullPolicy: "Never"
#   envSecrets:
#     - name: dagster-secrets
celeryK8sRunLauncher:
# Change with caution! If you're using a fixed tag for pipeline run images, changing the
# image pull policy to anything other than "Always" will use a cached/stale image, which is
# almost certainly not what you want.
imagePullPolicy: "Never"
# # The Celery workers can be deployed with a fixed image (no user code included)
# image:
#   # When a tag is not supplied for a Dagster provided image,
#   # it will default as the Helm chart version.
#   repository: "<http://datamaxdev1.jfrog.io/datamax/dev/docker_datamax_etl_pipelines|datamaxdev1.jfrog.io/datamax/dev/docker_datamax_etl_pipelines>"
#   tag: ~
#   pullPolicy: Always
# Support overriding the name prefix of Celery worker pods
nameOverride: "celery-workers"
# Additional config options for Celery, applied to all queues.
# These can be overridden per-queue below.
# For available options, see:
# <https://docs.celeryq.dev/en/stable/userguide/configuration.html>
configSource: {}
# Additional Celery worker queues can be configured here. When overriding, be sure to
# provision a "dagster" worker queue, as this is the default queue used by Dagster.
#
# Optionally, labels and node selectors can be set on the Celery queue's workers.
# Specifying a queue's node selector will override any existing node selector defaults.
# configSource will be merged with the shared configSource above.
workerQueues:
- name: "dagster"
replicaCount: 2
labels: {}
nodeSelector: {}
configSource: {}
additionalCeleryArgs: []
- name: "data-sourcing"
replicaCount: 2
labels: {}
nodeSelector: {}
configSource: {}
additionalCeleryArgs: []
# Additional environment variables to set on the celery/job containers
# A Kubernetes ConfigMap will be created with these environment variables. See:
# <https://kubernetes.io/docs/concepts/configuration/configmap/>
#
# Example:
#
# env:
#   ENV_ONE: one
#   ENV_TWO: two
env: {}
# Additional environment variables can be retrieved and set from ConfigMaps. See:
# <https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#configure-all-key-value-pairs-in-a-configmap-as-container-environment-variables>
#
# Example:
#
# envConfigMaps:
#   - name: config-map
envConfigMaps: []
# Additional environment variables can be retrieved and set from Secrets. See:
# <https://kubernetes.io/docs/concepts/configuration/secret/#use-case-as-container-environment-variables>
#
# Example:
#
# envSecrets:
#   - name: secret
envSecrets:
- name: dagster-secrets
annotations: {}
# Sets a node selector as a default for all Celery queues.
#
# See:
# <https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector>
nodeSelector: {}
# Support affinity and tolerations for Celery pod assignment. See:
# <https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity>
# <https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/>
affinity: {}
tolerations: []
podSecurityContext: {}
securityContext: {}
# Specify resources.
# Example:
#
# resources:
#   limits:
#     cpu: 100m
#     memory: 128Mi
#   requests:
#     cpu: 100m
#     memory: 128Mi
resources: {}
`# If
livenessProbe
does not contain
exec
field, then we will default to using:`
# exec:
#   command:
#     - /bin/sh
#     - -c
#     - dagster-celery status -A <http://dagster_celery_k8s.app|dagster_celery_k8s.app> -y {{ $.Values.global.dagsterHome }}/celery-config.yaml | grep "${HOSTNAME}:.*OK"
livenessProbe:
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 10
successThreshold: 1
failureThreshold: 3
exec:
command:
- /bin/sh
- -c
- ls
# Additional volumes that should be included in the Job's Pod. See:
# <https://v1-18.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#volume-v1-core>
#
# Example:
#
# volumes:
#   - name: my-volume
#     configMap: my-config-map
volumes: []
# Additional volume mounts that should be included in the container in the Job's Pod. See:
# See: <https://v1-18.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#volumemount-v1-core>
#
# Example:
#
# volumeMounts:
#   - name: test-volume
#     mountPath: /opt/dagster/test_folder
#     subPath: test_file.yaml
volumeMounts: []
# Additional labels that should be included in the Job's Pod. See:
# <https://kubernetes.io/docs/concepts/overview/working-with-objects/labels>
#
# Example:
# labels:
#   my_label_key: my_label_value
labels: {}
# Whether the launched Kubernetes Jobs and Pods should fail if the Dagster run fails.
failPodOnRunFailure: false
Here is the run launcher config from our values.yaml
d
Ah I bet livenessProbe: ~ would work better than commenting it out - commenting it out would fall back to the default
j
Let me give that a shot and get back to you
It seems this didn't work either, after bringing the containers back up with the attached config, the liveness probe is still spawned and is restarting the containers again
d
and the liveness probe is that "dagster-celery status -A dagster_celery_k8s.app" command?
j
Yea it's the default one that is described in the comments
d
OK, we'll investigate what's going on there - seems like you have a workaround in the short-term though?
j
Thank you Daniel, I appreciate it. For the time being we should be alright just running the ls command in the liveness probe