https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
k

Kevin

08/21/2020, 12:10 AM
Hi Dagster folks! In another Dagster instance of mine (my team is kind of having a hunger games for Dagster instances 😄 ), I've come across an error trying to get GRPC to work. It's a K8s deployment at 0.9.2 with a modified dagster_aws module to allow boto3 endpoint_url. Attached is the error from dagit. Any help is appreciated! :D
With the very recent update to 0.9.3 the error has changed to something along the lines of:
[generate_number_19.compute]: (TypeError) - TypeError: _execute_step_k8s_job() got an unexpected keyword argument 'user_defined_k8s_config_dict
log file attached, again any help is appreciated :D
m

max

08/21/2020, 1:04 PM
is everything on the same dagster version?
k

Kevin

08/21/2020, 2:23 PM
oh you betcha! 0.9.3!
s

sashank

08/21/2020, 4:16 PM
@Kevin my guess is that your celery workers aren’t actually running 0.9.3, since
user_defined_k8s_config_dict
should be a valid arg in 0.9.3
There are three images you need to update when upgrading dagster: the
dagit
image, the
celery
image, and the
pipeline_run
image
Is there any way you can verify which version of dagster the
celery
images are running?
k

Kevin

08/21/2020, 4:48 PM
@sashank okay i'll double check that, but i'm pretty sure they are all 0.9.3 as my
dagit
, and
celery
images use the same image
hard to see how any of the dagster modules would not be the latest at 0.9.3 docker image:
Copy code
FROM python:3.7.7

ENV DAGSTER_VERSION=0.9.3

# Cron is required to use scheduling in Dagster
RUN apt-get update && apt-get install -yqq cron


# PIP install for git functions

RUN mkdir -p /opt/dagster/dagster_home /opt/dagster/app

RUN pip install \
     dagster==${DAGSTER_VERSION} \
     dagster-graphql==${DAGSTER_VERSION} \
     dagster-celery[flower,redis,kubernetes]==${DAGSTER_VERSION} \
     dagster-cron==${DAGSTER_VERSION} \
     dagit==${DAGSTER_VERSION} \
     dagster-postgres==${DAGSTER_VERSION} \
     dagster-pandas==${DAGSTER_VERSION} \
     dagster-gcp==${DAGSTER_VERSION} \
     dagster-k8s==${DAGSTER_VERSION} \
     dagster-airflow==${DAGSTER_VERSION} \
     dagster-celery-k8s==${DAGSTER_VERSION} \
     dagster-celery-docker==${DAGSTER_VERSION}

ADD build_cache/ /
RUN pip install -e dagster-aws

# Copy your pipeline code and entrypoint.sh to /opt/dagster/app
COPY *.py *.yaml /opt/dagster/app/
COPY .aws /root/.aws

# Copy dagster instance YAML to $DAGSTER_HOME
ENV DAGSTER_HOME=/opt/dagster/dagster_home
COPY dagster.yaml $DAGSTER_HOME

WORKDIR /opt/dagster/app
s

sashank

08/21/2020, 7:38 PM
That Dockerfile definitely looks right
Is there anyway you can ssh into the celery workers and check the dagster version?
For example, using
kubectl
, you can do:
Copy code
kubectl get pods
to get the name of the pod running the celery workers
Copy code
k exec {pod_name} dagster -- --version
To run the dagster version command in that pod
k

Kevin

08/24/2020, 4:59 PM
hi @sashank
k exec -n <namespace> <pod> dagster --version
dagster, version 0.9.3
it seems as though i have encountered a different issue with each of my attempts to fix it:
Copy code
FileNotFoundError: [Errno 2] No such file or directory: '/opt/dagster/app/celery_simple_pipeline.py'
  File "/usr/local/lib/python3.7/site-packages/dagster/cli/api.py", line 337, in _execute_run_command_body
    for event in execute_run_iterator(recon_pipeline, pipeline_run, instance):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 74, in execute_run_iterator
    step_keys_to_execute=pipeline_run.step_keys_to_execute,
  File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 558, in create_execution_plan
    pipeline_def = pipeline.get_definition()
  File "/usr/local/lib/python3.7/site-packages/dagster/core/definitions/reconstructable.py", line 98, in get_definition
    self.repository.get_definition()
  File "/usr/local/lib/python3.7/site-packages/dagster/core/definitions/reconstructable.py", line 37, in get_definition
    return repository_def_from_pointer(self.pointer)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/definitions/reconstructable.py", line 341, in repository_def_from_pointer
    target = def_from_pointer(pointer)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/definitions/reconstructable.py", line 290, in def_from_pointer
    target = pointer.load_target()
  File "/usr/local/lib/python3.7/site-packages/dagster/core/code_pointer.py", line 250, in load_target
    module = load_python_file(self.python_file, self.working_directory)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/code_pointer.py", line 87, in load_python_file
    return import_module_from_path(module_name, python_file)
  File "/usr/local/lib/python3.7/site-packages/dagster/seven/__init__.py", line 115, in import_module_from_path
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 724, in exec_module
  File "<frozen importlib._bootstrap_external>", line 859, in get_code
  File "<frozen importlib._bootstrap_external>", line 916, in get_data
always something XD thank you again for your help!
s

sashank

08/24/2020, 5:20 PM
cc @alex @daniel any ideas?
Also @Kevin are you not running into the celery error anymore?
d

daniel

08/24/2020, 5:24 PM
Hi Kevin - just confirming, that file does in fact exist on the machine where the execution is happening (but Dagster is saying that it doesn't)?
and I guess another question (just understanding your setup) - you said you're trying to get gRPC to work, that means you have a gRPC server running as well I assume, is that in the same image as the workers that are doing the execution?
k

Kevin

08/24/2020, 6:16 PM
@daniel @sashank Should i exec onto the pod in question and perform:
dagster pipeline execute -f <pipeline.py>
it executes no problem, and the celery error doesn't appear anymore given my changes to the values.yaml and configmap-instance.yaml -- my attempts at fixing it any files you guys want to look at do let me know!
d

daniel

08/24/2020, 6:18 PM
thanks, could we see your workspace/dagster.yaml as well?
a

alex

08/24/2020, 6:19 PM
how are you deploying in to k8s, helm?
k

Kevin

08/24/2020, 6:30 PM
Hi 🙂 yes i am deploying via the helm chart from your repo -- with my own values.yaml here are the files 🙂
a

alex

08/24/2020, 6:34 PM
can we see what
values.yaml
you are using too?
k

Kevin

08/24/2020, 6:40 PM
ask and you shall receive ;)
a

alex

08/24/2020, 6:43 PM
ok some things that stand out to me are
userDeployments.enabled
is
false
and the
pipeline_run
section has been commented out
i think since you are trying to get grpc working - you want to set
userDeployments.enabled
to
true
k

Kevin

08/24/2020, 6:48 PM
Yeah, initially i tried setting
userDeployments.enabled
to
true
but that lead to the error i mentioned above of:
Copy code
[generate_number_0.compute]: (TypeError) - TypeError: _execute_step_k8s_job() got an unexpected keyword argument 'user_defined_k8s_config_dict'
i'll uncomment the pipeline_run section but i seem to remember an error when i did that
okay something different, but i can't say whether it is better 😄
Copy code
dagster.core.errors.DagsterSubprocessError: During celery execution errors occurred in workers:
[not_much_1.compute]: (ParameterCheckError) - dagster.check.ParameterCheckError: Param "text" is not a str. Got ApiException() which is type <class 'kubernetes.client.rest.ApiException'>.
a

alex

08/24/2020, 6:57 PM
so another concern i have is that if you are just using the “latest” tag - k8s isn’t going to redeploy the celery workers unless you are doing a full
helm uninstall
every time. Given the errors you have been seeing I am guessing this might be the case.
so I think you either want to: * use an explicit tagging scheme * do
helm uninstall
I believe a lot of the confusing errors from above have to do with stale nodes in the cluster since k8s doesn’t know you have published a new
genimind/dagster-kcore:latest
and so won’t redeploy the long standing services
dagit
or
celery
. The
alwaysPull: True
trick only works for the
job
resources since they pull the newest copy of
latest
when they are created for run/step execution.
k

Kevin

08/24/2020, 9:16 PM
hi @alex @sashank, @johann helped me with the issue -- it comes from multiple dagster instances trying to access the same resource in my redis cluster. Specifying a redis backend/broker number beyond the default '0' fixed the issue! thanks again for all the help!!
a

alex

08/24/2020, 9:53 PM
sweet glad that that worked out - we should maybe namespace our celery entries to defend against that as well
3 Views