I have dagster deployed in our kubernetes cluster....
# ask-community
j
I have dagster deployed in our kubernetes cluster. I want to run a particular job with a different image than usual. I tried adding the
image
tag inside the
container_config
of the graph. The job fails with
Copy code
dagster.core.errors.DagsterInvariantViolationError: Could not find pipeline 'scale_model_training'. Found: .
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/grpc/impl.py", line 82, in core_execute_run
    recon_pipeline.get_definition()
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/reconstruct.py", line 180, in get_definition
    defn = self.repository.get_definition().get_pipeline(self.pipeline_name)
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/repository_definition.py", line 1102, in get_pipeline
    return self._repository_data.get_pipeline(name)
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/repository_definition.py", line 850, in get_pipeline
    return self._pipelines.get_definition(pipeline_name)
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/repository_definition.py", line 155, in get_definition
    raise DagsterInvariantViolationError(
where
scale_model_training
is the name of my job. any ideas what could be wrong?
🤖 1
I tried another way: to add
executor_def=k8s_job_executor
to the job definition and then specify which image to use in the run config
that works
hmm now I have a different problem though. the graph has a
pod_spec_config
with a
toleration
because I need to run it on a specific node pool. when I run the job it creates the pod for the job on the correct node pool. however, the “step” job doesn’t seem to keep those graph annotations big cry so the “step” job is scheduled on the default node pool..
I tried setting the tolerations via the
tag
argument in
.to_job()
but it doesn’t seem to have an effect. the job is still scheduled on a default pod. any idea what i can do?
Copy code
scale_model_training_job = scale_model_training_graph.to_job(
    name="scale_model_training",
    config=config_from_files(
        [
            file_relative_path(__file__, "scale_model_training.yaml"),
        ]
    ),
    executor_def=k8s_job_executor,
    tags={
        "dagster-k8s/config": {
            "container_config": {
                "resources": {
                    "requests": {"memory": "10Gi"},
                    "limits": {"memory": "10Gi"},
                },
            },
            "pod_spec_config": {
                "tolerations": [
                    {"key": "<http://nvidia.com/gpu|nvidia.com/gpu>", "operator": "Equal", "value": "present", "effect": "NoSchedule"}
                ],
            },
            "job_spec_config": {"ttl_seconds_after_finished": 3600},
        },
    },
)
ok, if I put the same tags on the op then it works!