I have dagster deployed in our kubernetes cluster I want to dagster #ask-community

I have dagster deployed in our kubernetes cluster....

jonvet

07/06/2022, 11:09 AM

I have dagster deployed in our kubernetes cluster. I want to run a particular job with a different image than usual. I tried adding the

image

tag inside the

container_config

of the graph. The job fails with

Copy code

dagster.core.errors.DagsterInvariantViolationError: Could not find pipeline 'scale_model_training'. Found: .
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/grpc/impl.py", line 82, in core_execute_run
    recon_pipeline.get_definition()
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/reconstruct.py", line 180, in get_definition
    defn = self.repository.get_definition().get_pipeline(self.pipeline_name)
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/repository_definition.py", line 1102, in get_pipeline
    return self._repository_data.get_pipeline(name)
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/repository_definition.py", line 850, in get_pipeline
    return self._pipelines.get_definition(pipeline_name)
  File "/home/ubuntu/pyenv/versions/3.9.8/lib/python3.9/site-packages/dagster/core/definitions/repository_definition.py", line 155, in get_definition
    raise DagsterInvariantViolationError(

where

scale_model_training

is the name of my job. any ideas what could be wrong?

🤖 1

jonvet

07/06/2022, 1:44 PM

I tried another way: to add

executor_def=k8s_job_executor

to the job definition and then specify which image to use in the run config

jonvet

07/06/2022, 1:44 PM

that works

jonvet

07/06/2022, 3:35 PM

hmm now I have a different problem though. the graph has a

pod_spec_config

with a

toleration

because I need to run it on a specific node pool. when I run the job it creates the pod for the job on the correct node pool. however, the “step” job doesn’t seem to keep those graph annotations big cry so the “step” job is scheduled on the default node pool..

jonvet

07/06/2022, 3:56 PM

I tried setting the tolerations via the

tag

argument in

.to_job()

but it doesn’t seem to have an effect. the job is still scheduled on a default pod. any idea what i can do?

Copy code

scale_model_training_job = scale_model_training_graph.to_job(
    name="scale_model_training",
    config=config_from_files(
        [
            file_relative_path(__file__, "scale_model_training.yaml"),
        ]
    ),
    executor_def=k8s_job_executor,
    tags={
        "dagster-k8s/config": {
            "container_config": {
                "resources": {
                    "requests": {"memory": "10Gi"},
                    "limits": {"memory": "10Gi"},
                },
            },
            "pod_spec_config": {
                "tolerations": [
                    {"key": "<http://nvidia.com/gpu|nvidia.com/gpu>", "operator": "Equal", "value": "present", "effect": "NoSchedule"}
                ],
            },
            "job_spec_config": {"ttl_seconds_after_finished": 3600},
        },
    },
)

jonvet

07/06/2022, 7:58 PM

ok, if I put the same tags on the op then it works!

5 Views

Open in Slack

Previous Next