I did not see a way to configure a PV or PVC in the `dagster dagster #deployment-kubernetes

I did not see a way to configure a PV or PVC in th...

Nicolas Huray

02/13/2023, 9:12 PM

I did not see a way to configure a PV or PVC in the

dagster-user-deployments

. Should we manage it outside that Helm chart ?

Andrea Giardini

02/13/2023, 10:16 PM

You need to use IOManagers too keep that data outside of the Op (S3 / GCS / etc…) https://docs.dagster.io/concepts/io-management/io-managers Otherwise as soon as the k8s pod is gone also all the data on it is gone.

Nicolas Huray

02/14/2023, 2:58 AM

Thanks Andrea. I know that I have to use an IOManager, but what’s not clear for me is what is the default implementation if none is specified (memory or filesystem)?

Nicolas Huray

02/14/2023, 3:02 AM

Based on the error message it looks it’s the Filesystem defined as default IOManager. So I think it would make sense to be able to define a PVC through the Helm chart.

Nicolas Huray

02/14/2023, 3:03 AM

That PVC will be mounted at

$DAGSTER_HOME/storage

into the pod created by the Kubernetes Job.

Nicolas Huray

02/14/2023, 3:03 AM

Does it make sense or am I missing something ?

Andrea Giardini

02/14/2023, 8:31 AM

Which value of the helm chart are you referring to exactly?

volumes:

is mentioned multiple times in the values.yaml file Anyway, I think that might give you a volume on the user-code-deployments, but not in the jobs you run. I believe IOmanagers are the best way to solve this issue, both in terms of scalability and complexity.

Nicolas Huray

02/14/2023, 2:36 PM

Hi Andrea, it looks we don’t understand each others. So let me retry to explain my thoughts.

Nicolas Huray

02/14/2023, 2:39 PM

when we define a job with 2 assets like that:

Copy code

@asset
def cereals():
    """All Cereals"""
    response = requests.get("<https://docs.dagster.io/assets/cereal.csv>")
    lines = response.text.split("\n")
    return [row for row in csv.DictReader(lines)]


@asset
def nabisco_cereals(cereals):
    """Cereals manufactured by Nabisco"""
    return [row for row in cereals if row["mfr"] == "N"]

but we don’t specifically define an IOManager, Dagster use a default one. I guess by default data are persisted on disk because the error message I got is

Not such file or directory

Nicolas Huray

02/14/2023, 2:42 PM

If we materialize

cereals

a Kubernetes Job gonna start a pod and store on disk the asset materialized. If we materialize in a second run

nabisco_cereals

the second pod gonna try to read the

cereals

assets from the disk BUT because it’s a different pod WITHOUT persistent volume it can’t find the

cereals

assets

Nicolas Huray

02/14/2023, 2:44 PM

In order to prevent that we should mount a persistent volume in the pods.

Nicolas Huray

02/14/2023, 2:45 PM

At the moment the

dagster-user-deployements

chart allows to mount volume BUT does not allow to create volume using a PersistentVolumeClaim.

Nicolas Huray

02/14/2023, 2:48 PM

So I think the chart should allow to create a PVC like that

Andrea Giardini

02/14/2023, 2:55 PM

Hi Nicolas 👋 I believe I understand your problem, but I think the solution you are suggesting is not optimal. Having a volume that moves around for every pipeline run is not recommended for a couple of reasons: • Volumes need to be attached to nodes where the pod is running. Every time a pod is scheduled on a different node, the volume must be attached to a new node. This takes time since this detach/attach can happen for every step of your pod. • What about volume sizing? is the amount of data you are generating even predictable in advance? • What about parallel steps? You can’t attach a volume to multiple pods (except for a couple of special volumes that tend to be like NFS file shares, so super-slow) For all the above reasons, I still think the best way of doing this is using IO managers. There is an IOManager for every major cloud provider. Instead of using a local volumes dagster will transparently upload/download your data from the object-storage of your choice. You won’t need any volume-magic and you won’t face any of the problems mentioned above.

🙏 1

Nicolas Huray

02/14/2023, 2:57 PM

Fair enough ! But I don’t understand what might be the usage in the

dagster-user-deployments

chart for volume mounts in that case.

Nicolas Huray

02/14/2023, 2:59 PM

Do we have to configure the IOManager at job level (in the code) or is it possible somehow to configure it when we configure the code location (using the

dagster-user-deployments

chart) ?

Andrea Giardini

02/14/2023, 3:12 PM

I think it’s on an asset-level -> https://docs.dagster.io/concepts/io-management/io-managers#using-io-managers-with-software-defined-assets

Andrea Giardini

02/14/2023, 3:13 PM

But I don’t understand what might be the usage in the
dagster-user-deployments
chart for volume mounts in that case

Well a volume can also be a configmap or a secret! 🙂

Nicolas Huray

02/14/2023, 3:13 PM

True 😉

Nicolas Huray

02/14/2023, 3:14 PM

Did we run some benchmark about using an object store as IOManager ?

Andrea Giardini

02/14/2023, 3:16 PM

I don’t know of any public benchmark… but in my experience it’s pretty fast since it’s using the official libraries of the cloud provider. also, you don’t have the “will I have enough disk space for this file” problem… which is a huge plus for me 🙂 Which cloud provider are you using?

Nicolas Huray

02/14/2023, 3:18 PM

We are using GCP !

Nicolas Huray

02/14/2023, 3:19 PM

From your experience should we have a unique GCS bucket for all the pipelines or create different ones per code location ?

Andrea Giardini

02/14/2023, 3:21 PM

I use GCP as well 🙂 The single-bucket/multiple-bucket question is very subjective. If you have a lot of pipelines I would have multiple buckets, one per code-location. If you are running fewer pipelines maybe a single-bucket with different subfolders might be enough. But that’s going to be an issue if you start using dagster extensively

Nicolas Huray

02/14/2023, 3:23 PM

I just saw we can define the IOManager at repository level here:

Copy code

@asset
def asset1():
    # create df ...
    return df

@asset
def asset2(asset1):
    return df[:5]

@repository
def repo():
    return with_resources(
        [asset1, asset2],
        resource_defs={
            "io_manager": gcs_pickle_io_manager.configured(
                {"gcs_bucket": "my-cool-bucket", "gcs_prefix": "my-cool-prefix"}
            ),
            "gcs": gcs_resource,
        },
    )
)

D 1

Nicolas Huray

02/14/2023, 3:23 PM

Do you know if it’s also possible using the

Definitions

Andrea Giardini

02/14/2023, 3:24 PM

it’s this one right? https://docs.dagster.io/concepts/io-management/io-managers#applying-io-managers-to-assets

Nicolas Huray

02/14/2023, 3:25 PM

Perfect 😛 Thanks a lot I’m going to try that !

Andrea Giardini

02/14/2023, 3:26 PM

Happy to help !

Nicolas Huray

02/14/2023, 3:26 PM

Just a last question for you because your are using GCP. Did you configure the pods to use Workload Identity to access the GCP services ?

Andrea Giardini

02/14/2023, 3:29 PM

Yeah, I didn’t have any particular issue with it. Are you having troubles?

Nicolas Huray

02/14/2023, 3:31 PM

No no, just wanted to be sure I can set the service account with the annotation to the pods triggered by the K8s jobs

Nicolas Huray

02/14/2023, 3:31 PM

That’s nice ! Thanks again for your help

Andrea Giardini

02/14/2023, 3:33 PM

Currently i use a single serviceaccount (with workload identity) for all the pods. Then you just add it here https://github.com/dagster-io/dagster/blob/master/helm/dagster/values.yaml#L15 and it works 🙂 Multiple service accounts / workload identities should be manageable but maybe a bit tricky

Nicolas Huray

02/14/2023, 3:35 PM

Should it be only at code location level (here) ?

Andrea Giardini

02/14/2023, 3:37 PM

that should work if you want to have a separate service-account per code-location, but i haven’t tried it myself

Nicolas Huray

02/14/2023, 3:37 PM

Oki. Thanks again !

Andrea Giardini

02/14/2023, 3:38 PM

Sure thing, I will send you a quick DM!

Nicolas Huray

02/14/2023, 3:38 PM

Sure !

Nicolas Huray

02/14/2023, 3:38 PM

Have a good one Andrea

👋 1

10 Views

Open in Slack

Previous Next