I did not see a way to configure a PV or PVC in th...
# deployment-kubernetes
n
I did not see a way to configure a PV or PVC in the
dagster-user-deployments
. Should we manage it outside that Helm chart ?
a
You need to use IOManagers too keep that data outside of the Op (S3 / GCS / etc…) https://docs.dagster.io/concepts/io-management/io-managers Otherwise as soon as the k8s pod is gone also all the data on it is gone.
n
Thanks Andrea. I know that I have to use an IOManager, but what’s not clear for me is what is the default implementation if none is specified (memory or filesystem)?
Based on the error message it looks it’s the Filesystem defined as default IOManager. So I think it would make sense to be able to define a PVC through the Helm chart.
That PVC will be mounted at
$DAGSTER_HOME/storage
into the pod created by the Kubernetes Job.
Does it make sense or am I missing something ?
a
Which value of the helm chart are you referring to exactly?
volumes:
is mentioned multiple times in the values.yaml file Anyway, I think that might give you a volume on the user-code-deployments, but not in the jobs you run. I believe IOmanagers are the best way to solve this issue, both in terms of scalability and complexity.
n
Hi Andrea, it looks we don’t understand each others. So let me retry to explain my thoughts.
when we define a job with 2 assets like that:
Copy code
@asset
def cereals():
    """All Cereals"""
    response = requests.get("<https://docs.dagster.io/assets/cereal.csv>")
    lines = response.text.split("\n")
    return [row for row in csv.DictReader(lines)]


@asset
def nabisco_cereals(cereals):
    """Cereals manufactured by Nabisco"""
    return [row for row in cereals if row["mfr"] == "N"]
but we don’t specifically define an IOManager, Dagster use a default one. I guess by default data are persisted on disk because the error message I got is
Not such file or directory
If we materialize
cereals
a Kubernetes Job gonna start a pod and store on disk the asset materialized. If we materialize in a second run
nabisco_cereals
the second pod gonna try to read the
cereals
assets from the disk BUT because it’s a different pod WITHOUT persistent volume it can’t find the
cereals
assets
In order to prevent that we should mount a persistent volume in the pods.
At the moment the
dagster-user-deployements
chart allows to mount volume BUT does not allow to create volume using a PersistentVolumeClaim.
So I think the chart should allow to create a PVC like that
a
Hi Nicolas 👋 I believe I understand your problem, but I think the solution you are suggesting is not optimal. Having a volume that moves around for every pipeline run is not recommended for a couple of reasons: • Volumes need to be attached to nodes where the pod is running. Every time a pod is scheduled on a different node, the volume must be attached to a new node. This takes time since this detach/attach can happen for every step of your pod. • What about volume sizing? is the amount of data you are generating even predictable in advance? • What about parallel steps? You can’t attach a volume to multiple pods (except for a couple of special volumes that tend to be like NFS file shares, so super-slow) For all the above reasons, I still think the best way of doing this is using IO managers. There is an IOManager for every major cloud provider. Instead of using a local volumes dagster will transparently upload/download your data from the object-storage of your choice. You won’t need any volume-magic and you won’t face any of the problems mentioned above.
🙏 1
n
Fair enough ! But I don’t understand what might be the usage in the
dagster-user-deployments
chart for volume mounts in that case.
Do we have to configure the IOManager at job level (in the code) or is it possible somehow to configure it when we configure the code location (using the
dagster-user-deployments
chart) ?
a
But I don’t understand what might be the usage in the
dagster-user-deployments
chart for volume mounts in that case
Well a volume can also be a configmap or a secret! 🙂
n
True 😉
Did we run some benchmark about using an object store as IOManager ?
a
I don’t know of any public benchmark… but in my experience it’s pretty fast since it’s using the official libraries of the cloud provider. also, you don’t have the “will I have enough disk space for this file” problem… which is a huge plus for me 🙂 Which cloud provider are you using?
n
We are using GCP !
From your experience should we have a unique GCS bucket for all the pipelines or create different ones per code location ?
a
I use GCP as well 🙂 The single-bucket/multiple-bucket question is very subjective. If you have a lot of pipelines I would have multiple buckets, one per code-location. If you are running fewer pipelines maybe a single-bucket with different subfolders might be enough. But that’s going to be an issue if you start using dagster extensively
n
I just saw we can define the IOManager at repository level here:
Copy code
@asset
def asset1():
    # create df ...
    return df

@asset
def asset2(asset1):
    return df[:5]

@repository
def repo():
    return with_resources(
        [asset1, asset2],
        resource_defs={
            "io_manager": gcs_pickle_io_manager.configured(
                {"gcs_bucket": "my-cool-bucket", "gcs_prefix": "my-cool-prefix"}
            ),
            "gcs": gcs_resource,
        },
    )
)
D 1
Do you know if it’s also possible using the
Definitions
?
n
Perfect 😛 Thanks a lot I’m going to try that !
a
Happy to help !
n
Just a last question for you because your are using GCP. Did you configure the pods to use Workload Identity to access the GCP services ?
a
Yeah, I didn’t have any particular issue with it. Are you having troubles?
n
No no, just wanted to be sure I can set the service account with the annotation to the pods triggered by the K8s jobs
That’s nice ! Thanks again for your help
a
Currently i use a single serviceaccount (with workload identity) for all the pods. Then you just add it here https://github.com/dagster-io/dagster/blob/master/helm/dagster/values.yaml#L15 and it works 🙂 Multiple service accounts / workload identities should be manageable but maybe a bit tricky
n
Should it be only at code location level (here) ?
a
that should work if you want to have a separate service-account per code-location, but i haven’t tried it myself
n
Oki. Thanks again !
a
Sure thing, I will send you a quick DM!
n
Sure !
Have a good one Andrea
👋 1