Stupid question to start with :sweat_smile:: How y...
# announcements
s
Stupid question to start with 😅: How you normally develop with spark on a Kubernetes environment? We have Open-Shift with Kubernetes (spark, jupyter notebooks, kafka etc.) and we have a databricks cluster (another spark for fast ramp-up-time). Would you deploy a dagster kubernets pod (https://dagster.readthedocs.io/en/latest/sections/deploying/k8s.html) right? But how would you develop if you're not into Vim? Because I'd like to VS Code locally, but all our stuff is in Kubernetes. What is best practice? I started now to use dagster in a databricks notebook (just to have the concept with solids and reusable pattern). But of course I cannot really visualise the dag etc. and also programming inside a notebook is not really fun.... any hints are well appreciated 🙈
a
Theres a lot to unpack in that question but ill do my best. You should be able to start by authoring your dagster python code locally and getting a version running that way against some sample data. You may be able to use
dagster-spark
for spark and
dagstermill
for jupyter notebooks. Then using the
mode
and
resource
abstractions - you can figure out how to make that pipeline work in your kubernetes infrastructure in addition to being runnable locally. You will build a docker image containing the pipeline code and potentially use
dagster-k8s
to deploy it.
s
thanks alex. I will try this. Problem is a bit, that all my data is on the object store and it's hard to create sample data for each file that I need. But I will try a little bit more with your hints and revert in case it doesn't work at all. Thank you very much for the help!
n
what object store are you using?
s
hi nate, i’m using s3. We’re using also a gateway like minio or zenko, but for the beginning i will directly use s3. I used your airline example, this is a good start for me. I believe you also used spark there in a local environment, didn’t work yet on my enviroment to run it (databricks extensions didn’t install, probably a proxy error on my windows/WSL installation in work 🙈. I’m trying now on my macbook, if it’s working here..) But the airline example uses spark on local mode, right?
n
yep! airline is local spark
s
local spark deployment, or just mimik it? 🤔 sorry for the beginner questions. But I’m trying to use S3 directly now, not over the
file_cache
and hopefully deploy it to our kubernetes cluster.