Stupid question to start with sweat smile How you normally d dagster #announcements

Stupid question to start with :sweat_smile:: How y...

Simon Späti

02/27/2020, 2:06 PM

Stupid question to start with 😅: How you normally develop with spark on a Kubernetes environment? We have Open-Shift with Kubernetes (spark, jupyter notebooks, kafka etc.) and we have a databricks cluster (another spark for fast ramp-up-time). Would you deploy a dagster kubernets pod (https://dagster.readthedocs.io/en/latest/sections/deploying/k8s.html) right? But how would you develop if you're not into Vim? Because I'd like to VS Code locally, but all our stuff is in Kubernetes. What is best practice? I started now to use dagster in a databricks notebook (just to have the concept with solids and reusable pattern). But of course I cannot really visualise the dag etc. and also programming inside a notebook is not really fun.... any hints are well appreciated 🙈

alex

02/27/2020, 9:19 PM

Theres a lot to unpack in that question but ill do my best. You should be able to start by authoring your dagster python code locally and getting a version running that way against some sample data. You may be able to use

dagster-spark

for spark and

dagstermill

for jupyter notebooks. Then using the

mode

and

resource

abstractions - you can figure out how to make that pipeline work in your kubernetes infrastructure in addition to being runnable locally. You will build a docker image containing the pipeline code and potentially use

dagster-k8s

to deploy it.

Simon Späti

02/28/2020, 12:35 PM

thanks alex. I will try this. Problem is a bit, that all my data is on the object store and it's hard to create sample data for each file that I need. But I will try a little bit more with your hints and revert in case it doesn't work at all. Thank you very much for the help!

nate

02/28/2020, 4:29 PM

what object store are you using?

Simon Späti

02/28/2020, 8:19 PM

hi nate, i’m using s3. We’re using also a gateway like minio or zenko, but for the beginning i will directly use s3. I used your airline example, this is a good start for me. I believe you also used spark there in a local environment, didn’t work yet on my enviroment to run it (databricks extensions didn’t install, probably a proxy error on my windows/WSL installation in work 🙈. I’m trying now on my macbook, if it’s working here..) But the airline example uses spark on local mode, right?

nate

02/28/2020, 8:19 PM

yep! airline is local spark

Simon Späti

02/28/2020, 8:24 PM

local spark deployment, or just mimik it? 🤔 sorry for the beginner questions. But I’m trying to use S3 directly now, not over the

file_cache

and hopefully deploy it to our kubernetes cluster.

2 Views

Open in Slack

Previous Next