Ivan Rivera

08/24/2020, 7:14 AM
Hi all 🙂 Is anyone here using Dagster with Databricks? I’ve seen a dagster-databricks module, but it’s not very well documented, so I’d be keen to hear about your setup


08/24/2020, 10:38 PM
yea its very new and community contributed so we’re still working on getting it documented. I think referencing the tests is one thing you can do for now
thankyou 1
cc @sandy


08/24/2020, 10:39 PM
hey @Ivan Rivera - happy to answer any questions you have on how it works. I believe @Binh Pham has used it and may have some takeaways
thankyou 1

Binh Pham

08/24/2020, 11:10 PM
I was able to setup dagster-databricks and got it to run successfully following the simple_pyspark example: But errors happening on the databrick's cluster was not being sent back to dagster, which is unfortunate because I wanted to use dagster for monitoring purposes. Unsure if this a limitation with dagster-databricks or databricks run now api. I ended up using databricks-connect and creating a simple resource for it:
Copy code
from pyspark.sql import SparkSession

class PySparkResource(object):
    def __init__(self):
        self.spark_session = SparkSession.builder.getOrCreate()

def pyspark_resource(_):
    return PySparkResource()
databricks-connect requires that you don't have any other version of pyspark though: So to able to use intermediate storage and provide a serialization plan for PySpark DataFrames, I just copied dagster_pyspark/ into my project. So far this has worked for me. But happy to hear if there is a better way to get the best of both worlds. 🙂
thankyou 2

Ivan Rivera

08/25/2020, 6:03 AM
Awesome, thank you @alex, @sandy and @Binh Pham! That’s plenty of info for me to get started with