Ivan Rivera
08/24/2020, 7:14 AMalex
08/24/2020, 10:38 PMsandy
08/24/2020, 10:39 PMBinh Pham
08/24/2020, 11:10 PMfrom pyspark.sql import SparkSession
class PySparkResource(object):
def __init__(self):
self.spark_session = SparkSession.builder.getOrCreate()
@resource
def pyspark_resource(_):
return PySparkResource()
databricks-connect requires that you don't have any other version of pyspark though: https://docs.databricks.com/dev-tools/databricks-connect.html#step-1-install-the-client
So to able to use intermediate storage and provide a serialization plan for PySpark DataFrames, I just copied dagster_pyspark/types.py into my project. https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-pyspark/dagster_pyspark/types.py
So far this has worked for me. But happy to hear if there is a better way to get the best of both worlds. 🙂Ivan Rivera
08/25/2020, 6:03 AM