Hi! I am trying to integrate PySpark code into Dag...
# ask-community
d
Hi! I am trying to integrate PySpark code into Dagster so that I can launch steps on EMR. I am following the example provided in the docs. However, I am getting the following error:
Initialization of resources [pyspark, io_manager] failed.
RuntimeError: Java gateway process exited before sending its port number
File "/usr/local/lib/python3.7/site-packages/dagster/_core/errors.py", line 188, in user_code_error_boundary
yield
File "/usr/local/lib/python3.7/site-packages/dagster/_core/execution/resources_init.py", line 326, in single_resource_event_generator
if is_context_provided(resource_def.resource_fn)
File "/usr/local/lib/python3.7/site-packages/dagster_pyspark/resources.py", line 54, in pyspark_resource
return PySparkResource(init_context.resource_config["spark_conf"])
File "/usr/local/lib/python3.7/site-packages/dagster_pyspark/resources.py", line 21, in __init__
self._spark_session = spark_session_from_config(spark_conf)
File "/usr/local/lib/python3.7/site-packages/dagster_pyspark/resources.py", line 16, in spark_session_from_config
return builder.getOrCreate()
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/session.py", line 269, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/lib/python3.7/site-packages/pyspark/context.py", line 483, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/lib/python3.7/site-packages/pyspark/context.py", line 195, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/usr/local/lib/python3.7/site-packages/pyspark/context.py", line 417, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/usr/local/lib/python3.7/site-packages/pyspark/java_gateway.py", line 106, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
Do I need to install Java in my user deployment code?
m
I have those lines in my Dockerfile to use Spark
Copy code
ENV SPARK_HOME=/usr/local/lib/python3.10/site-packages/pyspark
RUN wget <https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar> \
	-P $SPARK_HOME/jars
RUN wget <https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar> \
	-P $SPARK_HOME/jars
👍 1
c
^ did that resolve for you?
d
I did this but I also installed java in the Dockerfile. I haven't tried without installing Java. I can remove the Java installation and try it out this way and let you know 🙂
@chris I had to add the following to get it to work. It wouldn't work without installing Java.
RUN apt update -y && apt upgrade -y && apt install wget openjdk-11-jdk -y
ENV SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark
RUN wget <https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar> \
-P $SPARK_HOME/jars
RUN wget <https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar> \
-P $SPARK_HOME/jars
m
👍