Hi We are want to run a `spark UDF function` on a ...
# announcements
s
Hi We are want to run a 
spark UDF function
 on a 
yarn
 cluster, but having trouble configuring the 
spark resource
. Since we are running on anaconda we are trying to use the instructions from conda to config spark. When running outside of 
dagster
 we can use the 
spark.yarn.dist.archives
 configuration pointing to a hdfs folder that is accessible to all the workers. We tried to define the sparksubmitpyFiles resource config but we are still getting the error:
 Cannot run program "path/to/python/env/bin/python3.7" error=2, No such file or directory.
Additionally we tried sparkdriverextraLibararyPath  but still getting the same error. What would be the best way to try and implement the 
PYSPARK_PYTHON
 and 
archives
 variables? is it through parse_spark_configs.py ?
s
Hi @sephi - if I understand correctly, the main error you're hitting is that "Cannot run program" error?" Are you executing the dagster pipeline within the yarn cluster. If so, are you able to set PYSPARK_PYTHON in your environment?
s
Yes, we are running within a yarn cluster. We have set the
PYSPARK_PYTHON
in the python file using the
os
module in the python script. I can try tomorrow to set the PYSPARK_PYTHON in a different way . possibly using
spark.pyspark.virtualenv.enabled
.
s
setting
PYSPARK_PYTHON
in the python file won't work, because
PYSPARK_PYTHON
is needed before the file is executed. are you able to set it on the command line or in whatever script you use to launch the dagster?
s
Since we are running with
conda
the documentation from conda suggest using the
archives
config variable. We need to set the environment to a location that is accessible by all the workers (which in our case is the HDFS ).
s
Ah, I see - so the error that you're experiencing is on the executor side. Yes, I think the right approach would be to add
spark.yarn.dist.archives
to python_modules/automation/automation/parse_spark_configs.py
s
Are you interested in a PR? if so please define the scope
s
I filed an issue for this: https://github.com/dagster-io/dagster/issues/2473. If you have the bandwidth to create a PR, that would definitely be appreciated. If not, let me know, and I can look into it
s
Where is the code that you are using for "parsing the Spark documentation"?
circling back on this, I dug a little more and realized that the spark config is permissive, so you should be able to use
spark.yarn.dist.archives
even without it being explicitly defined within dagster e.g. configuring pyspark as follows should work
Copy code
resources:
  pyspark:
    config:
      spark_conf:
        spark.yarn.dist.archives: <some value of your choosing>
s
Thx - it works as you suggested!