sephi
05/18/2020, 10:00 AMspark UDF function
on a yarn
cluster, but having trouble configuring the spark resource
.
Since we are running on anaconda we are trying to use the instructions from conda to config spark.
When running outside of dagster
we can use the spark.yarn.dist.archives
configuration pointing to a hdfs folder that is accessible to all the workers.
We tried to define the sparksubmitpyFiles resource config but we are still getting the error:
Cannot run program "path/to/python/env/bin/python3.7" error=2, No such file or directory.Additionally we tried sparkdriverextraLibararyPath but still getting the same error. What would be the best way to try and implement the
PYSPARK_PYTHON
and archives
variables?
is it through parse_spark_configs.py ?sandy
05/18/2020, 4:09 PMsephi
05/18/2020, 6:58 PMPYSPARK_PYTHON
in the python file using the os
module in the python script.
I can try tomorrow to set the PYSPARK_PYTHON in a different way . possibly using spark.pyspark.virtualenv.enabled
.sandy
05/18/2020, 7:01 PMPYSPARK_PYTHON
in the python file won't work, because PYSPARK_PYTHON
is needed before the file is executed. are you able to set it on the command line or in whatever script you use to launch the dagster?sephi
05/18/2020, 7:06 PMconda
the documentation from conda suggest using the archives
config variable. We need to set the environment to a location that is accessible by all the workers (which in our case is the HDFS ).sandy
05/18/2020, 7:12 PMspark.yarn.dist.archives
to python_modules/automation/automation/parse_spark_configs.pysephi
05/18/2020, 7:16 PMsandy
05/18/2020, 8:17 PMsephi
05/19/2020, 4:02 AMsandy
05/19/2020, 4:11 AMspark.yarn.dist.archives
even without it being explicitly defined within dagster
e.g. configuring pyspark as follows should work
resources:
pyspark:
config:
spark_conf:
spark.yarn.dist.archives: <some value of your choosing>
sephi
06/16/2020, 9:46 AM