Hi We are want to run a `spark UDF function` on a `yarn` clu dagster #announcements

Hi We are want to run a `spark UDF function` on a ...

sephi

05/18/2020, 10:00 AM

Hi We are want to run a

spark UDF function

on a

yarn

cluster, but having trouble configuring the

spark resource

. Since we are running on anaconda we are trying to use the instructions from conda to config spark. When running outside of

dagster

we can use the

spark.yarn.dist.archives

configuration pointing to a hdfs folder that is accessible to all the workers. We tried to define the sparksubmitpyFiles resource config but we are still getting the error:

Cannot run program "path/to/python/env/bin/python3.7" error=2, No such file or directory.

Additionally we tried sparkdriverextraLibararyPath but still getting the same error. What would be the best way to try and implement the

PYSPARK_PYTHON

and

archives

variables? is it through parse_spark_configs.py ?

sandy

05/18/2020, 4:09 PM

Hi @sephi - if I understand correctly, the main error you're hitting is that "Cannot run program" error?" Are you executing the dagster pipeline within the yarn cluster. If so, are you able to set PYSPARK_PYTHON in your environment?

sephi

05/18/2020, 6:58 PM

Yes, we are running within a yarn cluster. We have set the

PYSPARK_PYTHON

in the python file using the

os

module in the python script. I can try tomorrow to set the PYSPARK_PYTHON in a different way . possibly using

spark.pyspark.virtualenv.enabled

sandy

05/18/2020, 7:01 PM

setting

PYSPARK_PYTHON

in the python file won't work, because

PYSPARK_PYTHON

is needed before the file is executed. are you able to set it on the command line or in whatever script you use to launch the dagster?

sephi

05/18/2020, 7:06 PM

Since we are running with

conda

the documentation from conda suggest using the

archives

config variable. We need to set the environment to a location that is accessible by all the workers (which in our case is the HDFS ).

sandy

05/18/2020, 7:12 PM

Ah, I see - so the error that you're experiencing is on the executor side. Yes, I think the right approach would be to add

spark.yarn.dist.archives

to python_modules/automation/automation/parse_spark_configs.py

sephi

05/18/2020, 7:16 PM

Are you interested in a PR? if so please define the scope

sandy

05/18/2020, 8:17 PM

I filed an issue for this: https://github.com/dagster-io/dagster/issues/2473. If you have the bandwidth to create a PR, that would definitely be appreciated. If not, let me know, and I can look into it

sephi

05/19/2020, 4:02 AM

Where is the code that you are using for "parsing the Spark documentation"?

sandy

05/19/2020, 4:11 AM

It's in that file you linked: https://github.com/dagster-io/dagster/blob/fdf72741d4c4c25a9f1faf2392c51e9db0944da4/python_modules/automation/automation/parse_spark_configs.py#L220

sandy

05/28/2020, 11:28 PM

circling back on this, I dug a little more and realized that the spark config is permissive, so you should be able to use

spark.yarn.dist.archives

even without it being explicitly defined within dagster e.g. configuring pyspark as follows should work

Copy code

resources:
  pyspark:
    config:
      spark_conf:
        spark.yarn.dist.archives: <some value of your choosing>

sephi

06/16/2020, 9:46 AM

Thx - it works as you suggested!

Open in Slack

Previous Next