Hi everyone, again a question with dagster-pyspark...
# ask-community
p
Hi everyone, again a question with dagster-pyspark - for which I tried to look for previous conversations but didn’t find any solution. I am trying to use emr_pyspark_step_launcher for my emr cluster and to pass pipeline path, I am doing
"local_pipeline_package_path": str(Path(__file__).parents[1])
so that code.zip can have parent of parent directory as I have some utility files which I would need to run with pyspark. However, my run fails with
ModuleNotFoundError
does anyone have any idea or have faced such issue? Thanks!
dagster bot responded by community 1
z
Where does the fail happen? If you’re using the EMR launcher it could fail either while trying to launch the job from the dagster runner or in the EMR step/job itself.
p
it fails in EMR step
z
What’s the exact module not found error? When I got the EMR step launcher set up I had to install many dependencies
Copy code
#!/bin/bash
sudo yum install -y python3-devel
sudo python3 -m pip install dagster dagster-aws boto3 dagster-spark
sudo python3 -m pip install dagster-pyspark --no-deps
Here’s a snippet from the bootstrap script that I use with EMR, not 100% if this is what is happening to you, but hope it helps 🙂
👍 1
p
I have installed all the dependencies. This error is for my utility method which I am importing from local path
do you think I should also add util files to EMR cluster?
z
Yep, I was just trying to find how I set mine up.
I don’t have the “best setup” for it ATM, but it essentially posts the entire repo into EMR for me
And the repo is like:
Copy code
utils_generic/
dagster_repo_a/
dagster_repo_b/
Eventually I’ll need to change this to only be:
Copy code
utils_generic/
dagster_repo_a/
p
Yeah, i want to post the entire repo. right now it is just putting the pipeline file It would be great to learn about how to achieve this
Do I need to change anything for this argument?
"local_pipeline_package_path"
whenever you have some time to check the setup - please let me know. Thank you for your help, I really appreciate this!
z
I have mine set up to point to
str(Path(__file__).parent.parent)
p
I thought -
str(Path(__file__).parents[1])
will do the same thing but I will try parent.parent also
do you also set -
"local_job_package_path"
?
z
I think those settings aren’t compatible with one another, I do have
deploy_local_pipeline_package
set to
True
however.
✔️ 1
I’m not sure what the exact outcome would be TBH, and it probably depends heavily on how you have your repo structured. One thing that’s pretty great about dagster is the ability to run it locally, you can run dagit via VSCODE debugger and get breakpoints with PDB to figure out what exact values are being passed and other great stuff
p
That’s helpful! I will try that. Thank you so much for your time and help 🙌thank you box
z
No worries! Hope you’re able to get it working shortly 🙂
❤️ 1
🤞 1