Hello, I'm trying to use _emr_pyspark_step_launche...
# ask-community
a
Hello, I'm trying to use _emr_pyspark_step_launcher_, and found some issues with the spark-submit command. This is my folder structure, with resources definition in
resources/__init__.py
, and my root module is
analytx
(so in my code, my imports are like
from analytx.assets import ...
)
Copy code
analytx
├── __init__.py
├── assets
├── hooks
├── jobs
├── partitions
├── repo.py
├── resources
├── schedules
├── sensors
└── utils
The three config params I'm trying to deal with are
deploy_local_pipeline_package
,
s3_job_package_path
, and
local_pipeline_package_path
. 1. with
deploy_local_pipeline_package = True
, a. if I set
local_pipeline_package_path
to
Path(__file__).parent.parent
, then the zip file code.zip will have all those subdirs of analytx at the root level of the zip file. Then the EMR step will fail with
analytx not found
b. if I set
local_pipeline_package_path
to
Path(__file__).parent.parent.parent
, then the execution stuck at building the zip file c. with
Path(__file__).parent.parent.parent + "/analytx"
, I get the same result as (1a) 2. with
deploy_local_pipeline_package = False
, I expected that the path to
s3_job_package_path
will be sent in spark-submit, but it actually didn't. Also, no code.zip was built and copied to S3 (which was expected), but the spark-submit still refers to that file, and fails, obviously.
for (1.a) and (1.c), I could have a walk-around solution. However, (1.b) and (2) look more like a bug to me.
BTW, I'm using dagster 0.14.6
c
Hi Averell, thanks for the detail. To answer your question in part, #2 is an existing bug, we aren't currently using the argument. There is an issue tracking this: https://github.com/dagster-io/dagster/issues/3679
👍 1
What's the error that you receive when the execution is stuck while building the zip file?
a
hi Claire, Thanks for the info. Regarding #1b, there's no error. • In the happy scenario (#1a, #1c, where the EMR step is submitted successfully), as per debug logs, the main python file is uploaded to s3 first (emr_step_main.py? - sorry, I don't have access to my deployment at the moment to confirm the file name), then code.zip, then a pickle file. • In the error scenario #1b, the logs showed that the main python file was uploaded, and no more logs after that. Checking the temp folder in S3 I could only see the main python file.
c
Hm, I replicated your directory setup and tried this and I couldn't replicate the error. Is there a log in the dagster run logs that contains the reason why the zip file did not upload to s3?
p
Hi @Averell, I have the same issue as yours 1#a scenario. I know this thread is 7 months old, but do you still recall the solution to this on top of your head? Thank you!