Hello I m trying to use emr pyspark step launcher and found dagster #ask-community

Hello, I'm trying to use _emr_pyspark_step_launche...

Averell

04/04/2022, 3:48 AM

Hello, I'm trying to use _emr_pyspark_step_launcher_, and found some issues with the spark-submit command. This is my folder structure, with resources definition in

resources/__init__.py

, and my root module is

analytx

(so in my code, my imports are like

from analytx.assets import ...

)

Copy code

analytx
├── __init__.py
├── assets
├── hooks
├── jobs
├── partitions
├── repo.py
├── resources
├── schedules
├── sensors
└── utils

The three config params I'm trying to deal with are

deploy_local_pipeline_package

s3_job_package_path

, and

local_pipeline_package_path

. 1. with

deploy_local_pipeline_package = True

, a. if I set

local_pipeline_package_path

Path(__file__).parent.parent

, then the zip file code.zip will have all those subdirs of analytx at the root level of the zip file. Then the EMR step will fail with

analytx not found

b. if I set

local_pipeline_package_path

Path(__file__).parent.parent.parent

, then the execution stuck at building the zip file c. with

Path(__file__).parent.parent.parent + "/analytx"

, I get the same result as (1a) 2. with

deploy_local_pipeline_package = False

, I expected that the path to

s3_job_package_path

will be sent in spark-submit, but it actually didn't. Also, no code.zip was built and copied to S3 (which was expected), but the spark-submit still refers to that file, and fails, obviously.

Averell

04/04/2022, 12:48 PM

for (1.a) and (1.c), I could have a walk-around solution. However, (1.b) and (2) look more like a bug to me.

Averell

04/04/2022, 12:51 PM

BTW, I'm using dagster 0.14.6

claire

04/04/2022, 9:37 PM

Hi Averell, thanks for the detail. To answer your question in part, #2 is an existing bug, we aren't currently using the argument. There is an issue tracking this: https://github.com/dagster-io/dagster/issues/3679

👍 1

claire

04/04/2022, 9:50 PM

What's the error that you receive when the execution is stuck while building the zip file?

Averell

04/05/2022, 3:04 PM

hi Claire, Thanks for the info. Regarding #1b, there's no error. • In the happy scenario (#1a, #1c, where the EMR step is submitted successfully), as per debug logs, the main python file is uploaded to s3 first (emr_step_main.py? - sorry, I don't have access to my deployment at the moment to confirm the file name), then code.zip, then a pickle file. • In the error scenario #1b, the logs showed that the main python file was uploaded, and no more logs after that. Checking the temp folder in S3 I could only see the main python file.

claire

04/05/2022, 6:19 PM

Hm, I replicated your directory setup and tried this and I couldn't replicate the error. Is there a log in the dagster run logs that contains the reason why the zip file did not upload to s3?

Pragna

10/27/2022, 11:42 PM

Hi @Averell, I have the same issue as yours 1#a scenario. I know this thread is 7 months old, but do you still recall the solution to this on top of your head? Thank you!

2 Views

Open in Slack

Previous Next