Hey team We are experiencing some issues when using the emr dagster #ask-community

Hey team, We are experiencing some issues when us...

Jim Nisivoccia

02/23/2023, 7:28 PM

Hey team, We are experiencing some issues when using the emr_pyspark_step_launcher in our dagster deployment. We’ve gotten the step launcher working when all jobs, ops, and the main repository are in one file, however this issue is arising when we are trying to import methods from other modules into our jobs. It seems the emr cluster is not able to find these directories, and we are wondering if this is due to the spark-submit command being sent by the step launcher. If we compare our two spark submit commands, we can see they both pass in the location of code.zip, emr_step_main.py, and step_run_ref.pkl. The difference is, in our code.zip file in our working example, all of the files that need to be included in the repository are available in the parent directory of this zip location. In our failing example, the jobs that are imported into our repository are interacting with methods in other subfolders, under the parent directory. This leads to an error output in our emr logs, which I have attached to this message. Here is a sample of our repository structure, and what we are passing into the step launcher as configuration: Repo structure Root Enums.py ---orchestration_manager -------jobs -----------job.py -------ops -----------ops.py -------hooks -----------hook.py -------repositories ----------repository.py ------resources -----------resource.py EMR pyspark step launcher config { "cluster_id": "<CLUSTER_ID>", "local_pipeline_package_path": str(Path(file).parent.parent.parent), "deploy_local_pipeline_package": True, "region_name": "us-east-1", "staging_bucket": "BUCKET", "wait_for_logs": True, } If we go into code.zip on s3, we can see that our orchestration_manager folder as well as all its subfolders are present. This is the sample spark submit command being triggered by dagster: /usr/lib/spark/bin/spark-submit --master yarn --deploy-mode client --conf spark.app.name=pyspark_commands --py-files s3://BUCKET/emr_staging/9bd3de6e-6d92-4552-8373-b9a0b085abb4/pyspark_commands/code.zip s3:// BUCKET /emr_staging/9bd3de6e-6d92-4552-8373-b9a0b085abb4/pyspark_commands/emr_step_main.py BUCKET emr_staging/9bd3de6e-6d92-4552-8373-b9a0b085abb4/pyspark_commands/step_run_ref.pk Do we need to recursively reference every directory in our repo structure in the spark submit command in order to be able to import from these files? In the emr pyspark step launcher code, I have noticed this piece of documentation: """ Synchronize the step run ref and pyspark code to an S3 staging bucket for use on EMR. For the zip file, consider the following toy example: # Folder: my_pyspark_project/ # a.py def foo(): print(1) # b.py def bar(): print(2) # main.py from a import foo from b import bar foo() bar() This will zip up

my_pyspark_project/

my_pyspark_project.zip

. Then, when running

spark-submit --py-files my_pyspark_project.zip emr_step_main.py

on EMR this will print 1, 2. """ This implies that we should be able to zip up the root folder, and we should be able to interact with any of the files in that directory. However the difference between our use case (where this error is occurring), and this docstring is that we are trying to drill into subfolders, whereas here all of the modules that are imported live directly in the parent directory. I am new to pyspark so any information related to why this error is occurring would be greatly appreciated.

👍 1

Open in Slack

Previous Next