The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hi,  good morning everyone! I am trying to build an pipeline in Dagster which does the following:
1. Launch an EMR cluster using the <https://docs.dagster.io/_modules/dagster_aws/emr/emr#EmrJobRunner|EmrJobRunner> class, by using its run_job_flow function.
2. Add one or more steps to that cluster to process data in PySpark by using the <https://docs.dagster.io/_modules/dagster_aws/emr/pyspark_step_launcher#emr_pyspark_step_launcher|emr_pyspark_step_launcher> resource. 
3. Shut down the cluster once all steps are finished.
I followed this <https://docs.dagster.io/integrations/spark#submitting-pyspark-ops-on-emr|tutorial> first, which assumes that you have an EMR cluster running and you hard code the EMR cluster ID as part of the Job specification. This way worked, as I could see my steps being run on EMR. However, when I try to automate the process I noticed that PySpark was running locally and not on EMR. I tried to wrap the emr_pyspark_step_launcher as a Resource which sets the cluster ID as part of the pipeline. The cluster ID can be obtained by using a function in the EmrJobRunner class which returns a cluster ID when providing a cluster name. I am trying to dynamically add the cluster ID during the job after launching the cluster but this isn't working as expected. Has someone tried to setup this type of pipeline and did it successfully? If so, how did you do it? I posted my code <https://dagster.slack.com/archives/C01U954MEER/p1669989661482459|here>, any help would be appreciated! Thank you :slightly_smiling_face: