What would be the preferred approach for managing a python asset with its op code in an on-prem git repo but running it in databricks as part of a pipeline? Would I create an op to post the file from the on-prem repo using the DBFS API to the Databricks workspace and then use the dagster_databricks.create_databricks_submit_run_op() to run the file in the Databricks cluster?
dagster bot responded by community 1
03/16/2023, 8:29 PM
You could just use the databricks_pyspark_step_launcher to transparently execute the code in Databricks for you. It basically moves the op code to Databricks at runtime and executes it via the /jobs/runs/submit endpoint, polls for the results and then continues the Dagster pipeline once the databricks job has completed. This is by far my preferred method because you don't have to deploy the code you want to execute on databricks separately somewhere yourself, Dagster handles all that for you and the executed code is never out of date
03/16/2023, 8:39 PM
Wonderful! I'll look at the docs closer since, for some reason, I missed that! Ok now I see it.
BTW, I am guessing that an S3/ADLS bucket needs to be visible to databricks for input data and results?
03/16/2023, 8:53 PM
yeah and for storing the run code. you could theoretically write your own IOManager for handling inputs / outputs in a different remote storage service, but S3/ADLS will still be necessary for storing the artifacts needed to run the op. and for IO it's pretty easy to slap the